Specialty Forums
Security and Virus
General Hardware
CPUs/Overclocking
Networking
Digital Photo/Video
Office Software
PC Gaming
Console Gaming
Programming
Database
Web Development
Digital Home

General Forums
Windows XP
Windows Vista
Windows 95/98
Windows Me
Windows NT
Windows 2000
Win Server 2008
Win Server 2003
Windows 3.1
Linux
PDAs
BeOS
Novell Netware
OpenVMS
Solaris
Disk Op. System
Unix
Mac
OS/2

Drivers
Driver Scan
Driver Forum

Software
Automatic Updates

BIOS Updates

My Computing.Net

Solution Center

Free IT eBook

Howtos

Site Search

Message Find

RSS Feeds

Install Guides

Data Recovery

About

Home
Reply to Message Icon Go to Main Page Icon

bash script

Original Message
Name: mike171562
Date: July 15, 2007 at 08:38:04 Pacific
Subject: bash script
OS: ubuntu
CPU/Ram: p4
Model/Manufacturer: intel
Comment:
Hello, I have been searching for a way to extact urls from google cache url search results,

I have a file with a list of urls like this

""http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22s---%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8""

what i need to do is extract the actual url www.worldwidewords.org/qa/qa-shi3.htm which lies between the : and the + and remove the google cache url from the list so I will have a list of regular urls, I also have normal urls in the list which I would like to keep in the list.

any help would be appreciated


This is the bash script I am using, it searches google and give you a list of urls, takes out everything but the link and pipes them to a file


#!/bin/bash
#
# google.sh
# ---------
# Automatic Google search from the command line.
#
# Syntax : $ google {search terms}
#
if [ -z $1 ]
then
# If no keyword is entered echo try again
#
echo "you didnt tell me what to search....try again"
else
#url variable with the maximum search results (100) per page
#
url='http://google.ca/search?num=100&hl=en&safe=off&q='

appended=0
for searchTerm in "$@"
do
# Replace white spaces in the search terms
#
searchTerm=`echo $searchTerm | sed 's/ /%20/g'`

url="$url%22$searchTerm%22"

if [ $appended -lt `expr $# - 1` ]
then
url="$url"\+
else
url="$url"\&btnG\=Google\+Search\&meta\=
fi

let "appended+=1"
done

lynx -dump $url >> googleresult1
sed 's/http/\^http/g' googleresult1 | tr -s "^" "\n" | grep http| sed 's/\ .*//g' >> googleresults2 #this command extract only the urs
rm googleresult1
cat googleresults2
sed -e '/google/d' googleresults2 >> urls.txt
fi

The sed command at the end removes the results with google.com in them which are the following pages of results
I have tried this sed -n '/:/,/+/p' url.txt but there are three colons in the cache url and I need the text between the third : and the +


Report Offensive Message For Removal


Response Number 1
Name: lankrypt0
Date: July 16, 2007 at 12:30:14 Pacific
Subject: bash script
Reply: (edit)
If the URLs are standard like that, try:

sed -e "s/^http:\/\/.*\/search?q=cache:.*www/www/g"|awk -F+ '{print $1}'


Report Offensive Follow Up For Removal




Use following form to reply to current message:

   Name: From My Computing.Net Settings
 E-Mail: From My Computing.Net Settings

Subject: bash script

Comments:

 
  Homepage URL (*): 
Homepage Title (*): 
         Image URL: 
 


Data Recovery Software




DSHUB24 Connection Problems

need help with dsl and dial up

novel 3.12

help mandriva install last straw!

Icon Scaling in Explorer Bar


The information on Computing.Net is the opinions of its users. Such opinions may not be accurate and they are to be used at your own risk. Computing.Net cannot verify the validity of the statements made on this site. Computing.Net and Computing.Net, LLC hereby disclaim all responsibility and liability for the content of Computing.Net and its accuracy.
PLEASE READ THE FULL DISCLAIMER AND LEGAL TERMS BY CLICKING HERE

All content ©1996-2007 Computing.Net, LLC