Hello, I have been searching for a way to extact urls from google cache url search results,I have a file with a list of urls like this
""http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22s---%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8""
what i need to do is extract the actual url www.worldwidewords.org/qa/qa-shi3.htm which lies between the : and the + and remove the google cache url from the list so I will have a list of regular urls, I also have normal urls in the list which I would like to keep in the list.
any help would be appreciated
This is the bash script I am using, it searches google and give you a list of urls, takes out everything but the link and pipes them to a file
#!/bin/bash
#
# google.sh
# ---------
# Automatic Google search from the command line.
#
# Syntax : $ google {search terms}
#
if [ -z $1 ]
then
# If no keyword is entered echo try again
#
echo "you didnt tell me what to search....try again"
else
#url variable with the maximum search results (100) per page
#
url='http://google.ca/search?num=100&hl=en&safe=off&q='
appended=0
for searchTerm in "$@"
do
# Replace white spaces in the search terms
#
searchTerm=`echo $searchTerm | sed 's/ /%20/g'`
url="$url%22$searchTerm%22"
if [ $appended -lt `expr $# - 1` ]
then
url="$url"\+
else
url="$url"\&btnG\=Google\+Search\&meta\=
fi
let "appended+=1"
done
lynx -dump $url >> googleresult1
sed 's/http/\^http/g' googleresult1 | tr -s "^" "\n" | grep http| sed 's/\ .*//g' >> googleresults2 #this command extract only the urs
rm googleresult1
cat googleresults2
sed -e '/google/d' googleresults2 >> urls.txt
fi
The sed command at the end removes the results with google.com in them which are the following pages of results
I have tried this sed -n '/:/,/+/p' url.txt but there are three colons in the cache url and I need the text between the third : and the +