Computing.Net > Forums > Unix > bash script

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

bash script

Reply to Message Icon

Name: mike171562
Date: July 15, 2007 at 08:38:04 Pacific
OS: ubuntu
CPU/Ram: p4
Product: intel
Comment:

Hello, I have been searching for a way to extact urls from google cache url search results,

I have a file with a list of urls like this

""http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22s---%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8""

what i need to do is extract the actual url www.worldwidewords.org/qa/qa-shi3.htm which lies between the : and the + and remove the google cache url from the list so I will have a list of regular urls, I also have normal urls in the list which I would like to keep in the list.

any help would be appreciated


This is the bash script I am using, it searches google and give you a list of urls, takes out everything but the link and pipes them to a file


#!/bin/bash
#
# google.sh
# ---------
# Automatic Google search from the command line.
#
# Syntax : $ google {search terms}
#
if [ -z $1 ]
then
# If no keyword is entered echo try again
#
echo "you didnt tell me what to search....try again"
else
#url variable with the maximum search results (100) per page
#
url='http://google.ca/search?num=100&hl=en&safe=off&q='

appended=0
for searchTerm in "$@"
do
# Replace white spaces in the search terms
#
searchTerm=`echo $searchTerm | sed 's/ /%20/g'`

url="$url%22$searchTerm%22"

if [ $appended -lt `expr $# - 1` ]
then
url="$url"\+
else
url="$url"\&btnG\=Google\+Search\&meta\=
fi

let "appended+=1"
done

lynx -dump $url >> googleresult1
sed 's/http/\^http/g' googleresult1 | tr -s "^" "\n" | grep http| sed 's/\ .*//g' >> googleresults2 #this command extract only the urs
rm googleresult1
cat googleresults2
sed -e '/google/d' googleresults2 >> urls.txt
fi

The sed command at the end removes the results with google.com in them which are the following pages of results
I have tried this sed -n '/:/,/+/p' url.txt but there are three colons in the cache url and I need the text between the third : and the +



Sponsored Link
Ads by Google

Response Number 1
Name: lankrypt0
Date: July 16, 2007 at 12:30:14 Pacific
Reply:

If the URLs are standard like that, try:

sed -e "s/^http:\/\/.*\/search?q=cache:.*www/www/g"|awk -F+ '{print $1}'


0
Reply to Message Icon

Related Posts

See More







Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Unix Forum Home


Sponsored links

Ads by Google


Results for: bash script

Verify maildelivery in bash script www.computing.net/answers/unix/verify-maildelivery-in-bash-script/5565.html

Converted BASH script acting strang www.computing.net/answers/unix/converted-bash-script-acting-strang/4385.html

howto add ctrl chars to bash script www.computing.net/answers/unix/howto-add-ctrl-chars-to-bash-script/4994.html