Solved Unix - scrape a website

March 3, 2013 at 02:20:06
Specs: Unix
I have written a script to scrape job scheduling information from a website and return it in a format which I can apply to a visual monitoring aid. The script returns the data but returns it too slowly meaning that the visual aid constantly times out. I have attached the script below, is there a quicker of returning this data:

#!/bin/ksh

JOBSDEF=$1

SCHEDULER=MIDAS_RW_PROD_LN1C_1

echo "Job Name,Last Start Date,Last Start Time,Last End Date,Last End Time,Last Run Result,Job Status,Time Zone"

while read line

do

JOBNAME=$line

echo

OUTPUTCOUNT=`curl -s "http://cfmwps13p-phys.nam.nsroot.net:7005/controlFreqWeb/Export_JobDetails?JobName=$JOBNAME&SchedulerName=$SCHEDULER" | wc -l`

if [ $OUTPUTCOUNT -ge 1 ]

then

OUTPUT=`curl -s "http://cfmwps13p-phys.nam.nsroot.net:7005/controlFreqWeb/Export_JobDetails?JobName=$JOBNAME&SchedulerName=$SCHEDULER" | tail -1`

Job=`echo $OUTPUT | cut -d "," -f2`

LastStartD=`echo $OUTPUT | cut -d "," -f3`

LastStartT=`echo $OUTPUT | cut -d "," -f4`

LastEndD=`echo $OUTPUT | cut -d "," -f5`

LastEndT=`echo $OUTPUT | cut -d "," -f6`

LastRun=`echo $OUTPUT | cut -d "," -f7`

JobStatus=`echo $OUTPUT | cut -d "," -f8`

TimeZone=`echo $OUTPUT | cut -d "," -f9`

echo $JOBNAME,$LastStartD,$LastStartT,$LastEndD,$LastEndT,$LastRun,$JobStatus,$TimeZone

fi

done<"$JOBSDEF"


See More: Unix - scrape a website

Report •

✔ Best Answer
March 4, 2013 at 16:18:35
First, if speed is an issue, you should eliminate the Useless Use of cat and use a while loop:

curl -s "http://cfmwps13p-phys.nam.nsroot.net:7005/controlFreqWeb/Export_JobDetails?SchedulerName=$1&db=1" | cut -d "," -f 2- | sed 's/ //g'|while read i
do
    .
    .
done

Read about the UUOC here:
<a href="http://partmaps.org/era/unix/award.html" target="_blank" rel="nofollow">http://partmaps.org/era/unix/award....</a>

OK, this stub program starts at 14 in a  while loop and terminates when x is greater than 200.  Note that I do not use the let command but embedded ((:

#!/bin/ksh

x=14

while true
do
   ((x+=1))
   if [[ $x -gt 200 ]]
   then
      break
   fi
   echo $x
done

# Let me know if you have any further questions.


#1
March 3, 2013 at 15:19:54
While the unix cut has a small foot print, your script is generating 8 sub processes. This is probably causing your timeout. Instead of parsing with cut, try using the set command. The set command elimiantes all the sub processes.

Since your output fields are comma delimited, this command does the parsing and allows setting the 8 fields using only 1 sub process:

# UNTESTED
set $(IFS=","; echo $OUTPUT)
Job=$2
LastStartD=$3
LastStartT=$4
.
.
TimeZone=$9


Report •

#2
March 4, 2013 at 00:14:21
I'll try this today, thanks very much for your response.

Report •

#3
March 4, 2013 at 14:39:13
I still saw the same performance speed with the SET command, I'm coming at it from a different angle now though. The below script returns all jobs on the scheduler starting at number 14. What I want to do is format the loop so that it starts at 14 and ends after a count of 200, can you help please ?

#!/bin/ksh

x=14
y=0

for i in `curl -s "http://cfmwps13p-phys.nam.nsroot.net:7005/controlFreqWeb/Export_JobDetails?SchedulerName=$1&db=1" | cut -d "," -f 2- | sed 's/ //g'`

do
let x=$x+1
if [ $x -eq 1 ]; then
echo ID,$i
else
let y=$x-1
echo $y,$i
fi
done


Report •

Related Solutions

#4
March 4, 2013 at 16:18:35
✔ Best Answer
First, if speed is an issue, you should eliminate the Useless Use of cat and use a while loop:

curl -s "http://cfmwps13p-phys.nam.nsroot.net:7005/controlFreqWeb/Export_JobDetails?SchedulerName=$1&db=1" | cut -d "," -f 2- | sed 's/ //g'|while read i
do
    .
    .
done

Read about the UUOC here:
<a href="http://partmaps.org/era/unix/award.html" target="_blank" rel="nofollow">http://partmaps.org/era/unix/award....</a>

OK, this stub program starts at 14 in a  while loop and terminates when x is greater than 200.  Note that I do not use the let command but embedded ((:

#!/bin/ksh

x=14

while true
do
   ((x+=1))
   if [[ $x -gt 200 ]]
   then
      break
   fi
   echo $x
done

# Let me know if you have any further questions.

Report •

#5
March 7, 2013 at 04:36:53
That worked a treat, thanks a lot.

Report •

#6
March 7, 2013 at 15:14:01
HI, I ran into another problem with the loop. I cant use a counter as the order of the jobs on the scheduler will change. I need to pass in an assigned list of jobs in in a .txt file as a parameter and read the details for each job in that manner. Any ideas how I can do this ?

Report •

#7
March 7, 2013 at 16:25:44
Sorry, but I don't understand your question. Please elaborate.

Report •

#8
March 7, 2013 at 21:35:14
Okay, currently I have a job scheduler - MARSRT_CNTG_LNX1C which contains over 400 jobs, when I run the below script:

#!/bin/ksh

x=14
y=0

for i in `curl -s "http://cfmwps13p-phys.nam.nsroot.net:7005/controlFreqWeb/Export_JobDetails?SchedulerName=$1&db=1" | cut -d "," -f 2- | sed 's/ //g'`

do
let x=$x+1
if [ $x -eq 1 ]; then
echo ID,$i
else
let y=$x-1
echo $y,$i
fi
done

From the command line by executing this, passing the scheduler name in as a paramter:

./scriptname MARSRT_CNTG_LNX1C

This returns all jobs on the scheduler which I dont need. I tried using a loop and count to extract the specific jobs I need, ie, starting at number 14 right through to 200. This is fine until new jobs are added to the scheduler, as they are added in alphabetical order, my loop will return jobs which I dont require. Therefore I need to find a way of extracting only the data needed for specific jobs from the scheduler, hope this helps.


Report •

#9
March 8, 2013 at 09:53:28
Ok, I understand a little better, but you don't tell us how to determine which specific jobs aren't required. Do they exist in a file somewhere? Is something in the job name that tells you they aren't required?

We need more information.


Report •

#10
March 9, 2013 at 10:35:52
The job scheduler MARSRT_CNTG_LNX1C contains jobs for numerous Applications, I am only concerned with the jobs for one Application for which I have been given a list of job names. For these job names I must extract the run details from the job scheduler.

The jobs on the scheduler do not have a unique key to identify them or a flag which states they are not required, the jobs are simply listed in alphabetical order. Is there a way of adding the list of jobs for which I am concerned with to a .txt file and passing that file to the script as a parameter ?

Or is there another way to go about this ?

Your help is appreciated, let me know if you require any further detail.

Thanks


Report •

#11
March 11, 2013 at 10:27:01

Yes, there should be a way of "adding the list of jobs for which I am concerned" to a text file, but you don't provide enough detail for me to help.

BTW, I don't mean to elaborate on the obvious, but you can add more than one parameter to a shell script.


Report •

#12
March 11, 2013 at 13:27:18
I've resolved it anyway, thanks for your help.

Report •

Ask Question