Computing.Net > Forums > Unix > Equivalent perl script for sed

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

Equivalent perl script for sed

Reply to Message Icon

Name: braveking
Date: January 28, 2004 at 12:59:58 Pacific
OS: solaris, AIX
CPU/Ram: 512
Comment:

Currently the following command in taking 2 hours to format a file for 6 million records.
One of the computing friend suggested i use a perl script. I am not that familiar with perl scripting. Can somebody suggest me the equivalent of the following unix commands.? I plan to replace the following statements
with perl script. Your help would be greatly appreciated.

csplit -ks -f ${DWH_OUT}/other/a1prefix ${DWH_OUT}/other/a1_coremat.tmp 3
cat ${DWH_OUT}/other/a1prefix01|sed -e 's/ //g' -e 's/\.|/|/g' -e 's/|?/|/g' >${DWH_OUT}/other/a1_corema.tmp



Sponsored Link
Ads by Google

Response Number 1
Name: James Boothe
Date: January 28, 2004 at 13:18:23 Pacific
Reply:

Converting to perl may gain a little, but there is a major time savings to be gained here ...

It is a common mistake to use an extra process and pipe by cat'ing a file and piping into sed or awk or whatever. Instead of:

cat myfile | sed '(sed code)'

just do:

sed '(sed code)' myfile

That will eliminate one pass through those 6 million records. This reminds me of the old adage:

"In unix scripting, thou shalt not use a superflous cat process, especially when dealing with 6 million records."

OK, so maybe that's a new adage.


0

Response Number 2
Name: braveking
Date: January 28, 2004 at 17:33:16 Pacific
Reply:

Hi James,
I changed the code as following and reran the script. It failed. Did i get the syntax right?

sed '(sed -e 's/ //g' -e 's/\.|/|/g' -e 's/|?/|/g' >${DWH_OUT}/other/a1_corema.tmp)' ${DWH_OUT}/other/a1prefix01


0

Response Number 3
Name: James Boothe
Date: January 29, 2004 at 07:25:33 Pacific
Reply:

Sorry for the confusion.

Below, I recoded the second command line, and I divided into multiple lines with line continuation for readability.

csplit -ks -f ${DWH_OUT}/other/a1prefix ${DWH_OUT}/other/a1_coremat.tmp 3
sed -e 's/ //g'     \
    -e 's/\.|/|/g'  \
    -e 's/|?/|/g'   \
 ${DWH_OUT}/other/a1prefix01 \
 > ${DWH_OUT}/other/a1_corema.tmp


0

Response Number 4
Name: FishMonger
Date: January 29, 2004 at 12:34:31 Pacific
Reply:

braveking,

Did you try the Perl code that I provided you in your previous question? The benchmark tests that I ran with that code took 2 minutes to process a file with 6 million lines. With a little more tweeking, we might be able to improve that a little.


0

Response Number 5
Name: braveking
Date: January 30, 2004 at 10:52:13 Pacific
Reply:

Hi James, This time it ran successfully but didn't make any difference. It exactly took two hours.


0

Related Posts

See More



Response Number 6
Name: braveking
Date: January 30, 2004 at 10:53:53 Pacific
Reply:

Hi Fish, i will try your perl script today and will let you know. Thx..


0

Response Number 7
Name: braveking
Date: January 30, 2004 at 11:27:34 Pacific
Reply:

Hi Fish, It's already half hour i submitted the script. It's still running.

Basically the script does unload and formatting. The unload only takes about 2 to 5 minutes. Following are the statments after the unload to format. And further down , we have mv and rm command to delete the files. So i haven't pasted them.

But if you notice, i replaced the sed command with the following perl commands.
I haven't got any error so far. I am hoping i got the syntax right.

csplit -ks -f ${DWH_OUT}/other/a1prefix ${DWH_OUT}/other/a1_coremat.tmp 3
#sed -e 's/ //g' \
# -e 's/\.|/|/g' \
# -e 's/|?/|/g' \
# ${DWH_OUT}/other/a1prefix01 \
# > ${DWH_OUT}/other/a1_corema.tmp
perl -pe "s/ //g" \
-pe "s/\.|/|/g" \
-pe "s/|?/|/g" \
${DWH_OUT}/other/a1prefix01 \
> ${DWH_OUT}/other/a1_corema.tmp



0

Response Number 8
Name: FishMonger
Date: January 30, 2004 at 22:39:08 Pacific
Reply:

Even though at home I have both a Solaris and Linux box, almost everything I do is on my Windows box (one of these days I'll get smart and reverse that). So, my unix skills are rusty and my shell scripting is worse, but let me see if I understand what you're doing (based on your 2 questions).

1) Splitting up your source file into temp files of only 3 lines each.

2) Using either sed or perl regexs to reformat the temp files as they are being created.

3) Combining the temp files back into 1 file, then moving or deleting the temp files.

If that's right then I'd say, since you're working with a source file that has 6 million lines and you're creating 2 million temp files, 98% of the run time is tied up in the creating/moving/deleting of those temp files. And, you're also calling the perl/sed script 2 million times.

From what you've posted, you can get rid of the csplit and do this with one short perl command that uses only 1 regex and reads/modifies the source file directly without using the temp files. Based on my benchmark test, this whole process can be shortend to 2 to 3 minutes. If you need or want, perl can easily and quickly split and move the file(s).

Here's the perl command line (which is what I posted in your first question) to do this:

perl -pe 's/\.\|\s*/|/g' input.txt > output.txt
or
perl -pi -e 's/\.\|\s*/|/g' input.txt
or
perl -pi.bak -e 's/\.\|\s*/|/g' input.txt

BTW, that's also the answer to your question:
“Can i get the equivalent perl script for the following unix command”?
cat a1prefix01 | sed -e 's/ //g' -e 's/\.|/|/g' -e 's/|?/|/g' > a1_coremat.tmp


0

Response Number 9
Name: FishMonger
Date: January 31, 2004 at 12:09:29 Pacific
Reply:

One point I hinted at but didn't clarify is since you're using 3 regexs, each line is being processed 3 times which means your processing jumped from 6 to 18 million lines.
 
Another thing to consider is that there might be something else in the script that is causing it to take 2 hours to complete.  If you want to post your complete script, James and I can look for other possible problems.


0

Response Number 10
Name: braveking
Date: February 2, 2004 at 14:18:06 Pacific
Reply:

Sorry to confuse you guys.
Well i re evaluated the script closely. Looks like i was mistaken to find the root cause of the 2 hours processing. Basically the script runs against terdata database to unload data from a table to a file and formats the unloaded data to create a file.

Basically the script has following steps.
Unload query and export to a file
Format File (i.e using commands csplit, sed etc)

When i sumbit the script, in around 5 minutes i usually get the following message on the console.
*** Query completed. 5993158 rows found. 5 columns returned.
*** Total elapsed time was one minute and 45 seconds.

Which was leading me to believe that unload completed. Since the next command is csplit and sed , i was blaming them.

Now i tried splitting the scripts to two parts one for unload and the other for formating. The unload is taking close to about two hours. I noticed that it is taking the time for i/o to file. Here is how the unload statement is coded on teradata, please let me know if anybody is aware of coding it better way on teradata.


bteq << !EOF

.SESSIONS 1;

.LOGON ${USERID},${PASSWORD}${ACCTSTR_SATURN}

.SET QUIET OFF;
.SET SEPARATOR "|";
.SET WIDTH 120;

.EXPORT REPORT FILE = ${DWH_OUT}/other/a1_coremat.tmp;
SELECT DISTINCT a.current_card_nbr,
a.customer_id,
a.household_id,
a.mail_allowed_id,
a.email_address_txt
FROM ${DWH_CUSTOMER_DB}.lu_customer a
WHERE a.customer_id >= ${CUSTOMERSTARTNBR}
AND a.customer_id <= ${CUSTOMERENDNBR}
ORDER BY a.household_id,
a.current_card_nbr;

.IF ERRORCODE <> 0 THEN .QUIT 1

.EXPORT RESET;

.LOGOFF;
.QUIT 0
!EOF



0

Sponsored Link
Ads by Google
Reply to Message Icon






Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Unix Forum Home


Sponsored links

Ads by Google


Results for: Equivalent perl script for sed

Format a unix file on Teradata www.computing.net/answers/unix/format-a-unix-file-on-teradata/5914.html

perl script - record altering www.computing.net/answers/unix/perl-script-record-altering/3546.html

PERL script www.computing.net/answers/unix/perl-script/6183.html