Tom's Guide | Tom's Hardware | Tom's Games
![]() |
![]() |
![]() |
Currently the following command in taking 2 hours to format a file for 6 million records.
One of the computing friend suggested i use a perl script. I am not that familiar with perl scripting. Can somebody suggest me the equivalent of the following unix commands.? I plan to replace the following statements
with perl script. Your help would be greatly appreciated.csplit -ks -f ${DWH_OUT}/other/a1prefix ${DWH_OUT}/other/a1_coremat.tmp 3
cat ${DWH_OUT}/other/a1prefix01|sed -e 's/ //g' -e 's/\.|/|/g' -e 's/|?/|/g' >${DWH_OUT}/other/a1_corema.tmp

Converting to perl may gain a little, but there is a major time savings to be gained here ...
It is a common mistake to use an extra process and pipe by cat'ing a file and piping into sed or awk or whatever. Instead of:
cat myfile | sed '(sed code)'
just do:
sed '(sed code)' myfile
That will eliminate one pass through those 6 million records. This reminds me of the old adage:
"In unix scripting, thou shalt not use a superflous cat process, especially when dealing with 6 million records."
OK, so maybe that's a new adage.

Hi James,
I changed the code as following and reran the script. It failed. Did i get the syntax right?sed '(sed -e 's/ //g' -e 's/\.|/|/g' -e 's/|?/|/g' >${DWH_OUT}/other/a1_corema.tmp)' ${DWH_OUT}/other/a1prefix01

Sorry for the confusion.
Below, I recoded the second command line, and I divided into multiple lines with line continuation for readability.
csplit -ks -f ${DWH_OUT}/other/a1prefix ${DWH_OUT}/other/a1_coremat.tmp 3
sed -e 's/ //g' \
-e 's/\.|/|/g' \
-e 's/|?/|/g' \
${DWH_OUT}/other/a1prefix01 \
> ${DWH_OUT}/other/a1_corema.tmp

braveking,
Did you try the Perl code that I provided you in your previous question? The benchmark tests that I ran with that code took 2 minutes to process a file with 6 million lines. With a little more tweeking, we might be able to improve that a little.

Hi Fish, It's already half hour i submitted the script. It's still running.
Basically the script does unload and formatting. The unload only takes about 2 to 5 minutes. Following are the statments after the unload to format. And further down , we have mv and rm command to delete the files. So i haven't pasted them.
But if you notice, i replaced the sed command with the following perl commands.
I haven't got any error so far. I am hoping i got the syntax right.csplit -ks -f ${DWH_OUT}/other/a1prefix ${DWH_OUT}/other/a1_coremat.tmp 3
#sed -e 's/ //g' \
# -e 's/\.|/|/g' \
# -e 's/|?/|/g' \
# ${DWH_OUT}/other/a1prefix01 \
# > ${DWH_OUT}/other/a1_corema.tmp
perl -pe "s/ //g" \
-pe "s/\.|/|/g" \
-pe "s/|?/|/g" \
${DWH_OUT}/other/a1prefix01 \
> ${DWH_OUT}/other/a1_corema.tmp

Even though at home I have both a Solaris and Linux box, almost everything I do is on my Windows box (one of these days I'll get smart and reverse that). So, my unix skills are rusty and my shell scripting is worse, but let me see if I understand what you're doing (based on your 2 questions).
1) Splitting up your source file into temp files of only 3 lines each.
2) Using either sed or perl regexs to reformat the temp files as they are being created.
3) Combining the temp files back into 1 file, then moving or deleting the temp files.
If that's right then I'd say, since you're working with a source file that has 6 million lines and you're creating 2 million temp files, 98% of the run time is tied up in the creating/moving/deleting of those temp files. And, you're also calling the perl/sed script 2 million times.
From what you've posted, you can get rid of the csplit and do this with one short perl command that uses only 1 regex and reads/modifies the source file directly without using the temp files. Based on my benchmark test, this whole process can be shortend to 2 to 3 minutes. If you need or want, perl can easily and quickly split and move the file(s).
Here's the perl command line (which is what I posted in your first question) to do this:
perl -pe 's/\.\|\s*/|/g' input.txt > output.txt
or
perl -pi -e 's/\.\|\s*/|/g' input.txt
or
perl -pi.bak -e 's/\.\|\s*/|/g' input.txtBTW, that's also the answer to your question:
“Can i get the equivalent perl script for the following unix command”?
cat a1prefix01 | sed -e 's/ //g' -e 's/\.|/|/g' -e 's/|?/|/g' > a1_coremat.tmp

One point I hinted at but didn't clarify is since you're using 3 regexs, each line is being processed 3 times which means your processing jumped from 6 to 18 million lines.
Another thing to consider is that there might be something else in the script that is causing it to take 2 hours to complete. If you want to post your complete script, James and I can look for other possible problems.

Sorry to confuse you guys.
Well i re evaluated the script closely. Looks like i was mistaken to find the root cause of the 2 hours processing. Basically the script runs against terdata database to unload data from a table to a file and formats the unloaded data to create a file.Basically the script has following steps.
Unload query and export to a file
Format File (i.e using commands csplit, sed etc)When i sumbit the script, in around 5 minutes i usually get the following message on the console.
*** Query completed. 5993158 rows found. 5 columns returned.
*** Total elapsed time was one minute and 45 seconds.Which was leading me to believe that unload completed. Since the next command is csplit and sed , i was blaming them.
Now i tried splitting the scripts to two parts one for unload and the other for formating. The unload is taking close to about two hours. I noticed that it is taking the time for i/o to file. Here is how the unload statement is coded on teradata, please let me know if anybody is aware of coding it better way on teradata.
bteq << !EOF.SESSIONS 1;
.LOGON ${USERID},${PASSWORD}${ACCTSTR_SATURN}
.SET QUIET OFF;
.SET SEPARATOR "|";
.SET WIDTH 120;.EXPORT REPORT FILE = ${DWH_OUT}/other/a1_coremat.tmp;
SELECT DISTINCT a.current_card_nbr,
a.customer_id,
a.household_id,
a.mail_allowed_id,
a.email_address_txt
FROM ${DWH_CUSTOMER_DB}.lu_customer a
WHERE a.customer_id >= ${CUSTOMERSTARTNBR}
AND a.customer_id <= ${CUSTOMERENDNBR}
ORDER BY a.household_id,
a.current_card_nbr;.IF ERRORCODE <> 0 THEN .QUIT 1
.EXPORT RESET;
.LOGOFF;
.QUIT 0
!EOF

![]() |
![]() |
![]() |

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.
| Ads by Google |