Computing.Net > Forums > Programming > Perl, parsing & regex

Computer Problems? Computing.Net has over 1,000,000 posts about all things technology related! Over 90% answered within 24 hours! Click here to start participating now! Also, be sure to check out the New User Guide.

Perl, parsing & regex

Reply to Message Icon

Name: Shr0Om
Date: June 6, 2007 at 08:02:17 Pacific
OS: Win XP
CPU/Ram: amd 64 3200
Product: custom
Comment:

Hi, i have some trouble with a tab delimited txt file that i want to restructure somewhat. I have talked to some people on #perlhelp (efnet), and one of em gave me a oneliner that does what i want, but i need to rewrite the script as im going to convert it to an .exe. Also, the oneliner doesnt make much sense to me as im pretty new to perl, and i really would like to understand what i do.

So, here's what i want to do with the txt file

1. Skip all lines containing one or more of the following words "CL08001 squib Salgsorganisation Salgskanal Division Bruger reklamatGrund Type". (Header that occurs every 20'th line)
I tried with:

next if /^ZCL08001 squib Salgsorganisation Salgskanal Division Bruger reklamat Grund Type/;
but that didnt work for some reason.

2. In the file, there are multiple tabs between some of the strings. I want all those multible tabs replaced by just one single tab.
I've tried with "$_ =~ s/\t+/\t/;", but it only replaced some of the tabs for some reason (within a WHILE FILE loop).

3. I want to remove newlines (\n) until perl reaches a blank line. This because the data is usually spread over 3 lines, then 2 blank lines. I guess i can join/remove newlines somehow?! Im a littlebit unsure how to do this.
Also, all blank lines should be skipped.
I guess the "oneliner" do that here "print join qq{\t}, split /[\n;\t]+/""..
But how to i insert it into a while loop (and how does it work?)

"while( <FILE> ) {"
...
...
}

So the result file should look like: multiple tabs replaced with just 1 tab
newlines removed until a blank line is reached, then skip the blank lines until some line with text occurs again.

I hope this makes any sense&that anyone can help and perhaps explain the aproach.
In the end, you can see the "oneliner" i got that i want to rewrite & trying to understand.


(oneliner)
perl -00 -nwle"next if /CL08001 squib Salgsorganisation Salgskanal Division Bruger reklamatGrund Type/; print join qq{\t}, split /[\n;\t]+/" foo.txt



Sponsored Link
Ads by Google

Response Number 1
Name: FishMonger
Date: June 6, 2007 at 12:06:34 Pacific
Reply:

Let's start by looking at the switches.

-00 read-in the file in paragraph mode (i.e., blocks separated by blank lines).

-n use a while loop to go through each line (paragraph/block) of the file.

-w enable warnings

-l chomp the line terminator and add it back in the print statement.

-e execute the following perl code.

=================================================================

Now let's look at the code.

next if /CL08001 squib Salgsorganisation Salgskanal Division Bruger reklamatGrund Type/;

Skip any paragraph that includes the header string (must be an exact match i.e., each of those words separated by a single space [not a tab])

print join qq{\t}, split /[\n;\t]+/

Let's read the print statement from left to right.

Split the string on any of the following characters: \t tab, ; semi-colon, or \n newline

The results of the split are joined with \t tab characters and and passed to the print command.

The sting is printed and the \n line terminator is added (via the -l switch).

=================================================================

I see possible problems with that 1 liner.

It assumes that you have a blank line both before and after each of the header strings. If that's not the case, it'll also be skipping over some of the tab separated data that you want to keep.

The header string is hard codded in the regex to look for those words separated by a single space not a tab.

Without seeing a sample of your real data, I can't be sure, but his may do what you need.

#!/usr/bin/perl

use strict;
use warnings;

$/ = "";

open (F, 'foo.txt') || die "open failed $!";

while(<F>) {
next if /CL08001 squib Salgsorganisation Salgskanal Division Bruger reklamatGrund Type/;
# print join qq{\t}, split /[\n;\t]+/;
s/[\t\s]+/\t/g;
print "$_\n";
}
close F;


0

Response Number 2
Name: Shr0Om
Date: June 6, 2007 at 13:21:50 Pacific
Reply:

Hi & big thanks for the help Fishmonger:)
I have some further questions i hope you could find time answer.

$/ = ""; <-- Why this one??

open (F, 'foo.txt') || die "open failed $!";

while(<F>) {
next if /CL08001 squib Salgsorganisation Salgskanal Division Bruger reklamatGrund Type/;
# print join qq{\t}, split /[\n;\t]+/;
s/[\t\s]+/\t/g; <-- Could you give a little explanation on this one as i cannot quite relate it to the oneliner you just explained.
print "$_\n";
}
close F;

The discarding of the header still doesnt work. The header looks like this (since this post wont show multiple tabs or newlines i fill em out)
\t\tSalgsorganisation\t\t\t\t\t\t:\t29\t\tDANMARK\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n
\t\tSalgskanal\t\t\t\t\t\t:\t10\t\tEkstern\t\t\t\t\t\t\t\tFrom Date\t\t\t 01.03.2007\tTo Date\t\t31.03.2007 \t\t\t\t\t\t\t\t\n

So any lines with one or more of the strings should just be discarded. Remember, not all of the strings are found at one line.
Ive used system 'findstr /v "CL08......." >tmpfile' previously, and it works fine&very efficiently, but i dont see why i shouldnt do it all in perl.

btw, there are always a blank line before&after the header.


0

Response Number 3
Name: FishMonger
Date: June 6, 2007 at 14:03:49 Pacific
Reply:

>> $/ = ""; <-- Why this one??

That has the same effect as the -00 switch.

>> s/[\t\s]+/\t/g; <-- Could you give a little explanation on this

That substitutes all multiple tabs and spaces with single tabs. It's very close, but not exactly what is being done in the split/join combination.

===================================================================

It sounds like the regex needs to be modified to use alternation instead of matching the complete exact string.

next if /(CL08001|squib|Salgsorganisation|Salgskanal|Division|Bruger|reklamatGrund|Type)/;

I'm tied up the rest of the day, but if you want to email me a sample of your data file and how it should look after processing, I'll look at it tomorrow.



0

Sponsored Link
Ads by Google
Reply to Message Icon

Related Posts

See More







Post Locked

This post is quite old and has been locked from receiving new replies. Please create a new posting instead.


Go to Programming Forum Home


Sponsored links

Ads by Google


Results for: Perl, parsing & regex

Procmail? Perl? parse emails www.computing.net/answers/programming/procmail-perl-parse-emails/6309.html

batch file to reformat text file www.computing.net/answers/programming/batch-file-to-reformat-text-file/15352.html

Javascript Menu - Call to PL/SQL www.computing.net/answers/programming/javascript-menu-call-to-plsql/9915.html