Perl help: What does a ^_ represent

Apple / MAC PRO
May 21, 2009 at 20:03:01
Specs: Mac OS X 10.5.6, Octo 2.8Ghz Xeon
Using Perl, I'm trying to parse a very large(1GB) file containing
several different values per line. I'd like to put each line in an array so
that I can append certain text to specific elements.

This file was written by a shell/perl script and appears to have
inserted the characters "^_" between several fields/elements (at least,
when I view the file in UNIX. They're not visible in a gui and the two
fields that would normally surround the character are appended to
each other). Is there any way to 'split' each line into an array using
those characters or whatever they represent? I've tried splitting by the
characters themselves (or \s+), but it doesn't work.

The following is a single line in the file:

AACA^_wOption^_AA       wDeleteDate^_string^_2009/03/21 
wExpirationDate^_string^_2009/03/21     wStrikePrice^_price^_12.5       
wExerciseStyle^_string^_A       wPremiumCurrency^_string^_USD   
wStrikeCurrency^_string^_USD    wIssueSymbol^_string^_AACA      
wPutCall^_string^_C     wPosLimitNearTerm^_int^_0       
wSettleOnOpenInd^_string^_N     wPosLimit^_int^_25000000        

The following is a small sampling of my code. This portion uses regex
to search for a specific symbol range on each line:

open (OCC_RAW, $rawOccFile) || die $!;
	while (my $line=<OCC_RAW>){
		my @line = split /wOption/, $line;
			if ($line[0] =~ /^A\.[AWYXBNQ]$/ or $line[0] =~ 
/^A[A-O][A-Z]*\.[AWYXBNQ]/) {print BBO_1 "$line";}
                       elsif ($line[0] =~ /^A[P-Z][A-Z]*\.[AWYXBNQ]/ or 
$line[0]=~ /^B\.[AWYXBNQ]$/ or $line[0]=~ /^B[A-M][A-
Z]*\.[AWYXBNQ]/) {print BBO_2 "$line";}

else {open (NaE, ">> $notFoundLog") || die print NaE "$line[0]\.NaEwOption$line[1]";}

I'm trying to append is an "NaE" to the symbols (AACA, in this example). As you can see, I'm splitting it by "wOption" (and, consequently adding wOption back, later in the script after appending NaE to the symbol) to get the symbol on it's own. The problem is that the symbol appears once more, near the middle of the line

I'm just trying to determine if it is possible to break each line into an using whatever that is between them. Does anyone have any recommendations, or a smarter way of doing this? I'm sure that my chosen method is a reach, though it got the job done before I needed to address this issue.

thanks in advance.

See More: Perl help: What does a ^_ represent

Report •

May 22, 2009 at 03:06:13
^_ is probably the Unit Separator ASCII control character, shown in Caret Notation.

Report •

May 22, 2009 at 03:50:37
I'd probably start by looking at the script that created the file to see how it constructed the lines.

Or I'd use od to dump a portion of the file.

head filename | od -c

Report •

May 22, 2009 at 03:59:41
On a side note, this line has a problem.
else {open (NaE, ">> $notFoundLog") || die print NaE "$line[0]\.NaEwOption$line[1]";}

If the open call fails, you try to print the error message to the filehandle that failed to open? And, die and print should not be used in the same statement. die sends its output to STDERR and print sends it to the currently selected (or specified) filehandle.

For info on why you should not do this:

perldoc -q quoting

Report •

Related Solutions

May 22, 2009 at 18:03:14
Thanks Klint/Fishmonger

'od' indeed revealed it is a Unit Separator (or 037). Do you
guys know if it's possible to split the line up using this as a delimiter?



thanks for the note on the 'open' call. I just threw that in there
to simulate what was being done with the line, for this post.
The actual code is different.

Report •

May 30, 2009 at 13:46:33
If it helps anyone else, I found that the control character can
be simulated as follows:

-hold down the <ctrl> key
-press the letter "v" (should prompt the caret '^')

While continuing to hold down the <ctrl> key

-press the letter or symbol you wish to use; in this case, the

<ctrl> + v then <ctrl> +<shift> + <_>

Didn't know it was that simple. I was chasing down articles
on Perl and unicode.

Report •

Ask Question