perl subsitution speed

April 10, 2009 at 22:54:45
Specs: Windows XP
Hello World,

I am trying to determine the fastest way to insert a string into another string at a predefined location. I ran a profile to compare, substr vs regex, and substr is faster, but it seems like it could be even faster if there were an insertion mode on substr instead of replacement, so I wouldn't have to use two substr calls. I am also interested in how memory is handled with large strings. I could also work with files if that is a better way to handle this, eg, if something can be shoved into the middle of a file at a specific position without having to read the rest of the file into memory. Here is my small example I used to profile and the results:

#!/usr/bin/perl -w

#use Rexx qw( :all) ;

#sub insert {
# my ($source , $target, $pos, $len, $char) = @_ ;
#
# my $targlen = length $target ;
#
# ($pos <= $targlen) ? substr( $target, $pos , 0 , $source )
# : ($target .= $char x ($pos-$targlen) . $source ) ;
# $target;
#}

BEGIN {
#$num = 100000000;
$num = 1000000;
$loops = 10000;

$t = time ;
print "start substr \n";

for($i=0;$i<$loops;$i++)
{
$string2=" awesomely, "x($i);
#print "string2 is $string2\n\n";
$string = "happy!"x$num;
#print "string before is $string\n\n";
substr($string,10)=$string2.substr($string,10);
#print "string after is $string\n\n\n\n";
}

print "end substr\n";
$t = time - $t ;
print "time elapsed was $t \n" ;
$t = time;
print "start regex\n";

for($i=0;$i<$loops;$i++)
{
$string2=" awesomely, "x($i);
#print "string2 is $string2\n\n";
$string = "happy!"x$num;
#print "string before is $string\n\n";
$string=~s/^(.{10})(.*)$/$1$string2$2/;
#print "string after is $string\n\n\n\n";
}
print "end regex\n";
$t = time - $t ;
print "time elapsed was $t \n" ;


}

here is the output

start substr
end substr
time elapsed was 341
start regex
end regex
time elapsed was 415


Any comments on a better way to do this would be very much appreciated.

Thanks,
Mark


See More: perl subsitution speed

Report •


#1
April 11, 2009 at 06:05:48
Are asking for a better way to profile/benchmark or to do the insertion?

Benchmark - benchmark running times of Perl code
http://search.cpan.org/~nwclark/per...

Devel::Profile - tell me why my perl program runs so slowly
http://search.cpan.org/~jaw/Devel-P...


Report •

#2
April 11, 2009 at 13:03:29
I'm asking for a better way to do the insertion.

Thanks,
Mark


Report •

#3
April 11, 2009 at 16:17:23
The first thing to do is come up with a realistic test. Having string lengths of 6,000,000 and 12,000,000 and after the insertion, a string length of 18,000,000 is absurd.

Start with a realistic string and length that represents the actual data that you plan on using. Next, come up with a few possible solutions and wrap each of them in a sub. Then use the Benchmark module to handle the comparison.

Start with this script and build onto it as needed.

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw/cmpthese/;

my $string1 = 'A quick brown fox';
my $string2 = 'and big ';
my $pos = 8;

cmpthese(-1, {
               substring => \&substring,
               regex => \®ex,
});



sub substring {
    my $new_str = $string1;
    $new_str = substr($new_str, 0, $pos) . $string2 . substr($new_str, $pos);
}

sub regex {
    my $new_str = $string1;
    $new_str =~ s/(.{$pos})(.*)/$1$string2$2/;
}


Report •

Related Solutions

#4
April 12, 2009 at 01:34:14
Hi Fishmonger,

Thanks for taking the time to reply to my posts.

The strings are being used to parse a netlist, modify the parse tree, and rewrite the netlist with the alterations. Netlists can be very large. The alterations can be small and inserted into the original string at anchors calculated using pos during the parse.

As I alluded in my original post, one of my concerns is memory usage. It would be nice to be able to handle Files as if they were not streams, but I'm not sure if that would work for writing. I haven't had a chance to look into the details of file system implementation. I'm pretty sure that the substr is faster than the regex (although not by an order of magnitude or anything like that) , and it will work. It will suffice for now, but I was hoping to at least find a way where I wouldn't have to read the string into memory twice, which is what I'm guessing my implementation above will do.

Ideally, I'd like to read it into memory once and then create a string that is a pointer to the two pieces and the new inserted piece (or something like that), and not have to think about the implementation thereof.

I'm also curious if I use substr, if the full string is read into memory or just the portion required. Also, I'm curious when the stack is used vs the heap, etc.

I know I could look at the Perl source and get specific answers, and will probably do that at some point; however, I was just hoping to get a quick answer as to a good (simple) way to handle it efficiently. If using files or some other mechanism would be better and could also be done as a one-liner, that would be great to know also.

Eventually, I will probably rewrite the parser using a compiler compiler and C or C++. Still, perl seems to have some good features that might port over well and be usable as a library, so I'd like to put a little time into learning it better.

If using substr as above is the most efficient, simple way of handling big strings, that is great to know too. (Also, it'd be great to know if there is a better, but still simple, way than using big strings to perform large file manipulation, etc.)

Thanks,
Mark


Report •

#5
April 12, 2009 at 06:57:23
I'm not familiar with netlist, but see if these help.

Verilog::Parser
http://search.cpan.org/~wsnyder/Ver...

Verilog::Netlist
http://search.cpan.org/~wsnyder/Ver...

http://search.cpan.org/search?query...


Report •

#6
April 12, 2009 at 09:34:57
I had looked at Verilog-Perl, Icarus, et al, before I started. I'll look again to see if there are some good ideas I can glean.

Does anyone have any comments on handling large files and/or strings in general? I know I can break them up at the cost of a bit of complexity, and if I start to run out of memory or run into severe perf issues, that's probably what I'll do. Otherwise is substr = newstring + substr the best way to handle an insertion into a large string? Does the full string get read into memory at each call? Are there effects one should consider?

Thanks,
Mark


Report •


Ask Question