Split txt inputfile into n output files based on content

January 13, 2014 at 08:41:17
Specs: windows xp
Hi,
I am trying to create a bat file that would pick up input file if.csv and split it into 0-n output files named of_<groupid>.csv. I say 0-n because if.csv may contain no records at all which in case no output file is needed. if.csv is a csv file with ; (semicolon) as delimiter.
Input file if.csv is already sorted on the first delimited value (<groupid>) of each record with exception of the first record which is a header record. <groupid> is max 34 positions but can be shorter.

So basically what I am trying to do is get an output file per <groupid> with all records containg that <groupid>. Each output file should contain a copy of the if.csv header record as first record. Each output file would need to be created with <groupid> in the name, e.g.: of_<groupid1>.csv of_<groupid2>.csv etc.

Any help would be gratly appreciated


See More: Split txt inputfile into n output files based on content

Report •

#1
January 13, 2014 at 10:25:51
Beware, script not tested.

@echo off & setlocal EnableDelayedExpansion
for %%i in (if.csv) do if %%~zi equ 0 exit
set /P head=<if.csv
set group=???
for /F "skip=1 tokens=1,* delims=;" %%i in (if.csv) do (
  if not "!group!"=="%%~i" (
    echo.!head!> "of_%%~i.csv" 
   set group=%%~i
  )
 echo.%%i;%%j>> "of_%%~i.csv" 
)


message edited by IVO


Report •

#2
January 14, 2014 at 11:08:11
Thanks a lot, much appreciated!

Report •

#3
January 15, 2014 at 05:20:45
So this worked as a charm except for one problem with the header I didn't foresee and am struggling to solve.

The header in the output files contains only the first 1024 characters of the header record from the input file. What I hadn't realised is that the input file has a header that is longer then 1024 characters (2060 in total to be precise). When opening the input file in Windows Notepad (word wrap is switched off) the header record is actually spread over first three visible lines with respectively 1024, 1024 and 12 characters.
Fourth line is then the first data record which never exceeds 1024 characters.
In short..how should I adapt script to cater for the complete header (2060) to be copied to each output file instead of just first 1024 characters?


Report •

Related Solutions

#4
January 15, 2014 at 07:16:50
It's a limitation of the SET /P approach. If your lines are over 1KB in size, you need to use this clunkier method.
set head=
for /f "usebackq" %%a in ("if.csv") do if not defined head set head=%%a

How To Ask Questions The Smart Way


Report •

#5
January 15, 2014 at 14:51:29
If you're willing/interested in looking at other options besides a bat file, here's a Perl option. I'm sure this can be done just as easily in VB or any number of other languages, but Perl is my preferred choice.

open my $fh, '<', 'if.txt' or die "failed to open 'if.txt' <$!>";
my $header = <$fh>;

my %csv;
%csv = map { push $csv{ (split /;/, $_)[0]}, $_ } <$fh>;
close $fh;

foreach my $id ( keys %csv ) {
    open my $csv_fh, '>', "of_$id.csv" or die "failed to open 'of_$id.csv' <$!>";
    print {$csv_fh} $header, @{$csv{$id}};
    close $csv_fh;
}


Report •

#6
January 16, 2014 at 10:55:33
Here my enhanced version of the original script supporting large headers

@echo off & setlocal EnableDelayedExpansion
if not exist if.csv exit
for %%i in (if.csv) do if %%~zi equ 0 exit
echo.Splitting "if.csv", please wait...
for /F "delims=" %%i in (if.csv) do (echo.%%i> "head.csv" & goto :DONE)
:DONE
set group=???
for /F "skip=1 tokens=1,* delims=;" %%i in (if.csv) do (
  if not "!group!"=="%%~i" (
    echo.  Generating "of_%%~i.csv" 
    copy "head.csv" "of_%%~i.csv" > nul
    set group=%%~i
  )
  echo.%%i;%%j>> "of_%%~i.csv" 
)
del "head.csv"
echo.DONE


Report •

#7
January 17, 2014 at 04:56:46
Nice!

One small issue remaining (I didn't see it in initial test of the script) is that a space appears to be added at the end of each record except for the last record.

Is there a way to prevent that?

Thanks again everyone.

message edited by zouthoutthee


Report •

#8
January 17, 2014 at 08:19:19
The statement used to transfer each line of the source .csv to the target file

echo.%%i;%%j>> "of_%%~i.csv"

exactly moves each original character without stripping or adding any
thing as any guy skilled on batch sctipting can confirm. So the space must exist in the original file. It is possible to strip it away, but the issue of the last line may do the fix complex.

Report •

#9
January 17, 2014 at 09:48:45
I know it shouldn't happen but it does.
I'll let you know when I figure it out.

Report •

#10
January 17, 2014 at 10:23:47
Check your code, and make sure it doesn't have any errant spaces. If it's not exactly what IVO posted, post your version and I'm sure someone will find the problem.

IVO: as any guy skilled on batch sctipting can confirm.
Well, unless %%j ends in <space>1. Then it'd still get directed to stdout, and the ending 1 would be stripped. Any other digit would cause the line to not be saved to the file.

How To Ask Questions The Smart Way


Report •

#11
January 18, 2014 at 03:47:38
@Razor2.3, as usual you are right and I was aware of that, but decided to ignore the side effect to avoid cumbersome code. Indeed the safe format of the statement would be

(echo.%%i;%%j)>> "of_%%~i.csv"


Report •

#12
January 19, 2014 at 19:31:20
Yep. I'm technically correct, and I think we can all agree that's the best kind of correct.

How To Ask Questions The Smart Way


Report •

#13
January 19, 2014 at 21:28:54
Might want to check your source-files for "non-displayed" char.s such as zero or 255 (as in unicode etc.) Use DEBUG or EDIT /70 or other means to ensure that no binary stuff is in your source-file.

Report •

Ask Question