extract columns

August 2, 2010 at 11:23:23
Specs: Windows XP
I have a large space delimited text file (~1000 columns x ~30,000 lines) that I need to extract certain columns out of.

Starting with column #2 I need every 3rd column, so that if the column data was:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
I would get (columns #2, #5, #8, etc):
0.2 0.5 0.8 1.1 1.4

The simple way that I know how to extract columns is by doing the following:

For /F "tokens=1-5 delims= " %%1 in (input.txt) Do ( Echo %%2, %%5>> output.txt)

But I don't want to (or have time to) code this for 1000 columns, and I can't get anything to work.

After that I need to sum all the columns, but I don't think you can sum decimal values in batch.

Thanks.


See More: extract columns

Report •


#1
August 2, 2010 at 13:31:46
About extracting the columns the following (untested) would do the job

@echo off & setlocal EnableDelayedExpansion > output.txt
for /F "delims=" %%i in (input.txt) do (
  set cnt=2
  set row=
  for %%j in (%%i) do (
    set /A cnt-=1
    if !cnt! equ 0 (set cnt=3 & set row=!row! %%j)
  )
  set row=!row:~1!
  echo.!row!>> output.txt
)
:: End_Of_Batch

About summing the columns batch math enables integers only, but if the decimal digits are fixed (e.g. two digits) that is still possible.


Report •

#2
August 2, 2010 at 15:24:08
The code worked perfectly. Thanks.

The follow-up is how to make the math work for the new output file. Let's say everything is fixed decimal, how would you go about outputing a file that has the sum of all values in the row? Everything I come across says that it truncates to an integer so it would be nice to know.


Report •

#3
August 2, 2010 at 19:07:30
when parsing files and doing text processing and doing maths (and practically everything else), use a good text/maths programming tool. Eg Perl/Python/gawk ( even vbscript, etc)

Here's and example using gawk for windows
Eg
The one liner below prints column 2, 5, 8 etc

c:\> gawk "{ for(i=2;i<=NF;i+=3) print $i }" file

this one liner sums all the values in the row, including values with decimal points

c:\test>  gawk "{ for(i=1;i<=NF;i++) {sum+=$i} ;print sum;sum=0}" file

you can also combine the 2 statements into one

c:\test> gawk "{ for(i=2;i<=NF;i+=3) print $i  ; for(i=1;i<=NF;i++) {sum+=$i} ;print sum;sum=0} " file


Batch will not do for decimal maths, date manipulation and a whole lot of other limitations. If you want to be productive, use a good programming language. Or start to learn to use vbscript if you are prone to use native tools on windows.


GNU win32 packages | Gawk


Report •

Related Solutions

#4
August 3, 2010 at 02:11:30
Please, let me know you prefer to follow ghostdog's advice or to go ahead on the batch road.

In that case do you want the output file has an additional row holding the sum of each column values?

How many decimal digits are coded and are they always expressed (i.e. 1.10 not 1.1 or 2.00 not 2)?

Post some possible values as example.


Report •

#5
August 3, 2010 at 08:35:05
Thanks Ghostdog. I have obviously never utilized gawk before but it was useful. I ended up combining the 2: My final code for those that would like to know was:
:: This takes an input file and keeps every
:: third column starting with column #2:
:: if input.txt = 1 2 3 4 5 6 7 8
:: then output1.txt = 2 5 8
::
@echo off & setlocal EnableDelayedExpansion > output1.txt
for /F "delims=" %%i in (input.txt) do (
set cnt=2
set row=
for %%j in (%%i) do (
set /A cnt-=1
if !cnt! equ 0 (set cnt=3 & set row=!row! %%j)
)
set row=!row:~1!
echo.!row!>> output1.txt
)
:: This potion of the code takes the output
:: from above and sums the entire row
:: to a file output2.txt. It utilizes gawk.exe
gawk "{ for(i=1;i<=NF;i++) {sum+=$i} ;print sum;sum=0}" output1.txt>>output2.txt

The output2.txt is a single column file with the sum of the rows from output1.txt.


IVO,
I am curious as to how one would go about utilizing batch to solve the decimal summation problem. If you have a few moments to instruct me that would be great. For my example everything is always written to 2 decimals with a lot of 0.00 values, and ideally a new file output is best, but I could always just extract the column if it added it to the original. A sample is as follows:

0.30 0.30 0.20 0.00 0.00 0.10 0.60 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.00 0.00

Thanks again to all. I have come to depend on this programming forum for time saving programming tips.


Report •

#6
August 3, 2010 at 14:17:53
Just to show my power (LOL), obviously not so terse as ghostdog's gawk and probably slower... but built in plain Windows software.

Actually the complexity is due to the (in)famous leading 0 meaning octal. Beware more test is needed.

@echo off & setlocal EnableDelayedExpansion

type nul > output.txt
type nul > outsum.txt

for /F "delims=" %%i in (input.txt) do (
  set hit=0 & set cnt=2 & set sum=0 & set row=
  for %%j in (%%i) do (set /A cnt-=1 & if !cnt! equ 0 call :SUM %%j)
  echo.!row:~1!>> output.txt
  set /A int= !sum:~0,-2! - !hit!
  set sum=!int!.!sum:~-2,2!
  echo.!sum!>> outsum.txt
)
goto :EOF

:SUM
  set cnt=3 & set row=%row% %1
  set val=%1
  set val=%val:0.=1.%
  if not "%1"=="%val%" set /A hit+=1
  set val=%val:.=%
  set /A sum+=%val%
goto :EOF

:: End_Of_Batch


Report •

#7
August 3, 2010 at 15:31:02
Thanks for the response IVO.

I ran the all windows based version for one of my files (I already utilized the GAWK method for everything this morning) and it worked perfectly. The interesting thing, as you mentioned, was that the more compact gawk code ran significantly faster. I ran it on my smallest file and the gawk took about 20 seconds to run versus the all DOS based was about 6 minutes. But in the end they both worked perfectly and have there uses. thanks again to all.


Report •

#8
August 3, 2010 at 19:53:42
@OP, gawk is a neat little programming language. If you use it, there is no need to use the batch portion.Just do everything with gawk. Here's how

C:\test>gawk "{ for(i=2;i<=NF;i+=3){ t+=$i; print $i > \"output1.txt\"} }END{print t >\"output2.txt\"}" input.txt

C:\test>more output1.txt
2
5
8

C:\test>more output2.txt
15

The above say, as you calculate every 3rd column, sum them up using variable "t". At the same time, print every third column to output1.txt At the end of it all, print the value of "t" to output2.txt.

GNU win32 packages | Gawk


Report •

#9
August 3, 2010 at 20:22:11
@OP,

>>But in the end they both worked perfectly and have there uses. thanks again to all.

how does it work for you? I ran IVO's code, but my sum is negative

C:\test>more input.txt
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4

C:\test>test.bat

C:\test>more output.txt
0.2 0.5 0.8 1.1 1.4

C:\test>more outsum.txt
-3.70

GNU win32 packages | Gawk


Report •

#10
August 3, 2010 at 21:26:10
The code is set to a fixed decimal value of 2. If you simply add a zero to make everything 2 decimals it will work:

Input.txt:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40

Output.txt:
0.2 0.5 0.8 1.1 1.4
0.20 0.50 0.80 1.10 1.40

Outsum.txt:
-3.70
4.00

So it works for the second line. For the small file I referred to earlier I copied the final sum for the 2 different methods into excel and compared and the answers were identical. The one difference though was that the windows based returned a value with a fixed decimal (4.00) versus the gawk method returns a value without the extraneous zeros (4).


Report •

#11
August 3, 2010 at 21:34:51
Have you tested on all possible cases of decimal places? what if you have 3 decimal places? According to my results, 3 decimal places failed with IVO's code.

GNU win32 packages | Gawk


Report •

#12
August 4, 2010 at 09:11:47
From the looks of it, it is specific to 2 decimal places. But I believe that by modifying the following lines of code you could make it work for 3 decimal places, but it would need to be a fixed number of decimals for the entire file:

ORIGINAL:
set /A int= !sum:~0,-2! - !hit!
set sum=!int!.!sum:~-2,2!

MODIFIED:
set /A int= !sum:~0,-3! - !hit!
set sum=!int!.!sum:~-3,3!

I would assume you could do this for any number of fixed decimals (not verified).


Report •

#13
August 4, 2010 at 10:35:58
Ok, one last post to beat this to the ground. Being a novice I enjoy seeing the final code. I have added some comments for myself so that in 2 weeks when I have already forgotten about this I can at least know what I was attempting to accomplish. I have only done the windows based one here. The gawk code is so short and posted above that no explanation is needed. Here is my final windows based code with comments.

:: This batch will extract every third column starting with column #2
:: the columns in the file "output.txt will be 2 5 8 11 etc...
::
:: The batch then sums every row and outputs another text file outsum.txt
:: with a single column which is just the sum of the row.
:: The batch is setup for a space delimited input file.

@echo off & setlocal EnableDelayedExpansion
type nul > output.txt
type nul > outsum.txt

:: Modify the following code to change the starting column.
:: If you want to start in row 3 it would be:
:: set hit=0 & set cnt=3 & set sum=0 & set row=
for /F "delims=" %%i in (input.txt) do (
set hit=0 & set cnt=2 & set sum=0 & set row=
for %%j in (%%i) do (set /A cnt-=1 & if !cnt! equ 0 call :SUM %%j)
echo.!row:~1!>> output.txt

set /A int= !sum:~0,-2! - !hit!
set sum=!int!.!sum:~-2,2!
echo.!sum!>> outsum.txt
)
:: By editing the above lines you can modify the
:: number of decimals to be anything you need.
::
:: For 3 decimal places (original code written for 2 decimals) it would look like:
:: set /A int= !sum:~0,-3! - !hit!
:: set sum=!int!.!sum:~-3,3!

goto :EOF

:: If you want to change the column interval (to be every 5th column for instance)
:: change the following code to (original code written for every 3rd column):
:: set cnt=5 & set row=%row% %1
:SUM
set cnt=3 & set row=%row% %1
set val=%1
set val=%val:0.=1.%
if not "%1"=="%val%" set /A hit+=1
set val=%val:.=%
set /A sum+=%val%
goto :EOF

:: End_Of_Batch


Report •

Ask Question