Name: maxbre Date: March 20, 2008 at 08:01:07 Pacific Subject: batch merge/substitute tokens OS: win xp CPU/Ram: 2 gb
Comment:
Hi all you great masters of batch scripting!
Let’s suppose I have one file named ‘output_utm.txt’ with 5 tokens in it which first two I need to elaborate (in fact, a quite complex elaboration performed via another batch using a specific *.exe). The result of this elaboration (against the first two tokens) is then temporary 'stored' in a file called gbo.txt. Up to now I managed quite well this task but now comes the tricky question because I need to ‘merge/substitute' the new tokens in the corresponding position of the old file 'output_utm.txt' and save everything in a new file ‘output_gbo.txt’. My rough attempt is the following:
::batch merge/substitute tokens
@Echo Off > output_gbo.txt
setLocal EnableDelayedExpansion
For /F "skip=2 tokens=1-5" %%a in (output_utm.txt) Do ( set /a N+=1 if !N! equ 1 (echo %%a %%b %%c %%d %%e >> output_gbo.txt ) else ( For /F "skip=1 tokens=1-2" %%f in (gbo.txt) Do ( echo %%f %%g %%c %%d %%e >> output_gbo.txt ) ) ) :: end
But obviously this is messing up everything because the nested 'for' is performing a ‘cartesian product’ of lines and therefore I need a way to restrict it somehow... Any help? Thank you max
to m2 sorry it's my fault I did not explain myself well;
I have the file a.txt with headers a b c d e a b c d e a1 b1 c1 d1 e1 a2 b2 c2 d2 e2 a3 b3 c3 d3 e3 a4 b4 c4 d4 e4 a5 b5 c5 d5 e5
I have the file b.txt with headers a b a b a1new b1new a2new b2new a3new b3new a4new b4new a5new b5new
I want to get the file c.txt like
a b c d e a1new b1new c1 d1 e1 a2new b2new c2 d2 e2 a3new b3new c3 d3 e3 a4new b4new c4 d4 e4 a5new b5new c5 d5 e5
to ghostdog I appreciate very much your hints for gawk; I have already installed it and I'm really struggling to use it but for now I do not feel much confident on it (nor on batch actually); I'm still in the middle of the gawk manual! In any case your help it's a good way to learn this powerfull language; thank you again for that; At the moment I'm also interested in comparing different solutions (batch vs gawk) and learn as much as possible...
you can read the gawk user guide here:http://www.gnu.org/software/gawk/manual/gawk.html
using the above code, it produces your desired output.
c:\test>gawk -f script.awk file1 file2 a b c d e a1new b1new c1 d1 e1 a2new b2new c2 d2 e2 a3new b3new c3 d3 e3 a4new b4new c4 d4 e4 a5new b5new c5 d5 e5
gawk first process file1, then file2 FNR means the input record number in the current input file. NR means the total number of input records seen so far. so FNR==NR means get all records from the first file. arrays in gawk are called associative arrays. The default field separator is space, if not specified. Fields are denoted by $1,$2 and so on. $1 is first field, $2 is second field and so on..
therefore, a[FNR]=$3" "$4" "$5 means to store in array "a", the values 3rd field, followed by space, followed by 4th field and so on.. this means : a[1] = d e f, a[2] = c1 d1 e1 and etc.. the statement "next" will go to next record.
{ print $1,$2,a[FNR] }
means to print the 1st and 2nd record of the next file, ie file1, followed by a[FNR], whose values are already stored in array "a"
@echo off > output_gbo.txt setlocal enabledelayedexpansion
set count=0 for /f "tokens=1-5" %%a in (output_utm.txt) do ( set count+=1 if !count! equ 1 ( echo.%%a %%b %%c %%d %%e>> output_gbo.txt ) else ( for /f "tokens=1-3 delims=[] " %%f in ('type gbo.txt ^| find /N /V ""') do ( if !count! equ %%f echo.%%g %%h %%a %%b %%c ) ) ) :: End_Of_Batch
I suggest you follow the tips of ghostdog as the above batch is higly inefficient as to use a batch to solve this issue you can, but with a modified logic.
yes, I do agee with all you; I promise I'll try to learn gawk (it seems to be really the proper language for my frequent tasks of txt manipulation); but on the other hand I'm still curious to explore the 'batch limits' and I'll let you know the results of my thoughts... Thank you all for now and also have a good easter to everybody! max
ps to ivo: what do you mean with 'modified logic'; does it means you have 'steered' my really goofy attempt in the direction that could be working but on the other hand you would have find a different batch solution? if this is the case I would be really happy to know how would you have faced the problem (by imagining you would have in your - powerful - hand just the batch solution...) bye again
Let me resume the initial problem: there is a file named output_utm.txt with layout
a b c d e a1 b1 c1 d1 e1 a2 b2 c2 d2 e2 a3 b3 c3 d3 e3 a4 b4 c4 d4 e4 a5 b5 c5 d5 e5
whose first two elements have to be replaced via a complex computation performed by an external .exe program. The result is the outlined output_gbo.txt file
a b c d e a1new b1new c1 d1 e1 a2new b2new c2 d2 e2 a3new b3new c3 d3 e3 a4new b4new c4 d4 e4 a5new b5new c5 d5 e5
maxbbre proposed solution is
Read output_utm.txt, generate for each row the computed tokens and store them into an intermediate gbo.txt file
a b a1new b1new a2new b2new a3new b3new a4new b4new a5new b5new
then merge with the native one to set up the required target file output_gbo.txt.
The problem raises as batch scripts are limited to browse one file at time, so the merging requires to read the whole other file for each line of the main one. This approach is implemented in the following code (tested and working)
for /f "tokens=1-6 delims=[] " %%a in (utm.tmp) do ( if %%a equ 1 ( echo.%%b %%c %%d %%e %%f>> output_gbo.txt ) else ( for /f "tokens=1-3 delims=[] " %%g in (gbo.tmp) do ( if %%a equ %%g echo.%%h %%i %%d %%e %%f>> output_gbo.txt ) ) ) del *.tmp :: End_Of_Batch
Stated output_utm.txt has N lines the bulk of I/O is
N (numbering utm) + N (numbering gbo)+ N (reading utm) + N*N (merging) + N (previous reading of utm) = 4N + N^2
E:G. for N=100 I/O is 10400
A better way is to read the input file and generate the target one on the fly as the I/O now is just N (or 2N as we will see).
The suggested script is (tested and working too)
@echo off > output_gbo.txt setLocal EnableDelayedExpansion set tk=? for /f "tokens=1-5" %%a in (output_utm.txt) do ( if !tk!==? ( echo.%%a %%b %%c %%d %%e>> output_gbo.txt set tk=# ) else ( call maxsub %%a %%b set /P tk=<gbo.tmp echo.!tk! %%c %%d %%e>> output_gbo.txt ) ) del gbo.tmp :: End_Of_Batch
where maxsub is the batch performing the tokens manipulation, here simulated by
@echo.%1new %2new>gbo.tmp
in the real life embedding the .exe program suited to perform the computation.
Because a called batch script invokes a secondary command processor, its results have to be passed back using a temporary file holding just one line (that accounts for N I/O).
In the above case the total I/O for N=100 is just 200 opposed to 10400 of the first script. A dramatic improvement (for real large tabular files).
Now the lesson is over; remember: even scripts have to face with performance factors.
what a great lesson from all you! Thank you again, I think I have here enough brain food for next 3 years, just assuming for myself a good 'performance factor' in understanding your help!
sorry for going back to this long and troubling story but real life is unfortunately always much more complicated...
that's because by running the followings:
::merge.bat @echo off > output_gbo.tmp setLocal EnableDelayedExpansion
set tk=? for /f "skip=2 tokens=1-5" %%a in (output_utm.txt) do ( if !tk!==? ( echo.%%a %%b %%c %%d %%e>> output_gbo.tmp set tk=# ) else ( call traspunto.bat set /p tk=<gbo.tmp echo.!tk! %%c %%d %%e>> output_gbo.tmp ) ) ::end of batch
::traspunto.bat :: this batch is performing calculation on first two tokens traspunto.exe f pia pia en utm.tmp ED50 ROMA40 32 O 32 O >gbo.tmp ::end of batch
I got into the problem that the echoing of !tk! is always reproducing just the tokens of first row of file gbo.tmp: why ? (please consider that I must necessarly pass through the creation of intermediate gbo.tmp); ...in the end it seems that batch is definitely not the proper solution for this case... ... but still to explore m2 post #10, really hard in some part of it, let you eventually know ... thanks again max
I guess you have absolutely not understood how my "optimized" script works. It passes the first two tokens in each row of output_utm.txt to the batch subroutine "maxsub" via the actual parameters %%a %%b
call maxsub %%a %%b
If you name "maxsub" "transpunto.bat" where are its tail parameters and how can transpunto.exe be aware of the current tokens processed?
traspunto.exe f pia pia en utm.tmp ED50 ROMA40 32 O 32 O >gbo.tmp
This command is a nightmare for me, but I do not see formal parameters on it (%1 for %%a and %2 for %%b) and more why is utm.tmp referenced here?.
What a mess!
Anyway if you explain me the meaning of the command tail, maybe I can give you a tip.
Last but not least, it is not a good idea to name the subroutine "transpunto.bat" as there is "transpunto.exe" and the native suffix precedence is .com .exe .bat that may lead to a big mess.
You are perfectly right but I've been forced to go through the mess I posted in #13. The main reason is because the program ‘traspunto.exe’ is only working with an input file here named ‘utm.tmp’ (passed with the obscure parameter 'f' in the command tail of traspunto.bat, along with many other strange parameters which are not much important here to explain I think – consider that are just options for specifying the type of transformation of geographic coordinates from one reference system to another -). I've been trying to echoing the results of ‘traspunto.exe’ in order to keep in the direction you suggested as to generate the target on the fly but I was unsuccessfull. ‘traspunto.exe’ is a program accepting an input file here named utm.tmp (just composed by two fields x and y) which final output (calculation against two fields x and y) I've redirected to the file named gbo.tmp. Do you think it is possible the redirection of the final output to the standard output (screen) and therefore catch the results on these two fields on the fly? Thank you again for you patience & precious help & sorry for wasting your time with my trivial questions.
Max
PS to M2: finally I managed to understand your code posted in #10 for which I greatly thank you; (but how a hard work for myself!); I will keep the code, among many others here posted, as a reference; it is working perfectly but unfortunately it is also suffering for the same (un)efficiency problem highlighted by ivo (about 10 min run to complete the task with my real data – 1600 rows -); nevertheless I’m considering it as a great lesson of batch scripting.
I don't understand the exact behavior of transpunto.exe,
- Is it forced to read the whole utm.tmp when running, creating the resultant output file holding the transposed lines from input or does the utmp.tmp contain just ONE row with two tokens?
- Where does utm.tmp come from in your #13 post, how is it related to output_utm.txt?
- As far as I can see, there is no f parameter on the transpunto.bat's tail, indeed there are no parameters at all.
It is not right to say here batch is not the way to undergo, espacially if this is the scripting language you master better, just it needs to be properly exploited (in a dramatic night years ago I was forced to code on the fly a utility in Fortran (!) to enable the as soon as possible restart of a financial system).
P.S.: Ten minutes are not a biblic time for a math run.
ok, now sit back because this is the complete story (hoping not boring you too much!). I’m working on environmental modelling of air pollution and to do that I’m using a sw compiled in fortran (please do not ask me to modify this sw because it is really a nightmare! More than 5000 lines of code with hundreds of includes!). The final output of the run is a file called output_utm.txt containing among others two fields X and Y with geographic coordinates (longitude and latitude) referring to a specific reference geographic system called utm (universal trasverse mercatore). The file utm.tmp comes from the file output_utm.txt: in fact it is the extraction of two tokens X and Y with about 1600 rows (i.e. the information to be processed by traspunto.exe). traspunto.exe is a compiled fortran exe which source code UNFORTUNATELY I do not have access; it’s a third party exe performing a real complex transformation of coordinates (fourier transformation). It is forced to read the whole utm.tmp when running creating the resultant output file holding the transposed lines from input. The following command: traspunto.exe f pia pia en utm.tmp ED50 ROMA40 32 O 32 O >gbo.tmp means: apply traspunto.exe to the file (f) utm.tmp in order to transform planar (pia pia) geographic coordinates in two fields X Y longitude latitude (en) from one reference system (ED50 32 O) to another (ROMA40 32 O) and finally store results in the file >gbo.tmp the input file utm.tmp must have a fixed format: just two fields X and Y with rows containing information to be processed. Finally I need to merge/substitute the information in gbo.tmp with those in output_utm.txt and save result in a file output_gbo.txt to be imported in a GIS (geographic information system) just working with a specific reference system (GBO gauss-boaga).
Now you may wonder why I’m not throwing out a piece of code in fortran to perform this task and I’m hanging on so much obstinately on batch scripting; it’s quick to say because I need to redistribute the tool to others which are not much familiar with programming (can you imagine someone less than me?) and give them a solution at click hand. Bye Max
ps ten minutes repeated for hundreds of times makes a big difference!
the following slightly modified batch should be tailored to your purpose. It generates a one row intermediate file (utm.tmp) for each native input line to be transposed. There is no need to call subroutines, just the transpose.exe application. If run time is suitable and the solution correct you are done, otherwise let me know as other ways are aheds.
::merge.bat @echo off > output_gbo.tmp setLocal EnableDelayedExpansion
set tk=? for /f "skip=2 tokens=1-5" %%a in (output_utm.txt) do ( if !tk!==? ( echo.%%a %%b %%c %%d %%e>> output_gbo.tmp set tk=# ) else ( echo.%%a %%b> utm.tmp traspunto.exe f pia pia en utm.tmp ED50 ROMA40 32 O 32 O >gbo.tmp set /p tk=<gbo.tmp echo.!tk! %%c %%d %%e>> output_gbo.tmp ) ) del *.tmp ::end of batch
there are indeed alternative ways to handle that problem by batch, if the proposed one failed, but quoting a sci-fi movie "the future must be left unknown". Just let us say the roads not taken are quite tricky.
Your questions are unconventional and challenging, so other than posting future questions, you may contact me by e-mail using Computing.net messaging system too (in private message speaking our mother tongue).