Extract from huge XML file

October 11, 2010 at 20:36:41
Specs: Windows XP
Hi,
This is my first time working with Shell Programming and I have been working with following issue for some time without proper solutions. I really appreciate if you could provide any support.

There is a huge XML file ( about 1.3M or even larger ) with only ONE line. It looks like:
<?xml version="1.0" encoding="UTF-8"?><ZIMGMA01>A</ZIMGMA01>..........<ZALLF>X</ZALLF>

The problem is the content of xml file sometimes duplicated like:
<?xml version="1.0" encoding="UTF-8"?><ZIMGMA01>A</ZIMGMA01>..........<ZALLF>X</ZALLF><?xml version="1.0" encoding="UTF-8"?><ZIMGMA01>A</ZIMGMA01>..........<ZALLF>X</ZALLF>

And what I wish to realize is to remove the duplicated part of the content. The command I'm using now is as follows:
cat ${DirName}/backup/${FileName} | awk '{a=index($0,"><?xml");print substr($FILE,1,a)}' | cat > ${DirName}/${FileName}

But on our Unix Server, awk command cannot be excuted with longer than 3,000 bytes. I don't know how to solve the problem.

Thanks & Best Regards


See More: Extract from huge XML file

Report •

#1
October 16, 2010 at 05:40:49
Try sed. Something like:

sed 's/<?xml.*<?xml/<?xml/' thefile.xml


Report •

#2
October 16, 2010 at 09:40:00
Really appreciate your kind help. I tried sed command, but it sounds like sed can only deal with the content end with '\r' or '\n'. However, this xml file contains only one line and without any '\r' or '\n' in the end.
.

Report •

#3
October 16, 2010 at 11:51:56
Works on my Linux box with no line ends.

Win XP? Where does your 'sed' come from? Or 'awk' for that matter.


Report •

Related Solutions

#4
October 17, 2010 at 07:52:07
I am working on HP-UX. And there is no result comes out if I use sed. With awk command, an error message of limitation of input within 3000 bytes will occur.

Report •

#5
October 18, 2010 at 12:00:48
The GNU versions of the commands (i.e. Linux) do not have the same input line buffer limitation problems that the legacy versions (i.e. HP-UX, Solaris, etc) of the command have.


Report •

#6
October 18, 2010 at 17:37:02
I knew that.The question is how to solve this problem in HP-UX, thanks.

Report •

#7
October 18, 2010 at 19:25:03
Although I personally have never done it, I have heard of folks downloading the GNU tools for their particular unix version.

If that is of interest, you might try searching http://www.gnu.org or at an HP forum site:

http://forums13.itrc.hp.com/service...


Report •

#8
October 18, 2010 at 19:48:13
Really appreciate your help. Thanks much.

Report •

#9
October 19, 2010 at 17:50:33
what other tools do you have? If you have Perl/Python you might be able to read the whole file in as a string. And have you tried nawk instead?

GNU win32 packages | Gawk


Report •

Ask Question