How to Extract Data from a Corrupt DOCX filetype


By: mmcconaghy
December 24, 2015

The new .DOCX files are actually a compressed or zip file.
If all else fails you may be able to recover some, if not all, of your data by going into the compressed file and extracting it by hand.

If you wish to try, this process has worked for me in the past:

Open the .docx file with something like 7-zip, which is an open source archive program. You can also use Zip.

Once you have the file open, you should see three directory's and one file:

     docProps
     word
     -rels
     [Content_Types].xml

Open the directory: word

Inside the word directory, there should be a file called: document.xml

This is the file that contains your data, along with a ton of xml markup stuff which you will have to remove.

Copy the file document.xml to someplace you can work on it.

Next, you will have to use a text editor like JEdit or Notepad++ to open the document.xml file.

(Regular Notepad or Wordpad will not accept the wildcard search string below.)

To start, do a Find or Search, and use the wildcard search string:
    <[^>]+>
and Replace with nothing.

The above search string should remove most, if not all of the XML markups.

You should go through the whole document and make sure you got it all, then save the data with a new file name and you can then re-import the data into a new Word file.

If you know XML, you may have a better search method, if you do, I'd like to hear about it.

Good Luck


Need more help?
Describe your Problem
Example: Hard Drive Not Detected on My PC

Ask Question