Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docx extraction may be incomplete ("Binary file (standard input) matches") #13

Open
andreaswachowski opened this issue Apr 23, 2022 · 0 comments

Comments

@andreaswachowski
Copy link

The code to extract text from docx files includes the following snippet:

unzip -p #{to_shell(file_path)} | grep ...

However, streaming the docx-data to grep can create a problem when the docx-zip-archive contains binary data (e.g. JPEGs):

With the pipe, grep is not processing file by file, but instead the whole archive as a continuous stream. That stream is processed by grep in chunks, not file by file. And how exactly the stream is split into chunks is AFAIK uncontrollable - it depends on the pipe buffering logic and on how the OS switches between the concurrent processes on the left and right side of the pipe.

It thus can happen that a chunk consists of matching text and binary data. But then, the whole chunk is discarded with the message "Binary file (standard input) matches" (see description of option "--binary-files" in https://www.gnu.org/software/grep/manual/grep.html), and hence the extraction of text is incomplete.

This can be very difficult to detect: I had a case where the extraction worked almost always. Very seldomly, the result showed some extracted text and then "Binary file (standard input) matches", truncating some data. When it failed, it turned out that word/document.xml was processed properly, but word/footer2.xml and a JPEG file were read as one chunk. It happened so seldom because the footer file and the JPEG file were almost 2MB apart in the archive. Only during exceptional circumstances did it happen that the I/O ended up processing footer2.xml and the JPEG in one chunk.

I can reproduce this fairly regularly (perhaps 20-50% of the time) with docx_with_image.docx on the Unix Command line with a one-liner shell script containing

unzip -p "$1" | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

Sometimes the output is (correctly) Hello world, this is a test, sometimes it is incorrectly Binary file (standard input) matches.

Replacing spec/fixtures/lorem.docx with the above docx-file and explicitly testing for the text will also reveal (sometimes) the failure:

index 2602b61..a7fb0da 100644
--- a/spec/doc_ripper/doc_ripper_spec.rb
+++ b/spec/doc_ripper/doc_ripper_spec.rb
@@ -48,7 +48,7 @@ module DocRipper

       it 'should respond with text to valid file extensions' do
         expect(DocRipper.rip(doc_path)).not_to eq(nil)
-        expect(DocRipper.rip(docx_path)).not_to eq(nil)
+        expect(DocRipper.rip(docx_path)).to eq("Hello world, this is a test\r\n")
         expect(DocRipper.rip(pdf_path)).not_to eq(nil)
Failures:

  1) provide a clean api to return the text from a document #rip should respond with text to valid file extensions
     Failure/Error: expect(DocRipper.rip(docx_path)).to eq(docx_text)

       expected: "Hello world, this is a test\r\n"
            got: "Binary file (standard input) matches\n"

       (compared using ==)

       Diff:
       @@ -1 +1 @@
       -Hello world, this is a test
       +Binary file (standard input) matches

     # ./spec/doc_ripper/doc_ripper_spec.rb:59:in `block (3 levels) in <module:DocRipper>'

I suppose a valid fix would be to unzipping only the xml-files (though I am not proficient enough in Office Open XML to know whether all relevant text is only found there):

unzip -p #{to_shell(file_path)} '*.xml' | grep ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant