You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code to extract text from docx files includes the following snippet:
unzip -p #{to_shell(file_path)} | grep ...
However, streaming the docx-data to grep can create a problem when the docx-zip-archive contains binary data (e.g. JPEGs):
With the pipe, grep is not processing file by file, but instead the whole archive as a continuous stream. That stream is processed by grep in chunks, not file by file. And how exactly the stream is split into chunks is AFAIK uncontrollable - it depends on the pipe buffering logic and on how the OS switches between the concurrent processes on the left and right side of the pipe.
It thus can happen that a chunk consists of matching text and binary data. But then, the whole chunk is discarded with the message "Binary file (standard input) matches" (see description of option "--binary-files" in https://www.gnu.org/software/grep/manual/grep.html), and hence the extraction of text is incomplete.
This can be very difficult to detect: I had a case where the extraction worked almost always. Very seldomly, the result showed some extracted text and then "Binary file (standard input) matches", truncating some data. When it failed, it turned out that word/document.xml was processed properly, but word/footer2.xml and a JPEG file were read as one chunk. It happened so seldom because the footer file and the JPEG file were almost 2MB apart in the archive. Only during exceptional circumstances did it happen that the I/O ended up processing footer2.xml and the JPEG in one chunk.
I can reproduce this fairly regularly (perhaps 20-50% of the time) with docx_with_image.docx on the Unix Command line with a one-liner shell script containing
Sometimes the output is (correctly) Hello world, this is a test, sometimes it is incorrectly Binary file (standard input) matches.
Replacing spec/fixtures/lorem.docx with the above docx-file and explicitly testing for the text will also reveal (sometimes) the failure:
index 2602b61..a7fb0da 100644
--- a/spec/doc_ripper/doc_ripper_spec.rb
+++ b/spec/doc_ripper/doc_ripper_spec.rb
@@ -48,7 +48,7 @@ module DocRipper
it 'should respond with text to valid file extensions' do
expect(DocRipper.rip(doc_path)).not_to eq(nil)
- expect(DocRipper.rip(docx_path)).not_to eq(nil)
+ expect(DocRipper.rip(docx_path)).to eq("Hello world, this is a test\r\n")
expect(DocRipper.rip(pdf_path)).not_to eq(nil)
Failures:
1) provide a clean api to return the text from a document #rip should respond with text to valid file extensions
Failure/Error: expect(DocRipper.rip(docx_path)).to eq(docx_text)
expected: "Hello world, this is a test\r\n"
got: "Binary file (standard input) matches\n"
(compared using ==)
Diff:
@@ -1 +1 @@
-Hello world, this is a test
+Binary file (standard input) matches
# ./spec/doc_ripper/doc_ripper_spec.rb:59:in `block (3 levels) in <module:DocRipper>'
I suppose a valid fix would be to unzipping only the xml-files (though I am not proficient enough in Office Open XML to know whether all relevant text is only found there):
The code to extract text from docx files includes the following snippet:
unzip -p #{to_shell(file_path)} | grep ...
However, streaming the docx-data to grep can create a problem when the docx-zip-archive contains binary data (e.g. JPEGs):
With the pipe, grep is not processing file by file, but instead the whole archive as a continuous stream. That stream is processed by grep in chunks, not file by file. And how exactly the stream is split into chunks is AFAIK uncontrollable - it depends on the pipe buffering logic and on how the OS switches between the concurrent processes on the left and right side of the pipe.
It thus can happen that a chunk consists of matching text and binary data. But then, the whole chunk is discarded with the message "Binary file (standard input) matches" (see description of option "--binary-files" in https://www.gnu.org/software/grep/manual/grep.html), and hence the extraction of text is incomplete.
This can be very difficult to detect: I had a case where the extraction worked almost always. Very seldomly, the result showed some extracted text and then "Binary file (standard input) matches", truncating some data. When it failed, it turned out that word/document.xml was processed properly, but word/footer2.xml and a JPEG file were read as one chunk. It happened so seldom because the footer file and the JPEG file were almost 2MB apart in the archive. Only during exceptional circumstances did it happen that the I/O ended up processing footer2.xml and the JPEG in one chunk.
I can reproduce this fairly regularly (perhaps 20-50% of the time) with docx_with_image.docx on the Unix Command line with a one-liner shell script containing
unzip -p "$1" | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
Sometimes the output is (correctly)
Hello world, this is a test
, sometimes it is incorrectlyBinary file (standard input) matches
.Replacing
spec/fixtures/lorem.docx
with the above docx-file and explicitly testing for the text will also reveal (sometimes) the failure:I suppose a valid fix would be to unzipping only the xml-files (though I am not proficient enough in Office Open XML to know whether all relevant text is only found there):
The text was updated successfully, but these errors were encountered: