-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object too large error in preprocessing script #102
Comments
If you end up splitting the output files in the
|
I had the same problem. See the error message below. After doing some more preprocessing, however, I no longer get the "bytes object is too large" ValueError. Preprocessing steps: (1) removed duplicates, (2) stripped whitespace from sentence end, (3) removed sentences of length > 2520 characters, (4) removed sentences of length < 11 characters. These 4 steps cut my dataset by 74% from 7,487,357 sentences to 1,978,295. So, I'm not sure which of those steps fixed the problem, but I no longer get the "bytes object is too large" ValueError.
|
How big are each of your documents? Is each one a sentence or is it more like a news article? Mine are around 500 words/3000-4000 characters, so if yours are sentence-length that could keep you below the memory limit. (That could also explain why you're getting 2,000 docs/second and I'm getting 100/second on 14 cores.) In general, though, it's not ideal to have to trim the corpus to prevent an out-of-memory error. I'm about to train vectors on a much larger corpus of text so I'll see how the splitting solution in #103 works. |
Each of my documents is a sentence that is 120 characters, on average. So, I agree with your statements. |
I've been getting a "bytes object is too large" error when processing a large-ish number of documents using the
01_parse.py
script. Creating several smallerdoc_bin
objects resolves the issue. Full error:The text was updated successfully, but these errors were encountered: