-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of Invalid sofa indexes #162
Comments
Thank you for the report. Do you have lots of unicode in your documents? Is it possible to create a minimal example of what does not work? How did you create the documents in the first place? I fear that the index mapping code is not 100% reliable and want to understand the error. I would add an optional flag to ignore index errors, the problem then just it that writing back will not work. |
Alright, I suspect this may actually at least in part be an error in the documents. I messed around trying to minimalize the examples a bit but didn't really get anywhere so since the files are open source I'll just link them here, maybe you can make sense of it. The typesystem can be found here but I had to extend it manually with a few missing annotations: with open(typesystem_path, "rb") as f:
typesystem = cassis.load_typesystem(f)
# We have to add some types that apparently are not in the XML
ep = typesystem.create_type("de.uniwue.mk.kall.Erzaehlpassage")
sa = typesystem.create_type("de.uniwue.mk.kall.Sprechakt")
typesystem.add_feature(type_=sa, name="Aufbau", rangeTypeName="uima.cas.String")
typesystem.create_type("de.uniwue.mk.kall.SprechaktText")
dialogue = typesystem.create_type("de.uniwue.mk.kall.Dialog")
typesystem.add_feature(type_=dialogue, name="Sprechakte", rangeTypeName="uima.cas.Integer") I looked at these two errors specifically:
These two files also gave me errors: This may all be down to the files being malformed, I don't know. The logic for determining byte offsets does seem straight forward enough. |
What needs to be done here? |
- Rename offset converter class - If an offset is invalid according to the mapping strategy, do not map it - pass through as is and log warning - Added test that invalid offsets are passed through on import and export and that warnings are logged
…a-indexes #162 - Handling of Invalid sofa indexes
- Switch to warnings.warn and adjust tests
Is your feature request related to a problem? Please describe.
I had to load data with invalid sofa indexes, don't ask me how they got in there. They are just comically out of bounds (in the hundreds of thousands when the document length is in the tenths of thousands).
Describe the solution you'd like
I fixed this for myself with this hack which just completely discards indexes in such a case: hatzel@8765c42
Doesn't feel great, let me know if you want to take something like this on board. Otherwise feel free to close this issue.
Let me know how/if you would like to handle this, I can provide a minimal example and potentially a better fix if you are interested.
Describe alternatives you've considered
You could just emit a warning but I am unsure if that would really be a great solution.
The text was updated successfully, but these errors were encountered: