Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of Invalid sofa indexes #162

Closed
hatzel opened this issue Mar 4, 2021 · 3 comments
Closed

Handling of Invalid sofa indexes #162

hatzel opened this issue Mar 4, 2021 · 3 comments
Assignees
Milestone

Comments

@hatzel
Copy link

hatzel commented Mar 4, 2021

Is your feature request related to a problem? Please describe.
I had to load data with invalid sofa indexes, don't ask me how they got in there. They are just comically out of bounds (in the hundreds of thousands when the document length is in the tenths of thousands).

Describe the solution you'd like
I fixed this for myself with this hack which just completely discards indexes in such a case: hatzel@8765c42

Doesn't feel great, let me know if you want to take something like this on board. Otherwise feel free to close this issue.

Let me know how/if you would like to handle this, I can provide a minimal example and potentially a better fix if you are interested.

Describe alternatives you've considered
You could just emit a warning but I am unsure if that would really be a great solution.

@jcklie
Copy link
Collaborator

jcklie commented Mar 8, 2021

Thank you for the report. Do you have lots of unicode in your documents? Is it possible to create a minimal example of what does not work? How did you create the documents in the first place? I fear that the index mapping code is not 100% reliable and want to understand the error. I would add an optional flag to ignore index errors, the problem then just it that writing back will not work.

@jcklie jcklie self-assigned this Mar 8, 2021
@jcklie jcklie added this to the 0.6.0 milestone Mar 8, 2021
@hatzel
Copy link
Author

hatzel commented Mar 9, 2021

Alright, I suspect this may actually at least in part be an error in the documents. I messed around trying to minimalize the examples a bit but didn't really get anywhere so since the files are open source I'll just link them here, maybe you can make sense of it.

The typesystem can be found here but I had to extend it manually with a few missing annotations:

    with open(typesystem_path, "rb") as f:
        typesystem = cassis.load_typesystem(f)
        # We have to add some types that apparently are not in the XML
        ep = typesystem.create_type("de.uniwue.mk.kall.Erzaehlpassage")
        sa = typesystem.create_type("de.uniwue.mk.kall.Sprechakt")
        typesystem.add_feature(type_=sa, name="Aufbau", rangeTypeName="uima.cas.String")
        typesystem.create_type("de.uniwue.mk.kall.SprechaktText")
        dialogue = typesystem.create_type("de.uniwue.mk.kall.Dialog")
        typesystem.add_feature(type_=dialogue, name="Sprechakte", rangeTypeName="uima.cas.Integer")

I looked at these two errors specifically:

  • Key Error: 447683: in this tag <type:AlreadyHandled xmi:id="15560" sofa="222088" begin="447683" end="447688"/> this one is wayyy out of bounds in terms of sofa length.
  • Key Error: 18369 if I am not mistaken this one is only out of bounds by a single byte so it may be the more interesting example. <type:temp5 xmi:id="161223" sofa="1" begin="17917" end="18369"/>

These two files also gave me errors: Arnim,-Bettina-von__Die Günderode.xmi.xmi.xmi and Lewald,-Fanny__Jenny.xml.xmi.xmi.txt.xmi.xmi.xmi.

This may all be down to the files being malformed, I don't know. The logic for determining byte offsets does seem straight forward enough.

@reckart
Copy link
Member

reckart commented Sep 23, 2021

What needs to be done here?

@reckart reckart modified the milestones: 0.6.0, 0.7.0 Sep 27, 2021
@reckart reckart assigned reckart and unassigned jcklie Dec 17, 2021
reckart added a commit that referenced this issue Dec 17, 2021
- Rename offset converter class
- If an offset is invalid according to the mapping strategy, do not map it - pass through as is and log warning
- Added test that invalid offsets are passed through on import and export and that warnings are logged
reckart added a commit that referenced this issue Dec 17, 2021
…a-indexes

#162 - Handling of Invalid sofa indexes
@reckart reckart closed this as completed Dec 17, 2021
reckart added a commit that referenced this issue Dec 17, 2021
- Switch to warnings.warn and adjust tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants