-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] find_file_structure not detecting CSV header with many long and highly variable field values #45047
Labels
:ml
Machine learning
Comments
Pinging @elastic/ml-core |
Possibly look at ratio of numeric characters in addition to excluding very lengthy fields. |
droberts195
added a commit
to droberts195/elasticsearch
that referenced
this issue
Aug 1, 2019
When doing a fieldwise Levenshtein distance comparison between CSV rows, this change ignores all fields that have long values, not just the longest field. This approach works better for CSV formats that have multiple freeform text fields rather than just a single "message" field. Fixes elastic#45047
droberts195
added a commit
that referenced
this issue
Aug 1, 2019
When doing a fieldwise Levenshtein distance comparison between CSV rows, this change ignores all fields that have long values, not just the longest field. This approach works better for CSV formats that have multiple freeform text fields rather than just a single "message" field. Fixes #45047
droberts195
added a commit
that referenced
this issue
Aug 2, 2019
When doing a fieldwise Levenshtein distance comparison between CSV rows, this change ignores all fields that have long values, not just the longest field. This approach works better for CSV formats that have multiple freeform text fields rather than just a single "message" field. Fixes #45047
56 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
elastic/kibana#42114 contains an example of a CSV file where the find_file_structure endpoint didn't detect that the first row contained the column names.
The explanation is:
In other words:
The file in question contains AirBNB listings data. Some owners have written huge amounts about their properties and other owners have written very little, and the two current tests are confused by this.
To a human it's flagrantly obvious that the first row is a header row, so we should be able to improve this.
One idea is to extend
_excluding_ the biggest difference
from:to exclude all fields that are over a certain length in any row, as this indicates likely freeform text fields (and the AirBNB data has more than 1 such field per row).
Another idea would be to look at the number of distinct characters in each row. In the AirBNB data this could well notice a difference between the first row and others because the first row is all commas, lowercase letters and underscores whereas the other lines have many other characters.
The text was updated successfully, but these errors were encountered: