[ML] find_file_structure not detecting CSV header with many long and highly variable field values #45047

droberts195 · 2019-07-31T13:41:49Z

elastic/kibana#42114 contains an example of a CSV file where the find_file_structure endpoint didn't detect that the first row contained the column names.

The explanation is:

    "First row is not unusual based on length test: [1347.0] and [count=313, min=1231.000000, average=4363.025559, max=8911.000000]",
    "First row is not unusual based on Levenshtein test [count=100, min=1871.000000, average=4357.270000, max=8512.000000] and [count=100, min=1711.000000, average=3648.230000, max=5914.000000]"

In other words:

The first row length is 1347 and other rows vary in length between 1231 and 8911 characters.
The average Levenshtein distance between the first row and each of the next 100 rows is 4357.27 while the average distance between 100 other pairs of rows is 3648.23.

The file in question contains AirBNB listings data. Some owners have written huge amounts about their properties and other owners have written very little, and the two current tests are confused by this.

To a human it's flagrantly obvious that the first row is a header row, so we should be able to improve this.

One idea is to extend _excluding_ the biggest difference from:

/**
     * Sum of the Levenshtein distances between corresponding elements
     * in the two supplied lists _excluding_ the biggest difference.
     * The reason the biggest difference is excluded is that sometimes
     * there's a "message" field that is much longer than any of the other
     * fields, varies enormously between rows, and skews the comparison.
     */

to exclude all fields that are over a certain length in any row, as this indicates likely freeform text fields (and the AirBNB data has more than 1 such field per row).

Another idea would be to look at the number of distinct characters in each row. In the AirBNB data this could well notice a difference between the first row and others because the first row is all commas, lowercase letters and underscores whereas the other lines have many other characters.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-07-31T13:41:51Z

Pinging @elastic/ml-core

sophiec20 · 2019-07-31T13:48:56Z

Possibly look at ratio of numeric characters in addition to excluding very lengthy fields.

When doing a fieldwise Levenshtein distance comparison between CSV rows, this change ignores all fields that have long values, not just the longest field. This approach works better for CSV formats that have multiple freeform text fields rather than just a single "message" field. Fixes elastic#45047

When doing a fieldwise Levenshtein distance comparison between CSV rows, this change ignores all fields that have long values, not just the longest field. This approach works better for CSV formats that have multiple freeform text fields rather than just a single "message" field. Fixes #45047

droberts195 added the :ml Machine learning label Jul 31, 2019

droberts195 mentioned this issue Aug 1, 2019

[ML] Improve CSV header row detection in find_file_structure #45099

Merged

droberts195 closed this as completed in #45099 Aug 1, 2019

codebrain mentioned this issue Oct 14, 2019

7.4 meta ticket elastic/elasticsearch-net#4133

Closed

56 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] find_file_structure not detecting CSV header with many long and highly variable field values #45047

[ML] find_file_structure not detecting CSV header with many long and highly variable field values #45047

droberts195 commented Jul 31, 2019

elasticmachine commented Jul 31, 2019

sophiec20 commented Jul 31, 2019

[ML] find_file_structure not detecting CSV header with many long and highly variable field values #45047

[ML] find_file_structure not detecting CSV header with many long and highly variable field values #45047

Comments

droberts195 commented Jul 31, 2019

elasticmachine commented Jul 31, 2019

sophiec20 commented Jul 31, 2019