Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a BHL developer I want a shorter version of a dump with filtered data #61

Closed
dimus opened this issue Jan 3, 2023 · 1 comment
Closed

Comments

@dimus
Copy link
Member

dimus commented Jan 3, 2023

According to @mlichtenberg the following filters are currently applied in the previous version of bhlindex:

If the output WILL be filtered, then the needed columns are

names.csv

NameID
DetectedName
MatchedCanonical
MatchedFullName
RecordID
DataSourceID

occurrences.csv

NameID
PageID

If the output will NOT be filtered, then the needed columns are:

names.csv

NameID
DetectedName
MatchedCanonical
MatchedFullName
RecordID
DataSourceID
MatchSortOrder
MatchType
OddsLog10
Curation
Error

occurrences.csv

NameID
PageID
@dimus
Copy link
Member Author

dimus commented Jan 3, 2023

Filter:

COPY (
SELECT [n.name](http://n.name/), n.matched_name, n.matched_canonical
FROM name_strings n INNER JOIN name_statuses st ON [n.name](http://n.name/) = [st.name](http://st.name/)
WHERE (n.match_type IN ('ExactMatch', 'ExactCanonicalMatch') AND n.curation <> 'Unknown')
OR (n.match_type IN ('FuzzyCanonical', 'FuzzyPartial') AND (st.odds > 1000000 OR n.edit_distance IN (0,1) OR n.stem_edit_distance IN (0,1)))
OR (n.match_type IN ('NoMatch', '') AND st.odds > 1000000)
OR (n.match_type = 'ExactPartialMatch')
) TO STDOUT DELIMITER '|'

@dimus dimus closed this as completed in c13c270 Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant