Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map item catch #365

Merged
merged 18 commits into from
Oct 12, 2023
Merged

Map item catch #365

merged 18 commits into from
Oct 12, 2023

Conversation

dale-wahl
Copy link
Member

This works... but is logging to the original dataset not the new one being created (plus I'm warning via the database since datasets do not have the 4CAT log). I think it may make more sense to wrap map_item at the processor level. But map_item is a static method and thus, get_mapped_item (or whatever I call the new method) ought to be as well... but then I don't have a log or dataset or anything useful really. Also iterate_item is the main use of map_item and exists in the dataset. So perhaps I need a processor method to check if map_item works and then use the get_mapped_item.

@dale-wahl
Copy link
Member Author

dale-wahl commented May 24, 2023

Created BasicProcessor class methods get_mapped_item to check if map_item returns actual data (if not raise new Exception) and map_item_method_available to check if a processor and dataset can use map_item.

I did not combine these methods as it seems unnecessary to check compatibility of processor and dataset every time get_mapped_item is used.

Note: get_mapped_item could also be used to catch things like KeyErrors that are sometime seen from map_item. This would allow a processor to skip an item missing some particular key. However, I did not do that as I felt it may be better to crash the processor and force us to contend with the malformed item. In essence this get_mapped_item only catches purposefully skipped map_item results e.g. when map_item returns {} (or False, None, some pythonic False, etc.).

@dale-wahl dale-wahl requested a review from stijn-uva May 24, 2023 14:40
@dale-wahl dale-wahl marked this pull request as ready for review May 24, 2023 14:41
@dale-wahl
Copy link
Member Author

Tested it on all the ZeeSchuimer datasources. Apparently there is some other oddity in LinkedIn. This update works as intended and notifies both us and the User.

dale-wahl added 3 commits May 30, 2023 11:48
`get_item_keys` would also raise an error on a non csv/ndjson. It will return the first item keys even in the instance of `map_item` not existing... but that was how it always functioned so I am leaving it the same.
"""
# only run item mapper if extension of processor == extension of
# data file, for the scenario where a csv file was uploaded and
# converted to an ndjson-based data source, for example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dale-wahl is there any datasource that does this at the moment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's some old voodoo (I think it is yours to be honest!)... I extracted that code and comment into a new method to use as a check to see if map_item should be run, but I did not modify the code. I believe it has to do with custom imports of, say, a Twitter dataset. You could convert the dataset type from custom to a Twitter datasource type, but it would not be an NDJSON as expected. That would also presumably apply to any ZeeSchuimer datasource (a custom Doutin/Instagram CSV uploaded would not be able to use map_item). In practice, I am not exactly sure how often we ran into this problem since users cannot change datatypes only admins (am I wrong about that?).

backend/lib/search.py Outdated Show resolved Hide resolved
backend/lib/search.py Outdated Show resolved Hide resolved
# not a valid NDJSON file?
return []

if (self.get_results_path().suffix.lower() == ".csv") or (self.get_results_path().suffix.lower() == ".ndjson" and self.get_own_processor() is not None and self.get_own_processor().map_item_method_available(dataset=self)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the extension checks necessary here? I think if there is a map_item there will be columns and if not then not, regardless of extension

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know they were already there, but I think they might be a left-over from when we were less extension-agnostic)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember issue here with something in the frontend. get_item_keys would fail because get_own_processor would return None. But perhaps it the extension checks are redundant. I can test this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV datasets do not have map_item, but do have columns we wish to return. NDJSON datasets should only return columns if they can be mapped (else keys are not consistent enough to be used as "columns"). Some datasets do not have processors (rare; from what I can tell only deprecated datasets) and a check here avoids an error being raised by iterate_items.

Potentially we could combine the two Dataset methods get_item_keys and get_columns. They are almost the same thing. get_item_keys is used in various processors while get_columns is used by the frontend and most often get_options. Currently, get_item_keys takes advantage of iterate_items and grabs the first item's keys. get_columns was a copy of the code from iterate_items which is the code that I'm modifying in this PR.

When attempting to consolidate the code, I realized that get_columns could not be completely replaced by get_item_keys because of the previously mentioned instances where we do not wish to return columns. Possibly we do not wish to return item keys either in these instances via get_item_keys, but I did not explore all usages of that function as no errors were occurring. Mostly likely, get_columns returns False via the frontend and so the backend never runs into a misuse of get_item_keys.

@stijn-uva stijn-uva merged commit 1101a0a into master Oct 12, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants