Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WhoScored] Ignore cached events file if empty #420

Merged
merged 6 commits into from
Nov 9, 2023
Merged

[WhoScored] Ignore cached events file if empty #420

merged 6 commits into from
Nov 9, 2023

Conversation

shufinskiy
Copy link
Contributor

@shufinskiy shufinskiy commented Oct 26, 2023

Hello, @probberechts.

I propose a solution to the problem of empty files in the cache for Whoscored.

In issue 98 you suggest delete empty file with bash command by file size.

I made method _size_file which does same with Path.stat().st_size. If the file is smaller than threshold, we believe that it is not cached

    def _size_file(
        self,
        filepath: Optional[Path] = None,
        filter_size: int = 60
    ) -> bool:
        """Check if `filepath` contains data valid size.
        Parameters
        ----------
        filepath : Path, optional
            Path where file should be cached. If None, return False.
        filter_size : int file size threshold. If file is smaller, return False
        Raises
        ------
        TypeError
            If filter_size is not an integer.
        Returns
        -------
        bool
            True in case of a cache hit, otherwise False.
        """
        if filepath is None:
            return False
        if not isinstance(filter_size, int):
            raise TypeError("filter_size must be of type int")
        try:
            file_size = filepath.stat().st_size
        except FileNotFoundError:
            return False
        return file_size > filter_size and filepath.exists()

@probberechts probberechts added enhancement New feature or request common Issue or pull request related to all scrapers WhoScored Issue or pull request related to the WhoScored scraper and removed common Issue or pull request related to all scrapers labels Nov 6, 2023
@probberechts
Copy link
Owner

The problem only exists for the WhoScored scraper but your solution affects all scrapers. For some scrapers, an empty reply might actually be a valid answer. For example, if a new team is promoted, an empty result is expected in the ClubElo scraper.

Moreover, the bash script checks the file size simply because that was easy to write as a bash command. What it really should check is whether the file contains an empty JSON object. Something that could easily be done in Python.

@shufinskiy
Copy link
Contributor Author

Yes, you're right. I'll think about how it can be implemented in a different way.

@shufinskiy
Copy link
Contributor Author

@probberechts
I fixed the verification logic: now inside the Whoscored.read_events method there is a check of the first 4 bytes of the file: if they are null, then the get method is run again with the no_cache=True parameter.

reader = self.get(
    url,
    filepath,
    var="requirejs.s.contexts._.config.config.params.args.matchCentreData",
    no_cache=live,
)
if reader.read(4) == b'null':
    reader = self.get(
        url,
        filepath,
        var="requirejs.s.contexts._.config.config.params.args.matchCentreData",
        no_cache=True,
    )
reader.seek(0)
json_data = json.load(reader)

@probberechts probberechts linked an issue Nov 9, 2023 that may be closed by this pull request
@probberechts
Copy link
Owner

Nice solution! Thanks.

@probberechts probberechts changed the title Size cache [WhoScored] Ignore cached events file if empty Nov 9, 2023
@probberechts probberechts merged commit aae7e5b into probberechts:master Nov 9, 2023
8 checks passed
@shufinskiy shufinskiy deleted the size_cache branch November 12, 2023 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request WhoScored Issue or pull request related to the WhoScored scraper
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[WhoScored] Do not cache empty event files
2 participants