-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[archiveorg] Fix extraction #23827
[archiveorg] Fix extraction #23827
Conversation
youtube_dl/extractor/archiveorg.py
Outdated
r"(?s)Play\('[^']+'\s*,\s*(\[.+\])\s*,\s*{.*?}\)", | ||
webpage, 'jwplayer playlist'), video_id) | ||
r'.*\s+value\s*=\s*[\'"](.+)[\'"][\s/]', | ||
input_element_with_playlist, 'playlist data'), video_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_attributes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. I changed the code to use extract_attributes()
in commit e910f49.
It was still bugging me that I couldn't use get_element_by_class()
in the line before as it only returns the content of the element and not the element (as the name of the function indicates). So I changed get_element_by_class()
(and it's associated functions) in commit b98d1c0 to accept an optional include_tag
parameter. By default it's set to False
. So no existing code will be affected by it.
This should prove useful in the future and make those functions more intuitive to use.
I also added tests to test_utils.py
. While doing so I noticed that get_element_by_class()
didn't work for class names starting with a hyphen. I fixed that, too.
youtube_dl/extractor/archiveorg.py
Outdated
@@ -52,7 +55,7 @@ def get_optional(metadata, field): | |||
metadata = self._download_json( | |||
'http://archive.org/details/' + video_id, video_id, query={ | |||
'output': 'json', | |||
})['metadata'] | |||
}).get('metadata', {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to make the extraction more resilient against website changes. With metadata being optional for downloading videos, it's probably better not to crash if the metadata element changes its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not enough. If you reconsider this optional then you must also:
- Handle non dict metadata scenario.
- Handle download failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Thanks. Commit 1326a5a fixes this by setting fatal=False
when calling _download_json()
and by checking the return value.
Use get_element_by_class() from utils to get rid of yet another regex. This function used to return only the content of the element, and not the element itself, including its tag and attributes. The whole group of get_element_by_X() functions are a bit of a misnomer, as they all return the *content* of the element and not the element itself. All these functions can now return the whole element when setting their `include_tag` parameter to `True`. By default it is `False` so no other code will be affected by this change. Tests have been added to test/test_utils.py accordingly. This uncovered a bug which prevented elements starting with a hyphen as their class name from being found. This has been fixed by fixing the regex used in get_elements_by_class().
@dstftw this pull request needs more work? |
Please follow the guide below
x
into all the boxes [ ] relevant to your pull request (like that [x])Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
The archive.org website has changed the way it embeds the JW Player playlist. That's why the regular expression can't find the playlist and the extractor is currently not working.
This pull request updates the extraction of the playlist to work with the new version of archive.org.
There's an
input
element with the classjs-play8-playlist
. Its value contains the playlist.Closes #21330, closes #23586, closes #23700.