Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[archiveorg] Fix extraction #23827

Closed
wants to merge 4 commits into from

Conversation

TinyToweringTree
Copy link

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

The archive.org website has changed the way it embeds the JW Player playlist. That's why the regular expression can't find the playlist and the extractor is currently not working.

This pull request updates the extraction of the playlist to work with the new version of archive.org.

There's an input element with the class js-play8-playlist. Its value contains the playlist.

Closes #21330, closes #23586, closes #23700.

r"(?s)Play\('[^']+'\s*,\s*(\[.+\])\s*,\s*{.*?}\)",
webpage, 'jwplayer playlist'), video_id)
r'.*\s+value\s*=\s*[\'"](.+)[\'"][\s/]',
input_element_with_playlist, 'playlist data'), video_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_attributes.

Copy link
Author

@TinyToweringTree TinyToweringTree Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I changed the code to use extract_attributes() in commit e910f49.

It was still bugging me that I couldn't use get_element_by_class() in the line before as it only returns the content of the element and not the element (as the name of the function indicates). So I changed get_element_by_class() (and it's associated functions) in commit b98d1c0 to accept an optional include_tag parameter. By default it's set to False. So no existing code will be affected by it.

This should prove useful in the future and make those functions more intuitive to use.

I also added tests to test_utils.py. While doing so I noticed that get_element_by_class() didn't work for class names starting with a hyphen. I fixed that, too.

@@ -52,7 +55,7 @@ def get_optional(metadata, field):
metadata = self._download_json(
'http://archive.org/details/' + video_id, video_id, query={
'output': 'json',
})['metadata']
}).get('metadata', {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point of this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to make the extraction more resilient against website changes. With metadata being optional for downloading videos, it's probably better not to crash if the metadata element changes its name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough. If you reconsider this optional then you must also:

  1. Handle non dict metadata scenario.
  2. Handle download failure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Thanks. Commit 1326a5a fixes this by setting fatal=False when calling _download_json() and by checking the return value.

Use get_element_by_class() from utils to get rid of yet another regex.
This function used to return only the content of the element, and not
the element itself, including its tag and attributes. The whole group
of get_element_by_X() functions are a bit of a misnomer, as they all
return the *content* of the element and not the element itself.

All these functions can now return the whole element when setting
their `include_tag` parameter to `True`. By default it is `False` so
no other code will be affected by this change. Tests have been added
to test/test_utils.py accordingly.

This uncovered a bug which prevented elements starting with a hyphen as
their class name from being found. This has been fixed by fixing the
regex used in get_elements_by_class().
@lukaarma
Copy link

@dstftw this pull request needs more work?

@adrianheine adrianheine mentioned this pull request Feb 3, 2021
11 tasks
@dirkf dirkf closed this Aug 1, 2023
@dirkf dirkf added the defunct PR source branch is not accessible label Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defunct PR source branch is not accessible
Projects
None yet
4 participants