[archiveorg] Fix extraction #23827

TinyToweringTree · 2020-01-24T17:28:13Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

At least skimmed through adding new extractor tutorial and youtube-dl coding conventions sections
Searched the bugtracker for similar pull requests
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

The archive.org website has changed the way it embeds the JW Player playlist. That's why the regular expression can't find the playlist and the extractor is currently not working.

This pull request updates the extraction of the playlist to work with the new version of archive.org.

There's an input element with the class js-play8-playlist. Its value contains the playlist.

Closes #21330, closes #23586, closes #23700.

…700)

dstftw · 2020-02-05T16:49:24Z

youtube_dl/extractor/archiveorg.py

-            r"(?s)Play\('[^']+'\s*,\s*(\[.+\])\s*,\s*{.*?}\)",
-            webpage, 'jwplayer playlist'), video_id)
+            r'.*\s+value\s*=\s*[\'"](.+)[\'"][\s/]',
+            input_element_with_playlist, 'playlist data'), video_id)


extract_attributes.

Thanks for the pointer. I changed the code to use extract_attributes() in commit e910f49.

It was still bugging me that I couldn't use get_element_by_class() in the line before as it only returns the content of the element and not the element (as the name of the function indicates). So I changed get_element_by_class() (and it's associated functions) in commit b98d1c0 to accept an optional include_tag parameter. By default it's set to False. So no existing code will be affected by it.

This should prove useful in the future and make those functions more intuitive to use.

I also added tests to test_utils.py. While doing so I noticed that get_element_by_class() didn't work for class names starting with a hyphen. I fixed that, too.

dstftw · 2020-02-05T16:49:36Z

youtube_dl/extractor/archiveorg.py

@@ -52,7 +55,7 @@ def get_optional(metadata, field):
        metadata = self._download_json(
            'http://archive.org/details/' + video_id, video_id, query={
                'output': 'json',
-            })['metadata']
+            }).get('metadata', {})


Point of this?

The idea is to make the extraction more resilient against website changes. With metadata being optional for downloading videos, it's probably better not to crash if the metadata element changes its name.

This is not enough. If you reconsider this optional then you must also:

Handle non dict metadata scenario.

Handle download failure.

You are right. Thanks. Commit 1326a5a fixes this by setting fatal=False when calling _download_json() and by checking the return value.

Use get_element_by_class() from utils to get rid of yet another regex. This function used to return only the content of the element, and not the element itself, including its tag and attributes. The whole group of get_element_by_X() functions are a bit of a misnomer, as they all return the *content* of the element and not the element itself. All these functions can now return the whole element when setting their `include_tag` parameter to `True`. By default it is `False` so no other code will be affected by this change. Tests have been added to test/test_utils.py accordingly. This uncovered a bug which prevented elements starting with a hyphen as their class name from being found. This has been fixed by fixing the regex used in get_elements_by_class().

lukaarma · 2020-12-14T12:17:57Z

@dstftw this pull request needs more work?

[archiveorg] Fix extraction (closes #21330, closes #23586, closes #23…

8df0c2c

…700)

TinyToweringTree requested a review from dstftw January 25, 2020 20:00

dstftw requested changes Feb 5, 2020

View reviewed changes

TinyToweringTree added 3 commits February 19, 2020 22:04

[archiveorg] Use extract_attributes()

e910f49

[archiveorg] Make metadata extraction more robust

1326a5a

dinosore mentioned this pull request Mar 25, 2020

Internet Archive: Unable to extract jwplayer playlist #23586

Closed

dstftw force-pushed the master branch from 7b956a1 to 5e26784 Compare September 13, 2020 13:50

underhandedness mentioned this pull request Nov 20, 2020

archive.org video stream 'Unable to extract' vs pull/23827 #27109

Closed

adrianheine mentioned this pull request Feb 3, 2021

[ArchiveOrg] Fix extractor #28063

Closed

11 tasks

dirkf force-pushed the master branch from 01bf89e to 4c6fba3 Compare August 26, 2022 07:51

dirkf closed this Aug 1, 2023

dirkf added the defunct PR source branch is not accessible label Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[archiveorg] Fix extraction #23827

[archiveorg] Fix extraction #23827

TinyToweringTree commented Jan 24, 2020

dstftw Feb 5, 2020

TinyToweringTree Feb 19, 2020 •

edited

Loading

dstftw Feb 5, 2020

TinyToweringTree Feb 19, 2020

dstftw Feb 19, 2020

TinyToweringTree Feb 19, 2020

lukaarma commented Dec 14, 2020

[archiveorg] Fix extraction #23827

[archiveorg] Fix extraction #23827

Conversation

TinyToweringTree commented Jan 24, 2020

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dstftw Feb 5, 2020

Choose a reason for hiding this comment

TinyToweringTree Feb 19, 2020 • edited Loading

Choose a reason for hiding this comment

dstftw Feb 5, 2020

Choose a reason for hiding this comment

TinyToweringTree Feb 19, 2020

Choose a reason for hiding this comment

dstftw Feb 19, 2020

Choose a reason for hiding this comment

TinyToweringTree Feb 19, 2020

Choose a reason for hiding this comment

lukaarma commented Dec 14, 2020

TinyToweringTree Feb 19, 2020 •

edited

Loading