Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[motherless] Fix broken download on recent videos and other values extraction #27450

Closed
wants to merge 3 commits into from
Closed

[motherless] Fix broken download on recent videos and other values extraction #27450

wants to merge 3 commits into from

Conversation

cladmi
Copy link

@cladmi cladmi commented Dec 16, 2020

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

This pull request fixes 1 blocking issues with date retrieval, and missing stats extraction while downloading videos from motherless.

Invalid date on recent videos

The blocking issue is that the date format is different on new videos. Instead of being a 1 Jan 2020 format, it is a 1d ago or 1h ago format.
This is fixed by the first commit.
No test added as the url would need to be dynamic. Can be tested by checking through https://motherless.com/videos/recent videos.

Date is not found
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-j', 'https://motherless.com/3FE63E0']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.14
[debug] Lazy loading extractors enabled
[debug] Git HEAD: 1bc1520ad
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
ERROR: Unable to extract upload date; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/home/cladmi/git/youtube-dl/youtube_dl/YoutubeDL.py", line 803, in wrapper
    return func(self, *args, **kwargs)
  File "/home/cladmi/git/youtube-dl/youtube_dl/YoutubeDL.py", line 824, in __extract_info
    ie_result = ie.extract(url)
  File "/home/cladmi/git/youtube-dl/youtube_dl/extractor/common.py", line 532, in extract
    ie_result = self._real_extract(url)
  File "/home/cladmi/git/youtube-dl/youtube_dl/extractor/motherless.py", line 94, in _real_extract
    upload_date = self._html_search_regex(
  File "/home/cladmi/git/youtube-dl/youtube_dl/extractor/common.py", line 1019, in _html_search_regex
    res = self._search_regex(pattern, string, name, default, fatal, flags, group)
  File "/home/cladmi/git/youtube-dl/youtube_dl/extractor/common.py", line 1010, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract upload date; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Date is matched `20201215` for 22h ago and yesterday
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-j', 'https://motherless.com/F077C76', 'https://motherless.com/3FE63E0']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.14
[debug] Lazy loading extractors enabled
[debug] Git HEAD: 254f878b2
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[debug] Default format spec: bestvideo+bestaudio/best
{"id": "F077C76", "title": "Cruel ball whipping", "upload_date": "20201215", "uploader_id": "Soa60", "thumbnail": "https://cdn5-thumbs.motherlessmedia.com/thumbs/F077C76.jpg", "categories": null, "view_count": 26, "like_count": 0, "comment_count": 0, "age_limit": 18, "url": "https://cdn5-videos.motherlessmedia.com/videos/F077C76.mp4", "extractor": "Motherless", "webpage_url": "https://motherless.com/F077C76", "webpage_url_basename": "F077C76", "extractor_key": "Motherless", "playlist": null, "playlist_index": null, "thumbnails": [{"url": "https://cdn5-thumbs.motherlessmedia.com/thumbs/F077C76.jpg", "id": "0"}], "display_id": "F077C76", "requested_subtitles": null, "format_id": "0", "format": "0 - unknown", "ext": "mp4", "protocol": "https", "http_headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.61 Safari/537.36", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-us,en;q=0.5"}, "fulltitle": "Cruel ball whipping", "_filename": "Cruel ball whipping-F077C76.mp4"}
[debug] Default format spec: bestvideo+bestaudio/best
{"id": "3FE63E0", "title": "My slut", "upload_date": "20201215", "uploader_id": "Kinkyrthbetter", "thumbnail": "https://cdn5-thumbs.motherlessmedia.com/thumbs/3FE63E0.jpg", "categories": null, "view_count": 22, "like_count": 0, "comment_count": 0, "age_limit": 18, "url": "https://cdn5-videos.motherlessmedia.com/videos/3FE63E0.mp4", "extractor": "Motherless", "webpage_url": "https://motherless.com/3FE63E0", "webpage_url_basename": "3FE63E0", "extractor_key": "Motherless", "playlist": null, "playlist_index": null, "thumbnails": [{"url": "https://cdn5-thumbs.motherlessmedia.com/thumbs/3FE63E0.jpg", "id": "0"}], "display_id": "3FE63E0", "requested_subtitles": null, "format_id": "0", "format": "0 - unknown", "ext": "mp4", "protocol": "https", "http_headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.61 Safari/537.36", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-us,en;q=0.5"}, "fulltitle": "My slut", "_filename": "My slut-3FE63E0.mp4"}

View count and like count not parsed correctly when over 1000

When these values are over 1000 they are separated by commas on the thousand and million boundaries.
The code fix was written to handle both a comma and a dot to match what was is handled by str_to_int.

For example, the E0E8F2B video matches both.

Without the fix "view_count": null, "like_count": null,
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-j', 'https://motherless.com/E0E8F2B']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.14
[debug] Lazy loading extractors enabled
[debug] Git HEAD: 1bc1520ad
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
WARNING: unable to extract view count; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
WARNING: unable to extract like count; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
[debug] Default format spec: bestvideo+bestaudio/best
{"id": "E0E8F2B", "title": "Threesome has wife yelling don't stop", "upload_date": "20181111", "uploader_id": "dirty_fun19", "thumbnail": "https://cdn5-thumbs.motherlessmedia.com/thumbs/E0E8F2B.jpg", "categories": ["tumblr", "Amateur", "cell", "Homemade", "private", "stolen", "tattoo", "swing", "mmf", "doggy", "from behind", "head", "suck", "blow", "threesome"], "view_count": null, "like_count": null, "comment_count": 37, "age_limit": 18, "url": "https://cdn5-videos.motherlessmedia.com/videos/E0E8F2B.mp4", "extractor": "Motherless", "webpage_url": "https://motherless.com/E0E8F2B", "webpage_url_basename": "E0E8F2B", "extractor_key": "Motherless", "playlist": null, "playlist_index": null, "thumbnails": [{"url": "https://cdn5-thumbs.motherlessmedia.com/thumbs/E0E8F2B.jpg", "id": "0"}], "display_id": "E0E8F2B", "requested_subtitles": null, "format_id": "0", "format": "0 - unknown", "ext": "mp4", "protocol": "https", "http_headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.111 Safari/537.36", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-us,en;q=0.5"}, "fulltitle": "Threesome has wife yelling don't stop", "_filename": "Threesome has wife yelling don't stop-E0E8F2B.mp4"}
With the fix "view_count": 1305837, "like_count": 3673,
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-j', 'https://motherless.com/E0E8F2B']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.14
[debug] Lazy loading extractors enabled
[debug] Git HEAD: 4280e1581
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[debug] Default format spec: bestvideo+bestaudio/best
{"id": "E0E8F2B", "title": "Threesome has wife yelling don't stop", "upload_date": "20181111", "uploader_id": "dirty_fun19", "thumbnail": "https://cdn5-thumbs.motherlessmedia.com/thumbs/E0E8F2B.jpg", "categories": ["tumblr", "Amateur", "cell", "Homemade", "private", "stolen", "tattoo", "swing", "mmf", "doggy", "from behind", "head", "suck", "blow", "threesome"], "view_count": 1305837, "like_count": 3673, "comment_count": 37, "age_limit": 18, "url": "https://cdn5-videos.motherlessmedia.com/videos/E0E8F2B.mp4", "extractor": "Motherless", "webpage_url": "https://motherless.com/E0E8F2B", "webpage_url_basename": "E0E8F2B", "extractor_key": "Motherless", "playlist": null, "playlist_index": null, "thumbnails": [{"url": "https://cdn5-thumbs.motherlessmedia.com/thumbs/E0E8F2B.jpg", "id": "0"}], "display_id": "E0E8F2B", "requested_subtitles": null, "format_id": "0", "format": "0 - unknown", "ext": "mp4", "protocol": "https", "http_headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3591.3 Safari/537.36", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-us,en;q=0.5"}, "fulltitle": "Threesome has wife yelling don't stop", "_filename": "Threesome has wife yelling don't stop-E0E8F2B.mp4"}

Closes #26495

Comment on lines 101 to 106
if re.search(r'd\s+[aA]go', upload_date):
days = int(re.search(r'([0-9]+)', upload_date).group(1))
upload_date = (datetime.datetime.now() - datetime.timedelta(days=days)).strftime('%Y%m%d')
elif re.search(r'h\s+[aA]go', upload_date):
hours = int(re.search(r'([0-9]+)', upload_date).group(1))
upload_date = (datetime.datetime.now() - datetime.timedelta(hours=hours)).strftime('%Y%m%d')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this one too, the code is a bit longer though not sure it is clearer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure about splitting in a sub-function. If you have a format you prefer, just point me to an example in the repo and I will update it again.

Comment on lines 98 to 99
r'class=["\']count[^>]+>(\d+d\s+[aA]go)<', # 1d ago
r'class=["\']count[^>]+>(\d+h\s+[aA]go)<', # 20h ago
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Less than a week old videos use a '20h ago' or '1d ago' format.

I kept the support for 'Ago' with uppercase start at is was already in the code.
On my end, the view count is using a comma separated number.

I matched it with ',' and '.' in case it could be locale dependant as both are
supported by str_to_int.
On my end, the Favorites count is using a comma separated number.

I matched it with ',' and '.' in case it could be locale dependant as both are
supported by str_to_int.
@cladmi
Copy link
Author

cladmi commented Dec 17, 2020

I did push fixup commits and then autosquash them to be in a mergeable state, but for some reason github does not display the two fixup commits anymore as I thought it would.

It was the following ones: https://github.com/cladmi/youtube-dl/commit/a4311a85b https://github.com/cladmi/youtube-dl/commit/ae8261943

Copy link
Collaborator

@dstftw dstftw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests.

webpage, 'like count', fatal=False))

upload_date = self._html_search_regex(
(r'class=["\']count[^>]+>(\d+\s+[a-zA-Z]{3}\s+\d{4})<',
r'class=["\']count[^>]+>(\d+[hd])\s+[aA]go<', # 20h/1d ago
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not mix different scenarios in single regex. Upload date should be extracted first. If this fails it should fallback on ago pattern extraction.

unit = relative.group(2)
if unit == 'h':
delta_t = datetime.timedelta(hours=delta)
else: # unit == 'd'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in assert not in comment.

if unit == 'h':
delta_t = datetime.timedelta(hours=delta)
else: # unit == 'd'
delta_t = datetime.timedelta(days=delta)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY 105, 107.

@dstftw dstftw closed this in ecae54a Jan 5, 2021
@cladmi cladmi deleted the pr/fix_some_motherless_issues branch January 5, 2021 08:34
@cladmi
Copy link
Author

cladmi commented Jan 5, 2021

Thank you for integrating the changes and taking care of the last fixes, I did not manage to address them myself since then.

Cheers.

ThirumalaiK pushed a commit to ThirumalaiK/youtube-dl that referenced this pull request Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants