[motherless] Fix broken download on recent videos and other values extraction #27450

cladmi · 2020-12-16T10:37:43Z

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
- Only the view count and favorite count is fixed in [motherless] bug#fail to parse view_like and like_count when comma is… #26495 with a more permissive match. The PR also does not have the PR checklist.
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
~~Covered the code with tests (note that PRs without tests will be REJECTED)~~
- recent upload date will not be recent after a day/week anymore, and these dynamic values cannot be tested with the test_download script.
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

This pull request fixes 1 blocking issues with date retrieval, and missing stats extraction while downloading videos from motherless.

Invalid date on recent videos

The blocking issue is that the date format is different on new videos. Instead of being a 1 Jan 2020 format, it is a 1d ago or 1h ago format.
This is fixed by the first commit.
No test added as the url would need to be dynamic. Can be tested by checking through https://motherless.com/videos/recent videos.

Date is not found

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-j', 'https://motherless.com/3FE63E0']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.14
[debug] Lazy loading extractors enabled
[debug] Git HEAD: 1bc1520ad
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
ERROR: Unable to extract upload date; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/home/cladmi/git/youtube-dl/youtube_dl/YoutubeDL.py", line 803, in wrapper
    return func(self, *args, **kwargs)
  File "/home/cladmi/git/youtube-dl/youtube_dl/YoutubeDL.py", line 824, in __extract_info
    ie_result = ie.extract(url)
  File "/home/cladmi/git/youtube-dl/youtube_dl/extractor/common.py", line 532, in extract
    ie_result = self._real_extract(url)
  File "/home/cladmi/git/youtube-dl/youtube_dl/extractor/motherless.py", line 94, in _real_extract
    upload_date = self._html_search_regex(
  File "/home/cladmi/git/youtube-dl/youtube_dl/extractor/common.py", line 1019, in _html_search_regex
    res = self._search_regex(pattern, string, name, default, fatal, flags, group)
  File "/home/cladmi/git/youtube-dl/youtube_dl/extractor/common.py", line 1010, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract upload date; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Date is matched `20201215` for 22h ago and yesterday

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-j', 'https://motherless.com/F077C76', 'https://motherless.com/3FE63E0']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.14
[debug] Lazy loading extractors enabled
[debug] Git HEAD: 254f878b2
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[debug] Default format spec: bestvideo+bestaudio/best
{"id": "F077C76", "title": "Cruel ball whipping", "upload_date": "20201215", "uploader_id": "Soa60", "thumbnail": "https://cdn5-thumbs.motherlessmedia.com/thumbs/F077C76.jpg", "categories": null, "view_count": 26, "like_count": 0, "comment_count": 0, "age_limit": 18, "url": "https://cdn5-videos.motherlessmedia.com/videos/F077C76.mp4", "extractor": "Motherless", "webpage_url": "https://motherless.com/F077C76", "webpage_url_basename": "F077C76", "extractor_key": "Motherless", "playlist": null, "playlist_index": null, "thumbnails": [{"url": "https://cdn5-thumbs.motherlessmedia.com/thumbs/F077C76.jpg", "id": "0"}], "display_id": "F077C76", "requested_subtitles": null, "format_id": "0", "format": "0 - unknown", "ext": "mp4", "protocol": "https", "http_headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.61 Safari/537.36", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-us,en;q=0.5"}, "fulltitle": "Cruel ball whipping", "_filename": "Cruel ball whipping-F077C76.mp4"}
[debug] Default format spec: bestvideo+bestaudio/best
{"id": "3FE63E0", "title": "My slut", "upload_date": "20201215", "uploader_id": "Kinkyrthbetter", "thumbnail": "https://cdn5-thumbs.motherlessmedia.com/thumbs/3FE63E0.jpg", "categories": null, "view_count": 22, "like_count": 0, "comment_count": 0, "age_limit": 18, "url": "https://cdn5-videos.motherlessmedia.com/videos/3FE63E0.mp4", "extractor": "Motherless", "webpage_url": "https://motherless.com/3FE63E0", "webpage_url_basename": "3FE63E0", "extractor_key": "Motherless", "playlist": null, "playlist_index": null, "thumbnails": [{"url": "https://cdn5-thumbs.motherlessmedia.com/thumbs/3FE63E0.jpg", "id": "0"}], "display_id": "3FE63E0", "requested_subtitles": null, "format_id": "0", "format": "0 - unknown", "ext": "mp4", "protocol": "https", "http_headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.61 Safari/537.36", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-us,en;q=0.5"}, "fulltitle": "My slut", "_filename": "My slut-3FE63E0.mp4"}

View count and like count not parsed correctly when over 1000

When these values are over 1000 they are separated by commas on the thousand and million boundaries.
The code fix was written to handle both a comma and a dot to match what was is handled by str_to_int.

For example, the E0E8F2B video matches both.

Without the fix "view_count": null, "like_count": null,

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-j', 'https://motherless.com/E0E8F2B']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.14
[debug] Lazy loading extractors enabled
[debug] Git HEAD: 1bc1520ad
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
WARNING: unable to extract view count; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
WARNING: unable to extract like count; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
[debug] Default format spec: bestvideo+bestaudio/best
{"id": "E0E8F2B", "title": "Threesome has wife yelling don't stop", "upload_date": "20181111", "uploader_id": "dirty_fun19", "thumbnail": "https://cdn5-thumbs.motherlessmedia.com/thumbs/E0E8F2B.jpg", "categories": ["tumblr", "Amateur", "cell", "Homemade", "private", "stolen", "tattoo", "swing", "mmf", "doggy", "from behind", "head", "suck", "blow", "threesome"], "view_count": null, "like_count": null, "comment_count": 37, "age_limit": 18, "url": "https://cdn5-videos.motherlessmedia.com/videos/E0E8F2B.mp4", "extractor": "Motherless", "webpage_url": "https://motherless.com/E0E8F2B", "webpage_url_basename": "E0E8F2B", "extractor_key": "Motherless", "playlist": null, "playlist_index": null, "thumbnails": [{"url": "https://cdn5-thumbs.motherlessmedia.com/thumbs/E0E8F2B.jpg", "id": "0"}], "display_id": "E0E8F2B", "requested_subtitles": null, "format_id": "0", "format": "0 - unknown", "ext": "mp4", "protocol": "https", "http_headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.111 Safari/537.36", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-us,en;q=0.5"}, "fulltitle": "Threesome has wife yelling don't stop", "_filename": "Threesome has wife yelling don't stop-E0E8F2B.mp4"}

With the fix "view_count": 1305837, "like_count": 3673,

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-j', 'https://motherless.com/E0E8F2B']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2020.12.14
[debug] Lazy loading extractors enabled
[debug] Git HEAD: 4280e1581
[debug] Python version 3.9.0 (CPython) - Linux-5.9.13-arch1-1-x86_64-with-glibc2.32
[debug] exe versions: ffmpeg 4.3.1, ffprobe 4.3.1, rtmpdump 2.4
[debug] Proxy map: {}
[debug] Default format spec: bestvideo+bestaudio/best
{"id": "E0E8F2B", "title": "Threesome has wife yelling don't stop", "upload_date": "20181111", "uploader_id": "dirty_fun19", "thumbnail": "https://cdn5-thumbs.motherlessmedia.com/thumbs/E0E8F2B.jpg", "categories": ["tumblr", "Amateur", "cell", "Homemade", "private", "stolen", "tattoo", "swing", "mmf", "doggy", "from behind", "head", "suck", "blow", "threesome"], "view_count": 1305837, "like_count": 3673, "comment_count": 37, "age_limit": 18, "url": "https://cdn5-videos.motherlessmedia.com/videos/E0E8F2B.mp4", "extractor": "Motherless", "webpage_url": "https://motherless.com/E0E8F2B", "webpage_url_basename": "E0E8F2B", "extractor_key": "Motherless", "playlist": null, "playlist_index": null, "thumbnails": [{"url": "https://cdn5-thumbs.motherlessmedia.com/thumbs/E0E8F2B.jpg", "id": "0"}], "display_id": "E0E8F2B", "requested_subtitles": null, "format_id": "0", "format": "0 - unknown", "ext": "mp4", "protocol": "https", "http_headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3591.3 Safari/537.36", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-us,en;q=0.5"}, "fulltitle": "Threesome has wife yelling don't stop", "_filename": "Threesome has wife yelling don't stop-E0E8F2B.mp4"}

Closes #26495

dstftw · 2020-12-17T01:23:41Z

youtube_dl/extractor/motherless.py

+        if re.search(r'd\s+[aA]go', upload_date):
            days = int(re.search(r'([0-9]+)', upload_date).group(1))
            upload_date = (datetime.datetime.now() - datetime.timedelta(days=days)).strftime('%Y%m%d')
+        elif re.search(r'h\s+[aA]go', upload_date):
+            hours = int(re.search(r'([0-9]+)', upload_date).group(1))
+            upload_date = (datetime.datetime.now() - datetime.timedelta(hours=hours)).strftime('%Y%m%d')


I fixed this one too, the code is a bit longer though not sure it is clearer.

I was not sure about splitting in a sub-function. If you have a format you prefer, just point me to an example in the repo and I will update it again.

dstftw · 2020-12-17T01:23:47Z

youtube_dl/extractor/motherless.py

+             r'class=["\']count[^>]+>(\d+d\s+[aA]go)<',  # 1d ago
+             r'class=["\']count[^>]+>(\d+h\s+[aA]go)<',  # 20h ago


Less than a week old videos use a '20h ago' or '1d ago' format. I kept the support for 'Ago' with uppercase start at is was already in the code.

On my end, the view count is using a comma separated number. I matched it with ',' and '.' in case it could be locale dependant as both are supported by str_to_int.

On my end, the Favorites count is using a comma separated number. I matched it with ',' and '.' in case it could be locale dependant as both are supported by str_to_int.

cladmi · 2020-12-17T09:11:13Z

I did push fixup commits and then autosquash them to be in a mergeable state, but for some reason github does not display the two fixup commits anymore as I thought it would.

It was the following ones: https://github.com/cladmi/youtube-dl/commit/a4311a85b https://github.com/cladmi/youtube-dl/commit/ae8261943

dstftw

Add tests.

dstftw · 2020-12-19T15:57:08Z

youtube_dl/extractor/motherless.py

            webpage, 'like count', fatal=False))

        upload_date = self._html_search_regex(
            (r'class=["\']count[^>]+>(\d+\s+[a-zA-Z]{3}\s+\d{4})<',
+             r'class=["\']count[^>]+>(\d+[hd])\s+[aA]go<',  # 20h/1d ago


Do not mix different scenarios in single regex. Upload date should be extracted first. If this fails it should fallback on ago pattern extraction.

dstftw · 2020-12-19T15:57:52Z

youtube_dl/extractor/motherless.py

+            unit = relative.group(2)
+            if unit == 'h':
+                delta_t = datetime.timedelta(hours=delta)
+            else:  # unit == 'd'


This should be in assert not in comment.

dstftw · 2020-12-19T15:58:27Z

youtube_dl/extractor/motherless.py

+            if unit == 'h':
+                delta_t = datetime.timedelta(hours=delta)
+            else:  # unit == 'd'
+                delta_t = datetime.timedelta(days=delta)


DRY 105, 107.

cladmi · 2021-01-05T08:47:26Z

Thank you for integrating the changes and taking care of the last fixes, I did not manage to address them myself since then.

Cheers.

…g#26495, closes ytdl-org#27450)

dstftw requested changes Dec 17, 2020

View reviewed changes

dstftw added the pending-fixes label Dec 17, 2020

cladmi added 3 commits December 17, 2020 10:06

[motherless] Fix upload date on recent videos

97df7a7

Less than a week old videos use a '20h ago' or '1d ago' format. I kept the support for 'Ago' with uppercase start at is was already in the code.

[motherless] Fix view counts using commas

fa8457a

On my end, the view count is using a comma separated number. I matched it with ',' and '.' in case it could be locale dependant as both are supported by str_to_int.

[motherless] Fix like counts using commas

28d69f3

On my end, the Favorites count is using a comma separated number. I matched it with ',' and '.' in case it could be locale dependant as both are supported by str_to_int.

dstftw requested changes Dec 19, 2020

View reviewed changes

dstftw closed this in ecae54a Jan 5, 2021

cladmi deleted the pr/fix_some_motherless_issues branch January 5, 2021 08:34

ThirumalaiK pushed a commit to ThirumalaiK/youtube-dl that referenced this pull request Jan 28, 2021

[motherless] Fix review issues and improve extraction (closes ytdl-or…

e9dca9b

…g#26495, closes ytdl-org#27450)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[motherless] Fix broken download on recent videos and other values extraction #27450

[motherless] Fix broken download on recent videos and other values extraction #27450

cladmi commented Dec 16, 2020

dstftw Dec 17, 2020

cladmi Dec 17, 2020

cladmi Dec 17, 2020

dstftw Dec 17, 2020

cladmi Dec 17, 2020

cladmi commented Dec 17, 2020

dstftw left a comment

dstftw Dec 19, 2020

dstftw Dec 19, 2020

dstftw Dec 19, 2020

cladmi commented Jan 5, 2021

		r'class=["\']count[^>]+>(\d+d\s+[aA]go)<', # 1d ago
		r'class=["\']count[^>]+>(\d+h\s+[aA]go)<', # 20h ago

[motherless] Fix broken download on recent videos and other values extraction #27450

[motherless] Fix broken download on recent videos and other values extraction #27450

Conversation

cladmi commented Dec 16, 2020

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

Invalid date on recent videos

View count and like count not parsed correctly when over 1000

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cladmi commented Dec 17, 2020

dstftw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cladmi commented Jan 5, 2021