-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[motherless] Fix broken download on recent videos and other values extraction #27450
Conversation
youtube_dl/extractor/motherless.py
Outdated
if re.search(r'd\s+[aA]go', upload_date): | ||
days = int(re.search(r'([0-9]+)', upload_date).group(1)) | ||
upload_date = (datetime.datetime.now() - datetime.timedelta(days=days)).strftime('%Y%m%d') | ||
elif re.search(r'h\s+[aA]go', upload_date): | ||
hours = int(re.search(r'([0-9]+)', upload_date).group(1)) | ||
upload_date = (datetime.datetime.now() - datetime.timedelta(hours=hours)).strftime('%Y%m%d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this one too, the code is a bit longer though not sure it is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure about splitting in a sub-function. If you have a format you prefer, just point me to an example in the repo and I will update it again.
youtube_dl/extractor/motherless.py
Outdated
r'class=["\']count[^>]+>(\d+d\s+[aA]go)<', # 1d ago | ||
r'class=["\']count[^>]+>(\d+h\s+[aA]go)<', # 20h ago |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Less than a week old videos use a '20h ago' or '1d ago' format. I kept the support for 'Ago' with uppercase start at is was already in the code.
On my end, the view count is using a comma separated number. I matched it with ',' and '.' in case it could be locale dependant as both are supported by str_to_int.
On my end, the Favorites count is using a comma separated number. I matched it with ',' and '.' in case it could be locale dependant as both are supported by str_to_int.
I did push fixup commits and then autosquash them to be in a mergeable state, but for some reason github does not display the two fixup commits anymore as I thought it would. It was the following ones: https://github.com/cladmi/youtube-dl/commit/a4311a85b https://github.com/cladmi/youtube-dl/commit/ae8261943 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests.
webpage, 'like count', fatal=False)) | ||
|
||
upload_date = self._html_search_regex( | ||
(r'class=["\']count[^>]+>(\d+\s+[a-zA-Z]{3}\s+\d{4})<', | ||
r'class=["\']count[^>]+>(\d+[hd])\s+[aA]go<', # 20h/1d ago |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not mix different scenarios in single regex. Upload date should be extracted first. If this fails it should fallback on ago pattern extraction.
unit = relative.group(2) | ||
if unit == 'h': | ||
delta_t = datetime.timedelta(hours=delta) | ||
else: # unit == 'd' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in assert not in comment.
if unit == 'h': | ||
delta_t = datetime.timedelta(hours=delta) | ||
else: # unit == 'd' | ||
delta_t = datetime.timedelta(days=delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRY 105, 107.
Thank you for integrating the changes and taking care of the last fixes, I did not manage to address them myself since then. Cheers. |
Before submitting a pull request make sure you have:
Covered the code with tests (note that PRs without tests will be REJECTED)In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
This pull request fixes 1 blocking issues with date retrieval, and missing stats extraction while downloading videos from motherless.
Invalid date on recent videos
The blocking issue is that the date format is different on new videos. Instead of being a
1 Jan 2020
format, it is a1d ago
or1h ago
format.This is fixed by the first commit.
No test added as the url would need to be dynamic. Can be tested by checking through https://motherless.com/videos/recent videos.
Date is not found
Date is matched `20201215` for 22h ago and yesterday
View count and like count not parsed correctly when over 1000
When these values are over 1000 they are separated by commas on the thousand and million boundaries.
The code fix was written to handle both a comma and a dot to match what was is handled by
str_to_int
.For example, the E0E8F2B video matches both.
Without the fix
"view_count": null, "like_count": null,
With the fix
"view_count": 1305837, "like_count": 3673,
Closes #26495