-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CCMA] Fix CCMA extractor (closes #24347) #27994
Conversation
For some reason, provided UTC timestamp does not comply ISO8601, as its format is YYYY-DD-MM instead of expected YYYY-MM-DD. This can be checked with the also provided "text" field of emission date object. Example: "data_emissio": { "text": "14/05/2002�21:39", "utc": "2002-14-05T21:39:28+0200" } This commit fixes this behavior.
youtube_dl/extractor/ccma.py
Outdated
'upload_date': '20161108', | ||
} | ||
}, { | ||
'url': 'http://www.ccma.cat/tv3/alacarta/crims/crims-josep-tallada-lespereu-me-capitol-1/video/6031387/', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new tests at the end of the array.
# utc date is in format YYYY-DD-MM | ||
data_utc = informacio.get('data_emissio', {}).get('utc') | ||
try: | ||
data_iso8601 = data_utc[:5] + data_utc[8:10] + '-' + data_utc[5:7] + data_utc[10:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use datetime
class directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is not a standardized UTC timestamp... it is provided in format YYYY-DD-MM
, thus this manual parsing was needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's why i think it's better to directly use python's datetime builtin module to parse the non standardized format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I don't follow...
I can only think of two options:
- Modifying
data_utc
as so that it is ISO-compliant and accepted asdatetime.datetime.fromisoformat(data_utc)
, but this is what is already being done... - Parsing each segment as an
int
to calldatetime.datetime(year, month, day, hour=0, minute=0, second=0, microsecond=0...
constructor, but I see no advantage and only added complexity in comparison to what is currently being done.
Could you please give me another hint on what you have in mind?
Thanks :)
youtube_dl/extractor/ccma.py
Outdated
if sub_url: | ||
subtitles.setdefault( | ||
subtitols.get('iso') or subtitols.get('text') or 'ca', []).append({ | ||
st.get('iso') or 'ca', []).append({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the old fallback code.
Thanks for fixing the timestamp issue! I had just encountered this, as some videos would end up having no timestamp at all because of it. One request, however. Would there be a way to use the title from the |
Added test is one of the cases of broken compatibility. Issue is in featuring multiple languages in the subtitles field.
CCMA extractor used to raise an exception when attempting the download of a URL featuring multiple languages in the subtitles. When a single language is available, the field is the expected dict. When multiple languages are available, a list of dicts is provided. This commit fixes this issue.
Keeps old fallback code, as suggested in PR review.
Please follow the guide below
x
into all the boxes [ ] relevant to your pull request (like that [x])Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
Fixing two bugs present in CCMA extractor.
For some reason, provided UTC timestamp does not comply ISO8601, as its format is YYYY-DD-MM instead of expected YYYY-MM-DD.
This is made evident in the also provided
"text"
field of emission date object. Example:The extractor used to raise an exception when attempting the download of a URL featuring multiple subtitle languages.
Behavior of the
subtitols
field is:When a single language is available, the field is the expected
dict
.When multiple languages are available, a
list
ofdict
is provided.A test is added with an URL featuring multiple subtitles, to ensure no exception is raised during extraction.