[CCMA] Fix CCMA extractor (closes #24347) #27994

guillemglez · 2021-01-28T13:47:13Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

Fixing two bugs present in CCMA extractor.

Incorrect timestamp:

For some reason, provided UTC timestamp does not comply ISO8601, as its format is YYYY-DD-MM instead of expected YYYY-MM-DD.

This is made evident in the also provided "text" field of emission date object. Example:

    "data_emissio": {
                "text": "14/05/2002�21:39",
                "utc": "2002-14-05T21:39:28+0200"
    }

Multiple subtitles support ([CCMA] Videos with several subtitles won't download #24347)

The extractor used to raise an exception when attempting the download of a URL featuring multiple subtitle languages.

Behavior of the subtitols field is:

When a single language is available, the field is the expected dict.
When multiple languages are available, a list of dict is provided.

A test is added with an URL featuring multiple subtitles, to ensure no exception is raised during extraction.

For some reason, provided UTC timestamp does not comply ISO8601, as its format is YYYY-DD-MM instead of expected YYYY-MM-DD. This can be checked with the also provided "text" field of emission date object. Example: "data_emissio": { "text": "14/05/2002�21:39", "utc": "2002-14-05T21:39:28+0200" } This commit fixes this behavior.

remitamine · 2021-01-30T12:16:27Z

youtube_dl/extractor/ccma.py

+            'upload_date': '20161108',
+        }
+    }, {
+        'url': 'http://www.ccma.cat/tv3/alacarta/crims/crims-josep-tallada-lespereu-me-capitol-1/video/6031387/',


Add new tests at the end of the array.

remitamine · 2021-01-30T12:19:25Z

youtube_dl/extractor/ccma.py

+        # utc date is in format YYYY-DD-MM
+        data_utc = informacio.get('data_emissio', {}).get('utc')
+        try:
+            data_iso8601 = data_utc[:5] + data_utc[8:10] + '-' + data_utc[5:7] + data_utc[10:]


use datetime class directly.

Note this is not a standardized UTC timestamp... it is provided in format YYYY-DD-MM, thus this manual parsing was needed.

that's why i think it's better to directly use python's datetime builtin module to parse the non standardized format.

Sorry, but I don't follow...

I can only think of two options:

Modifying data_utc as so that it is ISO-compliant and accepted as datetime.datetime.fromisoformat(data_utc), but this is what is already being done...

Parsing each segment as an int to call datetime.datetime(year, month, day, hour=0, minute=0, second=0, microsecond=0... constructor, but I see no advantage and only added complexity in comparison to what is currently being done.

Could you please give me another hint on what you have in mind?

Thanks :)

remitamine · 2021-01-30T12:20:33Z

youtube_dl/extractor/ccma.py

            if sub_url:
                subtitles.setdefault(
-                    subtitols.get('iso') or subtitols.get('text') or 'ca', []).append({
+                    st.get('iso') or 'ca', []).append({


keep the old fallback code.

mid-kid · 2021-02-02T00:19:55Z

Thanks for fixing the timestamp issue! I had just encountered this, as some videos would end up having no timestamp at all because of it.
I wonder if the timestamp is affected by the locale selection in the headers or on the server hosting the website... Maybe it'll change again in the future...?

One request, however. Would there be a way to use the title from the titol_complet key instead of the one in titol? This title differs from video to video but can sometimes include episode information which is very useful.
EDIT: There's a .informacio.capitol key in the JSON that exposes the episode number. Can this be added?

Added test is one of the cases of broken compatibility. Issue is in featuring multiple languages in the subtitles field.

CCMA extractor used to raise an exception when attempting the download of a URL featuring multiple languages in the subtitles. When a single language is available, the field is the expected dict. When multiple languages are available, a list of dicts is provided. This commit fixes this issue.

Keeps old fallback code, as suggested in PR review.

guillemglez changed the title ~~[CCMA] Fix CCMA extraction~~ [CCMA] Fix CCMA extractor Jan 28, 2021

guillemglez changed the title ~~[CCMA] Fix CCMA extractor~~ [CCMA] Fix CCMA extractor (closes #24347) Jan 28, 2021

remitamine requested changes Jan 30, 2021

View reviewed changes

guillemglez added 4 commits February 2, 2021 23:38

[CCMA] Add test with multiple subtitles

196c168

Added test is one of the cases of broken compatibility. Issue is in featuring multiple languages in the subtitles field.

[CCMA] Avoid exception when 'utc' is not found

5601e48

[CCMA] Use also "text" as a fallback for subtitle language parsing

efa4d6f

Keeps old fallback code, as suggested in PR review.

remitamine closed this in 07f7aad Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CCMA] Fix CCMA extractor (closes #24347) #27994

[CCMA] Fix CCMA extractor (closes #24347) #27994

guillemglez commented Jan 28, 2021 •

edited

Loading

remitamine Jan 30, 2021

remitamine Jan 30, 2021

guillemglez Feb 2, 2021

remitamine Feb 2, 2021

guillemglez Feb 2, 2021

remitamine Jan 30, 2021

mid-kid commented Feb 2, 2021 •

edited

Loading

[CCMA] Fix CCMA extractor (closes #24347) #27994

[CCMA] Fix CCMA extractor (closes #24347) #27994

Conversation

guillemglez commented Jan 28, 2021 • edited Loading

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

remitamine Jan 30, 2021

Choose a reason for hiding this comment

remitamine Jan 30, 2021

Choose a reason for hiding this comment

guillemglez Feb 2, 2021

Choose a reason for hiding this comment

remitamine Feb 2, 2021

Choose a reason for hiding this comment

guillemglez Feb 2, 2021

Choose a reason for hiding this comment

remitamine Jan 30, 2021

Choose a reason for hiding this comment

mid-kid commented Feb 2, 2021 • edited Loading

guillemglez commented Jan 28, 2021 •

edited

Loading

mid-kid commented Feb 2, 2021 •

edited

Loading