Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utils] Fix handling of comments in js_to_json #28446

Closed
wants to merge 5 commits into from

Conversation

marado
Copy link

@marado marado commented Mar 14, 2021

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

js_to_json() removes commas that appear right before a comment line, creating invalid JSON.

This pull request fixes this by making the conversion a two-step process. It first removes all comments and then the actual JS to JSON fix-up happens. This also somewhat simplifies the conversion as comment handling is no longer needed in the second step.

Doing it in one step makes it hard to deal properly with both commas right before comments and trailing commas at the end of a dictionary or array. A two-step approach is easy to follow and works.

The regular expression used for comment removal is not thrown off by comments within string literals. It is based on this Stack Overflow answer: https://stackoverflow.com/a/25735600

Additional tests in test/test_utils.py based on issue #23785 are also provided.

Closes #23785.

This Pull Request is a rebase of the PR provided (and approved) at #23877. As such:
Closes #23877 .

@violetmage
Copy link

Fixes #28335

and many other related issues

am thinking, is good fix

@somini
Copy link

somini commented Apr 7, 2021

CI is failing on flake8 issues:

diff --git i/youtube_dl/utils.py w/youtube_dl/utils.py
index d7162e401..30830af42 100644
--- i/youtube_dl/utils.py
+++ w/youtube_dl/utils.py
@@ -4110,6 +4110,7 @@ def js_to_json(code):
         [0-9]+(?=\s*:)
         ''', fix_kv, code)

+
 def qualities(quality_ids):
     """ Get a numeric quality value out of a list of possible values """
     def q(qid):

The other failures stem from the fact that the ! prefixes are no longer removed.

@somini
Copy link

somini commented Apr 7, 2021

Input:

{
            a: !0,
            b: !1,
            c: !!0,
            d: !!42.42,
            e: !!![],
            f: !"abc",
            g: !"",
            !42: 42
        }

Output from the cleanup function:

{
            "a": !0,
            "b": !1,
            "c": !!0,
            "d": !!42.42,
            "e": !!![],
            "f": !"abc",
            "g": !"",
            !"42": 42
        }

This is not valid JSON.

While this results in wrong values, it is the previous behaviour and
what the tests are currently expecting.
@marado
Copy link
Author

marado commented Apr 22, 2021

@dstftw can you please review/remove the pending-fixes label?

# Remove all comments first, including all whitespace leading up to them.
# This regular expression is based on this Stack Overflow answer:
# https://stackoverflow.com/a/25735600
code = re.sub(r'("(?:[^"\\]|\\[\s\S])*"|\'(?:[^\'\\]|\\[\s\S])*\')|[ \t]*//.*|[ \t]*/\*(?:[^*]|\*(?!/))*\*/',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Use re.X to lay out the RE
  2. Supposing [ \t] is meant to be space and not newline, use (?:(?!\n)\s)
  3. Use re.s to avoid [\s\S] and re.M to match $ at an end of line
Suggested change
code = re.sub(r'("(?:[^"\\]|\\[\s\S])*"|\'(?:[^\'\\]|\\[\s\S])*\')|[ \t]*//.*|[ \t]*/\*(?:[^*]|\*(?!/))*\*/',
code = re.sub(
r'''(?xmi)
("(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')|
(?:(?!\n)\s)*
(?:
//.*$|
/\*(?:[^*]|\*(?!/))*\*/
)
''',

Comment on lines +4130 to +4131
# Remove all ! characters
code = code.replace('!', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a different problem? !0 for true, etc?

Comment on lines 4133 to 4138
def fix_kv(m):
v = m.group(0)
if v in ('true', 'false', 'null'):
return v
elif v.startswith('/*') or v.startswith('//') or v.startswith('!') or v == ',':
elif v == ',':
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def fix_kv(m):
v = m.group(0)
if v in ('true', 'false', 'null'):
return v
elif v.startswith('/*') or v.startswith('//') or v.startswith('!') or v == ',':
elif v == ',':
return ""
INVERTS = {'true': 'false, 'false': 'true', 'null': 'true'}
def js_not(x):
if isinstance(x, compat_basestring):
return INVERTS.get(x) or ''
try:
return not json.dumps(x)
except BaseException:
return ''
def fix_kv(m):
v = m.group(0)
i = 0
while v[i] == '!':
i = i + 1
v = v[i:]
if v in INVERTS:
return v if i % 2 == 0 else js_not(v)
elif v in ('undefined', 'void 0'):
return 'null' if i == 0 else 'true' if i % 2 == 1 else 'false'
elif v == ',':
return ''

@dirkf
Copy link
Contributor

dirkf commented Mar 4, 2024

The tests pass with d9d07a9:
In [1]: import json

In [2]: from youtube_dl.utils import js_to_json

In [3]: inp = '''{
   ...:             foo: "value",
   ...:             // bar: { nested:'x' },
   ...:             bar: { nested:'x' },
   ...:             chaff: "something"
   ...:         }'''

In [4]: json.loads(js_to_json(inp)) == json.loads('''{
   ...:             "foo": "value",
   ...: 
   ...:             "bar": { "nested":"x" },
   ...:             "chaff": "something"
   ...:         }''')
Out[4]: True

In [5]:         inp = '''{
   ...:             id: "player_prog",
   ...:             googleCast: true,
   ...:             //extraSettings: { googleCastReceiverAppId:'1A6F2224', skin:'s3',  skinAccentCol
   ...: or: '0073FF'},
   ...:             extraSettings: { googleCastReceiverAppId:'1A6F2224'},
   ...:             mediaType: "video",
   ...:         }'''

In [6]: json.loads(js_to_json(inp)) == json.loads('''{
   ...:             "id": "player_prog",
   ...:             "googleCast": true,
   ...:             "extraSettings": { "googleCastReceiverAppId":"1A6F2224"},
   ...:             "mediaType": "video"
   ...:         }''')
Out[6]: True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

utils.js_to_json is broken: mishandles commented lines
6 participants