the :not selector don't work as expected. #212

jimages · 2021-02-09T05:41:19Z

the minimal code which can reproduce the bug lists below

import bs4
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
b.body.a['foo'] = None  # str(b) ->  <html><body><a foo href="http://www.example.com"></a></body></html>
b.select("a:not([foo])")  # -> [<a foo href="http://www.example.com"></a>]

in this case, the tag a shouldn't be selected.

The text was updated successfully, but these errors were encountered:

facelessuser · 2021-02-09T06:36:37Z

I do not view this as a bug, but as a consequence of doing an undefined action. Basically, you've stuffed nonsense into the attribute. Let me demonstrate what an empty attribute actually looks like and why it behaves as it does:

import bs4
b = bs4.BeautifulSoup("<a foo href=\"http://www.example.com\"></a>") 
print(b.body.a.attrs)
print(b.select("a:not([foo])"))

Notice that when an attribute has no value that it is assigned an empty string, not a None.

{'foo': '', 'href': 'http://www.example.com'}
[]

Now, in your case you've placed essentially something that doesn't belong, a None. While I can see how it may seem intuitive, it does not belong. You've created an undefined state. Just because BS doesn't choke on it, doesn't mean you should be doing that. So, does it work as you intend? Maybe not, but this is an unexpected case.

The fact that BS happens to convert the random value is frankly dumb luck:

import bs4
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
b.body.a['foo'] = {'1': 1}
print(b.body.a)

<a foo="{'1': 1}" href="http://www.example.com"></a>

As you can see, BeautifulSoup is a bit too forgiving of what gets shoved into these attributes.

So, now we have a None as the value of the attribute. If we are using the following logic to see if a key is present, we expect that None is returned via get if name is not found in the dictionary.

el.attrs.get(name)

And now you see why what you are doing is throwing a wrench into the machine.

I think it is beyond the scope of SoupSieve to try and anticipate the ways in which a user may abuse the BS structure. We check the elements based on the expected internals. Attribute values are either strings or a list of strings (BS breaks classes up into a list of strings). Anything above and beyond that is undefined.

facelessuser · 2021-02-09T06:46:03Z

I think there is sufficient reasoning to assert there is nothing to fix here. I am aware of no cases that BS ever inserts None as an attribute value itself (though I am open to being proven wrong 🙂). As things stand now, I am flagging this as a wontfix. Maybe there is some case I am overlooking?

@gir-bot remove S: triage
@gir-bot add S: wontfix

facelessuser · 2021-02-09T14:11:24Z

Closing the issue for now.

facelessuser · 2021-02-09T14:42:50Z

I wanted to provide one more case that adds validity to our decision not to address this. The official BS documentation also demonstrates removing and checking attributes with get and how it will return None when you remove attributes: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes.

I think by assigning None, to an attribute, you break the user's expectation of the attribute behavior and such assignments should be avoided, regardless of BS's forgiving behavior. As we've shown earlier, BS will take almost anything as an attribute value sometimes, accidentally giving you what you expect and sometimes not at all.

facelessuser · 2021-02-09T17:03:56Z

So, I was looking through the BS code. And their search normalizes both search terms and attributes and such through a _normalize_search_value: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1943.

Now it is confusing as to why they would normalize their attribute searches the same way they normalize their attribute values. For instance, a regular expression pattern assigned to an attribute will be pass through as the regex object, but when you output the HTML, it will convert that to a string. So it will never match via SoupStrainer, but it will be output as an attribute value string on export. This is really wrong, and maybe we should submit a Pull request sometime in the future. BS also doesn't handle dicts correctly in matching but outputs them safely as a string.

So, maybe we should at least normalize our attribute values to at least be on par or better than BS.

Since BS does it a little buggy, we'll write our own.

Fixes #212

facelessuser · 2021-02-09T17:38:59Z

Not a huge fan of having to handle these weird cases, but I can see BS really gets users accustomed to the idea that they can put anything into attributes and it should still come out reasonable. It turned out that it was quite trivial to do due to the way we wrote things, so it all worked out in the end 🙂.

Pull #213 will fix this issue.

@gir-bot remove S: wontfix
@gir-bot add T: bug

Fixes #212

facelessuser · 2021-02-09T20:05:18Z

@jimages Took me a while to come to the conclusion we needed to fix this, but it is now available in the 2.2 release.

jimages · 2021-02-11T09:59:07Z

Sorry for the late reply since I am on a long journey. First of all, the reason why I do the undefined action is that the only possible way to get attributes without a value. I will show you an example.

b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></>", "html5lib")
b.body.a['foo']= "" # so we get <a foo="" href="http://www.example.com"></a>

as you can see, it only outputs an attribute with a value an empty string. but what I want is a bare attribute without a value, though these are somehow equivalent.

ref: https://stackoverflow.com/questions/49991604/python-beautifulsoup-add-attribute-to-tag-without-value/49991727

In the end, it's my pleasure to see that this bug is fixed. Thank you!

jimages · 2021-02-11T10:25:20Z

maybe the None method is intended. https://bazaar.launchpad.net/~dguitarbite/beautifulsoup/beautifulsoup/view/head:/bs4/element.py#L1087 Isaac Muse <notifications@github.com> 于 2021年2月10日周三上午4:05写道：

…

@jimages <https://github.com/jimages> Took me a while to come to the conclusion we needed to fix this, but it is now available in the 2.2 release. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#212 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANTCUMYVA3TATU6LPAUS33S6GIQ3ANCNFSM4XKKU2ZQ> .

facelessuser · 2021-02-11T13:57:35Z

as you can see, it only outputs an attribute with a value an empty string. but what I want is a bare attribute without a value, though these are somehow equivalent.

Yeah, I would argue this is a bug with BS as BS imports a bare attribute with an empty string, but then you have to trick it into outputting it the same way you imported it.

maybe the None method is intended.

If it was intended, I'd argue it would assign None to attributes itself, but it doesn't. BS tries everything in its power to output a sane HTML output, and that means trying its best to convert to something that doesn't fail, that doesn't mean it is supposed to allow users to do that. I illustrated that earlier with:

import bs4
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
b.body.a['foo'] = {'1': 1}
print(b.body.a)

But it'll even take something weird like functions:

import bs4

def func():
    pass

b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
b.body.a['foo'] = func
print(b.body.a)
``

Output:

```html
<a foo="&lt;function func at 0x10aedc160&gt;" href="http://www.example.com"></a>

Just because it is handled, doesn't mean you should ever do something like this.

With that said, out of everything, None, at the very least, seems like a more sane thing to handle than things like functions and dictionaries, but I'd argue, that if this is intentional, BS should also convert <a foo></a> to a['foo']=None, but it doesn't. At least for attributes, on input, it should handle <a foo=''></a> and <a foo></a> the same in the output for these is very different currently.

facelessuser · 2021-02-11T15:01:38Z

I guess you could force empty attributes with something like this:

import bs4
from bs4.formatter import HTMLFormatter

class EmptyAttr(HTMLFormatter):
    def attributes(self, tag):
        for k, v in tag.attrs.items():
            if v == '':
                v = None
            yield k, v

b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>", 'html5lib')
b.body.a['foo'] = ''
print(b.body.a.encode(formatter=EmptyAttr()))

Output

b'<a href="http://www.example.com" foo></a>'

I may bring up this whole None handling over with the author of BS.

facelessuser · 2021-02-11T15:35:40Z

Testing it out in real CSS, as far as HTML is concerned, foo="" and foo are both equivalent. and [foo] and [foo=""] will match both, so I don't see why BS can't export foo="" as a bare foo by default. I may request that over at BS and see what the author says.

facelessuser · 2021-02-11T16:00:39Z

Here is the issue I created over at BeautifulSoup: https://bugs.launchpad.net/beautifulsoup/+bug/1915424.

facelessuser · 2021-02-11T16:16:25Z

In general, I guess now None will work, so you can keep doing that. Generally, I don't think people should use None and this should be changed in the HTML formatter to output bare attributes. I think overriding the HTML formatter is probably a cleaner option to do this on output, especially if the author doesn't want to change the output in HTML moving forward (XML should not do this though).

I guess we'll see what he decides. But I guess you have options now.

jimages · 2021-02-12T06:31:03Z

If it was intended, I'd argue it would assign None to attributes itself, but it doesn't. BS tries everything in its power to output a sane HTML output, and that means trying its best to convert to something that doesn't fail, that doesn't mean it is supposed to allow users to do that.

Yeah, this may be reasonable. But I still want to argue about the undefined action. It's true that it doesn't assign None to attributes itself nor set similar test cases. if it is an undefined behavior it means BS wouldn't assign None to attributes itself. But if not, it doesn't mean BS should assign. let's see what the author thinks.

facelessuser · 2021-02-13T16:06:41Z

Sounds like he's open to trying this out on the default HTML5 formatter, so that is a start. I'll try to get a pull in there so at least on the HTML5 formatter, empty strings should output as bare attributes.

gir-bot added the S: triage Issue needs triage. label Feb 9, 2021

gir-bot added S: wontfix The issue will not be fixed for the stated reasons. and removed S: triage Issue needs triage. labels Feb 9, 2021

facelessuser closed this as completed Feb 9, 2021

facelessuser reopened this Feb 9, 2021

facelessuser added a commit that referenced this issue Feb 9, 2021

Handle weird attribute inputs

7716645

Fixes #212

facelessuser mentioned this issue Feb 9, 2021

Handle weird attribute inputs #213

Merged

gir-bot added T: bug Bug. and removed S: wontfix The issue will not be fixed for the stated reasons. labels Feb 9, 2021

facelessuser closed this as completed in #213 Feb 9, 2021

facelessuser added a commit that referenced this issue Feb 9, 2021

Handle weird attribute inputs (#213)

0ba6ce8

Fixes #212

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the :not selector don't work as expected. #212

the :not selector don't work as expected. #212

jimages commented Feb 9, 2021

facelessuser commented Feb 9, 2021 •

edited

Loading

facelessuser commented Feb 9, 2021

facelessuser commented Feb 9, 2021

facelessuser commented Feb 9, 2021

facelessuser commented Feb 9, 2021 •

edited

Loading

facelessuser commented Feb 9, 2021

facelessuser commented Feb 9, 2021

jimages commented Feb 11, 2021

jimages commented Feb 11, 2021 via email

facelessuser commented Feb 11, 2021 •

edited

Loading

facelessuser commented Feb 11, 2021

facelessuser commented Feb 11, 2021

facelessuser commented Feb 11, 2021

facelessuser commented Feb 11, 2021

jimages commented Feb 12, 2021

facelessuser commented Feb 13, 2021

the :not selector don't work as expected. #212

the :not selector don't work as expected. #212

Comments

jimages commented Feb 9, 2021

facelessuser commented Feb 9, 2021 • edited Loading

facelessuser commented Feb 9, 2021

facelessuser commented Feb 9, 2021

facelessuser commented Feb 9, 2021

facelessuser commented Feb 9, 2021 • edited Loading

facelessuser commented Feb 9, 2021

facelessuser commented Feb 9, 2021

jimages commented Feb 11, 2021

jimages commented Feb 11, 2021 via email

facelessuser commented Feb 11, 2021 • edited Loading

facelessuser commented Feb 11, 2021

facelessuser commented Feb 11, 2021

facelessuser commented Feb 11, 2021

facelessuser commented Feb 11, 2021

jimages commented Feb 12, 2021

facelessuser commented Feb 13, 2021

facelessuser commented Feb 9, 2021 •

edited

Loading

facelessuser commented Feb 9, 2021 •

edited

Loading

facelessuser commented Feb 11, 2021 •

edited

Loading