-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the :not selector don't work as expected. #212
Comments
I do not view this as a bug, but as a consequence of doing an undefined action. Basically, you've stuffed nonsense into the attribute. Let me demonstrate what an empty attribute actually looks like and why it behaves as it does: import bs4
b = bs4.BeautifulSoup("<a foo href=\"http://www.example.com\"></a>")
print(b.body.a.attrs)
print(b.select("a:not([foo])")) Notice that when an attribute has no value that it is assigned an empty string, not a {'foo': '', 'href': 'http://www.example.com'}
[] Now, in your case you've placed essentially something that doesn't belong, a The fact that BS happens to convert the random value is frankly dumb luck: import bs4
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>")
b.body.a['foo'] = {'1': 1}
print(b.body.a)
As you can see, BeautifulSoup is a bit too forgiving of what gets shoved into these attributes. So, now we have a el.attrs.get(name) And now you see why what you are doing is throwing a wrench into the machine. I think it is beyond the scope of SoupSieve to try and anticipate the ways in which a user may abuse the BS structure. We check the elements based on the expected internals. Attribute values are either strings or a list of strings (BS breaks classes up into a list of strings). Anything above and beyond that is undefined. |
I think there is sufficient reasoning to assert there is nothing to fix here. I am aware of no cases that BS ever inserts |
Closing the issue for now. |
I wanted to provide one more case that adds validity to our decision not to address this. The official BS documentation also demonstrates removing and checking attributes with I think by assigning |
So, I was looking through the BS code. And their search normalizes both search terms and attributes and such through a Now it is confusing as to why they would normalize their attribute searches the same way they normalize their attribute values. For instance, a regular expression pattern assigned to an attribute will be pass through as the regex object, but when you output the HTML, it will convert that to a string. So it will never match via SoupStrainer, but it will be output as an attribute value string on export. This is really wrong, and maybe we should submit a Pull request sometime in the future. BS also doesn't handle dicts correctly in matching but outputs them safely as a string. So, maybe we should at least normalize our attribute values to at least be on par or better than BS. Since BS does it a little buggy, we'll write our own. |
Not a huge fan of having to handle these weird cases, but I can see BS really gets users accustomed to the idea that they can put anything into attributes and it should still come out reasonable. It turned out that it was quite trivial to do due to the way we wrote things, so it all worked out in the end 🙂. Pull #213 will fix this issue. |
@jimages Took me a while to come to the conclusion we needed to fix this, but it is now available in the 2.2 release. |
Sorry for the late reply since I am on a long journey. First of all, the reason why I do the undefined action is that the only possible way to get attributes without a value. I will show you an example.
as you can see, it only outputs an attribute with a value an empty string. but what I want is a bare attribute without a value, though these are somehow equivalent. In the end, it's my pleasure to see that this bug is fixed. Thank you! |
maybe the None method is intended.
https://bazaar.launchpad.net/~dguitarbite/beautifulsoup/beautifulsoup/view/head:/bs4/element.py#L1087
Isaac Muse <notifications@github.com> 于 2021年2月10日周三 上午4:05写道:
… @jimages <https://github.com/jimages> Took me a while to come to the
conclusion we needed to fix this, but it is now available in the 2.2
release.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#212 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANTCUMYVA3TATU6LPAUS33S6GIQ3ANCNFSM4XKKU2ZQ>
.
|
Yeah, I would argue this is a bug with BS as BS imports a bare attribute with an empty string, but then you have to trick it into outputting it the same way you imported it.
If it was intended, I'd argue it would assign import bs4
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>")
b.body.a['foo'] = {'1': 1}
print(b.body.a) But it'll even take something weird like functions: import bs4
def func():
pass
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>")
b.body.a['foo'] = func
print(b.body.a)
``
Output:
```html
<a foo="<function func at 0x10aedc160>" href="http://www.example.com"></a> Just because it is handled, doesn't mean you should ever do something like this. With that said, out of everything, |
I guess you could force empty attributes with something like this: import bs4
from bs4.formatter import HTMLFormatter
class EmptyAttr(HTMLFormatter):
def attributes(self, tag):
for k, v in tag.attrs.items():
if v == '':
v = None
yield k, v
b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>", 'html5lib')
b.body.a['foo'] = ''
print(b.body.a.encode(formatter=EmptyAttr())) Output
I may bring up this whole |
Testing it out in real CSS, as far as HTML is concerned, |
Here is the issue I created over at BeautifulSoup: https://bugs.launchpad.net/beautifulsoup/+bug/1915424. |
In general, I guess now None will work, so you can keep doing that. Generally, I don't think people should use I guess we'll see what he decides. But I guess you have options now. |
Yeah, this may be reasonable. But I still want to argue about the undefined action. It's true that it doesn't assign None to attributes itself nor set similar test cases. if it is an undefined behavior it means BS wouldn't assign None to attributes itself. But if not, it doesn't mean BS should assign. let's see what the author thinks. |
Sounds like he's open to trying this out on the default HTML5 formatter, so that is a start. I'll try to get a pull in there so at least on the HTML5 formatter, empty strings should output as bare attributes. |
the minimal code which can reproduce the bug lists below
in this case, the tag
a
shouldn't be selected.The text was updated successfully, but these errors were encountered: