Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fqdn() to always include suffix if private suffix enabled and private suffix exists #300

Merged
merged 8 commits into from
Sep 13, 2023

Conversation

elliotwutingfeng
Copy link
Contributor

@elliotwutingfeng elliotwutingfeng commented Jul 12, 2023

Closes #178

Changes

  • If private suffixes are enabled and there is a private suffix, fqdn() will always return the suffix + domain and/or subdomain if they exist as well.

Misc

  • This can be a step towards fixing Semantics of registered_domain property for private domains #138. Would need a consensus on whether registered_domain should strictly refer to domain + public suffix regardless of whether private suffixes are enabled.

  • Is there a better way to expose is_private than putting it in the existing NamedTuple?

  • The CLI parser behaviour remains unchanged; it only prints out the str attributes of the NamedTuple.

@john-kurkowski
Copy link
Owner

Is the point of this PR mainly to fix the .fqdn property in #178? Is there a way to fix the property without a breaking change? I don't like that extracting a public suffix now gives a different result than extracting a private suffix. I like treating public and private (when enabled) as similarly as possible.

# entire URL is a public suffix
>>> tldextract.extract("gov.au", include_psl_private_domains=True)
ExtractResult(subdomain='', domain='', suffix='gov.au') # not domain='gov', suffix='au'?

# entire URL is a private suffix
>>> tldextract.extract("blogspot.com", include_psl_private_domains=True)
ExtractResult(subdomain='', domain='blogspot', suffix='com')

Also, I don't know how useful it is to check if an entire URL was a suffix, but you could check that before this PR. After this PR, I don't think you can.

>>> result = tldextract.extract("blogspot.com", include_psl_private_domains=True)
>>> is_suffix = result.suffix and not (result.subdomain or result.domain)

@elliotwutingfeng
Copy link
Contributor Author

elliotwutingfeng commented Jul 16, 2023

Is there a way to fix the property without a breaking change?

This is a breaking change so perhaps we should use a flag to optionally enable this behaviour.

This can be done with an opt-in flag (not yet implemented as of now) to enable the new behaviour.

tldextract.extract("blogspot.com", include_psl_private_domains=False, opt_in_flag_name=False)

If the caller doesn't set this flag to True, the old behaviour applies. If you find this approach feasible, what would be an appropriate name to replace opt_in_flag_name?

@john-kurkowski
Copy link
Owner

What desirable new behavior would the flag let users opt into? Is the flag just a bugfix for #178? If so, breaking changes, new configuration options, and something the caller has to do all seem like a chainsaw, for what calls for a knife. I'm afraid to maintain 2 different string splitting experiences or more options to the already too many to construct a TLDExtract.

@elliotwutingfeng
Copy link
Contributor Author

elliotwutingfeng commented Jul 16, 2023

Yes, this flag is just a bugfix for #178. If an opt-in flag is used, none of the existing code out there should be affected, and hence it would not be a breaking change.

However I do agree that this complicates the string splitting algorithm; having to handle 2 different cases. If it is too much, I think #178 (which is currently still open) should be closed as "not planned".


After re-reading the issue, I realised your solution of using the is_private property is simpler and there won't be a need ti opt-in, but it would change the output of FQDN for the private domains.

@john-kurkowski
Copy link
Owner

After re-reading the issue, I realised your solution of using the is_private property is simpler and there won't be a need ti opt-in, but it would change the output of FQDN for the private domains.

You got it. That approach is what I'm interested in. #178 is a legit bug, so I'll keep it open for now. The FQDN of private domains was never intended to be blank, so changing that output is desirable.

@elliotwutingfeng elliotwutingfeng changed the title Accept next largest suffix if entire URL is private suffix fqdn() to always include suffix if private suffix enabled and private suffix exists Aug 1, 2023
@john-kurkowski
Copy link
Owner

This PR changed a lot since my last comment! I think it's much closer to the solution I was seeking in my last comment. Nice!

My only remaining concern is the semi-breaking change of adding another field to this project's class ExtractResult(NamedTuple). This project's docs previously advocated iterating through the fields. With this PR, anybody still iterating through the namedtuple is going to build some weird strings or more likely raise a TypeError. That is, without guards e.g. isinstance(str) in this PR's updated docs. I noted a similar compatibility issue on the abandoned #273.

I wonder if there's a better way? Do we add an attribute to the namedtuple, instead of extending its fields? Do we bump the breaking version number of the project? Do we move away from namedtuple?

I don't want to make this PR's fix go on much longer. Just noodling what compatibility I'm comfortable with. 🤔

@john-kurkowski
Copy link
Owner

Do we add an attribute to the namedtuple, instead of extending its fields?

I've played around with this, and it doesn't feel viable. __slots__ and __new__ cannot be overridden on a namedtuple.

Make … an attribute of ExtractResult but not a member of the tuple, like urllib.parse.urlsplit does

I still haven't investigated what the stdlib is doing differently in e.g. urllib.parse.urlsplit's example. I'm not sure it's worth the hoop jumping for this project.

Do we bump the breaking version number of the project?

Doing this for every field added to the namedtuple seems like a lot. I'd just steer people away from iterating over all fields, unless they really know what they're doing.

All this to say, I'm leaning toward merging this as is, releasing, and seeing how niche the compatibility issues are.

@banagale
Copy link

banagale commented Oct 6, 2023

My only remaining concern is the semi-breaking change of adding another field to this project's class ExtractResult(NamedTuple). This project's docs previously advocated iterating through the fields. With this PR, anybody still iterating through the namedtuple is going to build some weird strings or more likely raise a TypeError. That is, without guards e.g. isinstance(str) in this PR's updated docs. I noted a similar compatibility issue on the abandoned #273.

I had some code using tldextract.extract(url)[1:] and hit this TypeError! Came across it despite not having test coverage on this portion of the product.

Glad this package is getting improved and am okay with the breaking change,

I don't know if I would have caught it, but if this had kicked the project up to 4.0.0 (to help highlight the breaking change). I might have checked the release notes more carefully. I realize not everyone uses that convention, and that this is not such a deeply breaking change to possibly warrant it. But throwing it out there nonetheless.

As it was, I just updated to site_str: str = '.'.join(tldextract.extract(url)[1:3]) and got back to it.

Thanks for any and all efforts to maintain and further extend this great package.

@john-kurkowski
Copy link
Owner

@banagale thank you for weighing in! Per #305, 3.6.0 is yanked and republished as 4.0.0. Your tuple slicing code will work in 3.x and 4.0.0.

However, I've also published 5.0.0, which moves away from a tuple return type entirely. In 5.0.0, you will need to directly reference the fields you're interested in.

ext = tldextract.extract(url)
'.'.join((ext.domain, ext.suffix))

@banagale
Copy link

@banagale thank you for weighing in! Per #305, 3.6.0 is yanked and republished as 4.0.0. Your tuple slicing code will work in 3.x and 4.0.0.

However, I've also published 5.0.0, which moves away from a tuple return type entirely. In 5.0.0, you will need to directly reference the fields you're interested in.

ext = tldextract.extract(url)
'.'.join((ext.domain, ext.suffix))

Cool. Makes good sense.

bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this pull request Oct 23, 2023
https://build.opensuse.org/request/show/1119465
by user mia + anag+factory
- Update to 5.0.1:
Bugfixes:
  * Indicate MD5 not used in a security context (FIPS compliance)
    #gh/john-kurkowski/tldextract#309
Misc.:
  * Increase typecheck aggression
- Changes in 5.0.0:
Breaking Changes:
  * Migrate `ExtractResult` from `namedtuple` to `dataclass`
    #gh/john-kurkowski/tldextract#306
Bugfixes:
  * Drop support for EOL Python 3.7
- Changes in 4.0.0:
Breaking Bugfixes:
  * Always include suffix if private suffix enabled and private
    suffix exists
    #gh/john-kurkowski/tldextract#300
- Changes in 3.5.0:
Features:
  * Support IPv6 addresses
    #gh/john-kurkowski/tldextract#298
Bugfixes:
  * Accept only 4 decimal octet IPv4 addresses
    #gh/john-kurkowski/tldextract#292
  * Support IPv4 addresses with unicode dots
  * Reject IPv4 addresses with trailing whitespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect empty FQDN when no subdomain and domain
3 participants