-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes bugs where the globals are None #126
Conversation
It seems like the right call here would be to move the downloading of this data into an I prefer that over adding special logic to each determinant function, which essentially means we're putting the burden on each Maybe that function is best written as part of fixing #99? But if it's easy enough to do here without having to refactor the caching logic itself, then we could do it here. |
Hmm, I like the idea of having an ...
if suffix_list is None:
initialize_caches()
... I still think we want something in these three functions like above to make sure that if they get called (e.g. by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
So I don't agree, here -- I don't think the So I would expect the test harness to generally run For this PR, I'd really just like to clean up the three |
As far as the test harness is concerned, I actually found a way already to work around them not being loaded: https://github.com/dhs-ncats/pshtt/blob/537f4120396b188ee3f20c8c0b3610ee9bad988d/tests/test_cli.py#L11-L20 I'm not sure that I agree that the While tests are one example of how you might use these functions directly, another is the case where someone is using the internals of pshtt as a library. In that case, they may not make use of the Consider the following snippet, which would fail without us making some kind of change: from pshtt.pshtt import is_hsts_preloaded
from pshtt.models import Domain
domain_url = 'example.com'
my_domain = Domain(domain_url)
my_domain.http = Endpoint("http", "root", domain_url)
my_domain.httpwww = Endpoint("http", "www", domain_url)
my_domain.https = Endpoint("https", "root", domain_url)
my_domain.httpswww = Endpoint("https", "www", domain_url)
assert(is_hsts_preloaded(my_domain) == True) The above will currently raise an exception due to In [1]: from pshtt.pshtt import is_hsts_preloaded
In [2]: from pshtt.models import Domain, Endpoint
In [3]: domain_url = 'example.com'
In [4]: my_domain = Domain(domain_url)
...: my_domain.http = Endpoint("http", "root", domain_url)
...: my_domain.httpwww = Endpoint("http", "www", domain_url)
...: my_domain.https = Endpoint("https", "root", domain_url)
...: my_domain.httpswww = Endpoint("https", "www", domain_url)
...:
In [5]: assert(is_hsts_preloaded(my_domain) == True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-8d7a9de74754> in <module>()
----> 1 assert(is_hsts_preloaded(my_domain) == True)
~/projs/pshtt/pshtt/pshtt.py in is_hsts_preloaded(domain)
910 Whether a domain is contained in Chrome's HSTS preload list.
911 """
--> 912 return domain.domain in preload_list
913
914
TypeError: argument of type 'NoneType' is not iterable I can imagine something like the above being used to very quickly write an "are these domains in the HSTS preload / pending-preload lists" script. |
OK, I find that convincing, that those methods should be context-aware of whether their relevant data's been loaded. However, I still think this could be handled a bit differently. While it's not the worst thing, I don't think it would be ideal if So what about insisting that that method needs to have |
(Also, small note that since merging #125, there's a merge conflict here that should be resolved.) |
7b3e70d
to
8e43ae4
Compare
(Rebased and pushed to fix conflicts.) So it sounds like you are leaning towards something like the following where we raise an exception and force an explicit load of the data, rather than having the data implicitly loaded when it's needed if it hasn't been yet... E.g.: def parent_domain_for(hostname):
if suffix_list is None:
raise RuntimeError('suffix_list data has not been loaded. You must run `initialize_caches()` in order to load')
return suffix_list.get_public_suffix(hostname) |
Yep, that's exactly right. Do you think that's good/doable? |
Certainly doable, though I think I personally lean towards implicitly doing the data fetch if they need it, rather than making it explicitly required. In defense of those other services, since these get defined as globals, they would only be fetched once per library invocation and held in memory. Maybe @h-m-f-t + @jsf9k should weigh in with their thoughts? |
As an aside, I will be squashing this branch prior to merging to resolve the strange commits as we've been discussing what this ought to look like |
That is true, and would likely mitigate accidental DDoS in many cases. That said, there are still cases where people may use the Python API but still orchestrate their overall batching/parallelization in ways that don't benefit from this in-memory use. You can imagine this coming up if someone called pshtt functions inside of a Python-based Lambda function, for instance. Based on my own experience at doing high-parallelization batching and exploring use of Lambda, I favor approaches that make sure that people are handling this network outreach, which can have implications for both clients (in time/cost) and servers (in load/cost), explicitly rather than implicitly. Very open to other thoughts, of course! |
I think this approach is reasonable, especially when viewed in terms of (ours, others) bandwidth costs. I'm okay raising an hard error to the user to enforce that. |
This is the case when these functions are called directly, and where the `inspect_domains()` function is not used. Currently these lists are defined only if `inspect_domains()` has been called, which may not be the case when testing (e.g. in #28 / #125), or when using pshtt as a library. The current solution is to raise an exception if the initialize function is not called explicitly, which is now its own function to handle initializing all the third party data. This begins work towards a more complete solution for #99, and allows for the initialize function to be mocked / called separately when working in tests.
791fb14
to
e6a4b17
Compare
I think I've addressed all the concerns from @konklone and @h-m-f-t (and @jsf9k already liked the changes ;)). Squashed and settled on raising a hard exception for now. Let me know if I missed anything. I'm inclined to make this an incremental improvement on the path to a more complete caching solution, rather than kicking it down the road too far. |
As always, many thanks @IanLee1521 |
…nd-wheel-with-pip Install/upgrade setuptools and wheel when upgrading pip
Add support for Python version 3.12 in build workflow
This is the case when these functions are called directly, and where the
inspect_domains()
function is not used.Currently these lists are defined only if
inspect_domains()
has beencalled, which may not be the case when testing (e.g. in #28 / #125).
This breaks the tests I am creating in #125, but I thought these decisions might warrant some discussion with @h-m-f-t and @konklone as to whether this is the right approach, so I've split it out into it's own PR.