Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes bugs where the globals are None #126

Merged
merged 1 commit into from
Oct 26, 2017
Merged

Fixes bugs where the globals are None #126

merged 1 commit into from
Oct 26, 2017

Conversation

IanLee1521
Copy link
Collaborator

This is the case when these functions are called directly, and where the
inspect_domains() function is not used.

Currently these lists are defined only if inspect_domains() has been
called, which may not be the case when testing (e.g. in #28 / #125).

This breaks the tests I am creating in #125, but I thought these decisions might warrant some discussion with @h-m-f-t and @konklone as to whether this is the right approach, so I've split it out into it's own PR.

@coveralls
Copy link

coveralls commented Oct 21, 2017

Coverage Status

Coverage decreased (-0.2%) to 23.414% when pulling 14f41a9 on default-globals into c99ea6a on master.

@konklone
Copy link
Collaborator

It seems like the right call here would be to move the downloading of this data into an initialize_data or initialize_cache function or something, which can then also be easily mocked within tests (and called as part of test setup) to set them all to static data or empty arrays.

I prefer that over adding special logic to each determinant function, which essentially means we're putting the burden on each is_ function to know whether it's being operated in a test environment or not.

Maybe that function is best written as part of fixing #99? But if it's easy enough to do here without having to refactor the caching logic itself, then we could do it here.

@IanLee1521
Copy link
Collaborator Author

Hmm, I like the idea of having an initialize_cache or similar function.. Then we could change what I've proposed here to be something like:

...
if suffix_list is None:
    initialize_caches()
...

I still think we want something in these three functions like above to make sure that if they get called (e.g. by the is_* functions, that they actually have done something to set the values on suffix_list, preload_list, and preload_pending (unlike now where those are only set as part of calling the inspect_domains function).

@coveralls
Copy link

coveralls commented Oct 23, 2017

Coverage Status

Coverage decreased (-0.1%) to 23.494% when pulling b48cb70 on default-globals into c99ea6a on master.

Copy link
Member

@jsf9k jsf9k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@konklone
Copy link
Collaborator

I still think we want something in these three functions like above to make sure that if they get called (e.g. by the is_* functions, that they actually have done something to set the values on suffix_list, preload_list, and preload_pending (unlike now where those are only set as part of calling the inspect_domains function).

So I don't agree, here -- I don't think the is_* functions should have to be aware of things like this, and it leads to redundant code. In this case, I think the test harness should do what it needs to do to ensure that those methods are properly initialized with what they need.

So I would expect the test harness to generally run initialize_caches(), as well as mock out that function so it doesn't hit the network but loads those arrays with whatever generic arrays makes sense in a test environment. (And you could make it easy to override per-test for those that exercise those functions explicitly.)

For this PR, I'd really just like to clean up the three is_* functions by having the test harness call initialize_caches() so that those functions don't crash without the explicit check. Does that sound doable?

@IanLee1521
Copy link
Collaborator Author

As far as the test harness is concerned, I actually found a way already to work around them not being loaded: https://github.com/dhs-ncats/pshtt/blob/537f4120396b188ee3f20c8c0b3610ee9bad988d/tests/test_cli.py#L11-L20

I'm not sure that I agree that the is_hsts_preload_pending, is_hsts_preloaded and parent_domain_for should fail with an exception if they don't have the corresponding lists available.

While tests are one example of how you might use these functions directly, another is the case where someone is using the internals of pshtt as a library. In that case, they may not make use of the inspect_domains function, which is currently the only place those lists get created.

Consider the following snippet, which would fail without us making some kind of change:

from pshtt.pshtt import is_hsts_preloaded
from pshtt.models import Domain

domain_url = 'example.com'

my_domain = Domain(domain_url)
my_domain.http = Endpoint("http", "root", domain_url)
my_domain.httpwww = Endpoint("http", "www", domain_url)
my_domain.https = Endpoint("https", "root", domain_url)
my_domain.httpswww = Endpoint("https", "www", domain_url)

assert(is_hsts_preloaded(my_domain) == True)

The above will currently raise an exception due to preload_list being undefined:

In [1]: from pshtt.pshtt import is_hsts_preloaded

In [2]: from pshtt.models import Domain, Endpoint

In [3]: domain_url = 'example.com'

In [4]: my_domain = Domain(domain_url)
   ...: my_domain.http = Endpoint("http", "root", domain_url)
   ...: my_domain.httpwww = Endpoint("http", "www", domain_url)
   ...: my_domain.https = Endpoint("https", "root", domain_url)
   ...: my_domain.httpswww = Endpoint("https", "www", domain_url)
   ...:

In [5]: assert(is_hsts_preloaded(my_domain) == True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-8d7a9de74754> in <module>()
----> 1 assert(is_hsts_preloaded(my_domain) == True)

~/projs/pshtt/pshtt/pshtt.py in is_hsts_preloaded(domain)
    910     Whether a domain is contained in Chrome's HSTS preload list.
    911     """
--> 912     return domain.domain in preload_list
    913
    914

TypeError: argument of type 'NoneType' is not iterable

I can imagine something like the above being used to very quickly write an "are these domains in the HSTS preload / pending-preload lists" script.

@konklone
Copy link
Collaborator

OK, I find that convincing, that those methods should be context-aware of whether their relevant data's been loaded.

However, I still think this could be handled a bit differently. While it's not the worst thing, I don't think it would be ideal if is_hsts_preloaded() suddenly downloaded network data without warning. Instead, I think it would be better to treat it as if it were an API call that needs to have third party credentials set ahead of time (e.g. using the censys Python library to make a Censys API call, without having set API credentials).

So what about insisting that that method needs to have initialize_caches (or maybe since we're exposing it, a more explanatory name, e.g. get_third_party_data) run first, and having the method raise a specific exception if it's not been run?

@konklone
Copy link
Collaborator

(Also, small note that since merging #125, there's a merge conflict here that should be resolved.)

@IanLee1521
Copy link
Collaborator Author

(Rebased and pushed to fix conflicts.)

So it sounds like you are leaning towards something like the following where we raise an exception and force an explicit load of the data, rather than having the data implicitly loaded when it's needed if it hasn't been yet... E.g.:

def parent_domain_for(hostname):
    if suffix_list is None:
        raise RuntimeError('suffix_list data has not been loaded. You must run `initialize_caches()` in order to load')

    return suffix_list.get_public_suffix(hostname)

@konklone
Copy link
Collaborator

Yep, that's exactly right. Do you think that's good/doable?

@IanLee1521
Copy link
Collaborator Author

Certainly doable, though I think I personally lean towards implicitly doing the data fetch if they need it, rather than making it explicitly required. In defense of those other services, since these get defined as globals, they would only be fetched once per library invocation and held in memory.

Maybe @h-m-f-t + @jsf9k should weigh in with their thoughts?

@IanLee1521
Copy link
Collaborator Author

As an aside, I will be squashing this branch prior to merging to resolve the strange commits as we've been discussing what this ought to look like

@konklone
Copy link
Collaborator

Certainly doable, though I think I personally lean towards implicitly doing the data fetch if they need it, rather than making it explicitly required. In defense of those other services, since these get defined as globals, they would only be fetched once per library invocation and held in memory.

That is true, and would likely mitigate accidental DDoS in many cases. That said, there are still cases where people may use the Python API but still orchestrate their overall batching/parallelization in ways that don't benefit from this in-memory use. You can imagine this coming up if someone called pshtt functions inside of a Python-based Lambda function, for instance.

Based on my own experience at doing high-parallelization batching and exploring use of Lambda, I favor approaches that make sure that people are handling this network outreach, which can have implications for both clients (in time/cost) and servers (in load/cost), explicitly rather than implicitly. Very open to other thoughts, of course!

@h-m-f-t
Copy link
Member

h-m-f-t commented Oct 24, 2017

Based on my own experience at doing high-parallelization batching and exploring use of Lambda, I favor approaches that make sure that people are handling this network outreach, which can have implications for both clients (in time/cost) and servers (in load/cost), explicitly rather than implicitly.

I think this approach is reasonable, especially when viewed in terms of (ours, others) bandwidth costs. I'm okay raising an hard error to the user to enforce that.

This is the case when these functions are called directly, and where the
`inspect_domains()` function is not used.

Currently these lists are defined only if `inspect_domains()` has been
called, which may not be the case when testing (e.g. in #28 / #125), or
when using pshtt as a library. The current solution is to raise an
exception if the initialize function is not called explicitly, which is
now its own function to handle initializing all the third party data.

This begins work towards a more complete solution for #99, and allows
for the initialize function to be mocked / called separately when
working in tests.
@IanLee1521
Copy link
Collaborator Author

I think I've addressed all the concerns from @konklone and @h-m-f-t (and @jsf9k already liked the changes ;)).

Squashed and settled on raising a hard exception for now. Let me know if I missed anything. I'm inclined to make this an incremental improvement on the path to a more complete caching solution, rather than kicking it down the road too far.

@h-m-f-t h-m-f-t merged commit 745a629 into master Oct 26, 2017
@h-m-f-t h-m-f-t deleted the default-globals branch October 26, 2017 12:33
@h-m-f-t
Copy link
Member

h-m-f-t commented Oct 26, 2017

As always, many thanks @IanLee1521

cisagovbot pushed a commit that referenced this pull request Apr 26, 2023
…nd-wheel-with-pip

Install/upgrade setuptools and wheel when upgrading pip
cisagovbot pushed a commit that referenced this pull request Dec 19, 2023
Add support for Python version 3.12 in build workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants