-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-architecting of CSV / MD / JSON writing #125
Conversation
This re-works the cli main function to write out each domain of data as it is generated. This works by converting the `inspect_domains` function yield the results of each domain inspection as they occur, rather than waiting until the very end of the generation.
@h-m-f-t , it looks like you may have left coveralls configured to require only increases in coverage, but since this removes some code, it actually dropped the coverage. Until we can get the testing framework more intact, maybe we should bump up the threshold for a pull request dropping the coverage to, maybe something like, 2%? 5%? Should be able to be configured at: https://coveralls.io/github/dhs-ncats/pshtt/settings (though you as admin would need to do this) |
Done by ignoring the flake8 error when using Python2
@h-m-f-t , For this particular case, I ended up just updating the pull request to add more tests along the way. :) |
Alright, I know I was doing a bunch of work after submitting this, but I'm done for a while. The last thing I would like to do with this is to add the testing of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a a few small requests in the tests, to avoid hardcoding data/strings where possible. But this overall looks great.
Is there any exception handling present, or does this improve existing exception handling, so that an exception during parsing still results in incomplete data getting written to CSV?
tests/test_cli.py
Outdated
with open(self.temp_filename) as fh: | ||
content = fh.read() | ||
|
||
expected = 'Domain,Base Domain,Canonical URL,Live,Redirect,Redirect To,Valid HTTPS,Defaults to HTTPS,Downgrades HTTPS,Strictly Forces HTTPS,HTTPS Bad Chain,HTTPS Bad Hostname,HTTPS Expired Cert,HTTPS Self Signed Cert,HSTS,HSTS Header,HSTS Max Age,HSTS Entire Domain,HSTS Preload Ready,HSTS Preload Pending,HSTS Preloaded,Base Domain HSTS Preloaded,Domain Supports HTTPS,Domain Enforces HTTPS,Domain Uses Strong HSTS,Unknown Error\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we generate this by pulling in the global var and joining them with .? That would prevent us from having to update the test data every time a column header changes.
tests/test_cli.py
Outdated
content = fh.read() | ||
|
||
expected = '' | ||
expected += 'Domain,Base Domain,Canonical URL,Live,Redirect,Redirect To,Valid HTTPS,Defaults to HTTPS,Downgrades HTTPS,Strictly Forces HTTPS,HTTPS Bad Chain,HTTPS Bad Hostname,HTTPS Expired Cert,HTTPS Self Signed Cert,HSTS,HSTS Header,HSTS Max Age,HSTS Entire Domain,HSTS Preload Ready,HSTS Preload Pending,HSTS Preloaded,Base Domain HSTS Preloaded,Domain Supports HTTPS,Domain Enforces HTTPS,Domain Uses Strong HSTS,Unknown Error\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note as above.
tests/test_cli.py
Outdated
|
||
expected = '' | ||
expected += 'Domain,Base Domain,Canonical URL,Live,Redirect,Redirect To,Valid HTTPS,Defaults to HTTPS,Downgrades HTTPS,Strictly Forces HTTPS,HTTPS Bad Chain,HTTPS Bad Hostname,HTTPS Expired Cert,HTTPS Self Signed Cert,HSTS,HSTS Header,HSTS Max Age,HSTS Entire Domain,HSTS Preload Ready,HSTS Preload Pending,HSTS Preloaded,Base Domain HSTS Preloaded,Domain Supports HTTPS,Domain Enforces HTTPS,Domain Uses Strong HSTS,Unknown Error\n' | ||
expected += 'example.com,example.com,http://example.com,False,False,,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be best managed in the test data as an array which is also joined by a comma in code before appending to expected
. It still might require changing when the headers change, but it would be much more obvious and direct how to make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Replying to all three comments) Yeah, I actually started down that route originally. For the headers, as you say, using ','.join(_pshtt.HEADERS)
is pretty simple, the issue for me then became how to convert the data itself, as doing a list of the values will make it tougher to map each back to the appropriate header...
Maybe creating a full domain object and setting all the values there would be best... Let me give that a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as doing a list of the values will make it tougher to map each back to the appropriate header...
This is true, but it's probably an acceptable annoyance, IMO. I think it's likely to be less brittle than a concatenated string, and so just turning that into a static array would be enough for me for this PR.
Side question - @h-m-f-t, the Coveralls integration is a bit noisy. Is there a way to tune it down somewhat? |
The coveralls output might be more my fault... It fires on every push, and since I'd kept working on this PR after creating it (which I probably shouldn't have) that's why it got a bit noisy). |
Instead uses a list of tuples for the columnar data. This feedback was provided by @konklone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I'd also like @jsf9k to give a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
One question, though...when I try to run the tests locally using the command line "tox" the flake8 output is overwhelming. (The tests do pass, though.) Am I running it correctly? I suspect user error on my part.
@jsf9k -- Huh, I don't see those... this is what I see: [... snip ...]
flake8 installed: asn1crypto==0.23.0,beautifulsoup4==4.6.0,certifi==2017.7.27.1,cffi==1.11.2,chardet==3.0.4,cryptography==1.9,DataProperty==0.25.6,docopt==0.6.2,dominate==2.3.1,elasticsearch==5.4.0,flake8==3.4.1,idna==2.6,jsonschema==2.6.0,Logbook==1.1.0,markdown2==2.3.4,mbstrdecoder==0.2.2,mccabe==0.6.1,nassl==0.17.0,path.py==10.4,pathvalidate==0.16.2,pshtt==0.3.0.dev0,publicsuffix==1.1.0,pycodestyle==2.3.1,pycparser==2.18,pyflakes==1.5.0,pyOpenSSL==17.3.0,pyparsing==2.2.0,pytablereader==0.13.4,pytablewriter==0.24.0,python-dateutil==2.6.1,pytz==2017.2,requests==2.18.4,requests-cache==0.4.13,SimpleSQLite==0.16.0,six==1.11.0,SSLyze==1.1.4,tls-parser==1.1.0,toml==0.9.3,typepy==0.0.20,urllib3==1.22,wget==3.2,xlrd==1.1.0,XlsxWriter==1.0.2,xlwt==1.3.0
flake8 runtests: PYTHONHASHSEED='4263820095'
flake8 runtests: commands[0] | flake8
___________________________________________________________________________________ summary ____________________________________________________________________________________
py27: commands succeeded
SKIPPED: py34: InterpreterNotFound: python3.4
SKIPPED: py35: InterpreterNotFound: python3.5
py36: commands succeeded
flake8: commands succeeded
congratulations :) Can you paste what you're seeing? |
Absolutely not -- filing a PR and then continuing to modify it in-place is a design feature of GitHub and a positive part of collaborative development. Please keep doing that! We should just make coveralls less noisy so as not to punish that workflow. |
@IanLee1521, here is the output I get from |
@konklone does https://github.com/dhs-ncats/pshtt/pull/125/files#diff-09f983a1db5a7aca6102f2e2a260c88cR55 work better for you w.r.t. the static string versus something better? |
That works (I also would have been fine with a straight array with header names in the comments, but all good). |
This is the case when these functions are called directly, and where the `inspect_domains()` function is not used. Currently these lists are defined only if `inspect_domains()` has been called, which may not be the case when testing (e.g. in #28 / #125), or when using pshtt as a library. The current solution is to raise an exception if the initialize function is not called explicitly, which is now its own function to handle initializing all the third party data. This begins work towards a more complete solution for #99, and allows for the initialize function to be mocked / called separately when working in tests.
…for-codeql-workflow Add a diagnostics job to the CodeQL workflow
This resolves #28, by re-working the cli main function to write out each domain of data as it is generated. This works by converting the
inspect_domains
function yield the results of each domain inspection as they occur, rather than waiting until the very end of the generation.This PR also simplifies the logic surrounding writing out the other formats (JSON and MD) by using a unified
smart_open
context manager to manage outputting to a file vs to stdout.