Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-architecting of CSV / MD / JSON writing #125

Merged
merged 16 commits into from
Oct 24, 2017
Merged

Conversation

IanLee1521
Copy link
Collaborator

This resolves #28, by re-working the cli main function to write out each domain of data as it is generated. This works by converting the inspect_domains function yield the results of each domain inspection as they occur, rather than waiting until the very end of the generation.

This PR also simplifies the logic surrounding writing out the other formats (JSON and MD) by using a unified smart_open context manager to manage outputting to a file vs to stdout.

This re-works the cli main function to write out each domain of data as
it is generated. This works by converting the `inspect_domains` function
yield the results of each domain inspection as they occur, rather than
waiting until the very end of the generation.
@coveralls
Copy link

coveralls commented Oct 21, 2017

Coverage Status

Coverage decreased (-0.3%) to 23.338% when pulling 3ccc408 on gh-28-output-as-we-go into c99ea6a on master.

@IanLee1521
Copy link
Collaborator Author

@h-m-f-t , it looks like you may have left coveralls configured to require only increases in coverage, but since this removes some code, it actually dropped the coverage.

Until we can get the testing framework more intact, maybe we should bump up the threshold for a pull request dropping the coverage to, maybe something like, 2%? 5%?

Should be able to be configured at: https://coveralls.io/github/dhs-ncats/pshtt/settings (though you as admin would need to do this)

@coveralls
Copy link

coveralls commented Oct 21, 2017

Coverage Status

Coverage increased (+0.6%) to 24.228% when pulling 760dc18 on gh-28-output-as-we-go into c99ea6a on master.

@coveralls
Copy link

coveralls commented Oct 21, 2017

Coverage Status

Coverage increased (+0.6%) to 24.228% when pulling 905b91f on gh-28-output-as-we-go into c99ea6a on master.

@coveralls
Copy link

coveralls commented Oct 21, 2017

Coverage Status

Coverage increased (+1.2%) to 24.846% when pulling 3225035 on gh-28-output-as-we-go into c99ea6a on master.

@IanLee1521
Copy link
Collaborator Author

@h-m-f-t , For this particular case, I ended up just updating the pull request to add more tests along the way. :)

@coveralls
Copy link

coveralls commented Oct 21, 2017

Coverage Status

Coverage increased (+1.2%) to 24.846% when pulling f99e244 on gh-28-output-as-we-go into c99ea6a on master.

IanLee1521 added a commit that referenced this pull request Oct 21, 2017
This is the case when these functions are called directly, and where the
`inspect_domains()` function is not used.

Currently these lists are defined only if `inspect_domains()` has been
called, which may not be the case when testing (e.g. in #28 / #125).
@coveralls
Copy link

coveralls commented Oct 21, 2017

Coverage Status

Coverage increased (+21.2%) to 44.801% when pulling bf31668 on gh-28-output-as-we-go into c99ea6a on master.

@IanLee1521
Copy link
Collaborator Author

Alright, I know I was doing a bunch of work after submitting this, but I'm done for a while. The last thing I would like to do with this is to add the testing of the to_json and to_markdown functions I factored out, but I can do that in this PR or a separate one if we get this merged sooner.

Copy link
Collaborator

@konklone konklone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a a few small requests in the tests, to avoid hardcoding data/strings where possible. But this overall looks great.

Is there any exception handling present, or does this improve existing exception handling, so that an exception during parsing still results in incomplete data getting written to CSV?

with open(self.temp_filename) as fh:
content = fh.read()

expected = 'Domain,Base Domain,Canonical URL,Live,Redirect,Redirect To,Valid HTTPS,Defaults to HTTPS,Downgrades HTTPS,Strictly Forces HTTPS,HTTPS Bad Chain,HTTPS Bad Hostname,HTTPS Expired Cert,HTTPS Self Signed Cert,HSTS,HSTS Header,HSTS Max Age,HSTS Entire Domain,HSTS Preload Ready,HSTS Preload Pending,HSTS Preloaded,Base Domain HSTS Preloaded,Domain Supports HTTPS,Domain Enforces HTTPS,Domain Uses Strong HSTS,Unknown Error\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we generate this by pulling in the global var and joining them with .? That would prevent us from having to update the test data every time a column header changes.

content = fh.read()

expected = ''
expected += 'Domain,Base Domain,Canonical URL,Live,Redirect,Redirect To,Valid HTTPS,Defaults to HTTPS,Downgrades HTTPS,Strictly Forces HTTPS,HTTPS Bad Chain,HTTPS Bad Hostname,HTTPS Expired Cert,HTTPS Self Signed Cert,HSTS,HSTS Header,HSTS Max Age,HSTS Entire Domain,HSTS Preload Ready,HSTS Preload Pending,HSTS Preloaded,Base Domain HSTS Preloaded,Domain Supports HTTPS,Domain Enforces HTTPS,Domain Uses Strong HSTS,Unknown Error\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as above.


expected = ''
expected += 'Domain,Base Domain,Canonical URL,Live,Redirect,Redirect To,Valid HTTPS,Defaults to HTTPS,Downgrades HTTPS,Strictly Forces HTTPS,HTTPS Bad Chain,HTTPS Bad Hostname,HTTPS Expired Cert,HTTPS Self Signed Cert,HSTS,HSTS Header,HSTS Max Age,HSTS Entire Domain,HSTS Preload Ready,HSTS Preload Pending,HSTS Preloaded,Base Domain HSTS Preloaded,Domain Supports HTTPS,Domain Enforces HTTPS,Domain Uses Strong HSTS,Unknown Error\n'
expected += 'example.com,example.com,http://example.com,False,False,,False,False,False,False,False,False,False,False,False,,,False,False,False,False,False,False,False,False,False\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be best managed in the test data as an array which is also joined by a comma in code before appending to expected. It still might require changing when the headers change, but it would be much more obvious and direct how to make that change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Replying to all three comments) Yeah, I actually started down that route originally. For the headers, as you say, using ','.join(_pshtt.HEADERS) is pretty simple, the issue for me then became how to convert the data itself, as doing a list of the values will make it tougher to map each back to the appropriate header...

Maybe creating a full domain object and setting all the values there would be best... Let me give that a shot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as doing a list of the values will make it tougher to map each back to the appropriate header...

This is true, but it's probably an acceptable annoyance, IMO. I think it's likely to be less brittle than a concatenated string, and so just turning that into a static array would be enough for me for this PR.

@konklone
Copy link
Collaborator

Side question - @h-m-f-t, the Coveralls integration is a bit noisy. Is there a way to tune it down somewhat?

@IanLee1521
Copy link
Collaborator Author

The coveralls output might be more my fault... It fires on every push, and since I'd kept working on this PR after creating it (which I probably shouldn't have) that's why it got a bit noisy).

Instead uses a list of tuples for the columnar data.

This feedback was provided by @konklone
@coveralls
Copy link

coveralls commented Oct 23, 2017

Coverage Status

Coverage increased (+21.2%) to 44.801% when pulling 537f412 on gh-28-output-as-we-go into c99ea6a on master.

Copy link
Member

@h-m-f-t h-m-f-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I'd also like @jsf9k to give a look.

@h-m-f-t h-m-f-t requested a review from jsf9k October 23, 2017 16:01
Copy link
Member

@jsf9k jsf9k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

One question, though...when I try to run the tests locally using the command line "tox" the flake8 output is overwhelming. (The tests do pass, though.) Am I running it correctly? I suspect user error on my part.

@IanLee1521
Copy link
Collaborator Author

IanLee1521 commented Oct 23, 2017

@jsf9k -- Huh, I don't see those... this is what I see:

[... snip ...]
flake8 installed: asn1crypto==0.23.0,beautifulsoup4==4.6.0,certifi==2017.7.27.1,cffi==1.11.2,chardet==3.0.4,cryptography==1.9,DataProperty==0.25.6,docopt==0.6.2,dominate==2.3.1,elasticsearch==5.4.0,flake8==3.4.1,idna==2.6,jsonschema==2.6.0,Logbook==1.1.0,markdown2==2.3.4,mbstrdecoder==0.2.2,mccabe==0.6.1,nassl==0.17.0,path.py==10.4,pathvalidate==0.16.2,pshtt==0.3.0.dev0,publicsuffix==1.1.0,pycodestyle==2.3.1,pycparser==2.18,pyflakes==1.5.0,pyOpenSSL==17.3.0,pyparsing==2.2.0,pytablereader==0.13.4,pytablewriter==0.24.0,python-dateutil==2.6.1,pytz==2017.2,requests==2.18.4,requests-cache==0.4.13,SimpleSQLite==0.16.0,six==1.11.0,SSLyze==1.1.4,tls-parser==1.1.0,toml==0.9.3,typepy==0.0.20,urllib3==1.22,wget==3.2,xlrd==1.1.0,XlsxWriter==1.0.2,xlwt==1.3.0
flake8 runtests: PYTHONHASHSEED='4263820095'
flake8 runtests: commands[0] | flake8
___________________________________________________________________________________ summary ____________________________________________________________________________________
  py27: commands succeeded
SKIPPED:  py34: InterpreterNotFound: python3.4
SKIPPED:  py35: InterpreterNotFound: python3.5
  py36: commands succeeded
  flake8: commands succeeded
  congratulations :)

Can you paste what you're seeing?

@konklone
Copy link
Collaborator

The coveralls output might be more my fault... It fires on every push, and since I'd kept working on this PR after creating it (which I probably shouldn't have) that's why it got a bit noisy).

Absolutely not -- filing a PR and then continuing to modify it in-place is a design feature of GitHub and a positive part of collaborative development. Please keep doing that! We should just make coveralls less noisy so as not to punish that workflow.

@jsf9k
Copy link
Member

jsf9k commented Oct 24, 2017

@IanLee1521, here is the output I get from tox.

@IanLee1521
Copy link
Collaborator Author

@konklone does https://github.com/dhs-ncats/pshtt/pull/125/files#diff-09f983a1db5a7aca6102f2e2a260c88cR55 work better for you w.r.t. the static string versus something better?

@IanLee1521
Copy link
Collaborator Author

@jsf9k -- It looks like the issue is that you have a virtualenv directory in your top level directory. Therefore flake8 is picking up those files... One way to solve this is #127. Does that work for you?

@konklone
Copy link
Collaborator

@konklone does https://github.com/dhs-ncats/pshtt/pull/125/files#diff-09f983a1db5a7aca6102f2e2a260c88cR55 work better for you w.r.t. the static string versus something better?

That works (I also would have been fine with a straight array with header names in the comments, but all good).

@konklone konklone merged commit b3cff8c into master Oct 24, 2017
@konklone konklone deleted the gh-28-output-as-we-go branch October 24, 2017 19:06
IanLee1521 added a commit that referenced this pull request Oct 24, 2017
This is the case when these functions are called directly, and where the
`inspect_domains()` function is not used.

Currently these lists are defined only if `inspect_domains()` has been
called, which may not be the case when testing (e.g. in #28 / #125).
IanLee1521 added a commit that referenced this pull request Oct 26, 2017
This is the case when these functions are called directly, and where the
`inspect_domains()` function is not used.

Currently these lists are defined only if `inspect_domains()` has been
called, which may not be the case when testing (e.g. in #28 / #125), or
when using pshtt as a library. The current solution is to raise an
exception if the initialize function is not called explicitly, which is
now its own function to handle initializing all the third party data.

This begins work towards a more complete solution for #99, and allows
for the initialize function to be mocked / called separately when
working in tests.
cisagovbot pushed a commit that referenced this pull request Feb 8, 2023
cisagovbot pushed a commit that referenced this pull request Dec 19, 2023
…for-codeql-workflow

Add a diagnostics job to the CodeQL workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants