Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process for adding python/R/Julia/Ubuntu/etc. packages #622

Closed
1 of 2 tasks
jemrobinson opened this issue May 1, 2020 · 17 comments · Fixed by #1387
Closed
1 of 2 tasks

Process for adding python/R/Julia/Ubuntu/etc. packages #622

jemrobinson opened this issue May 1, 2020 · 17 comments · Fixed by #1387
Assignees
Labels
documentation Improvements to documentation
Milestone

Comments

@jemrobinson
Copy link
Member

jemrobinson commented May 1, 2020

We have two ways in which software packages can be made available for users inside an SRE:

  1. Baked into our "batteries included" VM image
  • Ubuntu packages
  • Julia packages
  • python packages
  • R packages
  1. Available from our mirrors of external repositories
  • python packages
  • R packages

We need to come up with a process for deciding how to approve what goes into (1) and what goes into (2). Specifically, under what circumstances would we say no to a request for a package and how do we decide whether this request should go into (1) or (2)? Where does the liability belong if a malicious package is included?

@thobson88 is going to take a look at this (initially in the context of the request for ~100 new R packages on #615).

Stages

  • Design a process for deciding whether to include any particular package (done in Package whitelisting policy & request form #671)
  • Automatically generate a nice 1-page PDF summary for each proposed package which highlights any CVEs/safety concerns and lists eg. download stats etc.
@thobson88
Copy link
Contributor

Here's a proposal for a policy to decide between options 1. & 2. (above) for R & Python packages. I'll consider the other question (of when to accept/reject a package request) in another comment.

These are the criteria I used to judge the ~100 R packages in #615. I don't see any real possibility of automating this process, as it requires a judgement of whether the package is broadly useful to a cross section of researchers.

To be deemed generally useful, and therefore included in the VM image, the package should:

  • implement at least one generic (i.e. not domain-specific) statistical algorithm or method, or
  • provide support for a cross-cutting analysis technique (e.g. geospatial data analysis, NLP), or
  • facilitate data science or software development best practices (e.g. for robustness, correctness, reproducibility), or
  • enhance the presentational features of the programming language (e.g. for producing plots, notebooks, articles, websites), or
  • enhance the usability of the programming language or development environment (i.e. RStudio, PyCharm).

NOTE: based on these criteria, the following packages that are currently in the CRAN VM image list belong instead on the CRAN mirror: ape, bedr, COMBAT, fmsb, quantmod, Seurat, SnowballC, surveillance

@thobson88
Copy link
Contributor

thobson88 commented May 14, 2020

The other decision, of whether to accept or reject a package, has (at least theoretically) security implications as there's no way to guarantee that a package does not contain malicious code. It's already been considered in #312.

As noted in that ticket, the whitelisting of a particular package does not mean that it's immediately included in either the VM image or the repository mirror. Instead, it's added to the list of packages that will be approved by default in the event that they are requested by a user.

But I think we need to revisit the six criteria originally proposed in #312, because they would exclude some well-established packages and some of them (points 2. & 3.) seem pretty arbitrary and don't add any obvious benefit.

We ought to be able to justify the criteria, so the question is: what makes open source software trustworthy? I'd argue that either of these are reasonable grounds:

  • plenty of people have had plenty of time to use (and scrutinise) it, or
  • it's signed by trusted author(s).

Given that most packages on CRAN and PyPI are not digitally signed, we should focus on the first of these. This could be judged based on some combination of the following metadata (available from the package repository):

  • date of first publication
  • date of publication of current version
  • download statistics since those two dates

We could compute a measure of the "time spent in use", measured in days or weeks. For each each week since publication, we multiply the number of downloads in that week by the number of weeks that have passed since, and then sum all the results. We do this separately for the package and its current version, and if either exceeds some agreed threshold, then we add the package to the "approve-by-default" whitelist.

@jemrobinson any thoughts?

@jemrobinson
Copy link
Member Author

jemrobinson commented May 14, 2020

I really like your criteria for inclusion in the base image. I think we should add these to the docs directory and try to adhere to them when considering whether to add new packages (and possibly remove some old ones if anyone has time to look back over what we currently install).

I also completely agree with:

  • plenty of people have had plenty of time to use (and scrutinise) it
  • signed by trusted author(s)

as a sensible metric for whitelisting.

There's then a question of whether this can easily be turned into a computable metric, which is I think where #312 got stuck. If there is a sensible way to compute this (and therefore possibly automate the decision) then that's great, but otherwise we could keep these as criteria that admins should consider before adding new packages.

We should note that although we have complete control over which packages are included in Turing deployments, that's not true in general and the admins for other deployments might have different ideas about what to whitelist or not.

Do you have any further thoughts on this @martintoreilly?

@thobson88
Copy link
Contributor

thobson88 commented May 18, 2020

There's then a question of whether this can easily be turned into a computable metric

Here's the "days in use" metric described above, computed for all the packages in the current CRAN whitelist (all versions, during the last year), in a histogram with log scale:
hist_log10_days_in_use_1yr

Assuming we can get this data for the other package repos, could we use it pick a threshold for "days in use" above which a package is whitelisted-by-default?

@martintoreilly
Copy link
Member

We could compute a measure of the "time spent in use", measured in days or weeks. For each each week since publication, we multiply the number of downloads in that week by the number of weeks that have passed since, and then sum all the results.

I think this is a decaying proxy for days in actual use. An obsolete package that was massively popular 10 years ago will have a high score that will keep increasing, even if no-one has downloaded it for a decade. It feels we ideally want a consistent level of downloads, stretching to relatively recently. If we weren't worried about having a machine computable metric to threshold / weight in an automated decision or quality score for a packages, would we really just want to eyeball a plot of downloads over time chart for each package?

Are there other ways we can test security more directly (e.g. checking CVEs for package versions, static analysis etc)?

Do we want to consider new versions of packages as inheriting all the quality metrics accumulated over all previous versions? I can see cases where the new package might be much less trusted (massive refactor) or much more trusted (fixes high risk security bug)?

@martintoreilly
Copy link
Member

@thobson88 Where did you get the download data from?

@JimMadge
Copy link
Member

Tangentially related @jemrobinson and I have started building a core/white-listed packages document from the scratch here https://hackmd.io/@nHslnPpLRmCxPOmQBcOW-g/SJa8Em4oI due to the (probably necessarily) large size of the current lists.

@jemrobinson
Copy link
Member Author

jemrobinson commented May 22, 2020

@thobson88 Maybe reversing your recency weighting would help (ie. a download 1 yr ago is worth less than a download yesterday)? Possibly exponentially? Something like ndownloads * exp(-A * ndays) rather than ndownloads * ndays inside your integral? This should deal with @martintoreilly 's point.

@thobson88
Copy link
Contributor

@thobson88 Where did you get the download data from?

I used the cran_downloads function from the cranlogs package. Currently seeking equivalents for the other repositories.

@martintoreilly
Copy link
Member

For PyPI, download stats can be accessed from the Google BigQuery PyPI stats tables. The PyPI stats API directs to the Google tables for bulk operations.

@martintoreilly
Copy link
Member

martintoreilly commented May 26, 2020

Vulnerability checking

Vulnerability databases

Vulnerability scanners

Python

  • Bandit. This is the tool Gitlab CI supports for Static Application Security Testing (SAST) of Python code.
  • Jake (blog). From Sonatype - same folk who make the Nexus package repository proxy with caching and blacklisting. Runs against conda.
  • RATS (also does C/C++)
  • Pyntch? (more of a runtime error detector)
  • Packagr. Package security scanning with paid tier.
  • OWASP ZA Proxy. This is the tool used by GitLab's Auto DAST check.

General

@jemrobinson
Copy link
Member Author

jemrobinson commented May 26, 2020

What are we worried about?

  • Package typo squatting (getting a malicious package instead of the intended one)
  • Exploiting the available compute for purposes unrelated to the data science problem
  • Privilege escalation attacks that allow local users to gain root access (this is why we do not currently support Docker) or access to restricted groups at hand (eg. using the Azure compute resource for mining bitcoin [obviously this specific example isn't a good one!])

What are we not (so) worried about?

  • Attacks that target webservers (we will not be exposing any of these outside of the SRE)
  • Attacks that get the user to run arbitrary code (Safe Haven users are unprivileged)

What are the risks for the Safe Haven?

At a deeper level the main worries are:

  • approved users having access to data that they shouldn't (eg. from data mixing)
  • unapproved users having access to data (eg. from a data breach)

Which types of actors are we worried about?

  • User who wants to "just get it done" and install something randomly found online without considering security implications or more appropriate alternatives.
  • Administrator mistakenly adding a package that has not gone through the approval process due to lack of clarity on the process or the expected documentation supporting approval.

Which types of actors are we not worried about?

  • Malicious administrator (covered by existing access controls and logging)
  • Malicious developer (covered by existing access controls and logging

Conclusions: It seems like we want to push the burden of decision making onto the person making the request.

Policy

@martintoreilly 's thoughts:

  • update compute image every month
  • check CVE databases for all whitelisted packages every month
    • ideally blacklist specific versions of packages (check whether this works)
  • require users to justify why they want additional packages to be whitelisted
    • decision made by one or more decision makers (could be data provider representative or delegated)
    • biased towards saying 'yes' to additions that will improve research productivity

Questions for request-makers

  • Is this package the mostly widely supported way to do the thing you want to do?
  • What will you be able to do with this package that you can't currently do? What alternatives are there?
  • What risks to data integrity/security might arise from including this package?

Summary statistics for decision makers

  • Number of downloads in last month (or a graph like @thobson88 showed)
  • Number of contributors
  • List of known vulnerabilities

@thobson88
Copy link
Contributor

thobson88 commented May 28, 2020

@jemrobinson I've written up a first draft in PR #671.

@martintoreilly
Copy link
Member

martintoreilly commented Jun 3, 2020

Existence on other default package lists

I think we should add whether a package has been included in other supported package lists as part of our quality assurance signal. This surely says something about how widespread useful something is?

  • Anaconda: Has a list of included packages. There is a separate list per OS/Python version combination, but the list includes Linux packages. A subset of the packages are marked "In installer", which might be a signal that they are considered more widely used / core.
    • Linux/Python 3.6: 682 packages (304 "in installer")
  • CoCalc: From SageMath. The documentation says "The default environment is very large, well tested, regularly maintained and matured over many years. This is what a project runs by default.". Preinstalled:
    • Python
      • Python 2: 618 packages
      • Python 3: 1051 packages
      • SageMath: 496 packages
      • Anaconda: 511 packages
    • R
      • R-project: 4871 packages
      • SageMath R: 449 packages
    • Linux: 212 packages
    • Julia: 431 packages
    • Octave: 35 packages

@martintoreilly
Copy link
Member

martintoreilly commented Jun 4, 2020

PyPI malware checks

As of Feb / March 2020 the Warehouse repository backing PyPI has had tooling in place to run malware checks on package upload, automated schedule and admin manual trigger.

I looks like the hooks are there, but it's not clear what, if any, real checks are running in production.

Another part of the same work is looking to incorporate The Update Framework (TUF) for more secure package updates.

@martintoreilly
Copy link
Member

I found an R package CVE - https://www.cvedetails.com/cve/CVE-2008-3931/

@JimMadge
Copy link
Member

Closing as stale and open-ended. This would be better placed in a discussion until we have a concrete proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants