Skip to content
Chiara Marmo edited this page Jan 15, 2020 · 2 revisions

How to access github data on scikit-learn

  • via google-cloud with queries like

    /* count number of issues in 2011 */
    SELECT * FROM `githubarchive.year.2011`
    WHERE type = 'IssuesEvent' AND repo.name = 'scikit-learn/scikit-learn'
    

    Issues

    • results for 2013 (?)
    • Query results are downloadable in json with a limit in size
    • Starting from 2014 the error
      Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors 
      
      is thrown
  • via download from gharchive e.g. Activity for all of January 2015

    wget https://data.gharchive.org/2015-01-{01..31}-{0..23}.json.gz
    

    Issues

    • A lot of unuseful data to download with respect to the information I need to process. 5 months in 2015 -> 28GB
Clone this wiki locally