More options for pull-requests: --state, --org, and --search (#80)

* always ask for 100 items when paginating (helps #79) * fix typos in README.md * ignore test and build artifacts * --org and --state options for pull-requests * --search for pull-requests, but it can only get 1000
dogsheep · Dec 10, 2023 · a0a711b · a0a711b
1 parent 56f2aee
commit a0a711b
Show file tree

Hide file tree

Showing 4 changed files with 90 additions and 24 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,5 @@ venv
 .eggs
 .pytest_cache
 *.egg-info
-
+.coverage
+build/
diff --git a/README.md b/README.md
@@ -82,13 +82,25 @@ You can use the `--pull-request` option one or more times to load specific pull
 
 Note that the `merged_by` column on the `pull_requests` table will only be populated for pull requests that are loaded using the `--pull-request` option - the GitHub API does not return this field for pull requests that are loaded in bulk.
 
+You can load only pull requests in a certain state with the `--state` option:
+
+    $ github-to-sqlite pull-requests --state=open github.db simonw/datasette
+
+Pull requests across an entire organization (or more than one) can be loaded with `--org`:
+
+    $ github-to-sqlite pull-requests --state=open --org=psf --org=python github.db
+
+You can use a search query to find pull requests.  Note that no more than 1000 will be loaded (this is a GitHub API limitation), and some data will be missing (base and head SHAs).  When using searches, other filters are ignored; put all criteria into the search itself:
+
+    $ github-to-sqlite pull-requests --search='org:python defaultdict state:closed created:<2023-09-01' github.db
+
 Example: [pull_requests table](https://github-to-sqlite.dogsheep.net/github/pull_requests)
 
 ## Fetching issue comments for a repository
 
 The `issue-comments` command retrieves all of the comments on all of the issues in a repository.
 
-It is recommended you run `issues` first, so that each imported comment can have a foreign key poining to its issue.
+It is recommended you run `issues` first, so that each imported comment can have a foreign key pointing to its issue.
 
     $ github-to-sqlite issues github.db simonw/datasette
     $ github-to-sqlite issue-comments github.db simonw/datasette
@@ -101,7 +113,7 @@ Example: [issue_comments table](https://github-to-sqlite.dogsheep.net/github/iss
 
 ## Fetching commits for a repository
 
-The `commits` command retrieves details of all of the commits for one or more repositories. It currently fetches the sha, commit message and author and committer details - it does no retrieve the full commit body.
+The `commits` command retrieves details of all of the commits for one or more repositories. It currently fetches the SHA, commit message and author and committer details; it does not retrieve the full commit body.
 
     $ github-to-sqlite commits github.db simonw/datasette simonw/sqlite-utils
 
@@ -156,7 +168,7 @@ You can pass more than one username to fetch for multiple users or organizations
 
     $ github-to-sqlite repos github.db simonw dogsheep
 
-Add the `--readme` option to save the README for the repo in a column called `readme`. Add `--readme-html` to save the HTML rendered version of the README into a collumn called `readme_html`.
+Add the `--readme` option to save the README for the repo in a column called `readme`. Add `--readme-html` to save the HTML rendered version of the README into a column called `readme_html`.
 
 Example: [repos table](https://github-to-sqlite.dogsheep.net/github/repos)
 
@@ -216,7 +228,7 @@ You can fetch a list of every emoji supported by GitHub using the `emojis` comma
 
     $ github-to-sqlite emojis github.db
 
-This will create a table callad `emojis` with a primary key `name` and a `url` column.
+This will create a table called `emojis` with a primary key `name` and a `url` column.
 
 If you add the `--fetch` option the command will also fetch the binary content of the images and place them in an `image` column:
 
@@ -235,7 +247,7 @@ The `github-to-sqlite get` command provides a convenient shortcut for making aut
 
 This will make an authenticated call to the URL you provide and pretty-print the resulting JSON to the console.
 
-You can ommit the `https://api.github.com/` prefix, for example:
+You can omit the `https://api.github.com/` prefix, for example:
 
     $ github-to-sqlite get /gists
 

diff --git a/github_to_sqlite/cli.py b/github_to_sqlite/cli.py
@@ -1,5 +1,6 @@
 import click
 import datetime
+import itertools
 import pathlib
 import textwrap
 import os
@@ -104,19 +105,53 @@ def issues(db_path, repo, issue_ids, auth, load):
     type=click.Path(file_okay=True, dir_okay=False, allow_dash=True, exists=True),
     help="Load pull-requests JSON from this file instead of the API",
 )
-def pull_requests(db_path, repo, pull_request_ids, auth, load):
+@click.option(
+    "--org",
+    "orgs",
+    help="Fetch all pull requests from this GitHub organization",
+    multiple=True,
+)
+@click.option(
+    "--state",
+    help="Only fetch pull requests in this state",
+)
+@click.option(
+    "--search",
+    help="Find pull requests with a search query",
+)
+def pull_requests(db_path, repo, pull_request_ids, auth, load, orgs, state, search):
     "Save pull_requests for a specified repository, e.g. simonw/datasette"
     db = sqlite_utils.Database(db_path)
     token = load_token(auth)
-    repo_full = utils.fetch_repo(repo, token)
-    utils.save_repo(db, repo_full)
     if load:
+        repo_full = utils.fetch_repo(repo, token)
+        utils.save_repo(db, repo_full)
         pull_requests = json.load(open(load))
+        utils.save_pull_requests(db, pull_requests, repo_full)
+    elif search:
+        repos_seen = set()
+        search += " is:pr"
+        pull_requests = utils.fetch_searched_pulls_or_issues(search, token)
+        for pull_request in pull_requests:
+            pr_repo_url = pull_request["repository_url"]
+            if pr_repo_url not in repos_seen:
+                pr_repo = utils.fetch_repo(url=pr_repo_url)
+                utils.save_repo(db, pr_repo)
+                repos_seen.add(pr_repo_url)
+            utils.save_pull_requests(db, [pull_request], pr_repo)
     else:
-        pull_requests = utils.fetch_pull_requests(repo, token, pull_request_ids)
-
-    pull_requests = list(pull_requests)
-    utils.save_pull_requests(db, pull_requests, repo_full)
+        if orgs:
+            repos = itertools.chain.from_iterable(
+                utils.fetch_all_repos(token=token, org=org)
+                for org in orgs
+            )
+        else:
+            repos = [utils.fetch_repo(repo, token)]
+        for repo_full in repos:
+            utils.save_repo(db, repo_full)
+            repo = repo_full["full_name"]
+            pull_requests = utils.fetch_pull_requests(repo, state, token, pull_request_ids)
+            utils.save_pull_requests(db, pull_requests, repo_full)
     utils.ensure_db_shape(db)
 
 

diff --git a/github_to_sqlite/utils.py b/github_to_sqlite/utils.py
@@ -2,6 +2,7 @@
 import requests
 import re
 import time
+import urllib.parse
 import yaml
 
 FTS_CONFIG = {
@@ -170,17 +171,21 @@ def save_pull_requests(db, pull_requests, repo):
         # Add repo key
         pull_request["repo"] = repo["id"]
         # Pull request _links can be flattened to just their URL
-        pull_request["url"] = pull_request["_links"]["html"]["href"]
-        pull_request.pop("_links")
+        if "_links" in pull_request:
+            pull_request["url"] = pull_request["_links"]["html"]["href"]
+            pull_request.pop("_links")
+        else:
+            pull_request["url"] = pull_request["pull_request"]["html_url"]
         # Extract user
         pull_request["user"] = save_user(db, pull_request["user"])
         labels = pull_request.pop("labels")
         # Extract merged_by, if it exists
         if pull_request.get("merged_by"):
             pull_request["merged_by"] = save_user(db, pull_request["merged_by"])
         # Head sha
-        pull_request["head"] = pull_request["head"]["sha"]
-        pull_request["base"] = pull_request["base"]["sha"]
+        if "head" in pull_request:
+            pull_request["head"] = pull_request["head"]["sha"]
+            pull_request["base"] = pull_request["base"]["sha"]
         # Extract milestone
         if pull_request["milestone"]:
             pull_request["milestone"] = save_milestone(
@@ -292,12 +297,13 @@ def save_issue_comment(db, comment):
     return last_pk
 
 
-def fetch_repo(full_name, token=None):
+def fetch_repo(full_name=None, token=None, url=None):
     headers = make_headers(token)
     # Get topics:
     headers["Accept"] = "application/vnd.github.mercy-preview+json"
-    owner, slug = full_name.split("/")
-    url = "https://api.github.com/repos/{}/{}".format(owner, slug)
+    if url is None:
+        owner, slug = full_name.split("/")
+        url = "https://api.github.com/repos/{}/{}".format(owner, slug)
     response = requests.get(url, headers=headers)
     response.raise_for_status()
     return response.json()
@@ -358,7 +364,7 @@ def fetch_issues(repo, token=None, issue_ids=None):
             yield from issues
 
 
-def fetch_pull_requests(repo, token=None, pull_request_ids=None):
+def fetch_pull_requests(repo, state=None, token=None, pull_request_ids=None):
     headers = make_headers(token)
     headers["accept"] = "application/vnd.github.v3+json"
     if pull_request_ids:
@@ -370,11 +376,20 @@ def fetch_pull_requests(repo, token=None, pull_request_ids=None):
             response.raise_for_status()
             yield response.json()
     else:
-        url = "https://api.github.com/repos/{}/pulls?state=all&filter=all".format(repo)
+        state = state or "all"
+        url = f"https://api.github.com/repos/{repo}/pulls?state={state}"
         for pull_requests in paginate(url, headers):
             yield from pull_requests
 
 
+def fetch_searched_pulls_or_issues(query, token=None):
+    headers = make_headers(token)
+    url = "https://api.github.com/search/issues?"
+    url += urllib.parse.urlencode({"q": query})
+    for pulls_or_issues in paginate(url, headers):
+        yield from pulls_or_issues["items"]
+
+
 def fetch_issue_comments(repo, token=None, issue=None):
     assert "/" in repo
     headers = make_headers(token)
@@ -445,13 +460,15 @@ def fetch_stargazers(repo, token=None):
         yield from stargazers
 
 
-def fetch_all_repos(username=None, token=None):
-    assert username or token, "Must provide username= or token= or both"
+def fetch_all_repos(username=None, token=None, org=None):
+    assert username or token or org, "Must provide username= or token= or org= or a combination"
     headers = make_headers(token)
     # Get topics for each repo:
     headers["Accept"] = "application/vnd.github.mercy-preview+json"
     if username:
         url = "https://api.github.com/users/{}/repos".format(username)
+    elif org:
+        url = "https://api.github.com/orgs/{}/repos".format(org)
     else:
         url = "https://api.github.com/user/repos"
     for repos in paginate(url, headers):
@@ -469,6 +486,7 @@ def fetch_user(username=None, token=None):
 
 
 def paginate(url, headers=None):
+    url += ("&" if "?" in url else "?") + "per_page=100"
     while url:
         response = requests.get(url, headers=headers)
         # For HTTP 204 no-content this yields an empty list