Skip to content

Commit

Permalink
More options for pull-requests: --state, --org, and --search (#80)
Browse files Browse the repository at this point in the history
* always ask for 100 items when paginating (helps #79)
* fix typos in README.md
* ignore test and build artifacts
* --org and --state options for pull-requests
* --search for pull-requests, but it can only get 1000
  • Loading branch information
nedbat authored Dec 10, 2023
1 parent 56f2aee commit a0a711b
Show file tree
Hide file tree
Showing 4 changed files with 90 additions and 24 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ venv
.eggs
.pytest_cache
*.egg-info

.coverage
build/
22 changes: 17 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,25 @@ You can use the `--pull-request` option one or more times to load specific pull

Note that the `merged_by` column on the `pull_requests` table will only be populated for pull requests that are loaded using the `--pull-request` option - the GitHub API does not return this field for pull requests that are loaded in bulk.

You can load only pull requests in a certain state with the `--state` option:

$ github-to-sqlite pull-requests --state=open github.db simonw/datasette

Pull requests across an entire organization (or more than one) can be loaded with `--org`:

$ github-to-sqlite pull-requests --state=open --org=psf --org=python github.db

You can use a search query to find pull requests. Note that no more than 1000 will be loaded (this is a GitHub API limitation), and some data will be missing (base and head SHAs). When using searches, other filters are ignored; put all criteria into the search itself:

$ github-to-sqlite pull-requests --search='org:python defaultdict state:closed created:<2023-09-01' github.db

Example: [pull_requests table](https://github-to-sqlite.dogsheep.net/github/pull_requests)

## Fetching issue comments for a repository

The `issue-comments` command retrieves all of the comments on all of the issues in a repository.

It is recommended you run `issues` first, so that each imported comment can have a foreign key poining to its issue.
It is recommended you run `issues` first, so that each imported comment can have a foreign key pointing to its issue.

$ github-to-sqlite issues github.db simonw/datasette
$ github-to-sqlite issue-comments github.db simonw/datasette
Expand All @@ -101,7 +113,7 @@ Example: [issue_comments table](https://github-to-sqlite.dogsheep.net/github/iss

## Fetching commits for a repository

The `commits` command retrieves details of all of the commits for one or more repositories. It currently fetches the sha, commit message and author and committer details - it does no retrieve the full commit body.
The `commits` command retrieves details of all of the commits for one or more repositories. It currently fetches the SHA, commit message and author and committer details; it does not retrieve the full commit body.

$ github-to-sqlite commits github.db simonw/datasette simonw/sqlite-utils

Expand Down Expand Up @@ -156,7 +168,7 @@ You can pass more than one username to fetch for multiple users or organizations

$ github-to-sqlite repos github.db simonw dogsheep

Add the `--readme` option to save the README for the repo in a column called `readme`. Add `--readme-html` to save the HTML rendered version of the README into a collumn called `readme_html`.
Add the `--readme` option to save the README for the repo in a column called `readme`. Add `--readme-html` to save the HTML rendered version of the README into a column called `readme_html`.

Example: [repos table](https://github-to-sqlite.dogsheep.net/github/repos)

Expand Down Expand Up @@ -216,7 +228,7 @@ You can fetch a list of every emoji supported by GitHub using the `emojis` comma

$ github-to-sqlite emojis github.db

This will create a table callad `emojis` with a primary key `name` and a `url` column.
This will create a table called `emojis` with a primary key `name` and a `url` column.

If you add the `--fetch` option the command will also fetch the binary content of the images and place them in an `image` column:

Expand All @@ -235,7 +247,7 @@ The `github-to-sqlite get` command provides a convenient shortcut for making aut

This will make an authenticated call to the URL you provide and pretty-print the resulting JSON to the console.

You can ommit the `https://api.github.com/` prefix, for example:
You can omit the `https://api.github.com/` prefix, for example:

$ github-to-sqlite get /gists

Expand Down
49 changes: 42 additions & 7 deletions github_to_sqlite/cli.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import click
import datetime
import itertools
import pathlib
import textwrap
import os
Expand Down Expand Up @@ -104,19 +105,53 @@ def issues(db_path, repo, issue_ids, auth, load):
type=click.Path(file_okay=True, dir_okay=False, allow_dash=True, exists=True),
help="Load pull-requests JSON from this file instead of the API",
)
def pull_requests(db_path, repo, pull_request_ids, auth, load):
@click.option(
"--org",
"orgs",
help="Fetch all pull requests from this GitHub organization",
multiple=True,
)
@click.option(
"--state",
help="Only fetch pull requests in this state",
)
@click.option(
"--search",
help="Find pull requests with a search query",
)
def pull_requests(db_path, repo, pull_request_ids, auth, load, orgs, state, search):
"Save pull_requests for a specified repository, e.g. simonw/datasette"
db = sqlite_utils.Database(db_path)
token = load_token(auth)
repo_full = utils.fetch_repo(repo, token)
utils.save_repo(db, repo_full)
if load:
repo_full = utils.fetch_repo(repo, token)
utils.save_repo(db, repo_full)
pull_requests = json.load(open(load))
utils.save_pull_requests(db, pull_requests, repo_full)
elif search:
repos_seen = set()
search += " is:pr"
pull_requests = utils.fetch_searched_pulls_or_issues(search, token)
for pull_request in pull_requests:
pr_repo_url = pull_request["repository_url"]
if pr_repo_url not in repos_seen:
pr_repo = utils.fetch_repo(url=pr_repo_url)
utils.save_repo(db, pr_repo)
repos_seen.add(pr_repo_url)
utils.save_pull_requests(db, [pull_request], pr_repo)
else:
pull_requests = utils.fetch_pull_requests(repo, token, pull_request_ids)

pull_requests = list(pull_requests)
utils.save_pull_requests(db, pull_requests, repo_full)
if orgs:
repos = itertools.chain.from_iterable(
utils.fetch_all_repos(token=token, org=org)
for org in orgs
)
else:
repos = [utils.fetch_repo(repo, token)]
for repo_full in repos:
utils.save_repo(db, repo_full)
repo = repo_full["full_name"]
pull_requests = utils.fetch_pull_requests(repo, state, token, pull_request_ids)
utils.save_pull_requests(db, pull_requests, repo_full)
utils.ensure_db_shape(db)


Expand Down
40 changes: 29 additions & 11 deletions github_to_sqlite/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import requests
import re
import time
import urllib.parse
import yaml

FTS_CONFIG = {
Expand Down Expand Up @@ -170,17 +171,21 @@ def save_pull_requests(db, pull_requests, repo):
# Add repo key
pull_request["repo"] = repo["id"]
# Pull request _links can be flattened to just their URL
pull_request["url"] = pull_request["_links"]["html"]["href"]
pull_request.pop("_links")
if "_links" in pull_request:
pull_request["url"] = pull_request["_links"]["html"]["href"]
pull_request.pop("_links")
else:
pull_request["url"] = pull_request["pull_request"]["html_url"]
# Extract user
pull_request["user"] = save_user(db, pull_request["user"])
labels = pull_request.pop("labels")
# Extract merged_by, if it exists
if pull_request.get("merged_by"):
pull_request["merged_by"] = save_user(db, pull_request["merged_by"])
# Head sha
pull_request["head"] = pull_request["head"]["sha"]
pull_request["base"] = pull_request["base"]["sha"]
if "head" in pull_request:
pull_request["head"] = pull_request["head"]["sha"]
pull_request["base"] = pull_request["base"]["sha"]
# Extract milestone
if pull_request["milestone"]:
pull_request["milestone"] = save_milestone(
Expand Down Expand Up @@ -292,12 +297,13 @@ def save_issue_comment(db, comment):
return last_pk


def fetch_repo(full_name, token=None):
def fetch_repo(full_name=None, token=None, url=None):
headers = make_headers(token)
# Get topics:
headers["Accept"] = "application/vnd.github.mercy-preview+json"
owner, slug = full_name.split("/")
url = "https://api.github.com/repos/{}/{}".format(owner, slug)
if url is None:
owner, slug = full_name.split("/")
url = "https://api.github.com/repos/{}/{}".format(owner, slug)
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.json()
Expand Down Expand Up @@ -358,7 +364,7 @@ def fetch_issues(repo, token=None, issue_ids=None):
yield from issues


def fetch_pull_requests(repo, token=None, pull_request_ids=None):
def fetch_pull_requests(repo, state=None, token=None, pull_request_ids=None):
headers = make_headers(token)
headers["accept"] = "application/vnd.github.v3+json"
if pull_request_ids:
Expand All @@ -370,11 +376,20 @@ def fetch_pull_requests(repo, token=None, pull_request_ids=None):
response.raise_for_status()
yield response.json()
else:
url = "https://api.github.com/repos/{}/pulls?state=all&filter=all".format(repo)
state = state or "all"
url = f"https://api.github.com/repos/{repo}/pulls?state={state}"
for pull_requests in paginate(url, headers):
yield from pull_requests


def fetch_searched_pulls_or_issues(query, token=None):
headers = make_headers(token)
url = "https://api.github.com/search/issues?"
url += urllib.parse.urlencode({"q": query})
for pulls_or_issues in paginate(url, headers):
yield from pulls_or_issues["items"]


def fetch_issue_comments(repo, token=None, issue=None):
assert "/" in repo
headers = make_headers(token)
Expand Down Expand Up @@ -445,13 +460,15 @@ def fetch_stargazers(repo, token=None):
yield from stargazers


def fetch_all_repos(username=None, token=None):
assert username or token, "Must provide username= or token= or both"
def fetch_all_repos(username=None, token=None, org=None):
assert username or token or org, "Must provide username= or token= or org= or a combination"
headers = make_headers(token)
# Get topics for each repo:
headers["Accept"] = "application/vnd.github.mercy-preview+json"
if username:
url = "https://api.github.com/users/{}/repos".format(username)
elif org:
url = "https://api.github.com/orgs/{}/repos".format(org)
else:
url = "https://api.github.com/user/repos"
for repos in paginate(url, headers):
Expand All @@ -469,6 +486,7 @@ def fetch_user(username=None, token=None):


def paginate(url, headers=None):
url += ("&" if "?" in url else "?") + "per_page=100"
while url:
response = requests.get(url, headers=headers)
# For HTTP 204 no-content this yields an empty list
Expand Down

0 comments on commit a0a711b

Please sign in to comment.