Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[crawler] Capture project details #26

Open
themightychris opened this issue Sep 24, 2019 · 16 comments
Open

[crawler] Capture project details #26

themightychris opened this issue Sep 24, 2019 · 16 comments

Comments

@themightychris
Copy link
Collaborator

themightychris commented Sep 24, 2019

The prototype crawler from #19 only captures a few initial basic project details, Code for Kenya/HURUmap provides a good example of a record with all fields filled in well and some natural changes to them in the history already

What are some more details a v2 should capture? Any thoughts on how we should organize it? (TOML has great support for grouping things any number of levels deep)

I don't think we want to capture any details in the index that routinely change day-by-day in the life of a project (e.g. number of open issues, number of contributors), BUT maybe we do capture things like that as binary or tiered buckets (e.g. has-issue=true or contributors=5-10)

I think we should pull in the GitHub description and/or opening paragraph of the README directly, and then for other big wordy things record their presence, link to them, and maybe measure their health or summarize them if there's a valuable way to do so. (e.g. we can record which license is used and link to the license, we can record which of GitHub's standard community health files are present and link to them)

We should also record the presence of any civic.json or publiccode.yaml file, and pull in some or all of their contents into a normalized form.

@themightychris
Copy link
Collaborator Author

I just found that publiccode.yaml got formalized and submitted to the OSI to be incubated as a stewarded project, it's been through a ton more iteration than civic.json and has first class support for describing that a project isBasedOn another project -- something our network sorely needs that is noticeably missing from civic.json

So it seems to me we should hew as closely as we can to publiccode.yml's schema and essentially just do our best to progressively fill out a massive database of project records that are autofilled as best we can

@nikolajbaer
Copy link
Collaborator

Sorry, itchy trigger finger.

i have a couple notes on this from the statusboard perspective. These are debatable, as they are more helpful for reading through it without a secondary "processing" task.

  1. It would be nice to have a "manifest" at the organization folder so we can see what organizations are there. This might also help with routing renamed groups.

  2. An explicit "name" in the .toml files for both orgs and projects would be nice, rather than the implicit folder / filename convention.

  3. Likewise a manifest in the brigade's projects folder, maybe even indicating which ones are active vs. archived, would be helpful in processing.

As to civicjson and publiccode.yaml, i am looking forward to seeing that data in the .toml files!

@gregboyer
Copy link
Contributor

Two questions I'm wondering about, and maybe they don't belong at this step but rather when presenting info:

  1. How do we know which brigades to include? (Maybe scrape everything but only display opt-in brigades?)
  2. Ditto, but more important for projects

My reasoning is that we need to have a threshold that allows us to present accurate and helpful information, but not add noise such as old or poorly documented projects

@gregboyer
Copy link
Contributor

I do also want to add that I love this middle layer for data standardization and how it'll allow us to make more flexible decisions in the future regarding our sources of information without impacting anything downstream.

@themightychris
Copy link
Collaborator Author

@gregboyer for which brigades to include, we're pulling from the already-curated dataset maintained for cfapi: https://github.com/codeforamerica/brigade-information

My thinking is that we scrape everything into the index repo so people can build different sorts on tools/analysis on top of it. Whatever quality/completion metrics we can layer into the data for easy filtering by tools like what aspiring contributors might search through

@tdooner
Copy link
Contributor

tdooner commented Jan 9, 2020

I'd like to add to the conversation a desire to have some information about project activity, for purposes of more easily determining what a Brigade is actively working on.

For example, imagine being able to sort OpenOakland's projects by last commit timestamp (i.e. Github default sort).

When combined with other relevance signals, this gets pretty close to making a fully useful search engine for brigade projects.

I don't think we want to capture any details in the index that routinely change day-by-day in the life of a project (e.g. number of open issues, number of contributors), BUT maybe we do capture things like that as binary or tiered buckets (e.g. has-issue=true or contributors=5-10)

I assume that this preference is a consequence of the fact that we're storing the entire revision history in Git and don't want it to be too big with daily changes to this stuff. I think we should give it a try, though, given how valuable this kind of thing could be. Perhaps if we change the Git commit pattern to be one commit per day instead of one commit per day per organization then the history will be a bit more scannable.

@themightychris
Copy link
Collaborator Author

themightychris commented Jan 10, 2020

Git could handle the volume, but we want to distribute the dataset in a fan-out to many applications and keeping track of granularity down to "time of last commit" is just needlessly noisey for all involved.

For I think almost all intents and purposes, it would be just as valuable if we tracked something like last_commit_within = week | month | quarter | year | 2years | 5years | 10years

That would provide a much cleaner signal over time about how active a project is. Think evolving classifications rather than time-series datapoints. We'll have the URL to the git repo in there too if anyone wants to go analyze the timestamps of every commit for a project

@nikolajbaer
Copy link
Collaborator

nikolajbaer commented Jan 16, 2020

Both are great thoughts! I agree on the signal/noise concerns, but also the value of a metric of activity.

What if we checked the most recent commit on a trigger, and then only update the index data if it slips from one "bucket" to another, e.g. when it transitions between the "last_commit_within" buckets (although maybe start at "quarter" as that might reduce the "noise" a bit). The bottom rung could be "active" for anything that has commit activity within the last 3 months. The challenge will be getting the threshold right so we don't have a lot of projects flip/flopping between "quarter" and "active".

Edit: realizing this was approach was already somewhat mentioned in the original and follow-up comments, so sorry for duplication but 👍 for the tiered idea.

tdooner added a commit to tdooner/brigade-project-index that referenced this issue Sep 15, 2020
For multiple types of our users (CfA Staff, Brigade Leader, Project
Leaders), being able to tell which projects are still active is a
crucial aspect of the index.

In civictechindex#26 we discuss using a bucketed approach so as to not create
unnecessary noise by committing the timestamp for every update. This
commit implements a coarse timestamp: for projects updated within the
last week, month, year, or over a year ago.
tdooner added a commit to tdooner/brigade-project-index that referenced this issue Sep 15, 2020
For multiple types of our users (CfA Staff, Brigade Leader, Project
Leaders), being able to tell which projects are still active is a
crucial aspect of the index.

In civictechindex#26 we discuss using a bucketed approach so as to not create
unnecessary noise by committing the timestamp for every update. This
commit implements a coarse timestamp: for projects updated within the
last week, month, year, or over a year ago.
tdooner added a commit that referenced this issue Sep 21, 2020
For multiple types of our users (CfA Staff, Brigade Leader, Project
Leaders), being able to tell which projects are still active is a
crucial aspect of the index.

In #26 we discuss using a bucketed approach so as to not create
unnecessary noise by committing the timestamp for every update. This
commit implements a coarse timestamp: for projects updated within the
last week, month, year, or over a year ago.
@giosce
Copy link
Collaborator

giosce commented Dec 10, 2020

Can we retrieve the programming languages?

@giosce
Copy link
Collaborator

giosce commented Sep 16, 2021

It could be interesting to save the last_commit on the default branch (but it is a api call per repo).
We could also save the number of gh open_issues and language

@themightychris
Copy link
Collaborator Author

right now all we capture from github is what's included in the GitHub Repo API object

once we start capturing content from the git repo though we'll be fetching the latest commit on the default branch and yeah could record details about it. That could present churn issues though

@themightychris
Copy link
Collaborator Author

I wonder if we could get a list of committers from github without having to fetch complete history ourselves

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Contributor

List of committers or contributors? Are we getting https://docs.github.com/en/rest/reference/repos#list-repository-contributors ?

@giosce
Copy link
Collaborator

giosce commented Oct 19, 2021

Let's decide what we would like to add.
Also, @themightychris, how do you get the current info from GH? Which API? I was assuming organization/repos but wonder how you get readme/description and topics.

I'd like the day of last push on default branch, number of contributors, languages. Anything else?
And don't bucket last push and open issues

If we are making this call https://api.github.com/repos/codeforamerica/brigade-project-index
we should be getting this response

{
"id": 193432663,
"node_id": "MDEwOlJlcG9zaXRvcnkxOTM0MzI2NjM=",
"name": "brigade-project-index",
"full_name": "codeforamerica/brigade-project-index",
"private": false,
"owner": {
"login": "codeforamerica",
"id": 337792,
"node_id": "MDEyOk9yZ2FuaXphdGlvbjMzNzc5Mg==",
"avatar_url": "https://avatars.githubusercontent.com/u/337792?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/codeforamerica",
"html_url": "https://github.com/codeforamerica",
"followers_url": "https://api.github.com/users/codeforamerica/followers",
"following_url": "https://api.github.com/users/codeforamerica/following{/other_user}",
"gists_url": "https://api.github.com/users/codeforamerica/gists{/gist_id}",
"starred_url": "https://api.github.com/users/codeforamerica/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/codeforamerica/subscriptions",
"organizations_url": "https://api.github.com/users/codeforamerica/orgs",
"repos_url": "https://api.github.com/users/codeforamerica/repos",
"events_url": "https://api.github.com/users/codeforamerica/events{/privacy}",
"received_events_url": "https://api.github.com/users/codeforamerica/received_events",
"type": "Organization",
"site_admin": false
},
"html_url": "https://github.com/codeforamerica/brigade-project-index",
"description": "Brigade Project Index",
"fork": false,
"url": "https://api.github.com/repos/codeforamerica/brigade-project-index",
"forks_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/forks",
"keys_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/keys{/key_id}",
"collaborators_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/collaborators{/collaborator}",
"teams_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/teams",
"hooks_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/hooks",
"issue_events_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/issues/events{/number}",
"events_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/events",
"assignees_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/assignees{/user}",
"branches_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/branches{/branch}",
"tags_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/tags",
"blobs_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/git/blobs{/sha}",
"git_tags_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/git/tags{/sha}",
"git_refs_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/git/refs{/sha}",
"trees_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/git/trees{/sha}",
"statuses_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/statuses/{sha}",
"languages_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/languages",
"stargazers_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/stargazers",
"contributors_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/contributors",
"subscribers_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/subscribers",
"subscription_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/subscription",
"commits_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/commits{/sha}",
"git_commits_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/git/commits{/sha}",
"comments_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/comments{/number}",
"issue_comment_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/issues/comments{/number}",
"contents_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/contents/{+path}",
"compare_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/compare/{base}...{head}",
"merges_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/merges",
"archive_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/{archive_format}{/ref}",
"downloads_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/downloads",
"issues_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/issues{/number}",
"pulls_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/pulls{/number}",
"milestones_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/milestones{/number}",
"notifications_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/notifications{?since,all,participating}",
"labels_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/labels{/name}",
"releases_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/releases{/id}",
"deployments_url": "https://api.github.com/repos/codeforamerica/brigade-project-index/deployments",
"created_at": "2019-06-24T04:17:53Z",
"updated_at": "2021-10-14T23:30:37Z",
"pushed_at": "2021-10-19T19:14:37Z",
"git_url": "git://github.com/codeforamerica/brigade-project-index.git",
"ssh_url": "git@github.com:codeforamerica/brigade-project-index.git",
"clone_url": "https://github.com/codeforamerica/brigade-project-index.git",
"svn_url": "https://github.com/codeforamerica/brigade-project-index",
"homepage": "https://brigade.cloud/",
"size": 38142,
"stargazers_count": 10,
"watchers_count": 10,
"language": "JavaScript",
"has_issues": true,
"has_projects": true,
"has_downloads": true,
"has_wiki": true,
"has_pages": true,
"forks_count": 16,
"mirror_url": null,
"archived": false,
"disabled": false,
"open_issues_count": 13,
"license": null,
"allow_forking": true,
"is_template": false,
"topics": [
"civic-tech",
"civic-tech-projects-catalog",
"civic-tech-projects-index",
"civic-tech-projects-map",
"civic-tech-projects-volunteers",
"code-for-america"
],
"visibility": "public",
"forks": 16,
"open_issues": 13,
"watchers": 10,
"default_branch": "master",
"temp_clone_token": null,
"organization": {
"login": "codeforamerica",
"id": 337792,
"node_id": "MDEyOk9yZ2FuaXphdGlvbjMzNzc5Mg==",
"avatar_url": "https://avatars.githubusercontent.com/u/337792?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/codeforamerica",
"html_url": "https://github.com/codeforamerica",
"followers_url": "https://api.github.com/users/codeforamerica/followers",
"following_url": "https://api.github.com/users/codeforamerica/following{/other_user}",
"gists_url": "https://api.github.com/users/codeforamerica/gists{/gist_id}",
"starred_url": "https://api.github.com/users/codeforamerica/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/codeforamerica/subscriptions",
"organizations_url": "https://api.github.com/users/codeforamerica/orgs",
"repos_url": "https://api.github.com/users/codeforamerica/repos",
"events_url": "https://api.github.com/users/codeforamerica/events{/privacy}",
"received_events_url": "https://api.github.com/users/codeforamerica/received_events",
"type": "Organization",
"site_admin": false
},
"network_count": 16,
"subscribers_count": 13
}

For more details we need additional calls, for projects
https://api.github.com/repos/codeforamerica/brigade-project-index/languages
https://api.github.com/repos/codeforamerica/brigade-project-index/contributors
https://api.github.com/repos/codeforamerica/brigade-project-index/branches/{default_branch returned by first call}

@themightychris
Copy link
Collaborator Author

Shared by @ExperimentsInHonesty , this page provides great practical examples for projects to publish community health files, which we should capture: https://100automations.org/guides/community-support-for-automations.html

@themightychris
Copy link
Collaborator Author

I think this issue can be closed by splitting off a few more specific tickets:

  • enable crawler to load content from inside Git repository default branch [Extract content from git repository #62]
  • capture any present publiccode.yml into the snapshot [Read PublicCode.yml #55]
  • capture any present civic.json into the snapshot
  • capture any present README.md into the snapshot (ideally using some sort of library to parse the Markdown document into structured object reflecting the document's TOC so we have visibility into what headers it contains and its structure)
  • capture all the standard community health Markdown files (and ideally parse them by TOC too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

7 participants