-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[crawler] Capture project details #26
Comments
I just found that So it seems to me we should hew as closely as we can to |
Sorry, itchy trigger finger. i have a couple notes on this from the statusboard perspective. These are debatable, as they are more helpful for reading through it without a secondary "processing" task.
As to civicjson and publiccode.yaml, i am looking forward to seeing that data in the .toml files! |
Two questions I'm wondering about, and maybe they don't belong at this step but rather when presenting info:
My reasoning is that we need to have a threshold that allows us to present accurate and helpful information, but not add noise such as old or poorly documented projects |
I do also want to add that I love this middle layer for data standardization and how it'll allow us to make more flexible decisions in the future regarding our sources of information without impacting anything downstream. |
@gregboyer for which brigades to include, we're pulling from the already-curated dataset maintained for cfapi: https://github.com/codeforamerica/brigade-information My thinking is that we scrape everything into the index repo so people can build different sorts on tools/analysis on top of it. Whatever quality/completion metrics we can layer into the data for easy filtering by tools like what aspiring contributors might search through |
I'd like to add to the conversation a desire to have some information about project activity, for purposes of more easily determining what a Brigade is actively working on. For example, imagine being able to sort OpenOakland's projects by last commit timestamp (i.e. Github default sort). When combined with other relevance signals, this gets pretty close to making a fully useful search engine for brigade projects.
I assume that this preference is a consequence of the fact that we're storing the entire revision history in Git and don't want it to be too big with daily changes to this stuff. I think we should give it a try, though, given how valuable this kind of thing could be. Perhaps if we change the Git commit pattern to be one commit per day instead of one commit per day per organization then the history will be a bit more scannable. |
Git could handle the volume, but we want to distribute the dataset in a fan-out to many applications and keeping track of granularity down to "time of last commit" is just needlessly noisey for all involved. For I think almost all intents and purposes, it would be just as valuable if we tracked something like That would provide a much cleaner signal over time about how active a project is. Think evolving classifications rather than time-series datapoints. We'll have the URL to the git repo in there too if anyone wants to go analyze the timestamps of every commit for a project |
Both are great thoughts! I agree on the signal/noise concerns, but also the value of a metric of activity. What if we checked the most recent commit on a trigger, and then only update the index data if it slips from one "bucket" to another, e.g. when it transitions between the "last_commit_within" buckets (although maybe start at "quarter" as that might reduce the "noise" a bit). The bottom rung could be "active" for anything that has commit activity within the last 3 months. The challenge will be getting the threshold right so we don't have a lot of projects flip/flopping between "quarter" and "active". Edit: realizing this was approach was already somewhat mentioned in the original and follow-up comments, so sorry for duplication but 👍 for the tiered idea. |
For multiple types of our users (CfA Staff, Brigade Leader, Project Leaders), being able to tell which projects are still active is a crucial aspect of the index. In civictechindex#26 we discuss using a bucketed approach so as to not create unnecessary noise by committing the timestamp for every update. This commit implements a coarse timestamp: for projects updated within the last week, month, year, or over a year ago.
For multiple types of our users (CfA Staff, Brigade Leader, Project Leaders), being able to tell which projects are still active is a crucial aspect of the index. In civictechindex#26 we discuss using a bucketed approach so as to not create unnecessary noise by committing the timestamp for every update. This commit implements a coarse timestamp: for projects updated within the last week, month, year, or over a year ago.
For multiple types of our users (CfA Staff, Brigade Leader, Project Leaders), being able to tell which projects are still active is a crucial aspect of the index. In #26 we discuss using a bucketed approach so as to not create unnecessary noise by committing the timestamp for every update. This commit implements a coarse timestamp: for projects updated within the last week, month, year, or over a year ago.
Can we retrieve the programming languages? |
It could be interesting to save the last_commit on the default branch (but it is a api call per repo). |
right now all we capture from github is what's included in the GitHub Repo API object once we start capturing content from the git repo though we'll be fetching the latest commit on the default branch and yeah could record details about it. That could present churn issues though |
I wonder if we could get a list of committers from github without having to fetch complete history ourselves |
List of committers or contributors? Are we getting https://docs.github.com/en/rest/reference/repos#list-repository-contributors ? |
Shared by @ExperimentsInHonesty , this page provides great practical examples for projects to publish community health files, which we should capture: https://100automations.org/guides/community-support-for-automations.html |
I think this issue can be closed by splitting off a few more specific tickets:
|
The prototype crawler from #19 only captures a few initial basic project details,
Code for Kenya/HURUmap
provides a good example of a record with all fields filled in well and some natural changes to them in the history alreadyWhat are some more details a v2 should capture? Any thoughts on how we should organize it? (TOML has great support for grouping things any number of levels deep)
I don't think we want to capture any details in the index that routinely change day-by-day in the life of a project (e.g. number of open issues, number of contributors), BUT maybe we do capture things like that as binary or tiered buckets (e.g. has-issue=true or contributors=5-10)
I think we should pull in the GitHub description and/or opening paragraph of the README directly, and then for other big wordy things record their presence, link to them, and maybe measure their health or summarize them if there's a valuable way to do so. (e.g. we can record which license is used and link to the license, we can record which of GitHub's standard community health files are present and link to them)
We should also record the presence of any
civic.json
orpubliccode.yaml
file, and pull in some or all of their contents into a normalized form.The text was updated successfully, but these errors were encountered: