Report for Google Summer of Code 21 Project @ Software Heritage
Project Details | |
---|---|
Initial Proposal | Advanced search features for the SWH Archive |
Repository | swh-search and swh-web |
Mentors | Valentin Lorentz and Vincent Sellier |
Contributions | swh-search and swh-web |
Documentation | Search query language syntax |
Duration | 3m 7d (17 May'21 - 23 Aug'21) |
Software Heritage is on a mission to collect, preserve, and share all the publicly available software with its source code and development history. The archive periodically crawls GitHub, GitLab, Debian, PyPI, etc. It has preserved more than 11 billion unique source code files with 2.3 billion commits spanning more than 163 million software projects.
The archive has a search feature to find repositories based on the repository URL or the metadata. This metadata includes the package name, version, description, license, etc. My GSoC project was all about improving the archive search. (back-end as well front-end)
I made the Archive search more expressive with the help of advanced search features like filters, sorting options, search query language (custom DSL) with autocomplete features (was optional)
Tasks completed:
- Introduced new fields in the search service (based on Elasticsearch), ingested data from other swh services through their RPC APIs or through the journal service (Kafka), and built filters/sorting features.
- Designed a grammar, built a parser and a translator (using TreeSitter) that traverses the AST to translate the DSL queries into Elasticsearch queries.
- Implemented autocomplete features for the query language in the Web UI. (almost completed, needs some improvements before merging)
-
Ingested the following fields into Elasticsearch from the journal service(Kafka) or from other swh services:
nb_visits
: Number of visits of an origin (D5824)last_visit_date
: Last visit date (D5824)last_eventful_visit_date
: Last visit date when the snapshot id changed (i.e. change in the content of the repo) (D5878)last_revision_date
: Last revision (commit) date (D5883)last_release_date
: Last release date (based on tags) (D5883)
-
Introduced
sort_by
,limit
in origin search (D5918) -
Added support for searching
license
,programmingLanguage
, andkeywords
(repo tags + README/description) from the repository metadata (D5949, D5963) -
Added support for searching
date_created
,date_modified
, anddate_published
from the repository metadata (D5964) -
Implemented a search query language using TreeSitter. (D5990, D6005)
The search query language has parsers (compiled as .so and .wasm files) that serve two different purposes:
- Making the autocomplete suggestions dynamic using data from other swh services.
- Fixing the swh-indexers to mine repository data (currently some of the filters work on the basis of metadata which can be wrong or unavailable)
- Improving the search UI to support the search DSL (Advanced mode) as well as UI components (Basic mode). (Similar to Jira query language)
Most of my work around the query langauge will take some time before it is visible to the public. As of now only the beta users and the team members have the access (and it's searching across 170M+ repos!). Once made public, you can use it via the Software Heritage archive.