Skip to content

In this project, Adam and Kwame use web scraping and natural language processing to predict the programming language of GitHub repositories.

Notifications You must be signed in to change notification settings

NLP-404-Not-Found/nlp-project

Repository files navigation

Predicting the Language of GitHub Repositories by README using NLP

About the Project

Goals

Using natural language processing, web scraping, and classification, we aim to create a machine learning model to predict the primary programming language of a given repository on GitHub, based on the contents of its README.

Background

GitHub automatically shows the percentages of what coding languages are used in the files of a repository. In this project, we are seeking to label and predict on only the primary language of each repository. These languages include Java, Python, Javascript, Ruby, HTML, and C++.

Deliverables

  • A well-documented Jupyter Notebook that contains our analysis
  • A Google Slides presentation suitable for a general audience that summarizes our findings and includes visualizations

Data Dictionary

Feature Name Description Additional Info
repo The end URL to the project. Can append to https://github.com/ to get a repo's full URL. Consists of the user, slash the name of the repository. object
language The primary programming language of a repository, according to GitHub's auto-analysis. object
readme_contents Messy, uncleaned text from a repo's README file as a single string. object
stemmed Readme_contents with each word stemmed, i.e. dimensionality reduction such that 'call', 'called', and 'calling' are treated as the same word. Stems are not the same as root words and do not always appear in the dictionary. object
lemmatized Similar to stemmed, but reduces words to its root word, which will always appear in the dictionary. object
clean Lemmatized readme_contents with stopwords removed. object
stopwords_removed The number of stopwords that were removed from the clean text. int64
doc_length How long a repo's README is. int64
words The clean text in array form. object

Tools & Requirements

  • Python v3.85 (including packages WordCloud, NLTK, and Scikit-Learn)
  • GitHub's API

License & Reproduction

Anyone can reproduce this project. All we ask is that you credit us if you use our work as part of your own project.

  1. Clone this repo.
  2. Acquire the data:
    • a. Go here and generate a personal access token. You do not need select any scopes, i.e. leave all the checkboxes unchecked.
    • b. Save it in your env.py file under the variable github_token.
    • c. Add your github username to your env.py file under the variable github_username.
    • d. Add more repositories to the REPOS list below if you so choose.
  3. Add any extra stop-words in the prepare.py file.
  4. Run the code in the nlp_model Jupyter Notebook.

Creators

Kwame V. Taylor, Data Scientist
Adam Gomez, Data Scientist

About

In this project, Adam and Kwame use web scraping and natural language processing to predict the programming language of GitHub repositories.

Resources

Stars

Watchers

Forks