This project aims to perform data analysis based on publicly available information from the GitHub API, focusing on users in Mozambique. We seek to extract valuable insights from user data, their repositories, and starred repositories to understand patterns, identify development trends, and explore the dynamics of the developer community in Mozambique.
The data analysis in the project follows a methodological approach that involves the following steps:
-
Data Collection: The data used in this project is obtained through the GitHub API. Through the API endpoints, we collect information about users, their repositories, their interactions with other repositories, and other GitHub elements. It is important to note that the analyses are based solely on the publicly available data in the GitHub API, and the availability and accessibility of this data are subject to the platform's policies and limitations.
-
Data Preprocessing: After collection, we perform a preprocessing step to clean and structure the data. This involves removing irrelevant data, handling null values, converting formats, and other necessary transformations to make the data suitable for analysis.
-
Analysis: We explore trends and other insights from the collected data. This includes analyzing the entry of new users over time, identifying trending technologies, analyzing the popularity of specific programming languages, and other relevant trends for the GitHub ecosystem.
-
Interactive Visualizations: We create interactive visualizations to facilitate data exploration and understanding. We use visualization libraries such as Matplotlib to create charts, plots, and other visual representations of the analysis results.
Some examples of obtained insights include:
-
Distribution of users by province: Allows identifying which provinces have a significant presence of users on the platform.
-
Popularity of programming languages in Mozambique over time: This analysis helps us understand trends and preferences regarding programming languages in the country.
-
Entry of Mozambicans on GitHub over time: Displays the growth and adoption trends of GitHub by the developer community in Mozambique.
-
Trending topics in Mozambique: Involves identifying trending topics within the developer community in Mozambique, using topics analysis on repositories. This helps us understand the areas of interest and focus of the country's developer community.
-
Percentage of users who liked national repositories: Allows us to evaluate the involvement and support of Mozambican developers in relation to local projects and initiatives.
It is important to acknowledge the limitations of the project in order to interpret the results carefully and understand its constraints. The main limitations include:
-
Data Limitations: The analyses are based on the data available in the GitHub API. Therefore, any limitations or restrictions imposed by the API, such as request limits or specific data availability, can affect the scope and accuracy of the analyses. Additionally, the quality and consistency of the data depend on the accuracy and updating of the information provided by GitHub users.
-
Public Access Restrictions: When using public GitHub data, it is important to remember that not all repositories are publicly available. Therefore, the analysis may be limited only to accessible public data. This results in limitations in analyzing private repositories.
-
Assumptions and Generalizations: During the analysis, assumptions and generalizations may be made based on the available data. These assumptions may not be applicable to all contexts or may not fully reflect the complexity and diversity of projects and contributions on GitHub.
-
Analysis Biases: Data analysis is subject to inherent biases, such as selection biases or sampling biases. For example, the choice of specific repositories or contributors for analysis may introduce biases in interpreting the results. It is important to be aware of these biases and interpret the results with caution.
-
Location: It is necessary to take into account that the analyses performed in this project are restricted to users from Mozambique who have registered their location on GitHub. There may be other Mozambican users with outdated or missing location information, which may impact the representativeness of the obtained results.
Before running the project, make sure you meet the following prerequisites:
-
Python: Make sure you have Python installed on your machine. You can download the latest version of Python at https://www.python.org/downloads/.
-
GitHub API Access Key (optional): You can obtain a personal access key to authenticate requests to the GitHub API. Follow the instructions at https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token to generate your personal access key.
-
Python Libraries: It is recommended to have the following Python libraries installed:
- pandas
- numpy
- matplotlib
- csv
- requests
- dotenv
- wordcloud
Step | Description | Notebook Path |
---|---|---|
1 | Data Collection: Get Users IDs | data_collect/get_users_ids.ipynb |
2 | Cleaning and Structuring: Structure IDs | cleaning_and_structuring/structure_ids.ipynb |
3 | Data Collection: Get Users Data | data_collect/get_users_data.ipynb |
4 | Data Collection: Get Repos Data | data_collect/get_repos_data.ipynb |
5 | Data Collection: Get Starred Data | data_collect/get_starred_data.ipynb |
6 | Visualization: Users Insights | visualization/users_insights.ipynb |
7 | Visualization: Repos Insights | visualization/repos_insights.ipynb |
8 | Visualization: Starred Insights | visualization/starred_insights.ipynb |
If you wish to contribute to this project, feel free to open an issue with your suggestions or submit a pull request with your changes. Your contribution will be greatly appreciated!