Description: This project aims to conduct a comprehensive analysis of user interaction data on the Mastodon platform, a decentralized, open-source social network. Leveraging the power of MapReduce for distributed data processing, our goal is to extract relevant information regarding user engagement patterns, content popularity, and other key metrics. The processed data will be efficiently stored in HBase, a highly scalable NoSQL database that offers effective management of unstructured data. To ensure a smooth and efficient workflow, we will integrate Apache Airflow, a workflow management and orchestration platform, enabling robust automation and real-time monitoring of the data analysis process. This holistic approach will lead to an in-depth understanding of user behavior on Mastodon, thereby providing significant insights for marketing managers and product developers.
The objectives can be classified into two categories:
-
Analyzing User Engagement
- Identifying users with the highest number of followers
- Calculating user engagement rates
- Studying user growth over time
- Identifying users mentioned in the most used tags
-
Identifying Content Popularity
- Identifying the most shared external websites
- Categorizing posts based on their language
- Counting the number of posts with attached multimedia content
- Identifying the most frequently used tags
Mastodon is an open-source social media. It offers a comprehensive API (Application Programming Interface) that allows developers to interact with various aspects of the platform. Here's a brief summary of what the Mastodon API encompasses:
-
Mastodon's API provides secure authentication methods, allowing developers to implement secure user authentication and access control.
-
The API enables the management of user accounts, including user profile data, preferences, and settings.
-
Developers can create, retrieve, and manage toots (Mastodon's equivalent of tweets) through the API, including features like posting, fetching, and deleting toots.
-
The API allows access to notifications, enabling developers to fetch and manage notifications such as mentions, likes, and reposts.
-
Mastodon's API provides access to various timelines, including the home timeline, local timeline, and federated timeline, allowing developers to retrieve and interact with posts from different timelines.
-
The API facilitates interactions between users, including following/unfollowing users, liking toots, and reposting (boosting) content.
-
Mastodon's API supports search functionality, enabling users to search for specific content, users, or hashtags within the Mastodon network.
-
The API offers streaming capabilities, allowing developers to implement real-time updates for activities such as new toots, notifications, and other interactions.
-
Mastodon's API provides information about instances and federation, enabling developers to retrieve data about instances, their policies, and the federated network of instances.
Data Type | Fields/Attributes |
---|---|
User Data | Username, Display Name, Bio, Avatar Image, Header Image, Follower Count, Following Count, Account Creation Date |
User Preferences | Privacy Settings, Notification Preferences, Account Visibility Options, Content Viewing Preferences |
Toots (Posts) | Toot ID, Content Text, Attached Media, Creation Timestamp, Visibility Settings, Content Tags, Reblogs (Boosts) Count, Likes (Favourites) Count, Mentioned Users |
Notifications | Notification ID, Notification Type, Related Toot ID, Timestamp, Notifying User |
Instance Data | Instance Name, Instance Description, Instance Rules and Policies, Instance Admins and Moderators |
Federation Data | Connected Instances, Federation Policies, Interaction Policies with External Instances |
Metadata | Hashtag Name, Associated Toots, Media ID, Media Type, Media URL, Language of the Toot |
Interaction Data | Follower ID, Followed User ID, User ID, Liked Toot ID, User ID, Boosted Toot ID |
- Plan the project tasks using Jira.
- Fetch data from the Mastodon API and retrieve data from the public timeline using the requests method with pagination. Save the fetched data in a JSON file.
- Upload the JSON file to HDFS (Hadoop Distributed File System).
- Implement MapReduce for the project objectives while adhering to GDPR regulations.
- Create a script to execute all MapReduce jobs simultaneously.
- Load data from the MapReduce output into HBase.
- Utilize Apache Airflow to schedule and execute the main script.
- Python
- Hadoop Distributed File System (HDFS)
- Apache HBase
- Apache Airflow