This repository contains the source code for assignment 1 of the COMP90024 Cluster and Cloud Computing course at the University of Melbourne.
Submission Details:
-
Student name: Matthias Bachfischer
-
Student ID: 1133751
data/
-- datasets used for testingdoc/
-- documentation and implementation notesoutput/
-- Output from previous submission runs on Spartanplayground/
-- scripts used for Twitter API communicationslurm/
-- slurm scripts for submission to Spartan queuetweetanalyzer/
-- helper and utility functions
To submit a job to the Spartan cluster, run the command sbatch path_to_slurm_script
and replace path_to_slurm_script
with the name of the SLURM script that you want to run.
Identify the top 10 most commonly used hashtags and the number of times they appear. A matching hashtag string can match if it has upper/lower case exact substrings, e.g. #covid19 and #COVID19 are a match. A hashtag should follow the Twitter rules, e.g. no spaces and no punctuation are allowed in a hashtag - any string following a # up until a space or punctuation character is a valid hashtag string (except underscore _).
Identify the languages used for tweeting and the number of times the language is used for the provided tweets
Documentation: https://developer.twitter.com/en/docs/twitter-for-websites/twitter-for-websites-supported-languages/overview
Standard for language code: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
Cloud
This partition is best suited for general-purpose single-node jobs. Multiple node jobs will work, but communication between nodes will be comparatively slow.
Physical
Each node is connected by high-speed 25Gb networking with 1.15 µsec latency, making this partition suited to multi-node jobs (e.g. those using OpenMPI).