README

The following report details the Naive Bayes weather prediction algorithm I developed for a project in a machine learning model competition. When evaluated against the test cases, my algorithm achieved the highest accuracy in a submission pool of 100 competitors, earning the top ranking.

The code can be found here: Naive Bayes Weather Classifier

Run Instructions

Clone the repository.
Run python3 WeatherClassifier.py training.xlsx tests ground_truth.json

Tips

training.xlsx is an excel sheet containing ~7000 rows of weather data where each row corresponds to a single day.
- columns correspond to weather features, e.g. cloudy, precipitation, wind speed, etc.
tests is a folder containing 1000 excel files, each containing 28 rows of data to represent a month of weather data. The classifier aims to predict the 29th day's weather for each excel file.
ground_truth.json contains the actual weather for tests. This is used to calculate the accuracy of the model.

Architecture

Attribute	time	weather_description	precip	visibility	temperature
Data Format	Categorical	Categorical	Categorical	Categorical	Continuous
Naive Bayes Variant	Categorical	Categorical	Categorical	Categorical	Gaussian

Attribute	wind_speed	wind_degree	heatindex	wd1	wd2	wd3
Data Format	Continuous	Continuous	Continuous	Categorical	Categorical	Categorical
Naive Bayes Variant	Gaussian	Gaussian	Gaussian	Categorical	Categorical	Categorical

Figure 1: Hybrid Naive Bayes Approach

Class Structure

I built a single class, called NaiveBayes, with five class variables:

self.labels = {}
# A dictionary of the possible ‘weather_descriptions’ labels;
# key = label, value = frequency in training dataset.

self.totalEntries = totalEntries
# The total number of rows in the training dataset
# (used to calculate frequency ratios).

self.attributes = {}
# A dictionary of the training attributes and their different values;
# key = attribute, value = possible values of this attribute
# (e.g. 0, 600, 1200, 1800 for ‘time’).

self.condFrequencies = {}
# A dictionary of the conditional frequencies for each attribute given each label;
# key = (attribute, value, label), value = conditional frequency.
# These conditional frequencies are used to calculate the conditional probabilities of categorical attributes.

self.condProbabilities = {}
# A dictionary of the conditional probabilities for each attribute given each label; key = (attribute, value, label), value = conditional probability.

self.smoothing_factor = 20
# A smoothing factor I applied to my formulas for calculating conditional frequency and conditional probability.
# condCount = len(y_subset[y_subset.index.isin(x_subset.index)]) + self.smoothing_factor

Within the NaiveBayes class, I defined six functions to populate these dictionaries:

getLabelFrequencies()
getAttributes()
getCondFrequencies()
getCondProbabilities() - Fill self.condProbabilities for categorical attributes.
getCondProbabilitiesContinuous() - Fill self.condProbabilities for continuous attributes with Gaussian PDF formula
predict() - Iterates through the 28th day’s attributes and generates keys (attribute, value, label) to search for its corresponding conditional probability in self.condProbabilities.

Pre-processing

The following screenshot illustrates the attributes that I trained my model on (‘wd1’, ‘wd2’, ‘wd3’ are the 3 weather descriptions that came before the weather description in question). A single instance (day) in the dataset is represented as a row in a pandas dataframe object.

x_categorical is the dataframe where each column is one of my selected categorical attributes. Same concept for x_continuous. I calculate the conditional probabilities differently using my functions getCondProbabilities() and getCondProbabilitiesContinuous().

Model Building

To train my classifier, I wrote separate functions (see above) to calculate and pre-save the conditional probabilities depending on the kind of attribute in question (categorical or continuous). I used the formulas provided in the lecture slides. In my predict() function, I calculated the conditional probabilities of each possible weather label given the attributes in the 28th day’s row by retrieving the necessary conditional probabilities from my self.condProbability dictionary and multiplying them together.

When looking through the training.xlsx file, I noticed that the presence of any precipitation ruled out any possibility of the weather description being “Sunny”, “Clear”, “Cloudy”, “Partly cloudy”, or “Overcast”. After printing my incorrect predictions on the given test dataset, I realized that most of my missed predictions were false “Sunny” and “Clear” predictions for days which had precipitation. As such, I coded a check in my prediction() function to set the probability of the prediction being “Sunny”, “Clear”, “Cloudy”, “Partly cloudy”, or “Overcast” to 0 if the 28th day’s ‘precip’ attribute was either “Light precipitation”, “Moderate precipitation”, or “Heavy precipitation”. This check bumped my accuracy from 0.669 to 0.786

Results

Accuracy: 0.786 Runtime: 18.37 seconds I attribute my improvements in accuracy to including the previous three days of weather_descriptions in my training dataset and filtering out “Sunny”, “Clear”, “Cloudy”, “Partly cloudy”, or “Overcast” predictions for days which were preceded with any sort of precipitation.

Challenges

One of the biggest challenges I faced was selecting the optimal combination of attributes to train my model. Initially, my approach was uninformed and random—I tested arbitrary combinations of features without understanding their relationships with the target variable (weather_description). Unsurprisingly, this led to inconsistent results and limited accuracy improvements.

To address this, I focused on feature engineering to uncover patterns within the training dataset. I started by creating additional features, such as the previous three days' weather descriptions (wd1, wd2, and wd3) and the 3-day moving averages of numerical attributes like temperature and wind speed. These features helped capture temporal patterns in the data and provided context for making predictions.

Additionally, I wrote a Python script to calculate the correlation coefficients between the target variable and each attribute in the dataset. This helped me identify which features were most strongly correlated with weather outcomes. For example:

Attributes like precipitation and humidity showed high correlations with cloudy or rainy conditions. Attributes like temperature and visibility were more indicative of sunny or clear days. Using this data-driven approach, I refined my feature selection, focusing on high-impact attributes while excluding noisy or irrelevant ones. For example, wind direction (wind_degree) was found to have minimal correlation with the target variable and was excluded from the final training set.

Through these adjustments, I not only improved the relevance of the features used in training but also enhanced the model's interpretability and accuracy. This systematic approach replaced my initial guesswork and laid the foundation for the model’s success.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Run Instructions

Tips

Architecture

Class Structure

Pre-processing

Model Building

Results

Challenges

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
tests		tests
README.md		README.md
WeatherClassifier.py		WeatherClassifier.py
ground_truth.json		ground_truth.json
training.xlsx		training.xlsx

gracefeng05/Weather-Classifier-ML

Folders and files

Latest commit

History

Repository files navigation

README

Run Instructions

Tips

Architecture

Class Structure

Pre-processing

Model Building

Results

Challenges

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages