The following report details the Naive Bayes weather prediction algorithm I developed for a project in a machine learning model competition. When evaluated against the test cases, my algorithm achieved the highest accuracy in a submission pool of 100 competitors, earning the top ranking.
The code can be found here: Naive Bayes Weather Classifier
- Clone the repository.
- Run
python3 WeatherClassifier.py training.xlsx tests ground_truth.json
training.xlsx
is an excel sheet containing ~7000 rows of weather data where each row corresponds to a single day.- columns correspond to weather features, e.g. cloudy, precipitation, wind speed, etc.
tests
is a folder containing 1000 excel files, each containing 28 rows of data to represent a month of weather data. The classifier aims to predict the 29th day's weather for each excel file.ground_truth.json
contains the actual weather fortests
. This is used to calculate the accuracy of the model.
Attribute | time | weather_description | precip | visibility | temperature |
---|---|---|---|---|---|
Data Format | Categorical | Categorical | Categorical | Categorical | Continuous |
Naive Bayes Variant | Categorical | Categorical | Categorical | Categorical | Gaussian |
Attribute | wind_speed | wind_degree | heatindex | wd1 | wd2 | wd3 |
---|---|---|---|---|---|---|
Data Format | Continuous | Continuous | Continuous | Categorical | Categorical | Categorical |
Naive Bayes Variant | Gaussian | Gaussian | Gaussian | Categorical | Categorical | Categorical |
Figure 1: Hybrid Naive Bayes Approach
I built a single class, called NaiveBayes
, with five class variables:
self.labels = {}
# A dictionary of the possible ‘weather_descriptions’ labels;
# key = label, value = frequency in training dataset.
self.totalEntries = totalEntries
# The total number of rows in the training dataset
# (used to calculate frequency ratios).
self.attributes = {}
# A dictionary of the training attributes and their different values;
# key = attribute, value = possible values of this attribute
# (e.g. 0, 600, 1200, 1800 for ‘time’).
self.condFrequencies = {}
# A dictionary of the conditional frequencies for each attribute given each label;
# key = (attribute, value, label), value = conditional frequency.
# These conditional frequencies are used to calculate the conditional probabilities of categorical attributes.
self.condProbabilities = {}
# A dictionary of the conditional probabilities for each attribute given each label; key = (attribute, value, label), value = conditional probability.
self.smoothing_factor = 20
# A smoothing factor I applied to my formulas for calculating conditional frequency and conditional probability.
# condCount = len(y_subset[y_subset.index.isin(x_subset.index)]) + self.smoothing_factor
Within the NaiveBayes class, I defined six functions to populate these dictionaries:
getLabelFrequencies()
getAttributes()
getCondFrequencies()
getCondProbabilities()
- Fill self.condProbabilities for categorical attributes.getCondProbabilitiesContinuous()
- Fill self.condProbabilities for continuous attributes with Gaussian PDF formulapredict()
- Iterates through the 28th day’s attributes and generates keys (attribute, value, label) to search for its corresponding conditional probability in self.condProbabilities.
The following screenshot illustrates the attributes that I trained my model on (‘wd1’, ‘wd2’, ‘wd3’ are the 3 weather descriptions that came before the weather description in question). A single instance (day) in the dataset is represented as a row in a pandas dataframe object.
x_categorical is the dataframe where each column is one of my selected categorical attributes. Same concept for x_continuous. I calculate the conditional probabilities differently using my functions getCondProbabilities() and getCondProbabilitiesContinuous().
To train my classifier, I wrote separate functions (see above) to calculate and pre-save the conditional probabilities depending on the kind of attribute in question (categorical or continuous). I used the formulas provided in the lecture slides. In my predict() function, I calculated the conditional probabilities of each possible weather label given the attributes in the 28th day’s row by retrieving the necessary conditional probabilities from my self.condProbability dictionary and multiplying them together.
When looking through the training.xlsx file, I noticed that the presence of any precipitation ruled out any possibility of the weather description being “Sunny”, “Clear”, “Cloudy”, “Partly cloudy”, or “Overcast”. After printing my incorrect predictions on the given test dataset, I realized that most of my missed predictions were false “Sunny” and “Clear” predictions for days which had precipitation. As such, I coded a check in my prediction() function to set the probability of the prediction being “Sunny”, “Clear”, “Cloudy”, “Partly cloudy”, or “Overcast” to 0 if the 28th day’s ‘precip’ attribute was either “Light precipitation”, “Moderate precipitation”, or “Heavy precipitation”. This check bumped my accuracy from 0.669 to 0.786
Accuracy: 0.786 Runtime: 18.37 seconds I attribute my improvements in accuracy to including the previous three days of weather_descriptions in my training dataset and filtering out “Sunny”, “Clear”, “Cloudy”, “Partly cloudy”, or “Overcast” predictions for days which were preceded with any sort of precipitation.
One of the biggest challenges I faced was selecting the optimal combination of attributes to train my model. Initially, my approach was uninformed and random—I tested arbitrary combinations of features without understanding their relationships with the target variable (weather_description). Unsurprisingly, this led to inconsistent results and limited accuracy improvements.
To address this, I focused on feature engineering to uncover patterns within the training dataset. I started by creating additional features, such as the previous three days' weather descriptions (wd1, wd2, and wd3) and the 3-day moving averages of numerical attributes like temperature and wind speed. These features helped capture temporal patterns in the data and provided context for making predictions.
Additionally, I wrote a Python script to calculate the correlation coefficients between the target variable and each attribute in the dataset. This helped me identify which features were most strongly correlated with weather outcomes. For example:
Attributes like precipitation and humidity showed high correlations with cloudy or rainy conditions. Attributes like temperature and visibility were more indicative of sunny or clear days. Using this data-driven approach, I refined my feature selection, focusing on high-impact attributes while excluding noisy or irrelevant ones. For example, wind direction (wind_degree) was found to have minimal correlation with the target variable and was excluded from the final training set.
Through these adjustments, I not only improved the relevance of the features used in training but also enhanced the model's interpretability and accuracy. This systematic approach replaced my initial guesswork and laid the foundation for the model’s success.