Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

further model testing #8

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 53 additions & 60 deletions analysis/baselineModelTesting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"Good: \n",
"* Is great for seeing how changes to a single input variable affect one output variable. \n",
"* It is commonly used for prediction (rather than categorization), which is exactly what we want.\n",
"* Works well for small datasets because relationships are esaily interpretable.\n",
"* Works well for small datasets because relationships are easily interpretable.\n",
"\n",
"Bad:\n",
"* If the data does not make linear relationships then this algorithm is very bad.\n",
Expand Down Expand Up @@ -62,7 +62,7 @@
"* Requires fine tuning, but with small datasets that is hard to do. \n",
"\n",
"## Conclusion ##\n",
"Our goal is to predict distance given lap data. Seeing as Elysia's lap data contains 208 rows and 9 columns (after having dropped useless columns), **Linear Regression** is a suitable choice for the base model. As we are just aiming to display a simple model on the website for now, we want something that is lightweight, simple, and easy to interpret, and Linear Regression matches all these criteria. \n",
"Our goal is to predict distance given lap data. Seeing as Elysia's lap data contains 208 rows and 8 columns (after having dropped useless columns), **Linear Regression** is a suitable choice for the base model. As we are just aiming to display a simple model on the website for now, we want something that is lightweight, simple, and easy to interpret, and Linear Regression matches all these criteria. \n",
"\n",
"We do not want to risk running into KNN and dimensionality issues. Decision Trees, Random Forests, and Gradient Boosting either take too much computation, or we may even lack the data to fine tune these models. \n",
"\n",
Expand All @@ -77,12 +77,12 @@
"metadata": {},
"source": [
"## Import Modules and Check the Data Size ##\n",
"Should be 208x9"
"Should be 208x8"
]
},
{
"cell_type": "code",
"execution_count": 40,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand All @@ -106,7 +106,6 @@
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>secondsdifference</th>\n",
" <th>totalpowerin</th>\n",
" <th>totalpowerout</th>\n",
" <th>netpowerout</th>\n",
Expand All @@ -120,7 +119,6 @@
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3264851</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
Expand All @@ -132,7 +130,6 @@
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>88892470</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
Expand All @@ -144,7 +141,6 @@
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>95990970</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
Expand All @@ -156,7 +152,6 @@
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>205000</td>\n",
" <td>793.532455</td>\n",
" <td>1803.385827</td>\n",
" <td>1009.853372</td>\n",
Expand All @@ -168,7 +163,6 @@
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3517768</td>\n",
" <td>798.704841</td>\n",
" <td>685.739611</td>\n",
" <td>-112.965230</td>\n",
Expand All @@ -188,11 +182,9 @@
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>203</th>\n",
" <td>325001</td>\n",
" <td>475.938651</td>\n",
" <td>1666.858482</td>\n",
" <td>1190.919832</td>\n",
Expand All @@ -204,7 +196,6 @@
" </tr>\n",
" <tr>\n",
" <th>204</th>\n",
" <td>2494999</td>\n",
" <td>700.176665</td>\n",
" <td>566.324558</td>\n",
" <td>-133.852107</td>\n",
Expand All @@ -216,7 +207,6 @@
" </tr>\n",
" <tr>\n",
" <th>205</th>\n",
" <td>343500</td>\n",
" <td>615.194700</td>\n",
" <td>1680.524756</td>\n",
" <td>1065.330056</td>\n",
Expand All @@ -228,7 +218,6 @@
" </tr>\n",
" <tr>\n",
" <th>206</th>\n",
" <td>330500</td>\n",
" <td>624.008188</td>\n",
" <td>1543.656620</td>\n",
" <td>919.648432</td>\n",
Expand All @@ -240,7 +229,6 @@
" </tr>\n",
" <tr>\n",
" <th>207</th>\n",
" <td>-1123290860</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
Expand All @@ -252,37 +240,37 @@
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>208 rows × 9 columns</p>\n",
"<p>208 rows × 8 columns</p>\n",
"</div>"
],
"text/plain": [
" secondsdifference totalpowerin totalpowerout netpowerout distance \\\n",
"0 3264851 0.000000 0.000000 0.000000 0.000000 \n",
"1 88892470 0.000000 0.000000 0.000000 0.000000 \n",
"2 95990970 0.000000 0.000000 0.000000 0.000000 \n",
"3 205000 793.532455 1803.385827 1009.853372 2.814824 \n",
"4 3517768 798.704841 685.739611 -112.965230 4.118535 \n",
".. ... ... ... ... ... \n",
"203 325001 475.938651 1666.858482 1190.919832 3.963359 \n",
"204 2494999 700.176665 566.324558 -133.852107 0.181906 \n",
"205 343500 615.194700 1680.524756 1065.330056 3.963687 \n",
"206 330500 624.008188 1543.656620 919.648432 3.966359 \n",
"207 -1123290860 0.000000 0.000000 0.000000 0.000000 \n",
" totalpowerin totalpowerout netpowerout distance amphours \\\n",
"0 0.000000 0.000000 0.000000 0.000000 69.800003 \n",
"1 0.000000 0.000000 0.000000 0.000000 126.800003 \n",
"2 0.000000 0.000000 0.000000 0.000000 98.699997 \n",
"3 793.532455 1803.385827 1009.853372 2.814824 97.800003 \n",
"4 798.704841 685.739611 -112.965230 4.118535 95.300003 \n",
".. ... ... ... ... ... \n",
"203 475.938651 1666.858482 1190.919832 3.963359 2.100000 \n",
"204 700.176665 566.324558 -133.852107 0.181906 4.700000 \n",
"205 615.194700 1680.524756 1065.330056 3.963687 3.000000 \n",
"206 624.008188 1543.656620 919.648432 3.966359 1.400000 \n",
"207 0.000000 0.000000 0.000000 0.000000 73.800003 \n",
"\n",
" amphours batterysecondsremaining averagespeed averagepackCurrent \n",
"0 69.800003 -1 0.000000 NaN \n",
"1 126.800003 -1 0.000000 NaN \n",
"2 98.699997 -1 0.000000 NaN \n",
"3 97.800003 22425 49.584229 15.70 \n",
"4 95.300003 56521 24.939703 6.07 \n",
".. ... ... ... ... \n",
"203 2.100000 406 43.894226 18.62 \n",
"204 4.700000 100215 1.106129 -5.78 \n",
"205 3.000000 611 41.645220 17.67 \n",
"206 1.400000 300 43.276732 16.79 \n",
"207 73.800003 -1 0.000000 NaN \n",
" batterysecondsremaining averagespeed averagepackCurrent \n",
"0 -1 0.000000 NaN \n",
"1 -1 0.000000 NaN \n",
"2 -1 0.000000 NaN \n",
"3 22425 49.584229 15.70 \n",
"4 56521 24.939703 6.07 \n",
".. ... ... ... \n",
"203 406 43.894226 18.62 \n",
"204 100215 1.106129 -5.78 \n",
"205 611 41.645220 17.67 \n",
"206 300 43.276732 16.79 \n",
"207 -1 0.000000 NaN \n",
"\n",
"[208 rows x 9 columns]"
"[208 rows x 8 columns]"
]
},
"metadata": {},
Expand All @@ -297,6 +285,10 @@
"from sklearn.model_selection import KFold\n",
"import numpy as np\n",
"\n",
"#global variables for randomness seed and test size\n",
"GLOBAL_RANDOM_STATE = 69420\n",
"GLOBAL_TEST_SIZE = 0.2\n",
"\n",
"packetTrainingDataPath=\"../training_data/Elysia.Laps.feather\"\n",
"df = pd.read_feather(packetTrainingDataPath)\n",
"df = df.drop(\n",
Expand All @@ -305,6 +297,7 @@
" \"_id.$oid\",\n",
" \"averagepackCurrent.$numberDouble\",\n",
" \"timestamp.$numberLong\",\n",
" \"secondsdifference\",\n",
" ]\n",
" )\n",
"\n",
Expand All @@ -320,15 +313,15 @@
},
{
"cell_type": "code",
"execution_count": 41,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE (root mean squared error) is: 36.22178511618355\n",
"MAE (mean absolute error) is: 6.9643536149352085\n",
"RMSE (root mean squared error) is: 36.21807130860871\n",
"MAE (mean absolute error) is: 6.9236865793943565\n",
"Minimum distance: -227.01091494750978\n",
"Maximum distance: 77.62573291015624\n",
"Average distance: 3.219106722576028\n",
Expand All @@ -346,12 +339,12 @@
"df = df.dropna(subset=['distance', 'averagepackCurrent', 'averagespeed'])\n",
"\n",
"#seperate distance from the other features\n",
"X = df[['secondsdifference', 'totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', \n",
"X = df[['totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', \n",
" 'averagepackCurrent', 'batterysecondsremaining', 'averagespeed']]\n",
"y = df['distance']\n",
"\n",
"#split into training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=GLOBAL_TEST_SIZE, random_state=GLOBAL_RANDOM_STATE)\n",
"\n",
"#train the baseline model with linear regression\n",
"linear_regression_model = LinearRegression()\n",
Expand Down Expand Up @@ -390,20 +383,20 @@
"metadata": {},
"source": [
"# Interpretation of Performance #\n",
"Given such little spread in the IQR (middle 50% of distance values), but RMSE of 36.22 and MAE of 6.96, it is clear that this baseline model is very inaccurate. There should be very few predictions of distance which differ from a value of 4. This suggests that there could be large outliers, and so lets try this again, but cleaning outliers and negative distance values (it does not make sense for distance to be negative.)"
"Given such little spread in the IQR (middle 50% of distance values), but RMSE of 36.22 and MAE of 6.92, it is clear that this baseline model is very inaccurate. There should be very few predictions of distance which differ from a value of 4. This suggests that there could be large outliers, and so lets try this again, but cleaning outliers and negative distance values (it does not make sense for distance to be negative.)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE (root mean squared error) after removing outliers is: 1.711367174002224\n",
"MAE (mean absolute error) after removing outliers is: 0.9110000515477485\n",
"RMSE (root mean squared error) after removing outliers is: 1.711275200278704\n",
"MAE (mean absolute error) after removing outliers is: 0.9109172082259483\n",
"Minimum distance: 0.0162265625\n",
"Maximum distance: 8.745853515625\n",
"Average distance: 4.063722782813743\n",
Expand All @@ -423,12 +416,12 @@
"#remove negative distance values\n",
"df = df[df['distance'] >= 0]\n",
"#seperate distance from the other features\n",
"X = df[['secondsdifference', 'totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', \n",
"X = df[['totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', \n",
" 'averagepackCurrent', 'batterysecondsremaining', 'averagespeed']]\n",
"y = df['distance']\n",
"\n",
"#split into training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=GLOBAL_TEST_SIZE, random_state=GLOBAL_RANDOM_STATE)\n",
"\n",
"#retrain with linear regression\n",
"linear_regression_model.fit(X_train, y_train)\n",
Expand All @@ -451,20 +444,20 @@
"metadata": {},
"source": [
"# Re-evaluation of Performance #\n",
"It is clear to see that the accuracy has improved significantly, as RMSE has gone from 36.22 to 1.7, and MAE hsa gone from 6.96 to 0.9. This means the difference in predictions of distance from actual values are much smaller. However, there are other simple tests we can try to improve the model. "
"It is clear to see that the accuracy has improved significantly, as RMSE has gone from 36.22 to 1.7, and MAE hsa gone from 6.92 to 0.92. This means the difference in predictions of distance from actual values are much smaller. However, there are other simple tests we can try to improve the model. "
]
},
{
"cell_type": "code",
"execution_count": 43,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross-validated RMSE: 1.8882794269274583\n",
"Cross-validated MAE: 0.8029010278510377\n",
"Cross-validated RMSE: 1.0719666192936375\n",
"Cross-validated MAE: 0.5674581684709319\n",
"Minimum distance: 0.0162265625\n",
"Maximum distance: 8.745853515625\n",
"Average distance: 4.063722782813743\n",
Expand All @@ -476,7 +469,7 @@
],
"source": [
"#initialize the KFold cross validator\n",
"kf = KFold(n_splits=5, shuffle=True, random_state=69420)\n",
"kf = KFold(n_splits=5, shuffle=True, random_state=GLOBAL_RANDOM_STATE)\n",
"\n",
"#lists to store RMSE and MAE for each fold (each subset of data)\n",
"rmse_list = []\n",
Expand Down Expand Up @@ -517,9 +510,9 @@
"metadata": {},
"source": [
"# Final Evaluation of Performance #\n",
"Cross validation seems to worsen RMSE from 1.7 to 1.88, but improves MAE from 0.9 to 0.8. In general, low RMSE means the model is better at estimating outliers, whereas low MAE means the model is more robust and can predict around the median of data. Seeing as we will clean and remove the outliers anyway, we want to focus on predicting the majority of input data. So, we are prioritising a low MAE. Thus, this model using Cross-validation is the best so far. \n",
"Cross validation improves RMSE from 1.7 to 1.07, as well as MAE from 0.92 to 0.57. This is important because low RMSE means the model is better at estimating outliers, whereas low MAE means the model is more robust and can predict around the median of data. Seeing as we will clean and remove the outliers anyway, we want to focus on predicting the majority of input data. This fact isn't as important here because both RMSE and MAE improve, but it will become important in future models if MAE improves but RMSE worsens. Thus, for the sake of this baseline model, using Cross-validation is the best so far. \n",
"\n",
"Since the majority of our data is between 3.9-4.0 and MAE is 0.8, it means predictions are normally 20% off from the real distance value. That is, in the real world, this model can be used to predict the distance of a lap given input features such as power, `amphours`, `averagepackCurrent`, `batterysecondsremaining`, and `averagespeed`. With a MAE of 0.8, the model's predictions are typically within 0.8 units of the actual distance. This prediction can be useful for making data driven decisions when optimizing lap performance, planning energy consumption, and looking at improving overall efficiency. By reducing the MAE, we ensure that the model is able to predict the majority of the input data, which has many applications in optimizing different metrics of the car during race."
"Since the majority of our data is between 3.9-4.0 and MAE is 0.57, it means predictions are normally 14.35% off from the real distance value. That is, in the real world, this model can be used to predict the distance of a lap given input features such as power, `amphours`, `averagepackCurrent`, `batterysecondsremaining`, and `averagespeed`. With a MAE of 0.57, the model's predictions are typically within 0.57 units of the actual distance. This prediction can be useful for making data driven decisions when optimizing lap performance, planning energy consumption, and looking at improving overall efficiency. By reducing the MAE, we ensure that the model is able to predict the majority of the input data, which has many applications in optimizing different metrics of the car during race."
]
}
],
Expand Down
Loading
Loading