UCSolarCarTeam · justin-phxm · Dec 7, 2024 · Nov 30, 2024 · Nov 30, 2024 · Dec 7, 2024
diff --git a/analysis/baselineModelTesting.ipynb b/analysis/baselineModelTesting.ipynb
@@ -16,7 +16,7 @@
     "Good: \n",
     "* Is great for seeing how changes to a single input variable affect one output variable. \n",
     "* It is commonly used for prediction (rather than categorization), which is exactly what we want.\n",
-    "* Works well for small datasets because relationships are esaily interpretable.\n",
+    "* Works well for small datasets because relationships are easily interpretable.\n",
     "\n",
     "Bad:\n",
     "* If the data does not make linear relationships then this algorithm is very bad.\n",
@@ -62,7 +62,7 @@
     "* Requires fine tuning, but with small datasets that is hard to do. \n",
     "\n",
     "## Conclusion ##\n",
-    "Our goal is to predict distance given lap data. Seeing as Elysia's lap data contains 208 rows and 9 columns (after having dropped useless columns), **Linear Regression** is a suitable choice for the base model. As we are just aiming to display a simple model on the website for now, we want something that is lightweight, simple, and easy to interpret, and Linear Regression matches all these criteria. \n",
+    "Our goal is to predict distance given lap data. Seeing as Elysia's lap data contains 208 rows and 8 columns (after having dropped useless columns), **Linear Regression** is a suitable choice for the base model. As we are just aiming to display a simple model on the website for now, we want something that is lightweight, simple, and easy to interpret, and Linear Regression matches all these criteria. \n",
     "\n",
     "We do not want to risk running into KNN and dimensionality issues. Decision Trees, Random Forests, and Gradient Boosting either take too much computation, or we may even lack the data to fine tune these models. \n",
     "\n",
@@ -77,12 +77,12 @@
    "metadata": {},
    "source": [
     "## Import Modules and Check the Data Size ##\n",
-    "Should be 208x9"
+    "Should be 208x8"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 40,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -106,7 +106,6 @@
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
-       "      <th>secondsdifference</th>\n",
        "      <th>totalpowerin</th>\n",
        "      <th>totalpowerout</th>\n",
        "      <th>netpowerout</th>\n",
@@ -120,7 +119,6 @@
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
-       "      <td>3264851</td>\n",
        "      <td>0.000000</td>\n",
        "      <td>0.000000</td>\n",
        "      <td>0.000000</td>\n",
@@ -132,7 +130,6 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
-       "      <td>88892470</td>\n",
        "      <td>0.000000</td>\n",
        "      <td>0.000000</td>\n",
        "      <td>0.000000</td>\n",
@@ -144,7 +141,6 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
-       "      <td>95990970</td>\n",
        "      <td>0.000000</td>\n",
        "      <td>0.000000</td>\n",
        "      <td>0.000000</td>\n",
@@ -156,7 +152,6 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
-       "      <td>205000</td>\n",
        "      <td>793.532455</td>\n",
        "      <td>1803.385827</td>\n",
        "      <td>1009.853372</td>\n",
@@ -168,7 +163,6 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
-       "      <td>3517768</td>\n",
        "      <td>798.704841</td>\n",
        "      <td>685.739611</td>\n",
        "      <td>-112.965230</td>\n",
@@ -188,11 +182,9 @@
        "      <td>...</td>\n",
        "      <td>...</td>\n",
        "      <td>...</td>\n",
-       "      <td>...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>203</th>\n",
-       "      <td>325001</td>\n",
        "      <td>475.938651</td>\n",
        "      <td>1666.858482</td>\n",
        "      <td>1190.919832</td>\n",
@@ -204,7 +196,6 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>204</th>\n",
-       "      <td>2494999</td>\n",
        "      <td>700.176665</td>\n",
        "      <td>566.324558</td>\n",
        "      <td>-133.852107</td>\n",
@@ -216,7 +207,6 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>205</th>\n",
-       "      <td>343500</td>\n",
        "      <td>615.194700</td>\n",
        "      <td>1680.524756</td>\n",
        "      <td>1065.330056</td>\n",
@@ -228,7 +218,6 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>206</th>\n",
-       "      <td>330500</td>\n",
        "      <td>624.008188</td>\n",
        "      <td>1543.656620</td>\n",
        "      <td>919.648432</td>\n",
@@ -240,7 +229,6 @@
        "    </tr>\n",
        "    <tr>\n",
        "      <th>207</th>\n",
-       "      <td>-1123290860</td>\n",
        "      <td>0.000000</td>\n",
        "      <td>0.000000</td>\n",
        "      <td>0.000000</td>\n",
@@ -252,37 +240,37 @@
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
-       "<p>208 rows × 9 columns</p>\n",
+       "<p>208 rows × 8 columns</p>\n",
        "</div>"
       ],
       "text/plain": [
-       "     secondsdifference  totalpowerin  totalpowerout  netpowerout  distance  \\\n",
-       "0              3264851      0.000000       0.000000     0.000000  0.000000   \n",
-       "1             88892470      0.000000       0.000000     0.000000  0.000000   \n",
-       "2             95990970      0.000000       0.000000     0.000000  0.000000   \n",
-       "3               205000    793.532455    1803.385827  1009.853372  2.814824   \n",
-       "4              3517768    798.704841     685.739611  -112.965230  4.118535   \n",
-       "..                 ...           ...            ...          ...       ...   \n",
-       "203             325001    475.938651    1666.858482  1190.919832  3.963359   \n",
-       "204            2494999    700.176665     566.324558  -133.852107  0.181906   \n",
-       "205             343500    615.194700    1680.524756  1065.330056  3.963687   \n",
-       "206             330500    624.008188    1543.656620   919.648432  3.966359   \n",
-       "207        -1123290860      0.000000       0.000000     0.000000  0.000000   \n",
+       "     totalpowerin  totalpowerout  netpowerout  distance    amphours  \\\n",
+       "0        0.000000       0.000000     0.000000  0.000000   69.800003   \n",
+       "1        0.000000       0.000000     0.000000  0.000000  126.800003   \n",
+       "2        0.000000       0.000000     0.000000  0.000000   98.699997   \n",
+       "3      793.532455    1803.385827  1009.853372  2.814824   97.800003   \n",
+       "4      798.704841     685.739611  -112.965230  4.118535   95.300003   \n",
+       "..            ...            ...          ...       ...         ...   \n",
+       "203    475.938651    1666.858482  1190.919832  3.963359    2.100000   \n",
+       "204    700.176665     566.324558  -133.852107  0.181906    4.700000   \n",
+       "205    615.194700    1680.524756  1065.330056  3.963687    3.000000   \n",
+       "206    624.008188    1543.656620   919.648432  3.966359    1.400000   \n",
+       "207      0.000000       0.000000     0.000000  0.000000   73.800003   \n",
        "\n",
-       "       amphours  batterysecondsremaining  averagespeed  averagepackCurrent  \n",
-       "0     69.800003                       -1      0.000000                 NaN  \n",
-       "1    126.800003                       -1      0.000000                 NaN  \n",
-       "2     98.699997                       -1      0.000000                 NaN  \n",
-       "3     97.800003                    22425     49.584229               15.70  \n",
-       "4     95.300003                    56521     24.939703                6.07  \n",
-       "..          ...                      ...           ...                 ...  \n",
-       "203    2.100000                      406     43.894226               18.62  \n",
-       "204    4.700000                   100215      1.106129               -5.78  \n",
-       "205    3.000000                      611     41.645220               17.67  \n",
-       "206    1.400000                      300     43.276732               16.79  \n",
-       "207   73.800003                       -1      0.000000                 NaN  \n",
+       "     batterysecondsremaining  averagespeed  averagepackCurrent  \n",
+       "0                         -1      0.000000                 NaN  \n",
+       "1                         -1      0.000000                 NaN  \n",
+       "2                         -1      0.000000                 NaN  \n",
+       "3                      22425     49.584229               15.70  \n",
+       "4                      56521     24.939703                6.07  \n",
+       "..                       ...           ...                 ...  \n",
+       "203                      406     43.894226               18.62  \n",
+       "204                   100215      1.106129               -5.78  \n",
+       "205                      611     41.645220               17.67  \n",
+       "206                      300     43.276732               16.79  \n",
+       "207                       -1      0.000000                 NaN  \n",
        "\n",
-       "[208 rows x 9 columns]"
+       "[208 rows x 8 columns]"
       ]
      },
      "metadata": {},
@@ -297,6 +285,10 @@
     "from sklearn.model_selection import KFold\n",
     "import numpy as np\n",
     "\n",
+    "#global variables for randomness seed and test size\n",
+    "GLOBAL_RANDOM_STATE = 69420\n",
+    "GLOBAL_TEST_SIZE = 0.2\n",
+    "\n",
     "packetTrainingDataPath=\"../training_data/Elysia.Laps.feather\"\n",
     "df = pd.read_feather(packetTrainingDataPath)\n",
     "df = df.drop(\n",
@@ -305,6 +297,7 @@
     "            \"_id.$oid\",\n",
     "            \"averagepackCurrent.$numberDouble\",\n",
     "            \"timestamp.$numberLong\",\n",
+    "            \"secondsdifference\",\n",
     "        ]\n",
     "    )\n",
     "\n",
@@ -320,15 +313,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 41,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "RMSE (root mean squared error) is: 36.22178511618355\n",
-      "MAE (mean absolute error) is: 6.9643536149352085\n",
+      "RMSE (root mean squared error) is: 36.21807130860871\n",
+      "MAE (mean absolute error) is: 6.9236865793943565\n",
       "Minimum distance: -227.01091494750978\n",
       "Maximum distance: 77.62573291015624\n",
       "Average distance: 3.219106722576028\n",
@@ -346,12 +339,12 @@
     "df = df.dropna(subset=['distance', 'averagepackCurrent', 'averagespeed'])\n",
     "\n",
     "#seperate distance from the other features\n",
-    "X = df[['secondsdifference', 'totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', \n",
+    "X = df[['totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', \n",
     "        'averagepackCurrent', 'batterysecondsremaining', 'averagespeed']]\n",
     "y = df['distance']\n",
     "\n",
     "#split into training and test sets\n",
-    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=GLOBAL_TEST_SIZE, random_state=GLOBAL_RANDOM_STATE)\n",
     "\n",
     "#train the baseline model with linear regression\n",
     "linear_regression_model = LinearRegression()\n",
@@ -390,20 +383,20 @@
    "metadata": {},
    "source": [
     "# Interpretation of Performance #\n",
-    "Given such little spread in the IQR (middle 50% of distance values), but RMSE of 36.22 and MAE of 6.96, it is clear that this baseline model is very inaccurate. There should be very few predictions of distance which differ from a value of 4. This suggests that there could be large outliers, and so lets try this again, but cleaning outliers and negative distance values (it does not make sense for distance to be negative.)"
+    "Given such little spread in the IQR (middle 50% of distance values), but RMSE of 36.22 and MAE of 6.92, it is clear that this baseline model is very inaccurate. There should be very few predictions of distance which differ from a value of 4. This suggests that there could be large outliers, and so lets try this again, but cleaning outliers and negative distance values (it does not make sense for distance to be negative.)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 42,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "RMSE (root mean squared error) after removing outliers is: 1.711367174002224\n",
-      "MAE (mean absolute error) after removing outliers is: 0.9110000515477485\n",
+      "RMSE (root mean squared error) after removing outliers is: 1.711275200278704\n",
+      "MAE (mean absolute error) after removing outliers is: 0.9109172082259483\n",
       "Minimum distance: 0.0162265625\n",
       "Maximum distance: 8.745853515625\n",
       "Average distance: 4.063722782813743\n",
@@ -423,12 +416,12 @@
     "#remove negative distance values\n",
     "df = df[df['distance'] >= 0]\n",
     "#seperate distance from the other features\n",
-    "X = df[['secondsdifference', 'totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', \n",
+    "X = df[['totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', \n",
     "    'averagepackCurrent', 'batterysecondsremaining', 'averagespeed']]\n",
     "y = df['distance']\n",
     "\n",
     "#split into training and test sets\n",
-    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=GLOBAL_TEST_SIZE, random_state=GLOBAL_RANDOM_STATE)\n",
     "\n",
     "#retrain with linear regression\n",
     "linear_regression_model.fit(X_train, y_train)\n",
@@ -451,20 +444,20 @@
    "metadata": {},
    "source": [
     "# Re-evaluation of Performance #\n",
-    "It is clear to see that the accuracy has improved significantly, as RMSE has gone from 36.22 to 1.7, and MAE hsa gone from 6.96 to 0.9. This means the difference in predictions of distance from actual values are much smaller. However, there are other simple tests we can try to improve the model. "
+    "It is clear to see that the accuracy has improved significantly, as RMSE has gone from 36.22 to 1.7, and MAE hsa gone from 6.92 to 0.92. This means the difference in predictions of distance from actual values are much smaller. However, there are other simple tests we can try to improve the model. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 43,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Cross-validated RMSE: 1.8882794269274583\n",
-      "Cross-validated MAE: 0.8029010278510377\n",
+      "Cross-validated RMSE: 1.0719666192936375\n",
+      "Cross-validated MAE: 0.5674581684709319\n",
       "Minimum distance: 0.0162265625\n",
       "Maximum distance: 8.745853515625\n",
       "Average distance: 4.063722782813743\n",
@@ -476,7 +469,7 @@
    ],
    "source": [
     "#initialize the KFold cross validator\n",
-    "kf = KFold(n_splits=5, shuffle=True, random_state=69420)\n",
+    "kf = KFold(n_splits=5, shuffle=True, random_state=GLOBAL_RANDOM_STATE)\n",
     "\n",
     "#lists to store RMSE and MAE for each fold (each subset of data)\n",
     "rmse_list = []\n",
@@ -517,9 +510,9 @@
    "metadata": {},
    "source": [
     "# Final Evaluation of Performance #\n",
-    "Cross validation seems to worsen RMSE from 1.7 to 1.88, but improves MAE from 0.9 to 0.8. In general, low RMSE means the model is better at estimating outliers, whereas low MAE means the model is more robust and can predict around the median of data. Seeing as we will clean and remove the outliers anyway, we want to focus on predicting the majority of input data. So, we are prioritising a low MAE. Thus, this model using Cross-validation is the best so far. \n",
+    "Cross validation improves RMSE from 1.7 to 1.07, as well as MAE from 0.92 to 0.57. This is important because low RMSE means the model is better at estimating outliers, whereas low MAE means the model is more robust and can predict around the median of data. Seeing as we will clean and remove the outliers anyway, we want to focus on predicting the majority of input data. This fact isn't as important here because both RMSE and MAE improve, but it will become important in future models if MAE improves but RMSE worsens. Thus, for the sake of this baseline model, using Cross-validation is the best so far. \n",
     "\n",
-    "Since the majority of our data is between 3.9-4.0 and MAE is 0.8, it means predictions are normally 20% off from the real distance value. That is, in the real world, this model can be used to predict the distance of a lap given input features such as power, `amphours`, `averagepackCurrent`, `batterysecondsremaining`, and `averagespeed`. With a MAE of 0.8, the model's predictions are typically within 0.8 units of the actual distance. This prediction can be useful for making data driven decisions when optimizing lap performance, planning energy consumption, and looking at improving overall efficiency. By reducing the MAE, we ensure that the model is able to predict the majority of the input data, which has many applications in optimizing different metrics of the car during race."
+    "Since the majority of our data is between 3.9-4.0 and MAE is 0.57, it means predictions are normally 14.35% off from the real distance value. That is, in the real world, this model can be used to predict the distance of a lap given input features such as power, `amphours`, `averagepackCurrent`, `batterysecondsremaining`, and `averagespeed`. With a MAE of 0.57, the model's predictions are typically within 0.57 units of the actual distance. This prediction can be useful for making data driven decisions when optimizing lap performance, planning energy consumption, and looking at improving overall efficiency. By reducing the MAE, we ensure that the model is able to predict the majority of the input data, which has many applications in optimizing different metrics of the car during race."
    ]
   }
  ],