You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the memory profiling work conducted by WSP during Phase 8 Interim, we have identified that memory usage of the ActivitySim model is high when large “choosers” tables with inefficient data types are created. For example, in the work tour scheduling step in the ARC run with a 25 percent sample, the choosers table has 540 million rows and 80 columns – a memory footprint of about 250 GB. Among the 80 columns, there are five string variables including tour purpose (e.g., “work”, “school”), tour category (e.g., “mandatory”), and time-period (e.g., “AM”); each takes about 32 GB of RAM. If we change string variables into something like enums using int8 data types, the memory footprint of this chooser table could be reduced from 250 GB to 102 GB. Many of the other columns unnecessarily use memory-intensive data types like float64 and int64. A logical next step is to optimize the data types used in ActivitySim, as part of the Phase 8 development.
Methodology
String variables
We looked into two alternatives to optimizing string variables:
The table below recaps the pros and cons of the two alternatives:
In consultation with the consortium members and the bench contractors, we decided to implement pandas Categorical for converting string variables, mainly because the level of effort is lower and it will keep the backward compatibility. We will modify the ActivitySim source code so that when a string variable is created, we convert it to pandas categorical.
Numeric variables
Numeric variables are created and used in the following sources:
Input data
pre- and post- annotation of each ActivitySim sub-model
.py source code
For the numeric variables in the input data, the user can define their data types in the settings.yaml, see example here. For the numeric variables created in the annotations and source codes, we can create a function that downcasts them based on the value ranges of the variables.
Implementation Details
Overview
The string to pandas categorical conversion shall happen under the hood, in the ActivitySim source code, and it should require minimum work for users to implement their models with this change. The downcasting of numeric variables is implemented as an option that users can turn on and off.
Although pandas categorical data type is a convenient solution to the memory issue, we have found the following caveats during implementation:
Assigning a value not already defined in the existing categories results in an error. There are places in the ActivitySim source code as well as in model UECs where new values are being assigned to existing string columns. Example 1, Example 2, Example 3. We need to make sure those new values are pre-defined in the categories.
Pandas categorical is fragile with pandas merge(). Merging two pandas categorical columns with different categories will result in an object type column (string) and cancels the memory saving. Hence, before joining we should make sure the two categorical columns use the same categories.
Calling pandas groupby() on a pandas categorical column will by default create groups for all pre-defined categories. This will crash the model if some pre-defined categories are not observed in the data. Example code.
We could also convert numeric variables to pandas categorical, but they will not work with any numeric operations. Example. We suggest not using pandas categorical for numeric variables in ActivitySim
We needed special treatment for Time Period variables in ActivitySim. Because time period variables are used in Sharrow to look up skims, and Sharrow requires them to be ordered. We converted time period string variables to ordered pandas categorical.
Downcasting numeric variables
In our tests, downcasting numeric variables helped further bringing down the memory requirement of ActivitySim. But changing the precision of numeric variables, especially float variables, caused the model result to change slightly in our tests. Hence, we have implemented the numeric downcasting as a switch in the ActivitySim setting, and defaulted to it being turned off.
Other notable findings
When running ActivitySim model with Sharrow turned on, household debug tracing requires additional memory. Presented at September 26, 2023 Meeting See issue #754
When running ActivitySim model with Sharrow turned on, there was additional memory being held unnecessarily (memory leak) due to Sharrow flow cache not being released properly. Presented at September 26, 2023 Meeting Jeff investigated and fixed this in PR #751
When running ActivitySim model with Sharrow turned on, utility expressions that compare pandas categorical variable to strings can be evaluated incorrectly. This has been documented in issue #766.
Results
prototype_arc 25% sample
The memory of the work tour scheduling choosers table of the 25% sample ARC run dropped from 254 GB to 79 GB after the data type optimization. The data type optimization alone reduced the peak memory from 491 GB to 335 GB. The implementation also includes fixing the memory leak we discovered in Sharrow, which reduced the peak memory by another 27 GB. Overall, the data type optimization work, along with the memory leak fix in Sharrow, reduced the peak memory of the 25 % ARC run from 491 GB to 308 GB. The chart below shows the memory profile of the 25% ARC model before and after data type optimization.
prototype_mtc_extended 100%
In our latest test with the extended MTC model, we found that school escorting (added in Phase 7) is the new memory peak, instead of the mandatory tour scheduling model. The data type optimization has brought down the memory peak from 375 GB to 154 GB (excluding school escorting), and from 490 GB to 380 GB (including school escorting). The chart below shows the memory profile of the 100% extended MTC model before and after data type optimization.
run time implication
In addition to the memory reduction, we also observed a run time reduction (from 488 mins to 359 mins) for the ARC model, with data type being optimized. However, we did not see a run time reduction for the extended MTC model.
Guidance for the future
The way we converted string variables to pandas categorical is a quick solution to reduce the memory burden created by strings, but it does not remove the use of strings in ActivitySim. Although it has brought down the memory requirement greatly, it also has a few caveats as documented above. In the future development, a more systematic way of truly getting rid of strings (such as a data type model with IntEnum) would be worth looking into.
The text was updated successfully, but these errors were encountered:
Background
In the memory profiling work conducted by WSP during Phase 8 Interim, we have identified that memory usage of the ActivitySim model is high when large “choosers” tables with inefficient data types are created. For example, in the work tour scheduling step in the ARC run with a 25 percent sample, the choosers table has 540 million rows and 80 columns – a memory footprint of about 250 GB. Among the 80 columns, there are five string variables including tour purpose (e.g., “work”, “school”), tour category (e.g., “mandatory”), and time-period (e.g., “AM”); each takes about 32 GB of RAM. If we change string variables into something like enums using int8 data types, the memory footprint of this chooser table could be reduced from 250 GB to 102 GB. Many of the other columns unnecessarily use memory-intensive data types like float64 and int64. A logical next step is to optimize the data types used in ActivitySim, as part of the Phase 8 development.
Methodology
String variables
We looked into two alternatives to optimizing string variables:
The table below recaps the pros and cons of the two alternatives:
In consultation with the consortium members and the bench contractors, we decided to implement pandas Categorical for converting string variables, mainly because the level of effort is lower and it will keep the backward compatibility. We will modify the ActivitySim source code so that when a string variable is created, we convert it to pandas categorical.
Numeric variables
Numeric variables are created and used in the following sources:
For the numeric variables in the input data, the user can define their data types in the settings.yaml, see example here. For the numeric variables created in the annotations and source codes, we can create a function that downcasts them based on the value ranges of the variables.
Implementation Details
Overview
The string to pandas categorical conversion shall happen under the hood, in the ActivitySim source code, and it should require minimum work for users to implement their models with this change. The downcasting of numeric variables is implemented as an option that users can turn on and off.
Relevant discussions/presentations can be found at:
Project-Meeting-2023.06.27
Project-Meeting-2023.07.18
Project-Meeting-2023.08.08
Project-Meeting-2023.08.22
Project-Meeting-2023.08.29
Project-Meeting-2023.09.12
Project-Meeting-2023.09.26
Project-Meeting-2023.10.10
Project-Meeting-2023.12.12
String to pandas categorical
Although pandas categorical data type is a convenient solution to the memory issue, we have found the following caveats during implementation:
Downcasting numeric variables
In our tests, downcasting numeric variables helped further bringing down the memory requirement of ActivitySim. But changing the precision of numeric variables, especially float variables, caused the model result to change slightly in our tests. Hence, we have implemented the numeric downcasting as a switch in the ActivitySim setting, and defaulted to it being turned off.
Other notable findings
Results
prototype_arc 25% sample
The memory of the work tour scheduling choosers table of the 25% sample ARC run dropped from 254 GB to 79 GB after the data type optimization. The data type optimization alone reduced the peak memory from 491 GB to 335 GB. The implementation also includes fixing the memory leak we discovered in Sharrow, which reduced the peak memory by another 27 GB. Overall, the data type optimization work, along with the memory leak fix in Sharrow, reduced the peak memory of the 25 % ARC run from 491 GB to 308 GB. The chart below shows the memory profile of the 25% ARC model before and after data type optimization.
prototype_mtc_extended 100%
In our latest test with the extended MTC model, we found that school escorting (added in Phase 7) is the new memory peak, instead of the mandatory tour scheduling model. The data type optimization has brought down the memory peak from 375 GB to 154 GB (excluding school escorting), and from 490 GB to 380 GB (including school escorting). The chart below shows the memory profile of the 100% extended MTC model before and after data type optimization.
run time implication
In addition to the memory reduction, we also observed a run time reduction (from 488 mins to 359 mins) for the ARC model, with data type being optimized. However, we did not see a run time reduction for the extended MTC model.
Guidance for the future
The way we converted string variables to pandas categorical is a quick solution to reduce the memory burden created by strings, but it does not remove the use of strings in ActivitySim. Although it has brought down the memory requirement greatly, it also has a few caveats as documented above. In the future development, a more systematic way of truly getting rid of strings (such as a data type model with IntEnum) would be worth looking into.
The text was updated successfully, but these errors were encountered: