From db9afcc5825ebda074f2e238feae523abfacedf5 Mon Sep 17 00:00:00 2001 From: Hsin Chen Date: Tue, 27 Jun 2023 18:58:53 -0700 Subject: [PATCH 1/3] add dfp model card --- models/model-cards/dfp-model-card.md | 247 +++++++++++++++++++++++++++ 1 file changed, 247 insertions(+) create mode 100644 models/model-cards/dfp-model-card.md diff --git a/models/model-cards/dfp-model-card.md b/models/model-cards/dfp-model-card.md new file mode 100644 index 0000000000..e21daceb57 --- /dev/null +++ b/models/model-cards/dfp-model-card.md @@ -0,0 +1,247 @@ +# Model Overview + +## Description: +This use case is currently implemented to detect changes in users' behavior that indicate a change from a human to a machine or a machine to a human. The model architecture consists of an Autoencoder, where the reconstruction loss of new log data is used as an anomaly score. + +## References(s): +- https://github.com/AlliedToasters/dfencoder/blob/master/dfencoder/autoencoder.py +- Rasheed Peng Alhajj Rokne Jon: Fourier Transform Based Spatial Outlier Mining 2009 - https://link.springer.com/chapter/10.1007/978-3-642-04394-9_39 + +## Model Architecture: +The model architecture consists of an Autoencoder, where the reconstruction loss of new log data is used as an anomaly score. + +**Architecture Type:** +* Autoencoder + +**Network Architecture:** +* The network architecture of the model includes a 2-layer encoder with dimensions [512, 500] and a 1-layer decoder with dimensions [512] + +## Input: +**Input Format:** +* AWS CloudTrail logs in json format + +**Input Parameters:** +* None + +**Other Properties Related to Output:** +* Not Applicable (N/A) + +## Output: +**Output Format:** +* Anomaly score and the reconstruction loss for each feature in a pandas dataframe + +**Output Parameters:** +* None + +**Other Properties Related to Output:** +* Not Applicable + +## Software Integration: +**Runtime(s):** +* Morpheus + +**Supported Hardware Platform(s):**
+* Ampere/Turing
+ +**Supported Operating System(s):**
+* Linux
+ +## Model Version(s): +* https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/dfp-models/hammah-role-g-20211017-dill.pkl +* https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/dfp-models/hammah-user123-20211017-dill.pkl + +# Training & Evaluation: + +## Training Dataset: + +**Link:** +* https://github.com/nv-morpheus/Morpheus/tree/branch-23.07/models/datasets/training-data/cloudtrail + +**Properties (Quantity, Dataset Descriptions, Sensor(s)):** + +The training dataset consists of AWS CloudTrail logs. It contains logs from two entities, providing information about their activities within the AWS environment. +* [hammah-role-g-training-part1.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/training-data/cloudtrail/hammah-role-g-training-part1.json): 700 records
+* [hammah-role-g-training-part2.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/training-data/cloudtrail/hammah-role-g-training-part2.json): 1187 records
+* [hammah-user123-training-part2.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/training-data/cloudtrail/hammah-user123-training-part2.json): 1000 records
+* [hammah-user123-training-part3.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/training-data/cloudtrail/hammah-user123-training-part3.json): 1000 records
+* [hammah-user123-training-part4.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/training-data/cloudtrail/hammah-user123-training-part4.json): 387 records
+ +**Dataset License:** +* [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)
+ +## Evaluation Dataset: +**Link:** +* https://github.com/nv-morpheus/Morpheus/tree/branch-23.07/models/datasets/validation-data/cloudtrail
+ +**Properties (Quantity, Dataset Descriptions, Sensor(s)):** + +The evaluation dataset consists of AWS CloudTrail logs. It contains logs from two entities, providing information about their activities within the AWS environment. +* [hammah-role-g-validation.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/validation-data/cloudtrail/hammah-role-g-validation.json): 314 records +* [hammah-user123-validation-part1.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/validation-data/cloudtrail/hammah-user123-validation-part1.json): 300 records +* [hammah-user123-validation-part2.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/validation-data/cloudtrail/hammah-user123-validation-part2.json): 300 records +* [hammah-user123-validation-part3.json](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/datasets/validation-data/cloudtrail/hammah-user123-validation-part3.json): 247 records + +**Dataset License:** +* [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)
+ +## Inference: +**Engine:** +* PyTorch + +**Test Hardware:** +* Other + +# Subcards + +## Model Card ++ Bias Subcard + +### What is the gender balance of the model validation data? +* Not Applicable + +### What is the racial/ethnicity balance of the model validation data? +* Not Applicable + +### What is the age balance of the model validation data? +* Not Applicable + +### What is the language balance of the model validation data? +* English (cloudtrail logs): 100% + +### What is the geographic origin language balance of the model validation data? +* Not Applicable + +### What is the educational background balance of the model validation data? +* Not Applicable + +### What is the accent balance of the model validation data? +* Not Applicable + +### What is the face/key point balance of the model validation data? +* Not Applicable + +### What is the skin/tone balance of the model validation data? +* Not Applicable + +### What is the religion balance of the model validation data? +* Not Applicable + +### Individuals from the following adversely impacted (protected classes) groups participate in model design and testing. +* Not Applicable + +### Describe measures taken to mitigate against unwanted bias. +* Not Applicable + +## Model Card ++ Explainability Subcard + +### Name example applications and use cases for this model. +* This model is primarily designed for testing purposes and serves as a small pretrained model specifically used to evaluate and validate the DFP pipeline. Its application is focused on assessing the effectiveness of the pipeline rather than being intended for broader use cases or specific applications beyond testing. + +### Fill in the blank for the model technique. +* This model is designed for developers seeking to test the DFP pipeline with a small pretrained model trained on a synthetic dataset. + +### Name who is intended to benefit from this model. +* The intended beneficiaries of this model are developers who aim to test the performance and functionality of the DFP pipeline using synthetic datasets. It may not be suitable or provide significant value for real-world cloudtrail logs analysis. + +### Describe the model output. +* The model calculates an anomaly score for each input based on the reconstruction loss obtained from the trained Autoencoder. This score represents the level of anomaly detected in the input data. Higher scores indicate a higher likelihood of anomalous behavior. +* The model provides the reconstruction loss of each feature to facilitate further testing and debugging of the pipeline. + +### List the steps explaining how this model works. +* The model works by training on baseline behaviors and subsequently detecting deviations from the established baseline, triggering alerts accordingly. +* [Training notebook](https://github.com/nv-morpheus/Morpheus/blob/branch-23.07/models/training-tuning-scripts/dfp-models/hammah-20211017.ipynb) + +### Name the adversely impacted groups (protected classes) this has been tested to deliver comparable outcomes regardless of: +* Not Applicable + +### List the technical limitations of the model. +* Model expects cloudtrail logs with specific features that match the training dataset. Data lacking the required features or requiring a different feature set may not be compatible with the model. + +### What performance metrics were used to affirm the model's performance? +* The model's performance was evaluated based on its ability to correctly identify anomalous behavior in the synthetic dataset during testing. + +### What are the potential known risks to users and stakeholders? +* None + +### What training is recommended for developers working with this model? If none, please state "none." +* Familiarity with the Morpheus SDK is recommended for developers working with this model. + +### Link the relevant end user license agreement +* [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) + +## Model Card ++ Saftey & Security Subcard + +### Link the location of the training dataset's repository (if able to share). +* https://github.com/nv-morpheus/Morpheus/tree/branch-23.07/models/datasets/training-data/cloudtrail + +### Is the model used in an application with physical safety impact? +* No + +### Describe physical safety impact (if present). +* None + +### Was model and dataset assessed for vulnerability for potential form of attack? +* No + +### Name applications for the model. +* The primary application for this model is testing the Morpheus pipeline. + +### Name use case restrictions for the model. +* The model's use case is restricted to testing the Morpheus pipeline and may not be suitable for other applications. + +### Has this been verified to have met prescribed quality standards? +* No + +### Name target quality Key Performance Indicators (KPIs) for which this has been tested. +* None + +### Technical robustness and model security validated? +* No + +### Is the model and dataset compliant with National Classification Management Society (NCMS)? +* No + +### Are there explicit model and dataset restrictions? +* No + +### Are there access restrictions to systems, model, and data? +* No + +### Is there a digital signature? +* No + + +## Model Card ++ Privacy Subcard + + +### Generatable or reverse engineerable personally-identifiable information (PII)? +* Neither + +### Was consent obtained for any PII used? +* No PII was used in the training of this pretrained DFP model. The dataset used is synthetic and generated using the python faker package. Any resemblance to real individuals is purely coincidental. + +### Protected classes used to create this model? (The following were used in model the model's training:) +* Not applicable + +### How often is dataset reviewed? +* The dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for any changes. + +### Is a mechanism in place to honor data subject right of access or deletion of personal data? +* No (as the dataset is fully synthetic) + +### If PII collected for the development of this AI model, was it minimized to only what was required? +* Not Applicable (no PII collected) + +### Is data in dataset traceable? +* No + +### Scanned for malware? +* No + +### Are we able to identify and trace source of dataset? +* Yes ([fully synthetic dataset](https://github.com/nv-morpheus/Morpheus/tree/branch-23.07/models/datasets/training-data/cloudtrail)) + +### Does data labeling (annotation, metadata) comply with privacy laws? +* Not applicable (as the dataset is fully synthetic) + +### Is data compliant with data subject requests for data correction or removal, if such a request was made? +* Not applicable (as the dataset is fully synthetic) From 1d4bae451f9d87ca998f17303c6f96dfb2b01b52 Mon Sep 17 00:00:00 2001 From: Hsin Chen Date: Tue, 27 Jun 2023 21:10:53 -0700 Subject: [PATCH 2/3] update information for faker --- models/model-cards/dfp-model-card.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/models/model-cards/dfp-model-card.md b/models/model-cards/dfp-model-card.md index e21daceb57..5ffb8596aa 100644 --- a/models/model-cards/dfp-model-card.md +++ b/models/model-cards/dfp-model-card.md @@ -134,7 +134,7 @@ The evaluation dataset consists of AWS CloudTrail logs. It contains logs from tw ## Model Card ++ Explainability Subcard ### Name example applications and use cases for this model. -* This model is primarily designed for testing purposes and serves as a small pretrained model specifically used to evaluate and validate the DFP pipeline. Its application is focused on assessing the effectiveness of the pipeline rather than being intended for broader use cases or specific applications beyond testing. +* The model is primarily designed for testing purposes and serves as a small pretrained model specifically used to evaluate and validate the DFP pipeline. Its application is focused on assessing the effectiveness of the pipeline rather than being intended for broader use cases or specific applications beyond testing. ### Fill in the blank for the model technique. * This model is designed for developers seeking to test the DFP pipeline with a small pretrained model trained on a synthetic dataset. @@ -154,7 +154,7 @@ The evaluation dataset consists of AWS CloudTrail logs. It contains logs from tw * Not Applicable ### List the technical limitations of the model. -* Model expects cloudtrail logs with specific features that match the training dataset. Data lacking the required features or requiring a different feature set may not be compatible with the model. +* The model expects cloudtrail logs with specific features that match the training dataset. Data lacking the required features or requiring a different feature set may not be compatible with the model. ### What performance metrics were used to affirm the model's performance? * The model's performance was evaluated based on its ability to correctly identify anomalous behavior in the synthetic dataset during testing. @@ -217,7 +217,7 @@ The evaluation dataset consists of AWS CloudTrail logs. It contains logs from tw * Neither ### Was consent obtained for any PII used? -* No PII was used in the training of this pretrained DFP model. The dataset used is synthetic and generated using the python faker package. Any resemblance to real individuals is purely coincidental. +* The synthetic data used in this model is generated using the [faker](https://github.com/joke2k/faker/blob/master/LICENSE.txt) python package. The user agent field is generated by faker, which pulls items from its own dataset of fictitious values (located in the linked repo). Similarly, the event source field is randomly chosen from a list of event names provided in the AWS documentation. There are no privacy concerns or PII involved in this synthetic data generation process. ### Protected classes used to create this model? (The following were used in model the model's training:) * Not applicable From d1a13dc6edaaaa9505d185d5554068337fe2ea00 Mon Sep 17 00:00:00 2001 From: Hsin Chen Date: Wed, 28 Jun 2023 17:34:02 -0700 Subject: [PATCH 3/3] add copyright --- models/model-cards/dfp-model-card.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/models/model-cards/dfp-model-card.md b/models/model-cards/dfp-model-card.md index 5ffb8596aa..6351c3e7e4 100644 --- a/models/model-cards/dfp-model-card.md +++ b/models/model-cards/dfp-model-card.md @@ -1,3 +1,20 @@ + + # Model Overview ## Description: