improving CM automation recipes for CUDA and improving docs (#1088)

mlcommons · Feb 2, 2024 · a31610e · a31610e
2 parents ff483c7 + c5ea901
commit a31610e
Show file tree

Hide file tree

Showing 13 changed files with 340 additions and 128 deletions.
diff --git a/README.md b/README.md
@@ -18,45 +18,37 @@
 ### About
 
 Collective Mind (CM) is a [community project](CONTRIBUTING.md) to develop 
-a [collection of portable and extensible automation recipes 
-with a human-friendly interface (aka CM scripts)](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md) 
-that can be reused in different projects to modularize, run, benchmark and optimize complex AI/ML applications 
+a [collection of portable, extensible, technology-agnostic and ready-to-use automation recipes
+with a human-friendly interface (aka CM scripts)](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
+that automate all the manual steps required to build, run, benchmark and optimize complex ML/AI applications on any platform
+with any software and hardware. 
+
+CM scripts are being developed based on the feedback from [MLCommons engineers and researchers](docs/taskforce.md) 
+to help them assemble, run, benchmark and optimize complex AI/ML applications
 across diverse and continuously changing models, data sets, software and hardware
 from Nvidia, Intel, AMD, Google, Qualcomm, Amazon and other vendors.
+They require Python 3.7+ with minimal dependencies and can run natively on Ubuntu, MacOS, Windows, RHEL, Debian, Amazon Linux
+and any other operating system, in a cloud or inside automatically generated containers.
+
+Some key requirements for the CM design are:
+* must be non-intrusive and easy to debug, require zero changes to existing projects and must complement, reuse, wrap and interconnect all existing automation scripts and tools (such as cmake, ML workflows, python poetry and containers) rather than substituting them; 
+* must have a very simple and human-friendly command line with a Python API and minimal dependencies;
+* must require minimal or zero learning curve by using plain Python, native scripts, environment variables and simple JSON/YAML descriptions instead of inventing new languages;
+* must run in a native environment with Ubuntu, Debian, RHEL, Amazon Linux, MacOS, Windows and any other operating system while automatically generating container snapshots with CM recipes for repeatability and reproducibility;
+
+Below you can find a few examples of this collaborative engineering effort sponsored by [MLCommons (non-profit organization with 125+ organizations)](https://mlcommons.org) -
+a few most-commonly used [automation recipes](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
+that can be chained into more complex automation workflows [using simple JSON or YAML](https://github.com/mlcommons/ck/blob/master/cm-mlops/script/app-image-classification-onnx-py/_cm.yaml).
+
+You can try them yourself (you only need Python 3.7+, PIP, git and wget installed and optionally Docker if you want to 
+run CM scripts via automatically-generated containers - check the [installation guide](docs/installation.md) for more details).
+
+*Note that the Collective Mind concept is to continue improving portability and functionality 
+of all CM automation recipes across rapidly evolving models, data sets, software and hardware
+based on collaborative testing and feedback - don't hestiate to report encountered issues 
+[here](https://github.com/mlcommons/ck/issues) and/or contact us via [public Discord Server](https://discord.gg/JjWNWXKxwT) 
+to help this community effort!*
 
-CM was originally designed based on the following feedback and requirements
-from MLCommons engineers and researchers to have a common and technology-agnostic automation 
-that can help them simplify and automate development of complex MLPerf benchmarks and AI applications with diverse ML models
-while making this process more repeatable and deterministic:
-
-* [CM automations](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md) 
-  should run on any platform with any operating system either natively or inside containers
-  in a unified and automated way;
-* should require minimal learning curve and minimal software dependencies;
-* should be non-intrusive and require minimal or no changes to existing projects;
-* should automate all manual steps to prepare and run AI projects including detection or 
-  installation of all dependencies (models, code and data), substituting local paths, 
-  updating environment variables and generating command lines for a given platform;
-* should be able to run native user scripts while unifying input/output to reuse all existing work;
-* should avoid using complex Domain Specific Languages (DSL);
-* should use plain Python with simple JSON/YAML configurations for portable automations;
-* should be easily understandable and extensble even by non-specialists;
-* should have a human-friendly command line with a very simple Python API.
-
-However, the community also started using and extending 
-[individual CM automation recipes](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md) 
-to modularize and run other software projects and reproduce [research papers at Systems and ML conferences]( https://cTuning.org/ae/micro2023.html ) -
-please check the [**Getting Started Guide**](docs/getting-started.md) 
-to understand how they work, how to reuse and extend them for your projects,
-and how to share your own automations in your public or private projects.
-
-
-Just to give you a flavor of the [CM automation recipes](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
-that can be chained into more complex automation workflows [using simple JSON or YAML](https://github.com/mlcommons/ck/blob/master/cm-mlops/script/app-image-classification-onnx-py/_cm.yaml), 
-here are a few most commonly used automation examples from the CM users 
-that you can try yourself on Linux, MacOS, Windows and other platforms
-with any hardware (you only need Python 3.7+, git, wget and PIP installed 
-on your platform - check the [installation guide](docs/installation.md) for more details):
 
 <details open>
 <summary><b>CM human-friendly command line:</b></summary>
@@ -119,6 +111,9 @@ cm pull repo --url=https://zenodo.org/records/10581696/files/cm-mlops-repo-20240
 cmr "install llvm prebuilt" --version=17.0.6
 cmr "app image corner-detection"
 
+cm run experiment --tags=tuning,experiment,batch_size -- echo --batch_size={{VAR1{range(1,8)}}}
+cm replay experiment --tags=tuning,experiment,batch_size
+
 cmr "get conda"
 
 cm pull repo ctuning@cm-reproduce-research-projects
@@ -149,7 +144,7 @@ if output['return']==0: print (output)
 
 
 <details open>
-<summary><b>Modular containers and GitHub actions with CM commands:</b></summary>
+<summary><b>Examples of modular containers and GitHub actions with CM commands:</b></summary>
 
 <small>
 
@@ -160,6 +155,21 @@ if output['return']==0: print (output)
 
 </details>
 
+[CM scripts](https://github.com/mlcommons/ck/blob/master/docs/list_of_scripts.md) 
+were successfully used to [modularize MLPerf inference benchmarks](https://github.com/mlcommons/ck/blob/master/docs/mlperf/inference/README.md) 
+and help the community automate more than 95% of all performance and power submissions in the v3.1 round
+across more than 120 system configurations (models, frameworks, hardware) 
+while reducing development and maintenance costs.
+
+Besides automating MLCommons projects, the community also started started using 
+and extending [CM scripts](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md) 
+to modularize, run and benchmark other software projects and make it
+easier to rerun, reproduce and reuse [research projects from published papers 
+at Systems and ML conferences]( https://cTuning.org/ae/micro2023.html ).
+
+Please check the [**Getting Started Guide**](docs/getting-started.md) 
+to understand how CM automation recipes work, how to use them to automate your own projects,
+and how to implement and share new automations in your public or private projects.
 
 ### Documentation
 
@@ -175,7 +185,7 @@ if output['return']==0: print (output)
 
 * ACM REP'23 keynote about MLCommons CM: [slides](https://doi.org/10.5281/zenodo.8105339)
 * ACM TechTalk'21 about automating research projects: [YouTube](https://www.youtube.com/watch?v=7zpeIVwICa4)
-* MLPerf inference submitter orientation: [slides](https://doi.org/10.5281/zenodo.8144274) 
+* MLPerf inference submitter orientation: [v3.1 slides](https://doi.org/10.5281/zenodo.10605079), [v3.0 slides](https://doi.org/10.5281/zenodo.8144274) 
 
 ### Get in touch
 

diff --git a/cm-mlops/automation/script/_cm.json b/cm-mlops/automation/script/_cm.json
@@ -7,6 +7,7 @@
   },
   "desc": "Making native scripts more portable, interoperable and deterministic",
   "developers": "[Arjun Suresh](https://www.linkedin.com/in/arjunsuresh), [Grigori Fursin](https://cKnowledge.org/gfursin)",
+  "actions_with_help":["run"],
   "sort": 1000,
   "tags": [
     "automation"

diff --git a/cm-mlops/automation/script/module.py b/cm-mlops/automation/script/module.py
@@ -650,6 +650,50 @@ def run(self, i):
         meta = script_artifact.meta
         path = script_artifact.path
 
+        # Check if has --help
+        if i.get('help',False):
+            print ('')
+            print ('Help for this CM script (automation recipe):')
+
+            variations = meta.get('variations',{})
+            if len(variations)>0:
+                print ('')
+                print ('Available variations:')
+                print ('')
+                for v in sorted(variations):
+                    print ('  _'+v)
+
+            input_mapping = meta.get('input_mapping', {})
+            if len(input_mapping)>0:
+                print ('')
+                print ('Available flags mapped to environment variables:')
+                print ('')
+                for k in sorted(input_mapping):
+                    v = input_mapping[k]
+
+                    print ('  --{}  ->  --env.{}'.format(k,v))
+
+            input_description = meta.get('input_description', {})
+            if len(input_description)>0:
+                print ('')
+                print ('Available flags (Python API dict keys):')
+                print ('')
+                for k in sorted(input_description):
+                    v = input_description[k]
+                    n = v.get('desc','')
+
+                    x = '  --'+k
+                    if n!='': x+='  ({})'.format(n)
+
+                    print (x)
+
+
+            print ('')
+            input ('Press Enter to see common flags for all scripts')
+
+            return {'return':0}
+
+
         deps = meta.get('deps',[])
         post_deps = meta.get('post_deps',[])
         prehook_deps = meta.get('prehook_deps',[])
@@ -3416,6 +3460,7 @@ def update_deps(self, i):
            * return (int): return code == 0 if no error and >0 if error
            * (error) (str): error string if return>0
         """
+
         deps = i['deps']
         add_deps = i['update_deps']
         update_deps(deps, add_deps, False)
@@ -4102,6 +4147,7 @@ def prepare_and_run_script_with_postprocessing(i, postprocess="postprocess"):
 
     return rr
 
+##############################################################################
 def run_detect_version(customize_code, customize_common_input, recursion_spaces, env, state, const, const_state, meta, verbose=False):
 
     if customize_code is not None and 'detect_version' in dir(customize_code):
@@ -4124,6 +4170,7 @@ def run_detect_version(customize_code, customize_common_input, recursion_spaces,
 
     return {'return': 0}
 
+##############################################################################
 def run_postprocess(customize_code, customize_common_input, recursion_spaces, env, state, const, const_state, meta, verbose=False, run_script_input=None):
 
     if customize_code is not None and 'postprocess' in dir(customize_code):

diff --git a/cm-mlops/script/get-cudnn/_cm.json b/cm-mlops/script/get-cudnn/_cm.json
@@ -33,6 +33,7 @@
   "new_env_keys": [
     "CM_CUDNN_*",
     "CM_CUDA_PATH_LIB_CUDNN",
+    "CM_CUDA_PATH_INCLUDE_CUDNN",
     "CM_CUDA_PATH_LIB_CUDNN_EXISTS",
     "+PATH",
     "+C_INCLUDE_PATH",

diff --git a/cm-mlops/script/get-cudnn/customize.py b/cm-mlops/script/get-cudnn/customize.py
@@ -116,6 +116,7 @@ def preprocess(i):
     cuda_inc_path = env['CM_CUDA_PATH_INCLUDE']
     cuda_lib_path = env['CM_CUDA_PATH_LIB']
     env['CM_CUDA_PATH_LIB_CUDNN'] = env['CM_CUDA_PATH_LIB']
+    env['CM_CUDA_PATH_INCLUDE_CUDNN'] = env['CM_CUDA_PATH_INCLUDE']
 
     try:
         print("Copying cudnn include files to {}(CUDA_INCLUDE_PATH)".format(cuda_inc_path))

diff --git a/cm-mlops/script/install-pytorch-from-src/_cm.json b/cm-mlops/script/install-pytorch-from-src/_cm.json
@@ -108,7 +108,8 @@
         }
       },
       "env": {
-        "CM_CONDA_ENV": "yes"
+        "CM_CONDA_ENV": "yes",
+        "CM_MLPERF_INFERENCE_INTEL": "yes"
       },
       "deps": [
         {
@@ -215,8 +216,34 @@
             "libstdcxx-ng"
           ],
           "tags": "get,generic,conda-package,_package.libstdcxx-ng,_source.conda-forge"
-	}
+        }
+      ]
+    },
+    "for-nvidia-mlperf-inference-v3.1-gptj": {
+      "base": [
+        "checkout.b5021ba9",
+        "cuda"
+      ],
+      "env": {
+        "CM_CONDA_ENV": "yes"
+      },
+      "deps": [
+        {
+          "tags": "get,conda,_name.nvidia"
+        }
       ]
+    },
+    "cuda": {
+      "deps": {
+        "tags": "get,cuda,cudnn",
+        "names": [ "cuda" ]
+      },
+      "env": {
+        "CUDA_HOME": "<<<CM_CUDA_INSTALLED_PATH>>>",
+        "CUDNN_LIBRARY_PATH": "<<<CM_CUDA_PATH_LIB_CUDNN>>>",
+        "CUDNN_INCLUDE_PATH": "<<<CM_CUDA_PATH_INCLUDE_CUDNN>>>",
+        "CUDA_NVCC_EXECUTABLE": "<<<CM_NVCC_BIN_WITH_PATH>>>"
+      }
     }
   },
   "versions": {}

diff --git a/cm-mlops/script/install-pytorch-from-src/customize.py b/cm-mlops/script/install-pytorch-from-src/customize.py
@@ -10,9 +10,11 @@ def preprocess(i):
 
     env = i['env']
 
-    run_cmd="CC=clang CXX=clang++ USE_CUDA=OFF python -m pip install -e . "
+    if env.get('CM_MLPERF_INFERENCE_INTEL', '') == "yes":
+        i['run_script_input']['script_name'] = "run-intel-mlperf-inference-v3_1"
+        run_cmd="CC=clang CXX=clang++ USE_CUDA=OFF python -m pip install -e . "
 
-    env['CM_RUN_CMD'] = run_cmd
+        env['CM_RUN_CMD'] = run_cmd
 
     automation = i['automation']
 

diff --git a/cm-mlops/script/install-pytorch-from-src/run-intel-mlperf-inference-v3_1.sh b/cm-mlops/script/install-pytorch-from-src/run-intel-mlperf-inference-v3_1.sh
@@ -0,0 +1,40 @@
+#!/bin/bash
+
+export PATH=${CM_CONDA_BIN_PATH}:$PATH
+
+CUR_DIR=$PWD
+rm -rf pytorch
+cp -r ${CM_PYTORCH_SRC_REPO_PATH} pytorch
+cd pytorch
+rm -rf build
+
+git submodule sync
+git submodule update --init --recursive
+if [ "${?}" != "0" ]; then exit 1; fi
+pushd third_party/gloo
+wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/gloo.patch
+if [ "${?}" != "0" ]; then exit 1; fi
+git apply gloo.patch
+if [ "${?}" != "0" ]; then exit 1; fi
+popd
+
+pushd third_party/ideep/mkl-dnn
+wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/clang_mkl_dnn.patch
+if [ "${?}" != "0" ]; then exit 1; fi
+git apply clang_mkl_dnn.patch
+if [ "${?}" != "0" ]; then exit 1; fi
+popd
+
+wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/pytorch_official_1_12.patch
+if [ "${?}" != "0" ]; then exit 1; fi
+git apply pytorch_official_1_12.patch
+if [ "${?}" != "0" ]; then exit 1; fi
+pip install -r requirements.txt
+
+cmd="${CM_RUN_CMD}"
+echo ${cmd}
+eval ${cmd}
+
+if [ "${?}" != "0" ]; then exit 1; fi
+
+echo "******************************************************"
diff --git a/cm-mlops/script/install-pytorch-from-src/run.sh b/cm-mlops/script/install-pytorch-from-src/run.sh
@@ -10,31 +10,11 @@ rm -rf build
 
 git submodule sync
 git submodule update --init --recursive
-if [ "${?}" != "0" ]; then exit 1; fi
-pushd third_party/gloo
-wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/gloo.patch
-if [ "${?}" != "0" ]; then exit 1; fi
-git apply gloo.patch
-if [ "${?}" != "0" ]; then exit 1; fi
-popd
-
-pushd third_party/ideep/mkl-dnn
-wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/clang_mkl_dnn.patch
-if [ "${?}" != "0" ]; then exit 1; fi
-git apply clang_mkl_dnn.patch
-if [ "${?}" != "0" ]; then exit 1; fi
-popd
-
-wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/pytorch_official_1_12.patch
-if [ "${?}" != "0" ]; then exit 1; fi
-git apply pytorch_official_1_12.patch
-if [ "${?}" != "0" ]; then exit 1; fi
-pip install -r requirements.txt
-
-cmd="${CM_RUN_CMD}"
-echo ${cmd}
-eval ${cmd}
-
-if [ "${?}" != "0" ]; then exit 1; fi
-
-echo "******************************************************"
+if [ "${?}" != "0" ]; then exit $?; fi
+
+python3 -m pip install -r requirements.txt
+python setup.py bdist_wheel
+if [ "${?}" != "0" ]; then exit $?; fi
+cd dist
+python3 -m pip install torch-2.*linux_x86_64.whl
+if [ "${?}" != "0" ]; then exit $?; fi
diff --git a/cm/CHANGES.md b/cm/CHANGES.md
@@ -1,3 +1,7 @@
+## V1.6.0.1
+   - improving --help for common automations and CM scripts (automation recipes)
+   - fixing a few minor bugs
+
 ## V1.6.0
    - added support for Python 3.12 (removed "pkg" dependency)
    - added --depth to "cm pull repo" to reduce size of stable repos