Skip to content

Commit

Permalink
improving CM automation recipes for CUDA and improving docs (#1088)
Browse files Browse the repository at this point in the history
  • Loading branch information
arjunsuresh authored Feb 2, 2024
2 parents ff483c7 + c5ea901 commit a31610e
Show file tree
Hide file tree
Showing 13 changed files with 340 additions and 128 deletions.
86 changes: 48 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,45 +18,37 @@
### About

Collective Mind (CM) is a [community project](CONTRIBUTING.md) to develop
a [collection of portable and extensible automation recipes
with a human-friendly interface (aka CM scripts)](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
that can be reused in different projects to modularize, run, benchmark and optimize complex AI/ML applications
a [collection of portable, extensible, technology-agnostic and ready-to-use automation recipes
with a human-friendly interface (aka CM scripts)](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
that automate all the manual steps required to build, run, benchmark and optimize complex ML/AI applications on any platform
with any software and hardware.

CM scripts are being developed based on the feedback from [MLCommons engineers and researchers](docs/taskforce.md)
to help them assemble, run, benchmark and optimize complex AI/ML applications
across diverse and continuously changing models, data sets, software and hardware
from Nvidia, Intel, AMD, Google, Qualcomm, Amazon and other vendors.
They require Python 3.7+ with minimal dependencies and can run natively on Ubuntu, MacOS, Windows, RHEL, Debian, Amazon Linux
and any other operating system, in a cloud or inside automatically generated containers.

Some key requirements for the CM design are:
* must be non-intrusive and easy to debug, require zero changes to existing projects and must complement, reuse, wrap and interconnect all existing automation scripts and tools (such as cmake, ML workflows, python poetry and containers) rather than substituting them;
* must have a very simple and human-friendly command line with a Python API and minimal dependencies;
* must require minimal or zero learning curve by using plain Python, native scripts, environment variables and simple JSON/YAML descriptions instead of inventing new languages;
* must run in a native environment with Ubuntu, Debian, RHEL, Amazon Linux, MacOS, Windows and any other operating system while automatically generating container snapshots with CM recipes for repeatability and reproducibility;

Below you can find a few examples of this collaborative engineering effort sponsored by [MLCommons (non-profit organization with 125+ organizations)](https://mlcommons.org) -
a few most-commonly used [automation recipes](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
that can be chained into more complex automation workflows [using simple JSON or YAML](https://github.com/mlcommons/ck/blob/master/cm-mlops/script/app-image-classification-onnx-py/_cm.yaml).

You can try them yourself (you only need Python 3.7+, PIP, git and wget installed and optionally Docker if you want to
run CM scripts via automatically-generated containers - check the [installation guide](docs/installation.md) for more details).

*Note that the Collective Mind concept is to continue improving portability and functionality
of all CM automation recipes across rapidly evolving models, data sets, software and hardware
based on collaborative testing and feedback - don't hestiate to report encountered issues
[here](https://github.com/mlcommons/ck/issues) and/or contact us via [public Discord Server](https://discord.gg/JjWNWXKxwT)
to help this community effort!*

CM was originally designed based on the following feedback and requirements
from MLCommons engineers and researchers to have a common and technology-agnostic automation
that can help them simplify and automate development of complex MLPerf benchmarks and AI applications with diverse ML models
while making this process more repeatable and deterministic:

* [CM automations](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
should run on any platform with any operating system either natively or inside containers
in a unified and automated way;
* should require minimal learning curve and minimal software dependencies;
* should be non-intrusive and require minimal or no changes to existing projects;
* should automate all manual steps to prepare and run AI projects including detection or
installation of all dependencies (models, code and data), substituting local paths,
updating environment variables and generating command lines for a given platform;
* should be able to run native user scripts while unifying input/output to reuse all existing work;
* should avoid using complex Domain Specific Languages (DSL);
* should use plain Python with simple JSON/YAML configurations for portable automations;
* should be easily understandable and extensble even by non-specialists;
* should have a human-friendly command line with a very simple Python API.

However, the community also started using and extending
[individual CM automation recipes](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
to modularize and run other software projects and reproduce [research papers at Systems and ML conferences]( https://cTuning.org/ae/micro2023.html ) -
please check the [**Getting Started Guide**](docs/getting-started.md)
to understand how they work, how to reuse and extend them for your projects,
and how to share your own automations in your public or private projects.


Just to give you a flavor of the [CM automation recipes](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
that can be chained into more complex automation workflows [using simple JSON or YAML](https://github.com/mlcommons/ck/blob/master/cm-mlops/script/app-image-classification-onnx-py/_cm.yaml),
here are a few most commonly used automation examples from the CM users
that you can try yourself on Linux, MacOS, Windows and other platforms
with any hardware (you only need Python 3.7+, git, wget and PIP installed
on your platform - check the [installation guide](docs/installation.md) for more details):

<details open>
<summary><b>CM human-friendly command line:</b></summary>
Expand Down Expand Up @@ -119,6 +111,9 @@ cm pull repo --url=https://zenodo.org/records/10581696/files/cm-mlops-repo-20240
cmr "install llvm prebuilt" --version=17.0.6
cmr "app image corner-detection"

cm run experiment --tags=tuning,experiment,batch_size -- echo --batch_size={{VAR1{range(1,8)}}}
cm replay experiment --tags=tuning,experiment,batch_size

cmr "get conda"

cm pull repo ctuning@cm-reproduce-research-projects
Expand Down Expand Up @@ -149,7 +144,7 @@ if output['return']==0: print (output)


<details open>
<summary><b>Modular containers and GitHub actions with CM commands:</b></summary>
<summary><b>Examples of modular containers and GitHub actions with CM commands:</b></summary>

<small>

Expand All @@ -160,6 +155,21 @@ if output['return']==0: print (output)

</details>

[CM scripts](https://github.com/mlcommons/ck/blob/master/docs/list_of_scripts.md)
were successfully used to [modularize MLPerf inference benchmarks](https://github.com/mlcommons/ck/blob/master/docs/mlperf/inference/README.md)
and help the community automate more than 95% of all performance and power submissions in the v3.1 round
across more than 120 system configurations (models, frameworks, hardware)
while reducing development and maintenance costs.

Besides automating MLCommons projects, the community also started started using
and extending [CM scripts](https://github.com/mlcommons/ck/tree/master/docs/list_of_scripts.md)
to modularize, run and benchmark other software projects and make it
easier to rerun, reproduce and reuse [research projects from published papers
at Systems and ML conferences]( https://cTuning.org/ae/micro2023.html ).

Please check the [**Getting Started Guide**](docs/getting-started.md)
to understand how CM automation recipes work, how to use them to automate your own projects,
and how to implement and share new automations in your public or private projects.

### Documentation

Expand All @@ -175,7 +185,7 @@ if output['return']==0: print (output)

* ACM REP'23 keynote about MLCommons CM: [slides](https://doi.org/10.5281/zenodo.8105339)
* ACM TechTalk'21 about automating research projects: [YouTube](https://www.youtube.com/watch?v=7zpeIVwICa4)
* MLPerf inference submitter orientation: [slides](https://doi.org/10.5281/zenodo.8144274)
* MLPerf inference submitter orientation: [v3.1 slides](https://doi.org/10.5281/zenodo.10605079), [v3.0 slides](https://doi.org/10.5281/zenodo.8144274)

### Get in touch

Expand Down
1 change: 1 addition & 0 deletions cm-mlops/automation/script/_cm.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
},
"desc": "Making native scripts more portable, interoperable and deterministic",
"developers": "[Arjun Suresh](https://www.linkedin.com/in/arjunsuresh), [Grigori Fursin](https://cKnowledge.org/gfursin)",
"actions_with_help":["run"],
"sort": 1000,
"tags": [
"automation"
Expand Down
47 changes: 47 additions & 0 deletions cm-mlops/automation/script/module.py
Original file line number Diff line number Diff line change
Expand Up @@ -650,6 +650,50 @@ def run(self, i):
meta = script_artifact.meta
path = script_artifact.path

# Check if has --help
if i.get('help',False):
print ('')
print ('Help for this CM script (automation recipe):')

variations = meta.get('variations',{})
if len(variations)>0:
print ('')
print ('Available variations:')
print ('')
for v in sorted(variations):
print (' _'+v)

input_mapping = meta.get('input_mapping', {})
if len(input_mapping)>0:
print ('')
print ('Available flags mapped to environment variables:')
print ('')
for k in sorted(input_mapping):
v = input_mapping[k]

print (' --{} -> --env.{}'.format(k,v))

input_description = meta.get('input_description', {})
if len(input_description)>0:
print ('')
print ('Available flags (Python API dict keys):')
print ('')
for k in sorted(input_description):
v = input_description[k]
n = v.get('desc','')

x = ' --'+k
if n!='': x+=' ({})'.format(n)

print (x)


print ('')
input ('Press Enter to see common flags for all scripts')

return {'return':0}


deps = meta.get('deps',[])
post_deps = meta.get('post_deps',[])
prehook_deps = meta.get('prehook_deps',[])
Expand Down Expand Up @@ -3416,6 +3460,7 @@ def update_deps(self, i):
* return (int): return code == 0 if no error and >0 if error
* (error) (str): error string if return>0
"""

deps = i['deps']
add_deps = i['update_deps']
update_deps(deps, add_deps, False)
Expand Down Expand Up @@ -4102,6 +4147,7 @@ def prepare_and_run_script_with_postprocessing(i, postprocess="postprocess"):

return rr

##############################################################################
def run_detect_version(customize_code, customize_common_input, recursion_spaces, env, state, const, const_state, meta, verbose=False):

if customize_code is not None and 'detect_version' in dir(customize_code):
Expand All @@ -4124,6 +4170,7 @@ def run_detect_version(customize_code, customize_common_input, recursion_spaces,

return {'return': 0}

##############################################################################
def run_postprocess(customize_code, customize_common_input, recursion_spaces, env, state, const, const_state, meta, verbose=False, run_script_input=None):

if customize_code is not None and 'postprocess' in dir(customize_code):
Expand Down
1 change: 1 addition & 0 deletions cm-mlops/script/get-cudnn/_cm.json
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
"new_env_keys": [
"CM_CUDNN_*",
"CM_CUDA_PATH_LIB_CUDNN",
"CM_CUDA_PATH_INCLUDE_CUDNN",
"CM_CUDA_PATH_LIB_CUDNN_EXISTS",
"+PATH",
"+C_INCLUDE_PATH",
Expand Down
1 change: 1 addition & 0 deletions cm-mlops/script/get-cudnn/customize.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ def preprocess(i):
cuda_inc_path = env['CM_CUDA_PATH_INCLUDE']
cuda_lib_path = env['CM_CUDA_PATH_LIB']
env['CM_CUDA_PATH_LIB_CUDNN'] = env['CM_CUDA_PATH_LIB']
env['CM_CUDA_PATH_INCLUDE_CUDNN'] = env['CM_CUDA_PATH_INCLUDE']

try:
print("Copying cudnn include files to {}(CUDA_INCLUDE_PATH)".format(cuda_inc_path))
Expand Down
31 changes: 29 additions & 2 deletions cm-mlops/script/install-pytorch-from-src/_cm.json
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,8 @@
}
},
"env": {
"CM_CONDA_ENV": "yes"
"CM_CONDA_ENV": "yes",
"CM_MLPERF_INFERENCE_INTEL": "yes"
},
"deps": [
{
Expand Down Expand Up @@ -215,8 +216,34 @@
"libstdcxx-ng"
],
"tags": "get,generic,conda-package,_package.libstdcxx-ng,_source.conda-forge"
}
}
]
},
"for-nvidia-mlperf-inference-v3.1-gptj": {
"base": [
"checkout.b5021ba9",
"cuda"
],
"env": {
"CM_CONDA_ENV": "yes"
},
"deps": [
{
"tags": "get,conda,_name.nvidia"
}
]
},
"cuda": {
"deps": {
"tags": "get,cuda,cudnn",
"names": [ "cuda" ]
},
"env": {
"CUDA_HOME": "<<<CM_CUDA_INSTALLED_PATH>>>",
"CUDNN_LIBRARY_PATH": "<<<CM_CUDA_PATH_LIB_CUDNN>>>",
"CUDNN_INCLUDE_PATH": "<<<CM_CUDA_PATH_INCLUDE_CUDNN>>>",
"CUDA_NVCC_EXECUTABLE": "<<<CM_NVCC_BIN_WITH_PATH>>>"
}
}
},
"versions": {}
Expand Down
6 changes: 4 additions & 2 deletions cm-mlops/script/install-pytorch-from-src/customize.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,11 @@ def preprocess(i):

env = i['env']

run_cmd="CC=clang CXX=clang++ USE_CUDA=OFF python -m pip install -e . "
if env.get('CM_MLPERF_INFERENCE_INTEL', '') == "yes":
i['run_script_input']['script_name'] = "run-intel-mlperf-inference-v3_1"
run_cmd="CC=clang CXX=clang++ USE_CUDA=OFF python -m pip install -e . "

env['CM_RUN_CMD'] = run_cmd
env['CM_RUN_CMD'] = run_cmd

automation = i['automation']

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash

export PATH=${CM_CONDA_BIN_PATH}:$PATH

CUR_DIR=$PWD
rm -rf pytorch
cp -r ${CM_PYTORCH_SRC_REPO_PATH} pytorch
cd pytorch
rm -rf build

git submodule sync
git submodule update --init --recursive
if [ "${?}" != "0" ]; then exit 1; fi
pushd third_party/gloo
wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/gloo.patch
if [ "${?}" != "0" ]; then exit 1; fi
git apply gloo.patch
if [ "${?}" != "0" ]; then exit 1; fi
popd

pushd third_party/ideep/mkl-dnn
wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/clang_mkl_dnn.patch
if [ "${?}" != "0" ]; then exit 1; fi
git apply clang_mkl_dnn.patch
if [ "${?}" != "0" ]; then exit 1; fi
popd

wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/pytorch_official_1_12.patch
if [ "${?}" != "0" ]; then exit 1; fi
git apply pytorch_official_1_12.patch
if [ "${?}" != "0" ]; then exit 1; fi
pip install -r requirements.txt

cmd="${CM_RUN_CMD}"
echo ${cmd}
eval ${cmd}

if [ "${?}" != "0" ]; then exit 1; fi

echo "******************************************************"
36 changes: 8 additions & 28 deletions cm-mlops/script/install-pytorch-from-src/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,31 +10,11 @@ rm -rf build

git submodule sync
git submodule update --init --recursive
if [ "${?}" != "0" ]; then exit 1; fi
pushd third_party/gloo
wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/gloo.patch
if [ "${?}" != "0" ]; then exit 1; fi
git apply gloo.patch
if [ "${?}" != "0" ]; then exit 1; fi
popd

pushd third_party/ideep/mkl-dnn
wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/clang_mkl_dnn.patch
if [ "${?}" != "0" ]; then exit 1; fi
git apply clang_mkl_dnn.patch
if [ "${?}" != "0" ]; then exit 1; fi
popd

wget -nc --no-check-certificate https://raw.githubusercontent.com/mlcommons/inference_results_v3.1/main/closed/Intel/code/bert-99/pytorch-cpu/patches/pytorch_official_1_12.patch
if [ "${?}" != "0" ]; then exit 1; fi
git apply pytorch_official_1_12.patch
if [ "${?}" != "0" ]; then exit 1; fi
pip install -r requirements.txt

cmd="${CM_RUN_CMD}"
echo ${cmd}
eval ${cmd}

if [ "${?}" != "0" ]; then exit 1; fi

echo "******************************************************"
if [ "${?}" != "0" ]; then exit $?; fi

python3 -m pip install -r requirements.txt
python setup.py bdist_wheel
if [ "${?}" != "0" ]; then exit $?; fi
cd dist
python3 -m pip install torch-2.*linux_x86_64.whl
if [ "${?}" != "0" ]; then exit $?; fi
4 changes: 4 additions & 0 deletions cm/CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## V1.6.0.1
- improving --help for common automations and CM scripts (automation recipes)
- fixing a few minor bugs

## V1.6.0
- added support for Python 3.12 (removed "pkg" dependency)
- added --depth to "cm pull repo" to reduce size of stable repos
Expand Down
Loading

0 comments on commit a31610e

Please sign in to comment.