Document status: Working draft
Our goal is to create a survey that helps us, data science & MLOps practitioners, learn more about our community and (in the future) track how our community changes year-over-year. Said differently - this is a survey for the community, by the community. We designed this survey to gain an accurate understanding of we, the professionals designing, building and deploying production ML systems, individually and collectively are, the types of data and models we implement, the tools we use, and the challenges we face. Specifically, this is NOT a vendor-led survey that pushes forward a specific perspective that advantages a vendor or group of vendors.
This document provides an outline of the topics we want to understand from our community. It is intended to spark discussion and drive alignment on set of topics and learning goals within each topic.
Specifically, our goals are:
- Develop a ~10 minute survey that provides a neutral, vendor-agnostic view of who is in our community and why and how MLOps is applied.
- Publish an annual report on the
State of MLOps
based on the survey results - Survey 1,000 data science practitioners who are representative of the broader community (per year)
- Provide open-source access to the (anonymized) data
Below are the folks who are contributing and/or supportive of this effort. Please feel encouraged to open a PR with your name if you are supportive of the survey or want to contribute!
Name | Contact Info |
---|---|
Eric Peter | eric@ericpeter.com |
Demetrios Brinkmann | demetrios@mlops.community |
Gonçalo Ribeiro | g@ydata.ai |
Jacopo Tagliabue | jacopo.tagliabue@nyu.edu |
Duarte Carmo | me@duarteocarmo.com |
Ben Wilson | benjamin.wilson@databricks.com |
Han lee | lee.hanchung@gmail.com |
Skylar Payne | skylar@healthrhythms.com |
Niall Murphy | niallm@gmail.com |
Jeffrey Luppes | jeffluppes@gmail.com |
Andreea Munteanu | andreeamihaelamunteanu@gmail.com |
Laszlo Sragner | laszlo@hypergolic.co.uk |
Elena Samuylova | elena.samuylova@evidentlyai.com |
Ian Schweer | ischweer@riotgames.com |
Add your name here |
We are working backwards from an April 2023 release date for the report. We are hopeful that our proposed KubeCon talk on this survey will be accepted.
- November 7 - 18: Engage with various community members to evangelize idea, identify contributors, get feedback, etc
- November 22: Finalized list of topics & learning goals
- December 9: First draft of survey text written
- December 10 - 22: Collect feedback on the draft survey text
- December 23 - January 5: Holiday break
- January 13: Survey release candidate #1 available as a web link
- January 13: Finalize list of incentives & distribution plan
- January 16 - 20: Live trials of survey with ~10 people to confirm length, identify bugs / unclear wording
- January 23 - 27: Finalize programming
- January 27 - February 10: Buffer for delays ;-)
- February 13: Survey go-live (tbd plan for distribution)
- March 3: Survey closed
- March 17: First draft of survey analysis (e.g., annual report / blog post)
- March 17 - April 7: Refine survey analysis
- April 7 - 21: Buffer for delays ;-)
- April 21: Public release
Interesting in helping out? Below are some areas we have identified, but we encourage pull requests on this repo if you have another idea. If you want to discuss, reach out to Eric Peter (eric@ericpeter.me) or on MLOps Slack at @Eric Peter.
- Share the survey link to your network (once published)
- Submit a PR to add something that is missing or correct an error
- Help define & refine the topics & questions
- Take ownership of a specific define & refine the lists that form answer choices LINK TO LISTS
- Volunteer as a QA tester
- Volunteer to take the survey (or a portion of the survey) live with a moderator (to help us refine the clarity of the questions & answers)
- Participation incentives: define and curate incentives that will be raffled off in order to encourage participation
- Conduct data analysis once responses are collected (looking at you data scientists ;-)
- Write the report/synthesis of the results as a blog post, or ...
- Pay for a license for a survey tool with maxDiff question types + advanced piping and branching logic (Qualtrics, etc)
- Donate a participation incentive
Below is the proposed list of survey topics and learning goals for each. The survey will be max 15 minutes in length (ideally shorter), so the topics below will be prioritized for inclusion.
Beyond job title, who are we? What skills do we have? What responsibilities do we own? What are our attitudes and perceptions towards/of MLOps? What types of organizations do we work for? What team structure(s) are we a part of?
Indicative topics/questions
- Current job & past titles
- Years & types of experience
- DevOps, ML Research, Software Engineering, etc
- Current responsibilities
- Experimentation, productionization, etc
- Attitudes towards ML and MLOps
- "MLOps is just DevOps", "MLOps should be an extension of DevOps", "MLOps is different from DevOps", "Software and ML should follow the same DevOps process", etc
- Familiarity with various ML and data concepts/topics
- ML concepts: Feature Store, Model Registry, etc
- Data concepts: Streaming, batch processing, etc
- Company details
- Number of employees, industry, etc
- Employer perceptions
- Team details
- Number of people by role
- Community participation
- Which communities?
- Frequency of engagement
- Reasons for engaging
Understand which ML and MLOps tools / frameworks are used by teams:
- Frequency of tool usage
- Use case for each tool
- Satisfaction with each tool
- Which tools would we (not) recommended?
- Effort for initial setup and ongoing maintenance
Understand the maturity of internal ML platforms or "paved paths"
TODO: Curate a list of tool categories and tool names. TODO: Should we differentiate between tools (Pytorch, Notebooks, etc) and infrastructure (Kubernetes, etc)? Or is the difference too blurred?
Understand the sources/sinks of data training and production
- Frequency of each source's use
- Difficulty of using each source
- What each source is used for (training data, production, logs, etc)?
- Desired sources that can't be used
Understand the data infra & tools
- Frequency of use
- Satisfaction
TODO: Flesh out list of data sources (REST API, production DBs, data lakes, etc) TODO: Flesh out list of data tools (Spark, BigQuery, etc)
Understand the business-related struggles of utilizing ML / DS solutions
- Process of project proposal
- Difficulties in navigating buy-in with stakeholders
- Approval and auditing processes for project work
Understand where and how practitioners discover tools, methods, and examples
- Sources of knowledge
- References and examples to learn from
- Ideation validation processes (where and how are experiments conducted)
- Testing and vetting process (peer review, demo, etc.)
Tools and methods used in validating proposed architectures and approaches
- Tools and frameworks used
- Process of running experimental validation of ideas
- Time boxing and restriction of scope creep
Understand the broad strokes of the types and details of the use cases overall. These questions will be asked as "approximately what % of your ML use cases[1] are the following?"
- Analytical (batch/offline) vs API-based (real time/online)
- Business use cases (pricing, recommendations, fraud, etc)
- Types of models (Transformers, XGBoost, etc)
- Types of data (tabular, unstructured text, images, video, time-series, etc)
- How model predictions are accessed by the end-user (within a report, BI tool, application, etc)
- Approach to releasing models and features (CI/CD, manual releases, etc)
- Feature freshness (milliseconds/realtime, hours, days, weeks, etc)
- TODO: Add others of interest
Understand the future view of these use cases - what will this look like in 3 years from now?
Other topics of interest
- Models that do / don't make it to production & why
- Which segments of the use cases are hardest to implement and why?
TODO: Significant effort is required here to ensure these responses are MECE (mutually exclusive and collectively exhaustive)
[1] ML use case is defined as the set of input data, features, and model(s) required to deliver a final prediction that the end user sees.
A flaw in many existing surveys is that they ask respondents specific details, but ask these details as a generalization across all use cases. This is known to produce poor quality (inaccurate) data since, as humans, generalizing is actually a surprisingly difficult task. In order to collect accurate data on the "weeds" of how we use ML, this section will ask detailed information about a single specific use case. The results will be aggregated and weighted according to the broad information collected in the previous section.
TODO: Add more details on the methodology
TODO: Flesh out this list of topics
- Include all topics from the general use case overview
- Time to deploy
- Time taken at each lifecycle stage
- Difficulty of each lifecycle stage
- What metrics you track
- Which roles (people) were involved
- Number of models used
- Number of features used
- Size of training data (GB, PB, etc)
- Volume of predictions
- Challenges faced and severity of each (TODO: need a good list here)
- Differences between the training and production environments (different code, etc)
Challenges
- validate data is the same between train & prod
- validate feature logic is the same
- manage dependencies in code (libraries)
- mange dependencies between models or features version
- reproducibility
- monitor metrics
- speed of deployment
ML Lifecycle stages TODO: Improve this draft
- Requirement definition
- Data identification
- Data cleaning
- Feature engineering / pre-processing
- Model training
- Model productionization
- Feature productionization
- Model deployment
- Feature deployment
- Monitoring
TBD - should we include this? This could be interesting to track over time. Understand the data science budgets and what specifically companies are currently spending that budget on (tools, infrastructure, etc).