MLSM - Multiple Lead Score Models

Red Hat's implementation of multiple data science models against input data sets.

Introduction

Operationalizing outputs from data science efforts is a tricky art. With so much data science relying on cutting-edge techniques, it can be difficult to balance that with the need for a stable infrastructure. Running a static analysis to provide either a spreadsheet or slide deck is fairly straightforward, but also prone to human error, time constraints, and limited resources. If you're trying to make real-time decisions based on data science models, the engine for processing models has to be reliable.

Our original use case for this package was deploying lead scoring models: given information we know about one person (based on marketing data), how should they be prioritized when being passed over to sales? If we implement a model for scoring leads, and the infrastructure breaks (due to deploying a new model, uncaught exceptions, etc), there are immediate downstream impacts to other groups who depend on this information.

Challenges with other platforms:

Limited modeling options; usually only simple "if-then" statements
Slow processing times
Lack of flexibility to change model types quickly
Lack of version control

To address these challenges, we built the mlsm package.

Using a Python framework offers the following solutions:

Limited modeling options; usually only simple "if-then" statements
- Python offers direct access to data science standards, such as linear regression, random forests, and machine learning
Slow processing times
- By running Python on a server, we can add more resources to speed up processing as needed
Lack of flexibility to quickly deploy new models
- With a standardized architecture designed for multiple concurrent models, deploying new models is easy
Lack of version control
- By deploying with a Git-based system, can easily view changes and roll back to older code; architecture for multiple models allows easier comparison of outputs against live data

Architecture

`Model` class

Model is a high-level wrapper for functions that run models. This not only puts functions in a framework to be more consistently applied, it also facilitates easier storage of results by model name/version, better validation of input data, and simplifying any script which actually applies models.

The inputs for executing a function, data and results, are expected to a single record (the function RunModelsAll is used to apply across multiple records).

Model also collects a fields dict, used to validate input values before execution (future enhancements will include extreme unit/case/exception handling).

`SummaryModel` class

SummaryModel is an extension of Model - it expects output of previously run Models to be passed (via results) for the purpose of running a second-level computation. For example, if three Models are run, a SummaryModel could then be used to determine which score 'wins' (or, which score is passed back to business users).

When passed to RunModelsAll, every SummaryModel will be run after all Models have run.

Data Flow

Pre-run

An ETL script prepares data for processing (not part of the mlsm package)
- Each record is a dict
- Each record has a unique identifier (which can later be passed as a parameter to RunModelsAll)
- Data for each record is structured in sub-dictionaries
  - data top-level dict
  - Model.name sub-level
  - Model.version sub-level - data values contained here
- When running a Model, the data passed to the underlying function will be taken from the path data[Model.name][Model.version]

`RunModelsAll`

For each record passed (records):
- For each Model passed (models):
  - Execute model
- For each SummaryModel passed (summaryModels):
  - Execute summary model
- If DB parameters passed:
  - Store output results object in MongoDB

Setup Process

Hosting

We use Red Hat's Openshift (Python 3.3 cartridge), with an additional install script to run Anaconda for more advanced models

Creating and deploying models

We follow this procedure for creating/deploying models (also workflow for passing models from data scientists to engineers for deployment):

Data Scientists

Create a new local git repo (and a remote for backup)
Use git-flow for develop/release/feature management
When a new model is ready to be deployed, push to remote and inform engineering team with commit ID

If testing a model from the develop branch, then must set Model.status='draft'

Engineers

Pull model repo to local
Add to deployment code repo
Import model to script which runs RunModelsAll
Add any additional requirements to existing ETL scripts

Future

Implementation standard leveraging Jenkins, so Data Scientists can have their latest models pulled directly in from remote

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
mlsm		mlsm
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
DataDictionary.md		DataDictionary.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLSM - Multiple Lead Score Models

Introduction

Architecture

`Model` class

`SummaryModel` class

Data Flow

Pre-run

`RunModelsAll`

Setup Process

Hosting

Creating and deploying models

Data Scientists

Engineers

Future

About

Releases

Packages

Languages

License

rh-marketingops/mlsm

Folders and files

Latest commit

History

Repository files navigation

MLSM - Multiple Lead Score Models

Introduction

Architecture

Model class

SummaryModel class

Data Flow

Pre-run

RunModelsAll

Setup Process

Hosting

Creating and deploying models

Data Scientists

Engineers

Future

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Model` class

`SummaryModel` class

`RunModelsAll`

Packages