Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics Monitoring: A minimal viable product #348

Merged
merged 36 commits into from
Jan 11, 2018
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
6d5f9e2
[Metrics] A working node exporter done.
simon-mo Dec 17, 2017
439d9f8
[Metric] Model Exporter Done.
simon-mo Dec 19, 2017
3166bc0
[Metric] Add FrontendExporter Docker image
simon-mo Dec 19, 2017
8fbba2f
[Metric] Add docstrings
simon-mo Dec 19, 2017
7947a5f
[Metric] Add integration test for clipper metric
simon-mo Dec 20, 2017
79f3145
[Metric] Format Code
simon-mo Dec 20, 2017
e1c7227
[Metrics] Bug Fix, update the name accoridingly
simon-mo Jan 3, 2018
fd3b078
[Metric] Small Fixes
simon-mo Jan 3, 2018
bc55071
[Metric] Format Code
simon-mo Jan 3, 2018
5efc602
[Metric] Skipped Metrics to Let Jenkins Build Images
simon-mo Jan 5, 2018
d86d79a
[Metric] Revert the files; docker imgs built before test
simon-mo Jan 5, 2018
3ffd51b
Merge branch 'metrics' of https://github.com/simon-mo/clipper into me…
simon-mo Jan 5, 2018
21b3926
Move the comments; restart the tests
simon-mo Jan 5, 2018
ceb6065
[Metric] Add version tag to the frontend-exporter
simon-mo Jan 5, 2018
0922c3d
[Metric] Add clipper_metric to run_unittests.sh
simon-mo Jan 5, 2018
ebf2df6
[Metric] Trigger another CI check
simon-mo Jan 5, 2018
ccff8ae
[Metrics] A working node exporter done.
simon-mo Dec 17, 2017
6756b18
[Metric] Model Exporter Done.
simon-mo Dec 19, 2017
92d2c01
[Metric] Add FrontendExporter Docker image
simon-mo Dec 19, 2017
4fc5c10
[Metric] Add docstrings
simon-mo Dec 19, 2017
d84be8d
[Metric] Add integration test for clipper metric
simon-mo Dec 20, 2017
9e955b3
[Metric] Format Code
simon-mo Dec 20, 2017
457f792
[Metrics] Bug Fix, update the name accoridingly
simon-mo Jan 3, 2018
808935b
[Metric] Small Fixes
simon-mo Jan 3, 2018
5816ca7
[Metric] Format Code
simon-mo Jan 3, 2018
26e0024
[Metric] Skipped Metrics to Let Jenkins Build Images
simon-mo Jan 5, 2018
a93606d
[Metric] Revert the files; docker imgs built before test
simon-mo Jan 5, 2018
3dd8298
Move the comments; restart the tests
simon-mo Jan 5, 2018
56425ae
[Metric] Add version tag to the frontend-exporter
simon-mo Jan 5, 2018
e2d8f46
[Metric] Add clipper_metric to run_unittests.sh
simon-mo Jan 5, 2018
086e305
[Metric] Trigger another CI check
simon-mo Jan 5, 2018
87a67e2
Merge branch 'develop' into metrics
dcrankshaw Jan 5, 2018
b89cbae
Address comments, fix typo
simon-mo Jan 8, 2018
aa4b7c8
Merge branch 'metrics' of https://github.com/simon-mo/clipper into me…
simon-mo Jan 8, 2018
dd9d2e5
[Metric] change the redis port for integration-test
simon-mo Jan 8, 2018
b218000
Merge branch 'develop' into metrics
simon-mo Jan 10, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions bin/build_docker_images.sh
Original file line number Diff line number Diff line change
Expand Up @@ -258,9 +258,10 @@ build_images () {
create_image pyspark-container PySparkContainerDockerfile $public
create_image tf_cifar_container TensorFlowCifarDockerfile $public
create_image tf-container TensorFlowDockerfile $public
}


# Build Metric Monitor image - no dependency
create_image frontend-exporter FrontendExporterDockerfile $public
}


usage () {
Expand Down
1 change: 1 addition & 0 deletions bin/run_unittests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ function run_integration_tests {
python ../integration-tests/kubernetes_integration_test.py
python ../integration-tests/deploy_tensorflow_models.py
../integration-tests/r_integration_test/rclipper_test.sh
python ../integration-tests/clipper_metric.py
}

function run_all_tests {
Expand Down
1 change: 1 addition & 0 deletions clipper_admin/clipper_admin/container_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
CLIPPER_INTERNAL_QUERY_PORT = 1337
CLIPPER_INTERNAL_MANAGEMENT_PORT = 1338
CLIPPER_INTERNAL_RPC_PORT = 7000
CLIPPER_INTERNAL_METRIC_PORT = 1390

CLIPPER_DOCKER_LABEL = "ai.clipper.container.label"
CLIPPER_MODEL_CONTAINER_LABEL = "ai.clipper.model_container.label"
Expand Down
37 changes: 31 additions & 6 deletions clipper_admin/clipper_admin/docker/docker_container_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@
ContainerManager, CLIPPER_DOCKER_LABEL, CLIPPER_MODEL_CONTAINER_LABEL,
CLIPPER_QUERY_FRONTEND_CONTAINER_LABEL,
CLIPPER_MGMT_FRONTEND_CONTAINER_LABEL, CLIPPER_INTERNAL_RPC_PORT,
CLIPPER_INTERNAL_QUERY_PORT, CLIPPER_INTERNAL_MANAGEMENT_PORT)
CLIPPER_INTERNAL_QUERY_PORT, CLIPPER_INTERNAL_MANAGEMENT_PORT,
CLIPPER_INTERNAL_METRIC_PORT)
from ..exceptions import ClipperException
from requests.exceptions import ConnectionError
from .docker_metric_utils import *

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -93,7 +95,7 @@ def start_clipper(self, query_frontend_image, mgmt_frontend_image,
logger.debug(
"{nw} network already exists".format(nw=self.docker_network))
except ConnectionError:
msg = "Unable to connect to Docker. Please Check if Docker is running."
msg = "Unable to Connect to Docker. Please Check if Docker is running."
raise ClipperException(msg)

if not self.external_redis:
Expand Down Expand Up @@ -123,24 +125,38 @@ def start_clipper(self, query_frontend_image, mgmt_frontend_image,
},
labels=mgmt_labels,
**self.extra_container_kwargs)

query_cmd = "--redis_ip={redis_ip} --redis_port={redis_port} --prediction_cache_size={cache_size}".format(
redis_ip=self.redis_ip,
redis_port=self.redis_port,
cache_size=cache_size)
query_labels = self.common_labels.copy()
query_labels[CLIPPER_QUERY_FRONTEND_CONTAINER_LABEL] = ""
query_container_id = random.randint(0, 100000)
query_name = "query_frontend-{}".format(query_container_id)
self.docker_client.containers.run(
query_frontend_image,
query_cmd,
name="query_frontend-{}".format(
random.randint(0, 100000)), # generate a random name
name=query_name,
ports={
'%s/tcp' % CLIPPER_INTERNAL_QUERY_PORT:
self.clipper_query_port,
'%s/tcp' % CLIPPER_INTERNAL_RPC_PORT: self.clipper_rpc_port
},
labels=query_labels,
**self.extra_container_kwargs)

# Metric Section
query_frontend_metric_name = "query_frontend_exporter-{}".format(
query_container_id)
run_query_frontend_metric_image(
query_frontend_metric_name, self.docker_client, query_name,
self.common_labels, self.extra_container_kwargs)
setup_metric_config(query_frontend_metric_name,
CLIPPER_INTERNAL_METRIC_PORT)
run_metric_image(self.docker_client, self.common_labels,
self.extra_container_kwargs)

self.connect()

def connect(self):
Expand Down Expand Up @@ -187,15 +203,24 @@ def _add_replica(self, name, version, input_type, image):
"CLIPPER_IP": query_frontend_hostname,
"CLIPPER_INPUT_TYPE": input_type,
}

model_container_label = create_model_container_label(name, version)
labels = self.common_labels.copy()
labels[CLIPPER_MODEL_CONTAINER_LABEL] = create_model_container_label(
name, version)
labels[CLIPPER_MODEL_CONTAINER_LABEL] = model_container_label

# Metric Section
model_container_name = model_container_label + '-{}'.format(
random.randint(0, 100000))
self.docker_client.containers.run(
image,
name=model_container_name,
environment=env_vars,
labels=labels,
**self.extra_container_kwargs)

update_metric_config(model_container_name,
CLIPPER_INTERNAL_METRIC_PORT)

def set_num_replicas(self, name, version, input_type, image, num_replicas):
current_replicas = self._get_replicas(name, version)
if len(current_replicas) < num_replicas:
Expand Down
141 changes: 141 additions & 0 deletions clipper_admin/clipper_admin/docker/docker_metric_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
import yaml
import requests
import random
import os
from ..version import __version__


def ensure_clipper_tmp():
"""
Make sure /tmp/clipper directory exist. If not, make one.
:return: None
"""
try:
os.makedirs('/tmp/clipper')
except OSError as e:
# Equivalent to os.makedirs(., exist_ok=True) in py3
pass


def get_prometheus_base_config():
"""
Generate a basic configuration dictionary for prometheus
:return: dictionary
"""
conf = dict()
conf['global'] = {'evaluation_interval': '5s', 'scrape_interval': '5s'}
conf['scrape_configs'] = []
return conf


def run_query_frontend_metric_image(name, docker_client, query_name,
common_labels, extra_container_kwargs):
"""
Use docker_client to run a frontend-exporter image.
:param name: Name to pass in, need to be unique.
:param docker_client: The docker_client object.
:param query_name: The corresponding frontend name
:param common_labels: Labels to pass in.
:param extra_container_kwargs: Kwargs to pass in.
:return: None
"""

query_frontend_metric_cmd = "--query_frontend_name {}".format(query_name)
query_frontend_metric_labels = common_labels.copy()

docker_client.containers.run(
"clipper/frontend-exporter:{}".format(__version__),
query_frontend_metric_cmd,
name=name,
labels=query_frontend_metric_labels,
**extra_container_kwargs)


def setup_metric_config(query_frontend_metric_name,
CLIPPER_INTERNAL_METRIC_PORT):
"""
Write to file prometheus.yml after frontend-metric is setup.
:param query_frontend_metric_name: Corresponding image name
:param CLIPPER_INTERNAL_METRIC_PORT: Default port.
:return: None
"""

ensure_clipper_tmp()

with open('/tmp/clipper/prometheus.yml', 'w') as f:
prom_config = get_prometheus_base_config()
prom_config_query_frontend = {
'job_name':
'query',
'static_configs': [{
'targets': [
'{name}:{port}'.format(
name=query_frontend_metric_name,
port=CLIPPER_INTERNAL_METRIC_PORT)
]
}]
}
prom_config['scrape_configs'].append(prom_config_query_frontend)

yaml.dump(prom_config, f)


def run_metric_image(docker_client, common_labels, extra_container_kwargs):
"""
Run the prometheus image.
:param docker_client: The docker client object
:param common_labels: Labels to pass in
:param extra_container_kwargs: Kwargs to pass in.
:return: None
"""

metric_cmd = [
"--config.file=/etc/prometheus/prometheus.yml",
"--storage.tsdb.path=/prometheus",
"--web.console.libraries=/etc/prometheus/console_libraries",
"--web.console.templates=/etc/prometheus/consoles",
"--web.enable-lifecycle"
]
metric_labels = common_labels.copy()
docker_client.containers.run(
"prom/prometheus",
metric_cmd,
name="metric_frontend-{}".format(random.randint(0, 100000)),
ports={'9090/tcp': 9090},
volumes={
'/tmp/clipper/prometheus.yml': {
'bind': '/etc/prometheus/prometheus.yml',
'mode': 'ro'
}
},
labels=metric_labels,
**extra_container_kwargs)


def update_metric_config(model_container_name, CLIPPER_INTERNAL_METRIC_PORT):
"""
Update the prometheus.yml configuration file.
:param model_container_name: New model container_name, need to be unique.
:param CLIPPER_INTERNAL_METRIC_PORT: Default port
:return: None
"""
with open('/tmp/clipper/prometheus.yml', 'r') as f:
conf = yaml.load(f)

new_job_dict = {
'job_name':
'{}'.format(model_container_name),
'static_configs': [{
'targets': [
'{name}:{port}'.format(
name=model_container_name,
port=CLIPPER_INTERNAL_METRIC_PORT)
]
}]
}
conf['scrape_configs'].append(new_job_dict)

with open('/tmp/clipper/prometheus.yml', 'w') as f:
yaml.dump(conf, f)

requests.post('http://localhost:9090/-/reload')
1 change: 1 addition & 0 deletions clipper_admin/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ docker==2.5.1
kubernetes==3.0.0
six==1.10.0
mock
prometheus_client
1 change: 1 addition & 0 deletions clipper_admin/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
'pyyaml',
'docker',
'kubernetes',
'prometheus_client',
'six',
],
extras_require={
Expand Down
50 changes: 48 additions & 2 deletions containers/python/rpc.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
import socket
import sys
from collections import deque
from multiprocessing import Pipe, Process
from prometheus_client import start_http_server
from prometheus_client.core import GaugeMetricFamily, REGISTRY

INPUT_TYPE_BYTES = 0
INPUT_TYPE_INTS = 1
Expand Down Expand Up @@ -180,7 +183,7 @@ def get_prediction_function(self):
def get_event_history(self):
return self.event_history.get_events()

def run(self):
def run(self, metric_conn):
print("Serving predictions for {0} input type.".format(
input_type_to_string(self.model_input_type)))
connected = False
Expand All @@ -189,6 +192,9 @@ def run(self):
poller = zmq.Poller()
sys.stdout.flush()
sys.stderr.flush()

pred_metric = dict(model_pred_count=0)

while True:
socket = self.context.socket(zmq.DEALER)
poller.register(socket, zmq.POLLIN)
Expand Down Expand Up @@ -306,6 +312,10 @@ def run(self):

response.send(socket, self.event_history)

pred_metric['model_pred_count'] += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're accumulating the counts here, but just sending samples of the latencies. This is why I was confused about how you were dequeuing from the pipe. In order to handle the latencies correctly you should be creating histograms. For this MVP PR, let's just accumulate the count. So delete the *_time metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I was thinking about this recently. I deleted *_time metrics.

If we can finalize (at least for now) all the metrics needed to be collected, I will initialize them in the prometheus collector process just once [1] and run update through each dictionary in the pipe. So that every metric record is updated in the metric process (especially matters for histogram and summary data type.

[1]: the current process initialize a new gauge every time the metric is updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's definitely the right way to do it. Let's merge this PR, then you can start on one that tracks the timing metrics with histograms.


metric_conn.send(pred_metric)

print("recv: %f us, parse: %f us, handle: %f us" %
((t2 - t1).microseconds, (t3 - t2).microseconds,
(t4 - t3).microseconds))
Expand Down Expand Up @@ -488,4 +498,40 @@ def start(self, model, host, port, model_name, model_version, input_type):
self.server.model_version = model_version
self.server.model_input_type = model_input_type
self.server.model = model
self.server.run()

parent_conn, child_conn = Pipe(duplex=True)
metrics_proc = Process(target=run_metric, args=(child_conn, ))
metrics_proc.start()
self.server.run(parent_conn)


class MetricCollector:
def __init__(self, pipe_child_conn):
self.pipe_conn = pipe_child_conn

def collect(self):
latest_metric_dict = None
while self.pipe_conn.poll():
latest_metric_dict = self.pipe_conn.recv()
if latest_metric_dict:
for name, val in latest_metric_dict.items():
try:
yield GaugeMetricFamily(
name=name,
documentation=name, # Required Argument
value=val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/lastest/latest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

except ValueError:
pass


def run_metric(child_conn):
"""
This function takes a child_conn at the end of the pipe and
receive object to update prometheus metric.

It is recommended to be ran in a separate process.
"""
REGISTRY.register(MetricCollector(child_conn))
start_http_server(1390)
while True:
time.sleep(1)
10 changes: 10 additions & 0 deletions dockerfiles/FrontendExporterDockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# This ARG isn't used but prevents warnings in the build script
ARG CODE_VERSION
FROM python:3

WORKDIR /usr/src/app

RUN pip install requests prometheus_client flatten_json

COPY monitoring/front_end_exporter.py .
ENTRYPOINT ["python", "./front_end_exporter.py"]
3 changes: 2 additions & 1 deletion dockerfiles/RPCDockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ MAINTAINER Dan Crankshaw <dscrankshaw@gmail.com>
RUN mkdir -p /model \
&& apt-get update \
&& apt-get install -y libzmq3 libzmq3-dev \
&& conda install -y pyzmq
&& conda install -y pyzmq \
&& pip install prometheus_client

WORKDIR /container

Expand Down
Loading