-
Notifications
You must be signed in to change notification settings - Fork 16
ODPi operations management spec objectives
We view the overarching objective of this and subsequent ODPi management specifications is to come up with specifications that apply to software systems that are managing ODPi components. We take system management to encompass installation, configuration/orchestration/provisioning, software upgrades, fault management, event-management/alerting, and security. Our goal is to come up with a widely applicable management specification where Ambari is the first reference implementation.
We divide objectives into two categories:
-
Compliance - a set of specifications at some level of specificity that identify conditions used to determine whether a management system is deemed to be ODPi compliant. The pragmatic import would be that ODPi big data management systems would be validated so that user's of these systems would have higher assurance about the quality and more clarity on the scope of their capabilities and semantics of their operations. To validate compliance we are looking to develop automation tools as part of the ODPI effort that can test systems, identifying violations and conditions met.
-
Standard interfaces - specifications at some level of granularity that promote integration and interoperability Standard interfaces can be at different levels such as transport/protocol used, object model terms, etc. The approach that we advocate is to focus on the object model and key action terms since this is where integration mediation is expensive compared with, for example, mediation between different transports. Another key concern is that along with providing standard terms, it is also critical to allow for vendor differentiation such as value-add configuration knobs or additional metrics collected or computed
The pragmatic benefits of having ODPi management interfaces include:
- Plug and play with different big data management systems allowing selection of best of breed
- Standard interface for a higher level provisioning system that wants to delegate management of ODPI components to a ODPi big data manager allowing this higher level system to focus on coordinating other related services like Cassandra, Kafka, external monitoring systems, networking, and stoarge
- Interface for an alerting system that has purview of the services across a data center to receive ODPi related alerts
- Interfaces for a system performing capacity management to best size or tune a big data cluster
- Interfaces for big data applications that want to dynamically scale up.and down the underlying big data services Facilitating validation functionality by having validation logic write to the standard interfaces
- ..
The above bullets are general descriptions; we are particularly interested in collecting real examples where there is presently difficulty in integration. For example, one use case we are looking it is integration of Cloud Foundry BOSH and Ambari where BOSH is responsible for spinning up nodes with Ambari and BOSH agents installed and for configuring Ambari, which then brings up the Hadoop cluster; once up, both BOSH and Ambari capabilities can be used. The challenge today is that Ambari and BOSH have different views of a template for deployment: BOSH tieing logical nodes to images used for spin up and Ambari having blueprints that tie logical nodes to big data service components. This use case is a driver for a representation in ODPi that unifies these different views of a cluster creation template.
We are looking to incrementally extend the ODPI management specifications to bring in objects that can benefit from standardization. Key is doing this in small incremental steps and also not to "over-standardizing" meaning selecting an appropriate granularity that is not too burdensome and provides flexibility to treat any applicable big data management system. Some areas that we wish to target are (using the terms 'component's and 'services' defined in https://github.com/odpi/specs/blob/master/ODPi-Runtime.md)
- Configuration/Orchestration
- Templates
- Component Templates:
- Standard terms for describing components and their attributes, which includes things that refer to the settings in the component config files
- Standard way to specify component attribute defaults
- Cluster Creation Template - a reusable, deployment-independent description capturing which services are in cluster, what are their versions, how the components making up the services map to nodes; this template can be used to deploy an ODPi cluster on nodes that exist as well as be used when a cluster is being dynamically spun up on virtual machines, cloud instances, or containers. This corresponds to a Blueprint in Ambari, but also will have additional information
- ...
- Component Templates:
- Initial Deployment of a cluster
- Instantiated Cluster - Extension of Cluster Creation Template with deployment specific information that binds the logical nodes in the template to actual existing nodes or to images and other context like subnets needed when dynamically spinning up nodes on a virtual fabric
- Syntax to specify deployment-specific overrides to default component attribute values
- Action(s) to initially deploy a cluster *...
- Updates once deployed
- Action(s) to scale up or down services in the cluster
- Action(s) to upgrade one or more services
- Action(s) to update component parameters in running services
- Action(s) to add, remove or enable features for a service
- Action(s) to perform maintenance operations, such as taking HDFS in and out of safemode
- ..
- Templates
- Performance management
- Standard terms for describing metrics
- ...
- Event management..
- Standard terms for describing alerts
- ...
- Security
- TODO
The proposed action plan is to have first specification that focuses on the Cluster Topology and Component template and
The approach we are advocating for compliance is to focus on service/system level behavior as opposed to details about a component's internal structure (e.g. packaging layout on the node's file system), which is described in the ODPI runtime doc. This system-level behavior can be validated with smoke tests that can do end to end tests. There are two different things that might be validated:
- Validation of a specific deployment
- Validation that an ODPI management system will produce validated deployments for any possible "input", i.e., specfied cluster toplogy and combonent attribute values
We will be focused on the first problem since the combinatorics associated with the second objective is prohibitive. However we also want to partially address the second problem; for example, a way to partially address this is to have tests that validate that a component is working at the unit/daemon level (e.g., namenode running, datanode running) so number of tests is linear to the number of components. These unit tests would also make sure that deamon can work under different OSs/versions and on physical hardware, vm, cloud instance, or container. We then can focus end to end tests on things that are emergent behavior at the system level, such as
- Testing the service works under the different possible topoogies
- Simulating failure to test HA setups
- Testing under different network contexts, e.g., looking at not just a cluster and its clients on a flat network, but also network topologies that can have public and private subnets, VPNs, NATing,,multiple NICS with a dedicated management network, etc
Since a cluster is comprised or multiple services the goal is to be able to test as much as possible each service individually starting from the base services like HDFS and Zookeeper.
This work is licensed under a Creative Commons Attribution 4.0 International License