Skip to content

SIDN/hadoop-provisioning-manager

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SIDN Labs Hadoop Provisioning

SIDN Labs Hadoop Provisioning Manager makes it easier to deploy a Hadoop based data analytics cluster. The analytics cluster has support for well known components.

  • Apache Hadoop
  • Apache Impala
  • Apache Spark
  • Apache Hive
  • Apache Ranger
  • Apache Zookeeper
  • Monitoring (Prometheus and Grafana)
  • Hue
  • Apache Superset
  • JupyterHub/JupyterLab
  • Apache Airflow
  • Apache Iceberg

The cluster is automatically configured with features such as.

  • HDFS High Availability
  • Authentication (Kerberos)
  • Authorization (Apache Ranger)
  • TLS support for all web interfaces

A management console is provided, to help administrators and users to navigate all the available services. Screenshot of console web application

Monitoring is automatically configured using Prometheus and Grafana. Dashboards are provided, to help administrators monitor all services and receive alerts when a service becomes unavailable. Screenshot of Grafana monitor dashboard

Requirements

  • At least 5 physical/virtual servers for testing, 8 for production usage.
  • A single TLS-certificate which is valid for every server in the cluster.
  • The management server will need 16GB of RAM to prevent issues when building Docker images.

The TLS-certificate may be a wildcard certificate or use a SAN-list that contains each server name.

Components

This project installs the following components

Name Description Optional
common Generic packages and config No
manager Packages and config for management node No
console A web-based management interface No
monitor Prometheus and Grafana server Yes
monitor-proxy Prometheus proxies Yes
db Postgresql database Yes
docker-reg-ui Docker registry ui No
zookeeper Zookeeper packages and config No
Hadoop Hadoop (hdfs/yarn) packages and config No
spark Spark packages and config No
hive Hive metastore packages and config No
impala Impala packages and config No
hue Hue packages and config No
livy Livy packages and config No
ranger Ranger packages and config No
solr Solr packages and config No
airflow Airflow packages and config Yes
tls TLS certificates and config for all components No
krb-kdc Deploy Keberos KDC server No
krb-principal Generate Kerberos principals for all components No
krb-keytab Kerberos keytabs for deployed components No
client-config Hadoop client configuration files No

Versions

The default component versions are found in config/vars/all/yml these versions are known good versions and have been tested to correctly work together. It is possible to change the version of a component, but be sure to test if everything functions correctly after deployment.

Binary packages

Most of the components are downloaded in the form of compiled binaries from the official component project download sites. Not all open source projects, such as Apache Impala, provide compiled packages. The components listed below have been compiled by us and the binary packages have been made available through the SIDN download site. Only the listed versions can be installed.

Name Version
Apache Impala 4.20
Apache Ranger 2.2.0
Apache Livy 0.8.0

OS Support

Both Redhat Enterprise Linux (RHEL) and Ubuntu are supported, although only a RHEL deployment has been fully tested. It is possible to create a mixed cluster, for example have the data nodes use RHEL and the gateways Ubuntu.

  • Redhat Enterprise Linux (RHEL) 7.9
  • Ubuntu 20.04/22.04 LTS

Redhat Enterprise Linux (RHEL)

At the moment only RHEL 7.9 is tested.

Getting RHEL

RHEL is a commercial Linux distribution and a paid license is required, however for evaluation and testing (non-production) use cases a free personal developer license can be granted by Redhat. See: https://developers.redhat.com/articles/faqs-no-cost-red-hat-enterprise-linux

Installation

After installing the OS, the subscription must be activated and the required repositories need to be enabled

subscription-manager register
subscription-manager attach
subscription-manager repos --enable=rhel-7-server-extras-rpms --enable=rhel-7-server-optional-rpms --enable=rhel-server-rhscl-7-rpms --enable=rhel-7-server-thirdparty-oracle-java-rpms

Deployment overview

Cluster functions are coupled to roles, a role can be deployed to 1 or more cluster nodes. The roles are mapped to groups in the Ansible hosts file. Available roles:

Role name Description
manager Ansible deploy scripts
console Management webinterface
zookeeper Zookeeper instance
Hadoop (data node, impalad, yarn node manager) data node
hdfs_nn HDFS primary name node and standby name node(s)
hdfs_journalnode HDFS journaling node
hdfs_httpfs HDFS Httpfs node
yarn_resource_mgr Yarn Resource manager
yarn_timelineserver Yarn Timeline server
spark_history Spark History server
spark_thrift Spark Thrift server
hive (metastore) Hive metastore instance
kerberos_kdc Keberos Key Distribution Center
ranger Ranger server
database Postgresql server
impala Impala Executor instance
impala_statestore Impala Statestore instance
impala_catalog Impala Catalog instance
impala_ha_proxy Impala HaProxy loadbalancer
hue Hie server
livy Livy server
docker-reg-ui Docker registry UI
gateway Gateway host, client only
monitor Monitoring (Prometheus and Grafana)

In general it is best to map a role to hosts best suited for this specific role. A data node host will have lots of disk storage and CPU-cores, while a controller node will have limited disk storage and CPU-cores. Deploy Hadoop (HDFS datanode, Yarn node manager) on a data node and the Hadoop management processes such as hdfs_nn on a controller node. This to ensure that heavy data processing does not negatively affect the name node process.

A logical distribution of cluster roles across cluster host types, could look something like this.

Host type Roles
Management node manager, console, kerberos_kdc, ranger, solr, database, monitor, docker registry, docker registry UI
controller node zookeeper (3x), hdfs_nn, hdfs_journalnode, hdfs_httpfs, yarn_resource_mgr, yarn_timelineserver, spark_history, spark_thrift, hive (metastore), impala_statestore, impala_catalog, impala_ha_proxy, livy
data node hadoop, impala
gateway node gateway, hue, superset, jupyterhub

Management host

The Management host is the central node from where services and configuration are deployed to the cluster nodes, the management host must have an user account that has:

  • SSH access to all cluster nodes
  • Permission for sudo on all cluster nodes
  • Access to the internet

Installation

Replacing an existing Hadoop deployment

Replacing an existing Hadoop deployment such as Cloudera Hadoop is no problem, the existing components and configuration of the currently existing Hadoop deployment is not modified or removed. Both Hadoop deployments can coexist on a host simultaneously. It is however only possible to have 1 active Hadoop deployment running at the same time.

Before deploying the new components make sure that the configuration setting of the new deployment match the settings of the old deployment. For example, make sure the HDFS data node configuration of the new deployment uses the same storage locations as the existing deployment.

Deploying a cluster from scratch requires the following steps to be executed.

  1. Create cluster hosts (hardware, virtual machine or combination)
  2. Configure valid DNS (forward and reverse)
  3. Create manager user
  4. Configure SSH
  5. Install SIDN-HADOOP
  6. Install Ansible
  7. Create a custom configuration
  8. Upload the TLS certificate and key
  9. Prepare hosts
  10. Create database backup
  11. Deploy components
  12. Start components
  13. Add users

Create cluster hosts

Create the virtual hosts or physical servers that will be used as the deploy target for the cluster services. The minimum requirement is to create 5 hosts. The management en gateway roles may also be combined on a single host but this is not advised. It is recommended to create dedicated hosts for the deployment of management functions such as the HDFS name node and Zookeeper processes. User of cluster services should only be allowed to access the gateway host and therefore it is best to not deploy any services on the host that is running as a gateway.

  • 1x Management host
  • 3x data host
  • 1x gateway host

A data host can also be used as a controller host, but this is not recommended for production. It would be better to also add:

  • 3x controller host

The minimal number of hosts for an production cluster would be 8.

Configure valid DNS (forward and reverse)

All hosts of the cluster must have a valid domain name in the DNS or in every hosts file on each node. It is important that the reverse DNS, resolving an IP address to a domain name, also works correctly.

Create manager user

On the management node, create a manager user, this is the user that will be used to execute all the Ansible scripts. The name of this user must match the value for the provision_user variable in the vars/all.yml configuration file.

Configure SSH

On the management node, create SSH-keys for the manager user account, create the key without a passphrase. Create keys using: ssh-keygen -t rsa -b 4096

Install SIDN-HADOOP

On the management host, as the manager user, fetch the repository from Github.

git clone git@gitlab.sidnlabs.nl:maarten/hadoop-provision.git

Install Ansible

On the management host, install the required dependencies.

  • Python3
  • Ansible

RHEL

sudo yum install python3 ansible

Ubuntu

sudo apt install python3
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

The Ansible scripts depend on additional Ansible libraries, install these with:

cd hadoop-provision/src
ansible-galaxy install -r requirements.yml

Create a custom configuration

Services are configured using multiple configuration files.

The <repo-dir>/config directory contains a set of example configuration files.

  1. Copy the default configuration directory to a location outside of the repository, this new directory can be versioned in your version control system such as Github.

The configuration directory contains the following:

Item Description
hosts file Contains a mapping between components and hosts
ansible.cfgfile Contains Ansible configuration
vars directory This directory contains the configuration files for every component
python directory This directory contains requirements files for Python environments
  1. Edit the hosts file to specify what host to deploy a service on.
  2. Edit the configuration files in the vars directory to match your own requirements

Most options have a reasonable default value that may not need to be changed, but options for passwords and resources such as directory names and memory size MUST be changed before executing the deployment script.

Sensitive variables such as passwords are not directly configured in a configuration file. These variables must be saved in a Ansible vault file, follow these steps to create a new vault file.

  1. Create new unencrypted vault file: cp <config_dir>/vars/vault_password_vars.template <config_dir>/vars/vault_password_vars.yml
  2. Encrypt the vault file: ansible-vault encrypt <config_dir>/vars/vault_password_vars.yml
  3. Edit the vault file to set the values: ansible-vault edit <config_dir>/vars/vault_password_vars.yml

The contents of the vault file can be viewed using ansible-vault view <file> and edited using ansible-vault edit <file>

All variables in the different configuration files that have a "vault_" prefix, are references to a variable in the vault_password_vars.yml vault file.

Ansible must be able to read the encrypted vault file, and needs to know the password for this file. Create a password file that can be used by Ansible to read the vault, this file must be named .vault_pass and placed in the configuration directory.

  1. Create a new .vault_pass file, like this for the example password "my_password": echo "my_password" > <config_dir>/.vault_pass

MAKE SURE THE .vault_pass IS NOT SAVED IN VERSION CONTROL, when using Git(Hub/Lab), add the .vault_pass to .gitignore.

Upload the TLS certificate and key

TLS is used to protect the network and web interfaces for the available services. The TLS-certificate must be valid for the domain name used for the cluster hosts.

For example, if there are 3 hosts, with the following names.

host1.example.nl
host2.example.nl
host3.example.nl

The TLS certificate must have a wildcard match *.example.nl or be a multi-domain (SAN) certificate and have all the host names explicitly included in the certificate.

The certificate must be PEM formatted and the PEM file must include all required intermediate certificates.

  1. Upload the certificate to the management node
  2. Make sure the manager user has read permission.
  3. Configure the location of the certificate using the tls_cert option in config/vars/tls_vars.yml

Prepare hosts

Prepare all cluster hosts by creating the required manager user account and setting up SSH access on all cluster nodes.

The Ansible deployment scripts require a manager user account to be present on all cluster nodes. The default name of this account hadoop-mgr, is configured in config/var/all.yml. The name must match the name of the manager user created earlier on the manager host. If you have used another name then update the value for provision_user in config/var/all.yml.

The prepare-hosts.sh script does the following:

  • Add the manager user account to all the hosts found in the hosts file in the configuration directory.
  • Add the SSH-key of the manager user on the management node to the authorized_keys of manager user on the other cluster hosts.
  • Add the manager user on the cluster host to the sudoers list.

The script requires that you already have another remote user account on the systems that is allowed to use sudo and can be used to create the new manager user account on the nodes. The existing account name and password must be the same on all hosts

prepare-hosts.sh <admin-user-on-remote-hosts>

The script will ask for the SSH password for the admin user on the remote host and the new password for the manager user account that is going to be created on the remote host.

If you want to manually create the manager user on all nodes, follow the steps below.

RHEL

adduser hadoop-mgr
usermod -aG wheel hadoop-mgr
passwd hadoop-mgr <new-password>

Ubuntu

adduser hadoop-mgr
usermod -aG sudo hadoop-mgr
passwd hadoop-mgr <new-password>

Distribute the public key to all other nodes (including the management node itself): ssh-copy-id hostname

Make sure to also add the new key to the manager user on the management node. This means adding the public key of the manager user to the authorized_keys of the manager user itself. This is necessary because Ansible will create an SSH-session to the management node from the management node, in effect starting a SSH connection to itself.

Create database backup

When upgrading an existing Hadoop cluster it is advisable to backup the existing database schemas. The components listed below use a database.

  • Apache Hive Metastore
  • Hue
  • Apache Ranger

Using Postgresql, a backup can be performed like this: pg_dump -d <database> -h <server_host> -U <usename> -Fc > database.backup

Deploy components

Before executing the deploy script the following environment variables need to be set.

Variable Description
ANSIBLE_BECOME_PASSWORD Sudo password for manager user on cluster node
SIDN_HADOOP_CFG_DIR path your custom config directory

The scripts will search for a environment file called set-env.sh this file or a symbolic link with this name may be placed in the "src" directory. See below for example content for set-env.sh.

export SIDN_HADOOP_CFG_DIR=/home/user/my-custom-config
export ANSIBLE_BECOME_PASSWORD=mypassword

Create a link in the repository src/bin directory to the location of the environment file.

cd <repo>/src/bin
ln -s /path/to/custom/set-ansible-vars.sh set-env.sh

Use the deploy.sh script to deploy components to the cluster hosts, the script can accept multiple components. There is a special "all" component, to install all components in the correct order sequentially. When more control is desired, it is advised to use only a single component. The order in which components must be deployed is important, because of dependencies between components. Use the ordering as is used in deploy-all.yml

Start components

Use the start.sh script to start services on cluster hosts, the script can accept multiple components. There is a special "all" component, to start all services.

Add users

Use the add-user.sh script to create a new cluster user, it will create:

  • A user account on all data nodes (no home directory or login shell)
  • A user account on all gateway nodes with a home directory and login shell
  • A principal in the Kerberos database
  • A user in the Ranger database (this might take a few minutes to show up)
  • A HDFS home directory

Console interface

The console service is a web application with an overview of all services and related information. It is also possible to download the Hadoop and Kerberos client configuration files directly from the console. The console web interface is deployed using the "console" role in the hosts config file, the default listening port is 5000.

Kerberos browser configuration

All web interfaces are protected using Kerberos and can only be accessed when a Kerberos session is active. Configure your browser to support SPNEGO, for more information on how to do this, see the information in the console web application.

Create a Kerberos session using a password/keytab with kinit and/or Mac Client TicketViewer (/System/Library/CoreServices/Applications/Ticket\ Viewer.app/) kinit example: kinit -kt /etc/path/to/keytab myuser@KERBEROS-REALM

Make sure to have a correct Kerberos client configuration (krb5.conf) on the client before creating a Kerberos session. A valid Kerberos client configuration can be downloaded from the console using the client configuration download button.

Additional information

Available scripts

There are multiple control scripts available to manage the cluster.

Script Description
deploy.sh Install 1 or more components on cluster nodes
start.sh Start 1 or more components
stop.sh Stop 1 or more components
add-user.sh Add a new user to Kerberos db and cluster nodes
del-user.sh delete an existing user from Kerberos db and cluster nodes
prepare-hosts.sh Create the manager user on all cluster nodes
add-py-env.sh Add new Python virtualenv on cluster nodes
del-py-env.sh Delete Python virtualenv on cluster nodes
hdfs-upgrade.sh Prepare and finalize HDFS upgrade

Impala

Due to an incompatibility between Apache Impala 4.0 and Apache Hive, it is not possible to deploy Apache Impala 4.0 until this is fixed in version 4.0.1. See: https://issues.apache.org/jira/browse/IMPALA-10756

The load balancer is used to distribute queries across all Impala nodes, it is recommended that users always connect to the load balancer host to make sure query's are evenly distributed across the Impala nodes.

Impala security is enabled by default and Apache Ranger is used to manage permissions. Use the Ranger interface to easily manage Impala user permissions.

Impala-shell

To use the impala-shell commandline interface, the following 3 steps are required.

  1. Create a Kerberos session with:

kinit -kt /path/to/keytab <principal>@<REALM_NAME>

  1. Set the correct path to the required libraries, make sure to also source the environment variables.

source /etc/<impala-conf-dir>/impalad-env.sh

  1. Active the python2 venv when the default Python version is not Python2 required. . /usr/local/hadoop/hadoop-venvs/py2-impala-shell/bin/activate

Use the following arguments when connecting to the impala_proxy host.

  • -i the fqdn of the <impala_proxy> host
  • --ssl Impala server uses SSL/TLS
  • -k use Kerberos session

Example:

impala-shell -i <proxy_fqdn>:<shell_port> --ssl -k

When directly connecting to an impalad node, use following additional argument.

  • -b <proxy_fqdn> tell impala to use the hostname of the impala_proxy when connecting

Example:

impala-shell -i <proxy_fqdn>:<shell_port> --ssl -k -b <proxy_fqdn>

HDFS Upgrade

Upgrading an existing HDFS cluster is only supported with downtime. The scripts used to start HDFS will detect if a newer version of Hadoop is deployed and will auto upgrade the HDFS metadata during name node start.

Use the following process when upgrading:

  1. Run hdfs-upgrade.sh prepare while the current older Hadoop version is still running.
  2. Stop hdfs and yarn services: stop.sh hdfs yarn
  3. Update the Hadoop version in vars/all/vars.yml
  4. Deploy new version of Hadoop package with deploy.sh hadoop
  5. Start HDFS with: start.sh hdfs yarn, this step will upgrade the HDFS meta data during the start of the name node. Before the upgrade it will create a backup of the name node metadata.
  6. Run your workload jobs and check if everything is working ok, if everything is working ok finalize the upgrade with: hdfs-upgrade.sh finalize

When migrating from an older Hadoop distribution such as Cloudera Hadoop, it is advisable to first deploy a version of Apache Hadoop that matches the version used by your current Cloudera deployment. When deploying a newer version of Hadoop, the HDFS metadata might be be upgraded, this may not be what you want.

When migrating away from CDH, only upgrade to a newer Hadoop version when the deployment of a matching the current CDH Hadoop version was successful and you have been using it for some time without issues.

Spark

The following Spark components are installed:

  • Spark (utils and libraries)
  • Thrift server
  • History server

Spark Thrift Server

The Spark thrift server is started with Kerberos service account <spark_thrift_user>/_HOST@REALM. To create a connection, first make sure a Kerberos session is created (kinit) and then use the <spark_thrift_user>/_HOST@REALM in the connection string.

For example when the Spark thrift server is deployed on testserver1.example.nl, use this Beeline test example: beeline -u "jdbc:hive2://<hostname>:10000/default;principal=spark/testserver1.example.nl@REALM_NAME"

Hive

Only the Hive metastore will be deployed.

Hive metastore upgrade

If you have an existing metastore database, then this database must be upgraded to the version required by the installed Hive version.

Upgrading from metastore version 2.1.1 (CDH 6.x)

When migrating from Cloudera Express (CDH 6.x) then the deploy.sh script will most likely detect metastore version 2.1.1. It will then copy modified upgrade scripts to the hive server. Starting the Hive metastore will kick off the upgrade process, the metastore will be upgraded to the version required by the installed Hive version.

If the old metastore database is hosted by an existing Postgresql server, and this server is not used for the new cluster, then copy the metastore database to the new Postgresql server.

Export existing metastore database: pg_dump -d metastore -h <server_host> -U <usename> -Fc > metastore.back

Import the metastore database in the new Postgresql server:

  • Create a new database metastore owned by user hive
  • Import the database: pg_restore -d metastore -h <server_host> -U <usename> metastore.back

Upgrading from other unsupported metastore versions

The upgrade might fail because the Hive upgrade scripts return an error. To fix this, manually alter the Hive upgrade scripts found in <HIVE_HOME>/scripts/metastore/upgrade/

Authorization

Apache Ranger is used for authorization of HDFS requests and Impala SQL-queries. The ranger interface can be accessed via the console webpage. Make sure to have a valid Kerberos session for a admin user when accessing the Ranger console. Otherwise the Ranger Admin website will only allow limited access.

After a fresh install it is avisable to delete all policies for the "HDFS" and "HADOOP SQL" services. The default policies might cause access/permission denied errors.

SOLR

SOLR is used to store Ranger audit logging, the SOLR index data can grow quite big when running larger cluster. Cleaning up older SOLR data can be done with a simple API command.

The following example command deleted all data older than 7 days.

curl --negotiate -u : "<solr_hostname>:8983/solr/ranger_audits/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>evtTime:[* TO NOW-7DAYS]</query></delete>"

A cleanup service is created during the deployment of Solr, the cleanup service will remove old data from Solr every night at 01:00. The max allowed age of the Solr data can be configured in the Solr configuration file.

Cluster management

Replacing the TLS certificate

When the existing TLS certificate needs to be replaced (because it might expire soon), do the following:

  • Copy the new TLS to the correct location on the management server.
  • ./deploy.sh tls
  • ./stop.sh all
  • ./start.sh all
  • ./do.sh enable-hdfs-ranger-plugin

The last step is required because otherwise the Ranger HDFS plugin will not be able to connect to the updated Ranger Admin service.

Adding additional hosts

Adding new hosts requires careful execution of the following steps, to make sure all services are correctly deployed on their assigned hosts.

  • Stop all services
  • Add the new host to desired groups in hosts configuration file
  • start db, kerberos and console
  • Deploy all service assigned to new host
  • Deploy tls
  • Deploy Kerberos principals
  • Deploy Kerberos keytabs
  • Start all services

Tips

Make sure the database and Kerberos KDC are running before deploying other components

Some components require the database and/or the Kerberos KDC to be up and running before the component can be deployed. fix: start.sh db krb-kdc

Impala, no databases and/or tables visible

Make sure the "impala" policy in Apache Ranger allows users in the group "hadoop_users" to view and select from the desired databases and/or tables. Every user must have an account on each impala node in the cluster, this is done automatically when adding users with the add-user.sh script.

Kerberos login fail after redeploying keystabs

When the keytabs are redeployed with services running, services such as HDFS can not authenticate using the new services until they are restarted.

Copyright and license

Code and documentation copyright 2022 SIDN. Code released under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages