SIDN Labs Hadoop Provisioning

SIDN Labs Hadoop Provisioning Manager makes it easier to deploy a Hadoop based data analytics cluster. The analytics cluster has support for well known components.

Apache Hadoop
Apache Impala
Apache Spark
Apache Hive
Apache Ranger
Apache Zookeeper
Monitoring (Prometheus and Grafana)
Hue
Apache Superset
JupyterHub/JupyterLab
Apache Airflow
Apache Iceberg

The cluster is automatically configured with features such as.

HDFS High Availability
Authentication (Kerberos)
Authorization (Apache Ranger)
TLS support for all web interfaces

A management console is provided, to help administrators and users to navigate all the available services.

Monitoring is automatically configured using Prometheus and Grafana. Dashboards are provided, to help administrators monitor all services and receive alerts when a service becomes unavailable.

Requirements

At least 5 physical/virtual servers for testing, 8 for production usage.
A single TLS-certificate which is valid for every server in the cluster.
The management server will need 16GB of RAM to prevent issues when building Docker images.

The TLS-certificate may be a wildcard certificate or use a SAN-list that contains each server name.

Components

This project installs the following components

Name	Description	Optional
common	Generic packages and config	No
manager	Packages and config for management node	No
console	A web-based management interface	No
monitor	Prometheus and Grafana server	Yes
monitor-proxy	Prometheus proxies	Yes
db	Postgresql database	Yes
docker-reg-ui	Docker registry ui	No
zookeeper	Zookeeper packages and config	No
Hadoop	Hadoop (hdfs/yarn) packages and config	No
spark	Spark packages and config	No
hive	Hive metastore packages and config	No
impala	Impala packages and config	No
hue	Hue packages and config	No
livy	Livy packages and config	No
ranger	Ranger packages and config	No
solr	Solr packages and config	No
airflow	Airflow packages and config	Yes
tls	TLS certificates and config for all components	No
krb-kdc	Deploy Keberos KDC server	No
krb-principal	Generate Kerberos principals for all components	No
krb-keytab	Kerberos keytabs for deployed components	No
client-config	Hadoop client configuration files	No

Versions

The default component versions are found in config/vars/all/yml these versions are known good versions and have been tested to correctly work together. It is possible to change the version of a component, but be sure to test if everything functions correctly after deployment.

Binary packages

Most of the components are downloaded in the form of compiled binaries from the official component project download sites. Not all open source projects, such as Apache Impala, provide compiled packages. The components listed below have been compiled by us and the binary packages have been made available through the SIDN download site. Only the listed versions can be installed.

Name	Version
Apache Impala	4.20
Apache Ranger	2.2.0
Apache Livy	0.8.0

OS Support

Both Redhat Enterprise Linux (RHEL) and Ubuntu are supported, although only a RHEL deployment has been fully tested. It is possible to create a mixed cluster, for example have the data nodes use RHEL and the gateways Ubuntu.

Redhat Enterprise Linux (RHEL) 7.9
Ubuntu 20.04/22.04 LTS

Redhat Enterprise Linux (RHEL)

At the moment only RHEL 7.9 is tested.

Getting RHEL

RHEL is a commercial Linux distribution and a paid license is required, however for evaluation and testing (non-production) use cases a free personal developer license can be granted by Redhat. See: https://developers.redhat.com/articles/faqs-no-cost-red-hat-enterprise-linux

Installation

After installing the OS, the subscription must be activated and the required repositories need to be enabled

subscription-manager register
subscription-manager attach
subscription-manager repos --enable=rhel-7-server-extras-rpms --enable=rhel-7-server-optional-rpms --enable=rhel-server-rhscl-7-rpms --enable=rhel-7-server-thirdparty-oracle-java-rpms

Deployment overview

Cluster functions are coupled to roles, a role can be deployed to 1 or more cluster nodes. The roles are mapped to groups in the Ansible hosts file. Available roles:

Role name	Description
manager	Ansible deploy scripts
console	Management webinterface
zookeeper	Zookeeper instance
Hadoop (data node, impalad, yarn node manager)	data node
hdfs_nn	HDFS primary name node and standby name node(s)
hdfs_journalnode	HDFS journaling node
hdfs_httpfs	HDFS Httpfs node
yarn_resource_mgr	Yarn Resource manager
yarn_timelineserver	Yarn Timeline server
spark_history	Spark History server
spark_thrift	Spark Thrift server
hive (metastore)	Hive metastore instance
kerberos_kdc	Keberos Key Distribution Center
ranger	Ranger server
database	Postgresql server
impala	Impala Executor instance
impala_statestore	Impala Statestore instance
impala_catalog	Impala Catalog instance
impala_ha_proxy	Impala HaProxy loadbalancer
hue	Hie server
livy	Livy server
docker-reg-ui	Docker registry UI
gateway	Gateway host, client only
monitor	Monitoring (Prometheus and Grafana)

In general it is best to map a role to hosts best suited for this specific role. A data node host will have lots of disk storage and CPU-cores, while a controller node will have limited disk storage and CPU-cores. Deploy Hadoop (HDFS datanode, Yarn node manager) on a data node and the Hadoop management processes such as hdfs_nn on a controller node. This to ensure that heavy data processing does not negatively affect the name node process.

A logical distribution of cluster roles across cluster host types, could look something like this.

Host type	Roles
Management node	manager, console, kerberos_kdc, ranger, solr, database, monitor, docker registry, docker registry UI
controller node	zookeeper (3x), hdfs_nn, hdfs_journalnode, hdfs_httpfs, yarn_resource_mgr, yarn_timelineserver, spark_history, spark_thrift, hive (metastore), impala_statestore, impala_catalog, impala_ha_proxy, livy
data node	hadoop, impala
gateway node	gateway, hue, superset, jupyterhub

Management host

The Management host is the central node from where services and configuration are deployed to the cluster nodes, the management host must have an user account that has:

SSH access to all cluster nodes
Permission for sudo on all cluster nodes
Access to the internet

Installation

Replacing an existing Hadoop deployment

Replacing an existing Hadoop deployment such as Cloudera Hadoop is no problem, the existing components and configuration of the currently existing Hadoop deployment is not modified or removed. Both Hadoop deployments can coexist on a host simultaneously. It is however only possible to have 1 active Hadoop deployment running at the same time.

Before deploying the new components make sure that the configuration setting of the new deployment match the settings of the old deployment. For example, make sure the HDFS data node configuration of the new deployment uses the same storage locations as the existing deployment.

Deploying a cluster from scratch requires the following steps to be executed.

Create cluster hosts (hardware, virtual machine or combination)
Configure valid DNS (forward and reverse)
Create manager user
Configure SSH
Install SIDN-HADOOP
Install Ansible
Create a custom configuration
Upload the TLS certificate and key
Prepare hosts
Create database backup
Deploy components
Start components
Add users

Create cluster hosts

Create the virtual hosts or physical servers that will be used as the deploy target for the cluster services. The minimum requirement is to create 5 hosts. The management en gateway roles may also be combined on a single host but this is not advised. It is recommended to create dedicated hosts for the deployment of management functions such as the HDFS name node and Zookeeper processes. User of cluster services should only be allowed to access the gateway host and therefore it is best to not deploy any services on the host that is running as a gateway.

1x Management host
3x data host
1x gateway host

A data host can also be used as a controller host, but this is not recommended for production. It would be better to also add:

3x controller host

The minimal number of hosts for an production cluster would be 8.

Configure valid DNS (forward and reverse)

All hosts of the cluster must have a valid domain name in the DNS or in every hosts file on each node. It is important that the reverse DNS, resolving an IP address to a domain name, also works correctly.

Create manager user

On the management node, create a manager user, this is the user that will be used to execute all the Ansible scripts. The name of this user must match the value for the provision_user variable in the vars/all.yml configuration file.

Configure SSH

On the management node, create SSH-keys for the manager user account, create the key without a passphrase. Create keys using: ssh-keygen -t rsa -b 4096

Install SIDN-HADOOP

On the management host, as the manager user, fetch the repository from Github.

git clone git@gitlab.sidnlabs.nl:maarten/hadoop-provision.git

Install Ansible

On the management host, install the required dependencies.

Python3
Ansible

RHEL

sudo yum install python3 ansible

Ubuntu

sudo apt install python3
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

The Ansible scripts depend on additional Ansible libraries, install these with:

cd hadoop-provision/src
ansible-galaxy install -r requirements.yml

Create a custom configuration

Services are configured using multiple configuration files.

The <repo-dir>/config directory contains a set of example configuration files.

Copy the default configuration directory to a location outside of the repository, this new directory can be versioned in your version control system such as Github.

The configuration directory contains the following:

Item	Description
hosts file	Contains a mapping between components and hosts
ansible.cfgfile	Contains Ansible configuration
vars directory	This directory contains the configuration files for every component
python directory	This directory contains requirements files for Python environments

Edit the hosts file to specify what host to deploy a service on.
Edit the configuration files in the vars directory to match your own requirements

Most options have a reasonable default value that may not need to be changed, but options for passwords and resources such as directory names and memory size MUST be changed before executing the deployment script.

Sensitive variables such as passwords are not directly configured in a configuration file. These variables must be saved in a Ansible vault file, follow these steps to create a new vault file.

Create new unencrypted vault file: cp <config_dir>/vars/vault_password_vars.template <config_dir>/vars/vault_password_vars.yml
Encrypt the vault file: ansible-vault encrypt <config_dir>/vars/vault_password_vars.yml
Edit the vault file to set the values: ansible-vault edit <config_dir>/vars/vault_password_vars.yml

The contents of the vault file can be viewed using ansible-vault view <file> and edited using ansible-vault edit <file>

All variables in the different configuration files that have a "vault_" prefix, are references to a variable in the vault_password_vars.yml vault file.

Ansible must be able to read the encrypted vault file, and needs to know the password for this file. Create a password file that can be used by Ansible to read the vault, this file must be named .vault_pass and placed in the configuration directory.

Create a new .vault_pass file, like this for the example password "my_password": echo "my_password" > <config_dir>/.vault_pass

MAKE SURE THE .vault_pass IS NOT SAVED IN VERSION CONTROL, when using Git(Hub/Lab), add the .vault_pass to .gitignore.

Upload the TLS certificate and key

TLS is used to protect the network and web interfaces for the available services. The TLS-certificate must be valid for the domain name used for the cluster hosts.

For example, if there are 3 hosts, with the following names.

host1.example.nl
host2.example.nl
host3.example.nl

The TLS certificate must have a wildcard match *.example.nl or be a multi-domain (SAN) certificate and have all the host names explicitly included in the certificate.

The certificate must be PEM formatted and the PEM file must include all required intermediate certificates.

Upload the certificate to the management node
Make sure the manager user has read permission.
Configure the location of the certificate using the tls_cert option in config/vars/tls_vars.yml

Prepare hosts

Prepare all cluster hosts by creating the required manager user account and setting up SSH access on all cluster nodes.

The Ansible deployment scripts require a manager user account to be present on all cluster nodes. The default name of this account hadoop-mgr, is configured in config/var/all.yml. The name must match the name of the manager user created earlier on the manager host. If you have used another name then update the value for provision_user in config/var/all.yml.

The prepare-hosts.sh script does the following:

Add the manager user account to all the hosts found in the hosts file in the configuration directory.
Add the SSH-key of the manager user on the management node to the authorized_keys of manager user on the other cluster hosts.
Add the manager user on the cluster host to the sudoers list.

The script requires that you already have another remote user account on the systems that is allowed to use sudo and can be used to create the new manager user account on the nodes. The existing account name and password must be the same on all hosts

prepare-hosts.sh <admin-user-on-remote-hosts>

The script will ask for the SSH password for the admin user on the remote host and the new password for the manager user account that is going to be created on the remote host.

If you want to manually create the manager user on all nodes, follow the steps below.

RHEL

adduser hadoop-mgr
usermod -aG wheel hadoop-mgr
passwd hadoop-mgr <new-password>

Ubuntu

adduser hadoop-mgr
usermod -aG sudo hadoop-mgr
passwd hadoop-mgr <new-password>

Distribute the public key to all other nodes (including the management node itself): ssh-copy-id hostname

Make sure to also add the new key to the manager user on the management node. This means adding the public key of the manager user to the authorized_keys of the manager user itself. This is necessary because Ansible will create an SSH-session to the management node from the management node, in effect starting a SSH connection to itself.

Create database backup

When upgrading an existing Hadoop cluster it is advisable to backup the existing database schemas. The components listed below use a database.

Apache Hive Metastore
Hue
Apache Ranger

Using Postgresql, a backup can be performed like this: pg_dump -d <database> -h <server_host> -U <usename> -Fc > database.backup

Deploy components

Before executing the deploy script the following environment variables need to be set.

Variable	Description
ANSIBLE_BECOME_PASSWORD	Sudo password for manager user on cluster node
SIDN_HADOOP_CFG_DIR	path your custom config directory

The scripts will search for a environment file called set-env.sh this file or a symbolic link with this name may be placed in the "src" directory. See below for example content for set-env.sh.

export SIDN_HADOOP_CFG_DIR=/home/user/my-custom-config
export ANSIBLE_BECOME_PASSWORD=mypassword

Create a link in the repository src/bin directory to the location of the environment file.

cd <repo>/src/bin
ln -s /path/to/custom/set-ansible-vars.sh set-env.sh

Use the deploy.sh script to deploy components to the cluster hosts, the script can accept multiple components. There is a special "all" component, to install all components in the correct order sequentially. When more control is desired, it is advised to use only a single component. The order in which components must be deployed is important, because of dependencies between components. Use the ordering as is used in deploy-all.yml

Start components

Use the start.sh script to start services on cluster hosts, the script can accept multiple components. There is a special "all" component, to start all services.

Add users

Use the add-user.sh script to create a new cluster user, it will create:

A user account on all data nodes (no home directory or login shell)
A user account on all gateway nodes with a home directory and login shell
A principal in the Kerberos database
A user in the Ranger database (this might take a few minutes to show up)
A HDFS home directory

Console interface

The console service is a web application with an overview of all services and related information. It is also possible to download the Hadoop and Kerberos client configuration files directly from the console. The console web interface is deployed using the "console" role in the hosts config file, the default listening port is 5000.

Kerberos browser configuration

All web interfaces are protected using Kerberos and can only be accessed when a Kerberos session is active. Configure your browser to support SPNEGO, for more information on how to do this, see the information in the console web application.

Create a Kerberos session using a password/keytab with kinit and/or Mac Client TicketViewer (/System/Library/CoreServices/Applications/Ticket\ Viewer.app/) kinit example: kinit -kt /etc/path/to/keytab myuser@KERBEROS-REALM

Make sure to have a correct Kerberos client configuration (krb5.conf) on the client before creating a Kerberos session. A valid Kerberos client configuration can be downloaded from the console using the client configuration download button.

Additional information

Available scripts

There are multiple control scripts available to manage the cluster.

Script	Description
deploy.sh	Install 1 or more components on cluster nodes
start.sh	Start 1 or more components
stop.sh	Stop 1 or more components
add-user.sh	Add a new user to Kerberos db and cluster nodes
del-user.sh	delete an existing user from Kerberos db and cluster nodes
prepare-hosts.sh	Create the manager user on all cluster nodes
add-py-env.sh	Add new Python virtualenv on cluster nodes
del-py-env.sh	Delete Python virtualenv on cluster nodes
hdfs-upgrade.sh	Prepare and finalize HDFS upgrade

Impala

Due to an incompatibility between Apache Impala 4.0 and Apache Hive, it is not possible to deploy Apache Impala 4.0 until this is fixed in version 4.0.1. See: https://issues.apache.org/jira/browse/IMPALA-10756

The load balancer is used to distribute queries across all Impala nodes, it is recommended that users always connect to the load balancer host to make sure query's are evenly distributed across the Impala nodes.

Impala security is enabled by default and Apache Ranger is used to manage permissions. Use the Ranger interface to easily manage Impala user permissions.

Impala-shell

To use the impala-shell commandline interface, the following 3 steps are required.

Create a Kerberos session with:

kinit -kt /path/to/keytab <principal>@<REALM_NAME>

Set the correct path to the required libraries, make sure to also source the environment variables.

source /etc/<impala-conf-dir>/impalad-env.sh

Active the python2 venv when the default Python version is not Python2 required. . /usr/local/hadoop/hadoop-venvs/py2-impala-shell/bin/activate

Use the following arguments when connecting to the impala_proxy host.

-i the fqdn of the <impala_proxy> host
--ssl Impala server uses SSL/TLS
-k use Kerberos session

Example:

impala-shell -i <proxy_fqdn>:<shell_port> --ssl -k

When directly connecting to an impalad node, use following additional argument.

-b <proxy_fqdn> tell impala to use the hostname of the impala_proxy when connecting

Example:

impala-shell -i <proxy_fqdn>:<shell_port> --ssl -k -b <proxy_fqdn>

HDFS Upgrade

Upgrading an existing HDFS cluster is only supported with downtime. The scripts used to start HDFS will detect if a newer version of Hadoop is deployed and will auto upgrade the HDFS metadata during name node start.

Use the following process when upgrading:

Run hdfs-upgrade.sh prepare while the current older Hadoop version is still running.
Stop hdfs and yarn services: stop.sh hdfs yarn
Update the Hadoop version in vars/all/vars.yml
Deploy new version of Hadoop package with deploy.sh hadoop
Start HDFS with: start.sh hdfs yarn, this step will upgrade the HDFS meta data during the start of the name node. Before the upgrade it will create a backup of the name node metadata.
Run your workload jobs and check if everything is working ok, if everything is working ok finalize the upgrade with: hdfs-upgrade.sh finalize

When migrating from an older Hadoop distribution such as Cloudera Hadoop, it is advisable to first deploy a version of Apache Hadoop that matches the version used by your current Cloudera deployment. When deploying a newer version of Hadoop, the HDFS metadata might be be upgraded, this may not be what you want.

When migrating away from CDH, only upgrade to a newer Hadoop version when the deployment of a matching the current CDH Hadoop version was successful and you have been using it for some time without issues.

Spark

The following Spark components are installed:

Spark (utils and libraries)
Thrift server
History server

Spark Thrift Server

The Spark thrift server is started with Kerberos service account <spark_thrift_user>/_HOST@REALM. To create a connection, first make sure a Kerberos session is created (kinit) and then use the <spark_thrift_user>/_HOST@REALM in the connection string.

For example when the Spark thrift server is deployed on testserver1.example.nl, use this Beeline test example: beeline -u "jdbc:hive2://<hostname>:10000/default;principal=spark/testserver1.example.nl@REALM_NAME"

Hive

Only the Hive metastore will be deployed.

Hive metastore upgrade

If you have an existing metastore database, then this database must be upgraded to the version required by the installed Hive version.

Upgrading from metastore version 2.1.1 (CDH 6.x)

When migrating from Cloudera Express (CDH 6.x) then the deploy.sh script will most likely detect metastore version 2.1.1. It will then copy modified upgrade scripts to the hive server. Starting the Hive metastore will kick off the upgrade process, the metastore will be upgraded to the version required by the installed Hive version.

If the old metastore database is hosted by an existing Postgresql server, and this server is not used for the new cluster, then copy the metastore database to the new Postgresql server.

Export existing metastore database: pg_dump -d metastore -h <server_host> -U <usename> -Fc > metastore.back

Import the metastore database in the new Postgresql server:

Create a new database metastore owned by user hive
Import the database: pg_restore -d metastore -h <server_host> -U <usename> metastore.back

Upgrading from other unsupported metastore versions

The upgrade might fail because the Hive upgrade scripts return an error. To fix this, manually alter the Hive upgrade scripts found in <HIVE_HOME>/scripts/metastore/upgrade/

Authorization

Apache Ranger is used for authorization of HDFS requests and Impala SQL-queries. The ranger interface can be accessed via the console webpage. Make sure to have a valid Kerberos session for a admin user when accessing the Ranger console. Otherwise the Ranger Admin website will only allow limited access.

After a fresh install it is avisable to delete all policies for the "HDFS" and "HADOOP SQL" services. The default policies might cause access/permission denied errors.

SOLR

SOLR is used to store Ranger audit logging, the SOLR index data can grow quite big when running larger cluster. Cleaning up older SOLR data can be done with a simple API command.

The following example command deleted all data older than 7 days.

curl --negotiate -u : "<solr_hostname>:8983/solr/ranger_audits/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>evtTime:[* TO NOW-7DAYS]</query></delete>"

A cleanup service is created during the deployment of Solr, the cleanup service will remove old data from Solr every night at 01:00. The max allowed age of the Solr data can be configured in the Solr configuration file.

Cluster management

Replacing the TLS certificate

When the existing TLS certificate needs to be replaced (because it might expire soon), do the following:

Copy the new TLS to the correct location on the management server.
./deploy.sh tls
./stop.sh all
./start.sh all
./do.sh enable-hdfs-ranger-plugin

The last step is required because otherwise the Ranger HDFS plugin will not be able to connect to the updated Ranger Admin service.

Adding additional hosts

Adding new hosts requires careful execution of the following steps, to make sure all services are correctly deployed on their assigned hosts.

Stop all services
Add the new host to desired groups in hosts configuration file
start db, kerberos and console
Deploy all service assigned to new host
Deploy tls
Deploy Kerberos principals
Deploy Kerberos keytabs
Start all services

Tips

Make sure the database and Kerberos KDC are running before deploying other components

Some components require the database and/or the Kerberos KDC to be up and running before the component can be deployed. fix: start.sh db krb-kdc

Impala, no databases and/or tables visible

Make sure the "impala" policy in Apache Ranger allows users in the group "hadoop_users" to view and select from the desired databases and/or tables. Every user must have an account on each impala node in the cluster, this is done automatically when adding users with the add-user.sh script.

Kerberos login fail after redeploying keystabs

When the keytabs are redeployed with services running, services such as HDFS can not authenticate using the new services until they are restarted.

Name		Name	Last commit message	Last commit date
Latest commit History 780 Commits
config		config
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

SIDN/hadoop-provisioning-manager

Folders and files

Latest commit

History

Repository files navigation

SIDN Labs Hadoop Provisioning

Requirements

Components

Versions

Binary packages

OS Support

Redhat Enterprise Linux (RHEL)

Getting RHEL

Installation

Deployment overview

Management host

Installation

Replacing an existing Hadoop deployment

Create cluster hosts

Configure valid DNS (forward and reverse)

Create manager user

Configure SSH

Install SIDN-HADOOP

Install Ansible

RHEL

Ubuntu

Create a custom configuration

Upload the TLS certificate and key

Prepare hosts

RHEL

Ubuntu

Create database backup

Deploy components

Start components

Add users

Console interface

Kerberos browser configuration

Additional information

Available scripts

Impala

Impala-shell

HDFS Upgrade

Spark

Spark Thrift Server

Hive

Hive metastore upgrade

Upgrading from metastore version 2.1.1 (CDH 6.x)

Upgrading from other unsupported metastore versions

Authorization

SOLR

Cluster management

Replacing the TLS certificate

Adding additional hosts

Tips

Make sure the database and Kerberos KDC are running before deploying other components

Impala, no databases and/or tables visible

Kerberos login fail after redeploying keystabs

Copyright and license

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages