SIDN Labs Hadoop Provisioning Manager makes it easier to deploy a Hadoop based data analytics cluster. The analytics cluster has support for well known components.
- Apache Hadoop
- Apache Impala
- Apache Spark
- Apache Hive
- Apache Ranger
- Apache Zookeeper
- Monitoring (Prometheus and Grafana)
- Hue
- Apache Superset
- JupyterHub/JupyterLab
- Apache Airflow
- Apache Iceberg
The cluster is automatically configured with features such as.
- HDFS High Availability
- Authentication (Kerberos)
- Authorization (Apache Ranger)
- TLS support for all web interfaces
A management console is provided, to help administrators and users to navigate all the available services.
Monitoring is automatically configured using Prometheus and Grafana. Dashboards are provided, to help administrators monitor all services and receive alerts when a service becomes unavailable.
- At least 5 physical/virtual servers for testing, 8 for production usage.
- A single TLS-certificate which is valid for every server in the cluster.
- The management server will need 16GB of RAM to prevent issues when building Docker images.
The TLS-certificate may be a wildcard certificate or use a SAN-list that contains each server name.
This project installs the following components
Name | Description | Optional |
---|---|---|
common | Generic packages and config | No |
manager | Packages and config for management node | No |
console | A web-based management interface | No |
monitor | Prometheus and Grafana server | Yes |
monitor-proxy | Prometheus proxies | Yes |
db | Postgresql database | Yes |
docker-reg-ui | Docker registry ui | No |
zookeeper | Zookeeper packages and config | No |
Hadoop | Hadoop (hdfs/yarn) packages and config | No |
spark | Spark packages and config | No |
hive | Hive metastore packages and config | No |
impala | Impala packages and config | No |
hue | Hue packages and config | No |
livy | Livy packages and config | No |
ranger | Ranger packages and config | No |
solr | Solr packages and config | No |
airflow | Airflow packages and config | Yes |
tls | TLS certificates and config for all components | No |
krb-kdc | Deploy Keberos KDC server | No |
krb-principal | Generate Kerberos principals for all components | No |
krb-keytab | Kerberos keytabs for deployed components | No |
client-config | Hadoop client configuration files | No |
The default component versions are found in config/vars/all/yml
these versions are known good versions and have been tested to correctly work together. It is possible to change the version of a component, but be sure to test if everything functions correctly after deployment.
Most of the components are downloaded in the form of compiled binaries from the official component project download sites. Not all open source projects, such as Apache Impala, provide compiled packages. The components listed below have been compiled by us and the binary packages have been made available through the SIDN download site. Only the listed versions can be installed.
Name | Version |
---|---|
Apache Impala | 4.20 |
Apache Ranger | 2.2.0 |
Apache Livy | 0.8.0 |
Both Redhat Enterprise Linux (RHEL) and Ubuntu are supported, although only a RHEL deployment has been fully tested. It is possible to create a mixed cluster, for example have the data nodes use RHEL and the gateways Ubuntu.
- Redhat Enterprise Linux (RHEL) 7.9
- Ubuntu 20.04/22.04 LTS
At the moment only RHEL 7.9 is tested.
RHEL is a commercial Linux distribution and a paid license is required, however for evaluation and testing (non-production) use cases a free personal developer license can be granted by Redhat. See: https://developers.redhat.com/articles/faqs-no-cost-red-hat-enterprise-linux
After installing the OS, the subscription must be activated and the required repositories need to be enabled
subscription-manager register
subscription-manager attach
subscription-manager repos --enable=rhel-7-server-extras-rpms --enable=rhel-7-server-optional-rpms --enable=rhel-server-rhscl-7-rpms --enable=rhel-7-server-thirdparty-oracle-java-rpms
Cluster functions are coupled to roles, a role can be deployed to 1 or more cluster nodes. The roles are mapped to groups in the Ansible hosts file. Available roles:
Role name | Description |
---|---|
manager | Ansible deploy scripts |
console | Management webinterface |
zookeeper | Zookeeper instance |
Hadoop (data node, impalad, yarn node manager) | data node |
hdfs_nn | HDFS primary name node and standby name node(s) |
hdfs_journalnode | HDFS journaling node |
hdfs_httpfs | HDFS Httpfs node |
yarn_resource_mgr | Yarn Resource manager |
yarn_timelineserver | Yarn Timeline server |
spark_history | Spark History server |
spark_thrift | Spark Thrift server |
hive (metastore) | Hive metastore instance |
kerberos_kdc | Keberos Key Distribution Center |
ranger | Ranger server |
database | Postgresql server |
impala | Impala Executor instance |
impala_statestore | Impala Statestore instance |
impala_catalog | Impala Catalog instance |
impala_ha_proxy | Impala HaProxy loadbalancer |
hue | Hie server |
livy | Livy server |
docker-reg-ui | Docker registry UI |
gateway | Gateway host, client only |
monitor | Monitoring (Prometheus and Grafana) |
In general it is best to map a role to hosts best suited for this specific role. A data node host will have lots of disk storage and CPU-cores, while a controller node will have limited disk storage and CPU-cores. Deploy Hadoop (HDFS datanode, Yarn node manager) on a data node and the Hadoop management processes such as hdfs_nn on a controller node. This to ensure that heavy data processing does not negatively affect the name node process.
A logical distribution of cluster roles across cluster host types, could look something like this.
Host type | Roles |
---|---|
Management node | manager, console, kerberos_kdc, ranger, solr, database, monitor, docker registry, docker registry UI |
controller node | zookeeper (3x), hdfs_nn, hdfs_journalnode, hdfs_httpfs, yarn_resource_mgr, yarn_timelineserver, spark_history, spark_thrift, hive (metastore), impala_statestore, impala_catalog, impala_ha_proxy, livy |
data node | hadoop, impala |
gateway node | gateway, hue, superset, jupyterhub |
The Management host is the central node from where services and configuration are deployed to the cluster nodes, the management host must have an user account that has:
- SSH access to all cluster nodes
- Permission for sudo on all cluster nodes
- Access to the internet
Replacing an existing Hadoop deployment such as Cloudera Hadoop is no problem, the existing components and configuration of the currently existing Hadoop deployment is not modified or removed. Both Hadoop deployments can coexist on a host simultaneously. It is however only possible to have 1 active Hadoop deployment running at the same time.
Before deploying the new components make sure that the configuration setting of the new deployment match the settings of the old deployment. For example, make sure the HDFS data node configuration of the new deployment uses the same storage locations as the existing deployment.
Deploying a cluster from scratch requires the following steps to be executed.
- Create cluster hosts (hardware, virtual machine or combination)
- Configure valid DNS (forward and reverse)
- Create manager user
- Configure SSH
- Install SIDN-HADOOP
- Install Ansible
- Create a custom configuration
- Upload the TLS certificate and key
- Prepare hosts
- Create database backup
- Deploy components
- Start components
- Add users
Create the virtual hosts or physical servers that will be used as the deploy target for the cluster services. The minimum requirement is to create 5 hosts. The management en gateway roles may also be combined on a single host but this is not advised. It is recommended to create dedicated hosts for the deployment of management functions such as the HDFS name node and Zookeeper processes. User of cluster services should only be allowed to access the gateway host and therefore it is best to not deploy any services on the host that is running as a gateway.
- 1x Management host
- 3x data host
- 1x gateway host
A data host can also be used as a controller host, but this is not recommended for production. It would be better to also add:
- 3x controller host
The minimal number of hosts for an production cluster would be 8.
All hosts of the cluster must have a valid domain name in the DNS or in every hosts file on each node. It is important that the reverse DNS, resolving an IP address to a domain name, also works correctly.
On the management node, create a manager user, this is the user that will be used to execute all the Ansible scripts.
The name of this user must match the value for the provision_user
variable in the vars/all.yml
configuration file.
On the management node, create SSH-keys for the manager user account, create the key without a passphrase.
Create keys using: ssh-keygen -t rsa -b 4096
On the management host, as the manager user, fetch the repository from Github.
git clone git@gitlab.sidnlabs.nl:maarten/hadoop-provision.git
On the management host, install the required dependencies.
- Python3
- Ansible
sudo yum install python3 ansible
sudo apt install python3
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install ansible
The Ansible scripts depend on additional Ansible libraries, install these with:
cd hadoop-provision/src
ansible-galaxy install -r requirements.yml
Services are configured using multiple configuration files.
The <repo-dir>/config
directory contains a set of example configuration files.
- Copy the default configuration directory to a location outside of the repository, this new directory can be versioned in your version control system such as Github.
The configuration directory contains the following:
Item | Description |
---|---|
hosts file | Contains a mapping between components and hosts |
ansible.cfgfile | Contains Ansible configuration |
vars directory | This directory contains the configuration files for every component |
python directory | This directory contains requirements files for Python environments |
- Edit the
hosts
file to specify what host to deploy a service on. - Edit the configuration files in the vars directory to match your own requirements
Most options have a reasonable default value that may not need to be changed, but options for passwords and resources such as directory names and memory size MUST be changed before executing the deployment script.
Sensitive variables such as passwords are not directly configured in a configuration file. These variables must be saved in a Ansible vault file, follow these steps to create a new vault file.
- Create new unencrypted vault file:
cp <config_dir>/vars/vault_password_vars.template <config_dir>/vars/vault_password_vars.yml
- Encrypt the vault file:
ansible-vault encrypt <config_dir>/vars/vault_password_vars.yml
- Edit the vault file to set the values:
ansible-vault edit <config_dir>/vars/vault_password_vars.yml
The contents of the vault file can be viewed using ansible-vault view <file>
and edited using ansible-vault edit <file>
All variables in the different configuration files that have a "vault_" prefix, are references to a variable in the vault_password_vars.yml
vault file.
Ansible must be able to read the encrypted vault file, and needs to know the password for this file.
Create a password file that can be used by Ansible to read the vault, this file must be named .vault_pass
and placed in the configuration directory.
- Create a new .vault_pass file, like this for the example password "my_password":
echo "my_password" > <config_dir>/.vault_pass
MAKE SURE THE .vault_pass IS NOT SAVED IN VERSION CONTROL, when using Git(Hub/Lab), add the .vault_pass to .gitignore.
TLS is used to protect the network and web interfaces for the available services. The TLS-certificate must be valid for the domain name used for the cluster hosts.
For example, if there are 3 hosts, with the following names.
host1.example.nl
host2.example.nl
host3.example.nl
The TLS certificate must have a wildcard match *.example.nl
or be a multi-domain (SAN) certificate and have all the host names explicitly included in the certificate.
The certificate must be PEM formatted and the PEM file must include all required intermediate certificates.
- Upload the certificate to the management node
- Make sure the manager user has read permission.
- Configure the location of the certificate using the
tls_cert
option inconfig/vars/tls_vars.yml
Prepare all cluster hosts by creating the required manager user account and setting up SSH access on all cluster nodes.
The Ansible deployment scripts require a manager user account to be present on all cluster nodes.
The default name of this account hadoop-mgr
, is configured in config/var/all.yml
.
The name must match the name of the manager user created earlier on the manager host. If you have used another name then update the value for provision_user
in config/var/all.yml
.
The prepare-hosts.sh
script does the following:
- Add the manager user account to all the hosts found in the hosts file in the configuration directory.
- Add the SSH-key of the manager user on the management node to the authorized_keys of manager user on the other cluster hosts.
- Add the manager user on the cluster host to the sudoers list.
The script requires that you already have another remote user account on the systems that is allowed to use sudo and can be used to create the new manager user account on the nodes. The existing account name and password must be the same on all hosts
prepare-hosts.sh <admin-user-on-remote-hosts>
The script will ask for the SSH password for the admin user on the remote host and the new password for the manager user account that is going to be created on the remote host.
If you want to manually create the manager user on all nodes, follow the steps below.
adduser hadoop-mgr
usermod -aG wheel hadoop-mgr
passwd hadoop-mgr <new-password>
adduser hadoop-mgr
usermod -aG sudo hadoop-mgr
passwd hadoop-mgr <new-password>
Distribute the public key to all other nodes (including the management node itself): ssh-copy-id hostname
Make sure to also add the new key to the manager user on the management node. This means adding the public key of the manager user to the authorized_keys of the manager user itself. This is necessary because Ansible will create an SSH-session to the management node from the management node, in effect starting a SSH connection to itself.
When upgrading an existing Hadoop cluster it is advisable to backup the existing database schemas. The components listed below use a database.
- Apache Hive Metastore
- Hue
- Apache Ranger
Using Postgresql, a backup can be performed like this:
pg_dump -d <database> -h <server_host> -U <usename> -Fc > database.backup
Before executing the deploy script the following environment variables need to be set.
Variable | Description |
---|---|
ANSIBLE_BECOME_PASSWORD | Sudo password for manager user on cluster node |
SIDN_HADOOP_CFG_DIR | path your custom config directory |
The scripts will search for a environment file called set-env.sh
this file or a symbolic link with this name may be placed in the "src" directory. See below for example content for set-env.sh.
export SIDN_HADOOP_CFG_DIR=/home/user/my-custom-config
export ANSIBLE_BECOME_PASSWORD=mypassword
Create a link in the repository src/bin directory to the location of the environment file.
cd <repo>/src/bin
ln -s /path/to/custom/set-ansible-vars.sh set-env.sh
Use the deploy.sh
script to deploy components to the cluster hosts, the script can accept multiple components. There is a special "all" component, to install all components in the correct order sequentially.
When more control is desired, it is advised to use only a single component.
The order in which components must be deployed is important, because of dependencies between components. Use the ordering as is used in deploy-all.yml
Use the start.sh
script to start services on cluster hosts, the script can accept multiple components. There is a special "all" component, to start all services.
Use the add-user.sh
script to create a new cluster user, it will create:
- A user account on all data nodes (no home directory or login shell)
- A user account on all gateway nodes with a home directory and login shell
- A principal in the Kerberos database
- A user in the Ranger database (this might take a few minutes to show up)
- A HDFS home directory
The console service is a web application with an overview of all services and related information. It is also possible to download the Hadoop and Kerberos client configuration files directly from the console. The console web interface is deployed using the "console" role in the hosts config file, the default listening port is 5000.
All web interfaces are protected using Kerberos and can only be accessed when a Kerberos session is active. Configure your browser to support SPNEGO, for more information on how to do this, see the information in the console web application.
Create a Kerberos session using a password/keytab with kinit and/or Mac Client TicketViewer (/System/Library/CoreServices/Applications/Ticket\ Viewer.app/)
kinit example: kinit -kt /etc/path/to/keytab myuser@KERBEROS-REALM
Make sure to have a correct Kerberos client configuration (krb5.conf) on the client before creating a Kerberos session. A valid Kerberos client configuration can be downloaded from the console using the client configuration download button.
There are multiple control scripts available to manage the cluster.
Script | Description |
---|---|
deploy.sh | Install 1 or more components on cluster nodes |
start.sh | Start 1 or more components |
stop.sh | Stop 1 or more components |
add-user.sh | Add a new user to Kerberos db and cluster nodes |
del-user.sh | delete an existing user from Kerberos db and cluster nodes |
prepare-hosts.sh | Create the manager user on all cluster nodes |
add-py-env.sh | Add new Python virtualenv on cluster nodes |
del-py-env.sh | Delete Python virtualenv on cluster nodes |
hdfs-upgrade.sh | Prepare and finalize HDFS upgrade |
Due to an incompatibility between Apache Impala 4.0 and Apache Hive, it is not possible to deploy Apache Impala 4.0 until this is fixed in version 4.0.1. See: https://issues.apache.org/jira/browse/IMPALA-10756
The load balancer is used to distribute queries across all Impala nodes, it is recommended that users always connect to the load balancer host to make sure query's are evenly distributed across the Impala nodes.
Impala security is enabled by default and Apache Ranger is used to manage permissions. Use the Ranger interface to easily manage Impala user permissions.
To use the impala-shell commandline interface, the following 3 steps are required.
- Create a Kerberos session with:
kinit -kt /path/to/keytab <principal>@<REALM_NAME>
- Set the correct path to the required libraries, make sure to also source the environment variables.
source /etc/<impala-conf-dir>/impalad-env.sh
- Active the python2 venv when the default Python version is not Python2 required.
. /usr/local/hadoop/hadoop-venvs/py2-impala-shell/bin/activate
Use the following arguments when connecting to the impala_proxy host.
- -i the fqdn of the <impala_proxy> host
- --ssl Impala server uses SSL/TLS
- -k use Kerberos session
Example:
impala-shell -i <proxy_fqdn>:<shell_port> --ssl -k
When directly connecting to an impalad node, use following additional argument.
- -b <proxy_fqdn> tell impala to use the hostname of the impala_proxy when connecting
Example:
impala-shell -i <proxy_fqdn>:<shell_port> --ssl -k -b <proxy_fqdn>
Upgrading an existing HDFS cluster is only supported with downtime. The scripts used to start HDFS will detect if a newer version of Hadoop is deployed and will auto upgrade the HDFS metadata during name node start.
Use the following process when upgrading:
- Run
hdfs-upgrade.sh prepare
while the current older Hadoop version is still running. - Stop hdfs and yarn services:
stop.sh hdfs yarn
- Update the Hadoop version in
vars/all/vars.yml
- Deploy new version of Hadoop package with
deploy.sh hadoop
- Start HDFS with:
start.sh hdfs yarn
, this step will upgrade the HDFS meta data during the start of the name node. Before the upgrade it will create a backup of the name node metadata. - Run your workload jobs and check if everything is working ok, if everything is working ok finalize the upgrade with:
hdfs-upgrade.sh finalize
When migrating from an older Hadoop distribution such as Cloudera Hadoop, it is advisable to first deploy a version of Apache Hadoop that matches the version used by your current Cloudera deployment. When deploying a newer version of Hadoop, the HDFS metadata might be be upgraded, this may not be what you want.
When migrating away from CDH, only upgrade to a newer Hadoop version when the deployment of a matching the current CDH Hadoop version was successful and you have been using it for some time without issues.
The following Spark components are installed:
- Spark (utils and libraries)
- Thrift server
- History server
The Spark thrift server is started with Kerberos service account <spark_thrift_user>/_HOST@REALM. To create a connection, first make sure a Kerberos session is created (kinit) and then use the <spark_thrift_user>/_HOST@REALM in the connection string.
For example when the Spark thrift server is deployed on testserver1.example.nl, use this Beeline test example:
beeline -u "jdbc:hive2://<hostname>:10000/default;principal=spark/testserver1.example.nl@REALM_NAME"
Only the Hive metastore will be deployed.
If you have an existing metastore database, then this database must be upgraded to the version required by the installed Hive version.
When migrating from Cloudera Express (CDH 6.x) then the deploy.sh script will most likely detect metastore version 2.1.1. It will then copy modified upgrade scripts to the hive server. Starting the Hive metastore will kick off the upgrade process, the metastore will be upgraded to the version required by the installed Hive version.
If the old metastore database is hosted by an existing Postgresql server, and this server is not used for the new cluster, then copy the metastore database to the new Postgresql server.
Export existing metastore database:
pg_dump -d metastore -h <server_host> -U <usename> -Fc > metastore.back
Import the metastore database in the new Postgresql server:
- Create a new database
metastore
owned by userhive
- Import the database:
pg_restore -d metastore -h <server_host> -U <usename> metastore.back
The upgrade might fail because the Hive upgrade scripts return an error. To fix this, manually alter the Hive upgrade scripts found in <HIVE_HOME>/scripts/metastore/upgrade/
Apache Ranger is used for authorization of HDFS requests and Impala SQL-queries. The ranger interface can be accessed via the console webpage. Make sure to have a valid Kerberos session for a admin user when accessing the Ranger console. Otherwise the Ranger Admin website will only allow limited access.
After a fresh install it is avisable to delete all policies for the "HDFS" and "HADOOP SQL" services. The default policies might cause access/permission denied errors.
SOLR is used to store Ranger audit logging, the SOLR index data can grow quite big when running larger cluster. Cleaning up older SOLR data can be done with a simple API command.
The following example command deleted all data older than 7 days.
curl --negotiate -u : "<solr_hostname>:8983/solr/ranger_audits/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>evtTime:[* TO NOW-7DAYS]</query></delete>"
A cleanup service is created during the deployment of Solr, the cleanup service will remove old data from Solr every night at 01:00. The max allowed age of the Solr data can be configured in the Solr configuration file.
When the existing TLS certificate needs to be replaced (because it might expire soon), do the following:
- Copy the new TLS to the correct location on the management server.
- ./deploy.sh tls
- ./stop.sh all
- ./start.sh all
- ./do.sh enable-hdfs-ranger-plugin
The last step is required because otherwise the Ranger HDFS plugin will not be able to connect to the updated Ranger Admin service.
Adding new hosts requires careful execution of the following steps, to make sure all services are correctly deployed on their assigned hosts.
- Stop all services
- Add the new host to desired groups in hosts configuration file
- start db, kerberos and console
- Deploy all service assigned to new host
- Deploy tls
- Deploy Kerberos principals
- Deploy Kerberos keytabs
- Start all services
Some components require the database and/or the Kerberos KDC to be up and running before the component can be deployed.
fix: start.sh db krb-kdc
Make sure the "impala" policy in Apache Ranger allows users in the group "hadoop_users" to view and select from the desired databases and/or tables. Every user must have an account on each impala node in the cluster, this is done automatically when adding users with the add-user.sh script.
When the keytabs are redeployed with services running, services such as HDFS can not authenticate using the new services until they are restarted.
Code and documentation copyright 2022 SIDN. Code released under the MIT License.