A terraform setup for setting up hdp's big data analytics server instance in aws. 🔥🔥🔥
- terraform-hdp
⚠️ Before running the scripts, create a remote s3 bucket to store the terraform state.
- By default, the name of the remote state bucket is
terraform-hadoop
. - If you want to create your own bucket with any-other name, ensure that you replace the default remote bucket name mentioned in
state.tf
.
To configure the public ip address, replace the HostIp
environment variable found in env > dev.tfvars | prod.tfvars
,
> curl https://checkip.amazonaws.com
💡If you don't want to utilize global credentials, add
AWS PROFILE=username>
to each terraform and aws command given below.
Initialize terraform
> cd terraform/private_vpc
> terraform init
Create AWS keypair that will be used to login into AWS instance, same KeyPair would be used for initializing the other instances too
> cd terraform/scripts # generate keys inside scripts
> aws ec2 create-key-pair --key-name hwsndbx --query 'KeyMaterial' --output text > hwsndbx.pem
> terraform workspace list # created at terraform init
To create two new workspaces,
> terraform workspace new dev
> terraform workspace new prod
If we need to provision the resources in the dev workspaces we need to first select the dev
workspace.
> terraform workspace select dev
> terraform apply
Apply terraform script,
> terraform plan
> terraform apply -auto-approve
Optional: Apply terraform script with environment variables,
> terraform plan -var-file=./env/dev.tfvars
> terraform apply -auto-approve -var-file=./env/dev.tfvars
Since we need a proper way to access our server and we cant tie the server down to our local dynamic ip which changes everytime, we create a new ec2 instance with openvpn to act as the bastion host.
For OpenVPN setup refer to this video.
Change the openvpn_ami_id
based on your specified region,
> aws --region=us-east-1 ec2 describe-images --owner=aws-marketplace --filters 'Name=name,Values=OpenVPN Access Server 2.7.5*'
> cd terraform/bastion_host_openvpn
> terraform init
> terraform plan
> terraform apply
Read through this for more setup.
Connect to the OpenVPN instance using the assigned elastic ip,
> ssh -i ./scripts/hwsndbx.pem openvpnas@<elasticip>
Use all settings as default. And change the password
> sudo passwd openvpn
Then go to the OpenVPN WebUI https://<elastic-ip>:943
. Use username as openvpn
and password configured in the terminal above.
- In Configuration > VPN Settings > Routing > Enable
Should client Internet traffic be routed through the VPN?
- With this configuration, the VPN client IP address is translated before being presented to resources inside the VPC. That means the client’s original IP address is remapped to one belonging to the VPC IP address space.
We can use the domain by adding the nameserver
generated by terraform apply output to the domain DNS.
Read more on adding SSL Cert.
Right now you should be access you VPN's admin GUI by going to https:///admin. However, your browser will show a warning as the SSL cert is not valid. You can bypass this warning to access the admin, but we should setup a valid SSL cert.
- Use ZeroSSL to obtain your cetificate for free.
Walk through the wizard to create a new Let's Encrypt certificate. You will be required to verify your domain as part of this process.
Copy the Certificate, CA Bundle and Private Key to files.
Login to your VPN access server GUI using the user openvpn
and created on the server. Navigate to Settings > Web Server. From there, upload the Certificate, CA Bundle and Private Key files. Click validate and save if there are no errors.
> ssh root@<host> "cat server.csr"|pbcopy
> ssh root@<host> "cat server.key"|pbcopy
Next, we will provision HDP as a spot instance if you need it as a readily-available instance change directory to ``.
> cd terraform/hdp_instance
> terraform init
> terraform plan
> terraform apply
So to connect using ssh we need a permission of 400
but by default it will be 644
,
> ls -la # to see the permission of the pem file
> chmod 400 ./scripts/hwsndbx.pem # same key for all
> ssh -i ./scripts/hwsndbx.pem ec2-user@<output_instance_ip>
Install HDP
through docker,
> docker info
> cd /tmp/hdp-docker-sandbox/HDP_2.6.5
> sudo bash docker-deploy-hdp265.sh
> docker ps
> docker ps -a
To restart the containers,
> cd /tmp/hdp-docker-sandbox
> sudo bash restart_docker.sh
- After it finishes, access Ambari through
http://elastic-public-ip:8080/
. - The default Ambari credential is
raj_ops
:raj_ops
andmaria_dev
:maria_dev
. The default AmbariShell login credential isroot
:hadoop
.
> sudo docker images
> sudo service docker restart
> sudo service docker status
Read cloudera hdp sandbox and apache ambari shell commands for more information.
To peek into the docker sandbox,
> docker exec -it <docker-sandbox-image-id> /bin/bash
> ssh root@localhost -p 2222 # or you can use this with password hadoop
> ambari-agent status
> ambari-agent start # if stopped start
> ambari-server restart
HortonWorks doesnt come with lot of resources out-of-the-box to work with python,
> sudo su -
> yum install python-pip -y
> pip install google-api-python-client==1.6.4
# > curl https://bootstrap.pypa.io/pip/2.7/get-pip.py | python
# > pip install --ignore-installed pyparsing
> pip install mrjob==0.5.11 #MRJob
> yum install nano -y
Example data files and scripts to play with,
> sudo su - maria_dev
> wget http://media.sundog-soft.com/hadoop/ml-100k/u.data
> wget http://media.sundog-soft.com/hadoop/RatingsBreakdown.py
> hadoop fs -copyFromLocal u.data /user/maria_dev/ml-100k/u.data
> python RatingsBreakdown.p u.data
> python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data #mrjob manually copies the file to hdfs temp location and executes it
> hostname -I | awk '{print $1}' # get the ip
> python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar hdfs://172.18.0.2:8020/user/maria_dev/ml-100k/u.data
> python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar hdfs:///user/maria_dev/ml-100k/u.data
Look into this script
Change the ambari
password once you create the instance,
> docker exec -it sandbox-hdp /bin/bash
> ambari-admin-password-reset
> ambari-agent restart
💡
C:\Windows\System32\drivers\etc\hosts
on Windows or/etc/hosts
on a MacOSX
In case you want a CNAME, you can add this line to your hosts file. Add hostip
to the mac to use as a domain name locally, to save and exit out of nano editor ctrl + o
> enter
> ctrl + x
> sudo nano /etc/hosts # add the ip and map to a host
> sudo killall -HUP mDNSResponder # flush DNS cache
127.0.0.1 sandbox-hdp.hortonworks.com
⚠️ Keep in mind, though there aren't any changes for a stopped instance, you may still incur charges forEBS
storage andElasticIP
associated to the instances.
Once created and you want to stop
instances just execute,
> cd /tmp/hdp-docker-sandbox
> bash pause_docker.sh # pause the instance
> cd hdp_instance
> terraform output # get the id from output for hdp instance
> aws ec2 stop-instances --instance-ids <instance_id> --profile edutf
> cd bastion_host_openvpn
> terraform output # get the id from output for openvpn instance
> aws ec2 stop-instances --instance-ids <instance_id> --profile edutf
Once created and you want later to reboot
after a stop,
> cd bastion_host_openvpn
> terraform output # get the id from output for openvpn instance
> aws ec2 start-instances --instance-ids <instance_id> --profile edutf
> cd hdp_instance
> terraform output # get the id from output for hdp instance
> aws ec2 start-instances --instance-ids <instance_id> --profile edutf
> cd terraform/hdp_instance
> ssh -i ./scripts/hwsndbx.pem ec2-user@<instance_ip>
> cd /tmp/hdp-docker-sandbox
> bash resume_docker.sh # resume the instance
> ps -ef
> kill -HUP <PID>
> bash start_jupyter.sh spark
To destroy the terraform instance,
> terraform destroy -auto-approve
- Installation guide for a single cluster HDP installation.
- Installation guide for multiple cluster nodes.
- To increase the storage instance type.
- Maven and Java setup
- DDP + Ambari 2.7.5 CentOS7.
- Starting and stopping ambari services using CURL command
- Look into terraform local-exec for stopping and starting server instances
- Ambari REST Api to restart all services
- Ambari REST Api commands
- Solve PigTez Failure on Ambari 2.6.5
MIT © Murshid Azher.