To make some easily accessible environment to run and develop Hive.
There are sometimes bugreports agains earlier releases; but testing these out sometimes is problematic - running/switching between versions is kinda problematic. I was using some vagrant based box which was usefull doing this...
I'm working on Hive and sometimes on other projects in the last couple years - and since QA runs may come after 8-12 hours; I work on multiple patches simultaneously. However; working on several patches simultaniously has its own problems:
I go thru all the approaches I was using ealier:
- basic approach: use a single workspace - and switch the branch...
- unquestionably this is the most simple
- after switching the branch - a full rebuild is neccessary
- 1 for each: use multiple copies of hive - with have isolated maven caches
- pro:
- capability to run maven commands simultaneuously on multiple patches
- con:
- one of the patches have to be "active" to make an IDE able to use it
- it falls short when it comes to working on patch simultaneous in multiple projects (hive+tez+hadoop)
- after some time it eats up space...
- pro:
- dockerized/virtualized development environment
- pro:
- everything is isolated
- because I'm not anymore bound to my natural environment: I may change a lot of things without interfering with anything else
- easier to "cleanup" at the end of submitting the patch (just delete the container)
- ability to have IDEs running for multiple patches at the same time
- con:
- isolated environment; configuration changes might get lost
- may waste disk space...
- pro:
The aim of this project is to provide an easier way to test-drive hive releases
- running releases:
- upstream apache releases
- HDP/CDP/CDH releases
- in-development builds
- provide an evironment for developing hive patches
# build and launch the hive-dev-box container
./run.bash
# after building the container you will get a prompt inside it
# initialize the metastore with
reinit_metastore
# everything should be ready to launch hive
hive_launch
# exit with CTRL+A CTRL+\ to kill all processes
- on linux based systems you are already running an xserver
- MacOSX users should follow: https://medium.com/@mreichelt/how-to-show-x11-windows-within-docker-on-mac-50759f4b65cb
Every container will be reaching out to almost the same artifacts; so employing an artifact cache "makes sense" in this case :D
# start artifactory instance
./start_artifactory.bash
You will have to manually configure this instance (once)
It will be available at http://127.0.0.1:8081/ use admin/password to login
- make sure to have anonymous acces enabled: ** left menu bar; Admin menu; Security / Security configuration > allow anonymous access is enabled
- add some remote repositories ** left menu bar: Admin menu: Repositories / Remote *** add maven central / etc *** or some caching mirror repository if you know one
- add the wonder virtual repository ** left menu bar: Admin menu: Repositories / Virtual *** make sure to use the name "wonder" for it *** add remote repos to it
This instance will be linked to the running development environment automatically
add an export to your .bashrc or similar; like:
export HIVE_DEV_BOX_HOST_DIR=$HOME/hive-dev-box
The dev environment will assume that you are working on upstream patches; and will always open a new branch forked from master If you skip this; things may not work - you will be left to do these things; in case you are using HIVE_SOURCES env variable you might not need to set it anyway.
# make sure to load the new env variables for bash
. .bashrc
# and also create the host dir beforehand
mkdir $HIVE_DEV_BOX_HOST_DIR
# invoking with an argument names the container and will also be the preffered name for the ws and the development branch
./run.bash HIVE-12121-asd
# when the terminal comes up
# issuing the the following command will clone the sources based on your srcs dsl
srcs hive
# enter hive dir ; and create a local branch based on your requirements
cd hive
git branch `hostname` apache/master
# if you need...patch the sources:
cdpd-patcher hive
# run a full rebuild
rebuild
# you may run eclipse
dev_eclipse
A shorter version exists for initializing upstream patch development
./run.bash HIVE-12121-asd
# this will clone the source; creates a branch named after the containers hostname; runs a rebuild and open eclipse
hive_patch_development
beyond the "obvious" /bin
and /lib
folders there are some which might make it more clear how this works:
/work
- used to store downloaded and expanded artifacts
- if you switch to say apache hive 3.1.1 and then to some other version you shouldn't need to wait for the download and expansion of it..
- this is mounted as a docker volume; and shared between the containers
- files under
/work
are not changed
/active
- the
/work
folder may contain a number versions of the same component - symbolic links point to actually used versions
- at any point doing an
ls -l /active
gives a brief overview about the active components
- the
/home/dev
- this is the development home
/home/dev/hive
- the Hive sources; in case
HIVE_SOURCES
is set at launch time; this folder will be mapped to that directory on the host
- the Hive sources; in case
/home/dev/host
- this is a directory shared with the host; can be used to exchange files (something.patch)
- will also contain the workspace "template"
bin
directory under this folder will be linked as/home/dev/bin
so that scripts can be shared between containers and the host
- run NAME
- starts a new container with NAME - without attaching to it
- enter NAME
- enters into the container
# create a symlink to hive-dev-box/hdb from an executable location ; eg $HOME/bin ?
ln -s $PWD/hdb $HOME/bin/hdb
# enable bash_completion for hdb
# add the following line to .bashrc
. <($HOME/bin/hdb bash_completion)
# use hadoop 3.1.0
sw hadoop 3.1.0
# use hive 2.3.5
sw hive 2.3.5
# use tez 0.8.4
sw tez 0.8.4
- optionally switch to a different metastore implementation
- wipe it clean
- populate schema and load sysdb
reinit_metastore [derby|postgres|mysql]