==============
Repository hosting PIH data warehousing scripts built on the Pentaho product suite. This contains:
- Kettle jobs (*.kjb): Represent a sequence of transforms that can be executed or run on a schedule to process data
- Kettle transforms (*.ktr): Represent specific data transformations (Extract, Transform, and Load) that can be exectuted within one more more jobs
- Other files (.sql, etc): Files used by these jobs and transforms
=====================
DEV ENVIRONMENT SETUP
For development, it is most helpful to utilize Pentaho Spoon, which is a graphical designer and editor for creating, editing, and viewing Pentaho jobs and transforms.
- Download Pentaho Kettle from here: http://community.pentaho.com/projects/data-integration/ (we are currently using version 6.1)
- Unzip and copy into your preferred executable directory (eg. /opt/pentaho/data-integration)
- Follow any additional setup instructions here: http://community.pentaho.com/projects/data-integration/
- Download the latest mysql connector jar from here: https://dev.mysql.com/downloads/file/?id=465644
- Extract the mysql connector jar out of the above zip file and copy it into the data-integration/lib folder
- Create a new mysql database for the warehouse ( eg. create database openmrs_warehouse default charset utf8; )
- Checkout out pih-pentaho code (this repository). Make note of the location where this is located. This is your PIH_PENTAHO_HOME directory.
- Edit or create ~/.kettle/kettle.properties, and add the following: PIH_PENTAHO_HOME=the/folder/from/above
- Create file at ~/.kettle/pih-kettle.properties with the following variables set to your preferred values:
** A sample of what this should look like is in pih-pentaho/config/pih-kettle-default.properties ** Connection settings are there for configuring the source and target databases ** pih.country should be set to the country of interest, and controls certain configurations within the main jobs and transforms ** warehouse.db.key_prefix by default will be 100 unless you override it here. This is a prefix that is appended to all primary keys for data that is imported. ** Sample for Haiti below:
pih.country = haiti
openmrs.db.host = localhost
openmrs.db.port = 3306
openmrs.db.name = openmrs
openmrs.db.user = root
openmrs.db.password = rootpw
warehouse.db.host = localhost
warehouse.db.port = 3306
warehouse.db.name = openmrs_warehouse
warehouse.db.user = root
warehouse.db.password = rootpw
warehouse.db.key_prefix = 10
- Run "spoon.sh" to start
Link your shared.xml to the shared file used by the project:
-
Go to $HOME/.kettle/shared.xml and delete this file if it exists
-
Create a new shared.xml that is a symbolic link to to "shared/shared-connections.xml" in this project.
-
Test it out by trying to run a job in the pih-pentaho/jobs folder (eg. load-from-openmrs.kjb)
-
Each country has it's own folder with implementation-specific jobs and transforms. By convention, the main pipleline run for each country is accessible under: /jobs/refresh-warehouse.kjb
=======================
RUNNING VIA PETL
=======================
RUNNING VIA DOCKER
Usage:
-
Make sure docker is installed and running on the target machine: Use our docker ansible playbook, or: https://docs.docker.com/engine/installation/linux/ubuntulinux/
-
Clone this repo into /home/reporting on the target machine
-
Build the image locally on the target machine: cd /home/reporting/pih-pentaho/docker sudo docker build -t pih:pdi .
-
Create a custom "pih-kettle.properties" file in "/home/reporting/.kettle" based on pih-kettle-default.properties, but customized for this install
-
Create the warehouse database in mysql: "create database <warehouse_db_name> default charset utf-8", where <warehouse_db_name> is set to the warehouse.db.name as specified in pih-kettle.properties
-
To run the "load-from-opennmrs" job on the target machine using kitchen: sudo docker run --net="host" --rm -v /home/reporting:/home/reporting -v /home/reporting/.kettle/pih-kettle.properties:/opt/pentaho/.kettle/pih-kettle.properties pih:pdi /opt/pentaho/data-integration/kitchen.sh -file="/home/reporting/pih-pentaho/jobs/load-from-openmrs-and-star-schema.kjb"
--net="host" : allows the container to connect to mysql on the host machine via 127.0.0.1 /home/reporting:/home/reporting: mounts the /home/reporting directory on the host machine to /home/reporting in the container