Skip to content

Running PDCi2b2

Jean Louis Raisaro edited this page Sep 14, 2017 · 20 revisions

Running PDCi2b2

PDCi2b2 is a client-server application that requires to be deployed on two distinct and independent servers (e.g., two instances in two distinct cloud providers). The first server (also storage provider) is responsible for storing encrypted aggregate-level i2b2 data. The second server (also proxy) is responsible for helping the storage provider during the re-encryption phase so that the query result can only be decrypted by the user. An example of a deployment (local) can be found at test_local_deployment.

Servers Setup

  1. Depending on the architecture and OS of the servers being used (default is amd64) build the PDCi2b2 executable:

./compileLinux.sh or ./compileMac.sh or ./compileWindows.sh

  1. Copy the executable to each of the two servers:

scp $GOPATH/src/github.com/JLRgithub/PDCi2b2/app/PDCi2b2 .

  1. Copy the database configuration (db.toml) to the storage provider (same path of the executable) and update the parameters' values according to your own database settings:

scp $GOPATH/src/github.com/JLRgithub/PDCi2b2/app/db.toml .

  1. For each server run a "server setup" and follow the installation guide:

./PDCi2b2 server setup

  1. Create in your client machine (the one that will act as your client) a group.toml and append the content of all the public.toml files created during each setup.

  2. Start each server:

./PDCi2b2 server -c private.toml or..

./PDCi2b2 -d 1 server -c private.toml (in debug mode)

Extraction Transformation and Loading (ETL)

  1. Extract a CSV file from an i2b2 instance or generate a new one with the following columns (e.g., testdata.csv):
  • location_cd: this column stores the code of the location X at which aggregate-level data has been collected
  • time: this column stores the time unit Y of the data collection, which can be set at any interval (i.e., month, quarter, year) as long as the unit is consistent across the dataset;
  • concept_cd: this column stores the ontology code Z, e.g., a clinical condition, a medication, a lab result, a procedure, etc.;
  • totalnum: this column stores the number of patients at location X and time unit Y who have an observation specified by the ontology concept Z.
  1. Encrypt the extracted (or generated) CSV file (input.csv) in order to obtain another csv file as output (output.csv):

./PDCi2b2 encryptCSV -f group.toml -a totalnum -i input.csv -o output.csv

  1. Load the encrypted CSV file in the PostgreSQL database at the SCP server.

Run Query

PDCi2b2 allows to run SQL aggregate queries of the form SELECT SUM(totalnum) FROM table WHERE {location_cd}* AND {time}* AND {concept_cd}* GROUP BY * on encrypted data stored in the database at the SCP server.

For example, to do so, execute on the client machine:

./PDCi2b2 -d 1 run -f group.toml -l location1,location2,.. -t time1,time2,.. -c concept1,concept2,.. -g group1,group2,... -o result.csv

What each flag stands for:

  • -d = debug level;
  • -f = servers' definition FILE (default: "group.toml");
  • -l = location codes in the SQL-WHERE clause (a comma-separated list of location codes, e.g., location1,location2, is equivalent to the Boolean expression location1 OR location2);
  • -t = time frames in the SQL-WHERE clause (a comma-separated list of time frames, e.g., time1,time2, is equivalent to the Boolean expression time1 OR time2);
  • -c = concept codes in the SQL-WHERE clause (a comma-separated list of concept codes, e.g., concept1,concept2, is equivalent to the Boolean expression concept1 OR concept2);
  • -g = attributes in the SQL-GROUPBY clause. Possible values: location_cd, concept_cd, time
  • -o = output CSV FILE containing the query result
Clone this wiki locally