Blueprint is a declarative extraction language for semi-structured documents. Users describe their documents using constraints, and Blueprint's engine solves the constraint systems to produce extractions.
For example:
extract(
text_equals("Neil Patel")('patel_label'),
is_dollar_amount('patel_revenue'),
bottom_aligned('patel_label', 'patel_revenue'),
text_equals("Annual $")('annual_revenue_label'),
right_aligned('annual_revenue_label', 'patel_revenue'))
+
= {
'patel_label': "Neil Patel",
'annual_revenue_label': "Annual $",
'patel_revenue': "$79,280"
}
Blueprint can be used from the command line.
We also provide Studio, an experimental, interactive GUI for labeling training data, automatically synthesizing Blueprint constraint systems from labeled data, tweaking synthesized Blueprint programs or building them from scratch interactively, running them against test data, and debugging.
Start by cloning this repo to your machine.
You will need to install several Python packages. It is recommended that you
do this in a Python virtual environment, created using the Python venv
module.
If you know what that means and how to do it, you can skip the rest of this
section.
If not, your choices are:
- Learn about
venv
: https://docs.python.org/3/library/venv.html - Or, execute all of the commands below without creating/activating a virtual environment. The required packages will be installed globally on your machine (or globally to your user account on your machine). This isn't necessarily a problem, but a virtual environment keeps things tidy.
- Or, blindly follow the instructions in this paragraph. From the root of this
repo, execute
python3 -m venv .venv
to create a virtual environment in a subdirectory called.venv
, then execute. .venv/bin/activate
to activate the virtual environment. Do this second step (the activate step) for every terminal you use to interact with the repo (starting the Studio server, etc.).
Add path/to/blueprint-oss/blueprint/py
and path/to/blueprint-oss/server/py
to your PYTHONPATH
. In most shells (Bash/Zsh), this can be accomplished by
running:
export PYTHONPATH="${PYTHONPATH}:path/to/blueprint-oss/blueprint/py:path/to/blueprint-oss/server/py"
For how to make this happen every time you start a new terminal, see: https://stackoverflow.com/questions/3402168/permanently-add-a-directory-to-pythonpath
Install the Python requirements (ideally from within your venv
) by running:
pip3 install -r python_requirements.txt
If you are just using Blueprint's CLI, you can skip this section.
To run Studio, you will need to have Node/npm
installed. There is probably
some minimum version but I'm not sure what it is. These instructions have been
tested using Node v17.0.1.
These instructions use npm
, but yarn
should also work.
Install Studio's JavaScript requirements (this will happen locally to the repo automatically, and will not touch the rest of your machine):
cd studio
npm install
Blueprint provides a command-line interface. To run on a sample paystub:
cd examples/paystubs
python3 paystubs.py run_model -v -g ocr/*
To generate OCR for your own document samples, see: https://cloud.google.com/vision/docs/drag-and-drop
Studio is an interactive GUI application for inspecting document samples and creating, running, testing, and debugging Blueprint extraction programs/models.
To use Studio, you will need to start a server hosting the frontend, and a server hosting the backend.
The Studio frontend is a single-page web application written in TypeScript/React. To start the server:
cd studio
npm start
The Studio backend is a thin wrapper around the Blueprint Python module, which allows the Studio frontend to execute Blueprint models and display the results. To start the backend, run:
python3 -m bp_server
During development, it can be useful to turn on auto-reloading for the backend server. This can be done by launching it like so:
FLASK_ENV=development python3 -m bp_server
Once you have started both the frontend and backend servers, open a browser to
http://localhost:3000
. You should see the Studio welcome screen.
Click on the instructions button and read the instructions carefully. You can
make a copy of the examples/paystubs
directory and use the copy as your
project directory -- then you should be able to view sample paystubs in the GUI.
Studio does not use Python-based Blueprint programs (for example, the reference
extraction program paystubs.py
). You will need to make a new model.
See instructions to reproduce experiments in the VLDB'22 paper.
There are two color schemes available for Studio. Toggling between them must be done manually. The default is the dark color scheme. To switch to the light color scheme, run the following command. You may need to restart the front-end development server.
cp studio/src/components/LightTheme.css studio/src/components/Theme.css