This software enables the creation of concept classifiers, to be utilized by an accompanying service. If you don't have your own data to train, you can use the pretrained models described here. This project was written about here for the Federal Data Strategy Incubator Project.
By concept tagging, we mean you can supply text, for example: Volcanic activity, or volcanism, has played a significant role in the geologic evolution of Mars.[2] Scientists have known since the Mariner 9 mission in 1972 that volcanic features cover large portions of the Martian surface.
and get back predicted keywords, like volcanology, mars surface, and structural properties
, as well as topics like space sciences, geosciences
, from a standardized list of several thousand NASA concepts with a probability score for each prediction.
You can see a list of options for this project by navigating to the root of the project and executing make
or make help
.
This project requires:
- docker -- tested with this version
- GNU Make -- tested with 3.81 built for i386-apple-darwin11.3.0
You have several options for installing and using the pipeline.
You can just pull a stable docker image which has already been made:
docker pull storage.analytics.nasa.gov/abuonomo/concept_trainer:stable
In order to do this, you must be on the NASA network and able to connect to the https://storage.analytics.nasa.gov docker registry. * There are several versions of the images. You can see them here. If you don't use "stable", some or all of this guide may not work properly.
To build from source, first clone this repository and go to its root.
Then build the docker image using:
docker build -t concept_trainer:example .
Substitute concept_trainer:example
for whatever name you would like. Keep this image name in mind. It will be used elsewhere.
* If you are actively developing this project, you should look at the make build
in Makefile. This command automatically tags the image with the current commit url and most recent git tag. The command requires that setuptools-scm is installed.
* tested with python3.7 First, clone this repository. Then create and activate a virtual environment. For example, using venv:
python -m venv my_env
source my_env/bin/activate
Next, while in the root of this project, run make requirements
.
The pipeline takes input document metadata structured like this and a config file like this. The pipeline produces interim data, models, and reports.
- using docker -- if you pulled or built the image
- using python in virtual environment -- if you are running in a local virtual environment
First, make sure config
, data
, data/raw
, data/interim
, models
, and reports
directories. If they do not exist, make them (mkdir config data models reports data/raw
). These directories will be used as docker mounted volumes. If you don't make these directories beforehand, they will be created by docker later on, but their permissions will be unnecessarily restrictive.
Next, make sure you have your input data in the data/raw/
directory. Here is an example file with the proper structure. You also need to make sure the subj_mapping.json
file here is in data/interim/
directory.
Now, make sure you have a config file in the config
directory. Here is an example config which will work with the above example file.
With these files in place, you can run the full pipeline with this command:
docker run -it \
-v $(pwd)/data:/home/data \
-v $(pwd)/models:/home/models \
-v $(pwd)/config:/home/config \
-v $(pwd)/reports:/home/reports \
concept_trainer:example pipeline \
EXPERIMENT_NAME=my_test_experiment \
IN_CORPUS=data/raw/STI_public_metadata_records_sample100.jsonl \
IN_CONFIG=config/test_config.yml
Substitute concept_trainer:example
with the name of your docker image.
You can set the EXPERIMENT_NAME
to whatever you prefer.
IN_CORPUS
and IN_CONFIG
should be set to the paths to the corpus and to the configuration file, respectively.
* Developers can also use the container
command in the Makefile. Note that this command requires setuptools-scm. Note that this command will use the image defined by the IMAGE_NAME
variable and version number equivalent to the most recent git tag.
Assuming you have cloned this repository, files for testing the pipeline should be in place. In particular, data/raw/STI_public_metadata_records_sample100.jsonl
and config/test_config.yml
should both exist. Additionally, you should add the src
directory to your PYTHONPATH
:
export PYTHONPATH=$PYTHONPATH:$(pwd)/src/
Then, you can run a test of the pipeline with:
make pipeline \
EXPERIMENT_NAME=test \
IN_CORPUS=data/raw/STI_public_metadata_records_sample100.jsonl \
IN_CONFIG=config/test_config.yml
If you are not using the default values, simply substitute the proper paths for IN_CORPUS
and IN_CONFIG
. Choose whatever name you prefer for EXPERIMENT_NAME
.
If you have access to the hq-ocio-ci-bigdata
moderate s3 bucket, you can sync local experiments with those in the s3 bucket.
For example, if you created a local experiment with EXPERIMENT_NAME=my_cool_experiment
, you can upload your local results to the appropriate place on the s3 bucket with:
make sync_experiment_to_s3 EXPERIMENT_NAME=my_cool_experiment PROFILE=my_aws_profile
where my_aws_profile
is the name of your awscli profile which has access to the given bucket.
Afterwards, you can download the experiment interim files and results with:
make sync_experiment_from_s3 EXPERIMENT_NAME=my_cool_experiment PROFILE=my_aws_profile
If you have access to the moderate bucket and you want to work with the full STI metadata records, you can download them to the data/raw
folder with:
make sync_raw_data_from_s3 PROFILE=my_aws_profile
When using these data, you will want to use a config file which is different from the test config file. You can browse previous experiments at s3://hq-ocio-ci-bigdata/home/DataSquad/classifier_scripts/
to see example config files. You might try:
weights: # assign weights for term types specified in process section
NOUN: 1
PROPN: 1
NOUN_CHUNK: 1
ENT: 1
ACRONYM: 1
min_feature_occurrence: 100
max_feature_occurrence: 0.6
min_concept_occurrence: 500
See config/test_config.yml for details on these parameters.
For more advanced usage of the project, look at the Makefile commands and their associated scripts. You can learn more about these python scripts by them with help flags. For example, you can run python src/make_cat_models.py -h
.