Skip to content

Implementation of the paper "How Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements"

License

Notifications You must be signed in to change notification settings

nadavborenstein/Iggy

Repository files navigation

Iggy

[Paper]

Implementation of the paper "How Did This Get Funded?!Automatically Identifying Quirky Scientific Achievements".

Downloading relevant files

The N-gram models:

Copy all the files from here and place them in resources/ngram-language-models/.

Finetuned GPT-2 model

We finetuned a GPT-2 model on our dataset of titles. Copy the folder from here to resources/finetuned-gpt2/.

Rudeness classifier

We trained a simple nbsvm for detecting rude or crude language, and we use it as one of our classifiers. Copy the model from here and paste it in resources/rudeness-classifier.

Our pretrained models

You can find our pretrained models (Iggy, and the BERT based models) here. Place them in models-weights.

Our SemanticScholar dataset

Our raw Semantic Scholar dataset of 0.6M titles can be found here.

It takes some time to run the classifiers on 0.6M titles. Fortunately, you can find here our Semantic Scholar dataset after performing this step.

Usage

Using the classifiers

To analyze a set of title using the classifiers, cd to Iggy/ and run:

python classifiers/run_classifiers.py --titles_path TITLES_PATH --save_path SAVE_PATH
                          --classifiers_to_use CLASSIFIERS_TO_USE [--labeled]

Where:

titles_path: path to a csv of titles to analyze. The titles should be in a column named "title".

save_path: where to save the analysis results.

classifiers_to_use: which classifiers to use to analyze the titles. A path to a txt file with the names of the classifiers to run. Check classifiers/classifiers_names.txt for reference.

--labeled: whether the titles to analyze are labeled. That is, whether the titles come from the file datasets/IgDataset.csv

After the code will finish running, you should see in SAVE_PATH a file with the same structure as dataset/IggDataset_with_data_analysis_results.csv

Training, testing and predicting with the MLP model

Usage:

python model_training/MLP_train.py --model_save_path MODEL_SAVE_PATH [--train] [--test]
                    [--predict]
                    [--train_dataset_root_path TRAIN_DATASET_ROOT_PATH]
                    [--predict_test_dataset_path PREDICT_TEST_DATASET_PATH]
                    [--predict_output_path PREDICT_OUTPUT_PATH]
                    [--hidden_size HIDDEN_SIZE] [--alpha ALPHA]

To train the MLP classifier, cd to Iggy/ and run:

python model_training/MLP_train.py --model_save_path MODEL_SAVE_PATH [--train] 
                    [--train_dataset_root_path TRAIN_DATASET_ROOT_PATH]
                    [--hidden_size HIDDEN_SIZE] [--alpha ALPHA]

Where:

--train: this will flag the script to train the MLP model.

model_save_path: where to save the trained model

train_dataset_root_path: path to a directory containing train.csv and dev.csv. The model will use those files to train.

hidden_size: int. Hidden size of the MLP. Default is 256 (this is the value we used).

alpha: float. l2 regularization parameter. Default is 2. (this is the value we used).



To evaluate the model on labeled data, run:

python model_training/MLP_train.py --model_save_path MODEL_SAVE_PATH [--test] 
                    [--predict_test_dataset_path PREDICT_TEST_DATASET_PATH]

Where:

--train: this will flag the script to evaluate the MLP model on predict_test_dataset_path.

model_save_path: path to the trained model

predict_test_dataset_path: path to a labeled dataset to evaluate.



To predict using the model on unlabeled data, run:

python model_training/MLP_train.py --model_save_path MODEL_SAVE_PATH [--predict] 
                    [--predict_test_dataset_path PREDICT_TEST_DATASET_PATH]
                    [--predict_output_path PREDICT_OUTPUT_PATH]

Where:

--predict: this will flag the script to predict using MLP model on predict_test_dataset_path.

model_save_path: path to the trained model

predict_test_dataset_path: path to an unlabeled dataset to predict on.

predict_output_path: where to store the prediction results

Training and predicting with the BERT-based models

Usage:

To predict using a BERT model on unlabeled data, run:

 python model_training/bert_labeler.py [-h] --data_dir DATA_DIR --bert_model BERT_MODEL
                       --output_path OUTPUT_PATH
                       [--max_seq_length MAX_SEQ_LENGTH]
                       [--batch_size BATCH_SIZE]

About

Implementation of the paper "How Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements"

Topics

Resources

License

Stars

Watchers

Forks

Languages