Skip to content

Computational Psychology Shared Task organized with NAACL 2022

Notifications You must be signed in to change notification settings

manasgaur/CLPsych2022

Repository files navigation

CLPsych 2022 Shared Task Structure

  • Organize the files (Next 2 weeks)
  • Task : Predict the moments of change in the posts made by the user. Following are some functions needed to create baseline models
    • data_reader.py : Show take as input certain path to a training dataset containing all the timelines
    • evaluator.py : Make custom functions for Precision, Recall, F1-Score, and other relevant metrics
    • utils.py : store the results inside utils.
    • model_building.py : Deep language models specific for the task. Numpy and Torch are acceptable
  • Each data point would be an array of:
    • Timeline ID
    • Post ID
    • User ID
    • Date
    • Label : ['IS', 'IE', 'O']
    • text : Post made a user at particular instance of time.
  • While assessing the baseline models, we are specifically interested in 'IS' and 'IE' labels.

Usage

Loading dataset

You can simply load dataset by inputing file name.

dataset = csv_reader("data/sample.csv")

Create embeddings

Select embeddings model type. In this repository, we provide three ways to define numerical embeddings of the textual data. (a) TF-IDF, (b) Sentence Transformer, and (c) GLoVE. You can use either of these incorporated embedding methods or introduce your own by adding another if-else block inside the call function. For instance, BERT embeddings can be used as described in from; https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/ .

from model_embeddings import modelEmbeddings
embeddings_model = modelEmbeddings(model_type = `glove')
embeddings = embeddings_model(documents)

Here model_type can take following values

There is a subtle difference tf-idf and embedding models lies in the engineered features. TF-IDF is like bag of words, discrete, whereas embedding models are continous semantic representation of words or sentences. Best way to select the model_type is by computing the similarity between words. Project this similarity into T-SNE or heatmap to analyze which model_type's word similarity scores are sensible, intuitively.

Loading pre-trained embeddings

from model_embeddings import modelEmbeddings
embeddings_model = modelEmbeddings(model_type = `tfidf')
embeddings = embeddings_model(documents,load_path='models/tfidf_vectorizer.pkl')

Saving trained model to custom location

from model_embeddings import modelEmbeddings
embeddings_model = modelEmbeddings(model_type = `tfidf')
embeddings = embeddings_model(documents,save_path='models/tfidf_vectorizer.pkl')

Training

Train basic SVM classifier and save trained model to custom location

dataset = csv_reader("data/sample.csv")
from data_reader import csv_reader
from train import Classifier
classifier, = Classifier(dataframe=dataset,embeddings_model_type='sentence_transformer',vectorizer_path = None)
model_path = classifier.train_predict()
eval = evaluator(classifier)

Evaluation

Evaluate trained model using evaluator

from evaluator import evaluator
eval = evaluator(classifier)
print (eval.precision())
print (eval.recall())
print (eval.accuracy())

Prediction

Predictions using pre-trained model at certain path

test_dataset = csv_reader("data/test.csv")
classifier = Classifier(dataframe=dataset,embeddings_model_type='sentence_transformer',vectorizer_path = None)
pred_list,test_list = classifier.predict(model_path='models/svm.pkl')
print(pred_list)

Download Embeddings

This repository would store all your pre-trained or fine-tuned embedding models. Also, we suggest storing your trained models here (.pkl (pickle), .npy (numpy), .hd5 are good methods to store trained models.)

There are the two sources from where you can download the GLoVE Embeddings:

If you are interested in converting GLoVE to word2vec, a good resource is https://radimrehurek.com/gensim/scripts/glove2word2vec.html

Download Word2Vec Embeddings: https://radimrehurek.com/gensim/models/word2vec.html

One Stop Shops for Embeddings:

For issues, please email: mgaur@email.sc.edu

About

Computational Psychology Shared Task organized with NAACL 2022

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages