Relation Extraction - RDD Corpus

This document explains the details for the reproduction of the results obtained with the RDD corpus in the task of detecting relationships between rare diseases and disabilities[2].

Directory Structure

This repository is divided as follows:

.
├── src 
│	|── data	
|	|	|── files
│	|	|	└── train-todo.txt (Text file with the corpus)
|	|   	|──  pkl (Intermediate directory with the preprocessed information.)
│	│	|	└── *.pk 
|	│	└── configurations
│	|	│	└── configuration_*.json
│	├── RDD-CNN.py (Python code for creating and training the model)
│	├── preprocess.py (Script for corpus preprocessing)    
├── trained-model (or build)
|	├── RDD-CNN.h5
|	├── predictions.txt
|	└── real.txt
├── Readme.md
└── requirements.txt

Corpus RDD

To carry out this experiment we have used the relationships file provided in the RDD corpus. This file includes annotations about relationships between disabilities and rare diseases that appear in the different sentences. Each line follows the following format.

The first column is the label, which may be:
1. rd-dis: if the sentence expresses a relationship between disability and rare disease
2. none: if the sentence does not clearly and unambiguously express a relationship between disability and rare disease.
The second column indicates the position where the rare disease begins.
The third column indicates the position where the disability begins.
The last column is the tokenized sentence.

An example can be found below

rd-disab 0 11 Leber congenital amaurosis LCA is one of the main causes of childhood blindness .
rd-disab 0 12 Fragile X syndrome FXS is the most frequent inherited form of human mental retardation .

This file is located at: src/data/files/train-todo.txt

Instructions for running the experiment

For this experiment Dependency-Based Word Embeddings has been used [1]

curl http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2 --output deps.words.bz2
bzip2 -d deps.words.bz2
rm -rf data/levy_word_emb && mkdir data/levy_word_emb
mv deps.words data/levy_word_emb

To run the experiment

python preprocess.py
python RDD-CNN.py

The model shown here has the following configuration:

experiment_configuration = {
 		"batch_size" 	: 128,	
		"nb_filter" 	: 100,
		"filter_length"	: 3,
		"hidden_dims" 	: 100,
		"nb_epoch" 	: 20,
		"position_dims"	: 50,
		"seed":1578
}

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
merge_10 (Merge)             (None, 88, 500)           0         
_________________________________________________________________
conv1d_19 (Conv1D)           (None, 88, 100)           150100    
_________________________________________________________________
global_max_pooling1d_19 (Glo (None, 100)               0         
_________________________________________________________________
reshape_10 (Reshape)         (None, 10, 10)            0         
_________________________________________________________________
conv1d_20 (Conv1D)           (None, 8, 100)            3100      
_________________________________________________________________
global_max_pooling1d_20 (Glo (None, 100)               0         
_________________________________________________________________
dropout_10 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 202       
=================================================================
Total params: 160,502
Trainable params: 159,702
Non-trainable params: 800
_________________________________________________________________

Making predictions

Set of instructions for classifying a sentence:

from keras.preprocessing import sequence
from keras.models import load_model
import gzip
import cPickle as pkl
import numpy as np

# Instructions for loading the pre-trained model
model = load_model('trained-model/RDD-CNN.h5')

# Load dictionaries
word2Idx = pkl.load(gzip.open('src/data/pkl/utils.pkl.gz', 'rb'))

sentence = "Autism in CHARGE association may represent a neuro endocrine dysfunction ."
pos_dis = 0 # AUSTISM
pos_rd = 2 # CHARGE
# Translation of terms based on dictionaries - 88 is the maximum sentence length allowed by the experiment
tokens = sentence.split(" ")
positionValues1 = [posToValue(idx,pos_dis,distanceMapping,-30) for idx in xrange(0, len(tokens))]
positionValues2 = [posToValue(idx,pos_rd,distanceMapping,-30) for idx in xrange(0, len(tokens))]
trad_tokens = [word2Idx[x] if x in word2Idx else word2Idx["UNKNOWN"] for x in tokens]


trad_tokens = sequence.pad_sequences([trad_tokens],maxlen=88,padding='post',value=word2Idx['PADDING'])

positionValues1 = sequence.pad_sequences([positionValues1],maxlen=88,padding='post',value=distanceMapping['PADDING'])
positionValues2 = sequence.pad_sequences([positionValues2],maxlen=88,padding='post',value=distanceMapping['PADDING'])

rel = np.argmax(model.predict([trad_tokens,positionValues1,positionValues2],verbose=0),axis=1)[0]

print(sentence,"rd-disab" if rel==1 else "None")

Out[57]: 
('Autism in CHARGE association may represent a neuro endocrine dysfunction .', 'rd-disab')

Results

We have evaluated our model using a 10-fold cross validation.

		Evaluation 1
	Precision	Recall	F-measure
CNN-W+POS	75.76	75.90	75.57

References

[1] - Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems (pp. 2177-2185).

[2] - Hermenegildo Fabregat Marcos, Lourdes Araujo, Juan Martinez-Romo (2018). Deep neural models for extracting entities and relationships in the new RDD corpus relating disabilities and rare diseases (In revision)

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
src		src
trained-model		trained-model
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Relation Extraction - RDD Corpus

Directory Structure

Corpus RDD

Instructions for running the experiment

Making predictions

Results

References

About

Releases

Packages

Languages

gildofabregat/RDD-Relation-Extraction-2018

Folders and files

Latest commit

History

Repository files navigation

Relation Extraction - RDD Corpus

Directory Structure

Corpus RDD

Instructions for running the experiment

Making predictions

Results

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages