This repository contains files and information about step 3 of Kaphta Architecture: Indexing of Extracted Information. In this stage, PubMed abstracts with extracted information (Information Extraction step) are indexed. There are 2 indexations, using the R language: Individual and Cross indexations. The individual indexations are for entities about polyphenols, cancer and genes, and the cross indexations are for polyphenol-cancer and polyphenol-gene entity associations. The following are listed the files and results of this stage.
For more information about this and other steps of the Kaphta Architecture, see sections of the Kaptha Web Tool available in https://portal.ifsuldeminas.edu.br/kaphtawebtool/.
- indexing-information-extracted-gh.R: R script for individual and cross indexation of extracted information from PubMed abstracts about polyphenols anticancer activity, using the inverted index.
- functions.R: script with auxiliary functions. Save this file in the same folder of indexing-information-extracted-gh.R script, because it is needed to execute this script.
- db_total_project.db: SQLite Database needed to execute all R scripts of kaphta architecture steps. This database contains tables with the Entity dictionary, Total PubMed abstracts textual corpus, and Pubmed abstracts classified as positive in text classification. Save this file in the same folder of indexing-information-extracted-gh.R script, because it is needed to execute this script.
- entities-recognized: folder with files resulted of NER task, containing extracted information about named entities (polyphenols, cancers and genes) recognized on PubMed abstracts in the previous stage (Information Extraction step). Save this folder with the files in the same folder of indexing-information-extracted-gh.R script, because it is needed to execute this script, on the indexation task.
- Rule_associations_recognized.rar: compacted file resulted of AR task in the previous stage (Information Extraction step), containing the PubMed abstract sentences with at least one rule from rules dictionary recognized. Save this file in the same folder of indexing-information-extracted-gh.R script, because it is needed to execute this script, on indexation tasks.
Below are presented files from the results folder, with the results for individual and cross indexation of PubMed abstracts.
- df_polyphenol_individual_indexation.tsv: tsv file containing data frame with indexed PubMed abstracts for polyphenol entities .
- df_cancers_individual_indexation.tsv: tsv file containing data frame with indexed PubMed abstracts for cancer entities .
- df_genes_individual_indexation.tsv: tsv file containing data frame with indexed PubMed abstracts for gene entities .
- df_cross_indexation_polyphenol_cancer_association.tsv: tsv file containing dataframe with indexed PubMed abstracts for polyphenol-cancer entity associations .
- df_cross_indexation_cancer_polyphenol_association.tsv: tsv file containing dataframe with indexed PubMed abstracts for cancer-polyphenol entity associations .
- df_cross_indexation_gene_polyphenol_association.tsv: tsv file containing dataframe with indexed PubMed abstracts for gene-polyphenol entity associations .
- df_cross_indexation_polyphenol_cancer_association_frequency.tsv: tsv file containing dataframe with total of cancers indexed for each polyphenol.
- df_cross_indexation_cancer_polyphenol_association_frequency.tsv: tsv file containing dataframe with total of polyphenols indexed for each cancer.
- df_cross_indexation_gene_polyphenol_association_frequency.tsv: tsv file containing dataframe with total of polyphenols indexed for each gene.