eurovoc_adding_version2.Rmd

---
title: "R Notebook"
output:
  pdf_document: default
  pdf_notebook: default
---

#Prend le résultat d'un topic modeling de Mathias ajoute les eurovocs

```{r, message=FALSE}
library(readr)
library(tidyverse)
library(data.table)
source("get_fingerprint.R") #imitation du fingerprint d'Open Refine
```

#on importe les labels de Mathias
```{r}
topic_modeling_result <- file.choose()
```


```{r}

topic_labels <- fread(topic_modeling_result,
                          header= FALSE,
                          encoding = "UTF-8")
topic_labels
```
#on inverse les colonnes SI le top term se trouve à la fin, et on élimine les lignes de chiffres


```{r}

topic_labels <-
  topic_labels %>% 
  filter(!is.na(V1)) %>% 
  select(num_range("V", c(1, 12:3)))

topic_labels

```


# On transpose verticalement. (réorganiser topics sous forme de DB, afin de minimiser les matchings ?)
```{r test_1}
labels_tidy <- 
  melt(topic_labels, id.vars = c("V1"),
                    measure.vars = 2:ncol(topic_labels)) %>% 
  select(topic_id=V1, value)

labels_tidy
```

# on ajoute un numéro de tokens et de topics afin de conserver leur ordre

```{r}

labels_tidy <- 
  labels_tidy %>% 
  group_by(topic_id) %>% 
  mutate(token_id = row_number(), topic_number = stringr::str_extract(topic_id, "\\d+")) %>% 
  ungroup()
labels_tidy
```

  
#on importe le thésaurus nettoyé
```{r}
thesaurus <-
  fread("eurovoc_thesaurus_complet.csv", encoding="UTF-8")


```


#jointure des unigrammes

```{r}
labels_tidy_joined_unigrams <- labels_tidy %>%
  fuzzyjoin::stringdist_left_join(thesaurus,
                                  by = c(value = "fingerprint_cleaned"),
                                  distance_col = "dist", 
                                  max_dist = 0.01,
                                  method='jw', p=0.1) %>% 
  arrange(as.integer(topic_id), term) %>% 
  filter(!is.na(term)) %>% 
  select(topic_id, tokens = value, term, dist, french, length, profondeur, category, link, bt1, mt, domain)

head(labels_tidy_joined_unigrams)

```


#création des bigrammes, fingerprint et numérotation dans l'ordre 

```{r}
bigrammes <-
  melt(setDT(labels_tidy), 
       id.var = c("topic_id"))[, combn(value, 2, FUN = paste, collapse = ' '), topic_id]


```

```{r}
bigrammes <- bigrammes[,.(topic_id, V1, fingerprint = get_fingerprint(V1))]

setnames(bigrammes, "V1", "value")
```


#Fuzzy join sur les bigrammes (prend de 3 à 20 minutes et BEAUCOUP de RAM avec 17 000 bigrammes)
```{r}
start.time <- Sys.time()
labels_tidy_joined_bigrams <- bigrammes %>%
  fuzzyjoin::stringdist_inner_join(thesaurus,
                                   by = c(fingerprint = "fingerprint_cleaned"),
                                   distance_col = "dist",
                                   max_dist=0,
                                   ignore_case = TRUE,
                                   method = "lv") %>% 
  select(topic_id, tokens = value, term, dist, french, length, profondeur, category, link, bt1, mt, domain)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

labels_tidy_joined_bigrams

View(labels_tidy_joined_bigrams)

```


#On merge les unigrams et bigrammes matchés

```{r}
merged <- rbind(labels_tidy_joined_unigrams, 
                labels_tidy_joined_bigrams) 

merged 

View(merged)
```

#Export en csv
```{r}
fwrite(merged, file = "merged_mathias2.csv")
```


#Je me suis arrêté ici pour cette partie
-----------------------------------------------------------------------------------

#On importe la matrice de topic par documents transposée dans refine pour gagner du temps

```{r}
library(readr)
doc.topics <- read_csv("DATA_EVAL/NewEnglish_250_composition_transposed.csv")
View(doc.topics)
```


#on transpose doc.topic
```{r}
doc.topics_t <- 
  doc.topics %>% 
  select(1:16) %>% 
  tibble::rownames_to_column("fichier") %>% 
  gather(topic_id, proportion, 3:16)
```


final <- 
  doc.topics_t %>% 
  left_join(merged)

#On retente de faire la somme cumulée des proportions de chaque token
somme_cumulee <- 
  final %>% 
  group_by(fichier, term) %>% 
  summarise(somme = sum(proportion)*100) %>%
  mutate(file_id=stringr::str_extract(""))
  filter(somme >= 5)
  
fwrite(somme_cumulee, "somme_cumulee.csv")

#il faut maintenant comparer avec les eurovocs manuels
jrc_metadata <- fread("~/eurovoc_topicmodeling/jrc_acquis_metadata.csv") %>% 
  select(-body)

fwrite(jrc_metadata, "jrc_metatada_sansbody.csv")