-
Notifications
You must be signed in to change notification settings - Fork 0
/
Introduction.tex
81 lines (51 loc) · 10.9 KB
/
Introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
\chapter{Introduction}
\paragraph{}
With the diffusion of Internet, the Web 2.0, and the mobile devices, the amount of generated data grows exponentially. A large amount of this data is user generated content. As shown by \href{http://www.internetlivestats.com/}{www.internetlivestats.com} each day 500 million posts on Twitter platform and 5 million blog posts are written. The main problem with these articles and posts is that they are expressed in natural language, this means that we cannot use them as information without processing.
The field of computer science that tries to generate information from unstructured data is called Information Extraction (IE). In most cases IE concerns of processing human language text using Natural Language Processing techniques, which is a computer science and computational linguistic field that uses statistical methods to process human language. IE includes many task, in this thesis we discuss about Named-Entity Recognition (NER) and Named-Entity Linking (NEL). NER is the task of identifying real world objects in a document (formally named entities), for example in the following phrase ``Paris is a beautiful city", a NER system should be able to recognize the named entity ``Paris". NEL tries to disambiguate the identity of a named entity using a link to a knowledge base, using the previous example, the entity ``Paris" should be linked to ``Paris (city)" and not "Paris Hilton" the American model. These tasks are mostly used under the term Named-Entity rEcognition and Linking (NEEL).
Named entity linking is important in many fields of Computer Science, it can be used by search engines for resolving ambiguous entities in indexed documents or in queries as in 70\% of cases they contain entities~\cite{pound2010ad}. NEL systems can also be used in combination with other Artificial Intelligence systems like Sentiment Analysis tools for the generation of statistical data that describe users liking of companies, politicians, and so on. Most of the time the source for these data are micro-blogging platforms like Twitter. This introduces various problems, tweets (messages sent using Twitter) are restricted to a maximum of 140 characters reducing the context information for an entity. Moreover, this limitation also induce the user to use abbreviations, slangs, and informal expressions increasing the difficulty compared to a well-formed document.
As catalyst for the development of NEEL systems specialized on microblog posts, each year since 2011, there is a workshop on \textit{Making Sense of Micropost} focused on English tweets. From this year, also the \textit{Italian Computational Linguistic Association} showed its interest in ht NEEL task organizing a challenge specific for the Italian language called \textit{NEEl-IT}. The challenge was par of the workshop \textit{EVALITA}, where we presented our solution.
In this work we propose a NEL system based on Word Embeddings, a neural network language model that creates continuous vector representation of word. The majority of the state-of-the-art solutions makes use of the similarity measure based on the occurence or frequencies of word (e.g. hamming distance, character Dice score, etc.) to find resource in the knowledge base related to each named entity. However, these approaches can be inaccurate, especially when dealing with microblog posts rich of misspellings, abbreviations, nicknames and other noisy forms of text. The motivation behind the use of a Word Embeddings representation is the exploitation of a high-level similarity measure between the named entities and the resources in the knowledge base. The resulting meaningful and dense representation, where words and entities are represented in a different way, will be able to capture both the syntactic and semantic connotations of a named entity, overcoming the aforementioned issues.
\paragraph{}
This work is organized through the following chapters: in Chapter 2 we will present the Information Extraction field; in Chapter 3 we take an overview of the current state-of-the-art of NEL approaches. In Chapter 4 we describe the idea of Word Embeddings and present our solution; in Chapter 5 we analyze the performance achieved by our system on English and Italian datasets. Finally in Chapter 6 are presented the conclusion of our work and some future improvements for our system.
\paragraph{}
This work was featured in a paper \cite{cecchiniunimib}: F. M. Cecchini, E. Fersini, P. Manchanda, E. Messina, D. Nozza, M. Palmonari, C. Sas. “UNIMIB@NEEL-IT: Named Entity Recognition and Linking of Italian Tweets”. In Proceedings of the 5th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2016).
\chapter{Information Extraction}
Information Extraction (IE) is a form of natural language analysis, which process unstructured or semi-structured text for the extraction of unambiguous information. IE can be separated in many sub-tasks based on the data required to process. The main sub-tasks are:
\begin{itemize}[itemsep = 0.1em]
\item Named-Entity rEcognition and Linking (NEEL): which is composed by a Named-Entity Recognition (NER) system and a Named-Entity Linking (NEL) system. A named entity is a real-world object such as persons, locations, organizations, products, etc., that can be denoted with a proper name. The NER goal is to find all the named entity contained in the text. The NEL takes the named entities and links them to the knowledge base (KB).
\item Relationship extraction: where the aim is to find the relations between entities in a phrase. More formally, the task of relationship extraction can be defined as a classification problem where given two entities \(e_1\)and \(e_2\) and a phrase \(s\) the classifier returns a label \(r\) describing the relation between the two entities (formally \((e_1, e_2, s)\rightarrow r\)). For example from the sentence ``Elon lives in California'' we can extract ``PERSON \textit{located} in LOCATION".
\item Terminology extraction: which consist in finding the relevant terms in a text or more generally in a corpus. This can be used for creating a domain knowledge based on the terms that describe the domain.
\end{itemize}
\section{Named-Entity rEcognition and Linking}
\subsection{Named-Entity Recognition}
\paragraph{}
Named-Entity Recognition is a critical IE task, as it identifies which terms in a text are mentions of entities in the real world.
As we see in the Figure \ref{fig:ner_io}, the NER takes a text (a tweet in our case) as an input and returns the entities with the corresponding type. The NER is also a pre-requisite not only for NEL but also for other IE task, including co-reference resolution, and relation extraction.
\begin{figure}[ht]
\includegraphics[width = \textwidth]{material_ner.jpg}
\caption{Example of a NER I/O}
\label{fig:ner_io}
\end{figure}
\vspace{-10pt}
\paragraph{}
As we mentioned before user generated content are one of the biggest source of data, specifically if posted on microblogging platforms like Facebook and Twitter. Early experiments have showed this particular type of contents to be extremely challenging for state-of-the-art algorithms of IE~\cite{derczynski2013microblog}. For instance, NER methods typically have around 90\% of accuracy on longer texts, but only 35-50\% over microblog post like tweets. The reasons of this difficulty can be summarized as follows:
\begin{itemize}[itemsep = 0.1em]
\item Shortness of microblogs: the max length of a tweet is 140 characters and this makes them hard to interpret as there is no context to derive from the text.
\item Capitalization of the words: in a tweet, or any other microblog post, the capitalization of words may be ignored for increasing the speed of writing. The user could also deliberately overdo them with the intent of adding more emphasis to the message. This is a problem as some words change their meaning based on the capitalization of the letters. For example the words ``trump" and ``Trump" in a tweet could both be refereed to the president of the USA, but for a musician the first one might refer to a musical instrument and the second one to the person.
\item Word Typos: in natural language, especially when writing fast and the most important thing is the idea and not the grammar, is easy to commit typos. Particulary microblogs posts have 2.5 times more typos than a well-formed text~\cite{derczynski2015analysis}. However, as for the capitalization, a typo could significantly change the meaning of a word.
\item Abbreviations: given the limit on the number of characters, users tend to use abbreviations and slang in order to write more expressive messages in the same amount of space. This makes more difficult to associate a word with a concept if there is not enough knowledge. For example in the following sentence "Donald Trump is the new POTUS" is required the knowledge of the meaning of the word ``POTUS" which stands for ``President of the United States".
\item Emotions: the meaning of a sentence can be drastically changed by an emoji. If in this sentence "I love Paris" we add the French flag emoji the NER will label it as ``Location" otherwise it could recognize it as ``Person".
\end{itemize}
\paragraph{}
To overcome these problems the researchers focused on specific NERs for microblogs posts. For example Ritter et al.~\cite{ritter2011named} proposed a NER for Twitter developing a domain specific part-of-speech tagger and trained a support vector machine to identify if the capitalization of the tweet was informative or not.
\subsection{Named-Entity Linking}
\paragraph{}
Named-Entity Linking is the task of determining the identity of entities mentioned in the text. Sometimes is also called Named-Entity Disambiguation, and also Named-Entity Normalization. It typically requires annotating a potentially ambiguous entity mention with a link to a canonical identifier, the knowledge base, describing the entity. A popular choice for the knowledge base is Wikipedia, in which each page is considered a named entity. We can see an example of a NEL at Figure \ref{fig:nel_io}
The problems we previously introduced for the NER are also valid for the entity linking problem. Also named entity mention can be highly ambiguous. For example, given the sentence ``Paris is the capital of France" and the entity ``Paris", the idea is to determine that ``Paris" is referred to the city and not to ``Paris Hilton" the American celebrity. Another NEL problem is represented by the fact that the same entity could have multiple surface forms. An example could be the named entity ``USA" also refereed as ``America", ``US", ``United State of America", and ``United States". We also need to keep in mind that some words recognized by the NER might not have a corresponding description in the KB, we refer to these words as Out of Vocabulary (OOV). These problems needs to be addressed by any NEL as they are very common and could decrease the overall performance.
\vspace{-10pt}
\begin{figure}[ht]
\includegraphics[width = \textwidth]{material_nel.jpg}
\caption{Example of a NEL I/O}
\label{fig:nel_io}
\end{figure}
\vspace{-10pt}