From e58d61dc0d47183054203dd1fd8defbf63f0f898 Mon Sep 17 00:00:00 2001 From: Hank Date: Thu, 9 Aug 2018 23:28:26 +0800 Subject: [PATCH] =?UTF-8?q?=E8=87=AA=E7=84=B6=E8=AF=AD=E8=A8=80=E5=A4=84?= =?UTF-8?q?=E7=90=86=E7=9C=9F=E6=98=AF=E6=9C=89=E8=B6=A3=EF=BC=81=20(#4238?= =?UTF-8?q?)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Update natural-language-processing-is-fun.md 自然语言处理真是有趣! * 根据校对信息进行修改 * 根据校对信息进行修改 * 根据校对信息进行修改 * 根据校对意见修改 --- TODO1/natural-language-processing-is-fun.md | 298 ++++++++++---------- 1 file changed, 149 insertions(+), 149 deletions(-) diff --git a/TODO1/natural-language-processing-is-fun.md b/TODO1/natural-language-processing-is-fun.md index 75f3f394a6b..2d0e247e311 100644 --- a/TODO1/natural-language-processing-is-fun.md +++ b/TODO1/natural-language-processing-is-fun.md @@ -2,275 +2,275 @@ > * 原文作者:[Adam Geitgey](https://medium.com/@ageitgey?source=post_header_lockup) > * 译文出自:[掘金翻译计划](https://github.com/xitu/gold-miner) > * 本文永久链接:[https://github.com/xitu/gold-miner/blob/master/TODO1/natural-language-processing-is-fun.md](https://github.com/xitu/gold-miner/blob/master/TODO1/natural-language-processing-is-fun.md) -> * 译者: -> * 校对者: +> * 译者:[lihanxiang](https://github.com/lihanxiang) +> * 校对者:[FesonX](https://github.com/FesonX)、[leviding](https://github.com/leviding)、[sakila1012](https://github.com/sakila1012) -# Natural Language Processing is Fun! +# 自然语言处理真是有趣! -## How computers understand Human Language +## 计算机如何理解人类的语言 -Computers are great at working with structured data like spreadsheets and database tables. But us humans usually communicate in words, not in tables. That’s unfortunate for computers. +计算机擅长处理结构化的数据,像电子表格和数据库表之类的。但是我们人类的日常沟通是用词汇来表达的,而不是表格,对计算机而言,这真是件棘手的事。 ![](https://cdn-images-1.medium.com/max/800/1*r54-L9t14gqTbI6IlrW2wA.png) -Unfortunately we don’t live in this alternate version of history where all data is _structured_. +遗憾的是,我们并不是生活在处处都是**结构化**数据的时代。 -A lot of information in the world is unstructured — raw text in English or another human language. How can we get a computer to understand unstructured text and extract data from it? +这个世界上的许多信息都是非结构化的 —— 不仅仅是英语或者其他人类语言的原始文本。我们该如何让一台计算机去理解这些非结构化的文本并且从中提取信息呢? ![](https://cdn-images-1.medium.com/max/1000/1*CtR2lIHDkhB9M8Jt4irSyg.gif) -_Natural Language Processing_, or _NLP_, is the sub-field of AI that is focused on enabling computers to understand and process human languages. Let’s check out how NLP works and learn how to write programs that can extract information out of raw text using Python! +**自然语言处理**,简称 **NLP**,是人工智能领域的一个子集,目的是为了让计算机理解并处理人类语言。让我们来看看 NLP 是如何工作的,并且学习一下如何用 Python 写出能够从原始文本中提取信息的程序。 -_Note: If you don’t care how NLP works and just want to cut and paste some code, skip way down to the section called “Coding the NLP Pipeline in Python”._ +**注意:如果你不关心 NLP 是如何工作的,只想剪切和粘贴一些代码,直接跳过至“用 Python 处理 NLP 管道”部分。** -### Can Computers Understand Language? +### 计算机能理解语言吗? -As long as computers have been around, programmers have been trying to write programs that understand languages like English. The reason is pretty obvious — humans have been writing things down for thousands of years and it would be really helpful if a computer could read and understand all that data. +自从计算机诞生以来,程序员们就一直尝试去写出能够理解像英语这样的语言的程序。这其中的原因显而易见 —— 几千年来,人类都是用写的方式来记录事件,如果计算机能够读取并理解这些数据将会对人类大有好处。 -Computers can’t yet truly understand English in the way that humans do — but they can already do a lot! In certain limited areas, what you can do with NLP already seems like magic. You might be able to save a lot of time by applying NLP techniques to your own projects. +目前,计算机还不能像人类那样完全了解英语 —— 但它们已经能做许多事了!在某些特定领域,你能用 NLP 做到的事看上去就像魔法一样。将 NLP 技术应用到你的项目上能够为你节约大量时间。 -And even better, the latest advances in NLP are easily accessible through open source Python libraries like [spaCy](https://spacy.io/), [textacy](http://textacy.readthedocs.io/en/latest/), and [neuralcoref](https://github.com/huggingface/neuralcoref). What you can do with just a few lines of python is amazing. +更好的是,在 NLP 方面取得的最新进展就是可以轻松地通过开源的 Python 库比如 [spaCy](https://spacy.io/)、[textacy](http://textacy.readthedocs.io/en/latest/) 和 [neuralcoref](https://github.com/huggingface/neuralcoref) 来进行使用。你需要做的只是写几行代码。 -### Extracting Meaning from Text is Hard +### 从文本中提取含义是很难的 -The process of reading and understanding English is very complex — and that’s not even considering that English doesn’t follow logical and consistent rules. For example, what does this news headline mean? +读取和理解英语的过程是很复杂的 —— 即使在不考虑英语中的逻辑性和一致性的情况下。比如,这个新闻的标题是什么意思呢? -> “Environmental regulators grill business owner over illegal coal fires.” +> 环境监管机构盘问了非法烧烤的业主。(“Environmental regulators grill business owner over illegal coal fires.”) -Are the regulators questioning a business owner about burning coal illegally? Or are the regulators literally cooking the business owner? As you can see, parsing English with a computer is going to be complicated. +环境监管机构就非法燃烧煤炭问题对业主进行了询问?或者按照字面意思,监管机构把业主烤了?正如你所见,用计算机来解析英语是非常复杂的一件事。 -Doing anything complicated in machine learning usually means _building a pipeline_. The idea is to break up your problem into very small pieces and then use machine learning to solve each smaller piece separately. Then by chaining together several machine learning models that feed into each other, you can do very complicated things. +在机器学习中做一件复杂的事通常意味着**建一条管道**。这个办法就是将你的问题分成细小的部分,然后用机器学习来单独解决每一个细小的部分。再将多个相互补充的机器学习模型进行链接,这样你就能搞定非常复杂的事。 -And that’s exactly the strategy we are going to use for NLP. We’ll break down the process of understanding English into small chunks and see how each one works. +而且这正是我们将要对 NLP 所使用的策略。我们将理解英语的过程分解为多个小块,并观察每个小块是如何工作的。 -### Building an NLP Pipeline, Step-by-Step +### 一步步构建 NLP 管道 -Let’s look at a piece of text from Wikipedia: +让我们看一段来自维基百科的文字: -> London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium. +> 伦敦是英格兰首都,也是英国的人口最稠密的城市。伦敦位于英国大不列颠岛东南部泰晤士河畔,两千年来一直是一个主要定居点。它是由罗马人建立的,把它命名为伦蒂尼恩。(London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.) -> (Source: [Wikipedia article “London”](https://en.wikipedia.org/wiki/London)) +> (来源:[维基百科“伦敦”](https://zh.wikipedia.org/zh-cn/%E4%BC%A6%E6%95%A6)) -This paragraph contains several useful facts. It would be great if a computer could read this text and understand that London is a city, London is located in England, London was settled by Romans and so on. But to get there, we have to first teach our computer the most basic concepts of written language and then move up from there. +这段文字包含了几个有用的信息。如果电脑能够阅读这段文字并且理解伦敦是一个由罗马人建立的,位于英国的城市等等,那就最好不过了。但是要达到这个要求,我们需要先将有关书面知识的最基本的概念传授给电脑,然后不断深入。 -#### Step 1: Sentence Segmentation +#### 步骤一:语句分割 -The first step in the pipeline is to break the text apart into separate sentences. That gives us this: +在管道中所要做的第一件事就是将这段文字分割成独立的句子,由此我们可以得到: -1. “London is the capital and most populous city of England and the United Kingdom.” -2. “Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia.” -3. “It was founded by the Romans, who named it Londinium.” +1. “伦敦是英国的首都,也是英格兰和整个联合王国人口最稠密的城市。(London is the capital and most populous city of England and the United Kingdom.)” +2. “位于泰晤士河流域的伦敦,在此后两个世纪内为这一地区最重要的定居点之一。(Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia.)” +3. 它由罗马人建立,取名为伦蒂尼恩。(It was founded by the Romans, who named it Londinium.)” -We can assume that each sentence in English is a separate thought or idea. It will be a lot easier to write a program to understand a single sentence than to understand a whole paragraph. +我们假设每一个句子都代表一个独立的想法。那么相较于能理解整篇文章的程序而言,我们可以更加容易地写出能够理解独立语句的程序。 -Coding a Sentence Segmentation model can be as simple as splitting apart sentences whenever you see a punctuation mark. But modern NLP pipelines often use more complex techniques that work even when a document isn’t formatted cleanly. +创建一个语句分割模型就像使用标点符号来分割语句一样简单。但是现代 NLP 管道通常需要更加复杂的技术来解决文档排版不整齐的情况。 -#### Step 2: Word Tokenization +#### 第二步:文字符号化 -Now that we’ve split our document into sentences, we can process them one at a time. Let’s start with the first sentence from our document: +现在我们已经把文档分割成了句子,我们可以一步一步地处理这些句子,让我们从文档中的第一个句子开始: > “London is the capital and most populous city of England and the United Kingdom.” -The next step in our pipeline is to break this sentence into separate words or _tokens_. This is called _tokenization_. This is the result: +下一步就是在管道中将这个句子分割成独立的词语或**符号**。这就称作**分词**。接下来看看对这个句子分词的结果: > “London”, “is”, “ the”, “capital”, “and”, “most”, “populous”, “city”, “of”, “England”, “and”, “the”, “United”, “Kingdom”, “.” -Tokenization is easy to do in English. We’ll just split apart words whenever there’s a space between them. And we’ll also treat punctuation marks as separate tokens since punctuation also has meaning. +**分词**在英语中是容易完成的。我们只要分割那些空格分隔的词语。我们也将标点符号作为单词,因为它们也具有含义。 -#### Step 3: Predicting Parts of Speech for Each Token +#### 第三步:猜测每个词的属性 -Next, we’ll look at each token and try to guess its part of speech — whether it is a noun, a verb, an adjective and so on. Knowing the role of each word in the sentence will help us start to figure out what the sentence is talking about. +接下来,我们需要猜测一下每一个词的属性 —— 名词,动词和形容词等等。知道每个词在句子中所扮演的角色之后,就能够帮助我们推断句子的含义。 -We can do this by feeding each word (and some extra words around it for context) into a pre-trained part-of-speech classification model: +要知道词的属性,我们可以将每个词(包括一些上下文的词)提供给预先训练的词性分类模型: ![](https://cdn-images-1.medium.com/max/800/1*u7Z1B1TIYe68V8lS2f8GNg.png) -The part-of-speech model was originally trained by feeding it millions of English sentences with each word’s part of speech already tagged and having it learn to replicate that behavior. +词性分类模型在最初通过数百万个英语句子的训练,这些句子中每个词的属性都已经被标记,并以此让模型学会复制这种标记的行为。 -Keep in mind that the model is completely based on statistics — it doesn’t actually understand what the words mean in the same way that humans do. It just knows how to guess a part of speech based on similar sentences and words it has seen before. +记住,这个模型是基于统计数据的 —— 它并不是以和人类一样的方式理解词的含义。它所知道的只是如何依靠之前标记的类似单词和句子来猜测语句的含义。 -After processing the whole sentence, we’ll have a result like this: +在处理完整个句子之后,我们会得出这样的结果: ![](https://cdn-images-1.medium.com/max/1000/1*O0gIbvPd-weZw4IGmA5ywQ.png) -With this information, we can already start to glean some very basic meaning. For example, we can see that the nouns in the sentence include “London” and “capital”, so the sentence is probably talking about London. +根据这些信息,我们已经能够开始搜集一些非常基础的含义。比如,这个句子中的名词包括“伦敦”和“首都”,所以这个句子极有可能是与伦敦有关的。 -#### Step 4: Text Lemmatization +#### 第四步:文本词形还原 -In English (and most languages), words appear in different forms. Look at these two sentences: +在英语(以及其它大多数语言)中,词语以不同的形式出现。来看看下面这两个句子: I had a **pony**. I had two **ponies**. -Both sentences talk about the noun **pony,** but they are using different inflections. When working with text in a computer, it is helpful to know the base form of each word so that you know that both sentences are talking about the same concept. Otherwise the strings “pony” and “ponies” look like two totally different words to a computer. +两句话都讲到了名词**小马 (pony)**,但是它们有着不同的词形变化。知道词语的基本形式对计算机处理文本是有帮助的,这样你就能知道两句话在讨论同一个概念。否则,“pony” 和 “ponies” 对于电脑来说就像两个完全不相关的词语。 -In NLP, we call finding this process _lemmatization_ — figuring out the most basic form or _lemma_ of each word in the sentence. +在 NLP 中,我们称这个过程为**词形还原** —— 找出句子中每一个词的最基本的形式或**词元**。 -The same thing applies to verbs. We can also lemmatize verbs by finding their root, unconjugated form. So “**I had two ponies**” becomes “**I [have] two [pony].**” +对于动词也一样。我们也能够通过寻找动词最初的非结合形式来进行词形还原。所以 “**I had two ponies**” 变为 “**I [have] two [pony]**”。 -Lemmatization is typically done by having a look-up table of the lemma forms of words based on their part of speech and possibly having some custom rules to handle words that you’ve never seen before. +词形还原一般是通过具有基于其词性的词汇形式的查找表来完成工作的,并且可能具有一些自定义的规则来处理之前从未见过的词语。 -Here’s what our sentence looks like after lemmatization adds in the root form of our verb: +这就是经过词形还原添加动词最初形式的句子: ![](https://cdn-images-1.medium.com/max/1000/1*EgYJsyjBNk074TQf87_CqA.png) -The only change we made was turning “is” into “be”. +唯一变化的地方就是将 “is” 变为 “be”。 -#### Step 5: Identifying Stop Words +#### 第五步:识别终止词 -Next, we want to consider the importance of a each word in the sentence. English has a lot of filler words that appear very frequently like “and”, “the”, and “a”. When doing statistics on text, these words introduce a lot of noise since they appear way more frequently than other words. Some NLP pipelines will flag them as **stop words** —that is, words that you might want to filter out before doing any statistical analysis. +接下来,我们需要考虑句子中的每个单词的重要性。英语有很多频繁出现的填充词比如 “and”、“the” 和 “a”。 在对文本进行统计的时候,随着这些词出现频率的升高,将会出现很多歧义。一些 NLP 管道将这些词语标记为“终止词” —— 在进行任何分析之前需要过滤掉的词语。 -Here’s how our sentence looks with the stop words grayed out: +这就是将终止词过滤掉之后的句子: ![](https://cdn-images-1.medium.com/max/1000/1*Zgq1nK_71AzX1CaknB89Ww.png) -Stop words are usually identified by just by checking a hardcoded list of known stop words. But there’s no standard list of stop words that is appropriate for all applications. The list of words to ignore can vary depending on your application. +终止词的识别通常是由查询一个硬编码的已知终止词列表来完成。但是不存在对于所有应用来说通用的标准终止词列表。这个列表极大程度上是由你的应用所决定的。 -For example if you are building a rock band search engine, you want to make sure you don’t ignore the word “The”. Because not only does the word “The” appear in a lot of band names, there’s a famous 1980’s rock band called _The The_! +举个例子,如果你正在建立一个与摇滚乐队有关的搜索引擎,需要确保你没有忽略单词 “The”。不仅是因为这个单词出现在很多乐队名中,而且还有一个 80 年代的著名摇滚乐队叫做 **The The**! -#### Step 6: Dependency Parsing +#### 第六步:依存语法解析 -The next step is to figure out how all the words in our sentence relate to each other. This is called _dependency parsing_. +下一步就是找出句子中的每一个词之间的依存关系,这就做**依存语法解析**。 -The goal is to build a tree that assigns a single **parent** word to each word in the sentence. The root of the tree will be the main verb in the sentence. Here’s what the beginning of the parse tree will look like for our sentence: +目标就是构建一棵树,为句子中的每一个词赋予一个**父类**词语。树的根是句子中的主要动词。根据这个句子构造的解析树的开头就是这个样子: ![](https://cdn-images-1.medium.com/max/800/1*nteaQRxNNSXMlAnT31iXjw.png) -But we can go one step further. In addition to identifying the parent word of each word, we can also predict the type of relationship that exists between those two words: +但我们还可以做的更多。为了识别每一个词的父类词,我们还可以预测这两个词之间存在的关系: ![](https://cdn-images-1.medium.com/max/800/1*onc_4Mnq2L7cetMAowYAbA.png) -This parse tree shows us that the subject of the sentence is the noun “_London_” and it has a “_be_” relationship with “_capital_”. We finally know something useful — _London_ is a _capital_! And if we followed the complete parse tree for the sentence (beyond what is shown), we would even found out that London is the capital of the _United Kingdom_. +这颗解析树为我们展示了这个句子的主体是名词**伦敦**,而且它和**首都**之间有着 **be** 关系。我们最终发现了一些有用的信息 —— **伦敦**是一个**首都**!如果我们遵循着这个句子的整颗解析树(不仅是图示信息),甚至能够发现伦敦是**英国**的首都。 -Just like how we predicted parts of speech earlier using a machine learning model, dependency parsing also works by feeding words into a machine learning model and outputting a result. But parsing word dependencies is particularly complex task and would require an entire article to explain in any detail. If you are curious how it works, a great place to start reading is Matthew Honnibal’s excellent article _“_[_Parsing English in 500 Lines of Python_](https://explosion.ai/blog/parsing-english-in-python)_”_. +就像我们早前使用机器学习模型来预测词性那样,以将词语输入机器学习模型并输出结果的方式来完成依存语法分析。 但是分析依存语法是一项十分复杂的任务,它需要用一整篇文章来作为分析某些细节的上下文。 如果你很好奇它是如何工作的,有一篇作者为 Matthew Honnibal 的优秀文章值得一读 **“**[**用 500 行 Python 代码来解析英语** (**Parsing English in 500 Lines of Python**)](https://explosion.ai/blog/parsing-english-in-python)**”**。 -But despite a note from the author in 2015 saying that this approach is now standard, it’s actually out of date and not even used by the author anymore. In 2016, Google released a new dependency parser called _Parsey McParseface_ which outperformed previous benchmarks using a new deep learning approach which quickly spread throughout the industry. Then a year later, they released an even newer model called _ParseySaurus_ which improved things further. In other words, parsing techniques are still an active area of research and constantly changing and improving. +但是尽管这位作者在 2015 年发表了一条说明称这种方法现在已成为标准,但它已经过时甚至不再被作者使用过。在 2016 年,谷歌推出了一种新的依存语法分析方法,称为 **Parsey McParseface**,它采用了一种新的深度学习方法,超越了之前的表现,并在业界内快速流传。一年之后,他们又发布了新的模型,称为 ParseySaurus,对某些方面做了进一步改善。换句话说,解析技术依旧是搜索领域的一项热门技术,并且在不断地变化和改进。 -It’s also important to remember that many English sentences are ambiguous and just really hard to parse. In those cases, the model will make a guess based on what parsed version of the sentence seems most likely but it’s not perfect and sometimes the model will be embarrassingly wrong_._ But over time our NLP models will continue to get better at parsing text in a sensible way. +很多英语语句是十分模糊且难以解析的,这一点需要牢记在心。在那些例子中,模型会根据之前解析过的最相似的句子来进行猜测,但这并不完美,有时这个模型会产生令人尴尬的错误。但随着时间的推移,我们的 NLP 模型将会继续以合理的方式更好地解析文本。 -Want to try out dependency parsing on your own sentence? [There’s a great interactive demo from the spaCy team here](https://explosion.ai/demos/displacy). +想要在你自己的句子上试一试依存语法解析吗?[这是来自 spaCy 团队的一个很棒的互动演示](https://explosion.ai/demos/displacy)。 -#### Step 6b: Finding Noun Phrases +#### 第六步(下):查找名词短语 -So far, we’ve treated every word in our sentence as a separate entity. But sometimes it makes more sense to group together the words that represent a single idea or thing. We can use the information from the dependency parse tree to automatically group together words that are all talking about the same thing. +到现在为止,我们将句子中的每一个词语都作为一个独立的实体。但有时将一些词语连接起来能够更加合理地表达一个想法或事件。我们能够用依存关系解析树中的信息来自动地将所有阐述相同事物的词语组合在一起。 -For example, instead of this: +举个例子,比如这一个: ![](https://cdn-images-1.medium.com/max/800/1*EgYJsyjBNk074TQf87_CqA.png) -We can group the noun phrases to generate this: +我们可以将名词短语组合在一起达到这样的结果: ![](https://cdn-images-1.medium.com/max/800/1*5dlHkuUP3pG8ktlR-wPliw.png) -Whether or not we do this step depends on our end goal. But it’s often a quick and easy way to simplify the sentence if we don’t need extra detail about which words are adjectives and instead care more about extracting complete ideas. +我们要根据最终目标来决定是否要进行这一步。但是如果我们并不需要得到哪些词是形容词这些额外细节而是更关注抽取句子中的完整含义,这通常是简化句子一个简单的方法。 -#### Step 7: Named Entity Recognition (NER) +#### 第七步:命名实体识别(NER) -Now that we’ve done all that hard work, we can finally move beyond grade-school grammar and start actually extracting ideas. +现在我们已经完成所有困难的工作,终于可以抛弃书面的语法并开始动手实现想法了。 -In our sentence, we have the following nouns: +在我们的句子中,有着以下名词: ![](https://cdn-images-1.medium.com/max/1000/1*JMXGOrdx4oQsfZC5t-Ksgw.png) -Some of these nouns present real things in the world. For example, “_London”_, _“England”_ and _“United Kingdom”_ represent physical places on a map. It would be nice to be able to detect that! With that information, we could automatically extract a list of real-world places mentioned in a document using NLP. +这些名词中,有一部分与实际意义相同。比如说“**伦敦**”、“**英格兰**”和“**英国**”代表了地图上的物理位置。如果能检测到这些那真是太棒了!有了这些信息,我们就能够使用 NLP 在自动地提取一个在文档中提及的真实世界地理位置列表。 -The goal of _Named Entity Recognition_, or _NER_, is to detect and label these nouns with the real-world concepts that they represent. Here’s what our sentence looks like after running each token through our NER tagging model: +**命名实体识别**(**NER**)的目标就是为了检测和标记这些代表真实世界中某些事物的名词。在使用我们的 NER 标记模型对句子中的每个词语进行处理之后,句子就变成这样: ![](https://cdn-images-1.medium.com/max/1000/1*x1kwwACli8Fcvjos_6oS-A.png) -But NER systems aren’t just doing a simple dictionary lookup. Instead, they are using the context of how a word appears in the sentence and a statistical model to guess which type of noun a word represents. A good NER system can tell the difference between “_Brooklyn Decker_” the person and the place “_Brooklyn_” using context clues. +但 NER 系统并不只是做这些简单的查找字典的工作。而是使用某个词语在句子中的上下文以及统计模型来猜测某个词语代表哪种类型的名词。一个优秀的 NER 系统能够根据上下文线索辨别出人名 “**Brooklyn Decker**” 和 地名 “**Brooklyn**”。 -Here are just some of the kinds of objects that a typical NER system can tag: +这些是经典的 NER 系统能够标记的事物: -* People’s names -* Company names -* Geographic locations (Both physical and political) -* Product names -* Dates and times -* Amounts of money -* Names of events +* 人名 +* 公司名 +* 地理位置(物理位置和政治位置) +* 产品名称 +* 日期和时间 +* 金额 +* 事件名称 -NER has tons of uses since it makes it so easy to grab structured data out of text. It’s one of the easiest ways to quickly get value out of an NLP pipeline. +自从 NER 能够帮助轻易地从文本中获取结构化数据,便被广泛使用。它是从 NLP 管道中获得结果的最便捷途径之一。 -Want to try out Named Entity Recognition yourself? [There’s another great interactive demo from spaCy here](https://explosion.ai/demos/displacy-ent). +想自己试一试专名识别技术吗?[这是来自 spaCy 团队的另一个很棒的互动演示](https://explosion.ai/demos/displacy-ent)。 -#### Step 8: Coreference Resolution +#### 第八步:共指解析 -At this point, we already have a useful representation of our sentence. We know the parts of speech for each word, how the words relate to each other and which words are talking about named entities. +在此刻,我们已经对句子有了充分的了解。我们了解了每个词语的词性、词语之间的依存关系以及哪些词语是代表命名实体的。 -However, we still have one big problem. English is full of pronouns — words like _he_, _she_, and _it_. These are shortcuts that we use instead of writing out names over and over in each sentence. Humans can keep track of what these words represent based on context. But our NLP model doesn’t know what pronouns mean because it only examines one sentence at a time. +可是,我们还需要解决一个大问题。英语中存在着大量的代词 —— 比如**他**、**她**和**它**。这些是我们对在句子中反复出现的名称的简化。人们能够根据上下文来得到这些词代表的内容。但是我们的 NLP 模型并不知道这些代词的含义,因为它每次只检查一个句子。 -Let’s look at the third sentence in our document: +来看看我们的文档中的第三个句子: > “It was founded by the Romans, who named it Londinium.” -If we parse this with our NLP pipeline, we’ll know that “it” was founded by Romans. But it’s a lot more useful to know that “London” was founded by Romans. +如果我们用 NLP 管道解析这个句子,我们就能知道“它”是由罗马人建立的。但如果能知道“伦敦”是由罗马人建立的那会更有用。 -As a human reading this sentence, you can easily figure out that “_it”_ means “_London”_. The goal of coreference resolution is to figure out this same mapping by tracking pronouns across sentences. We want to figure out all the words that are referring to the same entity. +当人们读这个句子时,能够很容易得出“它”代表“伦敦”。共指解析的目的是根据整个句子中的代词来找出这种相同的映射。我们是想要找出所有指向同一实体的词语。 -Here’s the result of running coreference resolution on our document for the word “London”: +这就是在我们的文档中对“伦敦”使用共指解析的结果: ![](https://cdn-images-1.medium.com/max/800/1*vGPbWiJqQA65GlwcOYtbKQ.png) -With coreference information combined with the parse tree and named entity information, we should be able to extract a lot of information out of this document! +将共指信息、解析树和命名实体信息结合在一起,我们就能够从这个文档中提取出很多信息! -Coreference resolution is one of the most difficult steps in our pipeline to implement. It’s even more difficult than sentence parsing. Recent advances in deep learning have resulted in new approaches that are more accurate, but it isn’t perfect yet. If you want to learn more about how it works, [start here](https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30). +共指解析是我们正在进行工作的管道中的最艰难步骤之一。它甚至比语句解析还要困难。深度学习的最新进展带来更精确的方法,但它还不够完美。如果你想多了解一下它是如何工作的,[从这里开始](https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30)。 -Want to play with co-reference resolution? [Check out this great co-reference resolution demo from Hugging Face](https://huggingface.co/coref/). +想要参与协作共指解析?[看看这个来自 Hugging Face 的协作共指解析演示](https://huggingface.co/coref/)。 -### Coding the NLP Pipeline in Python +### 用 Python 来构建 NLP 管道 -Here’s an overview of our complete NLP pipeline: +这是我们的完整 NLP 管道的概览: ![](https://cdn-images-1.medium.com/max/1000/1*zHLs87sp8R61ehUoXepWHA.png) -Coreference resolution is an optional step that isn’t always done. +共指解析是一项并不总要完成的可选步骤。 -Whew, that’s a lot of steps! +哎呀,有好多步骤啊! -_Note: Before we continue, it’s worth mentioning that these are the steps in a typical NLP pipeline, but you will skip steps or re-order steps depending on what you want to do and how your NLP library is implemented. For example, some libraries like spaCy do sentence segmentation much later in the pipeline using the results of the dependency parse._ +**注意:在我们往下看之前,值得一提的是,这些都是构建传统 NLP 管道的步骤,你可以根据你的目的以及如何实现你的 NLP 库来决定是跳过还是重复某些步骤。举个例子,一些像 spaCy 这样的库,是先使用依存语法解析,得出结果后再进行语句分割。** -So how do we code this pipeline? Thanks to amazing python libraries like spaCy, it’s already done! The steps are all coded and ready for you to use. +那么,我们该如何构建这个管道?多谢像 spaCy 这样神奇的 python 库,管道的构建工作已经完成!所有的步骤都已完成,时刻准备为你所用。 -First, assuming you have Python 3 installed already, you can install spaCy like this: +首先,假设你已经安装了 Python 3,你可以按如下步骤来安装 spaCy: ``` -# Install spaCy +# 安装 spaCy pip3 install -U spacy -# Download the large English model for spaCy +# 下载针对 spaCy 的大型英语模型 python3 -m spacy download en_core_web_lg -# Install textacy which will also be useful +# 安装同样大有用处的 textacy pip3 install -U textacy ``` -Then the code to run an NLP pipeline on a piece of text looks like this: +在一段文档中运行 NLP 管道的代码如下所示: ``` import spacy -# Load the large English NLP model +# 加载大型英语模型 nlp = spacy.load('en_core_web_lg') -# The text we want to examine +# 我们想要检验的文本 text = """London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium. """ -# Parse the text with spaCy. This runs the entire pipeline. +# 用 spaCy 解析文本. 在整个管道运行. doc = nlp(text) -# 'doc' now contains a parsed version of text. We can use it to do anything we want! -# For example, this will print out all the named entities that were detected: +# 'doc' 现在包含了解析之后的文本。我们可以用它来做我们想做的事! +# 比如,这将会打印出所有被检测到的命名实体: for entity in doc.ents: print(f"{entity.text} ({entity.label_})") ``` -If you run that, you’ll get a list of named entities and entity types detected in our document: +如果你运行了这条语句,你就会得到一个关于文档中被检测出的命名实体和实体类型的表: ``` London (GPE) @@ -284,29 +284,29 @@ Romans (NORP) Londinium (PERSON) ``` -You can look up what each of those [entity codes means here](https://spacy.io/usage/linguistic-features#entity-types). +你可以查看每一个[实体代码的含义](https://spacy.io/usage/linguistic-features#entity-types)。 -Notice that it makes a mistake on “Londinium” and thinks it is the name of a person instead of a place. This is probably because there was nothing in the training data set similar to that and it made a best guess. Named Entity Detection often requires [a little bit of model fine tuning](https://spacy.io/usage/training#section-ner) if you are parsing text that has unique or specialized terms like this. +需要注意的是,它误将 “Londinium” 作为人名而不是地名。这可能是因为在训练数据中没有与之相似的内容,不过它做出了最好的猜测。如果你要解析具有专业术语的文本,命名实体的检测通常需要[做一些微调](https://spacy.io/usage/training#section-ner)。 -Let’s take the idea of detecting entities and twist it around to build a data scrubber. Let’s say you are trying to comply with the new [GDPR privacy regulations](https://medium.com/@ageitgey/understand-the-gdpr-in-10-minutes-407f4b54111f) and you’ve discovered that you have thousands of documents with personally identifiable information in them like people’s names. You’ve been given the task of removing any and all names from your documents. +让我们把这实体检测的思想转变一下,来做一个数据清理器。假设你正在尝试执行新的 [GDPR 隐私条款](https://medium.com/@ageitgey/understand-the-gdpr-in-10-minutes-407f4b54111f)并且发现你所持有的上千个文档中都有个人身份信息,例如名字。现在你的任务就是移除文档中的所有名字。 -Going through thousands of documents and trying to redact all the names by hand could take years. But with NLP, it’s a breeze. Here’s a simple scrubber that removes all the names it detects: +如果将上千个文档中的名字手动去除,需要花上好几年。但如果用 NLP,事情就简单了许多。这是一个移除检测到的名字的数据清洗器: ``` import spacy -# Load the large English NLP model +# 加载大型英语 NLP 模型 nlp = spacy.load('en_core_web_lg') -# Replace a token with "REDACTED" if it is a name +# 如果检测到名字,就用 "REDACTED" 替换 def replace_name_with_placeholder(token): if token.ent_iob != 0 and token.ent_type_ == "PERSON": return "[REDACTED] " else: return token.string -# Loop through all the entities in a document and check if they are names +# 依次解析文档中的所有实体并检测是否为名字 def scrub(text): doc = nlp(text) for ent in doc.ents: @@ -322,42 +322,42 @@ Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule print(scrub(s)) ``` -And if you run that, you’ll see that it works as expected: +如果你运行了这个,就会看到它如预期般工作: ``` In 1950, [REDACTED] published his famous article "Computing Machinery and Intelligence". In 1957, [REDACTED] Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures. ``` -### Extracting Facts +### 信息提取 -What you can do with spaCy right out of the box is pretty amazing. But you can also use the parsed output from spaCy as the input to more complex data extraction algorithms. There’s a python library called [textacy](http://textacy.readthedocs.io/en/stable/) that implements several common data extraction algorithms on top of spaCy. It’s a great starting point. +开箱即用的 spaCy 能做的事实在是太棒了。但你也可以用 spaCy 解析的输出来作为更复杂的数据提取算法的输入。这里有一个叫做 [textacy](http://textacy.readthedocs.io/en/stable/) 的 python 库,它实现了多种基于 spaCy 的通用数据提取算法。这是一个良好的开端。 -One of the algorithms it implements is called [Semi-structured Statement Extraction](https://textacy.readthedocs.io/en/stable/api_reference.html#textacy.extract.semistructured_statements). We can use it to search the parse tree for simple statements where the subject is “London” and the verb is a form of “be”. That should help us find facts about London. +它实现的算法之一叫做[半结构化语句提取](https://textacy.readthedocs.io/en/stable/api_reference.html#textacy.extract.semistructured_statements)。我们用它来搜索解析树,查找主体为“伦敦”且动词是 “be” 形式的简单语句。这将会帮助我们找到有关伦敦的信息。 -Here’s how that looks in code: +来看看代码是怎样的: ``` import spacy import textacy.extract -# Load the large English NLP model +# 加载大型英语 NLP 模型 nlp = spacy.load('en_core_web_lg') -# The text we want to examine +# 需要检测的文本 text = """London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium. """ -# Parse the document with spaCy +# 用 spaCy 来解析文档 doc = nlp(text) -# Extract semi-structured statements +# 提取半结构化语句 statements = textacy.extract.semistructured_statements(doc, "London") -# Print the results +# 打印结果 print("Here are the things I know about London:") for statement in statements: @@ -365,7 +365,7 @@ for statement in statements: print(f" - {fact}") ``` -And here’s what it prints: +它打印出了这些: ``` Here are the things I know about London: @@ -374,7 +374,7 @@ Here are the things I know about London: - a major settlement for two millennia. ``` -Maybe that’s not too impressive. But if you run that same code on the entire London wikipedia article text instead of just three sentences, you’ll get this more impressive result: +也许这不会太令人印象深刻。但如果你将这段代码用于维基百科上关于伦敦的整篇文章上,而不只是这三个句子,就会得到令人印象十分深刻的结果: ``` Here are the things I know about London: @@ -407,47 +407,47 @@ Here are the things I know about London: - not the capital of England, as England does not have its own government ``` -Now things are getting interesting! That’s a pretty impressive amount of information we’ve collected automatically. +现在事情变得有趣了起来!我们自动收集了大量的信息。 -For extra credit, try installing the [neuralcoref](https://github.com/huggingface/neuralcoref) library and adding Coreference Resolution to your pipeline. That will get you a few more facts since it will catch sentences that talk about “it” instead of mentioning “London” directly. +为了让事情变得更有趣,试试安装 [neuralcoref](https://github.com/huggingface/neuralcoref) 库并且添加共指解析到你的管道。这将为你提供更多的信息,因为它会捕捉含有“它”的而不是直接表示“伦敦”的句子。 -#### What else can we do? +#### 我们还能做什么? -By looking through the [spaCy docs](https://spacy.io/api/doc) and [textacy docs](http://textacy.readthedocs.io/en/latest/), you’ll see lots of examples of the ways you can work with parsed text. What we’ve seen so far is just a tiny sample. +看看这个 [spaCy 文档](https://spacy.io/api/doc)和 [textacy 文档](http://textacy.readthedocs.io/en/latest/),你会发现很多能够用于解析文本的方法示例。目前我们所看见的只是一个小示例。 -Here’s another practical example: Imagine that you were building a website that let’s the user view information for every city in the world using the information we extracted in the last example. +这里有另外一个实例:想象你正在构建一个能够向用户展示我们在上一个例子中提取出的全世界城市的信息的网站。 -If you had a search feature on the website, it might be nice to autocomplete common search queries like Google does: +如果你的网站有搜索功能,能像谷歌那样能够自动补全常规的查询就太好了: ![](https://cdn-images-1.medium.com/max/1000/1*CvaGqQ63aSKpZ57Gy0k4Nw.png) -Google’s autocomplete suggestions for “London” +谷歌对于“伦敦”的自动补全建议 -But to do this, we need a list of possible completions to suggest to the user. We can use NLP to quickly generate this data. +如果这么做,我们就需要一个可能提供给用户的建议列表。我们可以使用 NLP 来快速生成这些数据。 -Here’s one way to extract frequently-mentioned noun chunks from a document: +这是从文档中提取常用名词块的一种方式: ``` import spacy import textacy.extract -# Load the large English NLP model +# 加载大型英语 NLP 模型 nlp = spacy.load('en_core_web_lg') -# The text we want to examine +# 需要检测的文档 text = """London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium. """ -# Parse the document with spaCy +# 用 spaCy 解析文档 doc = nlp(text) -# Extract semi-structured statements +# 提取半结构化语句 statements = textacy.extract.semistructured_statements(doc, "London") -# Print the results +# 打印结果 print("Here are the things I know about London:") for statement in statements: @@ -455,7 +455,7 @@ for statement in statements: print(f" - {fact}") ``` -If you run that on the London Wikipedia article, you’ll get output like this: +如果你用这段代码来处理维基百科上关于伦敦的文章,就会得到如下结果: ``` westminster abbey @@ -472,11 +472,11 @@ london eye .... etc .... ``` -### Going Deeper +### 更进一步 -This is just a tiny taste of what you can do with NLP. In future posts, we’ll talk about other applications of NLP like Text Classification and how systems like Amazon Alexa parse questions. +这只是你可以用 NLP 做到的事中的一个小示例。在以后的文章中,我们将会谈论一些其他的应用,例如文本分类或亚马逊 Alexa 系统是如何解析问题的。 -But until then, install [spaCy](https://spacy.io/) and start playing around! Or if you aren’t a Python user and end up using a different NLP library, the ideas should all work roughly the same way. +但目前要做的事,就是安装 [spaCy](https://spacy.io/) 并使用它。如果你不是 Python 程序员并且使用不同的 NLP 库,这种想法应该也能奏效。 > 如果发现译文存在错误或其他需要改进的地方,欢迎到 [掘金翻译计划](https://github.com/xitu/gold-miner) 对译文进行修改并 PR,也可获得相应奖励积分。文章开头的 **本文永久链接** 即为本文在 GitHub 上的 MarkDown 链接。