Pyserini: Working with spaCy

This page describes how to take Pyserini output and apply spaCy to do some NLP basics on it.

spaCy Prep

First, download the spaCy package and model:

pip install spacy
python -m spacy download en_core_web_sm

In this guide, we use model en_core_web_sm, which is a small English model trained on written web text (blogs, news, comments). There are many other models supporting different languages, you can download the best one for your application.


Use Pyserini's SimpleSearcher to fetch document from the MS MARCO pre-built index msmarco-passage:

import json
from import SimpleSearcher

# Initialize a searcher from a pre-built index
searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')

# Fetch raw text of a document given its docid
raw = searcher.doc('1').raw()
# Get actual content from raw
content = json.loads(raw)['contents']

content should be as follows:

The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.

Linguistic Features

Load spaCy's pre-trained model to a Language object called nlp, then call the nlp on content to get a processed Doc object:

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(content)

From Doc, we can apply spaCy's NLP features on our document. In this guide, we will talk about Tokenization, POS Tagging, NER and Sentence Segmentation.


Each Doc object contains individual Token objects, and you can iterate over them:

for token in doc:

The result should be as follows:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
The Manhattan Project and its atomic bomb helped bring an end to World War II . ...

Part-of-speech (POS) Tagging

There are many linguistic annotations contained in Token's attributes, such as

TEXT: The original word text.

LEMMA: The base form of the word.

POS: The simple UPOS part-of-speech tag.

DEP: Syntactic dependency, i.e. the relation between tokens.

SHAPE: The word shape – capitalization, punctuation, digits.

STOP: Is the token part of a stop list, i.e. the most common words of the language?

These attributes can be easily accessed by:

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.shape_, token.is_stop)

The output is shown in the following table:

The the DET det Xxx True
Manhattan Manhattan PROPN compound Xxxxx False
Project Project PROPN nsubj Xxxxx False
and and CCONJ cc xxx True
its -PRON- DET poss xxx True
atomic atomic ADJ amod xxxx False
bomb bomb NOUN conj xxxx False
helped help VERB aux xxxx False
bring bring VERB ROOT xxxx False
an an DET det xx True
end end NOUN dobj xxx False
to to ADP prep xx True
World World PROPN compound Xxxxx False
War War PROPN compound Xxx False
II II PROPN pobj XX False
. . PUNCT punct . False
... ... ... ... ... ...

Named Entity Recognition (NER)

spaCy can recognize various types of named entities in a document:

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

The following table shows recognized entities:

The Manhattan Project 0 21 ORG Companies, agencies, institutions, etc.
World War II 65 77 EVENT Named hurricanes, battles, wars, sports events, etc.

Sentence Segmentation

Doc also contains segmented sentences as Span objects, we can iterate over them:

for sent in doc.sents:

Then we have sentences:

0 The Manhattan Project and its atomic bomb helped bring an end to World War II.
1 Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.