-
Notifications
You must be signed in to change notification settings - Fork 352
Exporting index to xml
Note:
This page describes Thinlet Luke's functionality
Current version of Luke (implemented by Swing) does not support index export.
There is an issue for this: https://github.com/DmitryKey/luke/issues/141
There are different goals of why you would want to export your Lucene / Solr index or part of the index to an xml file for further processing.
One such goal is extracting the indexed tokens.
In this post we will illustrate one particular luke's feature, that allows you to dump index into an xml for external processing. The post has been adapted from here.
Extract indexed tokens from a field to a file for further analysis outside luke.
In order to extract tokens you need to index your field with term vectors configured. Usually, this also means, that you need to configure positions and offsets.
If you are indexing using Apache Solr, you would configure the following on your field:
<field indexed="true" name="Contents" omitnorms="false" stored="true" termoffsets="true" termpositions="true" termvectors="true" type="text">
With this line you make sure you field is going to store its contents, not only index; it will also store the term vectors, i.e. a term, its positions and offsets in the token stream.
One way to view the indexed tokens with luke is to search / list documents, select the field with term vectors enabled and click TV button (or right-click and choose "Field's Term Vector").
If you would like to extract this data into an external file, there is a way currently to accomplish this via menu Tools->Export index to XML:
In this case I have selected the docid 94724 (note, that this is lucene's internal doc id, not solr application level document id!), that is visible when viewing a particular document in luke. This dumps a document into the xml file, including the fields in the schema and each field's contents. In particular, this will dump the term vectors (if present) of a field, in my case:
<field flags="Idfp--SV-Nnum--------" name="Contents">
<val>CENTURY TEXT.</val>
<tv>
<t freq="1" offsets="0-7" positions="0" text="centuri" />
<t freq="1" offsets="0-7" positions="0" text="centuryä" />
<t freq="1" offsets="8-12" positions="1" text="text" />
<t freq="1" offsets="8-12" positions="1" text="textä" />
</tv>
</field>