Terms API: Allow to get terms for one or more field #21

kimchy · 2010-02-16T19:47:05Z

Getting terms (from one or more indices) and their document frequency (the number of time those terms appear in a document) is very handy. For example, implementing tag clouds, or providing basic auto suggest search box.

There should be several options for this API, including sorting by term (lex) or doc freq, bounding size, from/to (inclusive or not), min/max freq, prefix and regexp filtering.

The rest api should be: /{index}/_terms

kimchy · 2010-02-16T19:48:04Z

Terms API: Allow to get terms for one or more field. Closed by 5d78196.

clintongormley · 2010-02-18T16:53:21Z

Please could you provide the docs for the usage of terms, so that I can add it to ElasticSearch.pm

thanks

clint

kimchy · 2010-02-21T23:38:47Z

The terms API accepts the following uris:

GET /_terms
GET /{index}/_terms (where {index} can be one or more indices, with _all support)

The http parameters are (fields or field must be set):

fields: The fields to search on, comma separated.
field: The field to search on, can be multiple HTTP field parameters.
from: The lower bound (lex) term from which the iteration will start. Defaults to start from the first.
to: The upper bound (lex) term to which the iteration will end. Defaults to unbound (null).
fromInclusive: Should the first from (if set) be inclusive or not. Defaults to false.
toInclusive: Should the last to (if set) be inclusive or not. Defaults to true.
prefix: An optional prefix from which the terms iteration will start (in lex order).
regexp: An optional regular expression to filter out terms (only the ones that match the regexp will return).
minFreq: An optional minimum document frequency to filter out terms.
maxFreq: An optional maximum document frequency to filter out terms.
size: The number of term / doc freq pairs to return per field. Defaults to 10.
sort: The type of sorting for term / doc freq. Can either be "term" or "freq". Defaults to term.

The field names support for indexName based lookup, and full path lookup (can have a type prefix or not).

The results basically include a docs header, and then a object named based on the field name, and the term and document frequency for each.

The only thing that I am not sure about is that currently, the term value is the JSON object name, and I wonder if it make sense to create generic JSON object, with a term field inside with its value, what do you think?

kimchy · 2010-02-21T23:52:27Z

Regarding my previous question, I simply added another http boolean parameter called termsAsArray. It defaults to true, which means you will get an array of JSON objects, with term and docFreq as fields. This will also maintain the order for parsers that are not order aware (since you can sort). If set to false, it will return JSON object names with the term itself.

clintongormley · 2010-02-22T17:44:25Z

fromInclusive: Should the first from (if set) be inclusive or not. Defaults to false.
toInclusive: Should the last to (if set) be inclusive or not. Defaults to true

You mean fromInclusive defaults to TRUE. I've renamed these exclude_from and exclude_to so that the default (unspecified) is false.

clintongormley · 2010-02-22T17:56:25Z

What do you mean by this:

The field names support for indexName based lookup, and full path lookup (can have
a type prefix or not).

Can you give me an example of the format?

clintongormley · 2010-02-22T19:36:09Z

fromInclusive: Should the first from (if set) be inclusive or not. Defaults to false.
toInclusive: Should the last to (if set) be inclusive or not. Defaults to true

Actually, these are both incorrect. Currently fromInclusive is true and toInclusive is false.

Why do you have these as different values? From the naming of from and to, I'd expect them to be inclusive, and only to exclude them if specified.

kimchy · 2010-02-22T20:20:30Z

The idea of fromInclusive and toInclusive is to follow the usually convention of writing a for loop, something like for (i=0;i<10;i++), in this case, the from (0) is inclusive, and to to is not. In any case, I suggest that you follow the same wording and parameters elasticsearch uses, so you won't confuse users. We can talk about if it make sense to change this, but while I suggest keeping it the same.

Regarding the field name, it is exaplined a bit here (http://www.elasticsearch.com/docs/elasticsearch/mapping/object_type/#pathType), though I should add a page that explains it explicitly. For example, if you have (person is the type of the mapping):

{ person : { name : { firstName : "...", lastName : "..." } } }

then the field name (that will match) will be either person.name.firstName, or name.firstName. If you add explicit mapping for the name object (or person), you can control the pathType.

clintongormley · 2010-02-22T20:38:30Z

The idea of fromInclusive and toInclusive is to follow the usually convention of writing
a for loop, something like for (i=0;i<10;i++),

OK - I didn't get that. I would say then they should be called from and until, rather than to.

In Perl (and some other dynamic languages), loops can be written more succinctly, like:

for (1..5) {  }
foreach my $name (@names)

... both of which are inclusive. To my mind, basing the default values of fromInclusive and toInclusive on a for loop exposes implementation, rather than representing how a user might think in natural language.

Regarding the field name....

OK, I have two mappings: type_1 and type_2. Both have a field 'text', but i ask for terms on field 'text' or 'type_1.text', I get the same results, which doesn't seem to be what I'm asking.

Is this what it is supposed to do?

kimchy · 2010-02-22T20:55:27Z

No problem, make sense, I will change the toInclusive to true.

Regarding the field name, yea, its not filtered by type if you prefix it by type (which is different than if you use the typed field in search queries for example). It can be implemented, but its more difficult and will be much more expensive to perform, so for now, I did not implement it.

Closes #21.

Closes #21

Due to fix [3790](#3790) in core, upgrading an analyzer provided as a plugin now fails. See #5030 for details. Issue is in elasticsearch core code but can be fixed in plugins by overloading `PreBuiltAnalyzerProviderFactory`, `PreBuiltTokenFilterFactoryFactory`, `PreBuiltTokenizerFactoryFactory` or `PreBuiltCharFilterFactoryFactory ` when used. Closes #21 (cherry picked from commit 3401c21)

With #2784, we can now add plugin version in `es-plugin.properties` file. It will only be used with elasticsearch 1.0.0 and upper. No need to push it in 1.x branch. Closes #21.

Latest changes break tests Closes #21. (cherry picked from commit 04c77e8)

Closes #21. (cherry picked from commit a1b37f6)

According to [Containers naming guide](http://msdn.microsoft.com/en-us/library/dd135715.aspx): > A container name must be a valid DNS name, conforming to the following naming rules: > > * Container names must start with a letter or number, and can contain only letters, numbers, and the dash (-) character. > * Every dash (-) character must be immediately preceded and followed by a letter or number; consecutive dashes are not permitted in container names. > * All letters in a container name must be lowercase. > * Container names must be from 3 through 63 characters long. We need to fix the documentation and control that before calling Azure API. The validation will come with issue #27. Closes #21. (cherry picked from commit 6531165)

See #21.

Closes #21

Related to #21. Closes #22. (cherry picked from commit c3964ad)

Due to a change in elasticsearch 1.4.0, we need to apply a similar patch here. See #6864 See #7819 Closes #16. Closes #21.

Sometimes Tika may crash while parsing some files. In this case it may generate just runtime errors (Throwable), not TikaException. But there is no “catch” clause for Throwable in the AttachmentMapper.java : String parsedContent; try { // Set the maximum length of strings returned by the parseToString method, -1 sets no limit parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars); } catch (TikaException e) { throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e); } As a result, tika() may “hang up” the whole application. (we have some pdf-files that "hang up" Elastic client if you try to parse them using mapper-attahcment plugin) We propose the following fix: String parsedContent; try { // Set the maximum length of strings returned by the parseToString method, -1 sets no limit parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars); } catch (Throwable e) { throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e); } (just replace “TikaException” with “Throwable” – it works for our cases) Thank you! Closes elastic#21.

Prior to this change, the `publish()` method comprises a deeply nested collection of lambdas and anonymous classes which represent the notion of a single publication attempt. In future we want to treat it as a first-class concept so we can detect when it fails etc. This change gives names to the anonymous lambdas and classes as a step towards this.

Add updated repo to ubuntu to use newer version of ansible

Some revisions to the idea of using a function

With this commit we add a new command line parameter --elasticsearch-plugins to Night Rally. It will pass this parameter to Rally but also check whether we've specified "x-pack:security" as a plugin and will make the necessary adjustments (change expected cluster health and adjust the client options). Relates elastic#21

brian-from-fl mentioned this issue Feb 12, 2014

Add support for distances in nautical miles #5088

Closed

bluelu mentioned this issue Dec 11, 2014

Node is not responsive after the end of a big merge for close to 10 minutes #8905

Closed

dadoonet added a commit that referenced this issue Jun 5, 2015

Update to elasticsearch 1.0.0

5987363

Closes #21.

dadoonet added a commit that referenced this issue Jun 5, 2015

Add plugin release semi-automatic script

c962f0b

Closes #21

dadoonet added a commit that referenced this issue Jun 5, 2015

Add plugin version in es-plugin.properties

d6efef4

With #2784, we can now add plugin version in `es-plugin.properties` file. It will only be used with elasticsearch 1.0.0 and upper. No need to push it in 1.x branch. Closes #21.

dadoonet added a commit that referenced this issue Jun 5, 2015

Update to Lucene 4.8.0/ elasticsearch 2.0.0

7d7b224

Latest changes break tests Closes #21. (cherry picked from commit 04c77e8)

dadoonet added a commit that referenced this issue Jun 5, 2015

Update to elasticsearch 1.3.0

45dfe9a

Closes #21. (cherry picked from commit a1b37f6)

dadoonet added a commit that referenced this issue Jun 5, 2015

Activate tests for forbidden names

6aec0a0

See #21.

dadoonet added a commit that referenced this issue Jun 9, 2015

Update to elasticsearch 1.3.0

4bb4568

Closes #21

dadoonet added a commit that referenced this issue Jun 9, 2015

Update to Lucene 4.9.0 / elasticsearch 1.3.0

858d4da

Related to #21. Closes #22. (cherry picked from commit c3964ad)

brwe added a commit that referenced this issue Jun 9, 2015

Remove setNextScore in SearchScript

cd7756c

Due to a change in elasticsearch 1.4.0, we need to apply a similar patch here. See #6864 See #7819 Closes #16. Closes #21.

donkoink mentioned this issue Jun 25, 2015

Random crashes even when idle (on Mac) #11873

Closed

makeyang mentioned this issue Sep 7, 2015

es 0.90.2 plus jdk6.0_25-b06 crashed on production #13368

Closed

farin99 mentioned this issue May 29, 2018

elasticsearch data node crashing with OutOfMemoryError #30930

Closed

ClaudioMFreitas pushed a commit to ClaudioMFreitas/elasticsearch-1 that referenced this issue Nov 12, 2019

Merge pull request elastic#21 from electrical/acceptance/ubuntu

7614315

Add updated repo to ubuntu to use newer version of ansible

jaymode pushed a commit to jaymode/elasticsearch that referenced this issue Oct 16, 2020

Merge pull request elastic#21 from jakelandis/jake-xpack-spi-request

72d7b13

Some revisions to the idea of using a function

capnbab mentioned this issue Apr 5, 2021

Floating-point value accepted for a date field during indexing, but fails later on update/re-index #71311

Open

This was referenced Feb 11, 2022

CPU Utilization (%) 在节点概览中不能正确显示，在节点高级选项中正常 #83824

Closed

CPU utilization display exception #83877

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terms API: Allow to get terms for one or more field #21

Terms API: Allow to get terms for one or more field #21

kimchy commented Feb 16, 2010

kimchy commented Feb 16, 2010

clintongormley commented Feb 18, 2010

kimchy commented Feb 21, 2010

kimchy commented Feb 21, 2010

clintongormley commented Feb 22, 2010

clintongormley commented Feb 22, 2010

clintongormley commented Feb 22, 2010

kimchy commented Feb 22, 2010

clintongormley commented Feb 22, 2010

kimchy commented Feb 22, 2010

Terms API: Allow to get terms for one or more field #21

Terms API: Allow to get terms for one or more field #21

Comments

kimchy commented Feb 16, 2010

kimchy commented Feb 16, 2010

clintongormley commented Feb 18, 2010

kimchy commented Feb 21, 2010

kimchy commented Feb 21, 2010

clintongormley commented Feb 22, 2010

clintongormley commented Feb 22, 2010

clintongormley commented Feb 22, 2010

kimchy commented Feb 22, 2010

clintongormley commented Feb 22, 2010

kimchy commented Feb 22, 2010