Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms API: Allow to get terms for one or more field #21

Closed
kimchy opened this issue Feb 16, 2010 · 10 comments
Closed

Terms API: Allow to get terms for one or more field #21

kimchy opened this issue Feb 16, 2010 · 10 comments

Comments

@kimchy
Copy link
Member

kimchy commented Feb 16, 2010

Getting terms (from one or more indices) and their document frequency (the number of time those terms appear in a document) is very handy. For example, implementing tag clouds, or providing basic auto suggest search box.

There should be several options for this API, including sorting by term (lex) or doc freq, bounding size, from/to (inclusive or not), min/max freq, prefix and regexp filtering.

The rest api should be: /{index}/_terms

@kimchy
Copy link
Member Author

kimchy commented Feb 16, 2010

Terms API: Allow to get terms for one or more field. Closed by 5d78196.

@clintongormley
Copy link
Contributor

Please could you provide the docs for the usage of terms, so that I can add it to ElasticSearch.pm

thanks

clint

@kimchy
Copy link
Member Author

kimchy commented Feb 21, 2010

The terms API accepts the following uris:

  • GET /_terms
  • GET /{index}/_terms (where {index} can be one or more indices, with _all support)

The http parameters are (fields or field must be set):

  • fields: The fields to search on, comma separated.
  • field: The field to search on, can be multiple HTTP field parameters.
  • from: The lower bound (lex) term from which the iteration will start. Defaults to start from the first.
  • to: The upper bound (lex) term to which the iteration will end. Defaults to unbound (null).
  • fromInclusive: Should the first from (if set) be inclusive or not. Defaults to false.
  • toInclusive: Should the last to (if set) be inclusive or not. Defaults to true.
  • prefix: An optional prefix from which the terms iteration will start (in lex order).
  • regexp: An optional regular expression to filter out terms (only the ones that match the regexp will return).
  • minFreq: An optional minimum document frequency to filter out terms.
  • maxFreq: An optional maximum document frequency to filter out terms.
  • size: The number of term / doc freq pairs to return per field. Defaults to 10.
  • sort: The type of sorting for term / doc freq. Can either be "term" or "freq". Defaults to term.

The field names support for indexName based lookup, and full path lookup (can have a type prefix or not).

The results basically include a docs header, and then a object named based on the field name, and the term and document frequency for each.

The only thing that I am not sure about is that currently, the term value is the JSON object name, and I wonder if it make sense to create generic JSON object, with a term field inside with its value, what do you think?

@kimchy
Copy link
Member Author

kimchy commented Feb 21, 2010

Regarding my previous question, I simply added another http boolean parameter called termsAsArray. It defaults to true, which means you will get an array of JSON objects, with term and docFreq as fields. This will also maintain the order for parsers that are not order aware (since you can sort). If set to false, it will return JSON object names with the term itself.

@clintongormley
Copy link
Contributor

fromInclusive: Should the first from (if set) be inclusive or not. Defaults to false.
toInclusive: Should the last to (if set) be inclusive or not. Defaults to true

You mean fromInclusive defaults to TRUE. I've renamed these exclude_from and exclude_to so that the default (unspecified) is false.

@clintongormley
Copy link
Contributor

What do you mean by this:

The field names support for indexName based lookup, and full path lookup (can have
a type prefix or not).

Can you give me an example of the format?

@clintongormley
Copy link
Contributor

fromInclusive: Should the first from (if set) be inclusive or not. Defaults to false.
toInclusive: Should the last to (if set) be inclusive or not. Defaults to true

Actually, these are both incorrect. Currently fromInclusive is true and toInclusive is false.

Why do you have these as different values? From the naming of from and to, I'd expect them to be inclusive, and only to exclude them if specified.

@kimchy
Copy link
Member Author

kimchy commented Feb 22, 2010

The idea of fromInclusive and toInclusive is to follow the usually convention of writing a for loop, something like for (i=0;i<10;i++), in this case, the from (0) is inclusive, and to to is not. In any case, I suggest that you follow the same wording and parameters elasticsearch uses, so you won't confuse users. We can talk about if it make sense to change this, but while I suggest keeping it the same.

Regarding the field name, it is exaplined a bit here (http://www.elasticsearch.com/docs/elasticsearch/mapping/object_type/#pathType), though I should add a page that explains it explicitly. For example, if you have (person is the type of the mapping):

{ person : { name : { firstName : "...", lastName : "..." } } }

then the field name (that will match) will be either person.name.firstName, or name.firstName. If you add explicit mapping for the name object (or person), you can control the pathType.

@clintongormley
Copy link
Contributor

The idea of fromInclusive and toInclusive is to follow the usually convention of writing
a for loop, something like for (i=0;i<10;i++),

OK - I didn't get that. I would say then they should be called from and until, rather than to.

In Perl (and some other dynamic languages), loops can be written more succinctly, like:

for (1..5) {  }
foreach my $name (@names) 

... both of which are inclusive. To my mind, basing the default values of fromInclusive and toInclusive on a for loop exposes implementation, rather than representing how a user might think in natural language.

Regarding the field name....

OK, I have two mappings: type_1 and type_2. Both have a field 'text', but i ask for terms on field 'text' or 'type_1.text', I get the same results, which doesn't seem to be what I'm asking.

Is this what it is supposed to do?

@kimchy
Copy link
Member Author

kimchy commented Feb 22, 2010

No problem, make sense, I will change the toInclusive to true.

Regarding the field name, yea, its not filtered by type if you prefix it by type (which is different than if you use the typed field in search queries for example). It can be implemented, but its more difficult and will be much more expensive to perform, so for now, I did not implement it.

dadoonet added a commit that referenced this issue Jun 5, 2015
dadoonet added a commit that referenced this issue Jun 5, 2015
dadoonet added a commit that referenced this issue Jun 5, 2015
Due to fix [3790](#3790) in core, upgrading an analyzer provided as a plugin now fails.

See #5030 for details.

Issue is in elasticsearch core code but can be fixed in plugins by overloading `PreBuiltAnalyzerProviderFactory`,  `PreBuiltTokenFilterFactoryFactory`, `PreBuiltTokenizerFactoryFactory` or `PreBuiltCharFilterFactoryFactory ` when used.

Closes #21
(cherry picked from commit 3401c21)
dadoonet added a commit that referenced this issue Jun 5, 2015
With #2784, we can now add plugin version in `es-plugin.properties` file.

It will only be used with elasticsearch 1.0.0 and upper. No need to push it in 1.x branch.

Closes #21.
dadoonet added a commit that referenced this issue Jun 5, 2015
Latest changes break tests

Closes #21.
(cherry picked from commit 04c77e8)
dadoonet added a commit that referenced this issue Jun 5, 2015
Closes #21.
(cherry picked from commit a1b37f6)
dadoonet added a commit that referenced this issue Jun 5, 2015
According to [Containers naming guide](http://msdn.microsoft.com/en-us/library/dd135715.aspx):

> A container name must be a valid DNS name, conforming to the following naming rules:
>
> * Container names must start with a letter or number, and can contain only letters, numbers, and the dash (-) character.
> * Every dash (-) character must be immediately preceded and followed by a letter or number; consecutive dashes are not permitted in container names.
> * All letters in a container name must be lowercase.
> * Container names must be from 3 through 63 characters long.

We need to fix the documentation and control that before calling Azure API.
The validation will come with issue #27.

Closes #21.

(cherry picked from commit 6531165)
dadoonet added a commit that referenced this issue Jun 5, 2015
dadoonet added a commit that referenced this issue Jun 9, 2015
dadoonet added a commit that referenced this issue Jun 9, 2015
Related to #21.
Closes #22.
(cherry picked from commit c3964ad)
brwe added a commit that referenced this issue Jun 9, 2015
Due to a change in elasticsearch 1.4.0, we need to apply a similar patch here.

See #6864
See #7819

Closes #16.
Closes #21.
rmuir pushed a commit to rmuir/elasticsearch that referenced this issue Nov 8, 2015
Sometimes Tika may crash while parsing some files.  In this case it may generate just runtime errors (Throwable), not  TikaException.
But there is no “catch” clause for  Throwable in the AttachmentMapper.java :

        String parsedContent;
        try {
            // Set the maximum length of strings returned by the parseToString method, -1 sets no limit
            parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
        } catch (TikaException e) {
            throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
        }

As a result,  tika() may “hang up” the whole application.
(we have some pdf-files that "hang up" Elastic client if you try to parse them using mapper-attahcment plugin)

We propose the following fix:

        String parsedContent;
        try {
            // Set the maximum length of strings returned by the parseToString method, -1 sets no limit
            parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata, indexedChars);
        } catch (Throwable e) {
            throw new MapperParsingException("Failed to extract [" + indexedChars + "] characters of text for [" + name + "]", e);
        }

(just replace “TikaException” with “Throwable” – it works for our cases)

Thank you!
Closes elastic#21.
ywelsch pushed a commit to ywelsch/elasticsearch that referenced this issue Apr 24, 2018
Prior to this change, the `publish()` method comprises a deeply nested 
collection of lambdas and anonymous classes which represent the notion of a
single publication attempt. In future we want to treat it as a first-class
concept so we can detect when it fails etc.

This change gives names to the anonymous lambdas and classes as a step towards
this.
ClaudioMFreitas pushed a commit to ClaudioMFreitas/elasticsearch-1 that referenced this issue Nov 12, 2019
Add updated repo to ubuntu to use newer version of ansible
jaymode pushed a commit to jaymode/elasticsearch that referenced this issue Oct 16, 2020
Some revisions to the idea of using a function
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Oct 2, 2023
With this commit we add a new command line parameter
--elasticsearch-plugins to Night Rally. It will pass this parameter to
Rally but also check whether we've specified "x-pack:security" as a
plugin and will make the necessary adjustments (change expected cluster
health and adjust the client options).

Relates elastic#21
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants