Skip to content

Commit

Permalink
text update
Browse files Browse the repository at this point in the history
  • Loading branch information
mmattel committed Feb 6, 2023
1 parent 0217830 commit d207764
Showing 1 changed file with 82 additions and 66 deletions.
148 changes: 82 additions & 66 deletions services/search/README.md
Original file line number Diff line number Diff line change
@@ -1,114 +1,130 @@
# Postprocessing Service
# Search Service

The search service is responsible for content and metadata extraction, store the data and make it searchable.
The search service is responsible for metadata and content extraction, stores that data as index and makes it searchable. The following clarifies the extraction terms _metadata_ and _content_:

## General Prerequisites
* Metadata: all data that _describes_ the file like `Name`, `Size`, `MimeType`, `Tags` and `Mtime`.
* Content: all data that _relates to content_ of the file like `words`, `geo data`, `exif data` etc.

To use the search service, an event system needs to be configured for all services. By default, ocis ships with a preconfigured nats service.
If search should extract and index file contents it also needs a running tika instance.
## General Considerations

## Search Functionality
* To use the search service, an event system needs to be configured for all services like NATS, which is shipped and preconfigured.
* When looking for content extraction, [Apache Tika - a content analysis toolkit](https://tika.apache.org) can be used but needs to be installed separately.

The search contains two main parts, the file `indexing` and the file `search`.
The service runs with the shipped default configuration out of the box and no further configuration needed.
Extractions are stored as index via the search service. Consider that indexing requires adequat storage capacity - which usually will always grow. To avoid getting the filesystem full by the index making Infinite Scale unusable, the index should reside on its own filesystem.

Everytime a resource has changed or is new, the search service takes care and adds them to its index.
there are a few more steps between accepting the file and writing it to the index.
You can relocate the path where search maintains its data in case the filesystem gets close to full, by stopping the service, move the data, reconfigure the path in the environment variable and restart the service.

the following state changes in the life cycle of a file are taken into account:
When using content extraction, more resources and time is needed, because the content of the file needs to be analyzed. This is especially true for big and multiple concurrent files.

### Resource trashed
The search service runs with the shipped default `basic` configuration out of the box. No further configuration is needed, except when using content extraction.

The service looks in its index to see if the file already exists and marks it as deleted, at that moment, the file won't appear in any search request anymore.
The search entry stays intact.
Note that as of now, the search service can not be scaled. Consider using a dedicated hardware for this service in case of more resources are needed.

### Resource restored
## Search engines

That step is the counterpart of [Resource trashed](#resource-trashed), when deleting, the file isn't really removed from the index, instead the service just marks it as deleted.
This step undoes that and the file can be found again.
By default, the search service is shipped with [bleve](https://github.com/blevesearch/bleve) as it's primary search engine. The available engines can be extended by implementing the [Engine](pkg/engine/engine.go) interface and making that engine available.

### Resource moved
## Extraction Engines

This comes into place whenever a file is renamed or relocated, the search index then updates the resource location path.
The search service provides the following extraction engines, their result is used as index for searching:

### Folder created
* The embedded `basic` configuration provides metadata extraction which is always on.
* The `tika` configuration, which _additionally_ provides content extraction, if installed and configured.

This step is always executed when a folder is created. The search extracts all necessary information and stores it in the search index
## Content Extraction

### File created
The search service is able to manage and retrieve many types of information. For this purpose there are content extractors of which the following are included:

This step is similar to [Folder created](#folder-created) with the difference that a file can contain even more valuable information.
This starts to get interesting when it comes to real date content that needs to be exploited. Content extraction is part of the search service and is described in a later chapter.
### Basic Extractor

### File version restored
This extractor is the most simple one and just uses the resource information provided by Infinite Scale. It does not do any further analysis. Following fields are included to indexing: `Name`, `Size`, `MimeType`, `Tags`, `Mtime`.

Since ocis is capable of storing multiple versions of the same file, the search service also needs to take care of those versions.
When that step is triggered, the service starts to extract all needed information, stores them in the index and makes them discoverable.
### Tika Extractor

### Resource tag added
This extractor is more advanced compared to the [Basic extractor](#basic-extractor). The main difference is, that this extractor is able to read file contents and make it searchable.
However, [Apache Tika](https://tika.apache.org/) is required for this task. Read the [Getting Started with Apache Tika](https://tika.apache.org/2.6.0/gettingstarted.html) guide on how to install and run Tika respectively use a ready to run [Tika container](https://hub.docker.com/r/apache/tika).

Whenever a resource gets a new tag, the service takes care of it and makes that resource discoverable by that.
As soon as Tika is installed and accessible, the search service must be configured for the use with Tika. The following settings must be set:

### Resource tag removed
* `SEARCH_EXTRACTOR_TYPE=tika`
* `SEARCH_EXTRACTOR_TIKA_TIKA_URL=http://YOUR-TIKA.URL`

This is the counterpart of [Resource tag added](#resource-tag-added), it takes care that a tag gets unassigned from the referenced resource.
When the search service can reach Tika, it begins to read out the content on demand. Note that files must be downloaded during the process, which can lead to delays with larger documents.

### File uploaded - synchronous
## Search Functionality

This step gets triggered only if `async post processing` is disabled, if it's like that,
the service starts to extract all needed file information, stores them in the index and makes them discoverable.
The search service consists of two main parts which are file `indexing` and file `search`.

### File uploaded - asynchronous
### Indexing

This is exactly the same as [File uploaded - synchronous](#file-uploaded---synchronous) with the only difference that it is used for asynchronous uploads.
Everytime a resource changes its state, a corresponding event is triggered. Based on the event, the search service processes the file and adds them to its index. There are a few more steps between accepting the file and updating the index.

## Search engines
### Search

By default, the search service is shipped with [bleve](https://github.com/blevesearch/bleve) as it's primary search engine.
The available engines can be extended by implementing the [Engine](pkg/engine/engine.go) interface and making that engine available.
A query via the search service will return results based on the index created.

## Content extraction
### State Changes which Trigger Indexing

the search service is able to manage and retrieve many types of information, for this purpose there are the content extractors of which the following are included:
The following state changes in the life cycle of a file are taken into account for creating or updating the index:

### Basic extractor
#### Resource Trashed

that extractor is the mos simple, it just uses the provided resource information and does not try to analyze it further.
Following fields are included: `Name`, `Size`, `MimeType`, `Tags`, `Mtime`
The service looks in its index to see if the file already exists and marks it as deleted, at that moment, the file won't appear in any search request anymore.
The index entry stays intact.

### Tika extractor
#### Resource Restored

That extractor is a more advanced version of the [Basic extractor](#basic-extractor), the main difference is that it is able to read file contents and make them searchable.
However, [apache tika](https://tika.apache.org/) is required for it to be able to do this, please read the [getting started](https://tika.apache.org/1.10/gettingstarted.html) guide on how to install and run tika.
That step is the counterpart of [Resource trashed](#resource-trashed), when deleting, the file isn't really removed from the index, instead the service just marks it as deleted.
This step undoes that and the file can be found again.

As soon as tika is installed and accessible, the search service must be informed where the installation is running.
The following settings must be set
#### Resource Moved

* `SEARCH_EXTRACTOR_TYPE=tika`
* `SEARCH_EXTRACTOR_TIKA_TIKA_URL=http://YOUR-TIKA.URL`
This comes into place whenever a file or folder is renamed or moved, the search index then updates the resource location path or starts indexing if no index has been created so far for all items affected. See [Notes](#notes) for an example.

As soon as the service can reach tika, it begins to read out the content, which is done on demand.
It should be noted that the files must be downloaded during the process, which can lead to delays with large documents.
#### Folder Created

### Miscellaneous
This step is always executed when a folder is created. The search extracts all necessary information and stores it in the search index

#### File Created

This step is similar to [Folder created](#folder-created) with the difference that a file can contain even more valuable information. This starts to get interesting when it comes to real date content that needs to be exploited. Content extraction is part of the search service if configured.

## CLI - Index
#### File Version Restored

Since ocis is capable of storing multiple versions of the same file, the search service also needs to take care of those versions.
When that step is triggered, the service starts to extract all needed information, stores them in the index and makes them discoverable.

the service contains a CLI to trigger the space re-indexing
#### Resource Tag Added

Whenever a resource gets a new tag, the service takes care of it and makes that resource discoverable by that.

#### Resource Tag Removed

This is the counterpart of [Resource tag added](#resource-tag-added), it takes care that a tag gets unassigned from the referenced resource.

#### File Uploaded - Synchronous

This step gets triggered only if `async post processing` is disabled, if it's like that,
the service starts to extract all needed file information, stores them in the index and makes them discoverable.

#### File Uploaded - Asynchronous

This is exactly the same as [File uploaded - synchronous](#file-uploaded---synchronous) with the only difference that it is used for asynchronous uploads.

## Manually Trigger Re-Indexing a Space

The service contains a CLI to trigger re-indexing a space:

```shell
$ /bin/ocis search index --space $SPACE_ID --user $USER_ID
ocis search index --space $SPACE_ID --user $USER_ID
```

it is important to note that the specified user needs access to the space to be indexed

## Indexing process
Note it is important that not names but ID's are necessary and that the specified user ID needs access to the space to be indexed.

In some cases it could happen that the index gets out of sync, many writes at the same time or similar.
## Notes

Therefor the indexing process tries to be self-healing for the most parts, in detail this means,
each index job not only takes care of the specified resource, but also tries to find out which files are still relevant.
The indexing process tries to be self-healing for some situations.

to give a concrete example, let's assume a file tree `foo/bar/baz`.
If we now rename the folder `bar` to `new-bar`, the path to `baz` is no longer `foo/bar/baz` but `foo/new-bar/baz`.
therefore the service can not only take care of renaming `bar` to `new-bar`, but also has to update the path of `baz`.
Following example, let's assume a file tree `foo/bar/baz`.
If the folder `bar` gets renamed to `new-bar`, the path to `baz` is no longer `foo/bar/baz` but `foo/new-bar/baz`.
The search service checks the change and either just updates the path in the index or creates a new index for all items affected if none was present.

0 comments on commit d207764

Please sign in to comment.