Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added k-nn user guide and samples. #449

Merged
merged 4 commits into from
Jul 26, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Inspired from [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Added support for latest OpenSearch versions 2.7.0, 2.8.0 ([#445](https://github.com/opensearch-project/opensearch-py/pull/445))
- Added samples ([#447](https://github.com/opensearch-project/opensearch-py/pull/447))
- Improved CI performance of integration with unreleased OpenSearch ([#318](https://github.com/opensearch-project/opensearch-py/pull/318))
- Added k-NN guide and samples ([#449](https://github.com/opensearch-project/opensearch-py/pull/449))
### Changed
- Moved security from `plugins` to `clients` ([#442](https://github.com/opensearch-project/opensearch-py/pull/442))
### Deprecated
Expand Down
12 changes: 5 additions & 7 deletions USER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,7 @@ Then import it like any other module:
from opensearchpy import OpenSearch
```

For better performance we recommend the async client. To add the async client to your project, install it using [pip](https://pip.pypa.io/):

```bash
pip install opensearch-py[async]
```
For better performance we recommend the async client. See [Asynchronous I/O](guides/async.md) for more information.

In general, we recommend using a package manager, such as [poetry](https://python-poetry.org/docs/), for your projects. This is the package manager used for [samples](samples).

Expand Down Expand Up @@ -61,7 +57,7 @@ info = client.info()
print(f"Welcome to {info['version']['distribution']} {info['version']['number']}!")
```

See [hello.py](samples/hello/hello.py) for a working sample, and [guides/ssl](guides/ssl.md) for how to setup SSL certificates.
See [hello.py](samples/hello/hello.py) for a working synchronous sample, and [guides/ssl](guides/ssl.md) for how to setup SSL certificates.

### Creating an Index

Expand Down Expand Up @@ -148,6 +144,7 @@ print(response)

## Advanced Features

- [Asynchronous I/O](guides/async.md)
- [Authentication (IAM, SigV4)](guides/auth.md)
- [Configuring SSL](guides/ssl.md)
- [Bulk Indexing](guides/bulk.md)
Expand All @@ -161,4 +158,5 @@ print(response)

- [Security](guides/plugins/security.md)
- [Alerting](guides/plugins/alerting.md)
- [Index Management](guides/plugins/index_management.md)
- [Index Management](guides/plugins/index_management.md)
- [k-NN](guides/plugins/knn.md)
152 changes: 152 additions & 0 deletions guides/async.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
- [Asynchronous I/O](#asynchronous-io)
- [Setup](#setup)
- [Async Loop](#async-loop)
- [Connect to OpenSearch](#connect-to-opensearch)
- [Create an Index](#create-an-index)
- [Index Documents](#index-documents)
- [Refresh the Index](#refresh-the-index)
- [Search](#search)
- [Delete Documents](#delete-documents)
- [Delete the Index](#delete-the-index)

# Asynchronous I/O

This client supports asynchronous I/O that improves performance and increases throughput. See [hello-async.py](../samples/hello/hello-async.py) or [knn-async-basics.py](../samples/knn/knn-async-basics.py) for a working asynchronous sample.

## Setup

To add the async client to your project, install it using [pip](https://pip.pypa.io/):

```bash
pip install opensearch-py[async]
```

In general, we recommend using a package manager, such as [poetry](https://python-poetry.org/docs/), for your projects. This is the package manager used for [samples](../samples). The following example includes `opensearch-py[async]` in `pyproject.toml`.

```toml
[tool.poetry.dependencies]
opensearch-py = { path = "../", extras=["async"] }
```

## Async Loop

```python
import asyncio

async def main():
client = AsyncOpenSearch(...)
try:
# your code here
finally:
client.close()

if __name__ == "__main__":
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(main())
loop.close()
```

## Connect to OpenSearch

```python
host = 'localhost'
port = 9200
auth = ('admin', 'admin') # For testing only. Don't store credentials in code.

client = AsyncOpenSearch(
hosts = [{'host': host, 'port': port}],
http_auth = auth,
use_ssl = True,
verify_certs = False,
ssl_show_warn = False
)

info = await client.info()
print(f"Welcome to {info['version']['distribution']} {info['version']['number']}!")
```

## Create an Index

```python
index_name = 'test-index'

index_body = {
'settings': {
'index': {
'number_of_shards': 4
}
}
}

if not await client.indices.exists(index=index_name):
await client.indices.create(
index_name,
body=index_body
)
```

## Index Documents

```python
await asyncio.gather(*[
client.index(
index = index_name,
body = {
'title': f"Moneyball {i}",
'director': 'Bennett Miller',
'year': '2011'
},
id = i
) for i in range(10)
])
```

## Refresh the Index

```python
await client.indices.refresh(index=index_name)
```

## Search

```python
q = 'miller'

query = {
'size': 5,
'query': {
'multi_match': {
'query': q,
'fields': ['title^2', 'director']
}
}
}

results = await client.search(
body = query,
index = index_name
)

for hit in results["hits"]["hits"]:
print(hit)
```

## Delete Documents

```python
await asyncio.gather(*[
client.delete(
index = index_name,
id = i
) for i in range(10)
])
```

## Delete the Index

```python
await client.indices.delete(
index = index_name
)
```
117 changes: 117 additions & 0 deletions guides/plugins/knn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
- [k-NN Plugin](#k-nn-plugin)
- [Basic Approximate k-NN](#basic-approximate-k-nn)
- [Create an Index](#create-an-index)
- [Index Vectors](#index-vectors)
- [Search for Nearest Neighbors](#search-for-nearest-neighbors)
- [Approximate k-NN with a Boolean Filter](#approximate-k-nn-with-a-boolean-filter)
- [Approximate k-NN with a Lucene Filter](#approximate-k-nn-with-a-lucene-filter)
dblock marked this conversation as resolved.
Show resolved Hide resolved

# k-NN Plugin

Short for k-nearest neighbors, the k-NN plugin enables users to search for the k-nearest neighbors to a query point across an index of vectors. See [documentation](https://opensearch.org/docs/latest/search-plugins/knn/index/) for more information.

## Basic Approximate k-NN

In the following example we create a 5-dimensional k-NN index with random data. You can find a synchronous version of this working sample in [samples/knn/knn-basics.py](../../samples/knn/knn-basics.py) and an asynchronous one in [samples/knn/knn-async-basics.py](../../samples/knn/knn-async-basics.py).

```bash
$ poetry run knn/knn-basics.py

Searching for [0.61, 0.05, 0.16, 0.75, 0.49] ...
{'_index': 'my-index', '_id': '3', '_score': 0.9252405, '_source': {'values': [0.64, 0.3, 0.27, 0.68, 0.51]}}
{'_index': 'my-index', '_id': '4', '_score': 0.802375, '_source': {'values': [0.49, 0.39, 0.21, 0.42, 0.42]}}
{'_index': 'my-index', '_id': '8', '_score': 0.7826564, '_source': {'values': [0.33, 0.33, 0.42, 0.97, 0.56]}}
```

### Create an Index

```python
dimensions = 5
client.indices.create(index_name,
body={
"settings":{
"index.knn": True
},
"mappings":{
"properties": {
"values": {
"type": "knn_vector",
"dimension": dimensions
},
}
}
}
)
```

### Index Vectors

Create 10 random vectors and insert them using the bulk API.

```python
vectors = []
for i in range(10):
vec = []
for j in range(dimensions):
vec.append(round(random.uniform(0, 1), 2))

vectors.append({
"_index": index_name,
"_id": i,
"values": vec,
})

helpers.bulk(client, vectors)

client.indices.refresh(index=index_name)
```

### Search for Nearest Neighbors

Create a random vector of the same size and search for its nearest neighbors.

```python
vec = []
for j in range(dimensions):
vec.append(round(random.uniform(0, 1), 2))

search_query = {
"query": {
"knn": {
"values": {
"vector": vec,
"k": 3
}
}
}
}

results = client.search(index=index_name, body=search_query)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add that if we disable the _source field than it helps in reducing the latency of the search.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but feels like that's a very narrow use-case where I don't actually have any additional data and only want the ID, no?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say we can have section which is how to use the api in production for best performance. Hence I was thinking we should mention it. Reason is this is very valid for k-NN.

for hit in results["hits"]["hits"]:
print(hit)
```

## Approximate k-NN with a Boolean Filter

In [the boolean-filter.py sample](../../samples/knn/knn-boolean-filter.py) we create a 5-dimensional k-NN index with random data and a `metadata` field that contains a book genre (e.g. `fiction`). The search query is a k-NN search filtered by genre. The filter clause is outside the k-NN query clause and is applied after the k-NN search.

```bash
$ poetry run knn/knn-boolean-filter.py

Searching for [0.08, 0.42, 0.04, 0.76, 0.41] with the 'romance' genre ...

{'_index': 'my-index', '_id': '445', '_score': 0.95886475, '_source': {'values': [0.2, 0.54, 0.08, 0.87, 0.43], 'metadata': {'genre': 'romance'}}}
{'_index': 'my-index', '_id': '2816', '_score': 0.95256233, '_source': {'values': [0.22, 0.36, 0.01, 0.75, 0.57], 'metadata': {'genre': 'romance'}}}
```

## Approximate k-NN with an Efficient Filter

In [the lucene-filter.py sample](../../samples/knn/knn-efficient-filter.py) we implement the example in [the k-NN documentation](https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/), which creates an index that uses the Lucene engine and HNSW as the method in the mapping, containing hotel location and parking data, then search for the top three hotels near the location with the coordinates `[5, 4]` that are rated between 8 and 10, inclusive, and provide parking.

```bash
$ poetry run knn/knn-efficient-filter.py

{'_index': 'hotels-index', '_id': '3', '_score': 0.72992706, '_source': {'location': [4.9, 3.4], 'parking': 'true', 'rating': 9}}
{'_index': 'hotels-index', '_id': '6', '_score': 0.3012048, '_source': {'location': [6.4, 3.4], 'parking': 'true', 'rating': 9}}
{'_index': 'hotels-index', '_id': '5', '_score': 0.24154587, '_source': {'location': [3.3, 4.5], 'parking': 'true', 'rating': 8}}
```
Loading