Skip to content

Commit

Permalink
docs: updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Anush008 authored Nov 1, 2023
1 parent b471e3d commit eeb7d8d
Showing 1 changed file with 80 additions and 1 deletion.
81 changes: 80 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,81 @@
# Qdrant-Spark Connector
# Qdrant-Spark Connector πŸ’₯

[Apache Spark](https://spark.apache.org/) is a distributed computing framework designed for big data processing and analytics. This connector enables [Qdrant](https://qdrant.tech/) to be a storage destination in Spark.

## Installation πŸš€

### GitHub Releases πŸ“¦

The packaged `jar` file releases can be found [here](https://github.com/qdrant/qdrant-spark/releases).

### Building from source πŸ› οΈ

To build the `jar` from source, you need [JDK@17](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) and [Maven](https://maven.apache.org/) installed.
Once the requirements have been satisfied, run the following command in the project root. πŸ› οΈ

```bash
mvn package -Passembly
```
This will build and store the fat JAR in the `target` directory by default.

### Maven Central πŸ“š

The package will be available at the registry soon.

## Usage πŸ“

### Creating a Spark session (Single-node) with Qdrant support 🌟

```python
from pyspark.sql import SparkSession

spark = (
SparkSession.builder.config(
"spark.jars",
"spark-1.0-SNAPSHOT-jar-with-dependencies.jar", # specify the downloaded JAR file
)
.master("local[*]")
.appName("qdrant")
.getOrCreate()
```

### Loading data πŸ“Š

To load data into Qdrant, a collection has to be created beforehand with the appropriate vector dimensions and configurations.

```python
<pyspark.sql.DataFrame>
.write
.format("io.qdrant.spark.Qdrant")
.option("qdrant_url", <QDRANT_URL>)
.option("collection_name", <QDRANT_COLLECTION_NAME>)
.option("embedding_field", <EMBEDDING_FIELD_NAME>)
.option("schema", dataframe.schema.json())
.mode("append")
.save()
```

* By default, UUIDs are generated for each row. If you need to use custom IDs, you can do so by setting the `id_field` option.
* An API key can be set using the `api_key` option to make authenticated requests.

## Datatype support πŸ“‹

Qdrant supports all the Spark data types, and the appropriate types are mapped based on the provided `schema`.

## Options πŸ› οΈ

| Option | Description | Required |
| :-------- | :------- | :------------|
| `qdrant_url` | `string` REST URL of the Qdrant instance | βœ… |
| `collection_name` | `string` Name of the collection to write data into | βœ… |
| `embedding_field` | `string` Name of the field holding the embeddings | βœ… |
| `id_field` | `string` Name of the field holding the point IDs | βœ… |
| `schema` | `string` JSON string of the dataframe schema | βœ… |
| `mode` | `string` Write mode of the dataframe | βœ… |
| `batch_size` | `int` Max size of the upload batch. Default: 100 | ❌ |
| `retries` | `string` Number of upload retries. Default: 3 | ❌ |
| `api_key` | `string` API key to be sent in the header. Default: null | ❌ |

## LICENSE πŸ“œ

Apache 2.0 Β© [2023](https://github.com/qdrant/qdrant-spark)

0 comments on commit eeb7d8d

Please sign in to comment.