-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
80 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,81 @@ | ||
# Qdrant-Spark Connector | ||
# Qdrant-Spark Connector π₯ | ||
|
||
[Apache Spark](https://spark.apache.org/) is a distributed computing framework designed for big data processing and analytics. This connector enables [Qdrant](https://qdrant.tech/) to be a storage destination in Spark. | ||
|
||
## Installation π | ||
|
||
### GitHub Releases π¦ | ||
|
||
The packaged `jar` file releases can be found [here](https://github.com/qdrant/qdrant-spark/releases). | ||
|
||
### Building from source π οΈ | ||
|
||
To build the `jar` from source, you need [JDK@17](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) and [Maven](https://maven.apache.org/) installed. | ||
Once the requirements have been satisfied, run the following command in the project root. π οΈ | ||
|
||
```bash | ||
mvn package -Passembly | ||
``` | ||
This will build and store the fat JAR in the `target` directory by default. | ||
|
||
### Maven Central π | ||
|
||
The package will be available at the registry soon. | ||
|
||
## Usage π | ||
|
||
### Creating a Spark session (Single-node) with Qdrant support π | ||
|
||
```python | ||
from pyspark.sql import SparkSession | ||
|
||
spark = ( | ||
SparkSession.builder.config( | ||
"spark.jars", | ||
"spark-1.0-SNAPSHOT-jar-with-dependencies.jar", # specify the downloaded JAR file | ||
) | ||
.master("local[*]") | ||
.appName("qdrant") | ||
.getOrCreate() | ||
``` | ||
|
||
### Loading data π | ||
|
||
To load data into Qdrant, a collection has to be created beforehand with the appropriate vector dimensions and configurations. | ||
|
||
```python | ||
<pyspark.sql.DataFrame> | ||
.write | ||
.format("io.qdrant.spark.Qdrant") | ||
.option("qdrant_url", <QDRANT_URL>) | ||
.option("collection_name", <QDRANT_COLLECTION_NAME>) | ||
.option("embedding_field", <EMBEDDING_FIELD_NAME>) | ||
.option("schema", dataframe.schema.json()) | ||
.mode("append") | ||
.save() | ||
``` | ||
|
||
* By default, UUIDs are generated for each row. If you need to use custom IDs, you can do so by setting the `id_field` option. | ||
* An API key can be set using the `api_key` option to make authenticated requests. | ||
|
||
## Datatype support π | ||
|
||
Qdrant supports all the Spark data types, and the appropriate types are mapped based on the provided `schema`. | ||
|
||
## Options π οΈ | ||
|
||
| Option | Description | Required | | ||
| :-------- | :------- | :------------| | ||
| `qdrant_url` | `string` REST URL of the Qdrant instance | β | | ||
| `collection_name` | `string` Name of the collection to write data into | β | | ||
| `embedding_field` | `string` Name of the field holding the embeddings | β | | ||
| `id_field` | `string` Name of the field holding the point IDs | β | | ||
| `schema` | `string` JSON string of the dataframe schema | β | | ||
| `mode` | `string` Write mode of the dataframe | β | | ||
| `batch_size` | `int` Max size of the upload batch. Default: 100 | β | | ||
| `retries` | `string` Number of upload retries. Default: 3 | β | | ||
| `api_key` | `string` API key to be sent in the header. Default: null | β | | ||
|
||
## LICENSE π | ||
|
||
Apache 2.0 Β© [2023](https://github.com/qdrant/qdrant-spark) |