Skip to content

Commit

Permalink
refactor!: Qdrant Java client, named vectors (#8)
Browse files Browse the repository at this point in the history
* refactor: v2 use Qdrant Java client

BREAKING

* feat: named vector support

* fix: deps bug

* ci: remove fat jar PR

* ci: try release on master only

* refactor: try java 8

* chore: remove maven-surefire-plugin

* chore: update String.join() delimiter

* refactor: Spark type parsing

* chore: formatting

* chore: improve options handling Qdrant.java

* chore: array data parsing

* chore: formatting

* fix: com.google.guava.guava incompatiblity

* ci: Use Java 8

* chore: compress JAR, removed Spotify format

* fix: merge conf

* chore: ID long

* docs: Updated README.md

* chore: throw on invalid ID type
  • Loading branch information
Anush008 committed Feb 28, 2024
1 parent 95e78e3 commit 0f5d903
Show file tree
Hide file tree
Showing 16 changed files with 492 additions and 260 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ jobs:
echo "AUTHOR_EMAIL=$AUTHOR_EMAIL" >> $GITHUB_OUTPUT
id: author_info

- name: Set up Java 17
- name: Set up Java 8
uses: actions/setup-java@v3
with:
distribution: 'oracle'
java-version: '17'
java-version: "8"
distribution: temurin
server-id: ossrh
server-username: OSSRH_JIRA_USERNAME
server-password: OSSRH_JIRA_PASSWORD
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ jobs:
- uses: actions/checkout@v4
- uses: actions/setup-java@v3
with:
java-version: "17"
java-version: "8"
distribution: temurin
- name: Run the Maven tests
run: mvn test
- name: Generate assembly fat JAR
run: mvn clean package -Passembly
run: mvn clean package
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,19 @@
## Installation 🚀

> [!IMPORTANT]
> Requires Java 17 or above.
> Requires Java 8 or above.
### GitHub Releases 📦

The packaged `jar` file releases can be found [here](https://github.com/qdrant/qdrant-spark/releases).

### Building from source 🛠️

To build the `jar` from source, you need [JDK@17](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) and [Maven](https://maven.apache.org/) installed.
To build the `jar` from source, you need [JDK@18](https://www.azul.com/downloads/#zulu) and [Maven](https://maven.apache.org/) installed.
Once the requirements have been satisfied, run the following command in the project root. 🛠️

```bash
mvn package -P assembly
mvn package
```

This will build and store the fat JAR in the `target` directory by default.
Expand All @@ -30,7 +30,7 @@ For use with Java and Scala projects, the package can be found [here](https://ce
<dependency>
<groupId>io.qdrant</groupId>
<artifactId>spark</artifactId>
<version>1.12.1</version>
<version>2.0</version>
</dependency>
```

Expand All @@ -43,7 +43,7 @@ from pyspark.sql import SparkSession

spark = SparkSession.builder.config(
"spark.jars",
"spark-1.12.1-jar-with-dependencies.jar", # specify the downloaded JAR file
"spark-2.0.jar", # specify the downloaded JAR file
)
.master("local[*]")
.appName("qdrant")
Expand All @@ -58,7 +58,7 @@ To load data into Qdrant, a collection has to be created beforehand with the app
<pyspark.sql.DataFrame>
.write
.format("io.qdrant.spark.Qdrant")
.option("qdrant_url", <QDRANT_URL>)
.option("qdrant_url", <QDRANT_GRPC_URL>)
.option("collection_name", <QDRANT_COLLECTION_NAME>)
.option("embedding_field", <EMBEDDING_FIELD_NAME>) # Expected to be a field of type ArrayType(FloatType)
.option("schema", <pyspark.sql.DataFrame>.schema.json())
Expand All @@ -70,31 +70,32 @@ To load data into Qdrant, a collection has to be created beforehand with the app
- An API key can be set using the `api_key` option to make authenticated requests.

## Databricks

You can use the `qdrant-spark` connector as a library in Databricks to ingest data into Qdrant.

- Go to the `Libraries` section in your cluster dashboard.
- Select `Install New` to open the library installation modal.
- Search for `io.qdrant:spark:1.12.1` in the Maven packages and click `Install`.
- Search for `io.qdrant:spark:2.0` in the Maven packages and click `Install`.

<img width="1064" alt="Screenshot 2024-01-05 at 17 20 01 (1)" src="https://github.com/qdrant/qdrant-spark/assets/46051506/d95773e0-c5c6-4ff2-bf50-8055bb08fd1b">


## Datatype support 📋

Qdrant supports all the Spark data types, and the appropriate types are mapped based on the provided `schema`.
Qdrant supports all the Spark data types. The appropriate types are mapped based on the provided `schema`.

## Options and Spark types 🛠️

| Option | Description | DataType | Required |
| :---------------- | :------------------------------------------------------------------------ | :--------------------- | :------- |
| `qdrant_url` | REST URL of the Qdrant instance | `StringType` ||
| `qdrant_url` | GRPC URL of the Qdrant instance. Eg: <http://localhost:6334> | `StringType` ||
| `collection_name` | Name of the collection to write data into | `StringType` ||
| `embedding_field` | Name of the field holding the embeddings | `ArrayType(FloatType)` ||
| `schema` | JSON string of the dataframe schema | `StringType` ||
| `mode` | Write mode of the dataframe. Supports "append". | `StringType` ||
| `id_field` | Name of the field holding the point IDs. Default: Generates a random UUId | `StringType` ||
| `batch_size` | Max size of the upload batch. Default: 100 | `IntType` ||
| `retries` | Number of upload retries. Default: 3 | `IntType` ||
| `api_key` | Qdrant API key to be sent in the header. Default: null | `StringType` ||
| `vector_name` | Name of the vector in the collection. Default: null | `StringType` ||

## LICENSE 📜

Expand Down
129 changes: 76 additions & 53 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>io.qdrant</groupId>
<artifactId>spark</artifactId>
<version>1.12.1</version>
<version>2.0</version>
<name>qdrant-spark</name>
<url>https://github.com/qdrant/qdrant-spark</url>
<description>An Apache Spark connector for the Qdrant vector database</description>
Expand All @@ -31,31 +31,68 @@
</scm>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<!-- QDRANT CLIENT DEPENDENCIES -->

<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>4.9.1</version>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>30.1-jre</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10.1</version>
<groupId>io.grpc</groupId>
<artifactId>grpc-protobuf</artifactId>
<version>1.59.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
<groupId>io.qdrant</groupId>
<artifactId>client</artifactId>
<version>1.7.1</version>
<scope>compile</scope>
</dependency>

<!-- SPARK DEPENDENCIES -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>2.0.7</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.5.0</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>

<!-- TEST DEPENDENCIES -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers</artifactId>
<version>1.19.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>junit-jupiter</artifactId>
<version>1.19.4</version>
<scope>test</scope>
</dependency>
</dependencies>
<distributionManagement>
Expand All @@ -70,13 +107,6 @@
</distributionManagement>
<build>
<plugins>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
<configuration>
<argLine>--add-exports java.base/sun.nio.ch=ALL-UNNAMED</argLine>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
Expand Down Expand Up @@ -135,40 +165,33 @@
</configuration>
</plugin>
<plugin>
<groupId>com.spotify.fmt</groupId>
<artifactId>fmt-maven-plugin</artifactId>
<version>2.21.1</version>
<goals>
<goal>format</goal>
</goals>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.5.2</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<minimizeJar>true</minimizeJar>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
</transformers>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>com.shaded.google</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<profiles>
<!-- maven-assembly-plugin -->
<profile>
<id>assembly</id>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>


</project>
25 changes: 13 additions & 12 deletions src/main/java/io/qdrant/spark/Qdrant.java
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import org.apache.spark.sql.util.CaseInsensitiveStringMap;

/**
* A class that implements the TableProvider and DataSourceRegister interfaces. Provides methods to
* A class that implements the TableProvider and DataSourceRegister interfaces. Provides methods to
* infer schema, get table, and check required options.
*/
Expand All @@ -37,9 +38,13 @@ public String shortName() {
*/
@Override
public StructType inferSchema(CaseInsensitiveStringMap options) {

for (String fieldName : requiredFields) {
if (!options.containsKey(fieldName)) {
throw new IllegalArgumentException(fieldName.concat(" option is required"));
}
}
StructType schema = (StructType) StructType.fromJson(options.get("schema"));
checkRequiredOptions(options, schema);
validateOptions(options, schema);

return schema;
}
Expand All @@ -61,33 +66,29 @@ public Table getTable(
}

/**
* Checks if the required options are present in the provided options and if the id_field and
* embedding_field options are present in the provided schema.
* Checks if the required options are present in the provided options and chekcs if the specified
* id_field and embedding_field are present in the provided schema.
*
* @param options The options to check.
* @param schema The schema to check.
*/
void checkRequiredOptions(CaseInsensitiveStringMap options, StructType schema) {
for (String fieldName : requiredFields) {
if (!options.containsKey(fieldName)) {
throw new IllegalArgumentException(fieldName + " option is required");
}
}
void validateOptions(CaseInsensitiveStringMap options, StructType schema) {

List<String> fieldNames = Arrays.asList(schema.fieldNames());

if (options.containsKey("id_field")) {
String idField = options.get("id_field").toString();

if (!fieldNames.contains(idField)) {
throw new IllegalArgumentException("id_field option is not present in the schema");
throw new IllegalArgumentException("Specified 'id_field' is not present in the schema");
}
}

String embeddingField = options.get("embedding_field").toString();

if (!fieldNames.contains(embeddingField)) {
throw new IllegalArgumentException("embedding_field option is not present in the schema");
throw new IllegalArgumentException(
"Specified 'embedding_field' is not present in the schema");
}
}
}
Loading

0 comments on commit 0f5d903

Please sign in to comment.