This project implements a fraud detection system that integrates Azure Cosmos DB and Azure OpenAI embeddings. It allows the detection of suspicious activities based on transaction patterns, geographical information, and vector similarity using embeddings generated by OpenAI's API. The system stores transaction data in Cosmos DB, generates embeddings for the locations, and performs vector-based searches to detect anomalies in transactions.
To set up and run this project, the following Python packages are required:
- python-dotenv: For loading environment variables from a
.env
file. - openai: To interact with the OpenAI API for generating embeddings.
- geopy: For geocoding city names into latitude and longitude coordinates.
- azure-cosmos: For interacting with the Azure Cosmos DB service.
You can install these packages by running:
!pip install python-dotenv
!pip install openai
!pip install geopy
!pip install azure-cosmos
You need to set up a .env
file that contains your connection details to Azure Cosmos DB and OpenAI. Here's a template for the environment variables that should be included:
NOSQL_URI=<your_cosmos_db_uri>
NOSQL_PRIMARY_KEY=<your_cosmos_db_primary_key>
AOAI_ENDPOINT=<your_openai_endpoint>
AOAI_KEY=<your_openai_api_key>
API_VERSION=<openai_api_version>
AOAI_EMBEDDING_DEPLOYMENT=<openai_embedding_deployment_name>
AOAI_EMBEDDING_DEPLOYMENT_MODEL=<openai_embedding_model_name>
-
Cosmos DB Setup:
- A Cosmos DB database and container are created if they don't already exist.
- The container is configured with a vector index for the
locationVector
field, which allows for efficient vector searches.
-
Generating Location Embeddings:
- The system uses OpenAI's embedding model to generate vector representations of geographical locations (latitude and longitude).
- These embeddings are then stored in Cosmos DB alongside transaction data.
-
Transaction Storage:
- A pre-existing JSON file (
data_with_tenants.json
) containing transaction data is loaded. - Each transaction is updated with its corresponding location embeddings before being stored in the Cosmos DB container.
- A pre-existing JSON file (
-
Vector Search:
- The system allows vector-based searches to detect anomalies by comparing the current transaction's location vector with the average vector of previous transactions.
- Transactions are retrieved if the vector distance exceeds a certain threshold, indicating a possible anomaly.
Generates embeddings for a given latitude and longitude using OpenAI's embedding model.
Fetches the average location vector for all transactions associated with a specific tenant from the Cosmos DB container.
Performs a vector-based search to detect transactions with a large distance from the average transaction vector and current transaction vector.
Main function to perform the entire search operation. It calculates the average vector, generates the current transaction's embeddings, and runs a vector-based search in Cosmos DB.
-
Set Up Environment Variables: Ensure your
.env
file is correctly configured with the necessary credentials and endpoints for both Cosmos DB and OpenAI. -
Run the Search: You can run the following code snippet to perform a vector search and detect anomalies in transactions:
tenant_id = "10"
city = "Sweden"
merchant = "Walmart"
amount = 1000
results = perform_search(tenant_id, city, merchant, amount)
print(pd.DataFrame(results))
This will return a dataframe with the results of the vector-based search, listing transactions that deviate from the normal patterns.
The output of the perform_search
function will be a DataFrame showing the transactions that were found based on the vector search:
TransactionID Amount Timestamp Location Merchant TenantId ProximityOfCurrentToLast ProximityOfAverageToLast
0 T3235 282.75 2024-09-15 14:28:38 Boston Amazon 10 0.428310 0.523418
1 T7275 939.29 2024-09-15 14:24:38 Boston Walmart 10 0.428310 0.523418
...
- Vector Indexing: The project utilizes Azure Cosmos DB's
diskANN
indexing for vector-based searches. The embeddings generated for location vectors are stored as 1536-dimensional float arrays. - Azure OpenAI Integration: The project uses Azure OpenAI's embedding API to generate location embeddings.
This project is open-source and available for modification under the MIT License.