ai-cfia · ibrahim-kabir · Feb 26, 2024 · Feb 6, 2024 · Feb 6, 2024 · Feb 7, 2024
@@ -36,3 +36,9 @@ keys/
 
 # Ignore Flask Sessions
 flask_session/
+
+# Ignore local QnA json files
+QnA
+
+# Ignore output of api-test
+api-test/output
@@ -0,0 +1,146 @@
+# Design of the Finesse Benchmark Tool
+
+## Tools available
+
+There are tools that can integrate with Python or a script to accurately
+calculate API statistics. Currently, the needs are to test the tool using JSON
+files containing questions and their page origin in order to establish an
+accuracy score. We also want to calculate request times and generate a
+statistical summary of all this data. That being said, we plan to test the APIs
+under different conditions in the near future. For example, with multiple
+simultaneous users or under special conditions. That's why it's worth
+researching tools, if they are scalable and well adapted with Python.
+
+### Decision
+
+We've opted for Locust as our tool of choice. It's seamlessly compatible with
+Python, making it a natural fit due to its easy integration. Locust is an
+open-source load testing framework written in Python, designed to simulate
+numerous machines sending requests to a given system. It provides detailed
+insights into the system's performance and scalability. With its built-in UI and
+straightforward integration with Python scripts, Locust is user-friendly and
+accessible. It is popular and open source, with support from major tech
+companies such as Microsoft and Google
+
+However, Locust's primary purpose is to conduct ongoing tests involving multiple
+machines and endpoints simultaneously. Our specific requirement involves running
+the accuracy test just once. Nevertheless, there's potential for future
+integration, especially for stress and load testing scenarios that involve
+repeated searches.
+
+### Alternatives Considered
+
+#### Apache Bench (ab)
+
+Apache Bench (ab) is a command-line tool for benchmarking HTTP servers. It is
+included with the Apache HTTP Server package and is designed for simplicity and
+ease of use.
+
+Pros
+
+- Simple to use.
+- Good for basic testing.
+- Easy integration with test scripts.
+
+Cons
+
+- May not be flexible enough for complex testing scenarios.
+- Less performant for heavy loads or advanced testing.
+
+#### Siege
+
+Siege is a load testing and benchmarking tool that simulates multiple users
+accessing a web server, enabling stress testing and performance evaluation.
+
+Pros
+
+- Supports multiple concurrent users, making it suitable for load testing.
+- Allows for stress testing of web servers and applications.
+
+Cons
+
+- Lack of documentation, some arguments are not documented in their wiki.
+- May have a steeper learning curve compared to simpler tools like Apache Bench.
+
+## Overview
+
+This tool simplifies the process of comparing different search engines and
+assessing their accuracy. It's designed to be straightforward, making it easy
+to understand and use.
+
+## How it Works
+
+- **Single command:**
+  - Users can enter commands with clear instructions to choose a search engine,
+    specify a directory for JSON files and specify the backend URL.
+  - Mandatory arguments:
+    - `--engine [search engine]`: Pick a search engine.
+      - `ai-lab` : AI-Lab search engine
+      - `azure`: Azure search engine
+      - `static`: Static search engine
+      - `llamaindex`: LlamaIndex search engine
+    - `--path [directory path]`: Point to the directory with files structured
+    - `--host [API URL]`: Point to the finesse-backend URL
+      with JSON files with the following properties:
+      - `score`: The score of the page.
+      - `crawl_id`: The unique identifier associated with the crawl table.
+      - `chunk_id`: The unique identifier of the chunk.
+      - `title`: The title of the page.
+      - `url`: The URL of the page.
+      - `text_content`: The main textual content of the item.
+      - `question`: The question to ask.
+      - `answer`: The response to the asked question.
+  - Optional argument:
+    - `--format [file type]`:
+      - `csv`: Generate a CSV document
+      - `md`: Generate a Markdown document, selected by default
+    - `--once`: Go through all the json files and does not repeat
+    - `--top`: Limit the number of results returned by the search engine
+- **Many tests**
+  - Test all the JSON files in the path directory
+- **Accuracy score**
+  - The tool compares the expected page with the actual Finesse response pages.
+  - Calculates an accuracy score for each response based on its position in the
+    list of pages relative to the total number of pages in the list. 100% would
+    correspond of being at the top of the list, and 0% would mean not in the
+    list.
+- **Round trip time**
+  - Measure round trip time of each request
+- **Summary statistical value**
+  - Measure the average, median, standard deviation, minimum and maximal accuracy scores and round trip time
+
+## Diagram
+
+![Alt text](diagram.png)
+
+## Example Command
+
+```cmd
+$locust --engine azure --path api-test/QnA/good_question --host https://finesse-guidance.ninebasetwo.xyz/api --once
+Searching with Azure Search...
+
+File: qna_2023-12-08_36.json
+Question: Quelle est la zone réglementée dans la ville de Vancouver à partir du 19 mars 2022?
+Expected URL: https://inspection.canada.ca/protection-des-vegetaux/especes-envahissantes/directives/date/d-96-15/fra/1323854808025/1323854941807
+Accuracy Score: 50.0%
+Time: 277.836ms
+
+File: qna_2023-12-08_19.json
+Question: What are the requirements for inspections of fishing vessels?
+Expected URL: https://inspection.canada.ca/importing-food-plants-or-animals/food-imports/foreign-systems/audits/report-of-a-virtual-assessment-of-spain/eng/1661449231959/1661449232916
+Accuracy Score: 0.0%
+Time: 677.906ms
+
+...
+
+---
+Tested on 21 files.
+Time statistical summary:
+  Mean:429, Median:400, Standard Deviation:150  Maximum:889, Minimum:208
+Accuracy statistical summary:
+  Mean:0.35, Median:0.0, Standard Deviation:0.25, Maximum:1.0, Minimum:0.0
+---
+```
+
+This example shows how the CLI Output of the tool, analyzing search results from
+Azure Search and providing an accuracy score for Finesse.
@@ -0,0 +1,115 @@
+import statistics
+import datetime
+import csv
+import os
+from collections import namedtuple
+
+OUTPUT_FOLDER = "./api-test/output"
+AccuracyResult = namedtuple("AccuracyResult", ["position", "total_pages", "score"])
+
+def calculate_accuracy(responses_url: list[str], expected_url: str) -> AccuracyResult:
+    position: int = 0
+    total_pages: int = len(responses_url)
+    score: float = 0.0
+    expected_number = int(expected_url.split('/')[-2])
+
+    for idx, response_url in enumerate(responses_url):
+        response_number = int(response_url.split('/')[-2])
+        if response_number == expected_number:
+            position = idx
+            score = 1 - (position / total_pages)
+            score= round(score, 2)
+            break
+
+    return AccuracyResult(position, total_pages, score)
+
+def save_to_markdown(test_data: dict, engine: str):
+    if not os.path.exists(OUTPUT_FOLDER):
+        os.makedirs(OUTPUT_FOLDER)
+    date_string = datetime.datetime.now().strftime("%Y-%m-%d")
+    file_name = f"test_{engine}_{date_string}.md"
+    output_file = os.path.join(OUTPUT_FOLDER, file_name)
+    with open(output_file, "w") as md_file:
+        md_file.write(f"# Test on the {engine} search engine: {date_string}\n\n")
+        md_file.write("## Test data table\n\n")
+        md_file.write("| 📄 File               | 💬 Question                                                                                                                | 📏 Accuracy Score | ⌛ Time     |\n")
+        md_file.write("|--------------------|-------------------------------------------------------------------------------------------------------------------------|----------------|----------|\n")
+        for key, value in test_data.items():
+            md_file.write(f"| {key} | [{value.get('question')}]({value.get('expected_page').get('url')})' | {value.get('accuracy')*100:.1f}% | {value.get('time')}ms |\n")
+        md_file.write("\n")
+        md_file.write(f"Tested on {len(test_data)} files.\n\n")
+
+        time_stats, accuracy_stats = calculate_statistical_summary(test_data)
+        md_file.write("## Statistical summary\n\n")
+        md_file.write("| Statistic             | Time       | Accuracy score|\n")
+        md_file.write("|-----------------------|------------|---------|\n")
+        md_file.write(f"|Mean| {time_stats.get('Mean')}ms | {accuracy_stats.get('Mean')*100}% |\n")
+        md_file.write(f"|Median| {time_stats.get('Median')}ms | {accuracy_stats.get('Median')*100}% |\n")
+        md_file.write(f"|Standard Deviation| {time_stats.get('Standard Deviation')}ms | {accuracy_stats.get('Standard Deviation')*100}% |\n")
+        md_file.write(f"|Maximum| {time_stats.get('Maximum')}ms | {accuracy_stats.get('Maximum')*100}% |\n")
+        md_file.write(f"|Minimum| {time_stats.get('Minimum')}ms | {accuracy_stats.get('Minimum')*100}% |\n")
+
+def save_to_csv(test_data: dict, engine: str):
+    if not os.path.exists(OUTPUT_FOLDER):
+        os.makedirs(OUTPUT_FOLDER)
+    date_string = datetime.datetime.now().strftime("%Y-%m-%d")
+    file_name = f"test_{engine}_{date_string}.csv"
+    output_file = os.path.join(OUTPUT_FOLDER, file_name)
+    with open(output_file, "w", newline="") as csv_file:
+        writer = csv.writer(csv_file)
+        writer.writerow(["File", "Question", "Accuracy Score", "Time"])
+        for key, value in test_data.items():
+            writer.writerow([
+                key,
+                value.get("question"),
+                f"{value.get('accuracy')}",
+                f"{value.get('time')}"
+            ])
+        writer.writerow([])
+
+        time_stats, accuracy_stats = calculate_statistical_summary(test_data)
+        writer.writerow(["Statistic", "Time", "Accuracy Score"])
+        writer.writerow(["Mean", f"{time_stats.get('Mean')}", f"{accuracy_stats.get('Mean')}"])
+        writer.writerow(["Median", f"{time_stats.get('Median')}", f"{accuracy_stats.get('Median')}"])
+        writer.writerow(["Standard Deviation", f"{time_stats.get('Standard Deviation')}", f"{accuracy_stats.get('Standard Deviation')}"])
+        writer.writerow(["Maximum", f"{time_stats.get('Maximum')}", f"{accuracy_stats.get('Maximum')}"])
+        writer.writerow(["Minimum", f"{time_stats.get('Minimum')}", f"{accuracy_stats.get('Minimum')}"])
+
+def log_data(test_data: dict):
+    for key, value in test_data.items():
+        print("File:", key)
+        print("Question:", value.get("question"))
+        print("Expected URL:", value.get("expected_page").get("url"))
+        print(f'Accuracy Score: {value.get("accuracy")*100}%')
+        print(f'Time: {value.get("time")}ms')
+        print()
+    time_stats, accuracy_stats = calculate_statistical_summary(test_data)
+    print("---")
+    print(f"Tested on {len(test_data)} files.")
+    print("Time statistical summary:", end="\n  ")
+    for key,value in time_stats.items():
+        print(f"{key}:{value},", end=' ')
+    print("\nAccuracy statistical summary:", end="\n  ")
+    for key,value in accuracy_stats.items():
+        print(f"{key}:{value*100}%,", end=' ')
+    print("\n---")
+
+
+def calculate_statistical_summary(test_data: dict) -> tuple[dict, dict]:
+    times = [result.get("time") for result in test_data.values()]
+    accuracies = [result.get("accuracy") for result in test_data.values()]
+    time_stats = {
+        "Mean": round(statistics.mean(times), 3),
+        "Median": round(statistics.median(times), 3),
+        "Standard Deviation": round(statistics.stdev(times), 3),
+        "Maximum": round(max(times), 3),
+        "Minimum": round(min(times), 3),
+    }
+    accuracy_stats = {
+        "Mean": round(statistics.mean(accuracies), 2),
+        "Median": round(statistics.median(accuracies), 2),
+        "Standard Deviation": round(statistics.stdev(accuracies), 2),
+        "Maximum": round(max(accuracies), 2),
+        "Minimum": round(min(accuracies), 2),
+    }
+    return time_stats, accuracy_stats
@@ -0,0 +1,9 @@
+import requests
+
+def is_host_up(host_url: str) -> bool:
+    health_check_endpoint = f"{host_url}/health"
+    try:
+        response = requests.get(health_check_endpoint)
+        return response.status_code == 200
+    except requests.RequestException:
+        return False
@@ -0,0 +1,29 @@
+import json
+from typing import Iterator
+import os
+
+class JSONReader(Iterator):
+    "Read test data from JSON files using an iterator"
+
+    def __init__(self, directory):
+        self.directory = directory
+        self.file_list = [f for f in os.listdir(directory) if f.endswith('.json')]
+        if not self.file_list:
+            raise FileNotFoundError(f"No JSON files found in the directory '{directory}'")
+        self.current_file_index = 0
+        self.file_name = None  # Initialize file_name attribute
+
+    def __iter__(self):
+        return self
+
+    def __next__(self):
+        if self.current_file_index >= len(self.file_list):
+            raise StopIteration
+
+        file_path = os.path.join(self.directory, self.file_list[self.current_file_index])
+        self.file_name = self.file_list[self.current_file_index]  # Update file_name attribute
+
+        with open(file_path, 'r') as f:
+            data = json.load(f)
+            self.current_file_index += 1
+            return data