Squashed commit of the following:

commit 3bf2d19 Author: Pamela Fox <pamelafox@microsoft.com> Date: Thu Nov 2 09:10:15 2023 -0700 Fix list file (Azure-Samples#897) commit b3c55b0 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Nov 1 06:42:42 2023 -0700 Bump pypdf from 3.16.3 to 3.17.0 in /scripts (Azure-Samples#890) Bumps [pypdf](https://github.com/py-pdf/pypdf) from 3.16.3 to 3.17.0. - [Release notes](https://github.com/py-pdf/pypdf/releases) - [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md) - [Commits](py-pdf/pypdf@3.16.3...3.17.0) --- updated-dependencies: - dependency-name: pypdf dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit cfbb90b Author: Pamela Fox <pamelafox@microsoft.com> Date: Tue Oct 31 21:37:55 2023 -0700 Add more readmes/guides (Azure-Samples#889) * Add more readmes/guides * Add image * Diagram added commit 4479a2c Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 30 21:38:26 2023 -0700 Handle errors better especialyl for streaming (Azure-Samples#884) commit 0d54f84 Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 30 17:58:09 2023 -0700 Add exclude files (Azure-Samples#876) commit 3647826 Author: Roderic Bos <github@rooc.nl> Date: Mon Oct 30 17:35:50 2023 +0100 When using the option --storagekey for the prepdocs script the key might (Azure-Samples#866) contain `==` base64 padding at the end. This will fail to succesfully login because the script just removes the `=` signs during the split action. Copied the version from the app/start.ps1 which is better suited here. Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit 5daa934 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Oct 30 09:17:55 2023 -0700 Bump the github-actions group with 1 update (Azure-Samples#880) Bumps the github-actions group with 1 update: [actions/setup-node](https://github.com/actions/setup-node). - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](actions/setup-node@v3...v4) --- updated-dependencies: - dependency-name: actions/setup-node dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit bfb3ee5 Author: Pamela Fox <pamelafox@microsoft.com> Date: Fri Oct 27 11:04:37 2023 -0700 Render entire PDFs instead of single pages (Azure-Samples#840) * Adding anchors * Show whole file * Show whole file * Page number support * More experiments with whole file * Revert unintentional changes * Add tests * Remove random num * Add retry_total=0 to avoid unnecessary network requests * Add comment to explain retry_total * Bring back deleted file * Blob manager refactor after merge * Update coverage amount * Make mypy happy with explicit check of path * Add debug for 3.9 * Skip in 3.9 since its silly * Reduce fail under percentage due to 3.9 commit c989048 Author: Pamela Fox <pamelafox@microsoft.com> Date: Thu Oct 26 20:01:18 2023 -0700 add screenshot (Azure-Samples#870) commit a64a12e Author: Matt <57731498+mattmsft@users.noreply.github.com> Date: Thu Oct 26 12:49:02 2023 -0700 Refactor prepdocs (Azure-Samples#862) * setting up types * setting up more types... * working on it... * prepdocs refactor * typing fixes; updating tests * more fixes; updating tests * fixing retry for embeddings * fixing adls gen2 list * more test fixes * fixes from manual runs * more fixes * more fixes * type fixes * more type fixes and test fixes * break into modules * openai embedding fix * novectors fix * fix id generation * doc strings * feedback from pr * rename feedback * trying to get imports to work * update test workflow with pamela's suggestion * fix ci again * delete __init__ * mypy configuration --------- Co-authored-by: Matt Gotteiner <magottei@microsoft.com> commit 94be632 Author: MaciejLitwiniec <MaciejLitwiniec@users.noreply.github.com> Date: Thu Oct 26 15:23:00 2023 +0200 Updated FAQ so that it reflect PR 835 (Azure-Samples#868) * Updated FAQ so that it reflect PR 835 * Update README.md * Update README.md --------- Co-authored-by: Pamela Fox <pamela.fox@gmail.com> commit d7bbf9f Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Oct 25 14:23:05 2023 -0700 Bump werkzeug from 3.0.0 to 3.0.1 in /app/backend (Azure-Samples#863) Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.0 to 3.0.1. - [Release notes](https://github.com/pallets/werkzeug/releases) - [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst) - [Commits](pallets/werkzeug@3.0.0...3.0.1) --- updated-dependencies: - dependency-name: werkzeug dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit d02aa14 Author: Pamela Fox <pamelafox@microsoft.com> Date: Wed Oct 25 13:48:45 2023 -0700 Message builder improvements (Azure-Samples#852) commit 8f55988 Author: Pamela Fox <pamelafox@microsoft.com> Date: Tue Oct 24 19:53:07 2023 -0700 Reorder tags to optimize for sample browser (Azure-Samples#853) commit 16a61bf Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 23 17:03:59 2023 -0700 Improve follow-up questions and pipe into context (Azure-Samples#832) * Add follow-up questions and parsing * Test breaking the e2e * Actually run tests * Fix runner * Add conditional * Fix the test * chat approach commit ca01af9 Author: Anthony Shaw <anthony.p.shaw@gmail.com> Date: Mon Oct 23 09:27:32 2023 +1100 Store an MD5 hash of uploaded/indexed file and check before prepdocs (Azure-Samples#835) * Hash the uploaded files locally and skip them if you provision a second time and they haven't changed * Overwrite the hash when it changes * Remove open mode parameter * fix f-string * reformat changes * Update prepdocs.py
vishalgtingre · Nov 5, 2023 · 8429f98 · 8429f98
1 parent d226448
commit 8429f98
Show file tree

Hide file tree

Showing 80 changed files with 2,392 additions and 873 deletions.
diff --git a/.github/workflows/python-test.yaml b/.github/workflows/python-test.yaml
@@ -25,7 +25,7 @@ jobs:
             python-version: ${{ matrix.python_version }}
             architecture: x64
         - name: Setup node
-          uses: actions/setup-node@v3
+          uses: actions/setup-node@v4
           with:
             node-version: 18
         - name: Build frontend
@@ -40,14 +40,25 @@ jobs:
         - name: Lint with ruff
           run: ruff .
         - name: Check types with mypy
-          run: python3 -m mypy scripts/ app/
+          run: |
+            cd scripts/
+            python3 -m mypy .
+            cd ../app/
+            python3 -m mypy .
         - name: Check formatting with black
           run: black . --check --verbose
         - name: Run Python tests
           if: runner.os != 'Windows'
-          run: python3 -m pytest -s -vv --cov --cov-fail-under=78
+          run: python3 -m pytest -s -vv --cov --cov-fail-under=87
         - name: Run E2E tests with Playwright
+          id: e2e
           if: runner.os != 'Windows'
           run: |
             playwright install --with-deps
-            python3 -m pytest tests/e2e.py
+            python3 -m pytest tests/e2e.py --tracing=retain-on-failure
+        - name: Upload test artifacts
+          if: ${{ failure() && steps.e2e.conclusion == 'failure' }}
+          uses: actions/upload-artifact@v3
+          with:
+            name: playwright-traces
+            path: test-results
diff --git a/.gitignore b/.gitignore
@@ -144,4 +144,6 @@ cython_debug/
 # NPM
 npm-debug.log*
 node_modules
-static/
+static/
+
+data/*.md5
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -15,8 +15,20 @@
         "editor.defaultFormatter": "esbenp.prettier-vscode",
         "editor.formatOnSave": true
     },
+    "files.exclude": {
+        "**/__pycache__": true,
+        "**/.coverage": true,
+        "**/.pytest_cache": true,
+        "**/.ruff_cache": true,
+        "**/.mypy_cache": true
+    },
     "search.exclude": {
         "**/node_modules": true,
         "static": true
-    }
+    },
+    "python.testing.pytestArgs": [
+        "tests"
+    ],
+    "python.testing.unittestEnabled": false,
+    "python.testing.pytestEnabled": true
 }
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -108,9 +108,19 @@ playwright install --with-deps
 Run the tests:
 
 ```
-python3 -m pytest tests/e2e.py
+python3 -m pytest tests/e2e.py --tracing=retain-on-failure
 ```
 
+When a failure happens, the trace zip will be saved in the test-results folder.
+You can view that using the Playwright CLI:
+
+```
+playwright show-trace test-results/<trace-zip>
+```
+
+You can also use the online trace viewer at https://trace.playwright.dev/
+
+
 ## <a name="style"></a> Code Style
 
 This codebase includes several languages: TypeScript, Python, Bicep, Powershell, and Bash.
@@ -135,3 +145,5 @@ Run `black` to format a file:
 ```
 python3 -m black <path-to-file>
 ```
+
+If you followed the steps above to install the pre-commit hooks, then you can just wait for those hooks to run `ruff` and `black` for you.
diff --git a/README.md b/README.md
@@ -2,15 +2,15 @@
 name: ChatGPT + Enterprise data
 description: Chat with your data using OpenAI and Cognitive Search.
 languages:
-- azdeveloper
-- typescript
 - python
+- typescript
 - bicep
+- azdeveloper
 products:
-- azure
-- azure-cognitive-search
 - azure-openai
+- azure-cognitive-search
 - azure-app-service
+- azure
 page_type: sample
 urlFragment: azure-search-openai-demo
 ---
@@ -266,6 +266,10 @@ By default, the deployed Azure web app will only allow requests from the same or
 
 For the frontend code, change `BACKEND_URI` in `api.ts` to point at the deployed backend URL, so that all fetch requests will be sent to the deployed backend.
 
+For an alternate frontend that's written in Web Components and deployed to Static Web Apps, check out
+[azure-search-openai-javascript](https://github.com/Azure-Samples/azure-search-openai-javascript) and its guide
+on [using a different backend](https://github.com/Azure-Samples/azure-search-openai-javascript#using-a-different-backend).
+
 ## Running locally
 
 You can only run locally **after** having successfully run the `azd up` command. If you haven't yet, follow the steps in [Azure deployment](#azure-deployment) above.
@@ -285,50 +289,22 @@ Once in the web app:
 * Explore citations and sources
 * Click on "settings" to try different options, tweak prompts, etc.
 
+## Customizing the UI and data
+
+Once you successfully deploy the app, you can start customizing it for your needs: changing the text, tweaking the prompts, and replacing the data. Consult the [app customization guide](docs/customization.md) as well as the [data ingestion guide](docs/data_ingestion.md) for more details.
+
 ## Productionizing
 
 This sample is designed to be a starting point for your own production application,
 but you should do a thorough review of the security and performance before deploying
-to production. Here are some things to consider:
-
-* **OpenAI Capacity**: The default TPM (tokens per minute) is set to 30K. That is equivalent
-  to approximately 30 conversations per minute (assuming 1K per user message/response).
-  You can increase the capacity by changing the `chatGptDeploymentCapacity` and `embeddingDeploymentCapacity`
-  parameters in `infra/main.bicep` to your account's maximum capacity.
-  You can also view the Quotas tab in [Azure OpenAI studio](https://oai.azure.com/)
-  to understand how much capacity you have.
-* **Azure Storage**: The default storage account uses the `Standard_LRS` SKU.
-  To improve your resiliency, we recommend using `Standard_ZRS` for production deployments,
-  which you can specify using the `sku` property under the `storage` module in `infra/main.bicep`.
-* **Azure Cognitive Search**: The default search service uses the `Standard` SKU
-  with the free semantic search option, which gives you 1000 free queries a month.
-  Assuming your app will experience more than 1000 questions, you should either change `semanticSearch`
-  to "standard" or disable semantic search entirely in the `/app/backend/approaches` files.
-  If you see errors about search service capacity being exceeded, you may find it helpful to increase
-  the number of replicas by changing `replicaCount` in `infra/core/search/search-services.bicep`
-  or manually scaling it from the Azure Portal.
-* **Azure App Service**: The default app service plan uses the `Basic` SKU with 1 CPU core and 1.75 GB RAM.
-  We recommend using a Premium level SKU, starting with 1 CPU core.
-  You can use auto-scaling rules or scheduled scaling rules,
-  and scale up the maximum/minimum based on load.
-* **Authentication**: By default, the deployed app is publicly accessible.
-  We recommend restricting access to authenticated users.
-  See [Enabling authentication](#enabling-authentication) above for how to enable authentication.
-* **Networking**: We recommend deploying inside a Virtual Network. If the app is only for
-  internal enterprise use, use a private DNS zone. Also consider using Azure API Management (APIM)
-  for firewalls and other forms of protection.
-  For more details, read [Azure OpenAI Landing Zone reference architecture](https://techcommunity.microsoft.com/t5/azure-architecture-blog/azure-openai-landing-zone-reference-architecture/ba-p/3882102).
-* **Loadtesting**: We recommend running a loadtest for your expected number of users.
-  You can use the [locust tool](https://docs.locust.io/) with the `locustfile.py` in this sample
-  or set up a loadtest with Azure Load Testing.
-
+to production. Read through our [productionizing guide](docs/productionizing.md) for more details.
 
 ## Resources
 
 * [Revolutionize your Enterprise Data with ChatGPT: Next-gen Apps w/ Azure OpenAI and Cognitive Search](https://aka.ms/entgptsearchblog)
 * [Azure Cognitive Search](https://learn.microsoft.com/azure/search/search-what-is-azure-search)
 * [Azure OpenAI Service](https://learn.microsoft.com/azure/cognitive-services/openai/overview)
-* [Comparing Azure OpenAI and OpenAI](https://learn.microsoft.com/en-gb/azure/cognitive-services/openai/overview#comparing-azure-openai-and-openai/)
+* [Comparing Azure OpenAI and OpenAI](https://learn.microsoft.com/azure/cognitive-services/openai/overview#comparing-azure-openai-and-openai/)
 
 ## Clean up
 
@@ -346,18 +322,6 @@ The resource group and all the resources will be deleted.
 
 ### FAQ
 
-<details><a id="ingestion-why-chunk"></a>
-<summary>Why do we need to break up the PDFs into chunks when Azure Cognitive Search supports searching large documents?</summary>
-
-Chunking allows us to limit the amount of information we send to OpenAI due to token limits. By breaking up the content, it allows us to easily find potential chunks of text that we can inject into OpenAI. The method of chunking we use leverages a sliding window of text such that sentences that end one chunk will start the next. This allows us to reduce the chance of losing the context of the text.
-</details>
-
-<details><a id="ingestion-more-pdfs"></a>
-<summary>How can we upload additional PDFs without redeploying everything?</summary>
-
-To upload more PDFs, put them in the data/ folder and run `./scripts/prepdocs.sh` or `./scripts/prepdocs.ps1`. To avoid reuploading existing docs, move them out of the data folder. You could also implement checks to see whats been uploaded before; our code doesn't yet have such checks.
-</details>
-
 <details><a id="compare-samples"></a>
 <summary>How does this sample compare to other Chat with Your Data samples?</summary>
 
@@ -397,22 +361,6 @@ Technology comparison:
 In `infra/main.bicep`, change `chatGptModelName` to 'gpt-4' instead of 'gpt-35-turbo'. You may also need to adjust the capacity above that line depending on how much TPM your account is allowed.
 </details>
 
-<details><a id="chat-ask-diff"></a>
-<summary>What is the difference between the Chat and Ask tabs?</summary>
-
-The chat tab uses the approach programmed in [chatreadretrieveread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/chatreadretrieveread.py).
-
-- It uses the ChatGPT API to turn the user question into a good search query.
-- It queries Azure Cognitive Search for search results for that query (optionally using the vector embeddings for that query).
-- It then combines the search results and original user question, and asks ChatGPT API to answer the question based on the sources. It includes the last 4K of message history as well (or however many tokens are allowed by the deployed model).
-
-The ask tab uses the approach programmed in [retrievethenread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/retrievethenread.py).
-
-- It queries Azure Cognitive Search for search results for the user question (optionally using the vector embeddings for that question).
-- It then combines the search results and user question, and asks ChatGPT API to answer the question based on the sources.
-
-</details>
-
 <details><a id="azd-up-explanation"></a>
 <summary>What does the `azd up` command do?</summary>
 

diff --git a/app/backend/app.py b/app/backend/app.py
@@ -9,6 +9,7 @@
 
 import aiohttp
 import openai
+from azure.core.exceptions import ResourceNotFoundError
 from azure.identity.aio import DefaultAzureCredential
 from azure.monitor.opentelemetry import configure_azure_monitor
 from azure.search.documents.aio import SearchClient
@@ -39,6 +40,10 @@
 CONFIG_BLOB_CONTAINER_CLIENT = "blob_container_client"
 CONFIG_AUTH_CLIENT = "auth_client"
 CONFIG_SEARCH_CLIENT = "search_client"
+ERROR_MESSAGE = """The app encountered an error processing your request.
+If you are an administrator of the app, view the full error in the logs. See aka.ms/appservice-logs for more information.
+Error type: {error_type}
+"""
 
 bp = Blueprint("routes", __name__, static_folder="static")
 
@@ -69,9 +74,18 @@ async def assets(path):
 # *** NOTE *** this assumes that the content files are public, or at least that all users of the app
 # can access all the files. This is also slow and memory hungry.
 @bp.route("/content/<path>")
-async def content_file(path):
+async def content_file(path: str):
+    # Remove page number from path, filename-1.txt -> filename.txt
+    if path.find("#page=") > 0:
+        path_parts = path.rsplit("#page=", 1)
+        path = path_parts[0]
+    logging.info("Opening file %s at page %s", path)
     blob_container_client = current_app.config[CONFIG_BLOB_CONTAINER_CLIENT]
-    blob = await blob_container_client.get_blob_client(path).download_blob()
+    try:
+        blob = await blob_container_client.get_blob_client(path).download_blob()
+    except ResourceNotFoundError:
+        logging.exception("Path not found: %s", path)
+        abort(404)
     if not blob.properties or not blob.properties.has_key("content_settings"):
         abort(404)
     mime_type = blob.properties["content_settings"]["content_type"]
@@ -83,6 +97,10 @@ async def content_file(path):
     return await send_file(blob_file, mimetype=mime_type, as_attachment=False, attachment_filename=path)
 
 
+def error_dict(error: Exception) -> dict:
+    return {"error": ERROR_MESSAGE.format(error_type=type(error))}
+
+
 @bp.route("/ask", methods=["POST"])
 async def ask():
     if not request.is_json:
@@ -100,14 +118,18 @@ async def ask():
                 request_json["messages"], context=context, session_state=request_json.get("session_state")
             )
         return jsonify(r)
-    except Exception as e:
-        logging.exception("Exception in /ask")
-        return jsonify({"error": str(e)}), 500
+    except Exception as error:
+        logging.exception("Exception in /ask: %s", error)
+        return jsonify(error_dict(error)), 500
 
 
 async def format_as_ndjson(r: AsyncGenerator[dict, None]) -> AsyncGenerator[str, None]:
-    async for event in r:
-        yield json.dumps(event, ensure_ascii=False) + "\n"
+    try:
+        async for event in r:
+            yield json.dumps(event, ensure_ascii=False) + "\n"
+    except Exception as e:
+        logging.exception("Exception while generating response stream: %s", e)
+        yield json.dumps(error_dict(e))
 
 
 @bp.route("/chat", methods=["POST"])
@@ -134,9 +156,9 @@ async def chat():
             response = await make_response(format_as_ndjson(result))
             response.timeout = None  # type: ignore
             return response
-    except Exception as e:
-        logging.exception("Exception in /chat")
-        return jsonify({"error": str(e)}), 500
+    except Exception as error:
+        logging.exception("Exception in /chat: %s", error)
+        return jsonify(error_dict(error)), 500
 
 
 # Send MSAL.js settings to the client UI