Skip to content

Commit

Permalink
Merge pull request #294 from lsst-sqre/tickets/DM-46034a
Browse files Browse the repository at this point in the history
DM-46034: Reorganize database documentation
  • Loading branch information
rra authored Sep 3, 2024
2 parents b8ba5b7 + a3f7f76 commit 94991e0
Show file tree
Hide file tree
Showing 12 changed files with 540 additions and 488 deletions.
484 changes: 0 additions & 484 deletions docs/user-guide/database.rst

This file was deleted.

61 changes: 61 additions & 0 deletions docs/user-guide/database/datetime.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#####################################
Handling datetimes in database tables
#####################################

When a database column is defined using the SQLAlchemy ORM using the `~sqlalchemy.types.DateTime` generic type, it cannot store a timezone.
The SQL standard type `~sqlalchemy.types.DATETIME` may include a timezone with some database backends, but it is database-specific.
It is therefore normally easier to store times in the database in UTC without timezone information.

However, `~datetime.datetime` objects in regular Python code should always be timezone-aware and use the UTC timezone.
Timezone-naive datetime objects are often interpreted as being in the local timezone, whatever that happens to be.
Keeping all datetime objects as timezone-aware in the UTC timezone will minimize surprises from unexpected timezone conversions.

This unfortunately means that the code for storing and retrieving datetime objects from the database needs a conversion layer.
asyncpg_ wisely declines to convert datetime objects.
It therefore returns timezone-naive objects from the database, and raises an exception if a timezone-aware datetime object is stored in a `~sqlalchemy.types.DateTime` field.
The conversion must therefore be done in the code making SQLAlchemy calls.

Safir provides `~safir.database.datetime_to_db` and `~safir.database.datetime_from_db` helper functions to convert from a timezone-aware datetime to a timezone-naive datetime suitable for storing in a DateTime column, and vice versa.
These helper functions should be used wherever `~sqlalchemy.types.DateTime` columns are read or updated.

`~safir.database.datetime_to_db` ensures the provided datetime object is timezone-aware and in UTC and converts it to a timezone-naive UTC datetime for database storage.
It raises `ValueError` if passed a timezone-naive datetime object.

`~safir.database.datetime_from_db` ensures the provided datetime object is either timezone-naive or in UTC and returns a timezone-aware UTC datetime object.

Both raise `ValueError` if passed datetime objects in some other timezone.
Both return `None` if passed `None`.

Examples
========

Here is example of reading an object from the database that includes DateTime columns:

.. code-block:: python
from safir.database import datetime_from_db
stmt = select(SQLJob).where(SQLJob.id == job_id)
result = (await session.execute(stmt)).scalar_one()
job = Job(
job_id=job.id,
# ...
creation_time=datetime_from_db(job.creation_time),
start_time=datetime_from_db(job.start_time),
end_time=datetime_from_db(job.end_time),
destruction_time=datetime_from_db(job.destruction_time),
# ...
)
Here is an example of updating a DateTime field in the database:

.. code-block:: python
from safir.database import datetime_to_db
async with session.begin():
stmt = select(SQLJob).where(SQLJob.id == job_id)
job = (await session.execute(stmt)).scalar_one()
job.destruction_time = datetime_to_db(destruction_time)
77 changes: 77 additions & 0 deletions docs/user-guide/database/dependency.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
############################################
Using a database session in request handlers
############################################

For FastAPI applications, Safir provides a FastAPI dependency that creates a database session for each request.
This uses the `SQLAlchemy async_scoped_session <https://docs.sqlalchemy.org/en/14/orm/extensions/asyncio.html#using-asyncio-scoped-session>`__ to transparently manage a separate session per running task.

Initialize the dependency
=========================

To use the database session dependency, it must first be initialized during application startup.
Generally this is done inside the application lifespan function.
You must also close the dependency during application shutdown.

.. code-block:: python
from collections.abc import AsyncIterator
from contextlib import asynccontextmanager
from fastapi import FastAPI
from safir.dependencies.db_session import db_session_dependency
from .config import config
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
await db_session_dependency.initialize(
config.database_url, config.database_password
)
yield
await db_session_dependency.aclose()
app = FastAPI(lifespan=lifespan)
As with some of the examples above, this assumes the application has a ``config`` object with the application settings, including the database URL and password.

Using the dependency
====================

Any handler that needs a database session can depend on the `~safir.dependencies.db_session.db_session_dependency`:

.. code-block:: python
from typing import Annotated
from fastapi import Depends
from safir.dependencies.db_session import db_session_dependency
from sqlalchemy.ext.asyncio import async_scoped_session
@app.get("/")
async def get_index(
session: Annotated[
async_scoped_session, Depends(db_session_dependency)
],
) -> Dict[str, str]:
async with session.begin():
# ... do something with session here ...
return {}
Transaction management
======================

The application must manage transactions when using the Safir database dependency.
SQLAlchemy will automatically start a transaction if you perform any database operation using a session (including read-only operations).
If that transaction is not explicitly ended, `asyncpg`_ may leave it open, which will cause database deadlocks and other problems.

Generally it's best to manage the transaction in the handler function (see the ``get_index`` example, above).
Wrap all code that may make database calls in an ``async with session.begin()`` block.
This will open a transaction, commit the transaction at the end of the block, and roll back the transaction if the block raises an exception.

.. note::

Due to an as-yet-unexplained interaction with FastAPI 0.74 and later, managing the transaction inside the database session dependency does not work.
Calling ``await session.commit()`` there, either explicitly or implicitly via a context manager, immediately fails by raising ``asyncio.CancelledError`` and the transaction is not committed or closed.
28 changes: 28 additions & 0 deletions docs/user-guide/database/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
######################
Using the database API
######################

Safir-based applications that use a SQL database can use Safir to initialize that database and acquire a database session.
Safir-based applications that use FastAPI can also use the Safir-provided FastAPI dependency to manage per-request database sessions.
The Safir database support is based on `SQLAlchemy`_ and assumes use of PostgreSQL (possibly via `Cloud SQL <https://cloud.google.com/sql>`__) as the underlying database.

Safir is an asyncio framework and requires using SQLAlchemy's asyncio API.
Safir uses the `asyncpg`_ PostgreSQL database driver.

Database support in Safir is optional.
To use it, depend on ``safir[db]`` in your pip requirements.

Also see :ref:`pydantic-dsns` for Pydantic types that help with configuring the PostgreSQL DSN.

Guides
======

.. toctree::
:titlesonly:

initialize
dependency
session
datetime
retry
testing
130 changes: 130 additions & 0 deletions docs/user-guide/database/initialize.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
#######################
Initializing a database
#######################

Safir supports simple initialization of a database with a schema provided by the application.
By default, this only adds any declared but missing tables, indices, or other objects, and thus does nothing if the database is already initialized.
The application may also request a database reset, which will drop and recreate all of the tables in the current schema.

More complex database schema upgrades are not supported by Safir.
If those are required, consider using `Alembic <https://alembic.sqlalchemy.org/en/latest/>`__.

Define the database schema
==========================

Database initialization in Safir assumes that the application has defined the database schema via the SQLAlchemy ORM.
The recommended way to do this is to add a ``schema`` directory to the application containing the table definitions.

In the file :file:`schema/base.py`, define the SQLAlchemy declarative base:

.. code-block:: python
from sqlalchemy.orm import declarative_base
Base = declarative_base()
In other files in that directory, define the database tables using the normal SQLAlchemy ORM syntax, one table per file.
Each database table definition must inherit from ``Base``, imported from ``.base``.

In :file:`schema/__init__.py`, import the table definitions from all of the files in the directory, as well as the ``Base`` variable, and export them using ``__all__``.

Using non-default PostgreSQL schemas
------------------------------------

Sometimes it is convenient for multiple applications to share the same database but use separate collections of tables.
PostgreSQL supports serving multiple schemas (in the sense of a namespace of tables) from the same database, and SQLAlchemy supports specifying the PostgreSQL schema for a given collection of tables.

The normal way to do this in SQLAlchemy is to modify the `~sqlalchemy.orm.DeclarativeBase` subclass used by the table definitions to specify a non-default schema.
For example:

.. code-block:: python
from sqlalchemy import MetaData
from sqlalchemy.orm import DeclarativeBase
from ..config import config
class Base(DeclarativeBase):
metadata = MetaData(schema=config.database_schema)
If ``config.database_schema`` is `None`, the default schema will be used; otherwise, SQLAlchemy will use the specified schema instead of the default one.

Safir supports this in database initialization by creating a non-default schema if one is set.
If the ``schema`` attribute is set (via code like the above) on the SQLAlchemy metadata passed to the ``schema`` parameter of `~safir.database.initialize_database`, it will create that schema in the PostgreSQL database if it does not already exist.

Add a CLI command to initialize the database
============================================

The recommended approach to add database initialization to an application is to add an ``init`` command to the command-line interface that runs the database initialization code.
For applications using Click_ (the recommended way to implement a command-line interface), this can be done with code like:

.. code-block:: python
import click
import structlog
from safir.asyncio import run_with_asyncio
from safir.database import create_database_engine, initialize_database
from .config import config
from .schema import Base
# Definition of main omitted.
@main.command()
@click.option(
"--reset", is_flag=True, help="Delete all existing database data."
)
@run_with_asyncio
async def init(reset: bool) -> None:
logger = structlog.get_logger(config.logger_name)
engine = create_database_engine(
config.database_url, config.database_password
)
await initialize_database(
engine, logger, schema=Base.metadata, reset=reset
)
await engine.dispose()
This code assumes that ``main`` is the Click entry point and ``.config`` provides a ``config`` object that contains the settings for the application, including the database URL and password as well as the normal Safir configuration settings.

The database URL may be a Pydantic ``Url`` type or a `str`.
The database password may be a ``pydantic.SecretStr``, a `str`, or `None` if no password is required by the database.

If it receives a connection error from the database, Safir will attempt the initialization five times, two seconds apart, to allow time for networking or a database proxy to start.

To drop and recreate all of the tables, pass the ``reset=True`` option to `~safir.database.initialize_database`.

Run database initialization on pod startup
==========================================

The recommended pattern for Safir-based applications that use a database but do not use Alembic is to initialize the database every time the pod has been restarted.

Since initialization does nothing if the schema already exists, this is safe to do.
It only wastes a bit of time during normal startup.
This allows the application to be deployed on a new cluster without any special initialization step.

The easiest way to do this is to add a script (conventionally located in ``scripts/start-frontend.sh``) that runs the ``init`` command and then starts the application with Uvicorn_:

.. code-block:: sh
#!/bin/bash
set -eu
application init
uvicorn application.main:app --host 0.0.0.0 --port 8080
Replace ``application`` with the application entry point (the first line) and Python module (the second line).
(These may be different if the application name contains dashes.)

Then, use this as the default command for the Docker image:

.. code-block:: docker
COPY scripts/start-frontend.sh /start-frontend.sh
CMD ["/start-frontend.sh"]
As a side effect, this will test database connectivity during pod startup and wait for network or a database proxy to be ready if needed, which avoids the need for testing database connectivity during the application startup.
66 changes: 66 additions & 0 deletions docs/user-guide/database/retry.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
##############################
Retrying database transactions
##############################

To aid in retrying transactions, Safir provides a decorator function, `safir.database.retry_async_transaction`.
Retrying transactions is often useful in conjunction with a custom transaction isolation level.

Setting an isolation level
==========================

If you have multiple simultaneous database writers and need to coordinate their writes to ensure consistent results, you may have to set a custom isolation level, such as ``REPEATABLE READ``,
In this case, transactions that attempt to modify an object that was modified by a different connection will raise an exception and can be retried.

`~safir.database.create_database_engine` and the ``initialize`` method of `~safir.dependencies.db_session.db_session_dependency` take an optional ``isolation_level`` argument that can be used to set a non-default isolation level.
If given, this parameter is passed through to the underlying SQLAlchemy engine.

See `the SQLAlchemy isolation level documentation <https://docs.sqlalchemy.org/en/20/orm/session_transaction.html#setting-transaction-isolation-levels-dbapi-autocommit>`__ for more information.
See :doc:`dependency` and :doc:`session` for more information about initializing the database engine.

Retrying transactions
=====================

To retry failed transactions in a function or method, decorate that function or method with `~safir.database.retry_async_transaction`.

If the decorated function or method raises `sqlalchemy.exc.DBAPIError` (the parent exception for exceptions raised by the underlying database API), it will be re-ran with the same arguments.
This will be repeated a configurable number of times (three by default).
The decorated function or method must therefore be idempotent and safe to run repeatedly.

Here's a simplified example from the storage layer of a Safir application:

.. code-block:: python
from datetime import datetime
from safir.database import datetime_to_db, retry_async_transaction
from sqlalchemy.ext.asyncio import async_scoped_session
class Storage:
def __init__(self, session: async_scoped_session) -> None:
self._session = session
@retry_async_transaction
async def mark_start(self, job_id: str, start: datetime) -> None:
async with self._session.begin():
job = await self._get_job(job_id)
if job.phase in ("PENDING", "QUEUED"):
job.phase = "EXECUTING"
job.start_time = start
If this method races with other methods updating the same job, the custom isolation level will force this update to fail with an exception, and it will then be retried by the decorator.

Changing the retry delay
------------------------

The decorator will delay for half a second (configurable with the ``delay`` parameter) between attempts, and by default the method is attempted three times.
These can be changed with a parameter to the decorator, such as:

.. code-block:: python
:emphasize-lines: 2
class Storage:
@retry_async_transaction(max_tries=5, delay=2.5)
async def mark_start(self, job_id: str, start: datetime) -> None:
async with self._session.begin():
...
Loading

0 comments on commit 94991e0

Please sign in to comment.