Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…into main
  • Loading branch information
vpopescu committed Jan 16, 2023
2 parents f6dc990 + 21ad59d commit 44f9571
Show file tree
Hide file tree
Showing 7 changed files with 268 additions and 27 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ jobs:
branchRegex: ^\w[\w-.]*$

- name: Build and push jupyterhub
uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
uses: docker/build-push-action@37abcedcc1da61a57767b7588cb9d03eb57e28b3
with:
context: .
platforms: linux/amd64,linux/arm64
Expand All @@ -170,7 +170,7 @@ jobs:
branchRegex: ^\w[\w-.]*$

- name: Build and push jupyterhub-onbuild
uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
uses: docker/build-push-action@37abcedcc1da61a57767b7588cb9d03eb57e28b3
with:
build-args: |
BASE_IMAGE=${{ fromJson(steps.jupyterhubtags.outputs.tags)[0] }}
Expand All @@ -191,7 +191,7 @@ jobs:
branchRegex: ^\w[\w-.]*$

- name: Build and push jupyterhub-demo
uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
uses: docker/build-push-action@37abcedcc1da61a57767b7588cb9d03eb57e28b3
with:
build-args: |
BASE_IMAGE=${{ fromJson(steps.onbuildtags.outputs.tags)[0] }}
Expand All @@ -215,7 +215,7 @@ jobs:
branchRegex: ^\w[\w-.]*$

- name: Build and push jupyterhub/singleuser
uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
uses: docker/build-push-action@37abcedcc1da61a57767b7588cb9d03eb57e28b3
with:
build-args: |
JUPYTERHUB_VERSION=${{ github.ref_type == 'tag' && github.ref_name || format('git:{0}', github.sha) }}
Expand Down
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@ def setup(app):
"https://github.com/jupyterhub/jupyterhub/pull/", # too many PRs in changelog
"https://github.com/jupyterhub/jupyterhub/compare/", # too many comparisons in changelog
r"https?://(localhost|127.0.0.1).*", # ignore localhost references in auto-links
r"https://jupyter.chameleoncloud.org", # FIXME: ignore (presumably) short-term SSL issue
]
linkcheck_anchors_ignore = [
"/#!",
Expand Down
1 change: 0 additions & 1 deletion docs/source/gallery-jhub-deployments.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,5 +190,4 @@ easy to do with RStudio too.
- https://wrdrd.com/docs/consulting/education-technology
- https://bitbucket.org/jackhale/fenics-jupyter
- [LinuxCluster blog](https://linuxcluster.wordpress.com/category/application/jupyterhub/)
- [Network Technology](https://arnesund.com/tag/jupyterhub/)
- [Spark Cluster on OpenStack with Multi-User Jupyter Notebook](https://arnesund.com/2015/09/21/spark-cluster-on-openstack-with-multi-user-jupyter-notebook/)
115 changes: 103 additions & 12 deletions docs/source/reference/database.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,120 @@
# The Hub's Database

JupyterHub uses a database to store information about users, services, and other
data needed for operating the Hub.
JupyterHub uses a database to store information about users, services, and other data needed for operating the Hub.
This is the **state** of the Hub.

## Default SQLite database
## Why does JupyterHub have a database?

The default database for JupyterHub is a [SQLite](https://sqlite.org) database.
We have chosen SQLite as JupyterHub's default for its lightweight simplicity
in certain uses such as testing, small deployments and workshops.
JupyterHub is a **stateful** application (more on that 'state' later).
Updating JupyterHub's configuration or upgrading the version of JupyterHub requires restarting the JupyterHub process to apply the changes.
We want to minimize the disruption caused by restarting the Hub process, so it can be a mundane, frequent, routine activity.
Storing state information outside the process for later retrieval is necessary for this, and one of the main thing databases are for.

A lot of the operations in JupyterHub are also **relationships**, which is exactly what SQL databases are great at.
For example:

- Given an API token, what user is making the request?
- Which users don't have running servers?
- Which servers belong to user X?
- Which users have not been active in the last 24 hours?

Finally, a database allows us to have more information stored without needing it all loaded in memory,
e.g. supporting a large number (several thousands) of inactive users.

## What's in the database?

The short answer of what's in the JupyterHub database is "everything."
JupyterHub's **state** lives in the database.
That is, everything JupyterHub needs to be aware of to function that _doesn't_ come from the configuration files, such as

- users, roles, role assignments
- state, urls of running servers
- Hashed API tokens
- Short-lived state related to OAuth flow
- Timestamps for when users, tokens, and servers were last used

### What's _not_ in the database

Not _quite_ all of JupyterHub's state is in the database.
This mostly involves transient state, such as the 'pending' transitions of Spawners (starting, stopping, etc.).
Anything not in the database must be reconstructed on Hub restart, and the only sources of information to do that are the database and JupyterHub configuration file(s).

## How does JupyterHub use the database?

JupyterHub makes some _unusual_ choices in how it connects to the database.
These choices represent trade-offs favoring single-process simplicity and performance at the expense of horizontal scalability (multiple Hub instances).

We often say that the Hub 'owns' the database.
This ownership means that we assume the Hub is the only process that will talk to the database.
This assumption enables us to make several caching optimizations that dramatically improve JupyterHub's performance (i.e. data written recently to the database can be read from memory instead of fetched again from the database) that would not work if multiple processes could be interacting with the database at the same time.

Database operations are also synchronous, so while JupyterHub is waiting on a database operation, it cannot respond to other requests.
This allows us to avoid complex locking mechanisms, because transaction races can only occur during an `await`, so we only need to make sure we've completed any given transaction before the next `await` in a given request.

:::{note}
We are slowly working to remove these assumptions, and moving to a more traditional db session per-request pattern.
This will enable multiple Hub instances and enable scaling JupyterHub, but will significantly reduce the number of active users a single Hub instance can serve.
:::

### Database performance in a typical request

Most authenticated requests to JupyterHub involve a few database transactions:

1. look up the authenticated user (e.g. look up token by hash, then resolve owner and permissions)
2. record activity
3. perform any relevant changes involved in processing the request (e.g. create the records for a running server when starting one)

This means that the database is involved in almost every request, but only in quite small, simple queries, e.g.:

- lookup one token by hash
- lookup one user by name
- list tokens or servers for one user (typically 1-10)
- etc.

### The database as a limiting factor

As a result of the above transactions in most requests, database performance is the _leading_ factor in JupyterHub's baseline requests-per-second performance, but that cost does not scale significantly with the number of users, active or otherwise.
However, the database is _rarely_ a limiting factor in JupyterHub performance in a practical sense, because the main thing JupyterHub does is start, stop, and monitor whole servers, which take far more time than any small database transaction, no matter how many records you have or how slow your database is (within reason).
Additionally, there is usually _very_ little load on the database itself.

By far the most taxing activity on the database is the 'list all users' endpoint, primarily used by the [idle-culling service](https://github.com/jupyterhub/jupyterhub-idle-culler).
Database-based optimizations have been added to make even these operations feasible for large numbers of users:

1. State filtering on [GET /users](./rest-api.md) with `?state=active`,
which limits the number of results in the query to only the relevant subset (added in JupyterHub 1.3), rather than all users.
2. [Pagination](api-pagination) of all list endpoints, allowing the request of a large number of resources to be more fairly balanced with other Hub activities across multiple requests (added in 2.0).

:::{note}
It's important to note when discussing performance and limiting factors and that all of this only applies to requests to `/hub/...`.
The Hub and its database are not involved in most requests to single-user servers (`/user/...`), which is by design, and largely motivated by the fact that the Hub itself doesn't _need_ to be fast because its operations are infrequent and large.
:::

## Database backends

JupyterHub supports a variety of database backends via [SQLAlchemy][].
The default is sqlite, which works great for many cases, but you should be able to use many backends supported by SQLAlchemy.
Usually, this will mean PostgreSQL or MySQL, both of which are well tested with JupyterHub.

[sqlalchemy]: https://www.sqlalchemy.org

### Default backend: SQLite

The default database backend for JupyterHub is [SQLite](https://sqlite.org).
We have chosen SQLite as JupyterHub's default because it's simple (the 'database' is a single file) and ubiquitous (it is in the Python standard library).
It works very well for testing, small deployments, and workshops.

For production systems, SQLite has some disadvantages when used with JupyterHub:

- `upgrade-db` may not work, and you may need to start with a fresh database
- `upgrade-db` may not always work, and you may need to start with a fresh database
- `downgrade-db` **will not** work if you want to rollback to an earlier
version, so backup the `jupyterhub.sqlite` file before upgrading

The sqlite documentation provides a helpful page about [when to use SQLite and
where traditional RDBMS may be a better choice](https://sqlite.org/whentouse.html).

## Using an RDBMS (PostgreSQL, MySQL)
### Picking your database backend (PostgreSQL, MySQL)

When running a long term deployment or a production system, we recommend using
a traditional RDBMS database, such as [PostgreSQL](https://www.postgresql.org)
or [MySQL](https://www.mysql.com), that supports the SQL `ALTER TABLE`
statement.
When running a long term deployment or a production system, we recommend using a full-fledged relational database, such as [PostgreSQL](https://www.postgresql.org) or [MySQL](https://www.mysql.com), that supports the SQL `ALTER TABLE` statement.

## Notes and Tips

Expand Down
21 changes: 14 additions & 7 deletions jupyterhub/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,9 +298,9 @@ def _load_classes(self):
return classes

load_groups = Dict(
Dict(),
Union([Dict(), List()]),
help="""
Dict of `{'group': {'users':['usernames'], properties : {}}` to load at startup.
Dict of `{'group': {'users':['usernames'], 'properties': {}}` to load at startup.
Example::
Expand All @@ -311,7 +311,8 @@ def _load_classes(self):
},
}
This strictly *adds* groups, users and properties to groups.
This strictly *adds* groups and users to groups.
Properties, if defined, replace all existing properties.
Loading one set of groups, then starting JupyterHub again with a different
set will not remove users or groups from previous launches.
Expand Down Expand Up @@ -2079,12 +2080,18 @@ async def init_groups(self):
for username in contents['users']:
username = self.authenticator.normalize_username(username)
user = await self._get_or_create_user(username)
self.log.debug(f"Adding user {username} to group {name}")
group.users.append(user)
if group not in user.groups:
self.log.debug(f"Adding user {username} to group {name}")
group.users.append(user)

if 'properties' in contents:
group_properties = contents['properties']
self.log.debug(f"Adding properties {group_properties} to group {name}")
group.properties = group_properties
if group.properties != group_properties:
# add equality check to avoid no-op db transactions
self.log.debug(
f"Adding properties to group {name}: {group_properties}"
)
group.properties = group_properties

db.commit()

Expand Down
136 changes: 136 additions & 0 deletions jupyterhub/tests/selenium/test_browser.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Tests for the Selenium WebDriver"""

import asyncio
import json
from functools import partial

import pytest
Expand All @@ -15,6 +16,7 @@
from tornado.escape import url_escape
from tornado.httputil import url_concat

from jupyterhub import scopes
from jupyterhub.tests.selenium.locators import (
BarLocators,
HomePageLocators,
Expand All @@ -24,6 +26,7 @@
)
from jupyterhub.utils import exponential_backoff

from ... import orm, roles
from ...utils import url_path_join
from ..utils import api_request, public_host, public_url, ujoin

Expand Down Expand Up @@ -854,3 +857,136 @@ async def test_user_logout(app, browser, url, user):
while f"/user/{user.name}/" not in browser.current_url:
await webdriver_wait(browser, EC.url_matches(f"/user/{user.name}/"))
assert f"/user/{user.name}" in browser.current_url


# OAUTH confirmation page


@pytest.mark.parametrize(
"user_scopes",
[
([]), # no scopes
( # user has just access to own resources
[
'self',
]
),
( # user has access to all groups resources
[
'read:groups',
'groups',
]
),
( # user has access to specific users/groups/services resources
[
'read:users!user=gawain',
'read:groups!group=mythos',
'read:services!service=test',
]
),
],
)
async def test_oauth_page(
app,
browser,
mockservice_url,
create_temp_role,
create_user_with_scopes,
user_scopes,
):
# create user with appropriate access permissions
service_role = create_temp_role(user_scopes)
service = mockservice_url
user = create_user_with_scopes("access:services")
roles.grant_role(app.db, user, service_role)
oauth_client = (
app.db.query(orm.OAuthClient)
.filter_by(identifier=service.oauth_client_id)
.one()
)
oauth_client.allowed_scopes = sorted(roles.roles_to_scopes([service_role]))
app.db.commit()
# open the service url in the browser
service_url = url_path_join(public_url(app, service) + 'owhoami/?arg=x')
await in_thread(browser.get, (service_url))
expected_client_id = service.name
expected_redirect_url = app.base_url + f"servises/{service.name}/oauth_callback"
assert expected_client_id, expected_redirect_url in browser.current_url

# login user
await login(browser, user.name, pass_w=str(user.name))
auth_button = browser.find_element(By.XPATH, '//input[@type="submit"]')
if not auth_button.is_displayed():
await webdriver_wait(
browser,
EC.visibility_of_element_located((By.XPATH, '//input[@type="submit"]')),
)
# verify that user can see the service name and oauth URL
text_permission = browser.find_element(
By.XPATH, './/h1[text()="Authorize access"]//following::p'
).text
assert f"JupyterHub service {service.name}", (
f"oauth URL: {expected_redirect_url}" in text_permission
)
# permissions check
oauth_form = browser.find_element(By.TAG_NAME, "form")
scopes_elements = oauth_form.find_elements(
By.XPATH, '//input[@type="hidden" and @name="scopes"]'
)
scope_list_oauth_page = []
for scopes_element in scopes_elements:
# checking that scopes are invisible on the page
assert not scopes_element.is_displayed()
scope_value = scopes_element.get_attribute("value")
scope_list_oauth_page.append(scope_value)

# checking that all scopes granded to user are presented in POST form (scope_list)
assert all(x in scope_list_oauth_page for x in user_scopes)
assert f"access:services!service={service.name}" in scope_list_oauth_page

check_boxes = oauth_form.find_elements(
By.XPATH, '//input[@type="checkbox" and @name="raw-scopes"]'
)
for check_box in check_boxes:
# checking that user cannot uncheck the checkbox
assert not check_box.is_enabled()
assert check_box.get_attribute("disabled")
assert check_box.get_attribute("title") == "This authorization is required"

# checking that appropriete descriptions are displayed depending of scopes
descriptions = oauth_form.find_elements(By.TAG_NAME, 'span')
desc_list_form = [description.text.strip() for description in descriptions]
# getting descriptions from scopes.py to compare them with descriptions on UI
scope_descriptions = scopes.describe_raw_scopes(
user_scopes or ['(no_scope)'], user.name
)
desc_list_expected = []
for scope_description in scope_descriptions:
description = scope_description.get("description")
text_filter = scope_description.get("filter")
if text_filter:
description = f"{description} Applies to {text_filter}."
desc_list_expected.append(description)

assert sorted(desc_list_form) == sorted(desc_list_expected)

# click on the Authorize button
await click(browser, (By.XPATH, '//input[@type="submit"]'))
# check that user returned to service page
assert browser.current_url == service_url

# check the granted permissions by
# getting the scopes from the service page,
# which contains the JupyterHub user model
text = browser.find_element(By.TAG_NAME, "body").text
user_model = json.loads(text)
authorized_scopes = user_model["scopes"]

# resolve the expected expanded scopes
# authorized for the service
expected_scopes = scopes.expand_scopes(user_scopes, owner=user.orm_user)
expected_scopes |= scopes.access_scopes(oauth_client)
expected_scopes |= scopes.identify_scopes(user.orm_user)

# compare the scopes on the service page with the expected scope list
assert sorted(authorized_scopes) == sorted(expected_scopes)
Loading

0 comments on commit 44f9571

Please sign in to comment.