Merge branch 'main' of https://github.com/AustrianDataLAB/jupyterhub …

…into main
AustrianDataLAB · Jan 16, 2023 · 44f9571 · 44f9571
2 parents f6dc990 + 21ad59d
commit 44f9571
Show file tree

Hide file tree

Showing 7 changed files with 268 additions and 27 deletions.
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -149,7 +149,7 @@ jobs:
           branchRegex: ^\w[\w-.]*$
 
       - name: Build and push jupyterhub
-        uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
+        uses: docker/build-push-action@37abcedcc1da61a57767b7588cb9d03eb57e28b3
         with:
           context: .
           platforms: linux/amd64,linux/arm64
@@ -170,7 +170,7 @@ jobs:
           branchRegex: ^\w[\w-.]*$
 
       - name: Build and push jupyterhub-onbuild
-        uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
+        uses: docker/build-push-action@37abcedcc1da61a57767b7588cb9d03eb57e28b3
         with:
           build-args: |
             BASE_IMAGE=${{ fromJson(steps.jupyterhubtags.outputs.tags)[0] }}
@@ -191,7 +191,7 @@ jobs:
           branchRegex: ^\w[\w-.]*$
 
       - name: Build and push jupyterhub-demo
-        uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
+        uses: docker/build-push-action@37abcedcc1da61a57767b7588cb9d03eb57e28b3
         with:
           build-args: |
             BASE_IMAGE=${{ fromJson(steps.onbuildtags.outputs.tags)[0] }}
@@ -215,7 +215,7 @@ jobs:
           branchRegex: ^\w[\w-.]*$
 
       - name: Build and push jupyterhub/singleuser
-        uses: docker/build-push-action@c56af957549030174b10d6867f20e78cfd7debc5
+        uses: docker/build-push-action@37abcedcc1da61a57767b7588cb9d03eb57e28b3
         with:
           build-args: |
             JUPYTERHUB_VERSION=${{ github.ref_type == 'tag' && github.ref_name || format('git:{0}', github.sha) }}

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -189,6 +189,7 @@ def setup(app):
     "https://github.com/jupyterhub/jupyterhub/pull/",  # too many PRs in changelog
     "https://github.com/jupyterhub/jupyterhub/compare/",  # too many comparisons in changelog
     r"https?://(localhost|127.0.0.1).*",  # ignore localhost references in auto-links
+    r"https://jupyter.chameleoncloud.org",  # FIXME: ignore (presumably) short-term SSL issue
 ]
 linkcheck_anchors_ignore = [
     "/#!",

diff --git a/docs/source/gallery-jhub-deployments.md b/docs/source/gallery-jhub-deployments.md
@@ -190,5 +190,4 @@ easy to do with RStudio too.
 - https://wrdrd.com/docs/consulting/education-technology
 - https://bitbucket.org/jackhale/fenics-jupyter
 - [LinuxCluster blog](https://linuxcluster.wordpress.com/category/application/jupyterhub/)
-- [Network Technology](https://arnesund.com/tag/jupyterhub/)
 - [Spark Cluster on OpenStack with Multi-User Jupyter Notebook](https://arnesund.com/2015/09/21/spark-cluster-on-openstack-with-multi-user-jupyter-notebook/)
diff --git a/docs/source/reference/database.md b/docs/source/reference/database.md
@@ -1,29 +1,120 @@
 # The Hub's Database
 
-JupyterHub uses a database to store information about users, services, and other
-data needed for operating the Hub.
+JupyterHub uses a database to store information about users, services, and other data needed for operating the Hub.
+This is the **state** of the Hub.
 
-## Default SQLite database
+## Why does JupyterHub have a database?
 
-The default database for JupyterHub is a [SQLite](https://sqlite.org) database.
-We have chosen SQLite as JupyterHub's default for its lightweight simplicity
-in certain uses such as testing, small deployments and workshops.
+JupyterHub is a **stateful** application (more on that 'state' later).
+Updating JupyterHub's configuration or upgrading the version of JupyterHub requires restarting the JupyterHub process to apply the changes.
+We want to minimize the disruption caused by restarting the Hub process, so it can be a mundane, frequent, routine activity.
+Storing state information outside the process for later retrieval is necessary for this, and one of the main thing databases are for.
+
+A lot of the operations in JupyterHub are also **relationships**, which is exactly what SQL databases are great at.
+For example:
+
+- Given an API token, what user is making the request?
+- Which users don't have running servers?
+- Which servers belong to user X?
+- Which users have not been active in the last 24 hours?
+
+Finally, a database allows us to have more information stored without needing it all loaded in memory,
+e.g. supporting a large number (several thousands) of inactive users.
+
+## What's in the database?
+
+The short answer of what's in the JupyterHub database is "everything."
+JupyterHub's **state** lives in the database.
+That is, everything JupyterHub needs to be aware of to function that _doesn't_ come from the configuration files, such as
+
+- users, roles, role assignments
+- state, urls of running servers
+- Hashed API tokens
+- Short-lived state related to OAuth flow
+- Timestamps for when users, tokens, and servers were last used
+
+### What's _not_ in the database
+
+Not _quite_ all of JupyterHub's state is in the database.
+This mostly involves transient state, such as the 'pending' transitions of Spawners (starting, stopping, etc.).
+Anything not in the database must be reconstructed on Hub restart, and the only sources of information to do that are the database and JupyterHub configuration file(s).
+
+## How does JupyterHub use the database?
+
+JupyterHub makes some _unusual_ choices in how it connects to the database.
+These choices represent trade-offs favoring single-process simplicity and performance at the expense of horizontal scalability (multiple Hub instances).
+
+We often say that the Hub 'owns' the database.
+This ownership means that we assume the Hub is the only process that will talk to the database.
+This assumption enables us to make several caching optimizations that dramatically improve JupyterHub's performance (i.e. data written recently to the database can be read from memory instead of fetched again from the database) that would not work if multiple processes could be interacting with the database at the same time.
+
+Database operations are also synchronous, so while JupyterHub is waiting on a database operation, it cannot respond to other requests.
+This allows us to avoid complex locking mechanisms, because transaction races can only occur during an `await`, so we only need to make sure we've completed any given transaction before the next `await` in a given request.
+
+:::{note}
+We are slowly working to remove these assumptions, and moving to a more traditional db session per-request pattern.
+This will enable multiple Hub instances and enable scaling JupyterHub, but will significantly reduce the number of active users a single Hub instance can serve.
+:::
+
+### Database performance in a typical request
+
+Most authenticated requests to JupyterHub involve a few database transactions:
+
+1. look up the authenticated user (e.g. look up token by hash, then resolve owner and permissions)
+2. record activity
+3. perform any relevant changes involved in processing the request (e.g. create the records for a running server when starting one)
+
+This means that the database is involved in almost every request, but only in quite small, simple queries, e.g.:
+
+- lookup one token by hash
+- lookup one user by name
+- list tokens or servers for one user (typically 1-10)
+- etc.
+
+### The database as a limiting factor
+
+As a result of the above transactions in most requests, database performance is the _leading_ factor in JupyterHub's baseline requests-per-second performance, but that cost does not scale significantly with the number of users, active or otherwise.
+However, the database is _rarely_ a limiting factor in JupyterHub performance in a practical sense, because the main thing JupyterHub does is start, stop, and monitor whole servers, which take far more time than any small database transaction, no matter how many records you have or how slow your database is (within reason).
+Additionally, there is usually _very_ little load on the database itself.
+
+By far the most taxing activity on the database is the 'list all users' endpoint, primarily used by the [idle-culling service](https://github.com/jupyterhub/jupyterhub-idle-culler).
+Database-based optimizations have been added to make even these operations feasible for large numbers of users:
+
+1. State filtering on [GET /users](./rest-api.md) with `?state=active`,
+   which limits the number of results in the query to only the relevant subset (added in JupyterHub 1.3), rather than all users.
+2. [Pagination](api-pagination) of all list endpoints, allowing the request of a large number of resources to be more fairly balanced with other Hub activities across multiple requests (added in 2.0).
+
+:::{note}
+It's important to note when discussing performance and limiting factors and that all of this only applies to requests to `/hub/...`.
+The Hub and its database are not involved in most requests to single-user servers (`/user/...`), which is by design, and largely motivated by the fact that the Hub itself doesn't _need_ to be fast because its operations are infrequent and large.
+:::
+
+## Database backends
+
+JupyterHub supports a variety of database backends via [SQLAlchemy][].
+The default is sqlite, which works great for many cases, but you should be able to use many backends supported by SQLAlchemy.
+Usually, this will mean PostgreSQL or MySQL, both of which are well tested with JupyterHub.
+
+[sqlalchemy]: https://www.sqlalchemy.org
+
+### Default backend: SQLite
+
+The default database backend for JupyterHub is [SQLite](https://sqlite.org).
+We have chosen SQLite as JupyterHub's default because it's simple (the 'database' is a single file) and ubiquitous (it is in the Python standard library).
+It works very well for testing, small deployments, and workshops.
 
 For production systems, SQLite has some disadvantages when used with JupyterHub:
 
-- `upgrade-db` may not work, and you may need to start with a fresh database
+- `upgrade-db` may not always work, and you may need to start with a fresh database
 - `downgrade-db` **will not** work if you want to rollback to an earlier
   version, so backup the `jupyterhub.sqlite` file before upgrading
 
 The sqlite documentation provides a helpful page about [when to use SQLite and
 where traditional RDBMS may be a better choice](https://sqlite.org/whentouse.html).
 
-## Using an RDBMS (PostgreSQL, MySQL)
+### Picking your database backend (PostgreSQL, MySQL)
 
-When running a long term deployment or a production system, we recommend using
-a traditional RDBMS database, such as [PostgreSQL](https://www.postgresql.org)
-or [MySQL](https://www.mysql.com), that supports the SQL `ALTER TABLE`
-statement.
+When running a long term deployment or a production system, we recommend using a full-fledged relational database, such as [PostgreSQL](https://www.postgresql.org) or [MySQL](https://www.mysql.com), that supports the SQL `ALTER TABLE` statement.
 
 ## Notes and Tips
 

diff --git a/jupyterhub/app.py b/jupyterhub/app.py
@@ -298,9 +298,9 @@ def _load_classes(self):
         return classes
 
     load_groups = Dict(
-        Dict(),
+        Union([Dict(), List()]),
         help="""
-        Dict of `{'group': {'users':['usernames'], properties : {}}`  to load at startup.
+        Dict of `{'group': {'users':['usernames'], 'properties': {}}`  to load at startup.
 
         Example::
 
@@ -311,7 +311,8 @@ def _load_classes(self):
                 },
             }
 
-        This strictly *adds* groups, users and properties to groups.
+        This strictly *adds* groups and users to groups.
+        Properties, if defined, replace all existing properties.
 
         Loading one set of groups, then starting JupyterHub again with a different
         set will not remove users or groups from previous launches.
@@ -2079,12 +2080,18 @@ async def init_groups(self):
                 for username in contents['users']:
                     username = self.authenticator.normalize_username(username)
                     user = await self._get_or_create_user(username)
-                    self.log.debug(f"Adding user {username} to group {name}")
-                    group.users.append(user)
+                    if group not in user.groups:
+                        self.log.debug(f"Adding user {username} to  group {name}")
+                        group.users.append(user)
+
             if 'properties' in contents:
                 group_properties = contents['properties']
-                self.log.debug(f"Adding properties {group_properties} to group {name}")
-                group.properties = group_properties
+                if group.properties != group_properties:
+                    # add equality check to avoid no-op db transactions
+                    self.log.debug(
+                        f"Adding properties to group {name}: {group_properties}"
+                    )
+                    group.properties = group_properties
 
         db.commit()
 

diff --git a/jupyterhub/tests/selenium/test_browser.py b/jupyterhub/tests/selenium/test_browser.py
@@ -1,6 +1,7 @@
 """Tests for the Selenium WebDriver"""
 
 import asyncio
+import json
 from functools import partial
 
 import pytest
@@ -15,6 +16,7 @@
 from tornado.escape import url_escape
 from tornado.httputil import url_concat
 
+from jupyterhub import scopes
 from jupyterhub.tests.selenium.locators import (
     BarLocators,
     HomePageLocators,
@@ -24,6 +26,7 @@
 )
 from jupyterhub.utils import exponential_backoff
 
+from ... import orm, roles
 from ...utils import url_path_join
 from ..utils import api_request, public_host, public_url, ujoin
 
@@ -854,3 +857,136 @@ async def test_user_logout(app, browser, url, user):
     while f"/user/{user.name}/" not in browser.current_url:
         await webdriver_wait(browser, EC.url_matches(f"/user/{user.name}/"))
     assert f"/user/{user.name}" in browser.current_url
+
+
+# OAUTH confirmation page
+
+
+@pytest.mark.parametrize(
+    "user_scopes",
+    [
+        ([]),  # no scopes
+        (  # user has just access to own resources
+            [
+                'self',
+            ]
+        ),
+        (  # user has access to all groups resources
+            [
+                'read:groups',
+                'groups',
+            ]
+        ),
+        (  # user has access to specific users/groups/services resources
+            [
+                'read:users!user=gawain',
+                'read:groups!group=mythos',
+                'read:services!service=test',
+            ]
+        ),
+    ],
+)
+async def test_oauth_page(
+    app,
+    browser,
+    mockservice_url,
+    create_temp_role,
+    create_user_with_scopes,
+    user_scopes,
+):
+    # create user with appropriate access permissions
+    service_role = create_temp_role(user_scopes)
+    service = mockservice_url
+    user = create_user_with_scopes("access:services")
+    roles.grant_role(app.db, user, service_role)
+    oauth_client = (
+        app.db.query(orm.OAuthClient)
+        .filter_by(identifier=service.oauth_client_id)
+        .one()
+    )
+    oauth_client.allowed_scopes = sorted(roles.roles_to_scopes([service_role]))
+    app.db.commit()
+    # open the service url in the browser
+    service_url = url_path_join(public_url(app, service) + 'owhoami/?arg=x')
+    await in_thread(browser.get, (service_url))
+    expected_client_id = service.name
+    expected_redirect_url = app.base_url + f"servises/{service.name}/oauth_callback"
+    assert expected_client_id, expected_redirect_url in browser.current_url
+
+    # login user
+    await login(browser, user.name, pass_w=str(user.name))
+    auth_button = browser.find_element(By.XPATH, '//input[@type="submit"]')
+    if not auth_button.is_displayed():
+        await webdriver_wait(
+            browser,
+            EC.visibility_of_element_located((By.XPATH, '//input[@type="submit"]')),
+        )
+    # verify that user can see the service name and oauth URL
+    text_permission = browser.find_element(
+        By.XPATH, './/h1[text()="Authorize access"]//following::p'
+    ).text
+    assert f"JupyterHub service {service.name}", (
+        f"oauth URL: {expected_redirect_url}" in text_permission
+    )
+    # permissions check
+    oauth_form = browser.find_element(By.TAG_NAME, "form")
+    scopes_elements = oauth_form.find_elements(
+        By.XPATH, '//input[@type="hidden" and @name="scopes"]'
+    )
+    scope_list_oauth_page = []
+    for scopes_element in scopes_elements:
+        # checking that scopes are invisible on the page
+        assert not scopes_element.is_displayed()
+        scope_value = scopes_element.get_attribute("value")
+        scope_list_oauth_page.append(scope_value)
+
+    # checking that all scopes granded to user are presented in POST form (scope_list)
+    assert all(x in scope_list_oauth_page for x in user_scopes)
+    assert f"access:services!service={service.name}" in scope_list_oauth_page
+
+    check_boxes = oauth_form.find_elements(
+        By.XPATH, '//input[@type="checkbox" and @name="raw-scopes"]'
+    )
+    for check_box in check_boxes:
+        # checking that user cannot uncheck the checkbox
+        assert not check_box.is_enabled()
+        assert check_box.get_attribute("disabled")
+        assert check_box.get_attribute("title") == "This authorization is required"
+
+    # checking that appropriete descriptions are displayed depending of scopes
+    descriptions = oauth_form.find_elements(By.TAG_NAME, 'span')
+    desc_list_form = [description.text.strip() for description in descriptions]
+    # getting descriptions from scopes.py to compare them with descriptions on UI
+    scope_descriptions = scopes.describe_raw_scopes(
+        user_scopes or ['(no_scope)'], user.name
+    )
+    desc_list_expected = []
+    for scope_description in scope_descriptions:
+        description = scope_description.get("description")
+        text_filter = scope_description.get("filter")
+        if text_filter:
+            description = f"{description} Applies to {text_filter}."
+        desc_list_expected.append(description)
+
+    assert sorted(desc_list_form) == sorted(desc_list_expected)
+
+    # click on the Authorize button
+    await click(browser, (By.XPATH, '//input[@type="submit"]'))
+    # check that user returned to service page
+    assert browser.current_url == service_url
+
+    # check the granted permissions by
+    # getting the scopes from the service page,
+    # which contains the JupyterHub user model
+    text = browser.find_element(By.TAG_NAME, "body").text
+    user_model = json.loads(text)
+    authorized_scopes = user_model["scopes"]
+
+    # resolve the expected expanded scopes
+    # authorized for the service
+    expected_scopes = scopes.expand_scopes(user_scopes, owner=user.orm_user)
+    expected_scopes |= scopes.access_scopes(oauth_client)
+    expected_scopes |= scopes.identify_scopes(user.orm_user)
+
+    # compare the scopes on the service page with the expected scope list
+    assert sorted(authorized_scopes) == sorted(expected_scopes)