Implement token-based sessions. #4690

adsnaider · 2022-04-05T02:27:40Z

This PR sets up the session table (called persistent_session because diesel... diesel-rs/diesel#3055). Every time a user is logged in, a session will be created alongside the cookie that the user will use for authentication. This PR is the initial step for fixing #2630

src/util/token.rs

jsha

Thanks so much for working on this! I'm quite happy to see it happen.

I'm not a contributor on crates.io, just an interested security engineer with some experience working on sessions and databases, so take the below with an appropriate grain of salt.

Session tables can be expensive (in DB capacity), because they need to be read on every request. Even more expensive if they need to be written on every request. We can reduce the expense if we can keep all, or a significant chunk, of the sessions table in memory. Some recommendations to reduce the size of the table:

Remove the index on hashed_token (indexes can be surprisingly expensive, bytes-wise). Instead, have the cookie assert a session id alongside the token, like scio:12345:77af9bea05b66df6422ffcf1bf97b7e62e2ca4e5. Then you can look up by the primary key, which is shorter and cheaper anyhow.

Move the last_used_at, last_ip_address, and last_user_agent into their own table. If the load from the writes gets too big, crates.io could disable writes to that table (at the cost of not having updated session usage data) to regain some capacity. Alternately, we could produce the "last accessed" data by post-processing logs instead of doing live writes to the DB. I realize I proposed in the original issue that the sessions table should store this, but I'm having minor second thoughts. :-)

Session tables have the property that you typically want to delete old data to save space. Deleting data row-by-row can be expensive. It's probably a good idea to partition the session table by ranges of primary key: https://www.postgresql.org/docs/current/ddl-partitioning.html

Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. Partitioning effectively substitutes for the upper tree levels of indexes, making it more likely that the heavily-used parts of the indexes fit in memory.
Bulk loads and deletes can be accomplished by adding or removing partitions, if the usage pattern is accounted for in the partitioning design. Dropping an individual partition using DROP TABLE, or doing ALTER TABLE DETACH PARTITION, is far faster than a bulk operation.

Hope this is helpful!

src/controllers/user/session.rs

jsha · 2022-04-05T22:41:03Z

src/models/persistent_session.rs

+        // TODO: Do we want to check if the user agent or IP address don't match? What about the
+        // created_at/last_user_at times, do we want to expire the tokens?


We shouldn't disqualify a session if the user agent or IP address change. IP addressed change regularly due to DHCP leases or moving to a different network (e.g. a cafe). User agent strings change as a result of browser upgrades.

Gotcha. At what point do we revoke sessions?

A couple of basic options:

If the user explicitly logs out of any session, invalidate all sessions everywhere (simplest, probably good enough)

If the user logs of a session, invalidate that session, AND have a separate "log out other sessions" button

Another maybe interesting option (probably not for this PR):

Have a background job that revalidates the GitHub token periodically. If that token has been revoked, invalidate all sessions.

src/util/token.rs

migrations/2022-02-21-211645_create_sessions/up.sql

bors · 2022-04-06T09:59:36Z

☔ The latest upstream changes (presumably #4678) made this pull request unmergeable. Please resolve the merge conflicts.

adsnaider · 2022-04-16T00:25:12Z

Made some updates. Still need to:

Partition the table by date
Split the table to store the last_used data in a different table

adsnaider · 2022-04-16T00:55:01Z

@jsha I got a couple of questions for you :)

I agree partitioning would be useful. Do you think it would be best to have a creation_date column and partition by that. Then we can automatically clean up old partitions by date? How can we achieve something similar partitioning by the primary key?
Since you mentioned that it might be useful to store the last_used data in a different (non-essential) table in case we need to disable writes, why would this be better than just storing NULL values for those fields in the sessions table instead?

jsha · 2022-04-16T03:49:16Z

Do you think it would be best to have a creation_date column and partition by that. Then we can automatically clean up old partitions by date? How can we achieve something similar partitioning by the primary key?

I hadn't noticed that there's no creation_date column. That's probably a good idea.

Since the primary key is SERIAL, we expect that the primary key id maps linearly to creation dates. To check if a partition in pruneable, we should be able to look at the creation date of the highest id in that partition. If that's past the expiration time, the whole partition can be dropped.

The best practices for partitioning choices are documented here: https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES

One of the most critical design decisions will be the column or columns by which you partition your data. Often the best choice will be to partition by the column or set of columns which most commonly appear in WHERE clauses of queries being executed on the partitioned table. WHERE clauses that are compatible with the partition bound constraints can be used to prune unneeded partitions.

Since we're now looking up by the primary key, partitioning by that key seems like the best choice. This also saves us from needing another index.

Since you mentioned that it might be useful to store the last_used data in a different (non-essential) table in case we need to disable writes, why would this be better than just storing NULL values for those fields in the sessions table instead?

I haven't fully solidified my thinking, but in broad strokes I'm thinking about how much of the session table we can afford to fit in the DB's memory, and also about locking and update rates. If the non-essential (and potentially large) fields are in another table, we wouldn't necessarily need to keep them in memory. On the other hand, writing to them constantly might wind up keeping them in memory anyhow (to the detriment of other tables). I'm also slightly concerned that constantly writing to session rows could cause contention on reads of those session rows. In general, I'm having second thoughts on the idea of recording user-agent and IP on every access, because I suspect it will be very expensive. It's a good feature to have, but I want to maximize the chances of success for the core feature of revocable sessions. What do you think of omitting that part for now?

Also, fair warning: I am no Postgres expert. I'm taking what I've learned on MariaDB/MySQL and extrapolating/reading the docs.

jsha · 2022-04-16T03:51:34Z

Here's another way of thinking about the access IP and user-agent: For the attack we're most worried about (compromised GitHub account), the IP and user-agent of the login are just about as useful as the IP and user-agent of the last access. So we could store that info at login to help people figure out if they have any inappropriate sessions.

adsnaider · 2022-04-16T04:53:23Z

I hadn't noticed that there's no creation_date column

We don't have creation_date, but we have created_at (which is a timestamp)

I agree with starting with the basics for now. I'm thinking of keeping the created_at column so that we can still invalidate stale sessions but I'll get rid of the last_used_at, last_used_ip, and last_user_agent columns.

adsnaider · 2022-04-17T18:33:48Z

@jsha I'm having trouble figuring out how partitioning will fit alongside the rest of the infrastructure. I can define the table to be partitioned by primary key but the partitions themselves need to be setup with SQL code (not from Rust). I have some ideas how this could work, so let me know what you think.

Assuming we can have some sort of cronjob in the server, I could set one up daily that reads the latest session (not sure how yet, maybe just select all and sort descending or something), and create a partition from the highest ID in the previous partition and the current highest. Then, we can have another cronjob that looks at the partitions starting from the oldest and deletes the ones that are stale (by whatever definition of stale we decide upon). I'm not sure that we can even know what partitions currently exist with SQL.

However, I think it might be worth considering having the partitions be ranged by creation_date and here's why: we don't need an index as this will be done with a cronjob once a day so efficiency isn't necessarily a priority (do you agree with that reasoning? Maybe DB locks will be a problem). The advantage of doing the partition by creation_date is that it's very easy to create and delete the older partitions without having to do any sorting to figure out if the partition should be deleted. Additionally, it would be easy to define such cronjob without having to figure out what partitions currently exist (we can just say look at the partition with date = today - x for whatever we decide xshould be.

What do you think?

bors · 2022-05-09T07:02:31Z

☔ The latest upstream changes (presumably ca82d24) made this pull request unmergeable. Please resolve the merge conflicts.

Unfortunately, due to diesel-rs/diesel#2411, the table can't be named `sessions` as it causes patch conflicts with recent_crate_downloads.

adsnaider · 2022-05-20T21:36:45Z

@bjorn3 or @pietroalbini do either of you know who I should contact in order to setup automatic creation/deletion of the partitions as a cron job within the server?

bors · 2022-05-31T07:34:32Z

☔ The latest upstream changes (presumably 64feb48) made this pull request unmergeable. Please resolve the merge conflicts.

pietroalbini · 2022-06-04T13:48:39Z

@adsnaider there is already a daily cronjob whose source code is https://github.com/rust-lang/crates.io/blob/master/src/worker/daily_db_maintenance.rs. I guess you can probably add it there?

Also, note that I'm not on the crates.io team, so it's up to them to review the change 🙂

adsnaider changed the title ~~Sessions~~ Implement token-based sessions. Apr 5, 2022

adsnaider mentioned this pull request Apr 5, 2022

security: Recovering from a GitHub account compromise #2630

Open

pietroalbini reviewed Apr 5, 2022

View reviewed changes

src/util/token.rs Outdated Show resolved Hide resolved

jsha reviewed Apr 5, 2022

View reviewed changes

Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Apr 6, 2022

adsnaider mentioned this pull request Apr 15, 2022

Flaky (Deadlock Detected) krate tests. #4729

Closed

adsnaider force-pushed the sessions branch from ae3ada6 to 67fd9d3 Compare April 16, 2022 00:55

adsnaider added 14 commits May 20, 2022 12:19

Created persistent sessions table.

e322fd9

Unfortunately, due to diesel-rs/diesel#2411, the table can't be named `sessions` as it causes patch conflicts with recent_crate_downloads.

models: Created persistent session model.

81ec8c6

Setup new persistent session and cookie on auth.

5c209f4

Revoke session tokens on logout.

5ef54f0

Use persistent sessions for authentication.

c51ab65

Added session tests.

04fe1d2

Added documentation and comments.

21d5d71

Fixed visiblity of token alphanumeric generator.

b9af39c

Fix lint error and make clippy happier.

2af70a2

Changed persistent sessions id to BIGSERIAL.

c774eee

Changed persistent session token prefix to scio.

81831e5

Changed persistent session cookie name to Host-auth.

b0fb29d

Clean up TODO.

1280232

Removed persisten_session hashed token index.

14ad3cc

adsnaider added 5 commits May 20, 2022 12:28

Removed last_used_* fields from persistent session table.

c215d23

Added test for persistent session after logout.

7466b63

Improved persistent session API.

2c51d08

Make clippy happy again.

26aeb88

Fixed merge conflicts.

731e2fd

adsnaider force-pushed the sessions branch from e9fa735 to 731e2fd Compare May 20, 2022 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement token-based sessions. #4690

Implement token-based sessions. #4690

adsnaider commented Apr 5, 2022

jsha left a comment

jsha Apr 5, 2022

adsnaider Apr 15, 2022

jsha Apr 16, 2022

bors commented Apr 6, 2022

adsnaider commented Apr 16, 2022 •

edited

Loading

adsnaider commented Apr 16, 2022

jsha commented Apr 16, 2022

jsha commented Apr 16, 2022

adsnaider commented Apr 16, 2022

adsnaider commented Apr 17, 2022

bors commented May 9, 2022

adsnaider commented May 20, 2022

bors commented May 31, 2022

pietroalbini commented Jun 4, 2022

		// TODO: Do we want to check if the user agent or IP address don't match? What about the
		// created_at/last_user_at times, do we want to expire the tokens?

Implement token-based sessions. #4690

Are you sure you want to change the base?

Implement token-based sessions. #4690

Conversation

adsnaider commented Apr 5, 2022

jsha left a comment

Choose a reason for hiding this comment

jsha Apr 5, 2022

Choose a reason for hiding this comment

adsnaider Apr 15, 2022

Choose a reason for hiding this comment

jsha Apr 16, 2022

Choose a reason for hiding this comment

bors commented Apr 6, 2022

adsnaider commented Apr 16, 2022 • edited Loading

adsnaider commented Apr 16, 2022

jsha commented Apr 16, 2022

jsha commented Apr 16, 2022

adsnaider commented Apr 16, 2022

adsnaider commented Apr 17, 2022

bors commented May 9, 2022

adsnaider commented May 20, 2022

bors commented May 31, 2022

pietroalbini commented Jun 4, 2022

adsnaider commented Apr 16, 2022 •

edited

Loading