Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement token-based sessions. #4690

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Conversation

adsnaider
Copy link

This PR sets up the session table (called persistent_session because diesel... diesel-rs/diesel#3055). Every time a user is logged in, a session will be created alongside the cookie that the user will use for authentication. This PR is the initial step for fixing #2630

@adsnaider adsnaider changed the title Sessions Implement token-based sessions. Apr 5, 2022
src/util/token.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working on this! I'm quite happy to see it happen.

I'm not a contributor on crates.io, just an interested security engineer with some experience working on sessions and databases, so take the below with an appropriate grain of salt.

Session tables can be expensive (in DB capacity), because they need to be read on every request. Even more expensive if they need to be written on every request. We can reduce the expense if we can keep all, or a significant chunk, of the sessions table in memory. Some recommendations to reduce the size of the table:

Remove the index on hashed_token (indexes can be surprisingly expensive, bytes-wise). Instead, have the cookie assert a session id alongside the token, like scio:12345:77af9bea05b66df6422ffcf1bf97b7e62e2ca4e5. Then you can look up by the primary key, which is shorter and cheaper anyhow.

Move the last_used_at, last_ip_address, and last_user_agent into their own table. If the load from the writes gets too big, crates.io could disable writes to that table (at the cost of not having updated session usage data) to regain some capacity. Alternately, we could produce the "last accessed" data by post-processing logs instead of doing live writes to the DB. I realize I proposed in the original issue that the sessions table should store this, but I'm having minor second thoughts. :-)

Session tables have the property that you typically want to delete old data to save space. Deleting data row-by-row can be expensive. It's probably a good idea to partition the session table by ranges of primary key: https://www.postgresql.org/docs/current/ddl-partitioning.html

Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. Partitioning effectively substitutes for the upper tree levels of indexes, making it more likely that the heavily-used parts of the indexes fit in memory.
Bulk loads and deletes can be accomplished by adding or removing partitions, if the usage pattern is accounted for in the partitioning design. Dropping an individual partition using DROP TABLE, or doing ALTER TABLE DETACH PARTITION, is far faster than a bulk operation.

Hope this is helpful!

src/controllers/user/session.rs Outdated Show resolved Hide resolved
Comment on lines 82 to 83
// TODO: Do we want to check if the user agent or IP address don't match? What about the
// created_at/last_user_at times, do we want to expire the tokens?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't disqualify a session if the user agent or IP address change. IP addressed change regularly due to DHCP leases or moving to a different network (e.g. a cafe). User agent strings change as a result of browser upgrades.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. At what point do we revoke sessions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of basic options:

  • If the user explicitly logs out of any session, invalidate all sessions everywhere (simplest, probably good enough)
  • If the user logs of a session, invalidate that session, AND have a separate "log out other sessions" button

Another maybe interesting option (probably not for this PR):

  • Have a background job that revalidates the GitHub token periodically. If that token has been revoked, invalidate all sessions.

src/util/token.rs Outdated Show resolved Hide resolved
migrations/2022-02-21-211645_create_sessions/up.sql Outdated Show resolved Hide resolved
@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Apr 6, 2022
@bors
Copy link
Contributor

bors commented Apr 6, 2022

☔ The latest upstream changes (presumably #4678) made this pull request unmergeable. Please resolve the merge conflicts.

@adsnaider
Copy link
Author

adsnaider commented Apr 16, 2022

Made some updates. Still need to:

  • Partition the table by date
  • Split the table to store the last_used data in a different table

@adsnaider
Copy link
Author

@jsha I got a couple of questions for you :)

  1. I agree partitioning would be useful. Do you think it would be best to have a creation_date column and partition by that. Then we can automatically clean up old partitions by date? How can we achieve something similar partitioning by the primary key?
  2. Since you mentioned that it might be useful to store the last_used data in a different (non-essential) table in case we need to disable writes, why would this be better than just storing NULL values for those fields in the sessions table instead?

@jsha
Copy link
Contributor

jsha commented Apr 16, 2022

Do you think it would be best to have a creation_date column and partition by that. Then we can automatically clean up old partitions by date? How can we achieve something similar partitioning by the primary key?

I hadn't noticed that there's no creation_date column. That's probably a good idea.

Since the primary key is SERIAL, we expect that the primary key id maps linearly to creation dates. To check if a partition in pruneable, we should be able to look at the creation date of the highest id in that partition. If that's past the expiration time, the whole partition can be dropped.

The best practices for partitioning choices are documented here: https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES

One of the most critical design decisions will be the column or columns by which you partition your data. Often the best choice will be to partition by the column or set of columns which most commonly appear in WHERE clauses of queries being executed on the partitioned table. WHERE clauses that are compatible with the partition bound constraints can be used to prune unneeded partitions.

Since we're now looking up by the primary key, partitioning by that key seems like the best choice. This also saves us from needing another index.

Since you mentioned that it might be useful to store the last_used data in a different (non-essential) table in case we need to disable writes, why would this be better than just storing NULL values for those fields in the sessions table instead?

I haven't fully solidified my thinking, but in broad strokes I'm thinking about how much of the session table we can afford to fit in the DB's memory, and also about locking and update rates. If the non-essential (and potentially large) fields are in another table, we wouldn't necessarily need to keep them in memory. On the other hand, writing to them constantly might wind up keeping them in memory anyhow (to the detriment of other tables). I'm also slightly concerned that constantly writing to session rows could cause contention on reads of those session rows. In general, I'm having second thoughts on the idea of recording user-agent and IP on every access, because I suspect it will be very expensive. It's a good feature to have, but I want to maximize the chances of success for the core feature of revocable sessions. What do you think of omitting that part for now?

Also, fair warning: I am no Postgres expert. I'm taking what I've learned on MariaDB/MySQL and extrapolating/reading the docs.

@jsha
Copy link
Contributor

jsha commented Apr 16, 2022

Here's another way of thinking about the access IP and user-agent: For the attack we're most worried about (compromised GitHub account), the IP and user-agent of the login are just about as useful as the IP and user-agent of the last access. So we could store that info at login to help people figure out if they have any inappropriate sessions.

@adsnaider
Copy link
Author

I hadn't noticed that there's no creation_date column

We don't have creation_date, but we have created_at (which is a timestamp)

I agree with starting with the basics for now. I'm thinking of keeping the created_at column so that we can still invalidate stale sessions but I'll get rid of the last_used_at, last_used_ip, and last_user_agent columns.

@adsnaider
Copy link
Author

@jsha I'm having trouble figuring out how partitioning will fit alongside the rest of the infrastructure. I can define the table to be partitioned by primary key but the partitions themselves need to be setup with SQL code (not from Rust). I have some ideas how this could work, so let me know what you think.

Assuming we can have some sort of cronjob in the server, I could set one up daily that reads the latest session (not sure how yet, maybe just select all and sort descending or something), and create a partition from the highest ID in the previous partition and the current highest. Then, we can have another cronjob that looks at the partitions starting from the oldest and deletes the ones that are stale (by whatever definition of stale we decide upon). I'm not sure that we can even know what partitions currently exist with SQL.

However, I think it might be worth considering having the partitions be ranged by creation_date and here's why: we don't need an index as this will be done with a cronjob once a day so efficiency isn't necessarily a priority (do you agree with that reasoning? Maybe DB locks will be a problem). The advantage of doing the partition by creation_date is that it's very easy to create and delete the older partitions without having to do any sorting to figure out if the partition should be deleted. Additionally, it would be easy to define such cronjob without having to figure out what partitions currently exist (we can just say look at the partition with date = today - x for whatever we decide xshould be.

What do you think?

@bors
Copy link
Contributor

bors commented May 9, 2022

☔ The latest upstream changes (presumably ca82d24) made this pull request unmergeable. Please resolve the merge conflicts.

@adsnaider
Copy link
Author

@bjorn3 or @pietroalbini do either of you know who I should contact in order to setup automatic creation/deletion of the partitions as a cron job within the server?

@bors
Copy link
Contributor

bors commented May 31, 2022

☔ The latest upstream changes (presumably 64feb48) made this pull request unmergeable. Please resolve the merge conflicts.

@pietroalbini
Copy link
Member

@adsnaider there is already a daily cronjob whose source code is https://github.com/rust-lang/crates.io/blob/master/src/worker/daily_db_maintenance.rs. I guess you can probably add it there?

Also, note that I'm not on the crates.io team, so it's up to them to review the change 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants