Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds a key-value store endpoint for Superset #17337

Closed

Conversation

michael-s-molina
Copy link
Member

@michael-s-molina michael-s-molina commented Nov 3, 2021

SUMMARY

This PR adds a key-value store endpoint for Superset with secure key generation and management of store entries.

Keys are generated on the server-side using the https://docs.python.org/3/library/secrets.html#secrets.token_urlsafe function and the value can be any JSON supported text.

Each value can have an associated duration. A Celery cleanup task is responsible for managing expired entries in the key-value store.

The user can also set a configuration to reset the duration upon retrieval to preserve actively accessed entries.

This endpoint can be used to solve a bunch of uses cases for Superset. Each client of this endpoint should carefully consider security, normalization, and ownership implications when deciding to store some type of information into the key-value store. Some possible use cases for this endpoint are listed below:

  • Profiles, preferences, and configurations management
  • Large objects storage (URL parameters, application state, drafts, etc)
  • Caching

We currently have an endpoint at /kv/store, but it has serial/monotonic ids which allow a user to iterate through other people’s saved states. That endpoint is also using the deprecated API system on superset/views, so it would need to be refactored anyway. Another important point is that the /kv/store endpoint does not contain any management capabilities regarding entries duration, ownership, and creation timestamps, which makes automatic sanitization harder.

The security model of the entry point is inspired by the Google Drive sharing model and it’s intended to support similar use cases. By default, any added value to the key-value store is accessible only to the user who created the value. This can be altered using the anyone_with_key, editor_*, and viewer_* configurations. The anyone_with_key option accepts the editor and viewer values indicating that anyone with the key can edit or view the store value. This is useful when storing non-sensitive information that needs to be shared among users. The editor_* and viewer_* configurations allow specific access restrictions to the key-value entries like granting read or write permissions to specific users or roles. In the case of access configuration conflicts, the more restrictive rule should always apply. An example would be anyone_with_key: ‘editor’ and editor_users: [1, 2, 3]. In this case, only users 1, 2, and 3 would have editor access.

Here's a pseudo-code of the endpoint operations and their configurations. For the complete status codes and possible responses check the Swagger documentation:

POST `api/v1/key_value_store`

BODY 
{
   // The duration of the value in the key store.
   // If no duration is specified the value won't expire.
   duration_ms?: <number>,

   // If the duration should be reset when the value is retrieved. 
   // This is useful if you wish to expire unused values but keep the ones that are actively retrieved.
   // True by default
   reset_duration_on_retrieval?: <boolean>,

   // Any type of JSON supported text.
   value: <string>,

   // Anyone with the key can edit or view the store value
   anyone_with_key?: <'editor', 'viewer'>

   // IDs of the users with editor access
   editor_users?: <[number]>

   // IDs of the roles with editor access
   editor_roles?: <[number]>

   // IDs of the users with viewer access
   viewer_users?: <[number]>

   // IDs of the roles with viewer access
   viewer_roles?: <[number]>
}

RESPONSE
{
   // The key generated on the server-side
   key: <string>
}
PUT `api/v1/key_value_store/<key>`

BODY - Same as POST body

RESPONSE
{
   // The result of the operation
   message: <string>
}
GET `api/v1/key_value_store/<key>`

RESPONSE
{
   // Any type of JSON supported text.
   value: <string>,
}
DELETE `api/v1/key_value_store/<key>`

RESPONSE
{
   // The result of the operation
   message: <string>
}

TESTING INSTRUCTIONS

1 - Execute Python tests
2 - All tests should pass

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@michael-s-molina michael-s-molina marked this pull request as ready for review November 9, 2021 14:29
@michael-s-molina michael-s-molina requested a review from a team as a code owner November 9, 2021 14:29
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments... the most important is to return SIP-40 compliant errors, specially since this is a new API.

superset/config.py Outdated Show resolved Hide resolved
superset/key_value/api.py Outdated Show resolved Hide resolved
superset/key_value/api.py Outdated Show resolved Hide resolved
superset/key_value/commands/cleanup.py Outdated Show resolved Hide resolved
tests/integration_tests/key_value/api_tests.py Outdated Show resolved Hide resolved
Copy link
Member

@willbarrett willbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about some of the proposed usages - particularly access tokens, refresh tokens, and public key storage. These items should all be held in an encrypted system, not a plain-text field. That's a big security no-no.

I would also recommend adding a user_id to the key value store and only allowing retrieval of items by the same user. Otherwise this system potentially allows any user to read any stored key, which for most of the uses you recommend would constitute a security hole.

The original key value store was originally deprecated due to similar security concerns. Keep in mind that obscurity of a long key is not the same thing as security - even if the keys are hard to guess, we should have security controls on the individual keys.

@michael-s-molina
Copy link
Member Author

Hi @willbarrett. Thanks so much for helping with the review!

I'm concerned about some of the proposed usages - particularly access tokens, refresh tokens, and public key storage. These items should all be held in an encrypted system, not a plain-text field. That's a big security no-no.

You're right about this. There are more secure structures for this type of information. I removed them from possible use cases in the PR description.

I would also recommend adding a user_id to the key value store and only allowing retrieval of items by the same user. Otherwise this system potentially allows any user to read any stored key, which for most of the uses you recommend would constitute a security hole.
The original key value store was originally deprecated due to similar security concerns. Keep in mind that obscurity of a long key is not the same thing as security - even if the keys are hard to guess, we should have security controls on the individual keys.

The key-value table has a created_by field for that purpose. One of the main uses of this store is to share content between users like an URL state, a draft, etc. Currently, the access is being controlled by the possession of a secure key. It's the same type of access control as Google Docs when you select the "Anyone with the link" type of sharing. We could increment this and also offer the second type of control where we can select which users are allowed to access the content, similar to Google Docs restricted sharing. What do you think?

@willbarrett
Copy link
Member

Thanks for the response @michael-s-molina - I think as Superset moves into more organizations we should default to closed in all cases, so I would support the closed-sharing model. The open-sharing model is pretty dangerous - it becomes easy to create a link that's accessible to everyone in a company, which I believe wouldn't be a desired behavior by most larger organizations as a default behavior. In Superset, URL parameters, application state, and cache can contain highly sensitive information so I think we should shy away from the open-sharing model in all cases.

@michael-s-molina
Copy link
Member Author

Thanks for the response @michael-s-molina - I think as Superset moves into more organizations we should default to closed in all cases, so I would support the closed-sharing model. The open-sharing model is pretty dangerous - it becomes easy to create a link that's accessible to everyone in a company, which I believe wouldn't be a desired behavior by most larger organizations as a default behavior. In Superset, URL parameters, application state, and cache can contain highly sensitive information so I think we should shy away from the open-sharing model in all cases.

@willbarrett These are good points. I agree that changing the default to the restricted model is more appropriate. I also think we should support the "Anyone with the key" model because we have some resources like public dashboards where we can benefit from it and we don't need the whole security configuration part from the user. I'll ping @dpgaspar to discuss this and increment the PR with these requirements. Thank you so much again!

@willbarrett
Copy link
Member

Thanks @michael-s-molina - if we do implement the "anyone with a key" model, we should throw some restrictions or confirmation around it so it's very clear to the user that they're about to share very widely. Something to think about on the UI-side of the house.

@michael-s-molina
Copy link
Member Author

Thanks @michael-s-molina - if we do implement the "anyone with a key" model, we should throw some restrictions or confirmation around it so it's very clear to the user that they're about to share very widely. Something to think about on the UI-side of the house.

Definitely! We should be really clear about the security implications. I was just checking the Google Drive interface and we can follow the same pattern.

Copy link
Member

@dpgaspar dpgaspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments, we should add more tests also

superset/key_value/api.py Outdated Show resolved Hide resolved
superset/key_value/api.py Show resolved Hide resolved
superset/key_value/api.py Show resolved Hide resolved
superset/key_value/commands/create.py Outdated Show resolved Hide resolved
superset/key_value/commands/get.py Outdated Show resolved Hide resolved
superset/key_value/utils.py Outdated Show resolved Hide resolved

__tablename__ = "key_value"
key = Column(String(256), primary_key=True)
value = Column(Text, nullable=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if value is potentially sensitive, we should encrypt this field

superset/key_value/commands/cleanup.py Outdated Show resolved Hide resolved
@dpgaspar
Copy link
Member

Thanks @michael-s-molina - if we do implement the "anyone with a key" model, we should throw some restrictions or confirmation around it so it's very clear to the user that they're about to share very widely. Something to think about on the UI-side of the house.

Would definitely fall to restrict access to the owner of the key. but the K/V store goal is not clear yet or it's just too broad. Session management and caching are sensitive, caching values could potentially defeat dataset ownership and RBAC permissions.

We can make the ownership restriction optional and on by default behind a config key. Or discuss this further on a secure sharing model

@michael-s-molina
Copy link
Member Author

left some comments, we should add more tests also

Thanks for the review @dpgaspar! I'll address all your comments

@michael-s-molina
Copy link
Member Author

@dpgaspar @willbarrett @betodealmeida I updated the PR description with the results of our last interactions about security and suggested use cases. I also added examples of the endpoint operations and their configurations.

@john-bodley I would really like your review too.

@dpgaspar even though the names in the API are editor and viewer, I’m still going to use the owner nomenclature in the database to match our existing schema 😉

You'll notice that the code is not reflecting the current description. I'm looking for more reviews on the security model before implementing it. If you can, just give a thumbs-up in this comment so I know you're ok with it or add a comment otherwise.

I'm really excited about my first Python PR so thank you all for the reviews. It's good to be back to full-stack development 🙌🏼

@github-actions
Copy link
Contributor

⚠️ @michael-s-molina Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@betodealmeida betodealmeida added the risk:db-migration PRs that require a DB migration label Nov 14, 2021
@codecov
Copy link

codecov bot commented Nov 15, 2021

Codecov Report

Merging #17337 (b4a465f) into master (e2a429b) will decrease coverage by 0.08%.
The diff coverage is 75.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17337      +/-   ##
==========================================
- Coverage   76.94%   76.86%   -0.09%     
==========================================
  Files        1042     1052      +10     
  Lines       56248    56533     +285     
  Branches     7784     7784              
==========================================
+ Hits        43282    43454     +172     
- Misses      12707    12820     +113     
  Partials      259      259              
Flag Coverage Δ
hive 81.48% <75.78%> (-0.06%) ⬇️
mysql 81.89% <75.78%> (-0.06%) ⬇️
postgres 81.90% <75.78%> (-0.06%) ⬇️
presto ?
python 82.25% <75.78%> (-0.22%) ⬇️
sqlite 81.58% <75.78%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/config.py 91.50% <ø> (ø)
superset/tasks/scheduler.py 62.96% <33.33%> (-5.93%) ⬇️
superset/key_value/commands/update.py 44.11% <44.11%> (ø)
superset/key_value/commands/delete.py 57.69% <57.69%> (ø)
superset/key_value/commands/cleanup.py 62.50% <62.50%> (ø)
superset/key_value/api.py 81.42% <81.42%> (ø)
superset/key_value/commands/get.py 83.33% <83.33%> (ø)
superset/key_value/dao.py 85.29% <85.29%> (ø)
superset/key_value/commands/create.py 87.87% <87.87%> (ø)
superset/initialization/__init__.py 88.07% <100.00%> (+0.08%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2a429b...b4a465f. Read the comment docs.

@geido
Copy link
Member

geido commented Nov 17, 2021

Awesome work @michael-s-molina!

@michael-s-molina
Copy link
Member Author

michael-s-molina commented Nov 17, 2021

We had a meeting about the proposed solution in this PR and decided to take a slightly different approach to resolve the use cases. Exposing a generic key-value interface had the potential to induce users to store any type of information in the store, diminishing the benefits of the current endpoints where each type of resource is represented by a specific path. We could also introduce some conflicts with the current security model in terms of access grants. One example would be to store information in the key-value store that is related to a dashboard. The dashboard already has configured access grants and re-defining those in each key-value entry had the potential for security problems. We decided to offer the same type of functionality under each currently defined endpoint to preserve the more typed nature of our endpoints and to leverage the existing security model. I'm closing this PR to preserve all the discussions and conclusions for historical reasons and I'll open a new one with the new approach.

A big thanks to all the reviewers that are helping to shape this new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
risk:db-migration PRs that require a DB migration size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants