Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload to existing indexd record #1147

Merged
merged 19 commits into from
Aug 7, 2024
15 changes: 10 additions & 5 deletions fence/blueprints/data/blueprint.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,8 @@ def upload_data_file():
authz = params.get("authz")
uploader = None

guid = params.get("guid")

if authz:
# if requesting an authz field, using new authorization method which doesn't
# rely on uploader field, so clear it out
Expand Down Expand Up @@ -165,7 +167,10 @@ def upload_data_file():
)

blank_index = BlankIndex(
file_name=params["file_name"], authz=params.get("authz"), uploader=uploader
file_name=params["file_name"],
authz=authz,
uploader=uploader,
guid=guid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened if an invalid or non-existent GUID is specified? Should the behavior be to 400 or move forward creating a blank record? I think we need to consider things like this. If I supply a GUID to the public API, I probably don't want a blank record. So if I mistyped the GUID I'd actually expect an error, not for it to move forward and create a blank record anyway

)
default_expires_in = flask.current_app.config.get("MAX_PRESIGNED_URL_TTL", 3600)

Expand Down Expand Up @@ -199,16 +204,16 @@ def upload_data_file():
def init_multipart_upload():
"""
Initialize a multipart upload request

NOTE This endpoint does not currently accept a `bucket` parameter like
`POST /upload` and `GET /upload/<GUID>` do.
"""
params = flask.request.get_json()
if not params:
raise UserError("wrong Content-Type; expected application/json")
if "file_name" not in params:
raise UserError("missing required argument `file_name`")
blank_index = BlankIndex(file_name=params["file_name"])

guid = params.get("guid")

blank_index = BlankIndex(file_name=params["file_name"], guid=guid)

default_expires_in = flask.current_app.config.get("MAX_PRESIGNED_URL_TTL", 3600)
expires_in = get_valid_expiration(
Expand Down
28 changes: 25 additions & 3 deletions fence/blueprints/data/indexd.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ def _get_client_id():

return client_id


def prepare_presigned_url_audit_log(protocol, indexed_file):
"""
Store in `flask.g.audit_data` the data needed to record an audit log.
Expand All @@ -232,7 +233,12 @@ class BlankIndex(object):
"""

def __init__(
self, uploader=None, file_name=None, logger_=None, guid=None, authz=None
self,
uploader=None,
file_name=None,
logger_=None,
guid=None,
authz=None,
):
self.logger = logger_ or logger
self.indexd = (
Expand All @@ -253,8 +259,10 @@ def __init__(
self.file_name = file_name
self.authz = authz

# if a guid is not provided, this will create a blank record for you
self.guid = guid or self.index_document["did"]
self.guid = guid
self.guid = self.index_document[
"did"
] # .index_document is a cached property with code below, it creates/retrieves the actual record and this line updates the stored GUID to the returned record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python comments should be on the line above, try to avoid in-line unless required by static code checking exclusions etc


@cached_property
def index_document(self):
Expand All @@ -266,6 +274,20 @@ def index_document(self):
response from indexd (the contents of the record), containing ``guid``
and ``url``
"""

if self.guid:
index_url = self.indexd.rstrip("/") + "/index/" + self.guid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to percent-encode the guid b/c it's user input, so I'm wondering if there's some security concerns here (like what if I pass in slashes, ../admin/ as the guid. I don't know if requests does that by default and this might get flagged as unsanitized user input in our security scans if it hasn't already. Not sure if/how we handle that elsewhere, just a thought

indexd_response = requests.get(index_url)
if indexd_response.status_code == 200:
document = indexd_response.json()
self.logger.info(f"Record with {self.guid} id found in Indexd.")
return document
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of early returns in a longer function like this since at a glance it's unclear where things end. Maybe just breaking some of this into a helper function(s) would help. document = _create_blank_record() and put everything after 288 in there basically. This is nitpicky so you don't need to do this, I just think the readability suffers with early returns unless the function itself is pretty small

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great updates, looks much cleaner now, thanks!

else:
raise NotFound(f"No indexed document found with id {self.guid}")

return self._create_blank_record()

def _create_blank_record(self):
index_url = self.indexd.rstrip("/") + "/index/blank/"
params = {"uploader": self.uploader, "file_name": self.file_name}

Expand Down
16 changes: 15 additions & 1 deletion openapis/swagger.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -565,6 +565,8 @@ paths:
Previous authorization check requires a more general, global upload permission:
"file_upload" on "/data_file" resource. When "authz" is *not* provided, this
endpoint will check for that permission for your user.

Accepts a "guid" field in the request body. If "guid" is provided, it checks indexd for an existing record. If not found, a new blank record will be created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this is the behavior anymore? If it's not found, it raises a 404

security:
- OAuth2:
- user
Expand Down Expand Up @@ -666,7 +668,10 @@ paths:
flow, Fence needs to provide a list of endpoints for supporting multipart upload presigned url
This is the first step on the API side for the multipart upload presigned url. This endpoint
causes fence to make a request to indexd to create a new, blank index record, and returns
the GUID for this new record and an uploadId for multipart upload presigned url
the GUID for this new record and an uploadId for multipart upload presigned url.


Accepts a "guid" field in the request body. If "guid" is provided, it checks indexd for an existing record. If not found, a new blank record will be created.
security:
- OAuth2:
- user
Expand Down Expand Up @@ -1686,6 +1691,10 @@ components:
description: >-
the requested bucket to upload to.
If not provided, defaults to the configured DATA_UPLOAD_BUCKET.
guid:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to this new param, we also need to update the example responses to include the 404 that can now happen

type: string
required: false
description: GUID to be assigned to the object
expires_in:
type: integer
description: optional integer specifying the presigned URL lifetime
Expand All @@ -1696,6 +1705,7 @@ components:
description: requested authorization resources to be set on the
resulting indexed record. You must have proper authorization to set this
example:
guid: "123456abcd"
file_name: "my_file.bam"
bucket: "bucket1"
expires_in: 1200
Expand All @@ -1708,6 +1718,10 @@ components:
type: string
required: true
description: the file name to use for this upload
guid:
type: string
required: false
description: GUID to be assigned to the object
expires_in:
type: integer
description: optional integer specifying the presigned URL lifetime
Expand Down
136 changes: 136 additions & 0 deletions tests/data/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -819,6 +819,26 @@ def test_public_authz_object_upload_file(
Test `GET /data/upload/1` in which the `1` Indexd record has authz
populated with the public value.
"""
did = str(uuid.uuid4())
index_document = {
"did": did,
"baseid": "",
"rev": "",
"size": 10,
"file_name": "file1",
"urls": ["s3://bucket1/key-{}".format(did[:8])],
"acl": ["phs000789"],
"hashes": {},
"metadata": {},
"form": "",
"created_date": "",
"updated_date": "",
}
mock_index_document = mock.patch(
"fence.blueprints.data.indexd.BlankIndex.index_document", index_document
)
mock_index_document.start()

indexd_client_accepting_record(INDEXD_RECORD_WITH_PUBLIC_AUTHZ_POPULATED)
mock_arborist_requests({"arborist/auth/request": {"POST": ({"auth": True}, 200)}})
headers = {
Expand All @@ -837,6 +857,8 @@ def test_public_authz_object_upload_file(
assert response.status_code == 200
assert "url" in response.json

mock_index_document.stop()


def test_public_authz_and_acl_object_upload_file_with_failed_authz_check(
client,
Expand Down Expand Up @@ -885,6 +907,26 @@ def test_public_authz_and_acl_object_upload_file(
acl populated with public values. In this case, authz takes precedence over
acl.
"""
did = str(uuid.uuid4())
index_document = {
"did": did,
"baseid": "",
"rev": "",
"size": 10,
"file_name": "file1",
"urls": ["s3://bucket1/key-{}".format(did[:8])],
"acl": ["phs000789"],
"hashes": {},
"metadata": {},
"form": "",
"created_date": "",
"updated_date": "",
}
mock_index_document = mock.patch(
"fence.blueprints.data.indexd.BlankIndex.index_document", index_document
)
mock_index_document.start()

indexd_client_accepting_record(INDEXD_RECORD_WITH_PUBLIC_AUTHZ_AND_ACL_POPULATED)
mock_arborist_requests({"arborist/auth/request": {"POST": ({"auth": True}, 200)}})
headers = {
Expand All @@ -903,6 +945,8 @@ def test_public_authz_and_acl_object_upload_file(
assert response.status_code == 200
assert "url" in response.json

mock_index_document.stop()


def test_non_public_authz_and_public_acl_object_upload_file(
client,
Expand All @@ -916,6 +960,26 @@ def test_non_public_authz_and_public_acl_object_upload_file(
Test that a user can successfully generate an upload url for an Indexd
record with a non-public authz field and a public acl field.
"""
did = str(uuid.uuid4())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to modify these to ensure you capture these test cases:

  • When the guid is provided and it already exists
  • When the guid is provided and it doesn't exist
  • When the guid is not provided

I believe you're only capturing the second one right now

index_document = {
"did": did,
"baseid": "",
"rev": "",
"size": 10,
"file_name": "file1",
"urls": ["s3://bucket1/key-{}".format(did[:8])],
"acl": ["phs000789"],
"hashes": {},
"metadata": {},
"form": "",
"created_date": "",
"updated_date": "",
}
mock_index_document = mock.patch(
"fence.blueprints.data.indexd.BlankIndex.index_document", index_document
)
mock_index_document.start()

indexd_record_with_non_public_authz_and_public_acl_populated = {
"did": "1",
"baseid": "",
Expand Down Expand Up @@ -951,6 +1015,8 @@ def test_non_public_authz_and_public_acl_object_upload_file(
assert response.status_code == 200
assert "url" in response.json

mock_index_document.stop()


def test_anonymous_download_with_public_authz(
client,
Expand Down Expand Up @@ -1664,6 +1730,76 @@ def json(self):
assert "uploadId" in response.json


@pytest.mark.parametrize("indexd_200_response", [True, False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be separate unit tests. One is testing when the GUID exists, one is testing when it doesn't. If you want to share some code, you can create helper functions, but I'd rather keep the positive/negative testing as separate unit tests

def test_initialize_multipart_upload_with_guid_in_request(
Avantol13 marked this conversation as resolved.
Show resolved Hide resolved
app, client, auth_client, encoded_creds_jwt, user_client, indexd_200_response
):
"""
Test /data/multipart/init with guid parameter in request data
"""

class MockResponse(object):
def __init__(self, data, status_code=200):
self.data = data
self.status_code = status_code

def json(self):
return self.data

data_requests_mocker = mock.patch(
"fence.blueprints.data.indexd.requests", new_callable=mock.Mock
)
arborist_requests_mocker = mock.patch(
"gen3authz.client.arborist.client.httpx.Client.request", new_callable=mock.Mock
)

fence.blueprints.data.indexd.BlankIndex.init_multipart_upload = MagicMock()
with data_requests_mocker as data_requests, arborist_requests_mocker as arborist_requests:
did = str(uuid.uuid4())
if indexd_200_response:
data_requests.get.return_value = MockResponse(
{
"did": did,
"baseid": "",
"rev": "",
"size": 10,
"file_name": "file1",
"urls": ["s3://bucket1/key"],
"hashes": {},
"metadata": {},
"authz": ["/open"],
"acl": ["*"],
"form": "",
"created_date": "",
"updated_date": "",
}
)
data_requests.get.return_value.status_code = 200
else:
data_requests.get.return_value = MockResponse("no record found")
data_requests.get.return_value.status_code = 404
arborist_requests.return_value = MockResponse({"auth": True})
arborist_requests.return_value.status_code = 200
fence.blueprints.data.indexd.BlankIndex.init_multipart_upload.return_value = (
"test_uploadId"
)
headers = {
"Authorization": "Bearer " + encoded_creds_jwt.jwt,
"Content-Type": "application/json",
}
file_name = "asdf"
data = json.dumps({"file_name": file_name, "guid": did})
response = client.post("/data/multipart/init", headers=headers, data=data)

if indexd_200_response:
assert response.status_code == 201, response
assert "guid" in response.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also check that the content of the guid is the same value as the one passed into the endpoint?

assert did == response.json.get("guid")
assert "uploadId" in response.json
else:
assert response.status_code == 404, response


def test_multipart_upload_presigned_url(
app, client, auth_client, encoded_creds_jwt, user_client
):
Expand Down
Loading