Add format versions count report #98

tw4l · 2020-11-30T16:30:57Z

Connected to #93

This PR implements the "Format versions count" report in the UI and API, with support for limiting results by start and end date. This commit also removes the Bootstrap alerts from the report selection template, as there is no longer the need for any "Coming soon"-type messages, and users should be able to tell that their reports are loading in a new tab without these alerts.

NB: With reference to this conversation in #65 (comment), I suspect that some changes will need to be made around implementing the start_date and end_date parameters before merging. This might mean changing this report to use files' ingestion Event date or it might mean modifying the "File format count" report to use AIP.create_date, as I've done here. My rationale for using AIP.create_date is that that date is likely the most visible date for Archivematica users, since it is displayed in both the Archival Storage tab and the Storage Service's Package tab. But honestly, I'm not really sure which is the better approach and would like some input, as I'd like to set a standard practice and use that moving forward wherever possible.

In many cases, file ingestion dates and AIP creation dates will be the same, or at least similar. But in some cases, as in the case of files that are stored in a transfer backlog for weeks or months prior to becoming part of a SIP (and, eventually, an AIP), they may end up being quite different.

So when we say "start date" and "end date" in relation to something like this report, or the "File format count" report, do we mean when the file was first ingested into an Archivematica instance, or when it became part of an AIP? Might that answer change for other reports? How might we better communicate what these dates mean to users? (I've tried in the API documentation with some success, I think, but am more at a loss for what to do in the GUI)

ross-spencer · 2020-12-08T03:02:13Z

AIPscan/helpers.py

+
+    :param date_string: Date string. Valid format is YYYY-MM-DD.
+
+    :returns: datetime.datetime object or None


I might consider doing the work of _prepare_datetimes_for_query(...) here and returning a consistent data type in this helper. For me there are two main clues/reasons:

Mixing return types might cause complexity upstream. Testing for None vs. getting to work on a newly minted datetime.

I appreciate the amount of thought gone into the date handling across the PR, but the story as I see it through the code, is that we're largely only dealing with date ranges which I believe these helpers are largely setup for handling going forward. I think we're always going to know with a min or a max. So I might suggest a function signature of, something like:

def parse_datetime_bounds(date_string, Upper=False): return datetime (or datetime + 1 day if upper==true) if valid, else return datetime.min if invalid, or datetime.max if invalid and upper==true

I think we might see a cleaner coding style appear by decoding early, rather than doing a little bit now, and a little bit later.

ross-spencer · 2020-12-08T03:05:54Z

AIPscan/helpers.py

+    if isinstance(date_string, str):
+        try:
+            return datetime.strptime(date_string, "%Y-%m-%d")
+        except ValueError:
+            pass
+    return None


I suspect you can get away with something like:

def parse_date(date_string): """Parse date parameter string into datetime object or None.""" try: return datetime.strptime(date_string, "%Y-%m-%d") except (ValueError, TypeError): pass

If you wanted to simplify this a little if the function is kept as is.

ross-spencer · 2020-12-08T04:12:01Z

@tw4l thanks for this. Overall I can see this is a high quality PR, although I have to step into it in more detail.

I feel like AIP created date is exactly the right date to use here as I think this is specifically AIP level reporting. You might be interested in this ticket I mentioned about the archival storage tab too: archivematica/Issues#731

I have left some initial comments about the date handling which are good additions and I look forward to seeing applied to other existing reports.

What I am a little bit stuck on before I start to review in more detail, is the similarity in the shape of the data between the two data functions:

curl -X GET "http://127.0.0.1:5000/api/data/aip-overview/1" -H "accept: application/json"
curl -X GET "http://127.0.0.1:5000/api/data/format-version-count/1" -H "accept: application/json"

It reminds me of a conversation before, but the similarity feels too hard to ignore, and there should be something done to (my preference - consolidate, rather than duplicate or remove). So I'd just like to ask more about the rationale for 2 over 1. And I think we need to understand what happens to both now, with respect for what each of them offer. My initial instinct is that 2. should probably not have been created without acknowledgement of 1 (i.e. potentially adding to technical debt). I could be wrong, the partial duplication might be fine. Maybe this really is a case of repository level reporting, vs. starting to implement functions to support preservation watch and treatment?

tw4l · 2020-12-11T00:12:41Z

Thank you for the feedback @ross-spencer! As always I appreciate your sharp mind and generosity.

I feel like AIP created date is exactly the right date to use here as I think this is specifically AIP level reporting. You might be interested in this ticket I mentioned about the archival storage tab too: archivematica/Issues#731

Fantastic, and thanks for the helpful context with that issue! Given that, I will plan to add a commit in this PR that modifies the "Format count report" view to also use the AIP created date, to bring the reports into line with each other.

I have left some initial comments about the date handling which are good additions and I look forward to seeing applied to other existing reports.

Thanks, these comments make a lot of sense! I agree with your insight that we'd be better off decoding early and always returning a datetime object, and will make changes accordingly.

What I am a little bit stuck on before I start to review in more detail, is the similarity in the shape of the data between the two data functions ... It reminds me of a conversation before, but the similarity feels too hard to ignore, and there should be something done to (my preference - consolidate, rather than duplicate or remove). So I'd just like to ask more about the rationale for 2 over 1.

This is fair, and truthfully it's just the result of an oversight on my part. I started with the UI report and worked backwards from there (i.e. deciding what the report looks like, then writing the Data endpoint accordingly, and finally adding the API endpoint to keep good coverage there). I suspect that the similarity and potential to extend what was already there would have been much more apparent to me if I had worked in the reverse order. But you're also right that we've had similar conversations before, so apologies for making us cover this ground again.

Given what you've pointed out, I see a few potential ways forward:

Keep both sets of Data endpoints and API endpoints.
Remove the new Data and API endpoints, and use the existing Data.aip_overview endpoint (adding file size, date limiting, and testing comparable to Data.format_versions_count) for the "Format versions count report" in the UI.
Hybrid approach: Keep the new Data.format_versions_count Data endpoint and continue to use it for the "Format versions count report" in the UI, but remove the new API endpoint in favour of the existing aip_overview API endpoint. We could potentially also supplement the aip_overview API endpoint by adding aggregate file size, date limiting, and tests so that nothing's lost to consumers by dropping the format_version_count API endpoint.

I agree with you that having both API endpoints as they are doesn't add much value currently (your thought about "repository level reporting, vs. starting to implement functions to support preservation watch and treatment" is intriguing, but I'm not sure we're there yet) so that rules out approach 1 for me.

My first thought was to go with approach 2, even though it would mean throwing out a lot of the work in this PR. However, as this would require first fetching data from the Data.aip_overview endpoint and then (in another Data endpoint function or the Reporter view) significantly transforming the shape of that data and sorting it for use in the UI template, there would be costs in terms of speed and complexity vs. leaning on the database/ORM to do the heavy lifting and return the data in the shape we need. Personally, I don't see the benefits of this approach outweighing those costs.

We could also consider the inverse of approach 2 (i.e. keeping the new endpoints and removing the old), but I see value in keeping the API endpoints a bit generic so that other systems can consume them and use them for their own purposes and I think the aip_overview API endpoint exemplifies that well.

So as I write this, I'm inclined to go with the hybrid approach 3. In essence, I see this as optimizing for speed in the reporting UI and thoroughness/generality in the API endpoint. What do you think? Does that sound reasonable to you?

ross-spencer · 2020-12-14T03:44:37Z

Thanks @tw4l for laying these options out. It's really helpful. I'm thinking through them. I was thinking it might benefit me to share some previous context/thinking here in person and I wonder impact any particular approach might have one some of the decisions we make in development moving forward, so I requested a Google Hangout with yourself and Peter if that sounds good?

Going through some of this again I don't feel is that big a deal. It's really timely. There's more that I could have done here to signpost intentions - and what I think might be a somewhat orphaned function in the data module/API endpoint certainly doesn't help! It's going to come up more for other folks as we get others contributing.

tw4l · 2020-12-14T15:38:44Z

Sounds good, thanks @ross-spencer! Looking forward to discussing further this afternoon.

tw4l · 2020-12-14T22:16:28Z

To sum up how we left the offline conversation:

I will move the format_version_count Data and API endpoints to a new module and namespace respectively so that there is a clear differentiation between endpoints that are for generic data consumption vs. endpoints that are feeding data for reports implemented in the UI.
I will assess the other existing endpoints for inclusion in the new module/namespace and move the ones that should be moved - e.g. the largest file report endpoints.
@ross-spencer will complete code review for this PR, with the understanding that I will be moving the endpoints and making the suggested changes to the helper for parsing date strings into datetime objects, but have not done so yet.

ross-spencer

@tw4l lots of this looking good. I'm submitting a first go code-review in the hope it might aid any of your efforts this week. I know you're working on the changes we discussed. There are a few small style things here, something to think about in the tests, and possibly some fortification to add to the report generation but I like the look of the report.

ross-spencer · 2020-12-16T02:39:13Z

AIPscan/Data/tests/conftest.py

+
+    This fixture is used to create expected state which is then used to
+    test the Data.format_versions_count endpoint.
+    """


These look good, I wonder if we can think about providing some helpers to make it easier to create fixtures? Simplify the constructors a little? Recognizing time restrictions this week I've explored simplifying this a little and come up with the following, but what do you think?

diff --git a/AIPscan/Data/tests/conftest.py b/AIPscan/Data/tests/conftest.py index 6fb1e88..816cccf 100644 --- a/AIPscan/Data/tests/conftest.py +++ b/AIPscan/Data/tests/conftest.py @@ -3,6 +3,8 @@ """This module defines shared Data blueprint pytest fixtures.""" import datetime +import uuid + import pytest from AIPscan import db, create_app @@ -27,6 +29,121 @@ AIP_1_CREATION_DATE = "2020-01-01" AIP_2_CREATION_DATE = "2020-06-01" +def _create_test_storage_service(**kwargs): + """Wrapper to slightly more easily return a sample storage service + for conftest fixtures. + + param: ... + param: ... + param: ... + """ + name = kwargs.get("name", "test storage service") + url = kwargs.get("name", "http://example.com") + user_name = kwargs.get("user_name", "test") + api_key = kwargs.get("api_key", "test") + download_limit = kwargs.get("download_limit", 20) + download_offset = kwargs.get("download_offset", 10) + default = kwargs.get("default", True) + storage_service = StorageService( + name=name, + url=url, + user_name=user_name, + api_key=api_key, + download_limit=download_limit, + download_offset=download_offset, + default=default, + ) + db.session.add(storage_service) + db.session.commit() + return storage_service + + +def _create_fetch_job(**kwargs): + """Wrapper to slightly more easily return a sample fetch job for + conftest fixtures. + + param: ... + param: ... + param: ... + """ + total_packages = kwargs.get("total_packages", 1) + total_aips = kwargs.get("total_aips", 1) + total_deleted_aips = kwargs.get("total_deleted_aips", 0) + download_start = kwargs.get("download_start", datetime.datetime.now()) + download_end = kwargs.get("download_end", datetime.datetime.now()) + download_directory = kwargs.get("download_directory", "/some/directory/") + storage_service_id = kwargs.get("storage_service_id", 1) + fetch_job = FetchJob( + total_packages=total_packages, + total_aips=total_aips, + total_deleted_aips=total_deleted_aips, + download_start=download_start, + download_end=download_end, + download_directory=download_directory, + storage_service_id=storage_service_id, + ) + db.session.add(fetch_job) + db.session.commit() + return fetch_job + +def _create_aip(**kwargs): + """Wrapper to slightly more easily return an AIP for conftest + fixtures. + + param: ... + param: ... + param: ... + """ + uuid_ = kwargs.get("uuid", str(uuid.uuid4())) + transfer_name = kwargs.get("transfer_name", "Test AIP") + create_date = kwargs.get("create_date", datetime.datetime.now()) + storage_service_id = kwargs.get("storage_service_id", 1) + fetch_job_id = kwargs.get("fetch_job_id", 1) + return AIP( + uuid=uuid_, + transfer_name=transfer_name, + create_date=create_date, + storage_service_id=storage_service_id, + fetch_job_id=fetch_job_id, + ) + + +def _create_test_file(**kwargs): + """Wrapper to slightly more easily return a sample file for conftest + fixtures. + + param: ... + param: ... + param: ... + """ + name = kwargs.get("name", "file_name.ext") + filepath = kwargs.get("filepath", "/path/to/file_name.ext") + uuid_ = kwargs.get("uuid", str(uuid.uuid4())) + file_type = kwargs.get("file_type", FileType.original) + size = kwargs.get("size", "0") + date_created = kwargs.get("date_created", datetime.datetime.now()) + puid = kwargs.get("puid", "fmt/test-1") + file_format = kwargs.get("file_format", "ACME File Format") + format_version = kwargs.get("format_version", "0.0.0") + checksum_type = kwargs.get("checksum_type", "test") + checksum_value = kwargs.get("checksum_value", "test") + aip_id = kwargs.get("aip_id", 1) + return File( + name=name, + filepath=filepath, + uuid=uuid_, + file_type=file_type, + size=size, + date_created=date_created, + puid=puid, + file_format=file_format, + format_version=format_version, + checksum_type=checksum_type, + checksum_value=checksum_value, + aip_id=aip_id, + ) + + @pytest.fixture def app_with_populated_files(scope="package"): """Fixture with pre-populated data. @@ -38,66 +155,30 @@ def app_with_populated_files(scope="package"): with app.app_context(): db.create_all() - storage_service = StorageService( - name="test storage service", - url="http://example.com", - user_name="test", - api_key="test", - download_limit=20, - download_offset=10, - default=True, - ) - db.session.add(storage_service) - db.session.commit() - - fetch_job = FetchJob( - total_packages=1, - total_aips=1, - total_deleted_aips=0, - download_start=datetime.datetime.now(), - download_end=datetime.datetime.now(), - download_directory="/some/dir", - storage_service_id=storage_service.id, - ) - db.session.add(fetch_job) - db.session.commit() + storage_service = _create_test_storage_service() + fetch_job = _create_fetch_job(storage_service_id=storage_service.id) - aip = AIP( - uuid="111111111111-1111-1111-11111111", - transfer_name="test aip", - create_date=datetime.datetime.now(), + aip = _create_aip( storage_service_id=storage_service.id, fetch_job_id=fetch_job.id, ) db.session.add(aip) db.session.commit() - original_file = File( - name="original.tif", - filepath="/path/to/original.tif", - uuid="111111111111-1111-1111-11111111", + original_file = _create_test_file( file_type=FileType.original, size=ORIGINAL_FILE_SIZE, - date_created=datetime.datetime.now(), puid=TIFF_PUID, file_format=TIFF_FILE_FORMAT, - checksum_type="test", - checksum_value="test", - aip_id=1, ) - preservation_file = File( - name="preservation.tif", - filepath="/path/to/preservation.tif", - uuid="222222222222-2222-2222-22222222", + + preservation_file = _create_test_file( file_type=FileType.preservation, size=PRESERVATION_FILE_SIZE, - date_created=datetime.datetime.now(), puid=TIFF_PUID, file_format=TIFF_FILE_FORMAT, - checksum_type="test", - checksum_value="test", - aip_id=1, ) + db.session.add(original_file) db.session.add(preservation_file) db.session.commit() @@ -114,45 +195,22 @@ def app_with_populated_format_versions(scope="package"): This fixture is used to create expected state which is then used to test the Data.format_versions_count endpoint. """ + AIP_DATE_FORMAT = "%Y-%m-%d" + app = create_app("test") with app.app_context(): db.create_all() - storage_service = StorageService( - name="test storage service", - url="http://example.com", - user_name="test", - api_key="test", - download_limit=20, - download_offset=10, - default=True, - ) - db.session.add(storage_service) - db.session.commit() + storage_service = _create_test_storage_service() + fetch_job = _create_fetch_job(storage_service_id=storage_service.id) - fetch_job = FetchJob( - total_packages=1, - total_aips=1, - total_deleted_aips=0, - download_start=datetime.datetime.now(), - download_end=datetime.datetime.now(), - download_directory="/some/dir", - storage_service_id=storage_service.id, - ) - db.session.add(fetch_job) - db.session.commit() - - aip1 = AIP( - uuid="111111111111-1111-1111-11111111", - transfer_name="test aip", - create_date=datetime.datetime.strptime(AIP_1_CREATION_DATE, "%Y-%m-%d"), + aip1 = _create_aip( + create_date=datetime.datetime.strptime(AIP_1_CREATION_DATE, AIP_DATE_FORMAT), storage_service_id=storage_service.id, fetch_job_id=fetch_job.id, ) - aip2 = AIP( - uuid="222222222222-2222-2222-22222222", - transfer_name="test aip", - create_date=datetime.datetime.strptime(AIP_2_CREATION_DATE, "%Y-%m-%d"), + aip2 = _create_aip( + create_date=datetime.datetime.strptime(AIP_2_CREATION_DATE, AIP_DATE_FORMAT), storage_service_id=storage_service.id, fetch_job_id=fetch_job.id, ) @@ -160,34 +218,22 @@ def app_with_populated_format_versions(scope="package"): db.session.add(aip2) db.session.commit() - file1 = File( - name="jpeg_1.01.jpg", - filepath="/path/to/jpeg_1.01.jpg", - uuid="111111111111-1111-1111-11111111", - file_type=FileType.original, + file1 = _create_test_file( size=ORIGINAL_FILE_SIZE, - date_created=datetime.datetime.now(), puid=JPEG_1_01_PUID, file_format=JPEG_FILE_FORMAT, format_version=JPEG_1_01_FORMAT_VERSION, - checksum_type="test", - checksum_value="test", aip_id=aip1.id, ) - file2 = File( - name="jpeg_1.02.jpg", - filepath="/path/to/jpeg_1.02.jpg", - uuid="222222222222-2222-2222-22222222", - file_type=FileType.original, + + file2 = _create_test_file( size=PRESERVATION_FILE_SIZE, - date_created=datetime.datetime.now(), puid=JPEG_1_02_PUID, file_format=JPEG_FILE_FORMAT, format_version=JPEG_1_02_FORMAT_VERSION, - checksum_type="test", - checksum_value="test", aip_id=aip2.id, ) + db.session.add(file1) db.session.add(file2) db.session.commit()

The full script:

# -*- coding: utf-8 -*- """This module defines shared Data blueprint pytest fixtures.""" import datetime import uuid import pytest from AIPscan import db, create_app from AIPscan.models import StorageService, FetchJob, AIP, File, FileType TIFF_FILE_FORMAT = "Tagged Image File Format" TIFF_PUID = "fmt/353" JPEG_FILE_FORMAT = "JPEG" JPEG_1_01_FORMAT_VERSION = "1.01" JPEG_1_01_PUID = "fmt/43" JPEG_1_02_FORMAT_VERSION = "1.02" JPEG_1_02_PUID = "fmt/44" ORIGINAL_FILE_SIZE = 1000 PRESERVATION_FILE_SIZE = 2000 AIP_1_CREATION_DATE = "2020-01-01" AIP_2_CREATION_DATE = "2020-06-01" def _create_test_storage_service(**kwargs): """Wrapper to slightly more easily return a sample storage service for conftest fixtures. param: ... param: ... param: ... """ name = kwargs.get("name", "test storage service") url = kwargs.get("name", "http://example.com") user_name = kwargs.get("user_name", "test") api_key = kwargs.get("api_key", "test") download_limit = kwargs.get("download_limit", 20) download_offset = kwargs.get("download_offset", 10) default = kwargs.get("default", True) storage_service = StorageService( name=name, url=url, user_name=user_name, api_key=api_key, download_limit=download_limit, download_offset=download_offset, default=default, ) db.session.add(storage_service) db.session.commit() return storage_service def _create_fetch_job(**kwargs): """Wrapper to slightly more easily return a sample fetch job for conftest fixtures. param: ... param: ... param: ... """ total_packages = kwargs.get("total_packages", 1) total_aips = kwargs.get("total_aips", 1) total_deleted_aips = kwargs.get("total_deleted_aips", 0) download_start = kwargs.get("download_start", datetime.datetime.now()) download_end = kwargs.get("download_end", datetime.datetime.now()) download_directory = kwargs.get("download_directory", "/some/directory/") storage_service_id = kwargs.get("storage_service_id", 1) fetch_job = FetchJob( total_packages=total_packages, total_aips=total_aips, total_deleted_aips=total_deleted_aips, download_start=download_start, download_end=download_end, download_directory=download_directory, storage_service_id=storage_service_id, ) db.session.add(fetch_job) db.session.commit() return fetch_job def _create_aip(**kwargs): """Wrapper to slightly more easily return an AIP for conftest fixtures. param: ... param: ... param: ... """ uuid_ = kwargs.get("uuid", str(uuid.uuid4())) transfer_name = kwargs.get("transfer_name", "Test AIP") create_date = kwargs.get("create_date", datetime.datetime.now()) storage_service_id = kwargs.get("storage_service_id", 1) fetch_job_id = kwargs.get("fetch_job_id", 1) return AIP( uuid=uuid_, transfer_name=transfer_name, create_date=create_date, storage_service_id=storage_service_id, fetch_job_id=fetch_job_id, ) def _create_test_file(**kwargs): """Wrapper to slightly more easily return a sample file for conftest fixtures. param: ... param: ... param: ... """ name = kwargs.get("name", "file_name.ext") filepath = kwargs.get("filepath", "/path/to/file_name.ext") uuid_ = kwargs.get("uuid", str(uuid.uuid4())) file_type = kwargs.get("file_type", FileType.original) size = kwargs.get("size", "0") date_created = kwargs.get("date_created", datetime.datetime.now()) puid = kwargs.get("puid", "fmt/test-1") file_format = kwargs.get("file_format", "ACME File Format") format_version = kwargs.get("format_version", "0.0.0") checksum_type = kwargs.get("checksum_type", "test") checksum_value = kwargs.get("checksum_value", "test") aip_id = kwargs.get("aip_id", 1) return File( name=name, filepath=filepath, uuid=uuid_, file_type=file_type, size=size, date_created=date_created, puid=puid, file_format=file_format, format_version=format_version, checksum_type=checksum_type, checksum_value=checksum_value, aip_id=aip_id, ) @pytest.fixture def app_with_populated_files(scope="package"): """Fixture with pre-populated data. This fixture is used to create expected state which is then used to test the Data.aips_by_file_format and Data.aips_by_puid endpoints. """ app = create_app("test") with app.app_context(): db.create_all() storage_service = _create_test_storage_service() fetch_job = _create_fetch_job(storage_service_id=storage_service.id) aip = _create_aip( storage_service_id=storage_service.id, fetch_job_id=fetch_job.id, ) db.session.add(aip) db.session.commit() original_file = _create_test_file( file_type=FileType.original, size=ORIGINAL_FILE_SIZE, puid=TIFF_PUID, file_format=TIFF_FILE_FORMAT, ) preservation_file = _create_test_file( file_type=FileType.preservation, size=PRESERVATION_FILE_SIZE, puid=TIFF_PUID, file_format=TIFF_FILE_FORMAT, ) db.session.add(original_file) db.session.add(preservation_file) db.session.commit() yield app db.drop_all() @pytest.fixture def app_with_populated_format_versions(scope="package"): """Fixture with pre-populated data. This fixture is used to create expected state which is then used to test the Data.format_versions_count endpoint. """ AIP_DATE_FORMAT = "%Y-%m-%d" app = create_app("test") with app.app_context(): db.create_all() storage_service = _create_test_storage_service() fetch_job = _create_fetch_job(storage_service_id=storage_service.id) aip1 = _create_aip( create_date=datetime.datetime.strptime( AIP_1_CREATION_DATE, AIP_DATE_FORMAT ), storage_service_id=storage_service.id, fetch_job_id=fetch_job.id, ) aip2 = _create_aip( create_date=datetime.datetime.strptime( AIP_2_CREATION_DATE, AIP_DATE_FORMAT ), storage_service_id=storage_service.id, fetch_job_id=fetch_job.id, ) db.session.add(aip1) db.session.add(aip2) db.session.commit() file1 = _create_test_file( size=ORIGINAL_FILE_SIZE, puid=JPEG_1_01_PUID, file_format=JPEG_FILE_FORMAT, format_version=JPEG_1_01_FORMAT_VERSION, aip_id=aip1.id, ) file2 = _create_test_file( size=PRESERVATION_FILE_SIZE, puid=JPEG_1_02_PUID, file_format=JPEG_FILE_FORMAT, format_version=JPEG_1_02_FORMAT_VERSION, aip_id=aip2.id, ) db.session.add(file1) db.session.add(file2) db.session.commit() yield app db.drop_all()

It reduces the number of lines a little and helps with sensible defaults.

There are a few gaps in that I didn't have a position on adding param: for these. Also, I think the rebase will affect it slightly but the overall functions should remain the same.

We could also make this a different future commit if you like?

Oh I like this a lot @ross-spencer. Thanks for putting in a lot of effort and good thinking here! I'll do my best to get this into my revisions but appreciate the option to kick it to a future commit/PR if needed.

Helpers added in commit dbe8350.

Awesome! The changes to conftest.py are looking good!

ross-spencer · 2020-12-16T02:47:53Z

AIPscan/Data/tests/test_format_versions_count.py

+DATE_AFTER_AIP_2 = "2020-06-02"
+
+
+class MockFormatVersionsCountQueryResult:


I like the sophistication here.

The approach also reminds me of named tuples which I use occasionally. They're a little more lightweight. But I've no opinion on which is best here. Just sharing for reference.

I use them in my Palettes work for the painter goblin: https://github.com/ross-spencer/palettes/blob/9a422fa8aeed79995b2395c87278d8d55cf71787/palettes.py#L13

And here's a link to "module of the week" for more context: https://pymotw.com/2/collections/namedtuple.html

Oh interesting, this is new to me. Thanks for sharing!

ross-spencer · 2020-12-16T02:51:15Z

AIPscan/Data/tests/test_format_versions_count.py

+    version_count,
+    total_file_count,
+    total_file_size,
+):


All of this looks good! 👍

ross-spencer · 2020-12-16T02:52:32Z

AIPscan/Reporter/templates/reports.html

-    // this into it's own import too.
-    //
-    // TODO: Are there more idiomatic approaches? 🤷
-    bootstrap_alert = function(message, alertType) {


Great to see these removed!

ross-spencer · 2020-12-16T02:57:13Z

AIPscan/Reporter/templates/reports.html

+                var startdate = $('#startdate3').val();
+                if (startdate != "") {
+                    if (startdate > date) {
+                        alert("The start date must be before the end date");


I wonder if the two alerts here are supposed to have the same string we can just remove the duplication with a const/var above? const DATE_ALERT = "...";.

My reason for asking if they are supposed, then I wonder if the end date alert might otherwise say, "the end date must be after the start date"? I have no preference.

I've added two new consts here - DATE_ALERT_START and DATE_ALERT_END - for the two variations of the message

ross-spencer · 2020-12-16T02:58:41Z

AIPscan/Reporter/templates/reports.html

@@ -54,8 +51,8 @@
        <tr>
            <td>Format version count</td>
            <td></td>
-            <td></td>
-            <td></td>
+            <td><input type="text" id="startdate3" value="{{ start_date }}"></td>


We've been using integers as suffixes a little across this page, but do you think we need to think a little further ahead and perhaps name attributes more meaningfully, e.g. id="start_date_version_report" or something along those lines? Same for end date?

Yes, good point!

ross-spencer · 2020-12-16T03:02:08Z

AIPscan/Reporter/report_format_versions_count.py

+        columns=translate_headers(headers),
+        versions=versions,
+        total_file_count=sum(version.get(data.FIELD_COUNT, 0) for version in versions),
+        total_size=sum(version.get(data.FIELD_SIZE, 0) for version in versions),


I'm getting a stack trace error here:

Traceback (most recent call last): File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask_restx/api.py", line 639, in error_router return original_handler(e) File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise raise value File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/home/ross-spencer/git/artefactual-labs/aipscanvenv/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/home/ross-spencer/git/artefactual-labs/AIPscan/AIPscan/Reporter/report_format_versions_count.py", line 37, in report_format_versions_count total_size=sum(version.get(data.FIELD_SIZE, 0) for version in versions),

It's not coming to me tonight why I might be getting None but I am wondering if you have an instinct? We might also need to capture that in tests?

I will add the check for None in the Data endpoint here, but agree that it would be good to capture this in tests! I will work on including that in my revisions.

ross-spencer · 2020-12-16T03:22:54Z

AIPscan/Reporter/templates/report_format_versions_count.html

+          <td>{{ version["Format"] }}</td>
+          <td>{{ version["Version"] }}</td>
+          <td>{{ version["Count"] }}</td>
+          <td>{{ version["Size"] | filesizeformat }}</td>


This might be related to the reporter module code, but here, I have a {'PUID': 'fmt/468', 'Format': 'ISO Disk Image File', 'Version': None, 'Count': 1, 'Size': None} result and the template expects a number to work with and not None so I wonder how we can fortify this or the data coming out of the DB (or into) which can help us to avoid this? I looked at the original METS and it looks like the ISO Disk Image File was ingested as a standard transfer from one of my FPR test sets - but it's likely the image from the sample data dirs.

I think it's likely that this File object has the format and size it does because of this potentially misleading AttributeError handling:

AIPscan/AIPscan/Aggregator/database_helpers.py

Lines 196 to 200 in 39dbf88

except AttributeError:

# File error/warning to log. Obviously this format may

# be incorrect so it is our best guess.

file_info["file_format"] = "ISO Disk Image File"

file_info["puid"] = "fmt/468"

(see also issue #83). I wonder if it's actually an ISO disk image...

Regardless, I think you hit the nail on the head that we should be checking for None in the data endpoint to avoid this issue. Good spot!

Ah whoops, just re-read your comment and my "I wonder if it's actually an ISO disk image..." was off base here. Sorry about that! 😅

ross-spencer · 2020-12-16T03:26:06Z

AIPscan/Data/data.py

+        except AttributeError:
+            pass
+        version_info[FIELD_COUNT] = version.file_count
+        version_info[FIELD_SIZE] = version.total_size


Ah, it looks like maybe we can test for none here? (see template comment)

Yes I think you're right!

ross-spencer · 2020-12-16T03:31:55Z

AIPscan/Data/data.py

+    :start_date: Optional inclusive AIP creation start date
+    (datetime.datetime object).
+    :end_date: Optional inclusive AIP creation end date
+    (datetime.datetime object).


Suggested change

(datetime.datetime object).

:param storage_service_id: Storage Service ID.

:start_date: Optional inclusive AIP creation start date

(datetime.datetime object).

:end_date: Optional inclusive AIP creation end date

(datetime.datetime object).

I am still getting used to this more complex style of docstring. Admittedly I haven't spent a lot of time looking at this, but what do you think about adding a hanging indent: https://stackoverflow.com/a/50455731 - to help readability? It would be good to do that wherever we break a docstring over two lines. Initially it's hard to pick up as part of the same string.

Is there another idiomatic approach that you happen to know about?

These detailed docstrings are a new-ish habit for me so I don't have a ready answer for other approaches but this is a good suggestion and I agree it would help a lot with readability.

tw4l · 2020-12-17T02:22:24Z

Hi @ross-spencer - I believe I've responded to each of your suggestions and comments and this PR is now ready for code review. Thanks again for the great suggestions here!

In addition to responding to the in-line comments, I've rebased this branch on main and moved the report-specific endpoints to a new Data.report_data module and API.namespace_report_data namespace. This required a lot of changes in the Data, API, and Reporter blueprints and updates to the tests, so the diff might be a bit large. I've made sure that all tests are passing and done a quick manual spot check of each of the API endpoints and UI reports to make sure everything looks good following the change.

Along the way I discovered a bug in the agents_transfers endpoint that sometimes cause that report and API endpoint to raise an AttributeError, which I filed an issue for here: #104.

tw4l · 2020-12-17T02:32:07Z

I realized right after marking this PR as ready for re-review that I forgot to update the "File format count" report to use AIP.create_date for date filtering, so I've added a follow-up commit with that change.

ross-spencer

One question around style/presentation for the API URL, but approving, and:

The way you approached moving things about made it really simple to re-review. Lots of this looking really clean and easy to use now too. Thanks @tw4l

ross-spencer · 2020-12-17T16:25:25Z

AIPscan/Data/tests/conftest.py

+
+    This fixture is used to create expected state which is then used to
+    test the Data.format_versions_count endpoint.
+    """


Awesome! The changes to conftest.py are looking good!

ross-spencer · 2020-12-17T16:34:12Z

AIPscan/Reporter/helpers.py

-        data.FIELD_STORAGE_NAME: "Storage Service Name",
-        data.FIELD_TRANSFER_NAME: "Transfer Name",
-        data.FIELD_VERSION: "Version",
+        fields.FIELD_AIP_NAME: "AIP Name",


The fields module is a nice intuitive change it feels much easier to make use of now.

ross-spencer · 2020-12-17T16:48:30Z

AIPscan/API/namespace_report_data.py

+from AIPscan.Data import report_data
+
+api = Namespace(
+    "report_data", description="Retrieve data optimized for AIPscan reports"


I wonder which is nicer to have here:

With underscore: http://127.0.0.1:5000/api/report_data/format-version-count/1
With dash: http://127.0.0.1:5000/api/report-data/format-version-count/1

It looks like both are valid - my preference is with the dash, because we have dashes in the endpoint name too. IDK what do you think?

Ah, good catch! I agree the dash is nicer and more consistent. I'll make that change before merging.

This commit also: - Adds a new Data.report_date module and report-data API namespace for data endpoints optimized for reports implemented in the AIPscan UI as opposed to more general data overview endpoints. - Modifies the file format count report to filter dates on AIP.create_date to bring the two reports into line with each other. - Adds new helper functions for creating test objects in the Data.test package's conftest module.

tw4l force-pushed the dev/issue-93-format-version-count branch 8 times, most recently from fe51478 to ce12964 Compare December 2, 2020 22:35

tw4l mentioned this pull request Dec 2, 2020

Agents-transfers logging reports #65

Merged

tw4l force-pushed the dev/issue-93-format-version-count branch from ce12964 to 21e78de Compare December 2, 2020 22:55

tw4l changed the title ~~WIP: Add format versions count report~~ Add format versions count report Dec 2, 2020

tw4l marked this pull request as ready for review December 2, 2020 22:58

tw4l requested review from ross-spencer and peterVG December 2, 2020 22:58

ross-spencer reviewed Dec 8, 2020

View reviewed changes

ross-spencer reviewed Dec 16, 2020

View reviewed changes

tw4l force-pushed the dev/issue-93-format-version-count branch from 21e78de to dbe8350 Compare December 17, 2020 02:02

tw4l mentioned this pull request Dec 17, 2020

Problem: The agent_transfers Data endpoint throws an AttributeError for AIPs with files extracted from a package #104

Closed

tw4l requested a review from ross-spencer December 17, 2020 02:23

ross-spencer approved these changes Dec 17, 2020

View reviewed changes

tw4l force-pushed the dev/issue-93-format-version-count branch from cb05db4 to e73fc58 Compare December 17, 2020 19:06

tw4l merged commit e73fc58 into main Dec 17, 2020

tw4l deleted the dev/issue-93-format-version-count branch December 17, 2020 19:13


		:param date_string: Date string. Valid format is YYYY-MM-DD.

		:returns: datetime.datetime object or None

		DATE_AFTER_AIP_2 = "2020-06-02"


		class MockFormatVersionsCountQueryResult:

	except AttributeError:
	# File error/warning to log. Obviously this format may
	# be incorrect so it is our best guess.
	file_info["file_format"] = "ISO Disk Image File"
	file_info["puid"] = "fmt/468"

Add format versions count report #98

Add format versions count report #98

Conversation

tw4l commented Nov 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ross-spencer commented Dec 8, 2020 • edited Loading

tw4l commented Dec 11, 2020

ross-spencer commented Dec 14, 2020

tw4l commented Dec 14, 2020

tw4l commented Dec 14, 2020 • edited Loading

ross-spencer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tw4l commented Dec 17, 2020 • edited Loading

tw4l commented Dec 17, 2020

ross-spencer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tw4l commented Nov 30, 2020 •

edited

Loading

ross-spencer commented Dec 8, 2020 •

edited

Loading

tw4l commented Dec 14, 2020 •

edited

Loading

tw4l commented Dec 17, 2020 •

edited

Loading