-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storing fixity auditing outcomes #847
Comments
Couple of simple thoughts:
+1000. As for where to put it, if this checksum info is stored in support of durability, it should be treated at least as well as other durable information: stored in multiple places, for each operation (as opposed to occasional bulk updates from one location to another). Which location is used as an authoritative source seems to me mostly to depend on pragmatic considerations (i.e. how to keep the architecture simple and performant). |
Thinking this through a bit more, if we store event outcome data as RDF in either FCREPO or the triplestore, we'd need not just one triple but two for each event, the timestamp, and the outcome (passed/failed), assuming the trusted checksum value need only be stored once, not per verification event. So a verification cycle of 100,000 resources would result in 200,000 new triples. If that's the case, maybe storing both pieces of info in one row in a database table is more efficient, replicating the info by periodically persisting the db table(s) into Fedora as a binary resource. Doing that at least ensures there are two copies, although not necessarily robustly distributed copies. But databases are pretty easy to replicate, so if we want distributed replication, that's also an option. |
The problem seems to me to be that a fixity check is transactional information. But the pattern suggested here persists it long after the transaction is done. Why store the events at all? Why not publish them like any other events in the system, and if a given site wants to store them or do something else with them, cool, they can figure out how to do that together with other sites that share that interest. But why make everyone pay the cost of persisting fixity check events? Speaking only for SI, we certainly don't need or want to do that. Does anything other than a single checksum per non-RDF resource actually need to be stored? |
Currently, in 7.x, enabling the Checksum and Checksum Checker modules is optional, and I'm not suggesting that similar functionality in CLAW is any different. Sorry I didn't state that explicitly. Any functionality in CLAW to generate, manage, and report on fixity auditing would be implemented as Drupal contrib modules. We would want to store events so we can express the history of a resource in PREMIS (for example). In our use case, we want to be able to document that uninterupted history across migrations between repositories, from 7.x to CLAW. |
I think what's getting wrapped up in here is the auditing functionality. If we just need to check fixity, stick it on as a Drupal field. It'll wind up in the drupal db, the triplestore, and fedora. If you want to persist audit events, I'd model that as a content type in drupal and it'll get persisted to all three as well by default. Of course, you could filter it with context and make it populate only what you want (e.g. just the triplestore and not fedora). |
@dannylamb I hadn't thought about modelling fixity events as a Drupal content type. One downside to doing that is adding all those nodes to Drupal. I'm concerned that over time, the number of events will grow very large, with dire effects on performance. |
@mjordan And after thinking about this some more, if you're worried about performance, your best bet is usually something like a finely tuned postgres. Just putting it in Drupal, and not Fedora or the triplestore may be your best bet. I'd just be sure to snapshot the db. That's a perfectly acceptable form of durability if you ask me. |
Ha, needed to refresh. |
@mjordan Yes, that's certainly a concern. That threshold of "how much is too much for Drupal" is looming out there. It'd be nice to find out where that really is. |
I agree with @ajs6f's characterization of fixity verification as transactional, which is why I'm resisting modelling the events as Drupal nodes. We should do some thorough scalability testing, for sure. Maybe we should open an issue for that now, as a placeholder? |
I see what you're saying. It's not like you're going to be scouring that history all the time, so there's no point in having it bog down everything else. If it's too hamfisted to model them as nodes, then having a drupal module just to emit them onto a queue is super sensible. And sandboxing it to its own storage is even more so. As for what that is/should be? I guess that depend on what you're going to do with it and how you want to access it. I presume you'd want to be able to query it? That at least narrows down the choice to either sql or the triplestore if you wanna stay with the systems already in the stack. |
...or Solr. |
Yeah, we're going to want to query it. If we store the SHA1 checksum as a field in the binary node (which sounds like a great idea), we'll want to query the events to serialize them as PREMIS, for example ("give me all the fixity verification events for the resource X, sorted by timestamp would be nice"). |
We weren't necessarily planning on using CLAW to manage fixity. I'm actually interested in what UMD proposed which includes using a separate graph in the triple-store specifically for audit data. Even if you were using the same Triple-store for both, placing them in separate graphs should preserve performance on the CLAW one. |
Can't agree enough! @seth-shaw-unlv Did you mean separate datasets? Because in most triplestores (depends a bit, but Jena is a good example) putting them in separate named graphs in one dataset isn't going to do anything for performance. (Putting one in the default graph and one in a named graph would do a little, but not anything much compared to putting them in separate datasets.) Generally, my experience has been that in non-SQL stores (be they denormalized like BigTable descendants or "hypernormalized" like RDF stores) query construction makes the biggest difference in performance, and should dictate data layout. @mjordan Sorry about the misunderstanding-- I thought you were talking about workflow to which every install would have to subscribe. Add-on/optional stuff, no problem! |
@ajs6f, yes, you are right. I was, admittedly, speaking based on an assumption that separate graphs would improve performance due to a degree of separation. I don't have experience scaling them yet . |
@seth-shaw-unlv I think we're all going to learn a bunch in the next few years about managing huge piles of RDF! |
@seth-shaw-unlv the UMD strategy looks good, but it's specific to fcrepo. I think it's important that Islandora not rely on features of a specific Fedora API implementation. Also, I'm hoping that we can implement fixity auditing in a single Drupal module, without any additional setup (which is what we have in Islandora 7.x). @ajs6f no problem, we're all so focussed on getting core functionality right that I should have made it clear I was shifting to optional functionality. |
I think the UMD plan could be simplified to:
I'd like to keep the processing of fixity off of the Drupal server if possible as this is a process that for large repositories could be always running. |
@whikloj yes I was starting to think about abstracting the storage out so individual sites could store it where they want. About keeping the processing off the Drupal server, you're right, the process would be running constantly. But I don't see where issueing a bunch of requests for checksums, then comparing them to the previous value, then persisting the results somewhere would put a huge load on the Drupal server. It's the Fedora server that I think will get the biggest hit, since, if my understanding is correct, it needs to read the resource into memory to respond to the request for the checksum. A while back I did some tests on a Fedora 3.x server to see how long it took to verify a checksum and found that "the length of time it takes to validate a checksum is proportionate to the size of the datastream"; I assume this is also true to a certain extent with regard to RAM usage although I didn't test for that. |
Following up on @whikloj suggestion of moving the fixity checking service off the Drupal server, would implementing it an external microservice be an option? That way, non-Islandora sites might be able to use it as well. Kind of complicates where the data is stored (maybe that could be abstracted such that Islandora stores it in Drupal db, Samvera stores it somewhere else, etc. Such a microservice could be containerizied if desired. |
👍 Doing it as a microservice will indeed abstract away all those details. The web interface you design for it will allow individual implementors to use whatever internal storage they want. |
Sounds like a plan - anyone object to moving forward on this basis? The "Islandora" version of this would be a module that consumed data generated by the microservice to provide reports, checksum mismatch warnings, etc. |
Reading this, my gut is telling me the microservice should stuff everything into its own SQL db and we point views at it in Drupal to generate reports/dashboards. |
I totally agree with the microservice idea for doing fixity checks. Not sure if we should handle it in this issue, or in another issue, but one thing we are missing (and missing completely in 7.x) is the ability to provide a checksum on ingest, and have it verified once the object is in storage, failing the upload if the fixity check fails. This is the most common fixity feature I'm asked for in Islandora 7.x, and it covers the statistically most likely case of the file getting mangled in transit, rather then when sitting on disk. |
@jonathangreen We're halfway there on transmission fixity. We cover it on the way into Fedora from Drupal, but not from upload into Drupal. We can add that as a separate issue to add it to the REST api and wherever else (text field on upload form?). |
@dannylamb sounds good to me. |
Just to be clear, this would be a service that produces checksums for the frontend via its own path to persistence, not a service to which the binaries are transmitted over HTTP for a checksum, right? |
Thanks @DiegoPino, I'll pass that advice on to him when I see him. |
@jonathangreen for the 7.x version Ticket, if feel it could be good put somewhere in that ticker, for whomever ends writing that, that some chunked transmissions implementations like plupload could have issues with a user provided hash.. via form (like where to put it and how to trigger-it or when to trigger it since assembling of the final upload is happening somewhere else...) |
@DiegoPino here is the ticket for 7.x if you want to add some notes: https://jira.duraspace.org/projects/ISLANDORA/issues/ISLANDORA-2261 |
During 2018-07-11 CLAW tech call, @rosiel asked about checksums on CLAW binaries stored external to Fedora, e.g. stored in s3, Dropbox, etc. Getting Fedora to provide a checksum on these binaries could be quite expensive since it pulls down the content to run a fixity check on it. One idea that come up in the discussion was that if we are using an external microservice to manage/store fixity checking, we could set up rules to verify checksums on those remote binaries. The microservice would need to pull it down to do its check, but not if the storage storage service provided an API to get a checksum on a binary, our microservice could query that API. |
Or maybe use those service's native fixity check... like on S3 you are
paying and getting that hash as tech metadata via the API?
El El mié, 11 de jul. de 2018 a las 13:46, Mark Jordan <
notifications@github.com> escribió:
During 2018-07-11 CLAW tech call, @rosiel <https://github.com/rosiel>
asked about checksums on CLAW binaries stored external to Fedora, e.g.
stored in s3, Dropbox, etc. Getting Fedora to provide a checksum on these
binaries could be quite expensive since it pulls down the content to run a
fixity check on it. One idea that come up in the discussion was that if we
are using an external microservice to manage/store fixity checking, we
could set up rules to verify checksums on those remote binaries. The
microservice would need to pull it down to do its check, but not if the
storage storage service provided an API to get a checksum on a binary, our
microservice could query that API.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#847 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGn852ZLarhLJdX9H2XILdGGENMVILkZks5uFjn8gaJpZM4UyQRp>
.
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
@DiegoPino yes, that's what I meant. Sorry if that was not clear. |
Sorry, my fault! Re-reading and totally agree, sorry again @mjordan |
I'd like to see a way here to have a trust but verify approach, where you can pull checksums from an external API, but maybe at some lower frequency you still want to pay for the bandwidth to do some verification of the checksums. Could just be some configuration options. |
@jonathangreen i agree it could be useful to provide a more compliant to what is expected preservation platform, but in terms of implementation, how would you propose we have no false positives of corruption because of timed out/stalled or even failed downloads? Not something that keeps me awake at night right now, but http(s) which is what most API provide for downloading assets tends to be hit and miss in that aspect. As said, i agree this is needed, just don´t know how to deal with it at an implementation level in a safe and reliable way |
One approach to handling false mismatches is to retry the request if it fails, and see what the results are. A one-off failure can be discarded, but if they all fail, the problem is probably legit. |
Not to increase scope here too much, but keep the use and edge cases coming. I'll be working on this (and related preservation stuff in CLAW) pretty much full time this fall. |
This is starting to sound like an application on top of something like iRODS. I'm not seriously suggesting that; I'm wondering whether for the MVP, would it be enough to have a simple µservice that just retrieves a checksum from the backend in use, on the assumption that such a checksum is available? I'm not at all trying to discourage people from recording use cases and I think it's awesome that @mjordan is thinking through this, it's just that when you're facing problems like: the character of reliable transport for mass data across networks of unknown character... that's a pretty big scope. |
@ajs6f MVP is a good way to frame it. I don't have the cycles to propose one this week but need to prepare a poster for iPres so will need to do that soon (next couple weeks?). We can build it in a way that can expand on the MVP. |
Maybe this could be one consideration in a repo manager's choice of storage solution. Since now we have all these options, we're going to need to make educated decisions on which one to use. Just because something's cool, doesn't mean it's the right tool for your job, and if you need reliable, regular, automated, locally-performed checksumming (and maybe that's a preservation best practice?) then S3 might not be the ideal storage location for you? |
What if it was stored as a standard Islandora datastream, like any other derivative, with each outcome appended to the same versioned file? |
@bradspry so each binary resource would have an accompanying text/JSON/XML or other resource containing its validation history... Doing that would avoid having triples for each event and also store the fixity checking history in Fedora. Definitely worth exploring. |
That can be a relatively lean pattern, but I would argue that it shouldn't be a default because it creates new resources in the repository. For sites like SI which won't be relying on this layer of technology for fixity functions, that's a huge number of empty useless resources. Maybe as an optional function in the proposed µservice? |
I did not read this whole thread, but there is a schema for putting technical metadata into RDF: https://spdx.org/rdf/spdx-terms-v2.1/ Are there x-paths to the technical metadata datastream elements that we can use to map to these terms? |
@rtilla1, that spec looks very useful but it appears to be specific to describing software packages. That said, there's no reason we couldn't use some of its properites for all types of content. The plan so far for fixity verification event data is to provide options for storing the data in several ways, e.g., in the fedora repository as entities, in a relational database, in CSV binary resource associated with the primary entity @bradspry is suggesting, etc. (we'll probably start off with a relational database managed by an external microservice). One advantage of storing the data in a db or CSV file is that it can be converted to specific schema on demand. But, same is true even if it is stored as RDF. For example, if a consumer of fixity data wanted PREMIS, we should be able to provide it to them in that vocabulary; same goes for spdx. So I don't think we're required to having to choose a specific ontology or fixity data right now. |
I think my implementation of a fixity checking microservice is ready to get some additional eyes on: https://github.com/mjordan/riprap This is only a Minimum Viable Product, but does take the following requirements from this issue into account:
I've already started some issues... but it's fairly far along. I'd love some feedback on the direction Riprap is taking. I'll also probably need some help getting Symfony's ActiveMQ listener working at some point... |
I was just thinking about you guys @mjordan @dannylamb while doing some preservation action releated work. Have you all seen this? |
Interesting but doesn't offer the features we want. From what I can tell it only logs digests and doesn't compare logged results with new fixity checks. Also doesn't support fetching a digest via HTTP, which what the Fedora specification requires. |
I would like to start work on fixity auditing (checksum verification) in CLAW. In 7.x, we have the Checksum and Checksum Checker modules, plus the PREMIS module, which serializes the results of Checksum Checker into PREMIS XML and HTML. Now is a good time to start thinking about how we will carry this functionality over to CLAW so that on migration, we can move PREMIS event data from the source 7.x to CLAW.
In 7.x, we rely on FCREPO 3.x's ability to verify a checksum. In a Drupal or server-side cron job, Checksum Checker issues a query to each datastream's
validateChecksum
REST endpoint (/objects/{pid}/datastreams/{dsID} ? [asOfDateTime] [format] [validateChecksum]
) and we store this fixity event outcome in the object's AUDIT datastream. The Fedora API Specification, on the other hand, does not require validation of a binary resource's fixity but instead requires implementations to return a binary resource's checksum to the requesting client, allowing the checksum value to "be used to infer persistence fixity by comparing it to previously-computed values."Therefore, in CLAW, to perform fixity validation, we need to store the previously-computed value ourselves. In order to ensure long-term durability and portability of the data, we should avoid managing it using implementation-specific features. Two general options for storing fixity auditing event data that should apply to all implementations of the Fedora spec are
Fixity event data can accumulate rapidly. The 7.x Checksum Checker module's README documents the effects of adding event outcome to an object's AUDIT datastream, but in general, each fixity verification event on a binary resource will generate one outcome, which includes the timestamp of the event and a binary value (passed/failed). For example, in a repository that contains 100,000 binary resources, each verfication cycle will generate 100,000 new outcomes that need to be persisted somewhere. In our largest Islandora instance, which contains over 600,000 newspaper page objects, we have completed 14 full fixity verification cycles, resulting in approximately 8,400,000 outcome entries.
I would like to know what people think are the pros and cons of storing this data both within the repository as RDF and external to the repository using the triplestore or a database. My initial take on this question is:
One possible mitigation against the loss of an RDBMS is to periodically dump the data as a text file and persist it into the repository; that way, if the database is lost, it can be recovered easily. The same strategy could be applied to data stored in the triplestore.
If we can come to consensus on where we should store this data, we can then move on to migration of fixity event data, implementing periodic validation ("checking"), serialization, etc.
The text was updated successfully, but these errors were encountered: