Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include OAI harvest details in Job details #374

Closed
ghukill opened this issue Jan 10, 2019 · 4 comments
Closed

include OAI harvest details in Job details #374

ghukill opened this issue Jan 10, 2019 · 4 comments
Assignees

Comments

@ghukill
Copy link
Contributor

ghukill commented Jan 10, 2019

Per a recommendation, include some statistics / insight into an OAI harvest. Specifically -- low hanging fruit -- what sets were harvested, and distribution of records across sets.

@ghukill ghukill changed the title OAI harvest details include OAI harvest details in Job details Jan 10, 2019
@antmoth antmoth self-assigned this Jun 19, 2019
@antmoth
Copy link
Collaborator

antmoth commented Jun 20, 2019

So, I made it do this:
image

My main concern is that the code I'm using to do this seems inefficient in a way that may or may not matter and if it does matter may or may not be able to be improved. See #422

@ghukill
Copy link
Contributor Author

ghukill commented Jun 21, 2019

Oooooo, this is awesome! I think this is exactly what some people have been asking for.

I do think you're right though, as it stands now, looping through all the records, may be ultimately inefficient for large Jobs. Thankfully, I think we could lean on the Django/Mongo ORM to pull these counts pretty quickly.

Don't have a one liner handy, but something like the following might work:

# get Job's records as QuerySet (MongoEngine, but very similar to native Django SQL ORM)
job_records = Job.objects.get(pk=224).get_records()

# get OAI sets from Job records
job_records.values_list('oai_set').distinct('oai_set')
Out[21]: 
['wayne:collectioncfai',
 'wayne:collectionhermanmiller',
 'wayne:collectionrencen',
 'wayne:collectionmim']

job_records.filter(oai_set='wayne:collectioncfai').count()
Out[22]: 2292

Then, to avoid calculating these each time a Job is loaded, one option may be to store in a Job's job_details, which is a JSON object that is storing these exact kinds of things (field mapping metrics, etc.). Could be stored on Job finish, or first time Job is loaded.

But it's looking awesome. Happy to keep spitballing, but in short, I would think leaning on the ORM might be a good option. And if necessary, it that ends up being costly for huge jobs, could count with Spark and write to job_details that way.

@antmoth
Copy link
Collaborator

antmoth commented Jun 21, 2019

It turns out that mongo totally has a function to do this: item_frequencies. New commit pushed!

@ghukill
Copy link
Contributor Author

ghukill commented Jun 21, 2019

Brilliant! 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants