Replies: 13 comments 6 replies
-
One of the ideas mentioned during the IOOS code sprint was integrating the log processing (including anonymization and aggregation to avoid exposing PII of users) into ERDDAP itself and then surfacing the data as an ERDDAP dataset. Would that cover what you're looking for, or do you need a full feature analytics dashboard? @callumrollo who led that code sprint topic and may have additional thoughts. |
Beta Was this translation helpful? Give feedback.
-
I don't think we need to know about the PII of the users. I think the idea is knowing the data set downloaded, number of downloads of said dataset, possibly the region of download (USA, possibly with state level info). Temporal resolution of the downloads (i.e. to see if socializing the data set impacts the number of downloads).
From: "Chris John" ***@***.***>
To: "ERDDAP/erddap" ***@***.***>
Cc: "Fred Bahr" ***@***.***>, "Author" ***@***.***>
Sent: Monday, June 24, 2024 11:37:01 AM
Subject: Re: [ERDDAP/erddap] Data access tracking (Discussion #162)
One of the ideas mentioned during the IOOS code sprint was integrating the log processing (including anonymization and aggregation to avoid exposing PII of users) into ERDDAP itself and then surfacing the data as an ERDDAP dataset. Would that cover what you're looking for, or do you need a full feature analytics dashboard?
[ https://github.com/callumrollo | @callumrollo ] who led that code sprint topic and may have additional thoughts.
—
Reply to this email directly, [ #162 (comment) | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ABGXFCODMMKXDV6M4TLXYG3ZJBRM3AVCNFSM6AAAAABJ2E3C6WVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQNRTGY4DM | unsubscribe ] .
You are receiving this because you authored the thread. Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
@flbahr @ChrisJohnNOAA @callumrollo Hi Fred: Getting dataset usage would be doable - but a lot of the rest that you ask for is becoming nigh near impossible. Look at your Apache logs sometime. If they are like ours, a bulk of the requests are coming from things like AWS, Google Cloud, Digital Ocean etc etc, so based on the IP (about the only thing you have to go on), you will not be able to refine the location further. Or given NWave, most NOAA requests are seen as coming from NOAA-West1 in Boulder. Also I question whether this is best done inside ERDDAP or by code that can be run outside of ERDDAP any time and any where, but I am open to discussion on this. For the nonce, you might talk to Dale Robinson, who of course among other things is on CenCOOS DMAC. Also a point I have made before - have you looked at the information sent daily in the email Daily report? Here is a sample from ours from a few days ago right after a restart: files browse DatasetID (since last daily report) files browse DatasetID (since startup) griddap DatasetID (since last daily report) and so on for Info and tabledap and the like. Is this the type of information you are after? My guess is that if wanted that this can be expanded to list all datasets even if accessed just once since I assume ERDDAP is already storing the information and it is a matter of what gets written out. |
Beta Was this translation helpful? Give feedback.
-
Roy,
Thanks for the info. Since Axiom does our ERDDAP for most data sets, I don't see these daily e-mail reports and didn't know they existed.
Not sure if Axiom is aware of these. The tracking of general location was just a thought and not high on the list. Given the number of datasets Axiom is serving I think we might have to filter these reports to get down to what we are primarily interested in.
Anyway, thanks again for the info and suggestions.
Cheers,
Fred
From: "Roy Mendelssohn" ***@***.***>
To: "ERDDAP" ***@***.***>
Cc: "Fred Bahr" ***@***.***>, "Mention" ***@***.***>
Sent: Monday, June 24, 2024 12:31:05 PM
Subject: Re: [ERDDAP/erddap] Data access tracking (Discussion #162)
[ https://github.com/flbahr | @flbahr ] [ https://github.com/ChrisJohnNOAA | @ChrisJohnNOAA ] [ https://github.com/callumrollo | @callumrollo ]
Hi Fred:
Getting dataset usage would be doable - but a lot of the rest that you ask for is becoming nigh near impossible. Look at your Apache logs sometime. If they are like ours, a bulk of the requests are coming from things like AWS, Google Cloud, Digital Ocean etc etc, so based on the IP (about the only thing you have to go on), you will not be able to refine the location further. Or given NWave, most NOAA requests are seen as coming from NOAA-West1 in Boulder. Also I question whether this is best done inside ERDDAP or by code that can be run outside of ERDDAP any time and any where, but I am open to discussion on this. For the nonce, you might talk to Dale Robinson, who of course among other things is on CenCOOS DMAC.
Also a point I have made before - have you looked at the information sent daily in the email Daily report? Here is a sample from ours from a few days ago right after a restart:
files browse DatasetID (since last daily report)
jplMURSST41: 5 (1%)
erdMurFront41USWest: 4 (1%)
erdNavgem05DPres_LonPM180: 3 (1%)
ncdcOisst21Agg: 3 (1%)
erdSW1chla1day_Lon0360: 2 (0%)
erdVHNchlamday: 2 (0%)
.
.
files browse DatasetID (since startup)
jplMURSST41: 5 (1%)
erdMurFront41USWest: 4 (1%)
erdNavgem05DPres_LonPM180: 3 (1%)
ncdcOisst21Agg: 3 (1%)
erdSW1chla1day_Lon0360: 2 (0%)
.
.
griddap DatasetID (since last daily report)
jplMURSST41: 8 (10%)
erdEtopoSeafloorGradient: 6 (7%)
erdSrtm30plusSeafloorGradient: 5 (6%)
chirps20GlobalDailyP05: 4 (5%)
and so on for Info and tabledap and the like. Is this the type of information you are after? My guess is that if wanted that this can be expanded to list all datasets even if accessed just once since I assume ERDDAP is already storing the information and it is a matter of what gets written out.
—
Reply to this email directly, [ #162 (comment) | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ABGXFCNTT4KVRSQMESGIQNLZJBXXTAVCNFSM6AAAAABJ2E3C6WVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQNRUGEYDK | unsubscribe ] .
You are receiving this because you were mentioned. Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
@flbahr Axiom should be able to set it up so that the email reports are sent to you (there are options in setup.xml as to who gets what emails). And it would not be hard to write a little script to parse the emails, either in a shell script or in Python. But as I said, talk to Dale, but if you don't have access to the logs he may not be much help. One reason I hesitate on this is Chris is only working part-time on ERDDAP, there are a lot of things that need to be done, and one of our first questions about any request is how central is a request to the service and/or if there already exists other ways to get the information. Would that we had staff to do everything, but we don't, so we have to set priorities and stick to them. |
Beta Was this translation helpful? Give feedback.
-
Hi Fred, As Chris mentioned, I led a topic at the IOOS code sprint to analyse the web server logs of ERDDAP, with contributions from Chris and others. It's available as a package on PyPI, you can find the repo here https://github.com/callumrollo/erddaplogs The notebook weblogs-parse-demo.ipynb shows the functionality and it ships with some anonymized example logs if you want to try it out. Currently the features are:
I hope to return to this project next month to complete the integration process that Chris mentioned, so that the anonymized results of this analysis can be made available on ERDDAP itself, easing integration and usage. |
Beta Was this translation helpful? Give feedback.
-
@rmendels This has come up enough times now that I don't believe the currently available stats are covering users' (admins in this case) needs. Additionally, I'm lacking data that would be helpful in prioritizing tasks (for example how widely used are the translations). My theory is making processed stats available as an ERDDAP dataset (with the option for an admin to disable) will be helpful for admins and myself. If @callumrollo contributes the integration, it will hopefully take minimal time on my part. |
Beta Was this translation helpful? Give feedback.
-
It seems like a low hanging fruit here would be to make the dataset access information tracked by ERDDAP and sent via email available on-demand and in a more useful/accessible format. If this information was available on-demand in a standard format like JSON it would be much easier to further analyze. Also I'm not sure how temporally granular the access tracking is in the ERDDAP internals, but we should consider how this data could be rolled up through time. In other words, would it be possible to ask ERDDAP how about dataset usage metrics between two specified dates? Or would metrics for the past month have to be stitched together from many separate "since startup" or "since last daily report" windows? In case of the later, it would be crucial to include ISO8601 time stamps of the last restart and last daily report. One last thought on adding the results of access log parsing as an ERDDAP dataset: it seems unusual for a service to serve its real time usage data to the public. I'm not necessarily advocating against it, but there may be security or privacy concerns that we should think through. |
Beta Was this translation helpful? Give feedback.
-
@ChrisPJohn Email reports are written to disk each day (not saying that we should not do the other stuff). Agree with Shane about possible privacy issues if anything besides dataset usage is made into a dataset. Been lazy, haven't tried Callum's script yet. |
Beta Was this translation helpful? Give feedback.
-
@ChrisJohnNOAA @rmendels @flbahr I think it is an interesting case having it as a public dataset, would it also report on itself? and would there be a version that isn't as anonymized for admins? |
Beta Was this translation helpful? Give feedback.
-
I've added more comprehensive installation instructions to the repo, including xml files for integration into ERDDAP. It should be much easier to integrate now. We have a live deployment up on our ERDDAP updating as a daily cron job if you want to see what the public stats look like https://erddap.observations.voiceoftheocean.org/erddap/tabledap/requests.html There are also a couple of example notebooks in the repo that explore both the processed log files and the anonymised data that are published on ERDDAP https://github.com/callumrollo/erddaplogs/tree/main/notebooks |
Beta Was this translation helpful? Give feedback.
-
yes, #118 suggested changes to get the information from within ERDDAP. we thought that as this information is already being captured, why not get ERDDAP to output that information in a format that is simple to ingest into other systems? that output can then power any analytics dashboard a user might have (we currently have an influx DB for that, but we are also looking at Grafana as well). #117 used a JSONL format for the logging of that user information, I believe that is the format used by the EDDTableFromJsonlCSVFiles class which would allow for the logs to be used directly as a feed for a metrics dataset? in parallel we did build a system that also worked on the inbound request information, we set up a gorepaly instance to mirror traffic to our ingestor, which reduced the reliance on logs and gave us real-time information as well, this is what is currently powering our metrics. totally agree this should be something easily turned off. |
Beta Was this translation helpful? Give feedback.
-
@callumrollo @thogar-computer @ChrisJohnNOAA Whatever the final resolution of this many thanks to all who have contributed. Much appreciated. |
Beta Was this translation helpful? Give feedback.
-
Something that we've been asked for time and time again is to track how many downloads of some data sets have been done. Since Google has gone to the new GA4 architecture it is not a simple method to track these downloads. I know that IOOS code sprint was working on code to attempt to get at this through the logs but a more direct access to these stats would be useful. This would help the RAs and IOOS with showing how often data are accessed and how important datasets are.
Beta Was this translation helpful? Give feedback.
All reactions