Data access tracking #162

flbahr · 2024-06-24T16:24:41Z

flbahr
Jun 24, 2024

Something that we've been asked for time and time again is to track how many downloads of some data sets have been done. Since Google has gone to the new GA4 architecture it is not a simple method to track these downloads. I know that IOOS code sprint was working on code to attempt to get at this through the logs but a more direct access to these stats would be useful. This would help the RAs and IOOS with showing how often data are accessed and how important datasets are.

ChrisJohnNOAA · 2024-06-24T18:36:40Z

ChrisJohnNOAA
Jun 24, 2024
Maintainer

One of the ideas mentioned during the IOOS code sprint was integrating the log processing (including anonymization and aggregation to avoid exposing PII of users) into ERDDAP itself and then surfacing the data as an ERDDAP dataset. Would that cover what you're looking for, or do you need a full feature analytics dashboard?

@callumrollo who led that code sprint topic and may have additional thoughts.

0 replies

flbahr · 2024-06-24T18:52:31Z

flbahr
Jun 24, 2024
Author

I don't think we need to know about the PII of the users. I think the idea is knowing the data set downloaded, number of downloads of said dataset, possibly the region of download (USA, possibly with state level info). Temporal resolution of the downloads (i.e. to see if socializing the data set impacts the number of downloads). From: "Chris John" ***@***.***> To: "ERDDAP/erddap" ***@***.***> Cc: "Fred Bahr" ***@***.***>, "Author" ***@***.***> Sent: Monday, June 24, 2024 11:37:01 AM Subject: Re: [ERDDAP/erddap] Data access tracking (Discussion #162) One of the ideas mentioned during the IOOS code sprint was integrating the log processing (including anonymization and aggregation to avoid exposing PII of users) into ERDDAP itself and then surfacing the data as an ERDDAP dataset. Would that cover what you're looking for, or do you need a full feature analytics dashboard? [ https://github.com/callumrollo | @callumrollo ] who led that code sprint topic and may have additional thoughts. — Reply to this email directly, [ #162 (comment) | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ABGXFCODMMKXDV6M4TLXYG3ZJBRM3AVCNFSM6AAAAABJ2E3C6WVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQNRTGY4DM | unsubscribe ] . You are receiving this because you authored the thread. Message ID: ***@***.***>

0 replies

rmendels · 2024-06-24T19:30:43Z

rmendels
Jun 24, 2024

@flbahr @ChrisJohnNOAA @callumrollo

Hi Fred:

Getting dataset usage would be doable - but a lot of the rest that you ask for is becoming nigh near impossible. Look at your Apache logs sometime. If they are like ours, a bulk of the requests are coming from things like AWS, Google Cloud, Digital Ocean etc etc, so based on the IP (about the only thing you have to go on), you will not be able to refine the location further. Or given NWave, most NOAA requests are seen as coming from NOAA-West1 in Boulder. Also I question whether this is best done inside ERDDAP or by code that can be run outside of ERDDAP any time and any where, but I am open to discussion on this. For the nonce, you might talk to Dale Robinson, who of course among other things is on CenCOOS DMAC.

Also a point I have made before - have you looked at the information sent daily in the email Daily report? Here is a sample from ours from a few days ago right after a restart:

files browse DatasetID (since last daily report)
jplMURSST41: 5 (1%)
erdMurFront41USWest: 4 (1%)
erdNavgem05DPres_LonPM180: 3 (1%)
ncdcOisst21Agg: 3 (1%)
erdSW1chla1day_Lon0360: 2 (0%)
erdVHNchlamday: 2 (0%)
.
.

files browse DatasetID (since startup)
jplMURSST41: 5 (1%)
erdMurFront41USWest: 4 (1%)
erdNavgem05DPres_LonPM180: 3 (1%)
ncdcOisst21Agg: 3 (1%)
erdSW1chla1day_Lon0360: 2 (0%)
.
.

griddap DatasetID (since last daily report)
jplMURSST41: 8 (10%)
erdEtopoSeafloorGradient: 6 (7%)
erdSrtm30plusSeafloorGradient: 5 (6%)
chirps20GlobalDailyP05: 4 (5%)

and so on for Info and tabledap and the like. Is this the type of information you are after? My guess is that if wanted that this can be expanded to list all datasets even if accessed just once since I assume ERDDAP is already storing the information and it is a matter of what gets written out.

0 replies

flbahr · 2024-06-24T19:41:59Z

flbahr
Jun 24, 2024
Author

Roy, Thanks for the info. Since Axiom does our ERDDAP for most data sets, I don't see these daily e-mail reports and didn't know they existed. Not sure if Axiom is aware of these. The tracking of general location was just a thought and not high on the list. Given the number of datasets Axiom is serving I think we might have to filter these reports to get down to what we are primarily interested in. Anyway, thanks again for the info and suggestions. Cheers, Fred From: "Roy Mendelssohn" ***@***.***> To: "ERDDAP" ***@***.***> Cc: "Fred Bahr" ***@***.***>, "Mention" ***@***.***> Sent: Monday, June 24, 2024 12:31:05 PM Subject: Re: [ERDDAP/erddap] Data access tracking (Discussion #162) [ https://github.com/flbahr | @flbahr ] [ https://github.com/ChrisJohnNOAA | @ChrisJohnNOAA ] [ https://github.com/callumrollo | @callumrollo ] Hi Fred: Getting dataset usage would be doable - but a lot of the rest that you ask for is becoming nigh near impossible. Look at your Apache logs sometime. If they are like ours, a bulk of the requests are coming from things like AWS, Google Cloud, Digital Ocean etc etc, so based on the IP (about the only thing you have to go on), you will not be able to refine the location further. Or given NWave, most NOAA requests are seen as coming from NOAA-West1 in Boulder. Also I question whether this is best done inside ERDDAP or by code that can be run outside of ERDDAP any time and any where, but I am open to discussion on this. For the nonce, you might talk to Dale Robinson, who of course among other things is on CenCOOS DMAC. Also a point I have made before - have you looked at the information sent daily in the email Daily report? Here is a sample from ours from a few days ago right after a restart: files browse DatasetID (since last daily report) jplMURSST41: 5 (1%) erdMurFront41USWest: 4 (1%) erdNavgem05DPres_LonPM180: 3 (1%) ncdcOisst21Agg: 3 (1%) erdSW1chla1day_Lon0360: 2 (0%) erdVHNchlamday: 2 (0%) . . files browse DatasetID (since startup) jplMURSST41: 5 (1%) erdMurFront41USWest: 4 (1%) erdNavgem05DPres_LonPM180: 3 (1%) ncdcOisst21Agg: 3 (1%) erdSW1chla1day_Lon0360: 2 (0%) . . griddap DatasetID (since last daily report) jplMURSST41: 8 (10%) erdEtopoSeafloorGradient: 6 (7%) erdSrtm30plusSeafloorGradient: 5 (6%) chirps20GlobalDailyP05: 4 (5%) and so on for Info and tabledap and the like. Is this the type of information you are after? My guess is that if wanted that this can be expanded to list all datasets even if accessed just once since I assume ERDDAP is already storing the information and it is a matter of what gets written out. — Reply to this email directly, [ #162 (comment) | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ABGXFCNTT4KVRSQMESGIQNLZJBXXTAVCNFSM6AAAAABJ2E3C6WVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TQNRUGEYDK | unsubscribe ] . You are receiving this because you were mentioned. Message ID: ***@***.***>

0 replies

rmendels · 2024-06-24T19:53:36Z

rmendels
Jun 24, 2024

@flbahr Axiom should be able to set it up so that the email reports are sent to you (there are options in setup.xml as to who gets what emails). And it would not be hard to write a little script to parse the emails, either in a shell script or in Python. But as I said, talk to Dale, but if you don't have access to the logs he may not be much help. One reason I hesitate on this is Chris is only working part-time on ERDDAP, there are a lot of things that need to be done, and one of our first questions about any request is how central is a request to the service and/or if there already exists other ways to get the information. Would that we had staff to do everything, but we don't, so we have to set priorities and stick to them.

0 replies

callumrollo · 2024-06-25T07:24:43Z

callumrollo
Jun 25, 2024

Hi Fred,

As Chris mentioned, I led a topic at the IOOS code sprint to analyse the web server logs of ERDDAP, with contributions from Chris and others. It's available as a package on PyPI, you can find the repo here https://github.com/callumrollo/erddaplogs

The notebook weblogs-parse-demo.ipynb shows the functionality and it ships with some anonymized example logs if you want to try it out. Currently the features are:

read requests from apache/nginx logs
filter out webcrawlers/spam requests (as Roy mentioned, these servers get a lot of spam/bots)
do an ip lookup to geolocate users
create plots of requests over time, most popular datasets, type of dataset requests (griddap vs tabledap vs files), a map showing where most requests come from etc.

I hope to return to this project next month to complete the integration process that Chris mentioned, so that the anonymized results of this analysis can be made available on ERDDAP itself, easing integration and usage.

1 reply

thogar-computer Jul 1, 2024

@SarahSidders @roje-bodc might be worth looking at this for our usage too

ChrisJohnNOAA · 2024-06-25T15:08:40Z

ChrisJohnNOAA
Jun 25, 2024
Maintainer

@rmendels This has come up enough times now that I don't believe the currently available stats are covering users' (admins in this case) needs. Additionally, I'm lacking data that would be helpful in prioritizing tasks (for example how widely used are the translations). My theory is making processed stats available as an ERDDAP dataset (with the option for an admin to disable) will be helpful for admins and myself. If @callumrollo contributes the integration, it will hopefully take minimal time on my part.

0 replies

srstsavage · 2024-06-26T05:21:02Z

srstsavage
Jun 26, 2024

It seems like a low hanging fruit here would be to make the dataset access information tracked by ERDDAP and sent via email available on-demand and in a more useful/accessible format. If this information was available on-demand in a standard format like JSON it would be much easier to further analyze.

Also I'm not sure how temporally granular the access tracking is in the ERDDAP internals, but we should consider how this data could be rolled up through time. In other words, would it be possible to ask ERDDAP how about dataset usage metrics between two specified dates? Or would metrics for the past month have to be stitched together from many separate "since startup" or "since last daily report" windows? In case of the later, it would be crucial to include ISO8601 time stamps of the last restart and last daily report.

One last thought on adding the results of access log parsing as an ERDDAP dataset: it seems unusual for a service to serve its real time usage data to the public. I'm not necessarily advocating against it, but there may be security or privacy concerns that we should think through.

1 reply

ChrisJohnNOAA Jun 26, 2024
Maintainer

I believe the way the daily stats email is tracked several of the requests above would not be possible (and are already available with Callum's log based approach). In particular, the timing and location of requests (though due to anonymization/aggregation not together). The log processing scripts also export as csv (currently 2 per day) to make the data easy to further analyze (or turn into a dataset).

The current daily email stats are also only stored in memory and so are lost on reboot. They daily counts are also cleared each day, it doesn't retain past days. In order to be able to have metrics for the past month or to compare dates, we'd need to write the reports to disk and then read+aggregate them. Additionally we'd likely want to implement storing the current stats so they aren't lost if you need to restart.

The logs are already written to disk and contain the above missing info (among other things).

Also if we use Callum's script it's not real time usage data that would be available. The script anonymizes and aggregates the logs (we can do more if there concerns the current anonymization isn't sufficient), and would most likely run on a daily basis (so requests since the last processing time would not be available). I'd also intend to make this optional so the processed logs dataset could be turned off for an ERDDAP server (and possibly as another option a protected dataset).

Lastly my thought on making the results a dataset is to enable easy usage in cases besides the sole admin. For example Fred being able to access the stats on the datasets he cares about (or others in an organization besides the server admin). Also for my own understanding of how the different ERDDAP servers are used, I'd love to have more data available to help guide where I focus my efforts.

rmendels · 2024-06-26T15:11:39Z

rmendels
Jun 26, 2024

@ChrisPJohn Email reports are written to disk each day (not saying that we should not do the other stuff). Agree with Shane about possible privacy issues if anything besides dataset usage is made into a dataset. Been lazy, haven't tried Callum's script yet.

1 reply

ChrisJohnNOAA Jun 26, 2024
Maintainer

Thanks for the reminder, I forgot about the email log. We'd still want to change how its written out (or add a structured output) if this was how we were saving the data for comparing across dates.

Also, I'm happy to discuss actual privacy concerns about Callum's script. It was something we discussed during the code sprint and took measures to address. It's possible we missed something, but I do think it does a good job of providing useful information while protecting user privacy.

thogar-computer · 2024-07-01T08:21:08Z

thogar-computer
Jul 1, 2024

@ChrisJohnNOAA @rmendels @flbahr
We have done some work on this at NOC. currently, it is using a custom script to pull out information from the web traffic going into ERDDAP (so sits outside ERDDAP) - however we did look at how it might work from within ERDDAP, issues #118, contains that information.

I think it is an interesting case having it as a public dataset, would it also report on itself? and would there be a version that isn't as anonymized for admins?

2 replies

ChrisJohnNOAA Jul 1, 2024
Maintainer

Callum's script works by processing the Tomcat logs. If the integration happens my understanding is it would still process the Tomcat (or other server logs).

I think based on #118 you were interested in collecting/logging all of the information from within ERDDAP itself, is that right?

Is there data lost in the anonymization process in particular you are interested in? Given the complex and frequently changing nature of data privacy laws, I'd be hesitant to provide tools for storing (or making more widely available even internally to an org) less anonymized data.

If we went the route of making the processed logs a dataset, then yes it would include access information for the logs dataset (not in real-time, I believe the proposal was to process the logs once a day).

ChrisJohnNOAA Jul 1, 2024
Maintainer

I should also add that if we do the integration, I'd want it to be easy to disable (both the processing and the serving of the processed logs). I understand not all ERDDAP admins may want those feature for various reasons.

callumrollo · 2024-07-05T11:41:35Z

callumrollo
Jul 5, 2024

I've added more comprehensive installation instructions to the repo, including xml files for integration into ERDDAP. It should be much easier to integrate now.

We have a live deployment up on our ERDDAP updating as a daily cron job if you want to see what the public stats look like

https://erddap.observations.voiceoftheocean.org/erddap/tabledap/requests.html

There are also a couple of example notebooks in the repo that explore both the processed log files and the anonymised data that are published on ERDDAP

https://github.com/callumrollo/erddaplogs/tree/main/notebooks

0 replies

thogar-computer · 2024-07-05T11:44:15Z

thogar-computer
Jul 5, 2024

yes, #118 suggested changes to get the information from within ERDDAP. we thought that as this information is already being captured, why not get ERDDAP to output that information in a format that is simple to ingest into other systems?

that output can then power any analytics dashboard a user might have (we currently have an influx DB for that, but we are also looking at Grafana as well). #117 used a JSONL format for the logging of that user information, I believe that is the format used by the EDDTableFromJsonlCSVFiles class which would allow for the logs to be used directly as a feed for a metrics dataset?

in parallel we did build a system that also worked on the inbound request information, we set up a gorepaly instance to mirror traffic to our ingestor, which reduced the reliance on logs and gave us real-time information as well, this is what is currently powering our metrics.

totally agree this should be something easily turned off.

1 reply

ChrisJohnNOAA Jul 5, 2024
Maintainer

I do think long term, a structured log from ERDDAP itself is a good plan for a number of reasons (including being able to log information not in the Tomcat logs). I'm happy to review PRs around adding this structured logging, but I think it needs more design work first.

What benefit does it provide over current logging (the current ERDDAP logs, the daily emails, and the Tomcat logs)?
What is being logged and in what format?
What are the privacy concerns and how are they addressed? Especially with ERDDAP being used in many different countries, this is a complicated question.
What are the uses for these logs? This can heavily influence the information and format of the logs.
Where is the information stored?
Does ERDDAP do any daily processing (daily or otherwise) of the logs to generate reports?
How can an ERDDAP administrator opt-out (or in) to generating these logs?
How do we test the system to make sure it doesn't break?

rmendels · 2024-07-05T17:36:42Z

rmendels
Jul 5, 2024

@callumrollo @thogar-computer @ChrisJohnNOAA Whatever the final resolution of this many thanks to all who have contributed. Much appreciated.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data access tracking #162

{{title}}

Replies: 13 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data access tracking #162

Replies: 13 comments · 6 replies

ChrisJohnNOAA Jun 24, 2024 Maintainer

flbahr Jun 24, 2024 Author

flbahr Jun 24, 2024 Author

ChrisJohnNOAA Jun 25, 2024 Maintainer

ChrisJohnNOAA Jun 26, 2024 Maintainer

ChrisJohnNOAA Jun 26, 2024 Maintainer

ChrisJohnNOAA Jul 1, 2024 Maintainer

ChrisJohnNOAA Jul 1, 2024 Maintainer

ChrisJohnNOAA Jul 5, 2024 Maintainer

Replies: 13 comments 6 replies

ChrisJohnNOAA
Jun 24, 2024
Maintainer

flbahr
Jun 24, 2024
Author

flbahr
Jun 24, 2024
Author

ChrisJohnNOAA
Jun 25, 2024
Maintainer

ChrisJohnNOAA Jun 26, 2024
Maintainer

ChrisJohnNOAA Jun 26, 2024
Maintainer

ChrisJohnNOAA Jul 1, 2024
Maintainer

ChrisJohnNOAA Jul 1, 2024
Maintainer

ChrisJohnNOAA Jul 5, 2024
Maintainer