ML challenge inspired from the aggregate reporting API proposal #137

BasileLeparmentier · 2021-04-28T16:19:43Z

Hi everyone,

(repost of issue csharrison/aggregate-reporting-api#21 as apparently it was not the right repo for it)

We are delighted to announce that we will be organising a challenge with adKDD inspired by the aggregate measurement API, tackling the optimisation use case.

Criteo will provide a dataset and some prize money and let researchers and data scientists from around the world compete on how to learn performing bidding models from differential private reports. Link to the challenge here.

We will be happy to work with the Chrome team to set the appropriate parameter to the differential privacy function, to be as close as possible to what real-life operations could look like (e.g. an epsilon level would be beneficial).

We hope to kickstart the challenge in early May, so if you are interested in solving the “optimisation” use case using the aggregate reporting API, please do participate!

Best,

Basile on behalf of the Criteo team

BasileLeparmentier · 2021-04-28T16:20:13Z

(I apologize for the repost)
Dear @csharrison ,

For the challenge could you please provide us for an order of magnitude of the epsilon for us to work with? If you have an idea on the noise distribution too (Gaussian? Laplacian?) it would be great!

We understand very well that this number might change (and we hope the challenge will help us inform the choice of epsilon!) but it would really help to have an order of magnitude to set up the challenge.

Many thanks!

csharrison · 2021-04-30T14:33:48Z

Thanks Basile! I'll work on trying to get this information to you soon.

ALamraniAlaouiScibids · 2021-05-03T16:01:11Z

Hello @BasileLeparmentier,

Thank you for realizing this challenge, we will surely learn a lot and prepare for the future of click / sales optimization !
Would it be possible to release a detailed note of the way the dataset was generated ?
(the query used for the aggregation, the different values of the parameters etc...)

BasileLeparmentier · 2021-05-03T16:03:10Z

Hi,

Yes all of this will be described in the challenge!

BasileLeparmentier · 2021-05-05T12:02:00Z

Dear @csharrison ,

Sorry to bother you, but if we could get the value of the epsilon today, it would really be great.

We are launching the challenge Monday and if we want to be able to do checks, if we get the epsilon later than today, it will be very hard for us.

many thanks,
Basile Leparmentier

csharrison · 2021-05-05T14:08:23Z

Hey Basile, yep planning on sending you some information by today, I just need to get a few final sign-offs. Sorry for the delay.

BasileLeparmentier · 2021-05-05T14:15:05Z

Thx!

csharrison · 2021-05-05T16:47:14Z

Hey Basile, here are some parameters for you to work with.

Total epsilon of 5, resetting every month, with an L1 “sensitivity budget” which limits the total contributions per (browser, week, publisher) across all buckets by some fixed parameter (L1).
- Note that we are also considering “pacing” this budget over time (e.g. having a 7/6 epsilon per week). You could consider divvying up the budget that way if you like.
For the purposes of this challenge you can set the L1 parameter however you like, and allow histogram contributions of any real numbers from 0 to L1. In the API we would expect reports to be dropped after the sensitivity budget is consumed, until the end of the time window.
- e.g. in the simplest setting, all clicks would contribute L1 to one bucket of the underlying histogram, but then the user would not be able to contribute anything else for the rest of the time window for that publisher.
- More complex settings are possible by allowing the user to contribute less than L1 to multiple buckets, across multiple events.
Noise is added to every aggregate bucket via Laplace(L1 / epsilon)
Publisher eTLD+1 available in the clear associated with reports for click reports, along with ~hour granularity.

Aggregation function

There are a few choices of aggregation functions you can use. Any of these could be acceptable but offer different trade-offs.

Unbounded key space + Laplace + thresholding. This is the last bullet in issue 116. Group by every key in the data, add Laplace noise, and threshold at some fixed T. See algorithm described here (set B0 = 1 and Binf = L1) and associated code for how to compute this for Laplace histograms. You can use delta = 10^-7 to start.
- Pros: Doesn’t require prior knowledge of the domain keys
- Cons: Introduces bias in the data via thresholding mechanism
Bounded key space + Laplace. Similar to the first bullet in issue 116. Same as (1) but the aggregation function takes as input a list of “possible” keys (not learned from the sensitive data itself!). Final histograms are computed by adding Laplace noise to sums per key, as long as each key was present in the input. If there are any keys present in the data that were not in the input, they should be dropped. No thresholding or delta term is necessary.
- Pros: Doesn’t introduce bias by thresholding
- Cons: Requires a priori knowledge of the domain. For sparse domains (ones where only a small fraction of the keys show up in real data), output will be similarly sparse with lots of noisy 0s.
Dense key space + Laplace + hierarchies. This matches the second bullet in issue 116, it is a “hierarchy of histograms” approach where aggregation keys are specified in some fixed domain (e.g. 32 bits) and are queryable at various bit-prefixes, with Laplace noise added to each bucket. This would entail splitting epsilon budget between different hierarchies queried, and additionally require some (dense) encoding of aggregation keys. It also removes the need for any specific thresholding in the privacy mechanism itself. This technique is more compatible with MPC.
- Pros: Doesn’t require thresholding. Allows for hierarchical queries.
- Cons: Requires structured keys in some dense encoding (e.g. 32 bits or less), a priori knowledge of the domain.

I’m happy to go into more detail on any of this. Note that we are planning on publishing a doc with more details on how this relates to the attribution API, so stay tuned. In that doc we are planning on suggesting option (3) for its flexibility and compatibility with MPC, but for the purposes of this challenge I think any of these are fine.

Note: Training models using aggregate data may be easier if you allow on-device model training / Federated Learning. We haven’t elaborated on how to do this but we should discuss it if you want to do any follow-up competitions in this space, since it may be something the Privacy Sandbox could support.

BasileLeparmentier · 2021-05-05T17:46:25Z

Hi Charlie,

Many thanks for your answer!

Is it important for you to stick with the Laplacian noise?

As it is a challenge, there are a lot of queries (around 200, which we would also expect in a real world setting, even probably more), we believe that using the gaussian mechanism with a delta of 10e-6 or 10e-8 would give better learning results with the same privacy guarantee.

Any thought?

Basile

csharrison · 2021-05-05T17:56:47Z

The Laplace noise fits better with our current design (details pending) where we wouldn't expect so many queries / one report to influence so many different outputs. However, if you already have all these aggregation dimensions fixed in your design I think it should be OK to use Gaussian noise. We aren't opposed to Gaussian noise on any principled level.

csharrison · 2021-05-07T21:15:41Z

FYI, published some more information about aggregate attribution reports here:
https://github.com/WICG/conversion-measurement-api/blob/main/AGGREGATE.md

BasileLeparmentier · 2021-05-10T16:08:40Z

Hi Charlie and others interested,

The ML challenge is now live!

For those who want to participate, please follow this link https://competitions.codalab.org/competitions/31485

BasileLeparmentier mentioned this issue May 3, 2021

MEETING: Attribution Reporting API calls #80

Open

csharrison closed this as completed Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML challenge inspired from the aggregate reporting API proposal #137

ML challenge inspired from the aggregate reporting API proposal #137

BasileLeparmentier commented Apr 28, 2021

BasileLeparmentier commented Apr 28, 2021

csharrison commented Apr 30, 2021

ALamraniAlaouiScibids commented May 3, 2021

BasileLeparmentier commented May 3, 2021

BasileLeparmentier commented May 5, 2021

csharrison commented May 5, 2021

BasileLeparmentier commented May 5, 2021

csharrison commented May 5, 2021 •

edited

Loading

BasileLeparmentier commented May 5, 2021

csharrison commented May 5, 2021

csharrison commented May 7, 2021

BasileLeparmentier commented May 10, 2021

ML challenge inspired from the aggregate reporting API proposal #137

ML challenge inspired from the aggregate reporting API proposal #137

Comments

BasileLeparmentier commented Apr 28, 2021

BasileLeparmentier commented Apr 28, 2021

csharrison commented Apr 30, 2021

ALamraniAlaouiScibids commented May 3, 2021

BasileLeparmentier commented May 3, 2021

BasileLeparmentier commented May 5, 2021

csharrison commented May 5, 2021

BasileLeparmentier commented May 5, 2021

csharrison commented May 5, 2021 • edited Loading

Aggregation function

BasileLeparmentier commented May 5, 2021

csharrison commented May 5, 2021

csharrison commented May 7, 2021

BasileLeparmentier commented May 10, 2021

csharrison commented May 5, 2021 •

edited

Loading