Contribution Bounding - per partition and cross partition bounding - continued from #26 pull request #32

preethiraghavan1 · 2021-05-25T14:34:50Z

Description

Implementing Per partition - bounding of each contribution and cross partition. Added flat map in pipeline_operations.py to accommodate this functionality. Continuation of #26

How has this been tested?

Unit tests for bounding

Checklist

[x ] I have followed the Contribution Guidelines and Code of Conduct
I have commented my code following the OpenMined Styleguide
I have labeled this PR with the relevant Type labels
My changes are covered by tests

…partition

pipeline_dp/dp_engine.py

preethiraghavan1 · 2021-05-25T14:37:47Z

pipeline_dp/dp_engine.py

+                                         max_contributions_per_partition,
+                                         "Sample per (privacy_id, partition_key)")
+    # ((privacy_id, partition_key), [value])
+    col = self._ops.map(col, lambda pid_pk: (pid_pk[0], aggregator_fn(


Please apply aggregator_fn per-partition contribution bounding, no need to do any aggregation after cross-partition bounding

Clarification:
eventually we need to aggregate per pk, we do it in 2 steps:
1.Aggregate per (pid, pk) (after per-partition contributions, this PR)
2.Aggregate per pk (after cross-partition contributions), but that's outside of the scope of this task

Aggregation in 2 is slightly more complicated, because in 1 we can assume that data per key is in memory (because we've done bounding), but in 2 we can't that assume, so it's needed to use group by key.

That makes sense! Thanks!
Moved it to after per partition

pipeline_dp/pipeline_operations.py

tests/dp_engine_test.py

dvadym

Thanks it looks great! I've left comments (mostly minor ones)

dvadym · 2021-05-26T11:14:25Z

pipeline_dp/dp_engine.py

@@ -49,3 +49,45 @@ def aggregate(self, col, params: AggregateParams,
    # TODO: implement aggregate().
    # It returns input for now, just to ensure that the an example works.
    return col
+
+  def _bound_cross_partition_contributions(self, col,


Could you please rename to _bound_contributions (sry I realized that the name I provided is not correct)

dvadym · 2021-05-26T11:16:18Z

pipeline_dp/dp_engine.py

+                                         max_contributions_per_partition,
+                                         "Sample per (privacy_id, partition_key)")
+    # ((privacy_id, partition_key), [value])
+    col = self._ops.map(col, lambda pid_pk: (pid_pk[0], aggregator_fn(


pipeline_dp/dp_engine.py

dvadym · 2021-05-26T11:19:11Z

pipeline_dp/dp_engine.py

+                                         max_contributions_per_partition,
+                                         "Sample per (privacy_id, partition_key)")
+    # ((privacy_id, partition_key), [value])
+    col = self._ops.map(col, lambda pid_pk: (pid_pk[0], aggregator_fn(


Nit (optional/style improvements): using map_values() instead of map() is simpler

Agreed! Done

dvadym · 2021-05-26T11:23:06Z

pipeline_dp/dp_engine.py

+    # Cross partition bounding
+    col = self._ops.map_tuple(col, lambda pid_pk, v: (pid_pk[0],
+                                                      (pid_pk[1], v)),
+                              "To (privacy_id, (partition_key, aggregator))")


Please update a stage name to "Rekey to ...".

Clarification:the operation of changing keys is needed very often and usually we use Rekey name for it.

dvadym · 2021-05-26T11:28:33Z

pipeline_dp/pipeline_operations.py

@@ -176,8 +188,19 @@ def keys(self, col, stage_name: str):
    def values(self, col, stage_name: typing.Optional[str] = None):
        return (v for k, v in col)

-    def sample_fixed_per_key(self, col, n: int, stage_name: str):
-        pass
+    def sample_fixed_per_key(self, col, n: int,


Please add a test for this function

dvadym · 2021-05-26T11:35:42Z

pipeline_dp/dp_engine.py

+    col = self._ops.sample_fixed_per_key(col, max_partitions_contributed,
+                                         "Sample per privacy_id")
+    # (privacy_id, [(partition_key, aggregator)])
+    return self._ops.flat_map(col, lambda pid: [((pid[0], pk_v[0]), pk_v[1])


Nit: lambda in this line is pretty complicated, maybe try a local function instead of lambda.

and please use lazy evaluations (there are 2 options "yield" or "(...)" generator)

tests/dp_engine_test.py

dvadym · 2021-05-26T11:44:46Z

pipeline_dp/dp_engine.py

+                                           max_contributions_per_partition: int,
+                                           aggregator_fn):
+    """
+    Bounds the contribution by privacy_id in and cross partitions


Please add "." after this phrase and then after each args description and return description

dvadym

a few minor suggestions

dvadym · 2021-05-26T18:59:04Z

pipeline_dp/dp_engine.py

+                              self._unnest_cross_partition_bound_sampled_per_key,
+                              "Unnest")
+
+  def _unnest_cross_partition_bound_sampled_per_key(self, pid):


I'd suggest to make it internal function of _bound_contributions, and no need to make comments then.

I agree it makes it better to be a nested function. Style guide suggested that we use nested function only for closing over a variable, so, added as a module level. Moved it to inner function.

dvadym · 2021-05-26T19:00:40Z

pipeline_dp/dp_engine.py

+                              self._unnest_cross_partition_bound_sampled_per_key,
+                              "Unnest")
+
+  def _unnest_cross_partition_bound_sampled_per_key(self, pid):


please rename pid to pid_pk_v (since it's not pid, but a tuple)

dvadym · 2021-05-26T19:01:34Z

pipeline_dp/dp_engine.py

+
+    Returns: tuple of the form ((privacy_id, partition_key), values)
+
+    """


please unpack arguments for readability:
pid, pk_values = pid_pk_v

…esent tuple 3. unpack arguments for readability

dvadym

Just a minor test improvement suggestion

dvadym · 2021-05-27T15:20:28Z

pipeline_dp/pipeline_operations.py

@@ -176,8 +188,19 @@ def keys(self, col, stage_name: str):
    def values(self, col, stage_name: typing.Optional[str] = None):
        return (v for k, v in col)

-    def sample_fixed_per_key(self, col, n: int, stage_name: str):
-        pass
+    def sample_fixed_per_key(self, col, n: int,


dvadym · 2021-05-27T15:22:39Z

tests/pipeline_operations_test.py

@@ -133,6 +153,52 @@ def assert_laziness(operator, *args):
        assert_laziness(self.ops.values)
        assert_laziness(self.ops.count_per_element)

+    def test_local_sample_fixed_per_key_requires_no_discarding(self):


please add
assert_laziness(self.ops.sample_fixed_per_key)
assert_laziness(self.ops.flat_map)

in test_laziness a few lines above

dvadym

Thanks a lot of contributing!

dvadym · 2021-05-27T16:23:54Z

tests/pipeline_operations_test.py

@@ -133,6 +153,52 @@ def assert_laziness(operator, *args):
        assert_laziness(self.ops.values)
        assert_laziness(self.ops.count_per_element)

+    def test_local_sample_fixed_per_key_requires_no_discarding(self):


Preethi Raghavan added 20 commits May 19, 2021 18:01

Implementing Per partition - bounding of each contribution and cross …

53d5241

…partition

empty test case

161ee09

Comments

98efeaf

fixed the abstract method having beam operations in it.

8763009

format the files

d78506f

format the files

cf54652

comment fixes

2a3dc08

misc - Fixes for the comments on cl

a0051da

misc - Fixes for the comments on cl

7605b1c

added

6360c90

formatting the files

52d0577

I screwed up the formatting. Fixing now.

2ca2235

nit fix

7dfffb0

fixing merge conflict

54144fc

comment fixes

b5353bf

add sample_fixed_per_key_generator for lazy impl

33aed25

applying aggregate function after per partion bounding

992e0ef

adding types of the values

22b4077

resolving conflicts

0eb50cb

applying group by key and a few fixes

f59e2c7

preethiraghavan1 requested a review from dvadym May 25, 2021 14:34