-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stdlib compat set #1006
Stdlib compat set #1006
Conversation
Thanks for this contribution. |
@UnixJunkie I thought i added unit tests for everything? I will check again... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is nice work (again).
Note: for functorized modules (Set and Map) we also have non-inline tests in testsuite/test_<module>.ml
, so that the tests can be performed on all implementations (for functions that are available everywhere). I have mixed thoughts about adding inline tests for some functions (if they could be tested this way): inline tests are more convenient, but they either force us to add redundancy (copying the tests for several interfaces) or they test less. (But there probably should have been a comment in the .ml
file pointing at where the tests are.) I would have a slight preference for moving the tests to testsuite/test_set.ml
when it makes sense, but I'm interested in what you think about this.
src/batSet.ml
Outdated
then join l v r | ||
else union cmp l (add cmp v r) | ||
|
||
let rec map_stdlib cmp f = function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you decide to keep the older, less-efficient implementations of map
and filter_map
around, and name those *_stdlib
to avoid name conflicts? Would it be correctly to completely replace the previous functions with the new implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the map and filter_map functions from stdlib take a function elt -> elt, while the functions in Batteries take a function elt -> 'b and produce a 'b set (for the polymorphic maps)
So it seems we must keep the old versions or break compatibility with people that currently rely on map/filter_map to create sets of a different element type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, thanks. I would propose to call them map_endo
or endo_map
(from "endo{morphism,function}", a function from a set to itself) or homogeneous_map
rather than map_stdlib
.
The implementation of map_stdlib
and filter_map_stdlib
that you propose is more efficient than the existing map
and filter_map
functions in the case where the mapping function preserves the order, or "nearly" preserves the order. Would you possibly be interested in using the same implementation with try_join
for the existing/heterogeneous map functions? Note that the ==
trick cannot be reused, as it assumes that the input and output types are equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give credit where it is due: the map_stdlib and filter_map_stdlib methods are copied (mostly) verbatim from the stdlib.
I rewrote the map and filter_map functions to use try_join. I'm not convinced that makes them faster though...
src/batSet.ml
Outdated
match m with | ||
| Empty -> Seq.empty | ||
| Node(l, v, r, _) -> | ||
Seq.append (to_seq l) (Seq.cons v (to_seq r)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two remarks:
- This implementation does not do what I would expect due to the use of the strict function
Seq.cons
: it will traverse the whole set before returning the first element. This is not technically wrong as there is no order-of-evaluation difference (and probably not much of a performance difference, although it is worth measuring), but I would find it more natural to have a version that lazily enumerates the list. - We have a proper datatype to represent partial enumerations of sets,
'a iter
andcons_iter
.
I would expect something looking roughly like (untested):
let rec iter_to_seq = function
| E -> Seq.empty
| C (e, r, t) -> fun () -> Seq.Cons (e, iter_to_seq (cons_iter r t))
let to_seq s = iter_to_seq (cons_iter t E)
(to_seq_from
also needs an adaptation.)
I would be curious to know what is the performance difference between your version, a version using iter
as above, and also a modified version of your code that avoids Seq.cons
to avoid forcing the tail (but otherwise traverses the set recursively in the direct way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified to_seq and to_seq_from to be more lazy.
They will not do anything on invocation (except returning a closure).
Furthermore, once that closure is called, it will only go down the leftmost path in the tree to determine the first element; recursing into the right subtrees is again done in a closure.
I hope that satisfies your expectations of these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do, thanks! I would still be curious to know whether the 'a iter
version is faster, but I don't have the time to look at this right now and maybe you are not so curious yourself, so the new versions are fine with me.
I started moving some testcases to the testsuite. I'm not done yet, sorry... |
…in BatSeq -> BatList -> BatSet -> BatSeq
If some tests are in another file, it would be nice to have a comment |
@UnixJunkie After checking again: I think i have unit tests for every function that i touched. Do you think that there is anything else to do? |
Let's go ahead and merge. Thanks for the nice work. |
…batteries-team#1006 Results on my machine: enumeration (89.71 us) is 77.5% faster than batseq (399.39 us) which is 3.8% faster than too strict (415.29 us) which is probably (alpha=40.06%) same speed as simple (416.74 us) Saving times to times.flat This suggests that using an enumeration indeed provides a large performance advantage over the simple approach.
…batteries-team#1006 Results on my machine: enumeration (89.71 us) is 77.5% faster than batseq (399.39 us) which is 3.8% faster than too strict (415.29 us) which is probably (alpha=40.06%) same speed as simple (416.74 us) Saving times to times.flat This suggests that using an enumeration indeed provides a large performance advantage over the simple approach.
benchmark various implementations of Bench.to_seq discussed in #1006
@jabot2 could you contribute the to_rev_seq versions? |
@jabot2 this is required in batMap, batSplay and batSet for 'make test-compat' to pass. |
I can write |
BatSet is missing several methods from Stdlib.Set.
Also, several existing methods in Stdlib.Set guarantee physical equality of resulting sets if possible.
This PR introduces the missing methods into BatSet, and adds the physical equality guarantees.
It also adds BatSet to the compattest.