Lift column ops to the dataset level (#107)

* Readme: Replace `lein test` with `lein midje` * Add proof of concept for lifting * Clean up * Fix magnitude arguments * Fix typo breaking lift operation for `magnitude * Save prototype working example that handles optional arguments * Clean up * Reorganize codegen utilities * moved hopefully common utilities up into 'tablecloth.utils.codegen * retooled those helpers in that ns to be a bit more accessible (WIP) * Clean up * Clean up * Rejigger codegen for column ops to take just fn-sym arglists * Try lifting all column ops to ds (no tests yet) * Exclude ops that do not potentially return column * Do not lift options that do not return columns * Add docstrings for some codegen Also regenerated operators to make sure tests pass. * Add docstring to ds col ops * version bump and small fix * Modify ds-level lift op to also return fn that returns column This is a breaking change for the column api lifting until I adapt the lift-op to the changes made in the codegen where the argument is supplied in data rather than within a fn. * example added for replace-missing * Add tests for ops that take inf number of cols * Add tests for ops returning ds taking max of three cols * Add tests for ops returning ds and taking two columns max * Test for ops returning ds and max of one column * Add more functions to test for ops taking one col * Clean up * Lifted ops taking one column and returning a scalar * Lift functions taking two columns and returning a scalar * Clean up * Clean up * bump to 7.000-beta-50 * fixes #108 * hashing in joins enabled for every case * 7.000-beta-51 * Clean up * Lift functions taking 1 col and returning scalar * Adjust column api lift ops to new declarative syntax * Adjust lift plan for tablecloth.column.api for tmd v7 * Remove mention of tech.ml.datatype * Add missing word * Bump tmd version to 7.006 for fix to fns that were erroring fns are: quartiles-1, quartiles-3 and median * Fixing more tests * Comment some code to keep around for a spell * Remove special lift op for 'round It's arugments were fixed. * Cleanup * 7.007 --------- Co-authored-by: Teodor Heggelund <git@teod.eu> Co-authored-by: genmeblog <38646601+genmeblog@users.noreply.github.com> Co-authored-by: GenerateMe <generateme.blog@gmail.com> Co-authored-by: adham-omran <git@adham-omran.com>
scicloj · Sep 29, 2023 · e0479aa · e0479aa
1 parent 1790609
commit e0479aa
Show file tree

Hide file tree

Showing 25 changed files with 8,927 additions and 3,885 deletions.
diff --git a/.dir-locals.el b/.dir-locals.el
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,45 @@
 # Change Log
 
+## [7.007]
+
+### Added
+
+* Extened documentation for `dataset` (copied from TMD), [#112](https://github.com/scicloj/tablecloth/issues/112)
+
+### Changed
+
+* `rows` accepts `:nil-missing?`(default: true) and `copying?`(default: false) options.
+
+## [7.000-beta-51]
+
+Deps updated
+
+## [7.000-beta-50.2]
+
+### Added
+
+* `:hashing` is available for single column joins too
+
+## [7.000-beta-50.1]
+
+### Added
+
+* `:hashing` option determines method of creating an index for multicolumn joins (was `hash` is `identity`)
+
+### Fixed
+
+* [#108](https://github.com/scicloj/tablecloth/issues/108) - hashing replaced with packing data into the  sequence
+
+## [7.000-beta-50]
+
+Deps updated
+
+## [7.000-beta-38]
+
+### Fixed
+
+* dataset from singleton creation generated from wrong structure
+
 ## [7.000-beta-27]
 
 ### Added

diff --git a/README.Rmd b/README.Rmd
@@ -48,7 +48,7 @@ knit_engines$set(clojure = function(options) {
 
 ## Versions
 
-### tech.ml.dataset 6.x (master branch)
+### tech.ml.dataset 7.x (master branch)
 
 [![](https://img.shields.io/clojars/v/scicloj/tablecloth)](https://clojars.org/scicloj/tablecloth)
 

diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 ## Versions
 
-### tech.ml.dataset 6.x (master branch)
+### tech.ml.dataset 7.x (master branch)
 
 [![](https://img.shields.io/clojars/v/scicloj/tablecloth)](https://clojars.org/scicloj/tablecloth)
 

diff --git a/deps.edn b/deps.edn
@@ -1,7 +1,4 @@
 {:extra-paths ["data"]
  :deps {org.clojure/clojure              {:mvn/version "1.11.1"}
-        ;; techascent/tech.ml.dataset       {:mvn/version "6.103"}
-        techascent/tech.ml.dataset       {:mvn/version "7.000-beta-27"}
-        ;; generateme/fastmath {:mvn/version "2.1.0"}
-        }
+        techascent/tech.ml.dataset       {:mvn/version "7.007"}}
  :aliases {:dev {:extra-deps {org.scicloj/clay {:mvn/version "2-alpha12"}}}}}
diff --git a/docs/index.Rmd b/docs/index.Rmd
@@ -86,12 +86,18 @@ knit_engines$set(clojure = function(options) {
 
 ```{clojure include=FALSE}
 (def tech-ml-version (get-in (read-string (slurp "deps.edn")) [:deps 'techascent/tech.ml.dataset :mvn/version]))
+(def tablecloth-version (nth (read-string (slurp "project.clj")) 2))
 ```
 
 ```{clojure results="asis"}
 tech-ml-version
 ```
 
+```{clojure results="asis"}
+tablecloth-version
+```
+
+
 ## Introduction
 
 [tech.ml.dataset](https://github.com/techascent/tech.ml.dataset) is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger `tech.ml` stack.
@@ -135,7 +141,7 @@ DS
 
 ### Dataset
 
-Dataset is a special type which can be considered as a map of columns implemented around `tech.ml.datatype` library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.
+Dataset is a special type which can be considered as a map of columns implemented around `tech.ml.dataset` library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.
 
 #### Dataset creation 
 
@@ -418,6 +424,8 @@ Possible result types:
 - `:as-double-arrays` - array of double arrays 
 - `:as-vecs` - sequence of vectors (rows)
 
+For `rows` setting `:nil-missing?` option to `false` will elide keys for nil values.
+
 ---
 
 Select column.
@@ -477,6 +485,26 @@ Rows as sequence of maps
 (clojure.pprint/pprint (take 2 (tc/rows ds :as-maps)))
 ```
 
+---
+
+Rows with missing values
+
+```{clojure}
+(-> {:a [1 nil 2]
+     :b [3 4 nil]}
+    (tc/dataset)
+    (tc/rows :as-maps))
+```
+
+Rows with elided missing values
+
+```{clojure}
+(-> {:a [1 nil 2]
+     :b [3 4 nil]}
+    (tc/dataset)
+    (tc/rows :as-maps {:nil-missing? false}))
+```
+
 #### Single entry
 
 Get single value from the table using `get-in` from Clojure API or `get-entry`. First argument is column name, second is row number.
@@ -526,12 +554,12 @@ Grouping is done by calling `group-by` function with arguments:
 * `ds` - dataset
 * `grouping-selector` - what to use for grouping
 * options:
-    - `:result-type` - what to return:
-        * `:as-dataset` (default) - return grouped dataset
-        * `:as-indexes` - return rows ids (row number from original dataset)
-        * `:as-map` - return map with group names as keys and subdataset as values
-        * `:as-seq` - return sequens of subdatasets
-    - `:select-keys` - list of the columns passed to a grouping selector function
+- `:result-type` - what to return:
+* `:as-dataset` (default) - return grouped dataset
+* `:as-indexes` - return rows ids (row number from original dataset)
+* `:as-map` - return map with group names as keys and subdataset as values
+* `:as-seq` - return sequens of subdatasets
+- `:select-keys` - list of the columns passed to a grouping selector function
 
 All subdatasets (groups) have set name as the group name, additionally `group-id` is in meta.
 
@@ -866,7 +894,7 @@ If you want to implement your own mapping function on grouped dataset you can ca
 
 ### Columns
 
-Column is a special `tech.ml.dataset` structure based on `tech.ml.datatype` library. For our purposes we cat treat columns as typed and named sequence bound to particular dataset. 
+Column is a special `tech.ml.dataset` structure. For our purposes we cat treat columns as typed and named sequence bound to particular dataset.
 
 Type of the data is inferred from a sequence during column creation. 
 
@@ -2080,9 +2108,10 @@ Missing values can be replaced using several strategies. `replace-missing` accep
 * column selector, default: `:all`
 * strategy, default: `:nearest`
 * value (optional)
-- single value
-- sequence of values (cycled)
-- function, applied on column(s) with stripped missings
+  - single value
+  - sequence of values (cycled)
+  - function, applied on column(s) with stripped missings
+  - map with [index,value] pairs
 
 Strategies are:
 
@@ -2148,6 +2177,14 @@ Replace missing with a function (mean)
 
 ---
 
+Replace missing some missing values with a map
+
+```{clojure results="asis"}
+(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})
+```
+
+---
+
 Using `:down` strategy, fills gaps with values from above. You can see that if missings are at the beginning, the are filled with first value
 
 ```{clojure results="asis"}
@@ -3181,6 +3218,8 @@ A column selector can be a map with `:left` and `:right` keys to specify column
 
 The difference between `tech.ml.dataset` join functions are: arguments order (first datasets) and possibility to join on multiple columns.
 
+Multiple columns joins create temporary index column from column selection. The method for creating index is based on `:hashing` option and defaults to `identity`. Prior to `7.000-beta-50` `hash` function was used, which caused hash collision for certain cases.
+
 Additionally set operations are defined: `intersect` and `difference`.
 
 To concat two datasets rowwise you can choose:
@@ -3428,6 +3467,16 @@ Return rows from ds1 not matching ds2
 (tc/anti-join ds2 ds1 {:left :e :right :a})
 ```
 
+#### Hashing
+
+When `:hashing` option is used, data from join columns are preprocessed by applying `join-columns` funtion with `:result-type` set to the value of `:hashing`. This helps to create custom joining behaviour. Function used for hashing will get vector of row values from join columns.
+
+In the following example we will join columns on value modulo 5.
+
+```{clojure results="asis"}
+(tc/left-join ds1 ds2 :b {:hashing (fn [[v]] (mod v 5))})
+```
+
 #### Cross
 
 Cross product from selected columns