Skip to content

Commit

Permalink
Lift column ops to the dataset level (#107)
Browse files Browse the repository at this point in the history
* Readme: Replace `lein test` with `lein midje`

* Add proof of concept for lifting

* Clean up

* Fix magnitude arguments

* Fix typo breaking lift operation for `magnitude

* Save prototype working example that handles optional arguments

* Clean up

* Reorganize codegen utilities

* moved hopefully common utilities up  into 'tablecloth.utils.codegen
* retooled those helpers in that ns to be a bit more accessible (WIP)

* Clean up

* Clean up

* Rejigger codegen for column ops to take just fn-sym arglists

* Try lifting all column ops to ds (no tests yet)

* Exclude ops that do not potentially return column

* Do not lift options that do not return columns

* Add docstrings for some codegen

Also regenerated operators to make sure tests pass.

* Add docstring to ds col ops

* version bump and small fix

* Modify ds-level lift op to also return fn that returns column

This is a breaking change for the column api lifting until I adapt
the lift-op to the changes made in the codegen where the argument
is supplied in data rather than within a fn.

* example added for replace-missing

* Add tests for ops that take inf number of cols

* Add tests for ops returning ds taking max of three cols

* Add tests for ops returning ds and taking two columns max

* Test for ops returning ds and max of one column

* Add more functions to test for ops taking one col

* Clean up

* Lifted ops taking one column and returning a scalar

* Lift functions taking two columns and returning a scalar

* Clean up

* Clean up

* bump to 7.000-beta-50

* fixes #108

* hashing in joins enabled for every case

* 7.000-beta-51

* Clean up

* Lift functions taking 1 col and returning scalar

* Adjust column api lift ops to new declarative syntax

* Adjust lift plan for tablecloth.column.api for tmd v7

* Remove mention of tech.ml.datatype

* Add missing word

* Bump tmd version to 7.006 for fix to fns that were erroring

fns are: quartiles-1, quartiles-3 and median

* Fixing more tests

* Comment some code to keep around for a spell

* Remove special lift op for 'round

It's arugments were fixed.

* Cleanup

* 7.007

---------

Co-authored-by: Teodor Heggelund <git@teod.eu>
Co-authored-by: genmeblog <38646601+genmeblog@users.noreply.github.com>
Co-authored-by: GenerateMe <generateme.blog@gmail.com>
Co-authored-by: adham-omran <git@adham-omran.com>
  • Loading branch information
5 people authored Sep 29, 2023
1 parent 1790609 commit e0479aa
Show file tree
Hide file tree
Showing 25 changed files with 8,927 additions and 3,885 deletions.
10 changes: 0 additions & 10 deletions .dir-locals.el

This file was deleted.

40 changes: 40 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,45 @@
# Change Log

## [7.007]

### Added

* Extened documentation for `dataset` (copied from TMD), [#112](https://github.com/scicloj/tablecloth/issues/112)

### Changed

* `rows` accepts `:nil-missing?`(default: true) and `copying?`(default: false) options.

## [7.000-beta-51]

Deps updated

## [7.000-beta-50.2]

### Added

* `:hashing` is available for single column joins too

## [7.000-beta-50.1]

### Added

* `:hashing` option determines method of creating an index for multicolumn joins (was `hash` is `identity`)

### Fixed

* [#108](https://github.com/scicloj/tablecloth/issues/108) - hashing replaced with packing data into the sequence

## [7.000-beta-50]

Deps updated

## [7.000-beta-38]

### Fixed

* dataset from singleton creation generated from wrong structure

## [7.000-beta-27]

### Added
Expand Down
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ knit_engines$set(clojure = function(options) {

## Versions

### tech.ml.dataset 6.x (master branch)
### tech.ml.dataset 7.x (master branch)

[![](https://img.shields.io/clojars/v/scicloj/tablecloth)](https://clojars.org/scicloj/tablecloth)

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Versions

### tech.ml.dataset 6.x (master branch)
### tech.ml.dataset 7.x (master branch)

[![](https://img.shields.io/clojars/v/scicloj/tablecloth)](https://clojars.org/scicloj/tablecloth)

Expand Down
5 changes: 1 addition & 4 deletions deps.edn
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
{:extra-paths ["data"]
:deps {org.clojure/clojure {:mvn/version "1.11.1"}
;; techascent/tech.ml.dataset {:mvn/version "6.103"}
techascent/tech.ml.dataset {:mvn/version "7.000-beta-27"}
;; generateme/fastmath {:mvn/version "2.1.0"}
}
techascent/tech.ml.dataset {:mvn/version "7.007"}}
:aliases {:dev {:extra-deps {org.scicloj/clay {:mvn/version "2-alpha12"}}}}}
71 changes: 60 additions & 11 deletions docs/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,18 @@ knit_engines$set(clojure = function(options) {

```{clojure include=FALSE}
(def tech-ml-version (get-in (read-string (slurp "deps.edn")) [:deps 'techascent/tech.ml.dataset :mvn/version]))
(def tablecloth-version (nth (read-string (slurp "project.clj")) 2))
```

```{clojure results="asis"}
tech-ml-version
```

```{clojure results="asis"}
tablecloth-version
```


## Introduction

[tech.ml.dataset](https://github.com/techascent/tech.ml.dataset) is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger `tech.ml` stack.
Expand Down Expand Up @@ -135,7 +141,7 @@ DS

### Dataset

Dataset is a special type which can be considered as a map of columns implemented around `tech.ml.datatype` library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.
Dataset is a special type which can be considered as a map of columns implemented around `tech.ml.dataset` library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.

#### Dataset creation

Expand Down Expand Up @@ -418,6 +424,8 @@ Possible result types:
- `:as-double-arrays` - array of double arrays
- `:as-vecs` - sequence of vectors (rows)

For `rows` setting `:nil-missing?` option to `false` will elide keys for nil values.

---

Select column.
Expand Down Expand Up @@ -477,6 +485,26 @@ Rows as sequence of maps
(clojure.pprint/pprint (take 2 (tc/rows ds :as-maps)))
```

---

Rows with missing values

```{clojure}
(-> {:a [1 nil 2]
:b [3 4 nil]}
(tc/dataset)
(tc/rows :as-maps))
```

Rows with elided missing values

```{clojure}
(-> {:a [1 nil 2]
:b [3 4 nil]}
(tc/dataset)
(tc/rows :as-maps {:nil-missing? false}))
```

#### Single entry

Get single value from the table using `get-in` from Clojure API or `get-entry`. First argument is column name, second is row number.
Expand Down Expand Up @@ -526,12 +554,12 @@ Grouping is done by calling `group-by` function with arguments:
* `ds` - dataset
* `grouping-selector` - what to use for grouping
* options:
- `:result-type` - what to return:
* `:as-dataset` (default) - return grouped dataset
* `:as-indexes` - return rows ids (row number from original dataset)
* `:as-map` - return map with group names as keys and subdataset as values
* `:as-seq` - return sequens of subdatasets
- `:select-keys` - list of the columns passed to a grouping selector function
- `:result-type` - what to return:
* `:as-dataset` (default) - return grouped dataset
* `:as-indexes` - return rows ids (row number from original dataset)
* `:as-map` - return map with group names as keys and subdataset as values
* `:as-seq` - return sequens of subdatasets
- `:select-keys` - list of the columns passed to a grouping selector function

All subdatasets (groups) have set name as the group name, additionally `group-id` is in meta.

Expand Down Expand Up @@ -866,7 +894,7 @@ If you want to implement your own mapping function on grouped dataset you can ca

### Columns

Column is a special `tech.ml.dataset` structure based on `tech.ml.datatype` library. For our purposes we cat treat columns as typed and named sequence bound to particular dataset.
Column is a special `tech.ml.dataset` structure. For our purposes we cat treat columns as typed and named sequence bound to particular dataset.

Type of the data is inferred from a sequence during column creation.

Expand Down Expand Up @@ -2080,9 +2108,10 @@ Missing values can be replaced using several strategies. `replace-missing` accep
* column selector, default: `:all`
* strategy, default: `:nearest`
* value (optional)
- single value
- sequence of values (cycled)
- function, applied on column(s) with stripped missings
- single value
- sequence of values (cycled)
- function, applied on column(s) with stripped missings
- map with [index,value] pairs

Strategies are:

Expand Down Expand Up @@ -2148,6 +2177,14 @@ Replace missing with a function (mean)

---

Replace missing some missing values with a map

```{clojure results="asis"}
(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})
```

---

Using `:down` strategy, fills gaps with values from above. You can see that if missings are at the beginning, the are filled with first value

```{clojure results="asis"}
Expand Down Expand Up @@ -3181,6 +3218,8 @@ A column selector can be a map with `:left` and `:right` keys to specify column

The difference between `tech.ml.dataset` join functions are: arguments order (first datasets) and possibility to join on multiple columns.

Multiple columns joins create temporary index column from column selection. The method for creating index is based on `:hashing` option and defaults to `identity`. Prior to `7.000-beta-50` `hash` function was used, which caused hash collision for certain cases.

Additionally set operations are defined: `intersect` and `difference`.

To concat two datasets rowwise you can choose:
Expand Down Expand Up @@ -3428,6 +3467,16 @@ Return rows from ds1 not matching ds2
(tc/anti-join ds2 ds1 {:left :e :right :a})
```

#### Hashing

When `:hashing` option is used, data from join columns are preprocessed by applying `join-columns` funtion with `:result-type` set to the value of `:hashing`. This helps to create custom joining behaviour. Function used for hashing will get vector of row values from join columns.

In the following example we will join columns on value modulo 5.

```{clojure results="asis"}
(tc/left-join ds1 ds2 :b {:hashing (fn [[v]] (mod v 5))})
```

#### Cross

Cross product from selected columns
Expand Down
Loading

0 comments on commit e0479aa

Please sign in to comment.