Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data module #49

Merged
merged 29 commits into from
Aug 1, 2023
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1e5e587
Shorthand api
cristineguadelupe Jul 19, 2023
031fc1d
Heatmap
cristineguadelupe Jul 21, 2023
68df4ab
Receives tuples instead of list
cristineguadelupe Jul 24, 2023
9b1ba9b
Initial tests for shorthand api
cristineguadelupe Jul 24, 2023
7f2f89f
Initial docs
cristineguadelupe Jul 24, 2023
c283eb2
Tests for heatmap
cristineguadelupe Jul 24, 2023
2dc3ca5
Initial guide
cristineguadelupe Jul 26, 2023
fca9084
Guide improvements
cristineguadelupe Jul 27, 2023
add3adf
Better heatmap example
cristineguadelupe Jul 27, 2023
9df069c
Applying suggestions
cristineguadelupe Jul 27, 2023
8f67de1
Pipe to and from mark
cristineguadelupe Jul 27, 2023
a5ecdbb
Update docs and the guide
cristineguadelupe Jul 27, 2023
4b08e64
Applying suggestions
cristineguadelupe Jul 28, 2023
665edb7
Heatmap - Raises when x or y is missing
cristineguadelupe Jul 28, 2023
f190f36
Specs
cristineguadelupe Jul 28, 2023
587fcb1
Update guides/data.livemd
cristineguadelupe Jul 31, 2023
8930b66
Guide improvements
cristineguadelupe Jul 31, 2023
01206a9
Update guides/data.livemd
cristineguadelupe Jul 31, 2023
c446e3f
Update guides/data.livemd
cristineguadelupe Jul 31, 2023
0ff092e
Update guides/data.livemd
cristineguadelupe Jul 31, 2023
336249a
Update guides/data.livemd
cristineguadelupe Jul 31, 2023
33210fe
Update guides/data.livemd
cristineguadelupe Jul 31, 2023
88255b3
Applying suggestions
cristineguadelupe Aug 1, 2023
6e6113f
Fixes annotated heatmap
cristineguadelupe Aug 1, 2023
a514dd3
Applying suggestions
cristineguadelupe Aug 1, 2023
afb4df9
Always show the VegaLite examples first
cristineguadelupe Aug 1, 2023
654f1b2
Update lib/vega_lite/data.ex
cristineguadelupe Aug 1, 2023
2918727
Update lib/vega_lite/data.ex
cristineguadelupe Aug 1, 2023
63b1f47
Update lib/vega_lite/data.ex
cristineguadelupe Aug 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 265 additions & 0 deletions guides/data.livemd
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
# VegaLite Data

```elixir
Mix.install([
{:explorer, "~> 0.6.1"},
{:kino, "~> 0.10.0"},
{:vega_lite,
git: "https://github.com/cristineguadelupe/vega_lite", branch: "cg-stats", override: true},
{:kino_vega_lite, "~> 0.1.9"}
])
```

## Introduction

The `VegaLite.Data` module is designed to provide a shorthand API to plot commonly used charts and high-level abstractions for specialized plots.

The API can be combined with the main `VegaLite` module at any level and at any point, providing flexibility to achieve the same results in a more concise way without compromising expressiveness.

Throughout this guide, we will look at how to use the API alone, in combination with the `VegaLite` module, and also show some comparisons between all the possible paths to achieve the same plotting results.

**Limitations**: `VegaLite.Data` relies on internal type inference, and although all options may be overridden, only data that implements the `Table.Reader` protocol is supported.

For meaningful examples, we will use the *fuels* dataset directly from `Explorer` and one additional `data` variable:

```elixir
alias Explorer.DataFrame, as: DF
alias VegaLite, as: Vl
alias VegaLite.Data

fuels = Explorer.Datasets.fossil_fuels()

data = [
%{"category" => "A", "score" => 28},
%{"category" => "B", "score" => 50},
%{"category" => "C", "score" => 34},
%{"category" => "D", "score" => 42},
%{"category" => "E", "score" => 39}
]
```

## Chart - the shorthand api

`VegaLite.Data.chart/3` and `VegaLite.Data.chart/4` are the shorthand API. We will use these functions to get quick and concise plots. It's best for plots that don't require a lot of configuration or customization.

`VegaLite.Data.chart/3` takes 3 arguments: the data, the mark and a list of fields to encode.

```elixir
# A simple bar plot using the shorthand api
Data.chart(data, :bar, x: "category", y: "score")
```

```elixir
# The same chart without the shorthand api
Vl.new()
|> Vl.data_from_values(data)
|> Vl.mark(:bar)
|> Vl.encode_field(:y, "score", type: :quantitative)
|> Vl.encode_field(:x, "category", type: :nominal)
```

Plotting a simple chart is a breeze! As we can see from the comparison above, the code becomes much more concise. However, the API also accepts a list of options for each argument, allowing more complex charts.

```elixir
# A line plot with point: true without the shorthand api
Vl.new()
|> Vl.data_from_values(fuels, only: ["total", "solid_fuel"])
|> Vl.mark(:line, point: true)
|> Vl.encode_field(:x, "total", type: :quantitative)
|> Vl.encode_field(:y, "solid_fuel", type: :quantitative)
```

```elixir
# A line plot with point: true using the shorthand api
Data.chart(fuels, [type: :line, point: true], x: "total", y: "solid_fuel")
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the Data example comes last, but in the previous example the Data example comes first. Should we make it consistent?


`VegaLite.Data.chart/4` works similarly but takes a valid `VegaLite` specification as the first argument. Let's see a bit of interoperability between the Data API and the main module. We'll plot the same line chart but now with a title and a custom width.

```elixir
# Without the shorthand api
Vl.new(title: "Fuels", width: 400)
|> Vl.data_from_values(fuels, only: ["total", "solid_fuel"])
|> Vl.mark(:line, point: true)
|> Vl.encode_field(:x, "total", type: :quantitative)
|> Vl.encode_field(:y, "solid_fuel", type: :quantitative)
```

```elixir
# With the shorthand api
Vl.new(title: "Fuels", width: 400)
|> Data.chart(fuels, [type: :line, point: true], x: "total", y: "solid_fuel")
```

If a channel requires more configuration, the flexibility of the API comes into play.

```elixir
Vl.new(width: 500, height: 300, title: "Fuels")
|> Vl.data_from_values(fuels, only: ["total", "solid_fuel"])
|> Vl.mark(:point)
|> Vl.encode_field(:x, "total", type: :quantitative)
|> Vl.encode_field(:y, "solid_fuel", type: :quantitative)
|> Vl.encode_field(:color, "total", type: :quantitative, scale: [scheme: "category10"])
```

In the example above, we have a color channel that requires more customization. While it's possible to get the exact same plot using only the shorthand API, the expressiveness may be sacrificed. It's precisely in these cases that using the API together with the main module will probably result in more readable code. Let's take a look and compare the possible combinations between the API and the `VegaLite` module.

```elixir
# Using mainly the shorthand api
Vl.new(width: 500, height: 300, title: "Combined")
|> Data.chart(fuels, :point,
x: "total",
y: "solid_fuel",
color: [field: "total", type: :quantitative, scale: [scheme: "category10"]]
)
```

```elixir
# Piping the shorthand api into a enconde_field
Vl.new(width: 500, height: 300, title: "Fuels")
|> Data.chart(fuels, :point, x: "total", y: "solid_fuel")
|> Vl.encode_field(:color, "total", type: :quantitative, scale: [scheme: "category10"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue in this approach. This code only works because both x and color use "total". But if the color was encoded with another value, then its field would not be included in only. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! We rely on Table.Reader, so it might be better to remove the implicit only and allow passing it as an option of the data argument. Or maybe make it implicit by default and use the option to override it. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have one option called additional_fields or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer only for consistency, like in values_from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data.chart([data: fuels, only: …], :point, x: "total", y: "solid_fuel")
And that should be the only argument accepted for data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you said nothing about having or not a default…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a default, for sure. But I was wondering if we could have that as part of fields somehow:

Data.chart(fuels, :point, x: "total", y: "solid_fuel", additional_fields: [:foo, :bar])

Another idea is to move the logic VegaLite itself. We could support only: :lazy and then we compute it the time we build the JSON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this without tackling this problem and please open up an issue so we do it next. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

```

As we can see, the API is flexible enough to allow it to be piped from `VegaLite`, piped to `VegaLite` or both! In principle, you are free to choose the code that best suits your needs, ideally aiming for a balance between conciseness and expressiveness.

## Specialized plots

Specialized plots provide high-level abstractions for commonly used complex charts.

### Heatmap

Plotting heatmaps directly from VegaLite requires a lot of code.

For a more concrete example, we will use precomputed data from the correlation matrix of the wine dataset.

<!-- livebook:{"disable_formatting":true} -->

```elixir
corr_to_plot = %{
"corr_val" => [1.0, -0.02, 0.29, 0.09, 0.02, -0.05, 0.09, 0.27, -0.43, -0.02,
-0.12, -0.11, -0.02, 1.0, -0.15, 0.06, 0.07, -0.1, 0.09, 0.03, -0.03, -0.04,
0.07, -0.19, 0.29, -0.15, 1.0, 0.09, 0.11, 0.09, 0.12, 0.15, -0.16, 0.06,
-0.08, -0.01, 0.09, 0.06, 0.09, 1.0, 0.09, 0.3, 0.4, 0.84, -0.19, -0.03,
-0.45, -0.1, 0.02, 0.07, 0.11, 0.09, 1.0, 0.1, 0.2, 0.26, -0.09, 0.02, -0.36,
-0.21, -0.05, -0.1, 0.09, 0.3, 0.1, 1.0, 0.62, 0.29, 0.0, 0.06, -0.25, 0.01,
0.09, 0.09, 0.12, 0.4, 0.2, 0.62, 1.0, 0.53, 0.0, 0.13, -0.45, -0.17, 0.27,
0.03, 0.15, 0.84, 0.26, 0.29, 0.53, 1.0, -0.09, 0.07, -0.78, -0.31, -0.43,
-0.03, -0.16, -0.19, -0.09, 0.0, 0.0, -0.09, 1.0, 0.16, 0.12, 0.1, -0.02,
-0.04, 0.06, -0.03, 0.02, 0.06, 0.13, 0.07, 0.16, 1.0, -0.02, 0.05, -0.12,
0.07, -0.08, -0.45, -0.36, -0.25, -0.45, -0.78, 0.12, -0.02, 1.0, 0.44,
-0.11, -0.19, -0.01, -0.1, -0.21, 0.01, -0.17, -0.31, 0.1, 0.05, 0.44, 1.0],
"x" => ["fixed acidity", "volatile acidity", "citric acid", "residual sugar",
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH",
"sulphates", "alcohol", "quality", "fixed acidity", "volatile acidity",
"citric acid", "residual sugar", "chlorides", "free sulfur dioxide",
"total sulfur dioxide", "density", "pH", "sulphates", "alcohol", "quality",
"fixed acidity", "volatile acidity", "citric acid", "residual sugar",
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH",
"sulphates", "alcohol", "quality", "fixed acidity", "volatile acidity",
"citric acid", "residual sugar", "chlorides", "free sulfur dioxide",
"total sulfur dioxide", "density", "pH", "sulphates", "alcohol", "quality",
"fixed acidity", "volatile acidity", "citric acid", "residual sugar",
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH",
"sulphates", "alcohol", "quality", "fixed acidity", "volatile acidity",
"citric acid", "residual sugar", "chlorides", "free sulfur dioxide",
"total sulfur dioxide", "density", "pH", "sulphates", "alcohol", "quality",
"fixed acidity", "volatile acidity", "citric acid", "residual sugar",
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH",
"sulphates", "alcohol", "quality", "fixed acidity", "volatile acidity",
"citric acid", "residual sugar", "chlorides", "free sulfur dioxide",
"total sulfur dioxide", "density", "pH", "sulphates", "alcohol", "quality",
"fixed acidity", "volatile acidity", "citric acid", "residual sugar",
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH",
"sulphates", "alcohol", "quality", "fixed acidity", "volatile acidity",
"citric acid", "residual sugar", "chlorides", "free sulfur dioxide",
"total sulfur dioxide", "density", "pH", "sulphates", "alcohol", "quality",
"fixed acidity", "volatile acidity", "citric acid", "residual sugar",
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH",
"sulphates", "alcohol", "quality", "fixed acidity", "volatile acidity",
"citric acid", "residual sugar", "chlorides", "free sulfur dioxide",
"total sulfur dioxide", "density", "pH", "sulphates", "alcohol", "quality"],
"y" => ["fixed acidity", "fixed acidity", "fixed acidity", "fixed acidity",
"fixed acidity", "fixed acidity", "fixed acidity", "fixed acidity",
"fixed acidity", "fixed acidity", "fixed acidity", "fixed acidity",
"volatile acidity", "volatile acidity", "volatile acidity",
"volatile acidity", "volatile acidity", "volatile acidity",
"volatile acidity", "volatile acidity", "volatile acidity",
"volatile acidity", "volatile acidity", "volatile acidity", "citric acid",
"citric acid", "citric acid", "citric acid", "citric acid", "citric acid",
"citric acid", "citric acid", "citric acid", "citric acid", "citric acid",
"citric acid", "residual sugar", "residual sugar", "residual sugar",
"residual sugar", "residual sugar", "residual sugar", "residual sugar",
"residual sugar", "residual sugar", "residual sugar", "residual sugar",
"residual sugar", "chlorides", "chlorides", "chlorides", "chlorides",
"chlorides", "chlorides", "chlorides", "chlorides", "chlorides", "chlorides",
"chlorides", "chlorides", "free sulfur dioxide", "free sulfur dioxide",
"free sulfur dioxide", "free sulfur dioxide", "free sulfur dioxide",
"free sulfur dioxide", "free sulfur dioxide", "free sulfur dioxide",
"free sulfur dioxide", "free sulfur dioxide", "free sulfur dioxide",
"free sulfur dioxide", "total sulfur dioxide", "total sulfur dioxide",
"total sulfur dioxide", "total sulfur dioxide", "total sulfur dioxide",
"total sulfur dioxide", "total sulfur dioxide", "total sulfur dioxide",
"total sulfur dioxide", "total sulfur dioxide", "total sulfur dioxide",
"total sulfur dioxide", "density", "density", "density", "density",
"density", "density", "density", "density", "density", "density", "density",
"density", "pH", "pH", "pH", "pH", "pH", "pH", "pH", "pH", "pH", "pH", "pH",
"pH", "sulphates", "sulphates", "sulphates", "sulphates", "sulphates",
"sulphates", "sulphates", "sulphates", "sulphates", "sulphates", "sulphates",
"sulphates", "alcohol", "alcohol", "alcohol", "alcohol", "alcohol",
"alcohol", "alcohol", "alcohol", "alcohol", "alcohol", "alcohol", "alcohol",
"quality", "quality", "quality", "quality", "quality", "quality", "quality",
"quality", "quality", "quality", "quality", "quality"]
}
|> Explorer.DataFrame.new()
```

```elixir
Vl.new(title: "Correlation matrix", width: 600, height: 600)
|> Vl.layers([
Vl.new()
|> Vl.data_from_values(corr_to_plot)
|> Vl.mark(:rect)
|> Vl.encode_field(:x, "x", type: :nominal)
|> Vl.encode_field(:y, "y", type: :nominal)
|> Vl.encode_field(:color, "corr_val", type: :quantitative),
Vl.new()
|> Vl.data_from_values(corr_to_plot)
|> Vl.mark(:text)
|> Vl.encode_field(:x, "x", type: :nominal)
|> Vl.encode_field(:y, "y", type: :nominal)
|> Vl.encode_field(:text, "corr_val", type: :quantitative)
])
```

We can use our already explored shorthand API to simplify it.

```elixir
Vl.new(title: "Correlation matrix", width: 600, height: 600)
|> Vl.layers([
Data.chart(corr_to_plot, :rect,
x: [field: "x", type: :nominal],
y: [field: "y", type: :nominal],
color: "corr_val"
),
Data.chart(corr_to_plot, :text,
x: [field: "x", type: :nominal],
y: [field: "y", type: :nominal],
text: "corr_val"
)
])
```

Or we can go even further and use the `VegaLite.Data.heatmap/2` function alone or the `VegaLite.Data.heatmap/3` function in combination with `VegaLite`.

The specialized plots follow the same principle as the shorthand API, they can be combined with the main module, and each argument can also take a list of options to override the defaults.

```elixir
Vl.new(title: "Correlation matrix", width: 600, height: 600)
|> Data.heatmap(corr_to_plot,
x: "x",
y: "y",
color: "corr_val",
text: "corr_val"
)
```
Loading
Loading