-
Notifications
You must be signed in to change notification settings - Fork 79
/
README.Rmd
278 lines (211 loc) · 9.46 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
---
output: md_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# skimr <a href='https://docs.ropensci.org/skimr/'>
<img src='https://docs.ropensci.org/skimr/reference/figures/logo.png'
align="right" height="139" /></a>
```{r set-options, echo=FALSE, message=FALSE}
library(skimr)
options(tibble.width = Inf)
options(width = 100)
```
[![Project Status: Active – The project has reached a stable, usable
state and is being actively
developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/)
[![R build
status](https://github.com/ropensci/skimr/workflows/R-CMD-check/badge.svg)](https://github.com/ropensci/skimr/actions?workflow=R-CMD-check)
[![codecov](https://codecov.io/gh/ropensci/skimr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/ropensci/skimr)
[![This is an ROpenSci Peer reviewed
package](https://badges.ropensci.org/175_status.svg)](https://github.com/ropensci/software-review/issues/175)
[![CRAN\_Status\_Badge](https://www.r-pkg.org/badges/version/skimr)](https://cran.r-project.org/package=skimr)
[![cran
checks](https://badges.cranchecks.info/worst/skimr.svg)](https://badges.cranchecks.info/worst/skimr.svg)
`skimr` provides a frictionless approach to summary statistics which conforms
to the [principle of least
surprise](https://en.wikipedia.org/wiki/Principle_of_least_astonishment),
displaying summary statistics the user can skim quickly to understand their
data. It handles different data types and returns a `skim_df` object which can
be included in a pipeline or displayed nicely for the human reader.
**Note: `skimr` version 2 has major changes when skimr is used programmatically.
Upgraders should review this document, the release notes and vignettes
carefully.**
## Installation
The current released version of `skimr` can be installed from CRAN. If you wish
to install the current build of the next release you can do so using the
following:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("ropensci/skimr")
```
The APIs for this branch should be considered reasonably stable but still
subject to change if an issue is discovered.
To install the version with the most recent changes that have not yet been
incorporated in the main branch (and may not be):
```{r, eval = FALSE}
devtools::install_github("ropensci/skimr", ref = "develop")
```
Do not rely on APIs from the develop branch, as they are likely to change.
## Skim statistics in the console
`skimr`:
- Provides a larger set of statistics than `summary()`, including missing,
complete, n, and sd.
- reports each data types separately
- handles dates, logicals, and a variety of other types
- supports spark-bar and spark-line based on the
[pillar package](https://github.com/r-lib/pillar).
### Separates variables by class:
```{r, render = knitr::normal_print}
skim(chickwts)
```
### Presentation is in a compact horizontal format:
```{r, render = knitr::normal_print}
skim(iris)
```
### Built in support for strings, lists and other column classes
```{r, render = knitr::normal_print}
skim(dplyr::starwars)
```
### Has a useful summary function
```{r, render = knitr::normal_print}
skim(iris) %>%
summary()
```
### Individual columns can be selected using tidyverse-style selectors
```{r, render = knitr::normal_print}
skim(iris, Sepal.Length, Petal.Length)
```
### Handles grouped data
`skim()` can handle data that has been grouped using `dplyr::group_by()`.
```{r, render = knitr::normal_print}
iris %>%
dplyr::group_by(Species) %>%
skim()
```
### Behaves nicely in pipelines
```{r, render = knitr::normal_print}
iris %>%
skim() %>%
dplyr::filter(numeric.sd > 1)
```
## Knitted results
Simply skimming a data frame will produce the horizontal print
layout shown above. We provide a `knit_print` method for the types of objects
in this package so that similar results are produced in documents. To use this,
make sure the `skimmed` object is the last item in your code chunk.
```{r}
faithful %>%
skim()
```
## Customizing skimr
Although skimr provides opinionated defaults, it is highly customizable.
Users can specify their own statistics, change the formatting of results,
create statistics for new classes and develop skimmers for data structures
that are not data frames.
### Specify your own statistics and classes
Users can specify their own statistics using a list combined with the
`skim_with()` function factory. `skim_with()` returns a new `skim` function that
can be called on your data. You can use this factory to produce summaries for
any type of column within your data.
Assignment within a call to `skim_with()` relies on a helper function, `sfl` or
`skimr` function list. By default, functions in the `sfl` call are appended to
the default skimmers, and names are automatically generated as well.
```{}
my_skim <- skim_with(numeric = sfl(mad))
my_skim(iris, Sepal.Length)
```
But you can also helpers from the `tidyverse` to create new anonymous functions
that set particular function arguments. The behavior is the same as in `purrr`
or `dplyr`, with both `.` and `.x` as acceptable pronouns. Setting the
`append = FALSE` argument uses only those functions that you've provided.
```{}
my_skim <- skim_with(
numeric = sfl(
iqr = IQR,
p01 = ~ quantile(.x, probs = .01)
p99 = ~ quantile(., probs = .99)
),
append = FALSE
)
my_skim(iris, Sepal.Length)
```
And you can remove default skimmers by setting them to `NULL`.
```{}
my_skim <- skim_with(numeric = sfl(hist = NULL))
my_skim(iris, Sepal.Length)
```
### Skimming other objects
`skimr` has summary functions for the following types of data by default:
* `numeric` (which includes both `double` and `integer`)
* `character`
* `factor`
* `logical`
* `complex`
* `Date`
* `POSIXct`
* `ts`
* `AsIs`
`skimr` also provides a small API for writing packages that provide their own
default summary functions for data types not covered above. It relies on
R S3 methods for the `get_skimmers` function. This function should return
a `sfl`, similar to customization within `skim_with()`, but you should also
provide a value for the `class` argument. Here's an example.
```{r}
get_skimmers.my_data_type <- function(column) {
sfl(
.class = "my_data_type",
p99 = quantile(., probs = .99)
)
}
```
## Limitations of current version
We are aware that there are issues with rendering the inline histograms and
line charts in various contexts, some of which are described below.
### Support for spark histograms
There are known issues with printing the spark-histogram characters when
printing a data frame. For example, `"▂▅▇"` is printed as
`"<U+2582><U+2585><U+2587>"`. This longstanding problem [originates in
the low-level
code](https://stat.ethz.ch/pipermail/r-devel/2015-May/071250.html)
for printing dataframes.
While some cases have been addressed, there are, for example, reports of this
issue in Emacs ESS. While this is a deep issue, there is [ongoing
work to address it in base R](https://blog.r-project.org/2020/05/02/utf-8-support-on-windows/).
This means that while `skimr` can render the histograms to the console and in
RMarkdown documents, it cannot in other circumstances. This includes:
* converting a `skimr` data frame to a vanilla R data frame, but tibbles render
correctly
* in the context of rendering to a pdf using an engine that does not support
utf-8.
One workaround for showing these characters in Windows is to set the CTYPE part
of your locale to Chinese/Japanese/Korean with `Sys.setlocale("LC_CTYPE",
"Chinese")`. The helper function `fix_windows_histograms()` does this for you.
And last but not least, we provide `skim_without_charts()` as a fallback.
This makes it easy to still get summaries of your data, even if unicode issues
continue.
### Printing spark histograms and line graphs in knitted documents
Spark-bar and spark-line work in the console, but may not work when you knit
them to a specific document format. The same session that produces a correctly
rendered HTML document may produce an incorrectly rendered PDF, for example.
This issue can generally be addressed by changing fonts to one with good
building block (for histograms) and Braille support (for line graphs). For
example, the open font "DejaVu Sans" from the `extrafont` package supports
these. You may also want to try wrapping your results in `knitr::kable()`.
Please see the vignette on using fonts for details.
Displays in documents of different types will vary. For example, one user found
that the font "Yu Gothic UI Semilight" produced consistent results for
Microsoft Word and Libre Office Write.
## Inspirations
* [TextPlots](https://github.com/sunetos/TextPlots.jl) for use of Braille
characters
* [spark](https://github.com/holman/spark) for use of block characters.
The earliest use of unicode characters to generate sparklines appears to be [from 2009](https://blog.jonudell.net/2009/01/13/fuel-prices-and-pageviews/).
Exercising these ideas to their fullest requires a font with good support for block drawing characters. [PragamataPro](https://fsd.it/shop/fonts/pragmatapro/) is one such font.
## Contributing
We welcome issue reports and pull requests, including potentially adding
support for commonly used variable classes. However, in general, we encourage
users to take advantage of skimr's flexibility to add their own customized
classes. Please see the
[contributing](https://docs.ropensci.org/skimr/CONTRIBUTING.html) and
[conduct](https://ropensci.org/code-of-conduct/) documents.
[![ropenci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)