Skip to content

Commit

Permalink
Add metadata macros (#377)
Browse files Browse the repository at this point in the history
* begin adding metadata features

* tests

* some docs

* add to main docs

* deleting notes

* printing docs

* printing tests

* more docs
  • Loading branch information
pdeffebach authored Feb 27, 2024
1 parent cdfb733 commit edf22c3
Show file tree
Hide file tree
Showing 5 changed files with 529 additions and 1 deletion.
3 changes: 2 additions & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,15 @@ DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
MacroTools = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
Reexport = "189a3867-3050-52da-a836-e630ba90ab69"
TableMetadataTools = "9ce81f87-eacc-4366-bf80-b621a3098ee2"

[compat]
Chain = "0.5"
DataFrames = "1"
MacroTools = "0.5"
OrderedCollections = "1"
Reexport = "0.2, 1"
julia = "1.6"
OrderedCollections = "1"

[extras]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
Expand Down
159 changes: 159 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -983,6 +983,165 @@ in the middle of a `@chain` block.
end
```

## Attaching variable labels and notes

A widely used and appreciated feature of the Stata data analysis
programming language is it's tools for column-level metadata in the
form of labels and notes. Like Stata, Julia's data ecosystem implements a common
API for keeping track of information associated with columns. DataFramesMeta.jl
implements the `@label!` and `@note!` macros to attach information to columns.

DataFramesMeta.jl also provides two convenience functions
for examining metadata, `printlabels` and `printnotes`.

### `@label!`: For short column labels

Use `@label!` to attach short-but-informative labels to columns. For example,
a variable `:wage` might be given the label `"Wage (2015 USD)"`.

```julia
df = DataFrame(wage = [16, 25, 14, 23]);
@label! df :wage = "Wage (2015 USD)"
```

View the labels with `printlabels(df)`

```
julia> printlabels(df)
┌────────┬─────────────────┐
│ Column │ Label │
├────────┼─────────────────┤
│ wage │ Wage (2015 USD) │
└────────┴─────────────────┘
```

You can access labels via the `label` function defined in TablesMetaDataTools.jl

```
julia> label(df, :wage)
"Wage (2015 USD)"
```

### `@note!`: For longer column notes

While labels are useful for pretty printing and clarification of short variable
names, notes are used to give more in depth information and describe the data
cleaning process. Unlike labels, notes can be stacked on to one another.

Consider the cleaning process for wages, starting with the data frame

```julia
julia> df = DataFrame(wage = [-99, 16, 14, 23, 5000])
5×1 DataFrame
Row │ wage
│ Int64
─────┼───────
1-99
216
314
423
55000

```

When data cleaning you might want to do the following:

1. Record the source of the data

```
@note! df :wage = "Hourly wage from 2015 American Community Survey (ACS)"
```

2. Fix coded wages. In this example, `-99` corresponds to "no job"

```
@rtransform! df :wage = :wage == -99 ? 0 : :wage
@note! df :wage = "Individuals with no job are recorded as 0 wage"
```

We use `printnotes` to see the notes for columns.

```
julia> printnotes(df)
Column: wage
────────────
Hourly wage from 2015 American Community Survey (ACS)
Individuals with no job are recorded as 0 wage
```

You can access the note via the `note` function.

```
julia> note(df, :wage)
"Hourly wage from 2015 American Community Survey (ACS)\nIndividuals with no job are recorded as 0 wage"
```

To remove all notes from a column, run

```
note!(df, :wage, ""; append = false)
````

### Printing metadata

#### `printlabels`: For printing labels

Use `printlabels` to print the labels of columns in a data frame. The optional
argument `cols` determines which columns to print, while the keyword
argument `unlabelled` controls whether to print columns without user-defined labels.

```julia-repl
julia> df = DataFrame(wage = [12], age = [23]);
julia> @label! df :wage = "Hourly wage (2015 USD)";
julia> printlabels(df)
┌────────┬────────────────────────┐
│ Column │ Label │
├────────┼────────────────────────┤
│ wage │ Hourly wage (2015 USD) │
│ age │ age │
└────────┴────────────────────────┘
julia> printlabels(df, [:wage, :age]; unlabelled = false)
┌────────┬────────────────────────┐
│ Column │ Label │
├────────┼────────────────────────┤
│ wage │ Hourly wage (2015 USD) │
└────────┴────────────────────────┘
```

#### `printlabels`: For printing notes

Use `printnotes` to print the notes of columns in a data frame. The optional
argument `cols` determines which columns to print, while the keyword
argument `unnoted` controls whether to print columns without user-defined notes.

```julia-repl
julia> df = DataFrame(wage = [12], age = [23]);
julia> @label! df :age = "Age (years)";
julia> @note! df :wage = "Derived from American Community Survey";
julia> @note! df :wage = "Missing values imputed as 0 wage";
julia> @label! df :wage = "Hourly wage (2015 USD)";
julia> printnotes(df)
Column: wage
────────────
Label: Hourly wage (2015 USD)
Derived from American Community Survey
Missing values imputed as 0 wage
Column: age
───────────
Label: Age (years)
```


```@contents
Pages = ["api/api.md"]
Depth = 3
Expand Down
6 changes: 6 additions & 0 deletions src/DataFramesMeta.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,14 @@ using MacroTools

using OrderedCollections: OrderedCollections

@reexport using TableMetadataTools

@reexport using DataFrames

@reexport using Chain

using DataFrames.PrettyTables

# Basics:
export @with,
@subset, @subset!, @rsubset, @rsubset!,
Expand All @@ -21,6 +25,7 @@ export @with,
@distinct, @rdistinct, @distinct!, @rdistinct!,
@eachrow, @eachrow!,
@byrow, @passmissing, @astable, @kwarg,
@label!, @note!, printlabels, printnotes,
@groupby,
@based_on, @where # deprecated

Expand All @@ -31,5 +36,6 @@ include("parsing_astable.jl")
include("macros.jl")
include("linqmacro.jl")
include("eachrow.jl")
include("metadata.jl")

end # module
Loading

0 comments on commit edf22c3

Please sign in to comment.