Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue adding Metadata to dataframes #1458

Closed
wants to merge 88 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
ec32311
Make `describe` return a DataFrame
pdeffebach Apr 7, 2018
3ac8af1
Delete REQUIRE
pdeffebach Apr 7, 2018
c0227e9
Add files via upload
pdeffebach Apr 7, 2018
a6fa29e
fix rowvectors
pdeffebach Apr 7, 2018
d9621b4
Update abstractdataframe.jl
pdeffebach Apr 7, 2018
fc6ef8c
Update abstractdataframe.jl
pdeffebach Apr 7, 2018
c477684
Update abstractdataframe.jl
pdeffebach Apr 7, 2018
215233f
Get rid of describe tests
pdeffebach Apr 7, 2018
21936e1
delete tests
pdeffebach Apr 7, 2018
ab120a4
Merge remote-tracking branch 'origin/master'
pdeffebach May 1, 2018
9f80ddf
Add improved described
pdeffebach May 12, 2018
bc3f2fa
Improve time for describe
pdeffebach May 12, 2018
e29de8b
Fix colstats vector in kw
pdeffebach May 12, 2018
935d2f8
Fix kw again
pdeffebach May 12, 2018
6561d89
Improve kwargs closure
pdeffebach May 12, 2018
f9b5e77
Fix NUnique error
pdeffebach May 12, 2018
01d9f86
Add docstring
pdeffebach May 12, 2018
01d836d
edit docstring
pdeffebach May 12, 2018
9f69461
Merge pull request #1 from pdeffebach/describe_to_dataframe
pdeffebach May 12, 2018
c891344
finall fix noNunique error
pdeffebach May 12, 2018
5ef7a16
more fixes
pdeffebach May 12, 2018
d9db833
fuck it more changes
pdeffebach May 12, 2018
4c54110
Respond to nalimilan's comments
pdeffebach May 13, 2018
ebfa988
fix stats
pdeffebach May 13, 2018
fb3ca98
fix stats again
pdeffebach May 13, 2018
5d410e6
fix test
pdeffebach May 13, 2018
e18be15
fix REQUIRE
pdeffebach May 15, 2018
fe02fbe
Respond to nalimilan's comments 2
pdeffebach May 15, 2018
707fea7
fix indentation on test
pdeffebach May 15, 2018
d163c5e
added bad stuff
pdeffebach May 25, 2018
36f18a9
fix test
pdeffebach May 25, 2018
c81bde3
Undo all the stupid things I had earlier
pdeffebach May 25, 2018
bf6b97a
Update tests and comments
pdeffebach May 25, 2018
6537afc
Fix indentation
pdeffebach May 25, 2018
820df36
Add back in description in docstring
pdeffebach May 25, 2018
451e474
fix space
pdeffebach May 25, 2018
a6e8f3b
Respond to Milan's comments 3
pdeffebach May 27, 2018
a1b1d6e
Respond to Milan 4
pdeffebach May 29, 2018
eb48ec7
Merge branch 'master' into master
nalimilan May 31, 2018
b5777a4
Add type-agnistic get_stats functions
pdeffebach Jun 6, 2018
8c2808e
Merge branch 'master' of https://github.com/JuliaData/DataFrames.jl
pdeffebach Jun 6, 2018
f61db94
Merge pull request #2 from pdeffebach/MetaData
pdeffebach Jun 6, 2018
e22d392
Add nomissing for new try...catch arguments
pdeffebach Jun 6, 2018
29d8a61
Merge branch 'MetaData'
pdeffebach Jun 6, 2018
5ae1eed
Merge remote-tracking branch 'origin/master'
pdeffebach Jun 6, 2018
22f2c3a
Add nunique to default, added optional last and first
pdeffebach Jun 6, 2018
22767a8
Add nunique to default, added optional last and first
pdeffebach Jun 6, 2018
56ece14
Merge remote-tracking branch 'origin/MetaData' into MetaData
pdeffebach Jun 6, 2018
a8c8fbb
Merge branch 'MetaData'
pdeffebach Jun 6, 2018
1dfc0a1
Add deprecation warning, change docstring
pdeffebach Jun 8, 2018
ba07f02
add deprecation warning
pdeffebach Jun 8, 2018
2f4f8e6
Add metadata without touching Index
pdeffebach Jun 12, 2018
efb5ee4
:
pdeffebach Jun 12, 2018
97b50e9
fix isequal use in tests.
pdeffebach Jun 12, 2018
53f6ad7
Respond to comments about deprecations and :all
pdeffebach Jun 13, 2018
d638068
Fix eltype call and some comments
pdeffebach Jun 13, 2018
33859c6
Make there be only one describe definition
pdeffebach Jun 13, 2018
d93e537
change :all to symbol argument
pdeffebach Jun 14, 2018
708a4fb
Fix docs for `describe`
pdeffebach Jun 14, 2018
98d082a
Small fixes
nalimilan Jun 14, 2018
cdc6e3b
trim whitespace in docstring
pdeffebach Jun 14, 2018
fc780da
Change error handling for symbol kw
pdeffebach Jun 14, 2018
da3bb6b
Merge branch 'master' of https://github.com/pdeffebach/DataFrames.jl …
pdeffebach Jun 14, 2018
3f68f09
Merge remote-tracking branch 'origin/master' into describe_changes
pdeffebach Jun 14, 2018
de73135
Progress with metadata, add test
pdeffebach Jun 15, 2018
06354fe
More fixes
pdeffebach Jun 15, 2018
4cec218
Change to isa, only generate error message on error.
pdeffebach Jun 15, 2018
e82dded
Merge branch 'master' of https://github.com/pdeffebach/DataFrames.jl …
pdeffebach Jun 15, 2018
6b73913
Addded merge operation to dataframes.jl
pdeffebach Jun 15, 2018
8dc227e
Merge remote-tracking branch 'JuliaData/master'
pdeffebach Jun 15, 2018
4f09f74
Merge remote-tracking branch 'JuliaData/master'
pdeffebach Jun 25, 2018
7478daa
respond to milan's comments 1: use Any etc.
pdeffebach Jun 25, 2018
e11ec8b
Merge remote-tracking branch 'JuliaData/master'
pdeffebach Jun 30, 2018
174b924
Respond to Milan 2
pdeffebach Jul 8, 2018
6a8f706
Merge remote-tracking branch 'JuliaData/master'
pdeffebach Jul 8, 2018
8a43666
Update docs for new `describe` (#1442)
pdeffebach Jul 8, 2018
0c15f20
make REPL printing of `nothing` an `empty string` (#1444)
pdeffebach Jul 11, 2018
c9a8a3b
Move Missing to it's own page in the Docs (#1415)
oxinabox Jul 11, 2018
12d9835
Add section headings and row-by-row construction example (#1416)
oxinabox Jul 11, 2018
8a8efd1
make nothing printing not expand size
pdeffebach Jul 14, 2018
40b0021
respond to milan 3
pdeffebach Jul 14, 2018
8ea657c
add docs...
pdeffebach Jul 14, 2018
94445ba
add docs for missing
pdeffebach Jul 14, 2018
d48132c
remove missings.md
pdeffebach Jul 14, 2018
75cdb25
respond to milan 3
pdeffebach Jul 14, 2018
a223ece
just metadata changes
pdeffebach Jul 14, 2018
962afa0
just metadat changes 2
pdeffebach Jul 14, 2018
e160375
Merge remote-tracking branch 'origin/metadata' into metadata
pdeffebach Jul 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ export AbstractDataFrame,
tail,
permutecols!,

metadata!,
metadata,
showmeta,

# Remove after deprecation period
pool,
pool!
Expand All @@ -82,6 +86,7 @@ export AbstractDataFrame,

include("other/utils.jl")
include("other/index.jl")
include("other/metadata.jl")

include("abstractdataframe/abstractdataframe.jl")
include("dataframe/dataframe.jl")
Expand Down
5 changes: 5 additions & 0 deletions src/abstractdataframe/join.jl
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,11 @@ function compose_joined_table(joiner::DataFrameJoiner, kind::Symbol,
copyto!(cols[i+ncleft], view(col, all_orig_right_ixs))
permute!(cols[i+ncleft], right_perm)
end
# To do:
# 1. Make a new metadata that is append(metadata(joiner.df1), metadata(df_noon))
# 2. Make a new constructor so that we can construct a new dataframe
# 3. long term, add optional arguments to choose which metadata gets put in.
# Haven't added this yet because I only want to focus on dataframe/dataframe.jl for now.
res = DataFrame(cols, vcat(names(joiner.dfl), names(dfr_noon)), makeunique=makeunique)

if length(rightonly_ixs.join) > 0
Expand Down
74 changes: 64 additions & 10 deletions src/dataframe/dataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -82,18 +82,19 @@ size(df1)
mutable struct DataFrame <: AbstractDataFrame
columns::Vector
colindex::Index
metadata::MetaData

function DataFrame(columns::Vector{Any}, colindex::Index)
function DataFrame(columns::Vector{Any}, colindex::Index, metadata::MetaData=MetaData())
if length(columns) == length(colindex) == 0
return new(Vector{Any}(undef, 0), Index())
return new(Vector{Any}(undef, 0), Index(), metadata)
elseif length(columns) != length(colindex)
throw(DimensionMismatch("Number of columns ($(length(columns))) and number of" *
" column names ($(length(colindex))) are not equal"))
end
lengths = [isa(col, AbstractArray) ? length(col) : 1 for col in columns]
minlen, maxlen = extrema(lengths)
if minlen == 0 && maxlen == 0
return new(columns, colindex)
return new(columns, colindex, metadata)
elseif minlen != maxlen || minlen == maxlen == 1
# recycle scalars
for i in 1:length(columns)
Expand All @@ -116,8 +117,9 @@ mutable struct DataFrame <: AbstractDataFrame
throw(DimensionMismatch("columns must be 1-dimensional"))
end
end
new(columns, colindex)
new(columns, colindex, metadata)
end

end

function DataFrame(pairs::Pair{Symbol,<:Any}...; makeunique::Bool=false)::DataFrame
Expand Down Expand Up @@ -223,6 +225,7 @@ end

index(df::DataFrame) = getfield(df, :colindex)
columns(df::DataFrame) = getfield(df, :columns)
metadata(df::DataFrame) = getfield(df, :metadata)

# TODO: Remove these
nrow(df::DataFrame) = ncol(df) > 0 ? length(columns(df)[1])::Int : 0
Expand Down Expand Up @@ -263,7 +266,7 @@ end
function Base.getindex(df::DataFrame, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = columns(df)[selected_columns]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
return DataFrame(new_columns, Index(_names(df)[selected_columns]), metadata(df)[selected_columns])
end

# df[:] => DataFrame
Expand All @@ -279,7 +282,7 @@ end
function Base.getindex(df::DataFrame, row_ind::Real, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = Any[dv[[row_ind]] for dv in columns(df)[selected_columns]]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
return DataFrame(new_columns, Index(_names(df)[selected_columns]), metadata(df)[selected_columns])
end

# df[MultiRowIndex, SingleColumnIndex] => AbstractVector
Expand All @@ -292,7 +295,7 @@ end
function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = Any[dv[row_inds] for dv in columns(df)[selected_columns]]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
return DataFrame(new_columns, Index(_names(df)[selected_columns]), metadata(df)[selected_columns])
end

# df[:, SingleColumnIndex] => AbstractVector
Expand All @@ -305,7 +308,7 @@ Base.getindex(df::DataFrame, row_ind::Real, col_inds::Colon) = df[[row_ind], col
# df[MultiRowIndex, :] => DataFrame
function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::Colon)
new_columns = Any[dv[row_inds] for dv in columns(df)]
return DataFrame(new_columns, copy(index(df)))
return DataFrame(new_columns, copy(index(df)), copy(metadata(df)))
end

# df[:, :] => DataFrame
Expand Down Expand Up @@ -344,10 +347,12 @@ function insert_single_column!(df::DataFrame,
if typeof(col_ind) <: Symbol
push!(index(df), col_ind)
push!(columns(df), dv)
push!(metadata(df), nothing)
else
if ncol(df) + 1 == Int(col_ind)
push!(index(df), nextcolname(df))
push!(columns(df), dv)
push!(metadata(df), nothing)
else
throw(ArgumentError("Cannot assign to non-existent column: $col_ind"))
end
Expand Down Expand Up @@ -606,6 +611,7 @@ function Base.setindex!(df::DataFrame,
col_inds::Colon=Colon())
setfield!(df, :columns, copy(columns(new_df)))
setfield!(df, :colindex, copy(index(new_df)))
setfield!(df, :metadata, copy(metadata(new_df)))
df
end

Expand Down Expand Up @@ -709,6 +715,7 @@ function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, name::S
end
insert!(index(df), col_ind, name)
insert!(columns(df), col_ind, item)
insert!(metadata(df), col_ind, nothing)
df
end

Expand Down Expand Up @@ -749,6 +756,7 @@ merge!(df, df2) # column z is added, column id is overwritten
"""
function Base.merge!(df::DataFrame, others::AbstractDataFrame...)
for other in others
merge!(metadata(df), metadata(other), index(df), index(other))
for n in _names(other)
df[n] = other[n]
end
Expand All @@ -764,12 +772,12 @@ end

# A copy of a DataFrame points to the original column vectors but
# gets its own Index.
Base.copy(df::DataFrame) = DataFrame(copy(columns(df)), copy(index(df)))
Base.copy(df::DataFrame) = DataFrame(copy(columns(df)), copy(index(df)), copy(metadata(df)))

# Deepcopy is recursive -- if a column is a vector of DataFrames, each of
# those DataFrames is deepcopied.
function Base.deepcopy(df::DataFrame)
DataFrame(deepcopy(columns(df)), deepcopy(index(df)))
DataFrame(deepcopy(columns(df)), deepcopy(index(df)), deepcopy(metadata(df)))
end

##############################################################################
Expand Down Expand Up @@ -1100,9 +1108,55 @@ function permutecols!(df::DataFrame, p::AbstractVector)
throw(ArgumentError("$p is not a valid column permutation for this DataFrame"))
end
permute!(columns(df), p)
permute!(metadata(df), p)
setfield!(df, :colindex, Index(names(df)[p]))
end

function permutecols!(df::DataFrame, p::AbstractVector{Symbol})
permutecols!(df, getindex.(index(df).lookup, p))
end


##############################################################################
##
## Set and Get MetaData
##
##############################################################################

"""
addlabel!(df::DataFrame, var::Symbol, label::String)

Adds a label to a DataFrame. Does not add other metadata.
"""
function metadata!(df::DataFrame, var::Symbol, field::Symbol, info)
addmeta!(df.metadata, index(df)[var], ncol(df), field, info)
return df
end

"""
showlabel(df::DataFrame, var::Symbol)

Prints the label (not other metadata) for a single variable of a dataframe.
"""
function metadata(df::DataFrame, var::Symbol, field::Symbol)
metadata(df).dict[field][index(df)[var]]
end

"""
Prints (does not return anything), all the MetaData
for a given field.
"""
function showmeta(df::DataFrame, fields::Union{Symbol, Vector{Symbol}}=collect(keys(metadata(df).dict)))

if fields isa Symbol
fields = [fields]
end

d = DataFrame(variable = names(df))

for field in fields
d[field] = getmeta.(df, names(df), field)
end

d
end
95 changes: 95 additions & 0 deletions src/other/metadata.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Defining behavior for DataFrames metadata
struct MetaData
dict::Dict{Symbol, Vector}
end

MetaData() = MetaData(Dict{Symbol,Vector}())

Base.isequal(x::MetaData, y::MetaData) = isequal(x.dict, y.dict)
Base.:(==)(x::MetaData, y::MetaData) = isequal(x, y)

Base.copy(x::MetaData) = MetaData(copy(x.dict))
Base.deepcopy(x::MetaData) = MetaData(copy(x.dict)) # field is immutable

function Base.getindex(x::MetaData, col_inds::AbstractVector)
new_dict = copy(x.dict)
for key in keys(new_dict)
new_dict[key] = new_dict[key][col_inds]
end
MetaData(new_dict)
end

function Base.permute!(x::MetaData, p::AbstractVector)
for key in keys(x.dict)
x.dict[key] = permute!(x.dict[key], p)
end
nothing
end

function Base.permute(x::MetaData, p::AbstractVector)
new_metadata = copy(x)
permute!(new_metadata, p)
end


function newfield!(x::MetaData, ncol::Int, field::Symbol, info)
x.dict[field] = Union{typeof(info), Nothing}[nothing for i in 1:ncol]
end

function addmeta!(x::MetaData, col_ind::Int, ncol::Int, field::Symbol, info)
if !haskey(x.dict, field)
newfield!(x, ncol, field, info)
end
x.dict[field][col_ind] = info
end

# For creating a new column in the dataframe
function Base.push!(x::MetaData, info)
for key in keys(x.dict)
push!(x.dict[key], info)
end
end

function Base.insert!(x::MetaData, col_ind::Int, item)
for key in keys(x.dict)
insert!(x.dict[key], col_ind, item)
end
end

function Base.merge!(leftmeta::MetaData, rightmeta::MetaData, leftindex::Index, rightindex::Index)
# Find the unique columns on the right
right_and_not_left_names = setdiff(names(rightindex), names(leftindex))
right_and_not_left_cols = rightindex[right_and_not_left_names]
# this imitates what's going on with the parent dataframes in merge!
rightmeta = rightmeta[right_and_not_left_cols]
rightindex = rightindex[right_and_not_left_names]
# Find the difference in the keys and allocate if needed
notonleft = setdiff(keys(rightmeta.dict), keys(leftmeta.dict))
notonright = setdiff(keys(leftmeta.dict), keys(rightmeta.dict))

for field in notonleft
newfield!(leftmeta, length(leftindex), field, nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted in the issue description, this approach is not very efficient, and it doesn't work for rightmeta since it shouldn't be modified. Another way of doing this is in for key in keys(leftmeta.dict), to check whether the key exists in the left and right data frames. If it exists in both, call vcat as you currently do. If it exists only in one of the data frames, allocate a Vector{Union{Nothing, eltype(key_vec)}} and call copyto! to fill the corresponding entries.

end

for field in notonright
newfield!(rightmeta, length(rightindex), field, nothing)
end

for key in keys(leftmeta.dict)
leftmeta.dict[key] =
vcat(leftmeta.dict[key], rightmeta.dict[key])
end
end

function append(leftmeta::MetaData, rightmeta::MetaData)
append!(copy(leftmeta), rightmeta)
end

# deleting columns is handled by get_index?
function getmeta(x::MetaData, col_ind::Int, field::Symbol)
if haskey(x.dict, field)
return x.dict[field][col_ind]
else
error("The field does not exist")
end
end
29 changes: 29 additions & 0 deletions test/metadata.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
module TestMetaData
using Compat, Compat.Test, DataFrames, StatsBase, Compat.Random
using Suppressor
using Compat: @warn

df1 = DataFrame(a = [1, 2], b = [3, 4])
df2 = DataFrame(c = [3, 4], d = [5, 6])

# Just used to add metadata easily for testing.
metadata!(df, :a, :label, "A label for variable a")

testdata = DataFrame(variable = names(df1), label =
["A label for variable a",
nothing])

@test showmeta(df1) == testdata

mergeddata = merge!(df1, df2)
testmergeddata = DataFrame(variable = names(mergeddata,
label =
["A label for variable a",
nothing,
nothing,
nothing,
nothing]))

@test showmeta(mergeddata) == testmergeddata

end # module TestMetaData