Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pandas >=1.0.0 #1197

Merged
merged 70 commits into from
Feb 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
c581e9e
update removed & changed APIs
itholic Jan 16, 2020
e891ae7
labels -> codes
itholic Jan 16, 2020
cb484ab
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Jan 28, 2020
e69d0d7
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Jan 29, 2020
e93a8d6
manage for supporting pandas 1.0.0
itholic Jan 29, 2020
6729c5e
using pandas 1.0.0rc0 in .travis.yml
itholic Jan 29, 2020
3c51348
fix conda forge path
itholic Jan 29, 2020
c77d11c
revert pandas1.0.0 for python3.5
itholic Jan 29, 2020
c7c7034
resolve lint
itholic Jan 29, 2020
ab8c1b9
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Jan 30, 2020
61c03d0
recover get_dtype_counts temporarily
itholic Jan 30, 2020
6be375b
[fix] RollingGroupBy
itholic Jan 30, 2020
75351c5
[Common] get_dummies
itholic Jan 30, 2020
4d68cf6
[fix] ValueError at several functions for DataFrameGroupBy
itholic Jan 30, 2020
71b78d7
[fix] ExpandingGroupBy
itholic Jan 30, 2020
3c867dd
Fix doctest for DataFrame.info
itholic Jan 30, 2020
1bc76b3
Comment to_latex test raising ValueError in pandas >= 1.0.0
itholic Jan 30, 2020
994247c
Resolve conflicts
itholic Jan 30, 2020
371d5ea
fix conda-forge
itholic Jan 30, 2020
a3f252b
skip doctest of DataFrame.info since inconsistency depends on python …
itholic Jan 30, 2020
7ab1386
add new function for DataFrame to_markdown to missing list
itholic Jan 31, 2020
57e33d7
pandas 1.0.0rc0 -> pandas 1.0.0
itholic Jan 31, 2020
0ba7378
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Jan 31, 2020
079721d
[fix] Expanding.count to support pandas 1.0.0
itholic Jan 31, 2020
1343204
[fix] Expanding.count rearrange
itholic Jan 31, 2020
957a4f9
[fix] ExpandingGroupby.count
itholic Jan 31, 2020
fe6fd57
[fix] doctest for ExpandingGroupby.count
itholic Jan 31, 2020
c4d6550
[requirements-dev] pandas>=0.23.2,<1.0 -> pandas>=1.0.0
itholic Jan 31, 2020
564d14c
comment python3.5 for travis
itholic Jan 31, 2020
550420a
travis test for both pandas<1.0.0 & pandas>=1.0.0
itholic Jan 31, 2020
fe5c413
fix some tests for supporting pandas>=1.0.0 and pandas<1.0.0 both
itholic Feb 1, 2020
61753e1
[TEMPORAL] fix travis for testing all version of pandas
itholic Feb 1, 2020
15cbd5a
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Feb 1, 2020
a64de0e
Empty commit for build test
itholic Feb 1, 2020
06df259
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Feb 4, 2020
925a5e8
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Feb 6, 2020
65b57a2
restore previous doctest
itholic Feb 9, 2020
708a08c
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Feb 9, 2020
062bcab
restore previous doctest
itholic Feb 9, 2020
2fc0a89
Restore removed list with `removed in pandas>=1.0.0` comment
itholic Feb 10, 2020
edaea26
Resolve conflicts
itholic Feb 10, 2020
916ef10
pandas 1.0.0 -> 10.
itholic Feb 10, 2020
b6d9fe0
pandas 1.0 -> 1.0.1
itholic Feb 11, 2020
c95fedb
fix comment for get_dtype_counts
itholic Feb 11, 2020
2b8b215
remove upper bound
itholic Feb 11, 2020
896ca83
remove upper bound
itholic Feb 11, 2020
9a55272
remove upper bound
itholic Feb 11, 2020
95892d3
Fix workflow of GitHub Actions for pandas 1.0.1
itholic Feb 11, 2020
8108d2b
Update master.yml
itholic Feb 11, 2020
b69c37c
add test pandas>=1.0.0 to matrix
itholic Feb 11, 2020
770d805
Revert changes related with monotonic~
itholic Feb 11, 2020
76ffbc3
Merge branch 'support_pandas_1.0.0' of https://github.com/itholic/koa…
itholic Feb 11, 2020
20dfa53
Revert unnecessary tests
itholic Feb 11, 2020
8be71e9
Test for all version of pandas
itholic Feb 11, 2020
da426a9
Restore get_dtype_counts deprecated
itholic Feb 11, 2020
f3e5187
Remove unnecessary space
itholic Feb 11, 2020
eda7166
update CI
itholic Feb 11, 2020
abcfbd1
travie, github actions
itholic Feb 11, 2020
1a884b0
Empty commit for build test
itholic Feb 12, 2020
2f26446
Fix doctest for ExpandingGroupby.count
itholic Feb 13, 2020
b2db508
Resolve conflicts
itholic Feb 13, 2020
cee119d
Revert unnecessary change
itholic Feb 13, 2020
e24b6e1
Revert unnecessary change
itholic Feb 13, 2020
8bd20f4
Revert unnecessary change
itholic Feb 13, 2020
43e8428
Revert unnecessary change
itholic Feb 13, 2020
eb820d2
Match the doctest of DataFrame.info to 1.0.1
itholic Feb 14, 2020
7a063a5
Match missing list to 1.0.1
itholic Feb 14, 2020
b1a491a
Merge branch 'master' of https://github.com/databricks/koalas into su…
itholic Feb 18, 2020
dba7841
Fix & Resolve conflicts
itholic Feb 20, 2020
bad607d
Fix comment
itholic Feb 20, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
# The name of the directory '.cache' is for Travis CI. Once we remove Travis CI,
# we should download Spark to a directory with a different name to prevent confusion.
SPARK_CACHE_DIR: /home/runner/.cache/spark-versions
PANDAS_VERSION: 0.23.4
PANDAS_VERSION: 0.24.2
PYARROW_VERSION: 0.10.0
# DISPLAY=0.0 does not work in Github Actions with Python 3.5. Here we work around wtih xvfb-run
PYTHON_EXECUTABLE: xvfb-run python
Expand Down Expand Up @@ -73,12 +73,12 @@ jobs:
include:
- python-version: 3.6
spark-version: 2.4.5
pandas-version: 0.24.2
pandas-version: 0.25.3
pyarrow-version: 0.13.0
logger: databricks.koalas.usage_logging.usage_logger
- python-version: 3.7
spark-version: 2.4.5
pandas-version: 0.25.3
pandas-version: 1.0.1
pyarrow-version: 0.14.1
env:
PYTHON_VERSION: ${{ matrix.python-version }}
Expand Down
26 changes: 13 additions & 13 deletions databricks/koalas/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4151,8 +4151,7 @@ def from_records(data: Union[np.array, List[tuple], dict, pd.DataFrame],
return DataFrame(pd.DataFrame.from_records(data, index, exclude, columns, coerce_float,
nrows))

def to_records(self, index=True, convert_datetime64=None,
column_dtypes=None, index_dtypes=None):
def to_records(self, index=True, column_dtypes=None, index_dtypes=None):
"""
Convert DataFrame to a NumPy record array.

Expand All @@ -4167,9 +4166,6 @@ def to_records(self, index=True, convert_datetime64=None,
index : bool, default True
Include index in resulting record array, stored in 'index'
field or using the index label, if set.
convert_datetime64 : bool, default None
Whether to convert the index to datetime.datetime if it is a
DatetimeIndex.
column_dtypes : str, type, dict, default None
If a string or type, the data type to store all columns. If
a dictionary, a mapping of column names and indices (zero-indexed)
Expand Down Expand Up @@ -8361,9 +8357,11 @@ def info(
<class 'databricks.koalas.frame.DataFrame'>
Index: 5 entries, 0 to 4
Data columns (total 3 columns):
int_col 5 non-null int64
text_col 5 non-null object
float_col 5 non-null float64
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 5 non-null int64
1 text_col 5 non-null object
2 float_col 5 non-null float64
dtypes: float64(1), int64(1), object(1)

Prints a summary of columns count and its dtypes but not per column
Expand All @@ -8386,13 +8384,15 @@ def info(
... encoding="utf-8") as f:
... _ = f.write(s)
>>> with open('%s/info.txt' % path) as f:
... f.readlines() # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
[...databricks.koalas.frame.DataFrame...,
... f.readlines() # doctest: +SKIP
itholic marked this conversation as resolved.
Show resolved Hide resolved
["<class 'databricks.koalas.frame.DataFrame'>\\n",
'Index: 5 entries, 0 to 4\\n',
'Data columns (total 3 columns):\\n',
'int_col 5 non-null int64\\n',
'text_col 5 non-null object\\n',
'float_col 5 non-null float64\\n',
' # Column Non-Null Count Dtype \\n',
'--- ------ -------------- ----- \\n',
' 0 int_col 5 non-null int64 \\n',
' 1 text_col 5 non-null object \\n',
' 2 float_col 5 non-null float64\\n',
'dtypes: float64(1), int64(1), object(1)']
"""
# To avoid pandas' existing config affects Koalas.
Expand Down
2 changes: 2 additions & 0 deletions databricks/koalas/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,8 @@ def cumprod(self, skipna: bool = True):
"""
return self._apply_series_op(lambda kser: kser._cumprod(skipna)) # type: ignore

# TODO: Although this has removed pandas >= 1.0.0, but we're keeping this as deprecated
itholic marked this conversation as resolved.
Show resolved Hide resolved
# since we're using this for `DataFrame.info` internally.
def get_dtype_counts(self):
"""
itholic marked this conversation as resolved.
Show resolved Hide resolved
Return counts of unique dtypes in this object.
Expand Down
6 changes: 3 additions & 3 deletions databricks/koalas/indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1924,13 +1924,13 @@ def symmetric_difference(self, other, result_name=None, sort=None):
return result

# TODO: ADD error parameter
def drop(self, labels, level=None):
def drop(self, codes, level=None):
"""
Make new MultiIndex with passed list of labels deleted

Parameters
----------
labels : array-like
codes : array-like
Must be a list of tuples
level : int or level name, default None

Expand Down Expand Up @@ -1962,7 +1962,7 @@ def drop(self, labels, level=None):
scol = index_scols[0]
else:
scol = index_scols[level] if isinstance(level, int) else sdf[level]
sdf = sdf[~scol.isin(labels)]
sdf = sdf[~scol.isin(codes)]
return MultiIndex(DataFrame(_InternalFrame(sdf=sdf,
index_map=self._kdf._internal.index_map)))

Expand Down
17 changes: 1 addition & 16 deletions databricks/koalas/missing/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,6 @@ def unsupported_property(property_name, deprecated=False, reason=""):

class _MissingPandasLikeDataFrame(object):

# Deprecated properties
blocks = unsupported_property('blocks', deprecated=True)
ftypes = unsupported_property('ftypes', deprecated=True)
is_copy = unsupported_property('is_copy', deprecated=True)
ix = unsupported_property('ix', deprecated=True)

# Functions
align = unsupported_function('align')
asfreq = unsupported_function('asfreq')
Expand Down Expand Up @@ -82,27 +76,18 @@ class _MissingPandasLikeDataFrame(object):
to_sql = unsupported_function('to_sql')
to_stata = unsupported_function('to_stata')
to_timestamp = unsupported_function('to_timestamp')
to_markdown = unsupported_function('to_markdown')
truncate = unsupported_function('truncate')
tshift = unsupported_function('tshift')
tz_convert = unsupported_function('tz_convert')
tz_localize = unsupported_function('tz_localize')
unstack = unsupported_function('unstack')

# Deprecated functions
as_blocks = unsupported_function('as_blocks', deprecated=True)
as_matrix = unsupported_function('as_matrix', deprecated=True)
clip_lower = unsupported_function('clip_lower', deprecated=True)
clip_upper = unsupported_function('clip_upper', deprecated=True)
convert_objects = unsupported_function('convert_objects', deprecated=True)
get_ftype_counts = unsupported_function('get_ftype_counts', deprecated=True)
get_value = unsupported_function('get_value', deprecated=True)
select = unsupported_function('select', deprecated=True)
set_value = unsupported_function('set_value', deprecated=True)
to_panel = unsupported_function('to_panel', deprecated=True)
get_values = unsupported_function('get_values', deprecated=True)
to_dense = unsupported_function('to_dense', deprecated=True)
to_sparse = unsupported_function('to_sparse', deprecated=True)
to_msgpack = unsupported_function('to_msgpack', deprecated=True)
compound = unsupported_function('compound', deprecated=True)
reindex_axis = unsupported_function('reindex_axis', deprecated=True)

Expand Down
21 changes: 2 additions & 19 deletions databricks/koalas/missing/indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,6 @@ class _MissingPandasLikeIndex(object):
# Properties
nbytes = unsupported_property('nbytes')

# Deprecated properties
strides = unsupported_property('strides', deprecated=True)
data = unsupported_property('data', deprecated=True)
itemsize = unsupported_property('itemsize', deprecated=True)
base = unsupported_property('base', deprecated=True)
flags = unsupported_property('flags', deprecated=True)

# Functions
argsort = unsupported_function('argsort')
asof = unsupported_function('asof')
Expand Down Expand Up @@ -70,7 +63,6 @@ class _MissingPandasLikeIndex(object):
reindex = unsupported_function('reindex')
repeat = unsupported_function('repeat')
searchsorted = unsupported_function('searchsorted')
set_value = unsupported_function('set_value')
slice_indexer = unsupported_function('slice_indexer')
slice_locs = unsupported_function('slice_locs')
sortlevel = unsupported_function('sortlevel')
Expand All @@ -82,11 +74,9 @@ class _MissingPandasLikeIndex(object):
where = unsupported_function('where')

# Deprecated functions
get_duplicates = unsupported_function('get_duplicates', deprecated=True)
summary = unsupported_function('summary', deprecated=True)
get_values = unsupported_function('get_values', deprecated=True)
item = unsupported_function('item', deprecated=True)
contains = unsupported_function('contains', deprecated=True)
set_value = unsupported_function('set_value')

# Properties we won't support.
values = common.values(unsupported_property)
Expand All @@ -105,10 +95,7 @@ class _MissingPandasLikeMultiIndex(object):
# Deprecated properties
strides = unsupported_property('strides', deprecated=True)
data = unsupported_property('data', deprecated=True)
base = unsupported_property('base', deprecated=True)
itemsize = unsupported_property('itemsize', deprecated=True)
labels = unsupported_property('labels', deprecated=True)
flags = unsupported_property('flags', deprecated=True)

# Functions
argsort = unsupported_function('argsort')
Expand Down Expand Up @@ -148,9 +135,7 @@ class _MissingPandasLikeMultiIndex(object):
repeat = unsupported_function('repeat')
searchsorted = unsupported_function('searchsorted')
set_codes = unsupported_function('set_codes')
set_labels = unsupported_function('set_labels')
set_levels = unsupported_function('set_levels')
set_value = unsupported_function('set_value')
slice_indexer = unsupported_function('slice_indexer')
slice_locs = unsupported_function('slice_locs')
sortlevel = unsupported_function('sortlevel')
Expand All @@ -164,11 +149,9 @@ class _MissingPandasLikeMultiIndex(object):

# Deprecated functions
get_duplicates = unsupported_function('get_duplicates', deprecated=True)
summary = unsupported_function('summary', deprecated=True)
to_hierarchical = unsupported_function('to_hierarchical', deprecated=True)
get_values = unsupported_function('get_values', deprecated=True)
contains = unsupported_function('contains', deprecated=True)
item = unsupported_function('item', deprecated=True)
set_value = unsupported_function('set_value', deprecated=True)

# Functions we won't support.
values = common.values(unsupported_property)
Expand Down
36 changes: 1 addition & 35 deletions databricks/koalas/missing/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,20 +29,6 @@ def unsupported_property(property_name, deprecated=False, reason=""):

class _MissingPandasLikeSeries(object):

# Deprecated properties
blocks = unsupported_property('blocks', deprecated=True)
ftypes = unsupported_property('ftypes', deprecated=True)
ftype = unsupported_property('ftype', deprecated=True)
is_copy = unsupported_property('is_copy', deprecated=True)
ix = unsupported_property('ix', deprecated=True)
asobject = unsupported_property('asobject', deprecated=True)
strides = unsupported_property('strides', deprecated=True)
imag = unsupported_property('imag', deprecated=True)
itemsize = unsupported_property('itemsize', deprecated=True)
data = unsupported_property('data', deprecated=True)
base = unsupported_property('base', deprecated=True)
flags = unsupported_property('flags', deprecated=True)

# Functions
align = unsupported_function('align')
argsort = unsupported_function('argsort')
Expand All @@ -65,6 +51,7 @@ class _MissingPandasLikeSeries(object):
first = unsupported_function('first')
infer_objects = unsupported_function('infer_objects')
interpolate = unsupported_function('interpolate')
item = unsupported_function('item')
items = unsupported_function('items')
iteritems = unsupported_function('iteritems')
last = unsupported_function('last')
Expand Down Expand Up @@ -99,37 +86,16 @@ class _MissingPandasLikeSeries(object):
view = unsupported_function('view')

# Deprecated functions
as_blocks = unsupported_function('as_blocks', deprecated=True)
as_matrix = unsupported_function('as_matrix', deprecated=True)
clip_lower = unsupported_function('clip_lower', deprecated=True)
clip_upper = unsupported_function('clip_upper', deprecated=True)
compress = unsupported_function('compress', deprecated=True)
convert_objects = unsupported_function('convert_objects', deprecated=True)
get_ftype_counts = unsupported_function('get_ftype_counts', deprecated=True)
get_value = unsupported_function('get_value', deprecated=True)
nonzero = unsupported_function('nonzero', deprecated=True)
reindex_axis = unsupported_function('reindex_axis', deprecated=True)
select = unsupported_function('select', deprecated=True)
set_value = unsupported_function('set_value', deprecated=True)
valid = unsupported_function('valid', deprecated=True)
get_values = unsupported_function('get_values', deprecated=True)
to_dense = unsupported_function('to_dense', deprecated=True)
to_sparse = unsupported_function('to_sparse', deprecated=True)
to_msgpack = unsupported_function('to_msgpack', deprecated=True)
compound = unsupported_function('compound', deprecated=True)
put = unsupported_function('put', deprecated=True)
item = unsupported_function('item', deprecated=True)
ptp = unsupported_function('ptp', deprecated=True)
argmax = unsupported_function('argmax', deprecated=True)
argmin = unsupported_function('argmin', deprecated=True)

# Properties we won't support.
values = common.values(unsupported_property)
array = common.array(unsupported_property)
duplicated = common.duplicated(unsupported_property)
real = unsupported_property(
'real',
reason="If you want to collect your data as an NumPy array, use 'to_numpy()' instead.")
nbytes = unsupported_property(
'nbytes',
reason="'nbytes' requires to compute whole dataset. You can calculate manually it, "
Expand Down
10 changes: 8 additions & 2 deletions databricks/koalas/namespace.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@

import numpy as np
import pandas as pd

from pandas.api.types import is_list_like
from pyspark import sql as spark
from pyspark.sql import functions as F
from pyspark.sql.types import ByteType, ShortType, IntegerType, LongType, FloatType, \
Expand Down Expand Up @@ -1266,6 +1266,10 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None,
if sparse is not False:
raise NotImplementedError("get_dummies currently does not support sparse")

if columns is not None:
if not is_list_like(columns):
raise TypeError("Input must be a list-like for parameter `columns`")

if dtype is None:
dtype = 'byte'

Expand Down Expand Up @@ -1307,7 +1311,9 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None,
for label in kdf._internal.column_labels
if label == key or label[0] == key]
if len(column_labels) == 0:
return kdf
if columns is None:
return kdf
raise KeyError("{} not in index".format(columns))

if prefix is None:
prefix = [str(label) if len(label) > 1 else label[0] for label in column_labels]
Expand Down
3 changes: 2 additions & 1 deletion databricks/koalas/tests/test_dataframe_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,8 +181,9 @@ def test_to_latex(self):
self.assert_eq(kdf.to_latex(sparsify=False), pdf.to_latex(sparsify=False))
self.assert_eq(kdf.to_latex(index_names=False), pdf.to_latex(index_names=False))
self.assert_eq(kdf.to_latex(bold_rows=True), pdf.to_latex(bold_rows=True))
self.assert_eq(kdf.to_latex(encoding='ascii'), pdf.to_latex(encoding='ascii'))
self.assert_eq(kdf.to_latex(decimal=','), pdf.to_latex(decimal=','))
if LooseVersion(pd.__version__) < LooseVersion("1.0.0"):
self.assert_eq(kdf.to_latex(encoding='ascii'), pdf.to_latex(encoding='ascii'))

def test_to_records(self):
if LooseVersion(pd.__version__) >= LooseVersion("0.24.0"):
Expand Down
6 changes: 2 additions & 4 deletions databricks/koalas/tests/test_expanding.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,16 +96,14 @@ def _test_groupby_expanding_func(self, f):
repr(getattr(kser.groupby(kser).expanding(2), f)().sort_index()),
repr(getattr(pser.groupby(pser).expanding(2), f)().sort_index()))

kdf = ks.DataFrame({'a': [1, 2, 3, 2], 'b': [4.0, 2.0, 3.0, 1.0]},
index=np.random.rand(4))
kdf = ks.DataFrame({'a': [1, 2, 3, 2], 'b': [4.0, 2.0, 3.0, 1.0]})
pdf = kdf.to_pandas()
self.assert_eq(
repr(getattr(kdf.groupby(kdf.a).expanding(2), f)().sort_index()),
repr(getattr(pdf.groupby(pdf.a).expanding(2), f)().sort_index()))

# Multiindex column
kdf = ks.DataFrame({'a': [1, 2, 3, 2], 'b': [4.0, 2.0, 3.0, 1.0]},
index=np.random.rand(4))
kdf = ks.DataFrame({'a': [1, 2, 3, 2], 'b': [4.0, 2.0, 3.0, 1.0]})
kdf.columns = pd.MultiIndex.from_tuples([('a', 'x'), ('a', 'y')])
pdf = kdf.to_pandas()
self.assert_eq(
Expand Down
Loading