Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datafusion-python #69

Merged
merged 11 commits into from
May 4, 2021
Merged

Add datafusion-python #69

merged 11 commits into from
May 4, 2021

Conversation

jorgecarleitao
Copy link
Member

This is a PR with the source code of python-datafusion, currently available at https://github.com/jorgecarleitao/datafusion-python and released in pypi as datafusion.

The goal of this PR is to gauge interest of moving that code base closer to datafusion and to within ASF.

Some notes:

  • The documentation is lacking, but I would hope to have the docs published somewhere via readthedocs.
  • The release builds wheels for windows, mac, and manylinux2010, which cover some of the user-base

@codecov-commenter
Copy link

codecov-commenter commented Apr 25, 2021

Codecov Report

Merging #69 (3d9e1a3) into master (2423ff0) will decrease coverage by 0.66%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
- Coverage   76.43%   75.77%   -0.67%     
==========================================
  Files         135      142       +7     
  Lines       23264    23467     +203     
==========================================
  Hits        17782    17782              
- Misses       5482     5685     +203     
Impacted Files Coverage Δ
python/src/context.rs 0.00% <0.00%> (ø)
python/src/dataframe.rs 0.00% <0.00%> (ø)
python/src/errors.rs 0.00% <0.00%> (ø)
python/src/expression.rs 0.00% <0.00%> (ø)
python/src/scalar.rs 0.00% <0.00%> (ø)
python/src/types.rs 0.00% <0.00%> (ø)
python/src/udaf.rs 0.00% <0.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2423ff0...3d9e1a3. Read the comment docs.

python/Cargo.toml Outdated Show resolved Hide resolved
python/LICENSE Outdated Show resolved Hide resolved
@andygrove
Copy link
Member

Thank you @jorgecarleitao I am really excited to see this and would love to see this merged into arrow-datafusion.

@jorgecarleitao
Copy link
Member Author

Some notes:

  • this points to a rather old commit in DataFusion. We need to work on that
  • the CI did not run for some reason; I need to fix that
  • I need to push the Apache license and headers to flag that this is being licensed to ASF via Apache 2 and not MIT

@@ -0,0 +1,72 @@
name: Build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tag release probably won't work in the context of an ASF repo anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, we will need to work out a packaging; the build of the wheels is imo still relevant, as it is not so easy in Rust (afai understood support for this is still a bit WIP). Building the manylinux was a feat.

.github/workflows/python_test.yaml Outdated Show resolved Hide resolved

```bash
pip install datafusion
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding here as a suggestion but I'll take a look at packaging it as a conda package. I'll cc you on the PR once I got a bit working.

Suggested change
```

or via conda/mamba:

conda install -c conda-forge datafusion
mamba install -c conda-forge datafusion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy
If you want you can ping me on the staged-recipes PR, once you create it. I was just reading up on the state of arrow vs. rust, and was surprised that datafusion isn't yet in conda-forge. ;-)

@@ -0,0 +1,98 @@
import unittest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: Why not pytest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it comes with python, so no need to install other stuff. But no feelings here; we can refactor this whole thing. =)

python/Cargo.toml Show resolved Hide resolved
@jorgecarleitao
Copy link
Member Author

Pushed the license and also hopefully fixed the CI.

jorgecarleitao and others added 7 commits April 30, 2021 08:23
Co-authored-by: Andy Grove <andygrove@users.noreply.github.com>
Co-authored-by: Uwe L. Korn <xhochy@users.noreply.github.com>
@jorgecarleitao
Copy link
Member Author

Ok, I have now fixed the CI, pushed the license headers, and bumped to latest datafusion.

There was a regression, documented in #226.

Once we fix the regression, this can be released in pypi as 0.2.2 since there were no backward incompatible changes on it 🎉

@andygrove
Copy link
Member

@alamb @Dandandan Any objection to merging this PR?

@Dandandan
Copy link
Contributor

@andygrove please go ahead 🚀

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections to merging from me. I skimmed it quickly and all seems good.

I think a significant investment in documentation will be needed for this code, but it seems like a good start to me

@andygrove andygrove merged commit 46bde0b into apache:master May 4, 2021
@wesm
Copy link
Member

wesm commented May 4, 2021

I hate to be a nuisance, but didn't this need to go through IP clearance?

@andygrove
Copy link
Member

I hate to be a nuisance, but didn't this need to go through IP clearance?

We can revert if this is the case, but because Jorge was the only contributor (except for one contribution fixing a typo in a README) this didn't seem to be required in this case?

@wesm
Copy link
Member

wesm commented May 4, 2021

Probably best to check with general@incubator to determine the preferred protocol in this situation. I don't want to subject you to unneeded process, but would be good to go by the book

@jorgecarleitao
Copy link
Member Author

@wesm thank you. Not a nuisance at all, it is important to have this done correctly.

The rational here:

I hold the copyright over the whole code base, except for a 1 word typo fix on the README. The code was MIT licensed on jorgecarleitao/python-datafusion.

As part of this PR, I pushed a commit that added the license headers to every file in the source code. As copyright holder, I thereby licensed all this code to ASF under the ICA.

andygrove referenced this pull request in andygrove/datafusion May 4, 2021
@jorgecarleitao jorgecarleitao deleted the python branch May 4, 2021 14:10
@andygrove
Copy link
Member

PR to revert: #257

@wesm
Copy link
Member

wesm commented May 4, 2021

Thanks, I'm not enough of an expert to know what is the correct protocol, a vote may not be needed at all but let's double check

nevi-me pushed a commit that referenced this pull request May 4, 2021
@alamb
Copy link
Contributor

alamb commented May 4, 2021

Yeah, my interpretation was that since @jorgecarleitao authored this code, I was treating this as "just a normal PR" (it happens to have lived somewhere else for a while but from an IP provenance perspective it seemed no different to a normal PR to me).

However, I am not an expert in such matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants