Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Julia example [WIP] #29

Merged
merged 4 commits into from
May 1, 2024
Merged

Conversation

simsurace
Copy link
Contributor

@simsurace simsurace commented Apr 24, 2024

This is a basic Julia example. I will update below when I complete tests locally. There is no automated testing in this repo it seems.

Julia client tested with

  • Python server
  • Go server ---> @ianmcook tested successfully
  • C-sharp server ---> @ianmcook tested successfully
  • Java server ---> @ianmcook tested successfully
  • Rust server ---> @ianmcook tested successfully
  • Ruby server ---> @ianmcook tested successfully

Julia server tested with

  • Python client
  • Go client
  • C client ---> @ianmcook tested successfully
  • C++ client ---> @ianmcook tested successfully
  • C-sharp client
  • Java client ---> @ianmcook tested successfully
  • Javascript client
  • Julia client
  • R client ---> @ianmcook tested successfully
  • Rust client ---> @ianmcook tested successfully
  • Ruby client ---> @ianmcook tested successfully

Closes apache/arrow-julia#502

@ianmcook
Copy link
Member

Thanks @simsurace!

This is a basic Julia example. I will update below when I complete tests locally. There is no automated testing in this repo it seems.

That's correct. I can help run all the tests. We intentionally didn't build out any automated tests here because the time to build and maintain them would probably exceed the time they save. Ultimately the examples from here will find their way to a permanent place and we will add proper integration tests there.

@ianmcook ianmcook self-requested a review April 29, 2024 13:09
@ianmcook
Copy link
Member

ianmcook commented Apr 29, 2024

@simsurace I tested the Julia client against all the servers, and they all ran without error (except Ruby server which I'm having trouble with for unrelated reasons).

I also added some temporary code to the Julia client to write the resulting data to an Arrow IPC file, then examined the file. I noticed a problem:

I would expect the schema to look like this:

a: int64
b: int64
c: int64
d: int64

But it looks like this:

a: list<: int64> not null
  child 0, : int64
b: list<: int64> not null
  child 0, : int64
c: list<: int64> not null
  child 0, : int64
d: list<: int64> not null
  child 0, : int64

It seems that the Julia library is wrapping the int64 columns in lists. Do you know if this is a known issue with the Arrow Julia IPC stream reader?

@simsurace
Copy link
Contributor Author

simsurace commented Apr 29, 2024

Hi, thanks for testing! Hmm, this may be a misunderstanding on my part. I was assuming that the record batches are tables with 4096 rows, so the columns a,b,c,d would be represented as Vector{Int}. You seem to be wanting to iterate over individual rows, right?

EDIT: As the clients all run without error, can you share the code you used to write the file?

@ianmcook
Copy link
Member

Oops, please disregard my message above about the schema. I was writing the file incorrectly.

I had added Arrow.write("output.arrow", batches) inside get_batches(). That caused the extra level of nesting to be added.

The right way to do it is like this:

open(Arrow.Writer, "output.arrow") do writer
  for batch in stream
    Arrow.write(writer, batch)
  end
end

When I do it that way, the nesting problem goes away.

@ianmcook
Copy link
Member

ianmcook commented Apr 29, 2024

What we're calling batches here is called an "Arrow table" in the Arrow.jl API. So you might want to replace instances of batches with table.
Actually batches is a Vector{Arrow.Table}

@simsurace
Copy link
Contributor Author

Hmm ok, I think I'm still confused about the nomenclature. Looking at the other implementations (e.g. Python), record batches seem to be small tables (i.e. 4096 rows), so wouldn't you expect the same schema/format as the full table, which has 100 million rows?

@ianmcook
Copy link
Member

On closer inspection: actually batches is not an Arrow.Table — it's a Vector{Arrow.Table}. Considering that, the added layer of nesting in the output schema makes more sense. Sorry for any confusion here — the Julia implementation of Arrow is very new to me.

@ianmcook
Copy link
Member

@simsurace I'm not able to get the server example working on macOS. It starts successfully, but when a client connects to it (any client), it throws an error:

% julia --project=.. server.jl
Serving on localhost:8008...
[ Info: Listening on: 127.0.0.1:8008, thread id: 1
┌ Error: handle_connection handler error. 

│ ===========================
│ HTTP Error message:

│ ERROR: IOError: write: invalid argument (EINVAL)
│ Stacktrace:
│   [1] uv_write(s::Sockets.TCPSocket, p::Ptr{UInt8}, n::UInt64)
│     @ Base ./stream.jl:1066
│   [2] unsafe_write(s::Sockets.TCPSocket, p::Ptr{UInt8}, n::UInt64)
│     @ Base ./stream.jl:1120
│   [3] unsafe_write
│     @ ~/.julia/packages/HTTP/PnoHb/src/Connections.jl:129 [inlined]
│   [4] unsafe_write(http::HTTP.Streams.Stream{HTTP.Messages.Request, HTTP.Connections.Connection{Sockets.TCPSocket}}, p::Ptr{UInt8}, n::UInt64)
│     @ HTTP.Streams ~/.julia/packages/HTTP/PnoHb/src/Streams.jl:95
│   [5] unsafe_write
│     @ ./io.jl:698 [inlined]
│   [6] write(s::HTTP.Streams.Stream{HTTP.Messages.Request, HTTP.Connections.Connection{Sockets.TCPSocket}}, a::Vector{UInt8})
│     @ Base ./io.jl:721
│   [7] (::HTTP.Handlers.var"#1#2"{typeof(get_stream)})(stream::HTTP.Streams.Stream{HTTP.Messages.Request, HTTP.Connections.Connection{Sockets.TCPSocket}})
│     @ HTTP.Handlers ~/.julia/packages/HTTP/PnoHb/src/Handlers.jl:61
│   [8] #invokelatest#2
│     @ ./essentials.jl:892 [inlined]
│   [9] invokelatest
│     @ ./essentials.jl:889 [inlined]
│  [10] handle_connection(f::Function, c::HTTP.Connections.Connection{Sockets.TCPSocket}, listener::HTTP.Servers.Listener{Nothing, Sockets.TCPServer}, readtimeout::Int64, access_log::Nothing)
│     @ HTTP.Servers ~/.julia/packages/HTTP/PnoHb/src/Servers.jl:469
│  [11] (::HTTP.Servers.var"#16#17"{HTTP.Handlers.var"#1#2"{typeof(get_stream)}, HTTP.Servers.Listener{Nothing, Sockets.TCPServer}, Set{HTTP.Connections.Connection}, Int64, Nothing, ReentrantLock, Base.Semaphore, HTTP.Connections.Connection{Sockets.TCPSocket}})()
│     @ HTTP.Servers ~/.julia/packages/HTTP/PnoHb/src/Servers.jl:401
│   request =
│    HTTP.Messages.Request:
│    """
│    GET / HTTP/1.1
│    Accept-Encoding: identity
│    Host: localhost:8008
│    User-Agent: Python-urllib/3.12
│    Connection: close

│    """
└ @ HTTP.Servers ~/.julia/packages/HTTP/PnoHb/src/Servers.jl:483

@simsurace
Copy link
Contributor Author

Yes, on macos there is a known issue that I think will be fixed in the next releases JuliaLang/julia#54225

@ianmcook
Copy link
Member

ianmcook commented May 1, 2024

Great, thanks. It works fine for me if I reduce total_records to 10_000_000

@ianmcook
Copy link
Member

ianmcook commented May 1, 2024

I successfully tested this Julia server with all the other client examples 🎉

Just one small thing: The int64 columns created in the server example are non-nullable (they have no validity bitmap). All the other server examples create nullable int64 columns (with a validity bitmap). Is it possible to make them nullable here for consistency?

@simsurace
Copy link
Contributor Author

simsurace commented May 1, 2024

Sure! I will do that. EDIT: done in 5f0a1a0

@simsurace
Copy link
Contributor Author

simsurace commented May 1, 2024

I would like to ask for feedback on the Julia community channels for how to make this more performant, but we can also do this in a follow-up PR. EDIT: opened a thread on Discourse.

@simsurace simsurace marked this pull request as ready for review May 1, 2024 10:08
@simsurace
Copy link
Contributor Author

I think this is ready to merge as a functional first version. I would propose to put any possible performance enhancement in a new PR.

@ianmcook ianmcook merged commit 188c4e5 into apache:main May 1, 2024
@ianmcook
Copy link
Member

ianmcook commented May 1, 2024

Thank you @simsurace!

@simsurace simsurace deleted the add-julia-example branch May 1, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow-over-HTTP client and server examples in Julia
2 participants