-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV.read memory leak #264
Comments
It'd be nice to test out the new |
Wow, the new CSV.File with Tables is amazingly stable. No memory leaks. When will that be released? What is the thinking for creating the new Tables package with DataFrames already existing? Guess Tables is much lighter than DataFrames? Without much needs for things like sorting, I believe Tables (with the NamedTuple) would satisfy most of my needs. Should I prepare my code to use Tables now? |
I think it is related. I have a file with 40 columns and 4 rows and when I try to do CSV.Source or CSV.read on it, Julia will freeze and the memory will start to grow up until
|
I confirm the problem on Linux on file CSVFiles.jl reads it cleanly. The problem is both on CC @pszufe |
My previous test with CSV.File("file.csv") |> columntable is OK, but today when I use CSV.File("file.csv") |> rowtable, the Julia freezes with memory and CPU go way up. Have kiil it with -9. Test code:
I am using a nightly build of Julia 1.1.0
EDIT: It appear with an earlier nightly build of 1.1.0, the problem doesn't exist:
|
@bkamins can you share your file and other details such as julia version, CSV version, etc.? It's hard to debug without knowing exactly what you're running. Happy to privately receive a file if necessary. @LeoK987, the new Tables.jl package is already released and registered officially. I'm hoping to do a new release of CSV soon (within the next week or two), along with DataFrames support for Tables as well. I can't reproduce any memory leaks on CSV master using CSV.File in Julia 1.0; nor any hangs while reading or anything. I'll see if I can get a source build or nightly to do additional testing. |
Here is a test case. My experience is when you let it run on Linux more than 10 minutes you will need to reboot. I use Amazon AWS c5.xlarge, AMI: julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
(v1.0) pkg> add Tables#master
(v1.0) pkg> add CSV#master
julia> include("load.jl")
┌ Warning: CSV.read(file) will return a CSV.File object in the future; to return a DataFrame, use `df = CSV.read(file) |> DataFrame`
│ caller = ip:0x0
└ @ Core :-1
[goes into infinite loop and freezes, crashes my Linux if run for too long (OS runs out of memory)] The file load.jl: using CSV
using DataFrames
function g()
CSV.read("df_bad.csv")
end
function f()
describe(g())
end
show(f()) The file df_bad.csv contains
|
@quinnj this is the test case I was talking about above. |
Here is the df_bad.csv file that I use (I can see that when copying from the console trailing spaces have been appended in the post above) - so just download this one |
@pszufe , thanks for the detailed repro notes. I can confirm I see the same problem using |
Rerun this again on a new nightly version. It still freezes with memory going way up. The only nightly version I tested that has no problem is the one on 8/27. After that up to today, it all freezes.
|
On Julia 1.0 with current CSV/DataFrames master, I don't see any memory leaks or process hanging when doing |
This is extracted from issue: CPU & memory go berserk when reading a 2-row file #236, as it's pretty much a separate issue.
This is a clean test case for the memory leak.
Edit the file, copy the two-rows and paste them on to 10000 rows, to make it sufficiently large so as to more easily notice the memory leak
Save the following julia script as "testCSV.jl"
include("testCSV.jl")
Then check about julia's memory usage going up.
The julia version I used to test this:
Though this test case is with the 1.1.0 nightly, I have been noticing the memory leak with many previous versions, starting perhaps with 0.6, where readtable was in place of CSV.read.
I thought package Plots has to be combined with CSV to trigger the memory leak, but it turns out not the case. With this test, I don't see a difference with or without 'using Plots'.
And it's about the same when I put the code in a function:
Putting GC.gc() in the loop makes the growth rate significantly lower, but not to 0%. With enough runs, julia's memory still goes up gradually.
The text was updated successfully, but these errors were encountered: