use polars for seaexplorer data file load #120

callumrollo · 2022-10-06T12:37:43Z

This is a big PR. More work needed to check that it's not changing data as the methods are not the same as pandas.

Testing on our datasets has shown a ~ 10 X speedup of processing large delayed mode datasets, and a ~5 X speedup with nrt data. I'll write some example code comparing the two methods along with more tests. This is almost certainly a sub-optimal implementation of polars, as I'm directly aping the existing flow for pandas.

Using polars also decreases disk usage of intermediate products by using parquet rather than .nc. It also has substantially lower memory usage than pandas, so should decrease overheads when processing large datasets.

This is designed to resolve #36

callumrollo · 2022-10-06T12:38:36Z

This is failing some tests at the moment. Some of those are expected (they use intermediate ncs which are now parquet files) others look like some errors with timestamp parsing and some other issues. I'll work on these

callumrollo · 2022-10-06T15:32:13Z

Just got to resolve the optional sensor-sepcific coarsening now

jklymak · 2022-10-06T18:54:47Z

I wonder if this needs to be its own method? I'd need to be convinced that parquet files are a better intermediate format than netcdf. We need to be able to look at the intermediate files to see what is in the raw data - what is the workflow for parquet files? Just polars? I don't think xarray handles them. Does the efficiency go away if you use netcdf and polars?

We recently had this issue with the raw Alseamar files and solved it by subsampling the redundant information and reduced the raw files down from 10Mb each to 40kb

callumrollo · 2022-10-07T11:57:40Z

These are good points, thanks Jody. We've tried subsampling the raw alseamar files, but with the 16 Hz legato running for 3 week missions, we still end up with huge datasets and the load/merge step is taking several hours per dataset.

I'm starting to profile the code now. polars makes time savings over pandas when loading and merging the data. parquet files are quicker and more storage efficient than ncs to write and read, but they do need to be readable as intermediate products. This is achieved in a very similar way to pandas:

>>> df = pl.read_parquet("sea63_35_nrt_rawnc/sea063.0035.gli.sub.0142.parquet")
>>> df
shape: (107, 23)
┌─────────────────────┬──────────┬───────────────┬─────────┬─────┬────────┬─────────┬──────────┬──────┐
│ time                ┆ NavState ┆ SecurityLevel ┆ Heading ┆ ... ┆ AngPos ┆ Voltage ┆ Altitude ┆ fnum │
│ ---                 ┆ ---      ┆ ---           ┆ ---     ┆     ┆ ---    ┆ ---     ┆ ---      ┆ ---  │
│ datetime[μs]        ┆ i64      ┆ i64           ┆ f64     ┆     ┆ f64    ┆ f64     ┆ f64      ┆ i64  │
╞═════════════════════╪══════════╪═══════════════╪═════════╪═════╪════════╪═════════╪══════════╪══════╡
│ 2022-03-01 23:50:06 ┆ 117      ┆ 0             ┆ 152.82  ┆ ... ┆ -5.0   ┆ 28.5    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:11 ┆ 110      ┆ 0             ┆ 154.58  ┆ ... ┆ 0.5    ┆ 28.5    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:21 ┆ 110      ┆ 0             ┆ 151.92  ┆ ... ┆ 0.5    ┆ 28.4    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:31 ┆ 110      ┆ 0             ┆ 156.13  ┆ ... ┆ 0.5    ┆ 28.4    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...                 ┆ ...      ┆ ...           ┆ ...     ┆ ... ┆ ...    ┆ ...     ┆ ...      ┆ ...  │

to me this makes more sense, as at the load_merge stage there is no metadata to make the bulkier netcdfs necessary for these intermediate table-like files.

I'm working on a full demo at the moment. I'd like to make use of this speedup, but if it's incompatible with existing workflows I'll make it a separate function or just keep it for internal use at VOTO

callumrollo · 2022-10-07T13:18:11Z

I've put together a rough notebook profiling the polars implementation against the current main branch. It shows the performance differences using sub and raw data as well as showing what the intermediate parquet products look like when loaded in polars

https://github.com/callumrollo/pyglider_profile/blob/main/profile_pyglider.ipynb

It's pretty ugly, but shows the main points. I can work to make it portable//repeatable if desired

jklymak · 2022-10-07T15:37:45Z

OK 5-10x is a big deal if you have 24 Hz data. If you do this, can you update the docs to explain what a parquet file is and a quick how-to on opening them? I think we can assume pandas and xarray knowledge (we are writing to netcdf), but parquet probably needs a paragraph of explanation.

Final question would be if parquet and pandas have any packaging issues? Does parquet work with a conda install pandas on the big three OS's? (Can't be worse than netcdf ;-)

truedichotomy · 2022-10-07T16:29:54Z

Would Zarr be a possible format option for pyglider?

https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314

https://www.azavea.com/blog/2022/09/22/benchmarking-zarr-and-parquet-data-retrieval-using-the-national-water-model-nwm-in-a-cloud-native-environment/

callumrollo · 2022-10-10T07:21:48Z

I've had a look and it turns out pandas is able to read and write parquet files, so user shouldn't ever need to interact with polars. I'll add a paragraph to the docs explaining it.

I've not encountered problems adding polars to the build so far. I'll test it out on a windows machine today to be sure though. Polars has mature builds on PyPI and conda-forge

callumrollo · 2022-10-10T11:21:31Z

No issues making conda environments in pandas. I'm testing it on a range of our SeaExplorer datasets now. I'll mark ready for review once I'm satisfied it's performing well in production

callumrollo · 2022-10-12T13:20:14Z

I've tested this pretty extensively on our datasets now. I think it's good to go. @jklymak do you have any other requested changes?

jklymak · 2022-10-12T14:01:08Z

Give me a day or two to look it over. Probably is all good.

jklymak

This mostly looks good, just a couple of style things, and we probably need to change a top-level name.

I think we need to decide if we are used enough yet to need a deprecation cycle for removing things. I think we are probably fine to do this now, but at some point we can't change a bunch of underlying code that changes the output file type(!) without a deprecation cycle. In which case, code like this would be better as a new API and the old API kept around for users who need it, with a deprecation warning.

jklymak · 2022-10-23T20:49:24Z

pyglider/seaexplorer.py

+                    try:
+                        out = pl.read_csv(f, sep=';')
+                    except:
+                        _log.warning(f'Could not read {f}')


Do we need to add the badfiles here?

badfiles added

jklymak · 2022-10-23T20:50:28Z

pyglider/seaexplorer.py

+                    except:
+                        _log.warning(f'Could not read {f}')
+                        continue
+                    if "Timestamp" in out.columns:


Can this get a comment? Why do we need this check?

This check is for corrupted files. I encountered this once in ~ 50 missions I've processed so far. I've added a comment

jklymak · 2022-10-23T20:52:53Z

pyglider/seaexplorer.py

-                        if ftype == 'gli':
-                            outx.to_netcdf(fnout[:-3] + '.nc', 'w')
+                    if rawsub == 'raw' and dropna_subset is not None:
+                        out = out.with_column(out.select(pl.col(dropna_subset).is_null().cast(pl.Int64))


This needs a comment, and if it doesn't cause a slowdown or extra writes maybe would benefit from unpacking into separate calls. Looks like you are dropping repeats, but its too many steps in one line for me to follow without spending too much time ;-)

I've added a comment on this. It's a bit convoluted looking, but it's just the polars equivalent of pandas.dropna. I didn't want to change the functionality of the dropna option in this PR. We can factor it out as a separate PR though?

jklymak · 2022-10-23T20:53:44Z

pyglider/seaexplorer.py

+    post_1971 = df.filter(pl.col("time") > dt_1971)
+    if len(post_1971) == len(df):
+        return post_1971
+    return df.filter(pl.col("time") > dt_1971)


 def merge_rawnc(indir, outdir, deploymentyaml, incremental=False, kind='raw'):


We should probably rename this?

I've renamed this to the more descriptive drop_pre_1971_samples

I meant rename merege_rawnc.... Though that screws with older scripts, but I'd rename this merge_parquet or whatever, and then alias merge_rawnc so old scripts work

callumrollo · 2022-11-01T14:03:07Z

I've added some more comments and used badfiles if a file fails the load. I have no idea how these changes have caused a couple of the slocum tests to start failing

callumrollo · 2022-11-02T09:59:56Z

OK I think we're good to go now. @jklymak has this met your requested changes? In future I'll make these kind of changes as part of a deprecation cycle as you suggest

jklymak

Looks close, just a couple more places to clarify the code.

jklymak · 2022-11-02T16:22:17Z

pyglider/seaexplorer.py

+    post_1971 = df.filter(pl.col("time") > dt_1971)
+    if len(post_1971) == len(df):
+        return post_1971
+    return df.filter(pl.col("time") > dt_1971)


 def merge_rawnc(indir, outdir, deploymentyaml, incremental=False, kind='raw'):


I meant rename merege_rawnc.... Though that screws with older scripts, but I'd rename this merge_parquet or whatever, and then alias merge_rawnc so old scripts work

jklymak · 2022-11-02T16:24:41Z

pyglider/seaexplorer.py

-                    val = _interp_gli_to_pld(sensor_sub, sensor, val2, indctd)
+                    coarse_ints = np.arange(0, len(sensor)/coarsen_time, 1/coarsen_time).astype(int)
+                    sensor_sub = sensor.with_columns(pl.lit(coarse_ints).alias("coarse_ints"))
+                    sensor_sub_grouped = sensor_sub.with_column(


This block needs a comment as well... It says "smooth" oxygen above, but I'm not following this danse with coarse_ints etc.

Ah that makes more sense! I've renamed to merge_parquet and added merge_rawnc = merge_parquet so hopefully older scripts will still work. I've also added more comments to the variables coarsening

jklymak · 2022-11-04T01:50:05Z

@callumrollo feel free to self merge, perhaps squash commits that you don't need, or just squash on merge. Thanks! we are about to try this out on some data sets that have been giving us problems. ping @hvdosser

richardsc · 2022-11-04T02:17:19Z

We have at least one dataset that has 16Hz legato data (plus 1Hz GPCTD), which would probably benefit from this change. Pinging @clayton to try this out on that mission.

jklymak · 2022-11-04T20:18:10Z

@richardsc @clayton This is now merged, and on master. It's not yet released, so you will need to install from the development branch. Let us know if that is problematic. If it manages to get through a few more glider setups we should release as well as the changes to the slocum processing.

use polars for seaexplorer data file load

c2c4a92

callumrollo added 5 commits October 6, 2022 15:06

correctly name parquet files

94a6840

test seaexplorer with parquet

20e3080

use filter to avoid future deprication

2c6d458

correct millisecond time parsing from pld

c579a3c

correctly compare datetimes if us or ns when interpolating

942c5ea

callumrollo added 2 commits October 7, 2022 10:02

coarsen by mean as originally implemented in pandas

8630680

slacken test rtol from 1E7 to 1E5 for small averaging discrepancy

eaf2450

jklymak mentioned this pull request Oct 7, 2022

Would Zarr be a possible format option for pyglider? #121

Open

add notes on parquet files to documentation

5485043

catch for corrupted files and cast non time cols to float64

ef8cc90

callumrollo marked this pull request as ready for review October 12, 2022 13:19

jklymak reviewed Oct 23, 2022

View reviewed changes

callumrollo added 2 commits November 1, 2022 14:46

if fileread fails, append to badfiles

ce89d1b

added comment on dropna and renamed drop_rogue_1970

a8eb889

relax test rtol to stop tests failing

94371a2

jklymak reviewed Nov 2, 2022

View reviewed changes

callumrollo added 2 commits November 3, 2022 10:10

refactor to merge_parquet

4bd4571

explain optional coarsening of variables

9a51289

jklymak approved these changes Nov 4, 2022

View reviewed changes

callumrollo merged commit 0869839 into c-proof:main Nov 4, 2022

hvdosser linked an issue Nov 16, 2022 that may be closed by this pull request

SeaExplorer delayed mode time series data loss #128

Open

callumrollo deleted the callum-patch-33 branch November 29, 2024 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use polars for seaexplorer data file load #120

use polars for seaexplorer data file load #120

callumrollo commented Oct 6, 2022 •

edited

Loading

callumrollo commented Oct 6, 2022

callumrollo commented Oct 6, 2022

jklymak commented Oct 6, 2022

callumrollo commented Oct 7, 2022

callumrollo commented Oct 7, 2022

jklymak commented Oct 7, 2022

truedichotomy commented Oct 7, 2022

callumrollo commented Oct 10, 2022

callumrollo commented Oct 10, 2022

callumrollo commented Oct 12, 2022

jklymak commented Oct 12, 2022

jklymak left a comment

jklymak Oct 23, 2022

callumrollo Nov 1, 2022

jklymak Oct 23, 2022

callumrollo Nov 1, 2022

jklymak Oct 23, 2022

callumrollo Nov 1, 2022

jklymak Oct 23, 2022

callumrollo Nov 1, 2022

jklymak Nov 2, 2022

callumrollo commented Nov 1, 2022

callumrollo commented Nov 2, 2022

jklymak left a comment

jklymak Nov 2, 2022

jklymak Nov 2, 2022

callumrollo Nov 3, 2022

jklymak commented Nov 4, 2022

richardsc commented Nov 4, 2022

jklymak commented Nov 4, 2022

use polars for seaexplorer data file load #120

use polars for seaexplorer data file load #120

Conversation

callumrollo commented Oct 6, 2022 • edited Loading

callumrollo commented Oct 6, 2022

callumrollo commented Oct 6, 2022

jklymak commented Oct 6, 2022

callumrollo commented Oct 7, 2022

callumrollo commented Oct 7, 2022

jklymak commented Oct 7, 2022

truedichotomy commented Oct 7, 2022

callumrollo commented Oct 10, 2022

callumrollo commented Oct 10, 2022

callumrollo commented Oct 12, 2022

jklymak commented Oct 12, 2022

jklymak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

callumrollo commented Nov 1, 2022

callumrollo commented Nov 2, 2022

jklymak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jklymak commented Nov 4, 2022

richardsc commented Nov 4, 2022

jklymak commented Nov 4, 2022

callumrollo commented Oct 6, 2022 •

edited

Loading