-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse dates to datetime objects #1
Comments
I have also a LAS file version 2.0 that has datetime channels inside, so I hope you will have similar to Python
|
It's about time to support datetime and/or timestamps in the data section for LAS <= 2 files. At the moment the data array is a numpy.ndarray with a common data type, so it won't work. Here are the options that I see:
Option 1We would need to read the datetimes and timestamps as a float Lines 353 to 366 in 692bc59
Then after the array is reshaped back in LASFile.read(), create the structured ndarray: Lines 225 to 239 in 692bc59
We would have to keep track of which column is datetime/timestamped or not. Update: More realistic would be re-doing the first function above so that it knows what dtype should be expected in which column. Somehow we have to support wrapped files. To do that, the read_file_contents function would have to fully parse the Curves section(s) before tackling the data section(s): Line 224 in f369cc0
Option 2
Update: pandas would struggle, I suspect, with wrapped files, which I'd prefer to support with the same code as unwrapped. |
I like the idea of leveraging pandas for the LASv2+ support. But it appears that you are right about pandas struggling with wrapped files. There doesn't seem to be any pandas built-in solution for reading rows that span multiple newlines (see this S.O. post for example). Just spitballing here, but would it make any sense to have some simple heuristic based on the first few lines of the ~A section to determine if it's wrapped or not? Then the logic might be: if it's not wrapped it could go directly to pandas via read_csv. If it's wrapped, the data section could be "unwrapped" then sent to pandas. Maybe that's more of a rewrite than you'd want to do, and it's unclear if the benefits (e.g. pandas handles datatypes) would outweigh the issues that arise (for example I'm not sure how pandas could do what you do with the READ_SUBS to handle malformed data sections). |
Yeah...I think either way it's a biggish job. I'm warming to using a record array. It's a good chance to do some of the LAS 3 stuff like reading multiple data sections, and dealing with comma delimited data sections too. Plus it might allow solving #227 The tricky part is keeping the memory/speed usage as it is now. |
This may not be the best spot to put this comment, but just to follow up on the "to pandas or not to pandas" question, this weekend I played around a bit with adding a pandas engine for parsing the data section. Here are some benchmarks on a 28MB (unwrapped) las file, comparing the default parser and this pandas one I kluged in: default lasio parsing
pandas parsing
Admittedly this code is not production ready, wouldn't pass all tests, and doesn't deal with wrapped files! But this basic test on my unoptimized code to read the data section with pandas |
Thanks. That is attractive. lasio is already much too slow. I am not sure that the benefits of all the substitution code outweigh the performance benefits, given that so many files are unwrapped. @ahjulstad submitted a great PR (#149) ages ago which went down this route, but I did not merge it because it came before a major refactor of how the reader &c was set up. And I was being precious about not requiring An issue that needs doing before we start: rework the overall reading function so that all the header sections are fully parsed before even touching any of the data sections. That way we know whether they are wrapped or not, whether any columns can be expected to be non-numeric, and so on, when parsing the data section. (TLDR: fix my horrendous And, if we have separate code for parsing wrapped and unwrapped data: all tests featuring the data section need to be duplicated for both wrapped and unwrapped. |
merging recent changes into justins branch
Rename header_only.py to header_only.las
+1 for pandas.read_csv as default engine for non wrapped files. Reasons:
If needed for harmonization, I think it is possible to write "unwrapper" for wrapped LAS files and after that use the same read_csv function. |
This has basically been implemented now in v0.30: |
The 3.0 specification describes a datetime format code (p. 24). I need to implement it.
The text was updated successfully, but these errors were encountered: