Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse dates to datetime objects #1

Closed
kinverarity1 opened this issue Dec 24, 2013 · 8 comments
Closed

Parse dates to datetime objects #1

kinverarity1 opened this issue Dec 24, 2013 · 8 comments
Labels
enhancement las3 stuff relating to LAS 3.0
Milestone

Comments

@kinverarity1
Copy link
Owner

The 3.0 specification describes a datetime format code (p. 24). I need to implement it.

@VelizarVESSELINOV
Copy link
Contributor

I have also a LAS file version 2.0 that has datetime channels inside, so I hope you will have similar to Python from __future__ import functionality to make available in LAS 2.0 functionality defined for LAS 3.0.

~VERSION INFORMATION 
VERS  .    2.0                                     :CWLS Log ASCII Standard - VERSION 2.0
...
~CURVE INFORMATION
#MNEM           .UNIT                  API CODE            :DESCRIPTION
#----            ------          --------------            -----------------------------
TIME_1900       .d                                         :                                                        Time Index(OLE Automation date)
TIME            .s                                         :                                (1s)                    Time(hh mm ss/dd-MMM-yyyy)
...
41725.9438268634 22:39:06/27-Mar-2014 ...

@kinverarity1
Copy link
Owner Author

kinverarity1 commented Jun 30, 2019

It's about time to support datetime and/or timestamps in the data section for LAS <= 2 files. At the moment the data array is a numpy.ndarray with a common data type, so it won't work. Here are the options that I see:

  1. Use a structured ndarray with dtypes specified. Or a record array, with curve mnemonics as keys.
  2. Require pandas and use a DataFrame

Option 1

We would need to read the datetimes and timestamps as a float

lasio/lasio/reader.py

Lines 353 to 366 in 692bc59

def items(f):
for line in f:
for pattern, sub_str in regexp_subs:
line = re.sub(pattern, sub_str, line)
for item in line.split():
try:
yield np.float64(item)
except ValueError:
yield item
array = np.array([i for i in items(file_obj)])
for value in value_null_subs:
array[array == value] = np.nan
return array

Then after the array is reshaped back in LASFile.read(), create the structured ndarray:

lasio/lasio/las.py

Lines 225 to 239 in 692bc59

arr = s["array"]
logger.debug('~A data.shape {}'.format(arr.shape))
if version_NULL:
arr[arr == null] = np.nan
logger.debug('~A after NULL replacement data.shape {}'.format(arr.shape))
n_curves = len(self.curves)
n_arr_cols = len(self.curves) # provisional pending below check
logger.debug("n_curves=%d ncols=%d" % (n_curves, s["ncols"]))
if wrap == "NO":
if s["ncols"] > n_curves:
n_arr_cols = s["ncols"]
data = np.reshape(arr, (-1, n_arr_cols))
self.set_data(data, truncate=False)

We would have to keep track of which column is datetime/timestamped or not.

Update: More realistic would be re-doing the first function above so that it knows what dtype should be expected in which column. Somehow we have to support wrapped files. To do that, the read_file_contents function would have to fully parse the Curves section(s) before tackling the data section(s):

def read_file_contents(file_obj, regexp_subs, value_null_subs,

Option 2

  • Drop all the custom data array reading code.
  • Drop the reshaping
  • Let pandas handle everything.

Obviously my preference is for option 2 😄

Update: pandas would struggle, I suspect, with wrapped files, which I'd prefer to support with the same code as unwrapped.

@dagrha
Copy link
Collaborator

dagrha commented Jul 2, 2019

I like the idea of leveraging pandas for the LASv2+ support. But it appears that you are right about pandas struggling with wrapped files. There doesn't seem to be any pandas built-in solution for reading rows that span multiple newlines (see this S.O. post for example).

Just spitballing here, but would it make any sense to have some simple heuristic based on the first few lines of the ~A section to determine if it's wrapped or not? Then the logic might be: if it's not wrapped it could go directly to pandas via read_csv. If it's wrapped, the data section could be "unwrapped" then sent to pandas.

Maybe that's more of a rewrite than you'd want to do, and it's unclear if the benefits (e.g. pandas handles datatypes) would outweigh the issues that arise (for example I'm not sure how pandas could do what you do with the READ_SUBS to handle malformed data sections).

@kinverarity1
Copy link
Owner Author

kinverarity1 commented Jul 3, 2019

Yeah...I think either way it's a biggish job. I'm warming to using a record array. It's a good chance to do some of the LAS 3 stuff like reading multiple data sections, and dealing with comma delimited data sections too.

Plus it might allow solving #227

The tricky part is keeping the memory/speed usage as it is now.

@dagrha
Copy link
Collaborator

dagrha commented Aug 5, 2019

This may not be the best spot to put this comment, but just to follow up on the "to pandas or not to pandas" question, this weekend I played around a bit with adding a pandas engine for parsing the data section.

Here are some benchmarks on a 28MB (unwrapped) las file, comparing the default parser and this pandas one I kluged in:

default lasio parsing

%timeit las = lasio.read('example_28MB_file.las')
4.93 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%memit las = lasio.read('example_28MB_file.las')
peak memory: 182.38 MiB, increment: 98.73 MiB

pandas parsing

%timeit las = lasio.read('example_28MB_file.las', engine='pandas')
347 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%memit las = lasio.read('example_28MB_file.las', engine='pandas')
peak memory: 112.80 MiB, increment: 28.46 MiB

Admittedly this code is not production ready, wouldn't pass all tests, and doesn't deal with wrapped files!

But this basic test on my unoptimized code to read the data section with pandas read_table and convert it to a 1-D array (as the default parser does) shows some promising gains in speed (>10x) and memory usage.

@kinverarity1
Copy link
Owner Author

Thanks. That is attractive. lasio is already much too slow. I am not sure that the benefits of all the substitution code outweigh the performance benefits, given that so many files are unwrapped.

@ahjulstad submitted a great PR (#149) ages ago which went down this route, but I did not merge it because it came before a major refactor of how the reader &c was set up. And I was being precious about not requiring pandas. Perhaps we should get that PR up to date - or use your engine - and then implement the unwrapped reader last.

An issue that needs doing before we start: rework the overall reading function so that all the header sections are fully parsed before even touching any of the data sections. That way we know whether they are wrapped or not, whether any columns can be expected to be non-numeric, and so on, when parsing the data section. (TLDR: fix my horrendous LASFile.read() method).

And, if we have separate code for parsing wrapped and unwrapped data: all tests featuring the data section need to be duplicated for both wrapped and unwrapped.

kinverarity1 pushed a commit that referenced this issue Feb 16, 2020
merging recent changes into justins branch
kinverarity1 pushed a commit that referenced this issue May 8, 2020
Rename header_only.py to header_only.las
@kinverarity1 kinverarity1 added las3 stuff relating to LAS 3.0 enhancement and removed enhancement labels May 11, 2020
@VelizarVESSELINOV
Copy link
Contributor

+1 for pandas.read_csv as default engine for non wrapped files.

Reasons:

  1. I like the speed performance
  2. wrapped LAS are rare and often small size no big performance issue
  3. I like the date/time and string management of pandas
  4. plus no more bugs like NULL not working after DateTime column (Null subsitutions do not work for non-numeric data sections #261)

If needed for harmonization, I think it is possible to write "unwrapper" for wrapped LAS files and after that use the same read_csv function.

@kinverarity1
Copy link
Owner Author

This has basically been implemented now in v0.30:

https://lasio.readthedocs.io/en/latest/data-section.html#handling-text-dates-timestamps-or-any-non-numeric-characters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement las3 stuff relating to LAS 3.0
Projects
None yet
Development

No branches or pull requests

3 participants