Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdlib_experimental_io(open): support for unformatted sequential files #86

Closed
wants to merge 1 commit into from

Conversation

jvdp1
Copy link
Member

@jvdp1 jvdp1 commented Jan 5, 2020

Addition of a support for opening unformatted sequential files

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 5, 2020

Possibilities for:

  • access= "direct | sequential | stream"
  • form= "formatted | unformatted"

Now supported by open in stdlib_experimental_io

  • sequential + formatted (text files; t)
  • sequential + unformatted ("traditional" Fortran binary files?; u)
  • stream + unformatted (stream files; b or s)

What about the direct access? How to support it (if needed)?

@milancurcic
Copy link
Member

Couldn't we cover all reading and writing with only stream access?

Do we need sequential and direct access? I thought they are edge cases covered under the more general stream access.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 5, 2020

Couldn't we cover all reading and writing with only stream access?

Do we need sequential and direct access? I thought they are edge cases covered under the more general stream access.

Personally I don't use "sequential + unformatted" and "direct + unformatted". But they seem to be used by some people: https://github.com/fortran-lang/stdlib/wiki/Usage-of-%22open%22

So I think it would be good to at least support "sequential+unformatted" (this PR). With these 3 options (t, s|b, an u) we may cover, let say 95(?)% of the open (simple) cases.

Note: a sequential unformatted file can be read as a stream unformatted file if the specificities of a sequential unformtted file are considered when it is read. Not sure about a direct unformatted file.

@certik
Copy link
Member

certik commented Jan 5, 2020

I use unformatted sometimes --- the advantage is that it allows to quickly save large arrays from a simulation, that can be later post-processed by another Fortran code (compiled with the same compiler of course). The stream might be similarly fast (I don't know if it's as fast as unformatted on all platforms).

I agree we should support text (t), binary (b) and unformatted (u).

I personally would not designate both b and s for binary stream. I would only use b, as in Python. @jvdp1 in your opinion, what is the advantage of allowing two characters s and b to do exactly the same?

Copy link
Member

@certik certik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks great. Thanks!

+1 to merge this PR.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 5, 2020

I personally would not designate both b and s for binary stream. I would only use b, as in Python. @jvdp1 in your opinion, what is the advantage of allowing two characters s and b to do exactly the same?

They exist a non-standard form=binary. So, using b for stream may be confusing. Mentioning both may clarify that b is used for unformatted stream files.
If people disagree with that, I can remove the s. Or we can keep it, and not advertise it. I will not be difficult with that.

@certik
Copy link
Member

certik commented Jan 5, 2020 via email

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 5, 2020

@milancurcic Is it fine to keep (and merge) it as it is now implemented?

@milancurcic
Copy link
Member

I think this API is problematic. Will write in more detail tonight.

@milancurcic
Copy link
Member

milancurcic commented Jan 5, 2020

Here's the problem in my view: This PR mixes up form (text/formatted or binary/unformatted) and access (sequential, direct, or stream).

form is important for the user and should be part of the API. When you want text, you use t in mode (or leave it out because it's default). When you want binary, you use b in mode.

Read/write/readwrite is also important for the user and should be part of the API. When you want to read- or write-only, you use r or w in the mode, respectively. r+ for read+write. a for append. We have this in master and so far we're conveniently mapping to Python's API.

access however merely specifies how you're reading or writing under the hood. Part of the reason why Fortran's I/O is so complicated is because user has to choose access also. And the access that you choose changes how read and write statements work:

  • formatted+sequential is mostly okay because each read is one record (but still not my favorite -- I think we can do better by defaulting to formatted+stream)
  • unformatted+sequential is problematic because it's not portable (record separator is compiler-dependent);
  • unformatted+direct is useful (I use it also) for scenarios that @certik mentioned above, but you need to specify record length recl, and this is also compiler dependent (gfortran and ifort, one counts it as # of elements, the other one # of bytes, I always forget which one, it's a mess);

I agree we should support text (t), binary (b) and unformatted (u).

Unformatted == binary. We don't need u!! :)

My main point being, sequential or direct access modes, while useful, are specific ways of reading and writing that can more generally be done by stream. I don't think we should expose them in the API.

I understand and agree that these are useful and there are projects using them. However, I don't think anybody's gonna take somebody else binary files written in sequential mode and try to read them using stdlib (and if they do, they can read it as long as they know what the records mean).

We should aim to design a clean API with one recommended way of reading and writing. I suggest that we take this to the drawing board in #14 and sketch out the API that we want, and beyond just open -- we should sketch out what should read and write look like.

@milancurcic
Copy link
Member

Designing the API for read and write functions will guide whether open('somefile.txt', 'rt') should open in formatted+sequential or formatted+stream. In the former we'd be working with records and in the latter we'd be working with bytes. It's quite possible that we'll find that the latter will lead to a cleaner API for read and write.

@certik
Copy link
Member

certik commented Jan 6, 2020

@milancurcic if I understood what you wrote, you are proposing to keep the rwa modes and to also keep the two tb modes for text / binary. But beyond that you would not expose anything else, or perhaps expose it as other arguments to open, but not in the mode.

To be honest, the various combinations are quite complicated that I only use the t mode, and then I use both stream if I want to write binary files that are compiler independent (for example the PPM binary reader/writer uses stream) and as well as "unformatted" that is compiler dependent.

As long as we naturally capture 95% of all use cases, then I think that's good enough.

However, when you use b in Python, that means binary that is compiler independent. So the only thing that corresponds to it in Fortran is the "stream" approach, even though it is technically "access". So I don't think it would make sense to use b for "unformatted" non-stream (sequential), because that would not be compiler independent.

In other words, this Python like API does not directly map to the "form" / "access" fields in Fortran. Rather, the idea that I had was to pick such combinations of "form", "access" and other parameters, so that the result is pretty much what you would expect when coming from Python. So there would always be combinations of open statement arguments that one cannot do in Python. But by exposing what Python does together with u for unformatted sequential, this would cover pretty much all practical use cases. And if you wanted some other combination, you can still use the original open statement.

@certik
Copy link
Member

certik commented Jan 6, 2020

Let's discuss some particular example. Using the current master:

character(:), allocatable :: filename
integer :: u, a(3)

! Test mode "w"
u = open(filename, "w")
write(u, *) 1, 2, 3
close(u)

! Test mode "r"
u = open(filename, "r")
read(u, *) a
call assert(all(a == [1, 2, 3]))
close(u)

the second open function is using the rt mode, which currently means formatted and sequential.

If you instead opened in formatted and stream, what would have to change in the above code to read the array "a" properly? What exactly is the difference between formatted/sequential and formatted/stream?

@milancurcic
Copy link
Member

milancurcic commented Jan 6, 2020

@certik Exactly!

In the API, expose what maps to Python's API, which is what we already have. rwa+ and tb modes map to action and form, respectively. access is an internal, Fortran-specific thing, which I don't think should be part of the user interface. At least we have an opportunity here to not expose it.

Regarding the internal implementation, I suggest that we always open with access=stream, for both formatted (t) and unformatted (b) modes, and design read and write around that. Then we're working directly with bytes rather than records which are a historical Fortran artifact. I think this will lead to the simplest (and closest to Python) API for read and write functions.

@milancurcic
Copy link
Member

If you instead opened in formatted and stream, what would have to change in the above code to read the array "a" properly? What exactly is the difference between formatted/sequential and formatted/stream?

I don't know the answer and I'll need to play with it.

@milancurcic
Copy link
Member

For text (form='formatted') sequential and stream access seem to be completely interchangeable, at least in this case (default formatting and integer array):

Sequential version:

integer :: u
integer :: a(3) = [1, 2, 3]
integer :: b(3) = 0

open(newunit=u, file='somefile.txt', status='unknown', &
     action='write', access='sequential', form='formatted')
write(u, *) a
close(u)

open(newunit=u, file='somefile.txt', status='old', &
     action='read', access='sequential', form='formatted')
read(u, *) b
close(u)

print *, all(a == b)

end

Stream version

integer :: u
integer :: a(3) = [1, 2, 3]
integer :: b(3) = 0

open(newunit=u, file='somefile.txt', status='unknown', &
     action='write', access='stream', form='formatted')
write(u, *) a
close(u)

open(newunit=u, file='somefile.txt', status='old', &
     action='read', access='stream', form='formatted')
read(u, *) b
close(u)

print *, all(a == b)

end

@certik
Copy link
Member

certik commented Jan 6, 2020

If you instead opened in formatted and stream, what would have to change in the above code to read the array "a" properly? What exactly is the difference between formatted/sequential and formatted/stream?

I don't know the answer and I'll need to play with it.

I tried this patch:

diff --git a/src/stdlib_experimental_io.f90 b/src/stdlib_experimental_io.f90
index f6e4a50..b3a115c 100644
--- a/src/stdlib_experimental_io.f90
+++ b/src/stdlib_experimental_io.f90
@@ -332,7 +332,7 @@ end select
 
 select case (mode_(3:3))
 case('t')
-    access_='sequential'
+    access_='stream'
     form_='formatted'
 case('b', 's')
     access_='stream'

and I can't see any difference... Tests still pass, etc.

So maybe we can just use stream everywhere for access and that's it. In which case the current master is already what is needed.

Then for codes that use other "access", such as sequential, they will continue using the built-in open statement. And it might be that switching to "stream" is essentially with no downsides, and in that case they can use stdlib's open.

@milancurcic
Copy link
Member

I'll experiment some more. This is a simple case. I'm curious if sequential and stream treat new lines in the same way.

In sequential mode, each read or write statement is one record and newline is inserted at the end. So if you do 3 reads, you read 3 lines. I'm not sure if stream works the same way. At the same time, I don't think we're necessarily looking for something to behave exactly like sequential. What matters is that we understand how stream works, and that we know that it can do what we need.

@certik
Copy link
Member

certik commented Jan 6, 2020

Well, I was really hoping that our open would work great with the built-in read and write. That would be the best scenario. Given how close we got, it seems it would be worth it.

For the OO interface, there you don't have to be compatible with any built-ins. So one can indeed design it in any way you like. Then we can provide all the necessary functions in the low level API. This open was the first one that I could think of. There might be more.

@milancurcic
Copy link
Member

You're still compatible with built-in read and write statements. They'd just be reading/writing in stream mode rather than sequential.

I don't argue for stream because I love it, but rather because I think sequential access makes for more awkward behavior and API.

@certik
Copy link
Member

certik commented Jan 6, 2020

Let's gain some experience with this, I need to see the details.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 6, 2020

I see @milancurcic 's point.

Regarding unformatted files, I don't see what are the advantages of unformatted sequential and unformatted direct over unformatted stream. Indeed, all files are "binary" files. unformatted sequential and unformatted direct needs 8 additional bytes per record compared to unformatted stream. And if a "direct access" is needed, the pos statement can be used with unformated stream (maybe with some restrictions for write). An additional advantage of unformatted stream is its interoperability with C binary streams (I mainly use unformatted stream for this specific advantage).
The only reason to support unformatted sequential and unformatted direct in stdlib open is for backward compatibility, since the stream access was introduced in the standard only in Fortran 2003.

Regarding formatted files, I will only consider sequential and stream. From a simple example (see below; same results with gfortran and ifort), writing with formatted sequential and formatted stream provides the same (bit-wise) files. So it seems that sequential and stream treats the new lines in the same way.
A nice thing of formatted stream is that data-driven record termination in the style of C text streams is allowed (following MRC). However, I couldn't get the same bit-wise file as with the two other options.

Here is some explanations by @zbeekman.

So my proposition is to close this PR, and to possibly change access = sequential to access = stream. Both text and binary files will be supported by our open (and it will cover 100% of my needs ;) ).

Then we can open a new issue (or in #14) to discuss if we want to support sequential and direct accesses with our function open.
If we want to support all accesses, an easy thing would be to add a new optional argument in our function open:

... function open(filename, mode, iostat, access)

where mode = r|w|a|x|t|b|+ (or any combinations) and access = stream,seq,dir with stream being the default access.
With such an approach, it is still Python-like API. IMHO this would still be a clean API with one recommended way of reading and writing.

program iofortran
 use, intrinsic:: iso_fortran_env, only: sp => real32
 implicit none
 integer::i,n,un,length
 real(sp)::r(3),rs(3)
 character(:),allocatable :: filename,cdummy

 n = 4
 rs = [ 1.1, 1.2, 1.3 ]

 filename='test.fseq'
 print*,'Formatted sequential: '//trim(filename)
 open(newunit=un,file=filename,status='replace',action='write'&
      ,form='formatted',access='sequential')
 r=rs
 do i=1,n
  write(un,'(*(f0.5,x))')r
  r=r+1.
 enddo
 close(un)

 filename='test.fstr'
 print*,'Formatted stream: '//trim(filename)
 open(newunit=un,file=filename,status='replace',action='write'&
      ,form='formatted',access='stream')
 r=rs
 do i=1,n
  write(un,'(*(f0.5,x))')r
  r=r+1.
 enddo
 close(un)

 filename='test.fstr.newline'
 print*,'Formatted stream: '//trim(filename)
 open(newunit=un,file=filename,status='replace',action='write'&
      ,form='formatted',access='stream')
 r=rs
 do i=1,n,2
  print*,i
  write(un,'(3(f0.5,x),a,3(f0.5,x))')r,new_line(cdummy),r+1.
  r=r+2.
 enddo
 close(un)

end program
$ md5sum test.fs*
202f479b02a8ecbab6ed2b775efd055f  test.fseq
202f479b02a8ecbab6ed2b775efd055f  test.fstr
a7328de08c49c61393f897a4bbafcf4c  test.fstr.newline

@certik
Copy link
Member

certik commented Jan 6, 2020

I agree. Let's use stream. Also function open(filename, mode, iostat, access) is a good idea --- it will allow to port pretty much any code out there, and it still simplifies the API.

@certik certik closed this Jan 6, 2020
@jvdp1
Copy link
Member Author

jvdp1 commented Jan 6, 2020

Let's use stream.

I will open a PR to modify sequential to stream. So it will be fixed.

Also function open(filename, mode, iostat, access) is a good idea --- it will allow to port pretty much
any code out there, and it still simplifies the API.

Should we discuss this API in #14? Or implementing it and opening a PR? What would be the best strategy such that many people can discuss it?

@certik
Copy link
Member

certik commented Jan 6, 2020

I will open a PR to modify sequential to stream.

I just did in #90, sorry about that.

Should we discuss this API in #14? Or implementing it and opening a PR? What would be the best strategy such that many people can discuss it?

I would send a PR, so that we can discuss the actual code and an API, and we can comment at #14 to discuss this at the PR.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 6, 2020

I would send a PR, so that we can discuss the actual code and an API, and we can comment at #14 to discuss this at the PR.

I will start on it, if ok for you

@certik
Copy link
Member

certik commented Jan 6, 2020

I will start on it, if ok for you

Yes, thank you!

@milancurcic
Copy link
Member

Also function open(filename, mode, iostat, access) is a good idea --- it will allow to port pretty much any code out there, and it still simplifies the API.

I think this is okay. I'm still skeptical that access will be useful, but at least it will be optional parameter. It doesn't hurt for now.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 6, 2020

See #91 for discussion and implementation of access as an optional argument in open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants