Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Modular Filesystems C API #101

Merged
merged 16 commits into from
Jun 7, 2019
Merged

RFC: Modular Filesystems C API #101

merged 16 commits into from
Jun 7, 2019

Conversation

mihaimaruseac
Copy link
Contributor

@mihaimaruseac mihaimaruseac commented May 6, 2019

Feedback period will be open until 2019-05-27

Modular Filesystems C API

Status Accepted
Author(s) Mihai Maruseac (mihaimaruseac@google.com)
Sponsor Gunhan Gulsoy (gunan@google.com)
Updated 2019-05-06

Summary

Define the API for the plugin-based implementation of filesystems under modular TensorFlow (as proposed in #77)

A major module of the modular TensorFlow involves all operations regarding files and the filesystem. Saving models, checkpointing, reading files as input to models (i.e. images, audio, etc.), and many other file manipulations are all tasks that TensorFlow has to support.

While users can use the underlying operating system’s API (such as calling fopen and friends on Posix environments, for example), it is definitely better to have a cross-platform, cross-environment API that can be used everywhere. TensorFlow provides such an API at the moment but it results in compiling code for all known filesystems in the final binary. We propose a new design where all filesystems are implemented as plugins and only the required ones are added at runtime. To do so, we will first need to design a C API for the filesystems and then integrate this with the rest of the modularization effort.

@mihaimaruseac mihaimaruseac changed the title Propose an RFC for filesystem plugins under modular TensorFlow Filesystem plugins under modular TensorFlow May 6, 2019
Copy link
Contributor

@alextp alextp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending the RFC!

Overall I think two things are confusing:

  1. it's not clear what parts of the existing system need to be understood by users using the APIs proposed here, and you go into a lot of detail about the file APIs in tf. I think you should fork that documentation off into a separate doc
  2. the key detail of how to handle versioning and upgrading is buried, and you show more than one implementation of the core structs. I think it'd be simpler if you only showed the intended final implementation of each struct (assuming you go with making the structs public, which I advise you to avoid and instead expose them only through functions to allocate/free and set/get the individual fields).

20190506-filesystem-plugin-modular-tensorflow.md Outdated Show resolved Hide resolved
20190506-filesystem-plugin-modular-tensorflow.md Outdated Show resolved Hide resolved
20190506-filesystem-plugin-modular-tensorflow.md Outdated Show resolved Hide resolved
20190506-filesystem-plugin-modular-tensorflow.md Outdated Show resolved Hide resolved
20190506-filesystem-plugin-modular-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20190506-filesystem-plugin-modular-tensorflow.md Outdated Show resolved Hide resolved
rfcs/20190506-filesystem-plugin-modular-tensorflow.md Outdated Show resolved Hide resolved
@ewilderj ewilderj changed the title Filesystem plugins under modular TensorFlow RFC: Filesystem plugins under modular TensorFlow May 6, 2019
@ewilderj ewilderj changed the title RFC: Filesystem plugins under modular TensorFlow RFC: Modular Filesystems C API May 6, 2019
@ewilderj ewilderj added the RFC: Proposed RFC Design Document label May 6, 2019
mihaimaruseac and others added 2 commits May 6, 2019 13:38
The RFC files got moved to a new directory but text was still using the old path.
```cpp
// Operations on a TF_RandomAccessFile
typedef struct TF_RandomAccessFileOps {
size_t (*const Read)(const TF_RandomAccessFile*, uint64, size_t, char*, TF_Status*);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihaimaruseac One very useful method for random access file could be Size or GetSize which returns the size of the file. At the moment, File Size is not visible from RandomAccessFile which is really inconvenient in many situations.

The reason being that many file format actually has the meta data at the end of the file (not at the beginning of the file). One good (and important) example is the zip file. Zip file's meta data is stored at the end of the file (central directory). So any parser to open a zip file will have to find the file size first before processing.

With a GetSize (or Size) API, RandomAccessFile could be greatly optimized.

Note technically it is possible to find the file size with Read to probe through until the EOF is reached, but this is really not efficient.

Of course, some file systems may not be able to provide a Size before hand (e.g., in stream fashion). I think by default GetSize could simply return NotImpemented. So any file size based optimization will only be "best effort".

For optimization, if a GetSize/Size API is exposed, then reading files could try GetSize first. If NotImplemented is returned, then a normal approach of probe through until the end is used. Otherwise the size of the file could be directly used.

Note: ZIP is the format mentioned but a lot of the files are actually zip files with [Content_Types].xml or META-INFO file inside.

For example, Windows office files (pptx, docx, etc) are all zip files. Java jar files are zip files. And python whl files are zip files as well.

You could just use unzip to decompress pptx, jar, whl actually and see each individual content files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this request. I only included the API calls that are currently included in the existing classes but I guess we can expand these. I'm going to add a new section on proposed additions to the op tables.

Copy link
Contributor Author

@mihaimaruseac mihaimaruseac May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually one of the fields of Read is the offset from where you want to read and the other is the number of bytes to read. So, if reading a zip file, you can go directly to the end of the file and read the size section from there.

Sure, we can add it as convenience function but we only identify the filesystem plugin based on the scheme part of the URI, not the extension (hdfs://path/to/file.zip and /path/to/file.zip will use different implementations but /path/to/file.zip and /path/to/another/file.txt will use the same ops)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihaimaruseac while it is possible to read the content with offset, the Read API does not give you the option to specify offset = -10 to read from 10 bytes before the end of the file EOF. In oder to achieve that, we will need to find out the length of the file (e.g., say 100), and subtract by 10 (100 - 10 = 90) to read from offset 90 (vs. -10).

The filesystem plugin will be consumed by a client calling it (a function inside tensorflow, or another plugin/customer ops outside of tensorflow repo).

For example, inside tensorflow one function might want to parse a file (not necessary zip file) with file name hdfs://path/to/special/file. Outside of tensorflow another plugin want to access Tensorflow's file system schema URI gcs://path/to/another/file.

Eventually the function (inside or outside tensorflow) will need to access Read (or proposed GetSize/GetFileSize) through the plugin.

If GetSize/GetFileSize is not exposed, then size is unknown to this function (function inside tensorflow, or another plugin using tensorflow). This function will have to probe through to find the EOF.

Copy link
Contributor Author

@mihaimaruseac mihaimaruseac May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't clear enough, sorry. Let me try with an example.

Suppose we live in a world where there are only 3 types of files: *.start which have the size metadata at the beginning, *.end which have it at the end and *.none which have no size metadata.

Suppose we also have only two filesystems: A (a:///path/to/file) and B (b:///path/to/file). Let's assume that A allows rewinds (so you can read the file and then come back to its start) while B is only streaming (once you read some part of the file, you can't go back, unless you open the file again).

Suppose we have a piece of user code which takes as input a filepath and returns the size of the file, let's call this ExternalFileSize(const string& path).

There are 4 posibilities:

  1. We keep the API as it is. In this case, ExternalFileSize(path) will call Env::Default()->NewRandomAccessFile(...) to get a new RandomAccessFile object and will have to call Read on it until the end increasing a size variable and then returning it. This, I agree with you, is not that good.

    // This code is outside TensorFlow
    size_t ExternalFileSize(const string& path) {
      RandomAccessFile file;
      Env::Default()->NewRandomAccessFile(path, &file);
      return ReadFromStartToEndAndGetSize(&file);
    }
    
    // Extracting this as separate function as I refer to it below
    // This is still outside TensorFlow
    ReadFromStartToEndAndGetSize(std::unique_ptr<RandomAccessFile> file) {
      size_t size = 0;
      // ignore casts for size_t to uint64 and from Status to bool
      while (file->Read(size, amount_to_read, output, result)) {
        // assume no partial reads for now as it doesn't matter here
        size += amount_to_read;
      }
      return size;
    }
  2. We add a GetSize method in the plugin API and in RandomAccessFile class. To implement ExternalFileSize(path) we now call NewRandomAccessFile and GetSize on the returned object. However, NewRandomAccessFile dispatches on the scheme, so both plugins need to handle *.start, *.none and *.end files inside the GetSize equivalent, resulting in code duplication as well as behavior that is not common across plugins. This is what I was mentioning above, in the second paragraph.

    // This code is outside TensorFlow
    size_t ExternalFileSize(const string& path) {
      RandomAccessFile file;
      Env::Default()->NewRandomAccessFile(path, &file);
      return file->GetSize();
    }
    
    // This code is in plugin (both A and B)
    size_t GetFileSize(const TF_RandomAccessFile *file) {
      const char *filename = file->filename;
    
      // SizeAtStart, SizeAtEnd, ReadFirstBlock, ReadLastBlock, ReadSizeFromMetadata, ReadFromStartToEndAndGetSize
      // are all functions implemented in each plugin, code duplication
      if (SizeAtStart(filename)) {
        char* metadata = ReadFirstBlock(file);
        return ReadSizeFromMetadata(metadata);
      } else if (SizeAtEnd(filename)) {
        char* metadata = ReadLastBlock(file);
        return ReadSizeFromMetadata(metadata);
      } else {
        // This calls the plugin function, not the external one!
        return ReadFromStartToEndAndGetSize(file);
      }
    }
  3. We allow Read to receive a negative argument for the offset. In this case, ExternalFileSize(path) first looks at the path extension. If it is none it calls Read and increments a size variable until end of file, returning this size. If it is end, it calls Read with the proper negative offset, reads until EOF and returns the size from the metadata. If it is start, it reads from start, only the metadata and returns the size field. Since Read now is on core tensorflow, each plugin only needs to implement it with support for negative offsets. If negative offsets are not supported (i.e., streaming filesystems) then Read with negative offset fails and code falls back to the same case as none extension above.

    // This code is outside TensorFlow
    size_t ExternalFileSize(const string& path) {
      RandomAccessFile file;
      Env::Default()->NewRandomAccessFile(path, &file);
    
      // SizeAtStart, SizeAtEnd, ReadSizeFromMetadata, ReadFromStartToEndAndGetSize
      // implemented only once in the external code
      if (SizeAtStart(filename)) {
        file->Read(0, amount_to_read, output, metadata);
        return ReadSizeFromMetadata(metadata);
      } else if (SizeAtEnd(filename)) {
        if (file->Read(-amount_to_read, amount_to_read, output, metadata)) {
          return ReadSizeFromMetadata(metadata);
        } else {
          // filesystem doesn't support negative offsets (filesystem B)
          // so, fallback to default
          return ReadFromStartToEndAndGetSize(file);
        }
      } else {
        return ReadFromStartToEndAndGetSize(file);
      }
      return file->GetSize();
    }
    
    // This code is in plugin A
    size_t Read(const TF_RandomAccessFile *file, uint64 offset, ...) {
      if (offset < 0) { 
        // position from end and read ...
      } else {
        // read directly
        ...
      }
    }
    
    // This code is in plugin B
    size_t Read(const TF_RandomAccessFile *file, uint64 offset, ...) {
      if (offset < 0) { 
        // error as seek is not supported
      } else {
        // read directly
        ...
      }
    }
  4. We change Read to only receive the number of bytes to read and add a Seek method to specify where the read should happen from. Seek will have an argument for the relative position (negative offset, positive offset) and another enum argument to specify if the position change should be from current position, start of file or end of file.

    // This code is outside TensorFlow
    size_t ExternalFileSize(const string& path) {
      RandomAccessFile file;
      Env::Default()->NewRandomAccessFile(path, &file);
    
      // SizeAtStart, SizeAtEnd, ReadSizeFromMetadata, ReadFromStartToEndAndGetSize
      // implemented only once in the external code
      if (SizeAtStart(filename)) {
        file->Read(amount_to_read, output, metadata);
        return ReadSizeFromMetadata(metadata);
      } else if (SizeAtEnd(filename) && file->Seek(-amount_to_read, FILE_END)) {
        file->Read(amount_to_read, output, metadata);
        return ReadSizeFromMetadata(metadata);
      } else {
        return ReadFromStartToEndAndGetSize(file);
      }
      return file->GetSize();
    }

    This is similar to 3. above, but has a clearer API.

I would go with 4 or 3 instead of 2 as you propose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihaimaruseac Thanks. I think 3 or 4 makes sense 👍 . For 3, maybe the offset will be int64 instead of uint64 (to allow negative numbers)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be int64, indeed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option (4) would prevent concurrent reads from the RandomAccessFile (and require additional locking ...) - while this functionality is currently underused I expect this to change.

I would really like to see a GetSize() (2) method for RandomAccessFile - currently GetFileSize() is used as a substituted either before or after NewRandomAccessFile().
GetSize() could ensure on most file systems that both operations actually refer to the same file object in the presence of rename/delete file operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that leaves option 3 to read the size after decompression.

Indeed, GetFileSize() returns the size of the file on disk. Probably we can also implement GetSize() which does the same thing, returning the size of the file at the moment it was read (and filesystems can decide to not implement it in which case we'll default to calling GetFileSize() from the filesystem object itself)

@mihaimaruseac
Copy link
Contributor Author

Thank you @alextp for all the review. I incorporated all of your changes, either by inline editing, extracting to appendix or deferring to an open questions section (including the way I'm leaning to) at the end of the document.

@ewilderj ewilderj merged commit f8212cf into tensorflow:master Jun 7, 2019
@ewilderj ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Jun 7, 2019
@mihaimaruseac mihaimaruseac deleted the filesystem-rfc branch June 9, 2019 22:27
or the core TensorFlow. As such, here are a few requirements that we think are
needed:

1. Once a plugin is registered, the function pointer to plugin provided
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording: either "to a plugin provided implementation" or "to plugin-provided implementations"?


1. Furthermore, it is ideal to have compilation failures when trying to compile
plugins against an old API interface which will surely fail the version
metadata tests. For example, if a method's signature changes this will break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, if you want ABI compatibility, you cannot change function signatures, right? You can only ever add new functions. This is what we see with a lot of ABI compat systems that create foo(), foo2(), foo3() etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly the case.

integration tests but will also be caught during compilation of plugin.
However, something that might not be caught at compile time is when an
optional method becomes required or a new method is added. We can maintain
_source compatibility_ by always increasing version numbers when a method
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different solution is having newer implementations always be optional in the sense that if foo3() is not implemented, you'll get an easy to detect error that you can propagate to the caller.


## Existing implementation

The implementation of a filesystem support in TensorFlow is hairy, due to the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording: s/a //

1. Although this design is mostly about C++ implementation, we should ensure
Python code and other language bindings will be able to continue working with
the modular design.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about debugging, and specifically performance debugging? Given that input pipelines are often a core part of the ML bottlenecks, how can we ensure that folks will understand how time is spent within a plugin?

class PluginFilesystemIOTest : public ::testing::Test {
protected:
void SetUp() override {
filesystem_ = new TF_Filesystem;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++ code, use a unique_ptr instead?

} TF_FilesystemOps;
```

We also provide some convenience calls that plugins can use to fill in these
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What part of the code is "we" here? A common library folks will depend on, or is this going to be calling back into the main TF code? If so, the complexity of creating a .lib file for Windows to make this work is going to play into the decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"we" here is more like "we, the authors of the plugin interface".

I'll have to reconsider the Windows issue, thanks for pointing it out

: filename_(filename), file_(file), ops_(ops) {}
virtual ~ModularRandomAccessFile() {
ops_->Cleanup(const_cast<TF_RandomAccessFile*>(file_));
delete file_;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this take ownership of file? Generally that seems like an anti-pattern; if you have a reason, I'd call that out in the comments.

};

Status ModularRandomAccessFile::Read(uint64 offset, size_t n, StringPiece* result, char* scratch) const {
TF_Status* plugin_status = TF_NewStatus();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TF_Status is just a struct, why not put it on the stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can also go on the stack. Will have to see how it is used elsewhere and also will have to learn modern C++ :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it cannot go on the stack as it needs to be private to plugins for ABI compatibility reasons.

1. **Should we pass plugin structs by value during registration?** Doing so
makes ownership clear and prevents memory allocation bugs. Passing them by
value results in a copy but we need a copy for security reasons anyway. We're
leaning towards **Yes**.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this about the structs that just wrap the void*? In that case, compilers will handle exactly like they would a void* (so no extra copy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about the structs containing the function pointers.

@mihaimaruseac
Copy link
Contributor Author

Thanks @r4nt for the extra review and suggestions. Will consider these too while implementing.

@ziliangpeng
Copy link

Is there a place to track the implementation status of this RFC?

@byronyi
Copy link
Contributor

byronyi commented Jan 21, 2020

Is there a place to track the implementation status of this RFC?

https://github.com/tensorflow/tensorflow/search?q=modular+filesystem&type=Commits

@mihaimaruseac
Copy link
Contributor Author

For now all the implementation will reside in tensorflow/c/experimental/filesystem.

POSIX support should be completed. I'm working on windows now then I will update this and send an email with instructions on how to test the experimental filesystems instead of the non-modular ones. At that time, we will start working on converting all other filesystems and then we will switch TF to use the modular ones (by moving the code to the proper locations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants