Numbers: data fs filesystem #348

rcoreilly · 2024-08-30T19:57:08Z

rcoreilly
Aug 30, 2024
Maintainer

A key component for exploratory data analysis (e.g., as in a Jupyter notebook) is a shared namespace to grab and store data. In the existing emer simulations, the estats package provides this functionality, to allow different methods to access float and tensor variables in a global shared namespace.

As far as I can tell, the guts of Jupyter is powered by IPython, which has various "magic" commands prefixed by % that manage the global namespace and perform the standard shell-like functionality, (e.g., cosh) https://ipython.readthedocs.io/en/stable/interactive/magics.html The %who and %whos magic commands list the global variables.

I couldn't find much info about people experiencing conflicts in this global namespace, but it seems inevitable that having just a single monolithic space is going to be bad. Here's something from another related project: https://discourse.julialang.org/t/notebooks-need-modules-i-e-multiple-separate-global-namespaces/68541

Thus, a potentially intuitive and flexible solution is to adopt the filesystem metaphor for organizing the global data variable space, by creating an fs implementation that allows you to read / write data variables as "files" in different directories, to better organize and avoid conflicts.

Another issue with the global data space is dealing with type safety properly. In python this is not an issue, but in Go, we want to maintain type safety.

One solution is to use automatic filename extensions, e.g., .float32 .tensor32 etc to label the data, and, somehow, when you access it, you get back a variable of the proper type. I'm not exactly sure if this can be pulled off properly with generics. In estats we just have a bunch of different maps for each different type, and separate named accessors.

We could at least have the equivalent of map[string]any to back the storage of data elements, and have explicit generic type args for reading and writing, e.g., myf := ds.Get[float32]("/epoch/sse/mean"). Set at least would not require using the type parameter.

Per above, we would want to add simple accessors for variables -- definitely don't want to have to deal with io Read / Write at this level. But supporting the fs interface in general would allow things like the filetree and FilePicker (properly updated to use the fs instead of os package) to work in browsing, accessing the variables.

The generic Read / Write interface for all variables would be handy for directly storing data to actual os files.

metadata

Each data "file" can have metadata (map[string]any) that would be key for various use-cases as illustrated below. The fs.FileInfo interface specifies a Sys() any method that could potentially be used to return this.

emer sim logging example

One interesting test case for this is to replace the existing elog functionality in emergent with a corresponding file structure at the different time scales, modes:

/log/train/trial/sse.tensor32
/log/train/epoch/sse/mean.tensor32 // sd etc too
/log/test/* etc

This is more flexible and especially the issue of being able to systematically represent the different stats like mean, sd etc is always a problem that this more elegantly solves. In the current emer sims, we are very often redundantly storing estat values in stats and then reading those into the log -- here we just have the one canonical location for every value, and the logging part is automatic based on these values and their metadata (e.g., can "hide" intermediate values that don't need to be saved).

We could also use metadata to deal with the ever-present ordering problem, for example by having an automatic OrderAdded int counter that tracks the order you add items to a directory. (alternatively could try to use the fs.FileInfo.ModTime() for this, tracking the exact time when a variable is added, and sorting order by that).

The metadata would however be critical for setting the plotting hints in terms of fixed or floating min / max ranges and other variables.

It would be easy to write standard aggregation algorithms that just iterate over variables in a given "directory" and populate corresponding variables at the next level up. One could automatically generate all standard aggregates (mean, sd, sem etc) and use Func filter calls to specify which values actually get saved to actual log files and / or plotted.

The plot and log tables can be efficiently plan updated from the current data fs state, so adding and removing items is automatic and efficient, and there is just one function that turns a given fs directory path into a corresponding table with appropriate metadata filter etc.

A simple cp command could be used to duplicate and save a given snapshot of log data for subsequent comparison, etc.

Also want to ensure that it is very easy to use String() method on enums to make path names (e.g., a version of path that does this automatically, if the native version does not), so you can e.g., use the etime enums to avoid limitations of dealing with string path names: path.Join(root, etime.Train, etime.Trial) etc.

Implementation

All of this would be implemented in underlying Go, presumably in core as a datafs package similar to and building on the tensor package, and should be very compact and simple to write. The as-yet-unnamed numbers scripting language #324 can provide a syntactically simplified interface to it, tbd.

At least all the standard shell cd ls cp etc commands must work directly to navigate the data fs -- this will be a good "opportunity" to implement all of those in underlying fs functionality instead of calling out to the /bin/ls methods etc. Hopefully can find existing Go package that does that?

rcoreilly · 2024-08-30T20:19:45Z

rcoreilly
Aug 30, 2024
Maintainer Author

The existing databrowser functionality in numbers also needs to be updated in the context of the above -- will make it much more general.

In general the data browser provides a filetree + tabbed viewer interface to an fs root, with special affordances for numerical data types, and the datafs version of fs (e.g., direct access to the metadata etc).

Databrowser also manages scripts in toolbars to automate repeated tasks. Need to continue to rethink that in context of evolving overall numbers design, relative to the "notebook" concept, vs. our enhanced "terminal" design etc.

A key principle is that you should be able to directly cp anything between the actual os filesystem and the datafs virtual filesystem space, using appropriate default data formats like tsv etc, so basically when you run a simulation it could just dump the internal data fs log space to a corresponding subdir in os filesystem on remote machine, and then you grab that, and then your browser just operates directly on that fs image. No need to stuff everything through an intervening table middleman.

A critical efficiency issue is that managing many small individual files in actual fs space is painful, so using zip automatically would be very useful. golang/go#61232 explains why not tar, and here is the fs version of zip: zipfs

So zipfs should be builtin and probably the default way of saving / loading datafs images.

vfs is used for zipfs and seems like the key package as a basis for datafs.

0 replies

rcoreilly · 2024-09-04T02:29:18Z

rcoreilly
Sep 4, 2024
Maintainer Author

GPU integration via `gosl`

A key constraint for GPU computation is that the compute kernels operate on global variables that need to be explicitly declared in the shader, using special syntax, etc. These are the things you index over to compute.

The datafs can serve as a way to organize this global dataspace in a flexible way. You can "cd" into a directory of the datafs and that effectively sets the global variables you're operating on. Need to work on this but this is a key affordance that should allow transparent GPU / CPU coding, along with gosl...

1 reply

rcoreilly Sep 4, 2024
Maintainer Author

some other points here:

GPU data is organized into group / binding -- natural 2 level directory hierarchy.
in axon, everything is a big bag of floats that you index into according to variables as enums -- need a general-purpose version of this, where you can specify the index logic and there is a nice slang [x,y,z]` syntax transpiler that lets you directly refer to a particular value in a simple clean way.
again, this works on both the gpu and cpu, transparently -- I will then rewrite axon in this language!
the slang transpiler is key for making this more seamless between cpu and gpu -- critically, the gosl transpiler is super fast by itself -- all the time is in goimports -- need an optimized version of goimports that uses heuristics and better caching -- it is a major bottleneck for cogent code too.
also need automatic editor support for slang as the primary representation for a codebase, where it automatically continuously transpiles into go and translates go errors back into the slang original -- in effect, we need a full wrapper for the go compiler that hides the intermediate .go files and you are just working on .sl files.

rcoreilly · 2024-09-04T20:10:04Z

rcoreilly
Sep 4, 2024
Maintainer Author

Jax notes

Some key points:

Jax is fully functional -- each function cannot have any side effects, and can only output values from purely const inputs. This is a huge restriction, which also enables a lot of things to happen in that ecosystem. From a practical perspective for something like axon or large state spaces e.g., waves, having 2 (or more!) copies of everything is just not viable or efficient. There is nothing in gosl that requires such a constraint. So, it makes sense instead to support (someday) the functional subset as an additional level on top of a more general base framework that has no such constraints.
jit runs the python code and records the results into an intermediate jaxpr expression language, which is then the source input to the xla compiler that runs on CPU, GPU, TPU etc. This is presumably where the functional constraints enter as implied here: https://jax.readthedocs.io/en/latest/jit-compilation.html By contrast, using gosl, we have a much more general programming model, and presumably we could have different backend targets, although that is a bit of a heavy lift. Btw, it makes sense to exclusively support WGSL instead of keeping HLSL support, because WebGPU supersedes Vulkan (runs in basically the same way on a larger set of platforms, more simply and hopefully just as efficiently), vs. a possible CUDA or TPU target which would have additional benefits presumably.
vmap is really the key thing we want to implement, and it seems relatively simple (except for the composability with jit presumably): https://jax.readthedocs.io/en/latest/automatic-vectorization.html It assumes the first dimension is the one you want to iterate over, and otherwise you can specify which dimension by axis indexes. In our case, we just want to be able to specify the function to call for each index of a parallel for loop, using the basic GPU workgroup indicies (x,y,z)... Very minimal and transparent. The CPU version just does our basic greedy threading calls.
pmap is mpi and probably we can just keep that as-is for now. vmap is really the urgent factor.
There are special control flow functions that replace standard if, for etc constructs. We can potentially support these as add-ons to improve performance in specific cases, but gosl supports arbitrary if and switch statements so again we have a more general, basic layer with performance-enhancing overlays.

4 replies

rcoreilly Sep 9, 2024
Maintainer Author

most languages call the relevant "apply function to list of data" map: https://en.wikipedia.org/wiki/Map_(higher-order_function)

however, map has a specific connotation of map(x) -> y elementwise, whereas we are envisioning a much more general "vectorize" kind of function that just applies a function to elements and can do all manner of things.

Also, in Go, map is taken. How about just "vectorize" as a more general name?

kkoreilly Sep 9, 2024
Maintainer

vectorize doesn't really seem like a stellar name to me, nor does it seem particularly general.

rcoreilly Sep 9, 2024
Maintainer Author

see https://en.wikipedia.org/wiki/Array_programming

In these languages, an operation that operates on entire arrays can be called a vectorized operation,[1]

do you have a better suggestion?

rcoreilly Sep 9, 2024
Maintainer Author

btw I'm really mixing these two discussions badly. #324 (reply in thread)

rcoreilly · 2024-09-08T05:54:05Z

rcoreilly
Sep 8, 2024
Maintainer Author

can start using the tensor/examples/datafs-sim code to think how we want cosl to work in this context.

go:    errors.Log1(datafs.New[int](stats, "Run", "Epoch", "Trial"))
cosl:  stats.New[int]("Run", "Epoch", "Trial") // we could fix the no generic methods issue by transpiling this into a function call

go:    sitems := ss.Stats.ItemsByTimeFunc(nil)
cosl:  sitems := ss.Stats[return true]  // [func filter applies to everything

go:    v := stats.StatTensor(src, ag)
cosl:  v := stats.Mean(src) // stats defines simple funcs that operate over data of any sort -- need to figure out this crux issue in terms of what the common representation of a filtered tensor might look like?

0 replies

rcoreilly · 2024-09-10T02:01:08Z

rcoreilly
Sep 10, 2024
Maintainer Author

Indexes on demand

new idea: Indexed indexes are nil by default, and Index accessor function does pass-through for nil. This way you know automatically if you have a sequential view or not, which is key for various uses of the data etc.

Including: datafs new design:

Data Dir has:

DirMap -- same
DirTable -- initially nil but made and updated on demand.

Data leaf has:

Tensor tensor.Indexed -- the only possible data rep, used for everything. So the leaves of every dir are automatically a table, and any cosl expression can operate directly on them.

Tables will now have Indexed columns instead of bare, so that they can literally point to the Data items, and this allows easy integration of heterogenous lengths and orders of data types.

The nil index in tensor.Indexed means that they can automatically use the dir table index if they don't have one themselves, when passed into cosl operations or any other view context (and if dir table indexes are nil, all the better).

When combining Dir tensors into the table, we can automatically set indexes to make them all compatible?

Add row needs to work properly for all this -- should be doable.

Scalar is 1d with rows = 1, so easy to grow that.

might need some special access methods for Scalars?

1 reply

rcoreilly Sep 10, 2024
Maintainer Author

indexes of DirTable are those of longest element, and others just repeat themselves to align?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numbers: data fs filesystem #348

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Numbers: data fs filesystem #348

rcoreilly Aug 30, 2024 Maintainer

metadata

emer sim logging example

Implementation

Replies: 5 comments · 6 replies

rcoreilly Aug 30, 2024 Maintainer Author

rcoreilly Sep 4, 2024 Maintainer Author

GPU integration via gosl

rcoreilly Sep 4, 2024 Maintainer Author

rcoreilly Sep 4, 2024 Maintainer Author

Jax notes

rcoreilly Sep 9, 2024 Maintainer Author

kkoreilly Sep 9, 2024 Maintainer

rcoreilly Sep 9, 2024 Maintainer Author

rcoreilly Sep 9, 2024 Maintainer Author

rcoreilly Sep 8, 2024 Maintainer Author

rcoreilly Sep 10, 2024 Maintainer Author

Indexes on demand

rcoreilly Sep 10, 2024 Maintainer Author

rcoreilly
Aug 30, 2024
Maintainer

Replies: 5 comments 6 replies

rcoreilly
Aug 30, 2024
Maintainer Author

rcoreilly
Sep 4, 2024
Maintainer Author

GPU integration via `gosl`

rcoreilly Sep 4, 2024
Maintainer Author

rcoreilly
Sep 4, 2024
Maintainer Author

rcoreilly Sep 9, 2024
Maintainer Author

kkoreilly Sep 9, 2024
Maintainer

rcoreilly Sep 9, 2024
Maintainer Author

rcoreilly Sep 9, 2024
Maintainer Author

rcoreilly
Sep 8, 2024
Maintainer Author

rcoreilly
Sep 10, 2024
Maintainer Author

rcoreilly Sep 10, 2024
Maintainer Author