Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal representation of atoms #908

Closed
triska opened this issue Apr 16, 2021 · 11 comments
Closed

Internal representation of atoms #908

triska opened this issue Apr 16, 2021 · 11 comments

Comments

@triska
Copy link
Contributor

triska commented Apr 16, 2021

In the WAM, an atom is a pointer (index) to an entry in the atom table.

However, in machine_indices.rs, an atom seems to be represented as Atom(ClauseName, Option<SharedOpDesc>) on the heap.

My question is: Why is SharedOpDesc involved here? The operators are only used when parsing and emitting a term, and can also be changed dynamically. They should play no role when internally representing an atom.

With great interest, I see from Robbepop/modular-bitfield#64 that a new heap representation is now being considered, where we have:

pub struct AtomCell {
    name: B45,
    prec: B11,
    spec: B3,
    m: bool,
    tag: B4,
}

Also here, the question arises: Why are prec and spec stored together with the atom? What does m indicate?

Ideally, the name of an atom is internally stored in the same way as strings (#24), i.e., using raw bytes, in the atom table. If this is the case, then atom_chars/2 can be implemented in O(1) to obtain the atom name as a list of characters.

@mthom
Copy link
Owner

mthom commented Apr 16, 2021

However, in machine_indices.rs, an atom seems to be represented as Atom(ClauseName, Option) on the heap.

Originally, I introduced the optional value to pass operator information along to the printer, so that operator terms could be printed correctly.

As unlikely as this may sound, the same problem (that of supporting dynamic operator changes) occurred to me a few hours after I opened the modular-bitfield issue. I decided AtomCell should have a definition like:

#[bitfield]
#[repr(u64)]
#[derive(Copy, Clone, Debug)]
pub struct AtomCell {
    name: B45,
    op_offset_or_arity: B10,
    is_op: bool,
    is_functor: bool,
    m: bool,
    tag: B6,
}

I also expanded tag to 6-bits to address my staged tagging problem, but that's unrelated to this issue. m is the mark bit.

If is_functor is true (a bool in the bitfield context is a single bit, BTW), the AtomCell points to the name of a functor term, and its arity is stored in the 10-bit op_offset_or_arity field if is_op is false (if is_op is true, op_offset_or_arity points to an entry describing the operator in a side buffer, as I explain shortly). There are 1024 registers in Scryer's WAM, so 10 bits is just enough space to store the arity. The arity is actually interpreted as op_offset_or_arity + 1.

If is_functor is false, the arity is 0, making AtomCell just an atom. The entries of the atom table now assume this format:

<the size of the string beginning in the following word, as a usize>
<the string>
<a Vec containing OpDesc's, each OpDesc is a 16-bit bitfield describing precedence and associativity>

The values of the OpDir map would then be offsets into the atom's Vec entry.

name is the offset of the atom into the atom table buffer. Because it's only 45 bits in length, the atom table is limited to an indexable size of 2^48 bits if each entry is aligned to an 8-byte boundary, which is still plenty of space, I think.

This new design allows operator definitions to change dynamically without our needing to update the value of each AtomCell that might have the old op definition. Can you propose an alternative that doesn't require operator precedence information to be stored in each atom? If each atom cell only pointed to an atom table entry, wouldn't that require a hash table lookup whenever operator information is needed?

@triska
Copy link
Contributor Author

triska commented Apr 16, 2021

If each atom cell only pointed to an atom table entry, wouldn't that require a hash table lookup whenever operator information is needed?

I think it would take only a single hash lookup to find out the configured precedence of the operator. This will likely be completely negligible compared to the time it takes to actually write a term on the terminal or to a file, and it seems better to keep operators separate from the internal data representation: Operators are only for syntax, needed only to read and write terms, and should not change the internal representation in any way. If the operator state is encoded together with atoms, there will be a lot of duplication when operators change even though the terms are completely the same semantically, and should be merged to allow faster tests with (==)/2 etc.

Internally, identical terms should always be shared to enable fast checks and to use memory efficiently. Ideally, as much as possible can internally then be reduced to comparing pointers (to cells).

In general, I think it would be important to aim for a design that conceptually allows compound terms with unbounded arity, or which can at least be feasibly extended later in this way. The reason is that such terms allow O(1) access to individual arguments (using arg/3), and that is a nice representation for applications that require fast access to one of many elements by a given index. Atom names of arbitrary length should also be supported, ideally without (significantly) affecting the number of atoms that can be represented in total.

@mthom
Copy link
Owner

mthom commented Apr 16, 2021

I'm not sure how we'd go about supporting compound terms with unbounded arity. Term arity is closely connected to the (necessarily finite) number of registers in the WAM design. We would have to go beyond the WAM.

If the atom of a 1-arity term is defined as both a prefix and postfix operator, how do we know which was originally intended when printing back the term? Is it okay to get that wrong? (I always assumed the answer to be no, but perhaps it's yes?) I think this was another of the original considerations that motivated SharedOpDesc.

EDIT: I've reviewed the standard and convinced myself that it's ok to print a principal functor using whatever operator information is current. I'm going to go ahead with your proposed alternative. Operator information will not be encoded in the atom. The 10-bit arity encoding for functors will be retained for now, though. We can consider how to implement unbounded functor arities in the future.

@UWN
Copy link

UWN commented Apr 17, 2021

As to the prefix/postfix question see conformity_testing#201. It makes sense to be consistent and stick to one or the other, but if you insist, also f 0 f would do.

It is good to remove operator information from the atom table. It so rarely is used - there are only a couple of operators but many more atoms whose entry would be needlessly bloated. Further, it might be a future consideration to make operators module local. Some implementations do this like SWI, YAP, Eclipse.

As for functors, max_arity would be ideally unbounded - from the viewpoint of the programmer. For actual WAM-predicates an arity of 16 or 32 seems to be sufficient. If you insist on supporting also larger predicates, simply fake them by putting into the last WAM-argument a corresponding functor containing the remaining arguments and access them with arg/3. No need to redesign the WAM!

For the representation of functors, a small one (for 1..4 or so), and a very large one might be the best, both with O(1) access, indeed. (This would not complicate unification)

@mthom
Copy link
Owner

mthom commented Apr 28, 2021

As for functors, max_arity would be ideally unbounded - from the viewpoint of the programmer. For actual WAM-predicates an arity of 16 or 32 seems to be sufficient. If you insist on supporting also larger predicates, simply fake them by putting into the last WAM-argument a corresponding functor containing the remaining arguments and access them with arg/3. No need to redesign the WAM!

@UWN @triska I assume the functor cell would be one of the component cells of the functor it extends? Since a functor cell never occurs as a component of a functor elsewhere in the WAM, it would be obvious that it extends the arguments.

@triska
Copy link
Contributor Author

triska commented May 1, 2021

Since a functor cell never occurs as a component of a functor elsewhere in the WAM

I do not understand this statement: A functor occurs as a component for example in nested terms such as f(X, Y, g(Z)), were g(Z) occurs as the third (and next) cell following the linearized representation of f(X,Y).

Also, I think it is important to distinguish between arity of terms on the heap, and arity of predicates which must be compiled. I think the arity of a term on the heap can easily be much higher than 1024, can it not? This is because a cell that stores a term needs to be tagged as a compound term (using a few bits), and the rest of the cell is simply a pointer to the functor table that stores the functor name and the arity of the term, and that arity can be as high as you want.

So, it should be absolutely no problem to create a huge term on the heap for example with functor(T, f, 100_000).

Regarding the number of predicate arguments: As I understand it, @UWN outlines essentially a source tranformation that does not need any changes of the WAM. Instead of invoking a predicate with a huge number of arguments, we invoke an auxiliary predicate with a smaller number of arguments where the final argument is a compound term (with high arity) that stores the arguments for which we do not have enough registries. This can always be implemented at a later stage.

Note that the maximum arity of term on the heap is the much more important factor in practice, because a term with many arguments can be used as an efficient array (using arg/3 to access individual arguments in O(1) time). The number of predicate arguments is usually at most 7, anything more becomes quickly unreadable.

@mthom
Copy link
Owner

mthom commented May 1, 2021

I do not understand this statement: A functor occurs as a component for example in nested terms such as f(X, Y, g(Z)), were g(Z) occurs as the third (and next) cell following the linearized representation of f(X,Y).

I'll clarify by stepping through the heap representation of f(X, Y, g(Z)). In Scryer, it looks like:

h    |   Functor(f/3)
h+1  |   Var(h+1)
h+2  |   Var(h+2)
h+3  |   Str(h+4)
h+4  |   Functor(g/1)
h+5  |   Var(h+5)

The functor cell Functor(g/1) doesn't appear in the component list of the functor cell Functor(f/3) at location h. It's referenced by Str(h+4) as its third component. Functor cells and their components are always stored in the heap/global stack, by the way.

I took Ulrich's answer to mean that we could limit the arity of functor cells to 4 or 5 bits, and extend the arity if necessary by appending a specialized extender cell specifying the number of remaining cells. The remaining arity would be encoded as a 57 bit unsigned integer (deducting 6 bits for the cell tag and 1 bit for marking). Indexing would still be O(1), of course.

@UWN
Copy link

UWN commented May 1, 2021

Yes, something of that kind. What is in any case important that these big structures do not need to be supported by WAM instructions at all, functor/3 and arg/3 are good enough for them ( and univ). So very few places need to know them in detail.

@triska
Copy link
Contributor Author

triska commented Jun 13, 2021

@UWN: Could you please briefly expand on how the heap should be organized to enable sharing of terms: From what I remember, a good organization would start with the "permanent" atoms at the bottom of the heap so that strings can share the atom names by simply pointing to the name of the atom.

Is there anything else to note about the heap organization that should be taken into account? For example, []/0 could be recognized by its (permanent) address in the heap, is this correct? It seems not necessary to introduce a dedicated tag for it.

@UWN
Copy link

UWN commented Jun 22, 2021

These are two different issues.

1mo, the handling of atom indices. Currently #992 suggests that there are still two different kinds of [] in the system. Now atoms are usually represented by an atom index which is more or less a hash value. The first entries can already be determined at the source code compile time. And thus there is no need for a special tag for nil (or other frequently used atoms like true or false), it's an atom with just a very specific index. Maybe there are some specialized instructions for nil. But for the data representation, there is nothing to optimize.

2do, the actual overall heap organization. Here, some (gradual) improvements are possible. Those improvements can be realized without much consequence for most of the code. So it might be of interest to consider the following organization. Starting with the "oldest".

  1. Strings not containing zero. These would also serve as the strings associated with atoms.
  2. Finite ground terms. Typically referred to by facts.
  3. Rational ground terms
  4. Non-ground terms (for global state or the like)
  5. Actual heap terms.

Such an organization would also help to speed up many operations. For example testing for chars is constant for terms in 1. Also many classical string-processing operations might be reused this way.

@triska
Copy link
Contributor Author

triska commented Jan 10, 2022

This representation is most excellently improved in the rebis-dev branch, so I am closing this issue! Thank you a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants