Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPFS Repo Spec update #43

Merged
merged 2 commits into from
Feb 13, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 52 additions & 36 deletions repo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ A `repo` is the storage repository of an IPFS node. It is the subsystem that
actually stores the data ipfs nodes use. All IPFS objects are stored in
in a repo (similar to git).


There are many possible repo implementations, depending on the storage media
used. Most commonly, ipfs nodes use an [fs-repo](fs-repo).

Expand All @@ -32,46 +31,35 @@ Repo Implementations:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be pretty sweet to upgrade https://github.com/maxogden/abstract-blob-store to have a parallel interface for Go, so that we can have all of those s3/fs/mem repos for 'free'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In go, we use https://github.com/jbenet/go-datastore/ -- which is similar and alraedy has a bunch.

i've been making these abstractions for a long time: https://github.com/datastore/datastore and https://github.com/datastore

## Repo Contents

The Repo stores:
- version - the repo version, required for safe migrations
The Repo stores a collection of [IPLD](../merkledag/ipld.md) objects that represent:

- keys - cryptographic keys, including node's identity
- config - node configuration and settings
- datastore - locally stored ipfs objects and indexing data
- datastore - content stored locally, and indexing data
- logs - debugging and usage event logs
- locks - process semaphores
- hooks - scripts to run at predefined times (not yet implemented)

![](ipfs-repo-contents.png?)
Note that the IPLD objects a repo stores are divided into:
- **state** (system, control plane) used for the node's internal state
- **content** (userland, data plane) which represent the user's cached and pinned data.

### version
Additionally, the repo state must determine the following. These need not be IPLD objects, though it is of course encouraged:

Repo implementations may change over time, thus they must all be recognizable.
For example, the `fs-repo` simply includes a `version` file with the contents.

### keys
- version - the repo version, required for safe migrations
- locks - process semaphores for correct concurrent access

A Repo holds the keys a node has access to, for signing xor encryption.
This includes:

- a special (private, public) key pair that defines the node's identity
- (private, public) key pairs
- symmetric keys

TODO: perhaps support ssh-agent style delegation.
![](ipfs-repo-contents.png?)

### config
### version

The node's config is a tree of variables, used to configure various aspects
of operation. For example:
- the set of bootstrap peers IPFS uses to connect to the network
- the Swarm, API, and Gateway network listen addresses
Repo implementations may change over time, thus they MUST include a `version` recognizable across versions. Meaning that a tool MUST be able to read the `version` of a given repo type.

For example, the `fs-repo` simply includes a `version` file with the version number. This way, the repo contents can evolve over time but the version remains readable the same way across versions.

### datastore

IPFS nodes stores some merkledag objects locally. These are either pinned
(stored until they are unpinned) or cached (stored until the next repo garbage
collection).
IPFS nodes store some IPLD objects locally. These are either (a) **state objects** required for local operation -- such as the `config` and `keys` -- or (b) **content objects** used to represent data locally available. **Content objects** are either _pinned_ (stored until they are unpinned) or _cached_ (stored until the next repo garbage collection).

The name "datastore" comes from
[go-datastore](https://github.com/jbenet/go-datastore), a library for
Expand All @@ -86,26 +74,53 @@ feature swappable datastores, for example:
This makes it easy to change properties or performance characteristics of
a repo without an entirely new implementation.


### keys (state)

A Repo typically holds the keys a node has access to, for signing and for encryption. This includes:

- a special (private, public) key pair that defines the node's identity
- (private, public) key pairs
- symmetric keys

Some repos MAY support key-agent delegation, instead of storing the keys directly.

Keys are structured using the [multikey](https://github.com/jbenet/multikey) format, and are part of the [keychain](../keychain) datastructure. This means all keys are IPLD objects, and that they link to all the data needed to make sense of them, including parent keys, identities, and certificates.

### config (state)

The node's `config` (configuration) is a tree of variables, used to configure various aspects of operation. For example:
- the set of bootstrap peers IPFS uses to connect to the network
- the Swarm, API, and Gateway network listen addresses

It is recommended that `config` files avoid identifying information, so that they may be re-shared across multiple nodes.

**CHANGES**: today, implementations like go-ipfs store the peer-id and private key directly in the config. These will be removed and moved out.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to specify the schema. Ideally any implementation should be compatible. Also, mention that its currently stored in JSON.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good-- though i do want to keep config files flexible, in that people may want to add more values into it than we give -- similar to how people use .git/config


### logs

A full IPFS node is complex. Many events can happen, and thus ipfs
A full IPFS node is complex. Many events can happen, and thus some ipfs
implementations capture event logs and (optionally) store them for user review
or debugging.

Logs MAY be stored directly as IPLD objects along with everything else, but this may be a problem if the logs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will never know how this sentence ends :) Suspense !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

choose your own adventure! :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was never logged! 😲


**NOTE**: go-ipfs no longer stores logs. it only emits them at a given route. This section is kept here in case other implementations may wish to store logs, though it may be removed in the future.

### locks

IPFS implementations may use multiple processes, or may disallow multiple
processes from running simultaneously on the same repo. This synchronization
is accomplished via locks on the repo itself.
processes from using the same repo simultaneously. Others may disallow using
the same repo but may allow sharing _datastores_ simultaneously. This
synchronization is accomplished via _locks_.

All repos contain the following standard locks:
- `repo.lock` - prevents concurrent access to the repo.
Must be held to read or write.
- `repo.lock` - prevents concurrent access to the repo. Must be held to _read_ or _write_.

### hooks (TODO)

Like git, IPFS will have `hooks`, a set of user configurable scripts that
can be run at predefined moments in ipfs operations. This makes it easy
Like git, IPFS nodes will allow `hooks`, a set of user configurable scripts
to run at predefined moments in ipfs operations. This makes it easy
to customize the behavior of ipfs nodes without changing the implementations
themselves.

Expand All @@ -114,14 +129,15 @@ themselves.
#### A Repo uniquely identifies an IPFS Node

A repository uniquely identifies a node. Running two different ipfs programs
with identical repositories -- and thus identical identities -- will cause
with identical repositories -- and thus identical identities -- WILL cause
problems.

Datastores MAY be shared -- with proper synchronization -- though note that sharing datastore access MAY erode privacy.

#### Repo implementation changes MUST include migrations

DO NOT BREAK USERS' DATA. It is critical. Thus, any changes to a repo's
implementation must be accompanied by a migration tool.
**DO NOT BREAK USERS' DATA.** This is critical. Thus, any changes to a repo's implementation **MUST** be accompanied by a **SAFE** migration tool.

See https://github.com/jbenet/go-ipfs/issues/537 and
https://github.com/jbenet/random-ideas/issues/33

Expand Down
81 changes: 68 additions & 13 deletions repo/fs-repo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,12 @@ The repo interface is defined [here](../).

### api

`api` is a file that exists only if there is currently a live api listening
for requests. This is used when the `repo.lock` prevents access. Clients may
opt to use the api service, or wait untill the process holding `repo.lock`
exits. The file's content is the api multiaddr
`./api` is a file that exists to denote an API endpoint to listen to.
- It MAY exist even if the endpoint is no longer live (i.e. it is a _stale_ or left-over `./api` file).

In the presence of an `./api` file, ipfs tools (eg go-ipfs `ipfs daemon`) MUST attempt to delegate to the endpoint, and MAY remove the file if resonably certain the file is stale. (e.g. endpoint is local, but no process is live)

The `./api` file is used in conjunction with the `repo.lock`. Clients may opt to use the api service, or wait until the process holding `repo.lock` exits. The file's content is the api endoint as a [multiaddr](https://github.com/jbenet/multiaddr)

```
> cat .ipfs/api
Expand All @@ -57,6 +59,31 @@ Notes:
- It is not enough to use the `config` file, as the API addr of a daemon may
have been overridden via ENV or flag.

#### api file for remote control

One use case of the `api` file is to have a repo directory like:

```
> tree $IPFS_PATH
/Users/jbenet/.ipfs
└── api

0 directories, 1 files

> cat $IPFS_PATH/api
/ip4/1.2.3.4/tcp/5001
```

In go-ipfs, this has the same effect as:

```
ipfs --api /ip4/1.2.3.4/tcp/5001 <cmd>
```

Meaning that it makes ipfs tools use an ipfs node at the given endpoint, instead of the local directory as a repo.

In this use case, the rest of the `$IPFS_PATH` may be completely empty, and no other information is necessary. It cannot be said it is a _repo_ per-se. (TODO: come up with a good name for this).

### blocks/

The `block/` component contains the raw data representing all IPFS objects
Expand Down Expand Up @@ -119,9 +146,9 @@ timestamp of their creation. For example:

### repo.lock

`repo.lock` prevents concurrent access to the repo. Its content is the PID
of the process currently holding the lock. This allows clients to detect
a failed lock cleanup.
`repo.lock` prevents concurrent access to the repo. Its content SHOULD BE the
PID of the process currently holding the lock. This allows clients to detect
a failed lock and cleanup.

```
> cat .ipfs/repo.lock
Expand All @@ -130,17 +157,32 @@ a failed lock cleanup.
42 ttys000 79:05.83 ipfs daemon
```

**TODO, ADDRESS DISCREPANCY:** the go-ipfs implementation does not currently store the PID in the file, which in some systems causes failures after a failure or a teardown. This SHOULD NOT require any manual intervention-- a present lock should give new processes enough information to recover. Doing this correctly in a portable, safe way, with good UX is very tricky. We must be careful with TOCTTOU bugs, and multiple concurrent processes capable of running at any moment. The goal is for all processes to operate safely, to avoid bothering the user, and for the repo to always remain in a correct, consistent state.

### version

The `version` file contains the repo implementation name and version
The `version` file contains the repo implementation name and version. This format has changed over time:

```
> cat version
fs-repo: 1
# in version 0
> cat $repo-at-version-0/version
cat: /Users/jbenet/.ipfs/version: No such file or directory

# in versions 1 and 2
> cat $repo-at-version-1/version
1
> cat $repo-at-version-2/version
2

# in versions >3
> cat $repo-at-version-3/version
fs-repo/3
```

_Any_ fs-repo implementation of _any_ versions MUST be able to read the
`version` file. It MUST NOT change between versions.
_Any_ fs-repo implementation of _any_ versions `>0` MUST be able to read the
`version` file. It MUST NOT change format between versions. The sole exception is version 0, which had no file.

**TODO: ADDRESS DISCREPANCY:** versions 1 and 2 of the go-ipfs implementation use just the integer number. It SHOULD have used `fs-repo/<version-number>`. We could either change the spec and always just use the int, or change go-ipfs in version `>3`. we will have to be backwards compatible.

## Datastore

Expand Down Expand Up @@ -188,8 +230,21 @@ For example:
filesystems are case insensitive.
- the multihash prefix is two bytes, which would waste two directory levels,
thus these are combined into one.
- the git `idx` and `pack` file could be used to coalesce objects
- the git `idx` and `pack` file formats could be used to coalesce objects

**TODO: ADDRESS DISCREPANCY:**

the go-ipfs fs-repo in version 2 uses a different `blocks/` dir layout:

```
/Users/jbenet/.ipfs/blocks
├── 12200007
│   └── 12200007d4e3a319cd8c7c9979280e150fc5dbaae1ce54e790f84ae5fd3c3c1a0475.data
├── 1220000f
│   └── 1220000fadd95a98f3a47c1ba54a26c77e15c1a175a975d88cf198cc505a06295b12.data
```

We MUST address whether we should change the fs-repo spec to match go-ipfs in version 2, or we should change go-ipfs to match the fs-repo spec (more tiers). We MUST also address whether the levels are a repo version parameter or a config parameter. There are filesystems in which a different fanout will have wildly different performance. These are mostly networked and legacy filesystems.

### Reading without the `repo.lock`

Expand Down