forked from cvmfs/doc-cvmfs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
cpt-details.tex
536 lines (469 loc) · 30.8 KB
/
cpt-details.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
\chapter{Implementation Notes}
\cvmfs\ has a modular structure and relies on several open source libraries.
Figure~\ref{fig:cvmfsblocks} shows the internal building blocks of \cvmfs.
Most of these libraries are shipped with the \cvmfs\ sources and are linked statically in order to facilitate debugging and to keep the system dependencies minimal.
\begin{figure}
\begin{center}
\input{figures/cvmfs-blocks}
\end{center}
\caption{\cvmfs\ block diagram.}
\label{fig:cvmfsblocks}
\end{figure}
\section{File Catalog}
A \cvmfs\ repository is defined by its \emph{file catalog}.
The file catalog is an \sqlite\footnote{\url{https://www.sqlite.org}} database~\cite{sqlite10} having a single table that lists files and directories together with its metadata.
The table layout is shown in Table~\ref{tab:catalog}.
\begin{table}
\begin{center}
\begin{tabular}[b]{ccc}
\begin{tabular}{ll}
\toprule
\bf Field & \bf Type \\
\midrule
\bf Path MD5 & 128\,Bit Integer \\
Parent Path MD5 & 128\,Bit Integer \\
Hardlinks & Integer \\
SHA1 Content Hash & 160\,Bit Integer \\
Size & Integer\\
Mode & Integer \\
Last Modified & Timestamp \\
Flags & Integer \\
Name & String \\
Symlink & String \\
uid & Integer \\
gid & Integer \\
\bottomrule
\end{tabular} &
\qquad &
\begin{tabular}{ll}
\toprule
\bf Flags & \bf Meaning\\
\midrule
1 & Directory \\
2 & Transition point to a nested catalog \\
33 & Root directory of a nested catalog \\
3 & Regular file \\
4 & Symbolic link \\
\bottomrule
\end{tabular}
\end{tabular}
\end{center}
\caption{Metadata information stored per directory entry.
The inode is dynamically issued by \cvmfs\ at runtime.}
\label{tab:catalog}
\end{table}
In order to save space we do not store absolute paths.
Instead we store MD5~\cite{rfc1321,rfc6151} hash values of the absolute path names.
Symbolic links are kept in the catalog.
Symbolic links may contain environment variables in the form \texttt{\$(VAR\_NAME)} that will be dynamically resolved by \cvmfs\ on access.
Hardlinks are emulated by \cvmfs.
The hardlink count is stored in the lower 32bit of the hardlinks field, a \emph{hardlink group} is stored in the higher 32 bit.
If the hardlink group is greater than zero, all files with the same hardlink group will get the same inode issued by the \cvmfs\ Fuse client.
The emulated hardlinks work within the same directory, only.
The SHA-1~\cite{rfc3174} content hash refers to the zlib-compressed~\cite{rfc1950} version of the file.
Flags indicate the type of an directory entry (see Table~\ref{tab:catalog}).
A file catalog contains a \emph{time to live} (TTL), stored in seconds.
The catalog TTL advises clients to check for a new version of the catalog, when expired.
Checking for a new catalog version takes place with the first file system operation on a \cvmfs\ volume after the TTL has expired.
The default TTL is 15 minutes.
If a new catalog is available, \cvmfs\ delays it's loading for the period of the \cvmfs\ kernel cache life time (default: 1 minute).
During this drain-out period, the kernel caching is turned off.
The first file system operation on a \cvmfs\ volume after that additional delay will apply a new file catalog and kernel caching is turned back on.
\subsection{Nested Catalogs}
In order to keep catalog sizes reasonable\footnote{As a rule of thumb, file catalogs up to \SI{25}{\mega\byte} (compressed) are reasonably small.}, repository subtrees may be cut and stored as separate \emph{nested catalogs}.
There is no limit on the level of nesting.
A reasonable approach is to store separate software versions as separate nested catalogs.
Figure~\ref{fig:nested} shows the directory structure which we use for the ATLAS repository.
\begin{figure}
\begin{center}
\framebox{\input{figures/nestedcatalogs}}
\end{center}
\caption{Directory structure useds for the ATLAS repository (simplified).}
\label{fig:nested}
\end{figure}
When a subtree is moved into a nested catalog, its entry directory serves as \emph{transition point} for nested catalogs.
This directory appears as empty directory in the parent catalog with flags set to 2.
The same path appears as root-directory in the nested catalog with flags set to 33.
Because the MD5 hash values refer to full absolute paths, nested catalogs store the root path prefix.
This prefix is prepended transparently by \cvmfs.
The SHA-1 key of nested catalogs is stored in the parent catalog.
Therefore, the root catalog fully defines an entire repository.
Loading of nested catalogs happens on demand by \cvmfs\ on the first attempt to access of anything inside, \ie a user won't see the difference between a single large catalog and several nested catalogs.
While this usually avoids unnecessary catalogs to be loaded, recursive operations like \texttt{find} can easily bypass this optimization.
\subsection{Catalog Statistics}
A \cvmfs\ file catalog maintains several counters about its contents and the contents of all of its nested catalogs.
The idea is that the catalogs know how many entries there are in their sub catalogs even without opening them.
This way, one can immediately tell how many entries, for instance, the entire ATLAS repository has.
The numbers are shown using the number of inodes in \texttt{statvfs}.
So \texttt{df -i} shows the overall number of entries in the repository and (as number of used inodes) the number of entries of currently loaded catalogs.
Nested catalogs create an additional entry (the transition directory is stored in both the parent and the child catalog).
File hardlinks are still individual entries (inodes) in the cvmfs catalogs.
The following counters are maintained for both a catalog itself and for the subtree this catalog is root of:
\begin{itemize}
\item Number of regular files
\item Number of symbolic links
\item Number of directories
\item Number of nested catalogs
\item Number of chunked files
\item Number of individual file chunks
\item Overall file content size
\item File content size stored in chunked files
\end{itemize}
%\subsection{$\mu$-Catalogs}
%The $\mu$-catalogs can be constructed in addition to the normal file catalogs.
%The $\mu$-catalogs can be considered as a special case of nested catalogs, in which each directory is a nested catalog.
%Essentially they are a precalculated \texttt{ls -la} for all directories.
%They are implemented as SQlite catalogs as well having the same structure as normal catalogs.
%The SHA-1 content hashes of the $\mu$-catalogs are stored in the hash field of the directory entries in the catalog table.
\pagebreak
\section{Repository Manifest (.cvmfspublished)}
\label{sct:cvmfspublished}
Every \cvmfs\ repository contains a repository manifest file that serves as entry point into the repository's catalog structure.
The repository manifest is the first file accessed by the \cvmfs\ client at mount time and therefore must be accessible via HTTP on the repository root URL.
It is always called \textbf{.cvmfspublished} and contains fundamental repository meta data like the root catalogs's SHA-1 hash and the repository revision number as a key-value list.
\subsection{Internal Manifest Structure}
Below is an example of a typical manifest file.
Each line starts with a capital letter specifying the meta data field, followed by the actual data string.
The list of meta information is ended by a separator line (\texttt{-{}-}) followed by signature information further described in Section~\ref{sct:cvmfspublished:signature}.
Please refer to table~\ref{tbl:manifestfields} for detailed information about each of the meta data fields.
\begin{table}
\begin{center}
\begin{tabularx}{\linewidth}{lX}
\toprule
{\bf\centering Field} & {\bf\centering Meta Data Description} \\
\midrule
\texttt{C} & SHA-1 hash of the repository's current root catalog \\
\texttt{R} & \begin{tabular}[t]{@{}l@{}}MD5 hash of the repository's root path\\(usually always \texttt{d41d8cd98f00b204e9800998ecf8427e})\end{tabular} \\
\texttt{B} & File size of the root catalog in bytes \\
\texttt{X} & SHA-1 hash of the signing certificate \\
\texttt{H} & SHA-1 hash of the repository's named tag history database \\
\texttt{T} & Unix timestamp of this particular revision \\
\texttt{D} & Time To Live (TTL) of the root catalog \\
\texttt{S} & Revision number of this published revision \\
\texttt{N} & The full name of the manifested repository \\
\texttt{L} & currently unused (reserved for micro catalogs) \\
\bottomrule
\end{tabularx}
\end{center}
\caption{Meaning of Repository Manifest Meta Data Fields}
\label{tbl:manifestfields}
\end{table}
\begin{verbatim}
C64551dccfbe0a48de7618dd7deb290200b474759
B1442336
Rd41d8cd98f00b204e9800998ecf8427e
D900
S42
Nexample.cern.ch
X731cca9476eb882f5a3f24aaa38001105a0e35eb
T1390301299
--
edde5308e502dd5e8fe405c56f5700f7477dc319
[...]
\end{verbatim}
\subsection{Repository Signature}
\label{sct:cvmfspublished:signature}
In order to provide authoritative information about a repository publisher, the repository manifest is signed by an X.509 certificate together with its private key.
\subsubsection{Signing a Repository}
It is important to note that it is sufficient to sign just the manifest file itself to gain a secure chain of the whole repository.
The manifest refers to the SHA-1 content hash of the root catalog which in turn recursively references all sub-catalogs with their SHA-1 content hashes.
Each catalog lists its files along with their SHA-1 content hashes.
This concept is called a merkle tree and eventually provides a single hash that depends on the \textit{complete} content of the repository.
The top level hash used for the repository signature can be found in the repository manifest right below the separator line (\texttt{-{}-} / see above).
It is the SHA-1 hash of the manifest's meta data lines excluding the separator line.
Following the top level hash is the actual signature produced by the X.509 certificate signing procedure in binary form.
\subsubsection{Signature Validation}
In order to validate repository manifest signatures, \cvmfs\ uses a white-list of valid publisher certificates.
The white-list contains the SHA-1 fingerprints of known publisher certificates and a timestamp.
A white-list is valid for 30 days.
It is signed by a private RSA key, which we refer to as \emph{master key}.
%For the cern.ch repositories, the master key is kept offline on a secure smart card at \cern.
%So the security of the master key is ensured by 2-factor authentication (possession of the smart card and knowledge of the pin).
%The smart card is used with a smart card reader with pin pad.
%Neither the key nor the pin are ever exposed outside the scope of those secure devices.
The public RSA key that corresponds to the master key is distributed with the \cvmfs\ sources and the \texttt{cvmfs-keys} RPM as well as with every instance of \cernvm.
In addition, \cvmfs\ checks certificate fingerprints against the local blacklist /etc/cvmfs/blacklist.
The blacklisted fingerprints have to be in the same format than the fingerprints on the white-list.
The blacklist has precedence over the white-list.
As crypto engine, \cvmfs\ uses \product{libcrypto} from the \product{OpenSSL} project~\cite{openssl}.
Figure~\ref{fig:security} shows the trust chain with a signed repository.
\begin{figure}
\begin{center}
\input{figures/security}
\end{center}
\caption{Trust chain with a signed repository.}
\label{fig:security}
\end{figure}
%\section{Repository Layout}
%\label{sct:repository}
\pagebreak
\section{Use of HTTP}
The particular way of using the \indexed{HTTP} protocol has significant impact on the performance and usability of \cvmfs.
If possible, \cvmfs\ tries to benefit from the HTTP/1.1 features keep-alive and cache-control.
Internally, \cvmfs\ uses the \product{libcurl} library~\cite{libcurl}.
The HTTP behaviour affects a system with cold caches only.
As soon as all necessary files are cached, there is only network traffic when a catalog TTL expires.
%Usually we'll see network traffic right after booting a \cernvm\ for the first time, after switching to another experiment environment, and after a new software version has been published.
The \cvmfs\ download manager runs as a separate thread that handles download requests asynchronously in parallel.
Concurrent download requests for the same URL are collapsed into a single request.
\subsection{DoS Protection}
A subtle denial of service attack (DoS) can occur when \cvmfs\ is successfully able to download a file but fails to store it in the local cache.
This situation escalates into a DoS when the application using \cvmfs\ remains in an endless loop and tries to open a file over and over again.
Such a situation is prevented by \cvmfs\ by re-trying with an exponential backoff.
%That prevents request storms to web servers from applications trying to open a file in an endless loop.
The backoff is triggered by consequtive filaures to cache a downloaded file within 10 seconds.
\subsection{Keep-Alive}
Although the HTTP protocol overhead is small in terms of data volume, in high latency networks we suffer from the bare number of requests: Each request-response cycle has a penalty of at least the network round trip time.
Using plain HTTP/1.0, this results in at least $3\cdot\text{round trip time}$ additional running time per file download for TCP handshake, HTTP GET, and TCP connection finalisation.
By including the \texttt{Connection:~Keep-Alive} header into HTTP requests, we advise the HTTP server end to keep the underlying TCP connection opened.
This way, overhead ideally drops to just round trip time for a single HTTP GET.
The impact of the keep-alive feature is shown in Figure~\ref{fig:keepalive}.
\begin{figure}
\begin{center}
\resizebox{0.5\linewidth}{!}{\input{figures/cvmfs-keepalive}}
\end{center}
\caption{Impact of keep-alive header on multiple file downloads.}
\label{fig:keepalive}
\end{figure}
This feature, of course, somewhat sabotages a server-side load-balancing.
However, exploiting the HTTP keep-alive feature does not affect scalability per se.
The servers and proxies may safely close idle connections anytime, in particular if they run out of resources.
In practice, the maximum connection duration has to be set carefully for the HTTP deamon.
\subsection{Cache Control}
\label{sct:cachecontrol}
In a limited way, \cvmfs\ advises intermediate web caches how to handle its requests.
Therefor it uses the \texttt{Pragma:~no-cache} and the \texttt{Cache-Control:~no-cache} headers in certain cases.
These cache control headers apply to both, forward proxies as well as reverse proxies.
However, this is by no means a guarantee that intermediate proxies fetch a fresh original copy (though they should).
By including these headers, \cvmfs\ tries to not fetch outdated cache copies.
This has to be handled with care, of course, in order to not overload the repository source server.
Only in case \cvmfs\ downloads a corrupted file from a proxy server, it retries having the HTTP \texttt{no-cache} header set.
This way, the corrupted file gets replaced in the proxy server by a fresh copy from the backend.
\subsection{Identification Header}
\cvmfs\ sends a custom header (\lstinline{X-CMFS2}) to be identified by the web server.
If you have set the \cernvm\ GUID, this GUID is also transmitted.
%In \cernvm\ we use this header to rewrite URL's to the respective repository format for \cvmfs\ version 1 or \cvmfs\ version 2.
\section{Disk Cache}
\label{sct:mamangedcache}
Each running \cvmfs\ instance requires a local cache directory.
Data are downloaded into a temporary files.
Only at the very latest point they are renamed into their content-addressable SHA-1 names atomically by \texttt{rename()}.
The hard disk cache is managed, \ie \cvmfs\ maintains cache size restrictions and replaces files according to the \indexed{least recently used}\index{LRU|see{least recently used}} (LRU) strategy~\cite{lru06}.
In order to keep track of files sizes and relative file access times, \cvmfs\ sets up another \sqlite\ database in the cache directory, the \emph{cache catalog}.
The cache catalog contains a single table; its structure is shown in Table~\ref{tab:cachecatalog}.
\begin{table}
\begin{center}
\begin{tabular}{ll}
\toprule
\bf Field & \bf Type \\
\midrule
\bf SHA-1 & String (hex notation) \\
Size & Integer \\
Access Sequence & Integer \\
Pinned & Integer \\
File type (chunk or file catalog) & Integer\\
\bottomrule
\end{tabular}
\end{center}
\caption{Cache catalog table structure.}
\label{tab:cachecatalog}
\end{table}
\cvmfs\ does not strictly enforce a the cache limit.
Instead \cvmfs\ works with two customizable soft limits, the \emph{cache quota} and the \emph{cache threshold}.
When exceeding the cache quota, files are deleted until the overall cache size is less than or equal to the cache threshold.
The cache threshold is currently hard-wired to half of the cache quota.
The cache quota is for data files as well as file catalogs.
Currently loaded catalogs are pinned in the cache, \ie they will not be deleted until unmount or until a new repository revision is applied.
On unmount, pinned file catalogs are updated with the highest sequence number.
The cache catalog can be re-constructed from scratch on mount.
Re-constructing the cache catalog is necessary when the managed cache is used for the first time and every time when ``unmanaged'' changes occurred to the cache directory, \eg when \cvmfs\ was terminated unexpectedly.
Re-construction has to be triggered manually.
In case of an exclusive cache, the cache manager runs as a separate thread of the \texttt{cvmfs2} process.
This thread gets notified by the Fuse module whenever a file is opened or inserted.
Notification is done through a pipe.
The shared cache uses the very same code, except that the thread becomes a separate process (see Figure~\ref{fig:sharedcache}).
This cache manager process is not another binary but \texttt{cvmfs2} forks to itself with special arguments, indicating that it is supposed to run as a cache manager.
The cache manager does not need to be started as a service.
The first \cvmfs\ instance that uses a shared cache will automatically spawn the cache manager process.
Subsequent \cvmfs\ instances will connect to the pipe of this cache manager.
Once the last \cvmfs\ instance that uses the shared cache is unmounted, the communication pipe is left without any writers and the cache manager automatically quits.
\begin{figure}
\centering
\input{figures/sharedcache}
\caption{The \cvmfs\ shared local hard disk cache.}
\label{fig:sharedcache}
\end{figure}
The \cvmfs\ cache supports two classes of files with respect to the cache replacement strategy: \emph{normal} files and \emph{volatile} files.
The sequence numbers of volatile files have bit 63 set.
Hence they are interpreted as negative numbers and have precedence over normal files when it comes to cache cleanup.
On automatic rebuild the volatile property of entries in the cache database is lost.
\section{File System Traces}\index{traces}\index{file system traces|see{traces}}
\cvmfs\ has an optional file system operations tracer.
The tracer creates logs of usage, which can---for instance---be used as profiling information for pre-fetching.
The trace file is created in \indexed{CSV} format (see Figure~\ref{fig:traces}).
\begin{figure}
\centering
\begin{verbatim}
"1481074921.015","1","/root/i686-pc-linux-gnu/include/TQObject.h","OPEN"
"1481074931.030","1","/root/i686-pc-linux-gnu/include/KeySymbols.h","OPEN"
"1481074931.220","3","/root/i686-pc-linux-gnu/include/KeySymbols.h","READ (TRY)"
"1481074964.565","3","/root/i686-pc-linux-gnu/include/KeySymbols.h","READ 7407@0"
"1481074965.005","1","/root/i686-pc-linux-gnu/include/TRootCanvas.h","OPEN"
\end{verbatim}
\caption{Example snippet of a trace log. The first columns stores time stamps as number of microseconds starting from the Unix epoch. The second column stores the event type. Negative event types are reserved for \cvmfs\ internal events.}
\label{fig:traces}
\end{figure}
The tracing runs in a separate thread and adapts the \emph{tread-safe trace buffer}, a technique used for multi-thread debugging~\cite[Chapter 8]{multicore06}.
Since traces are kept in a memory ring buffer\footnote{Usually, each trace record requires two atomic \texttt{fetch-and-add} operations.} and written to disk in blocks of thousands of lines, the performance overhead for tracing is low.
\section{NFS Maps}
In normal mode, \cvmfs\ issues inodes based on the row number of an entry in the file catalog.
When exported via NFS, this scheme can result in inconsistencies because \cvmfs\ does not control the cache lifetime of NFS clients.
A once issued inode can be asked for anytime later by a client.
To be able to reply to such client queries even after reloading catalogs or remounts of \cvmfs, the \cvmfs\ \emph{NFS maps} implement a persistent store of the path names $\mapsto$ inode mappings.
Storing them on hard disk allows for control of the \cvmfs\ memory consumption (currently $\approx\SI{45}{\mega\byte}$ extra) and ensures consistency between remounts of \cvmfs.
The performance penalty for doing so is small.
\cvmfs\ uses Google's \leveldb\cite{leveldb}, a fast, local key value store.
Reads and writes are only performed when meta-data are looked up in \sqlite, in which case the \sqlite\ query supposedly dominates the running time.
A drawback of the NFS maps is that there is no easy way to account for them by the cache quota.
They sum up to some 150-200 Bytes per path name that has been accessed.
A recursive \texttt{find} on /cvmfs/atlas.cern.ch with 25 million entries, for instance, would add up \SI{5}{\giga\byte} in the cache directory.
This is mitigated by the fact that the NFS mode will be only used on few servers that can be given large enough spare space on hard disk.
\section{Loader}
The \cvmfs\ \fuse\ module comprises a minimal \emph{loader} loader process (the \texttt{cvmfs2} binary) and a shared library containing the actual \fuse\ module (\texttt{libcvmfs\_fuse.so}).
This structure makes it possible to reload \cvmfs\ code and parameters without unmounting the file system.
Loader and library don't share any symbols except for two global structs \texttt{cvmfs\_exports} and \texttt{loader\_exports} used to call each others functions.
The loader process opens the Fuse channel and implements stub \fuse\ callbacks that redirect all calls to the \cvmfs\ shared library.
Hotpatch is implemented as unloading and reloading of the shared library, while the loader temporarily queues all file system calls in-between.
Among file system calls, the Fuse module has to keep very little state.
The kernel caches are drained out before reloading.
Open file handles are just file descriptors that are held open by the process.
Open directory listings are stored in a Google dense\_hash that is saved and restored.
%\section{Pre-fetching}\index{pre-fetching}
%\label{sct:prefetching}
\section{File System Interface}
\label{sct:interface}
Since \cvmfs\ is a read-only file system, there are only few non-trivial call-back functions to implement.
These call-back functions provide the system interface.
\subsection{\tt mount}
On mount, the file catalog has to be loaded.
First, the file catalog \emph{manifest} \texttt{.cvmfspublished} is loaded.
The manifest is only accepted on successful validation of the signature.
In order to validate the signature, the certificate and the white-list are downloaded in addition if not found in cache.
If the download fails for whatever reason, \cvmfs\ tries to load a local file catalog copy.
As long as all requested files are in the disk cache as well, \cvmfs\ continues to operate even without network access (\emph{offline mode}).
If there is no local copy of the manifest or the downloaded manifest and the cache copy differ, \cvmfs\ downloads a fresh copy of the file catalog.
\subsection{{\tt getattr} and {\tt lookup}}
Requests for file attributes are entirely served from the mounted catalogs, \ie there is no network traffic involved.
This function is called as pre-requisite to other file system operations and therefore the most frequently called \fuse\ callback.
In order to minimize relatively expensive \sqlite\ queries, \cvmfs\ uses a hash table to store negative and positive query results.
The default size of \SI{16}{\mega\byte} for this memory cache is determined according to compilation benchmarks.
Additionally, the callback takes care of the catalog TTL.
If the TTL is expired, the catalog is re-mounted on the fly.
Note that a re-mount might possibly break running programs.
We rely on careful repository publishers that produce more or less immutable directory trees, \ie new repository versions just add files.
If a directory with a nested catalog is accessed for the first time, the respective catalog is mounted in addition to the already mounted catalogs.
Loading nested catalogs is transparent to the user.
\subsection{\tt readlink}
A symbolic link is served from the file catalog.
As a special extension, \cvmfs\ detects environment variables in symlink strings written as \texttt{\$(VARIABLE)}.
These variables are expanded by \cvmfs\ dynamically on access (in the context of the \texttt{cvmfs2} process).
This way, a single symlink can point to different locations depending on the environment.
This is helpful, for instance, to dynamically select software package versions residing in different directories.
\subsection{\tt readdir}
A directory listing is served by a query on the file catalog.
Although the ``parent''-column is indexed (\cf Table~\ref{tab:catalog}), this is a relatively slow function.
We expect directory listing to happen rather seldom.
\subsection{\tt open / read}
The \texttt{open()} call has to provide a file descriptor for a given path name.
In \cvmfs\ file requests are always served from the disk cache.
The \fuse\ file handle is a file descriptor valid in the context of the \cvmfs\ process.
It points into the disk cache directory.
Read requests are translated into the \texttt{pread()} system call.
\subsection{\tt getxattr}
\cvmfs\ uses extended attributes to display additional repository information.
There are two supported attributes:
\begin{description}
\item[expires]
Shows the remaining life time of the mounted root file catalog in minutes.
\item[fqrn]
Shows the fully qualified repository name of the mounted repository.
\item[inode\_max]
Shows the highest possible inode with the current set of loaded catalogs.
\item[hash]
Shows the SHA-1 hash of a regular file as listed in the file catalog.
\item[host]
Shows the currently active HTTP server.
\item[lhash]
Shows the SHA-1 hash of a regular file as stored in the local cache, if available.
\item[maxfd]
Shows the maximum number of file descriptors available to file system clients.
\item[nclg]
Shows the number of currently loaded nested catalogs.
\item[ndiropen]
Shows the overall number of opened directories.
\item[ndownload]
Shows the overall number of downloaded files since mounting.
\item[nioerr]
Shows the total number of I/O errors encoutered since mounting.
\item[nopen]
Shows the overall number of \texttt{open()} calls since mounting.
\item[pid]
Shows the process id of the \cvmfs\ \fuse\ process.
\item[proxy]
Shows the currently active HTTP proxy.
\item[rawlink]
Shows unresolved variant symbolic links; only accessible as root.
\item[revision]
Shows the file catalog revision of the mounted root catalog, an auto-increment counter increased on every repository publish.
\item[root\_hash]
Shows the SHA-1 hash of the root file catalog.
\item[rx]
Shows the overall amount of downloaded kilobytes.
\item[speed]
Shows the average download speed.
\item[timeout]
Shows the timeout for proxied connections in seconds.
\item[timeout\_direct]
Shows the timeout for direct connections in seconds.
\item[rawlink]
Shows the unresolved variant symlink.
\item[uptime]
Shows the time passed since mounting in minutes.
\item[usedfd]
Shows the number of file descriptors currently issued to file system clients.
\item[version]
Shows the version of the loaded \cvmfs\ binary.
\end{description}
Extended attributes can be queried using the \texttt{attr} command.
For instance, \texttt{attr -g hash /cvmfs/atlas.cern.ch/ChangeLog} returns the SHA-1 key of the file at hand.
The extended attributes are used by the \texttt{cvmfs\_config stat} command in order to show a current overview of health and performance numbers.
\section{Repository Publishing}
Repositories are not immutable, every now and then they get updated.
This might be installation of a new release or a patch for an existing release.
But, of course, each time only a small portion of the repository is touched, say \SI{2}{\giga\byte} out of \SI{100}{\giga\byte}.
In order not to re-process an entire repository on every update, we create a read-write file system interface to a \cvmfs\ repository where all changes are written into a distinct scratch area.
\subsection{Read-write Interface using a Union File System}
Union file systems combine several directories into one virtual file system that provides the view of merging these directories.
These underlying directories are often called \emph{branches}.
Branches are ordered; in the case of operations on paths that exist in multiple branches, the branch selection is well-defined.
By stacking a read-write branch on top of a read-only branch, union file systems can provide the illusion of a read-write file system for a read-only file system.
All changes are in fact written to the read-write branch.
Preserving POSIX semantics in union file systems is non-trivial; the first fully functional implementation has been presented by Wright et al.~\cite{unionfs04}.
By now, union file systems are well established for ``Live CD'' builders, which use a RAM disk overlay on top of the read-only system partition in order to provide the illusion of a fully read-writable system.
\cvmfs\ uses the AUFS union file system.
Another union file system with similar semantics can be plugged in if necessary.
Union file systems can be used to track changes on CernVM-FS repositories (Figure~\ref{fig:overlay}).
In this case, the read-only file system interface of CernVM-FS is used in conjunction with a writable scratch area for changes.
\begin{figure}
\begin{center}
\resizebox{0.5\textwidth}{!}{\input{figures/overlay}}
\end{center}
\label{fig:overlay}
\caption{A union file system combines a CernVM-FS read-only mount point and a writable scratch area.
It provides the illusion of a writable CernVM-FS mount point, tracking changes on the scratch area.}
\end{figure}
Based on the read-write interface to CernVM-FS, we create a feed-back loop that represents the addition of new software releases to a CernVM-FS repository.
A repository in base revision $r$ is mounted in read-write mode on the publisher's end.
Changes are written to the scratch area and, once published, are re-mounted as repository revision $r+1$.
In this way, CernVM-FS provides snapshots.
In case of errors, one can safely resume from a previously committed revision.