This page is designed to take users through a brief overview of the
source code associated with the ZFS filesystem. It is not intended as an
introduction to ZFS – it is assumed that you already have some
familiarity with common terms and definitions, as well as a general
sense of filesystem architecture.
Traditionally, we describe ZFS
as being made of up of three components: ZPL (ZFS POSIX Layer), DMU
(Data Management Unit), and SPA (Storage Pool Allocator). While this
serves as a useful example for slideware, the real story is a little
more complex. The following image gives a more detailed overview;
clicking on one of the areas will take you to a more detailed
description and links to source code.
In this picture, you can
still see the three basic layers, though there are quite a few more
elements in each. In addition, we show zvol consumers, as well as the
management path, namely zfs(1M) and zpool(1M). You'll find a brief
description of all these subsystems below. This is not intended to be an
exhaustive overview of exactly how everything works. In the end, the
source code is the final documentation. We hope that it is easy to
follow and well commented.If not, feel free to post to the
ZFS discussion forum.
Filesystem Consumers
These
are your basic applications that interact with ZFS solely through the
POSIX filesystem APIs. Virtually every application falls into this
category. The system calls are passed through the generic OpenSolaris
VFS layer to the ZPL.
Device Consumers
ZFS
provides a means to created 'emulated volumes'. These volumes are
backed by storage from a storage pool, but appear as a normal device
under
/dev.
This is not a typical use case, but there are a small set of cases
where this capability is useful. There are a small number of
applications that interact directly with these devices, but the most
common consumer is a kernel filesystem or target driver layered on top
of the device.
Management GUI
Solaris
will ship with a web-based ZFS GUI in build 28. While not part of
OpenSolaris (yet), it is an example of a Java-based GUI layered on top
of the JNI.
Management Consumers
These
applications are those which manipulate ZFS filesystem or Storage
pools, including examining properties and dataset hierarchy. While there
are some scattered exceptions (zoneadm, zoneadmd, fstyp), the two main
applications are zpool(1M) and zfs(1M).
zpool(1M)
This
command is responsible for creating and managing ZFS storage pools. Its
primary purpose is to parse command line input and translate them into
libzfs calls, handling any errors along the way. The source for this
command can be found in
usr/src/cmd/zpool. It covers the following files:
zpool_main.c | The bulk of the command, responsible for processing all arguments and subcommands |
zpool_vdev.c | Code responsible for converting a series of vdevs into an nvlist representation for libzfs |
zpool_iter.c | Code to iterate over some or all the pools in the system easily |
zpool_util.c | Miscellaneous utility functions |
zfs(1M)
This
command is responsible for creating and managing ZFS filesystems.
Similar to zpool(1M), its purpose is really just to parse command line
arguments and pass the handling off to libzfs. The source for this
command can be found in
usr/src/cmd/zfs. It covers the following ilfes:
zfs_main.c | The bulk of the command, responsible for processing all arguments and subcommands |
zfs_iter.c | Code to iterate over some or all of the datasets in the system |
JNI
This
library provides a Java interface to libzfs. It is currently a private
interface, and is tailored specifically for the GUI. As such, it is
geared primarily toward observability, as the GUI performs most actions
through the CLI. The source for this library can be found in
usr/src/lib/libzfs_jni
libzfs
This
is the primary interface for management apps to interact with the ZFS
kernel module. The library presents a unified, object-based mechanism
for accessing and manipulating both storage pools and filesystems. The
underlying mechanism used to communicate with the kernel is ioctl(2)
calls through /dev/zfs. The source for this library can be found in
usr/src/lib/libzfs. It covers the following files:
ZPL (ZFS POSIX Layer)
The
ZPL is the primary interface for interacting with ZFS as a filesystem.
It is a (relatively) thin layer that sits atop the DMU and presents a
filesystem abstraction of files and directories. It is responsible for
bridging the gap between the OpenSolaris VFS interfaces and the
underlying DMU interfaces. It is also responsible for enforcing ACL
(Access Control List) rules as well as synchronous (O_DSYNC) semantics.
ZVOL (ZFS Emulated Volume)
ZFS
includes the ability to present raw devices backed by space from a
storage pool. These are known as 'zvols' within the source code, and is
implemented by a single file in the ZFS source.
zvol.c | Volume emulation interface |
/dev/zfs
This
device is the primary point of control for libzfs. While consumers
could consume the ioctl(2) interface directly, it is closely entwined
with libzfs, and not a public interface (not that libzfs is, either). It
consists of a single file, which does some validation on the ioctl()
parameters and then vectors the request to the appropriate place within
ZFS.
zfs_ioctl.c | /dev/zfs ioctl() interface |
zfs.h | Interfaces shared between kernel and userland |
DMU (Data Management Unit)
The
DMU is responsible for presenting a transactional object model, built
atop the flat address space presented by the SPA. Consumers interact
with the DMU via objsets, objects, and transactions. An objset is a
collection of objects, where each object is an arbitrary piece of
storage from the SPA. Each transaction is a series of operations that
must be committed to disk as a group; it is central to the on-disk
consistency for ZFS.
dmu.c | Primary external DMU interfaces |
dmu_objset.c | Objset open/close/manipulate external interfaces |
dmu_object.c | Object allocate/free interfaces |
txg.c | Transaction model control threads |
dmu_tx.c | Transaction creation/manipulation interfaces |
dnode.c | Open context object-level manipulation functions |
dnode_sync.c | Syncing context object-level manipulation functions |
dbuf.c | Buffer management functions |
dmu_zfetch.c | Data stream prefetch logic |
refcount.c | Generic reference counting interfaces |
dmu_send.c | Send and receive functions for snapshots |
DSL (Dataset and Snapshot Layer)
The
DSL aggregates DMU objsets into a hierarchical namespace, with
inherited properties, as well as quota and reservation enforcement. It
is also responsible for managing snapshots and clones of objsets.
For more information on how snapshots are implemented, see
Matt's blog entry.
ZAP (ZFS Attribute Processor)
The
ZAP is built atop the DMU, and uses scalable hash algorithms to create
arbitrary (name, object) associations within an objset. It is most
commonly used to implement directories within the ZPL, but is also used
extensively throughout the DSL, as well as a method of storing pool-wide
properties. There are two very different ZAP algorithms, designed for
different type of directories. The "micro zap" is used when the number
of entries is relatively small and each entry is reasonably short. The
"fat zap" is used for larger directories, or those with extremely long
names.
ZIL (ZFS Intent Log)
While
ZFS provides always-consistent data on disk, it follows traditional
filesystem semantics where the majority of data is not written to disk
immediately; otherwise performance would be pathologically slow. But
there are applications that require more stringent semantics where the
data is guaranteed to be on disk by the time the read(2) or write(2)
call returns. For those applications requiring this behavior (specified
with O_DSYNC), the ZIL provides the necessary semantics using an
efficient per-dataset transaction log that can be replayed in event of a
crash.
For a more detailed look at the ZIL implementation, see Neil's
blog entry.
Traversal
Traversal
provides a safe, efficient, restartable method of walking all data
within a live pool. It forms the basis of resilvering and scrubbing. It
walks all metadata looking for blocks modified within a certain period
of time. Thanks to the copy-on-write nature of ZFS, this has the benefit
of quickly excluding large subtrees that have not been touched during
an outage period. It is fundamentally a SPA facility, but has intimate
knowledge of some DMU structures in order to handle snapshots, clones,
and certain other characteristics of the on-disk format.
ARC (Adaptive Replacement Cache)
ZFS
uses a modified version of an Adaptive Replacement Cache to provide its
primary cacheing needs. This cache is layered between the DMU and the
SPA and so acts at the virtual block-level. This allows filesystems to
share their cached data with their snapshots and clones.
arc.c | Adaptive Replacement Cache implementation |
Pool Configuration (SPA)
While
the entire pool layer is often referred to as the SPA (Storage Pool
Allocator), the configuration portion is really the public interface. It
is responsible for gluing together the ZIO and vdev layers into a
consistent pool object. It includes routines to create and destroy pools
from their configuration information, as well as sync the data out to
the vdevs on regular intervals.
spa.c | Routines for opening, importing, exporting, and destroying pools |
spa_misc.c | Miscellaneous SPA routines, including locking |
spa_config.c | Parse and update pool configuration data |
spa_errlog.c | Log of persistent pool-wide data errors. |
spa_history.c | Persistent ring buffer of successful commands that modified the pool's state. |
zfs_fm.c | Routines to post ereports for FMA consumption. |
ZIO (ZFS I/O Pipeline)
The
ZIO pipeline is where all data must pass when going to or from the
disk. It is responsible for translation DVAs (Device Virtual Addresses)
into logical locations on a vdev, as well as checksumming and
compressing data as necessary. It is implemented as a multi-stage
pipeline, with a bit mask to control which stage gets executed for each
I/O. The pipeline itself is quite complex, but can be summed up by the
following digram:
I/O types | ZIO state | Compression | Checksum | Gang Blocks | DVA management | Vdev I/O |
RWFCI | open |
|
|
|
|
|
RWFCI | wait for
children ready |
|
|
|
|
|
-W- |
| write compress |
|
|
|
|
-W- |
|
| checksum generate |
|
|
|
-WFC- |
|
|
| gang pipeline |
|
|
-WFC- |
|
|
| get gang header |
|
|
-W- |
|
|
| rewrite gang
header |
|
|
--F-- |
|
|
| free gang
members |
|
|
-C- |
|
|
| claim gang
members |
|
|
-W- |
|
|
|
| DVA allocate |
|
--F-- |
|
|
|
| DVA free |
|
-C- |
|
|
|
| DVA claim |
|
-W- |
|
| gang checksum
generate |
|
|
|
RWFCI | ready |
|
|
|
|
|
RW--I |
|
|
|
|
| I/O start |
RW--I |
|
|
|
|
| I/O done |
RW--I |
|
|
|
|
| I/O assess |
RWFCI | wait for
children done |
|
|
|
|
|
R-- |
|
| checksum verify |
|
|
|
R-- |
|
|
| read gang
members |
|
|
R-- |
| read decompress |
|
|
|
|
RWFCI | done |
|
|
|
|
|
- I/O types
- Each
phase of the I/O pipeline applies to a certain type of I/O. The letters
stand for (R)ead, (W)rite, (F)ree, (C)laim, and (I)octl.
- ZIO state
- These
are internal states used to synchronize I/Os. For example, an I/O which
child I/Os must wait for all children to be ready before allocating a
BP, and must wait for all children to be done before returning.
- Compression
- This phase applies any compression algorithms, if applicable.
- Checksum
- This phase applies any checksum algorithms, if applicable.
- Gang blocks
- When
there is not enough contiguous space to write a complete block, the ZIO
pipeline will break the I/O up into smaller 'gang blocks' which can
later be assembled transparently to appear as complete blocks.
- DVA management
- Each
I/O must be allocated a DVA (Device Virtual Address) to correspon to a
particular portion of a vdev in the pool. These phases perform the
necessary allocation of these addresses, using metaslabs and the space
map.
- Vdev I/O
- These phases are the once which actually issue I/O to the vdevs contained within the pool.
VDEV (Virtual Devices)
The
virtual device subsystem provides a unified method of arranging and
accessing devices. Virtual devices form a tree, with a single root vdev
and multiple interior (mirror and RAID-Z) and leaf (disk and file)
vdevs. Each vdev is responsible for representing the available space, as
well as laying out blocks on the physical disk.
LDI (Layered Driver Interface)
At
the bottom of the stack, ZFS interacts with the underlying physical
devices through LDI, the Layered Driver Interface, as well as the VFS
interfaces (when dealing with files).