Wednesday, January 29, 2014

ZFS Source Tour





This page is designed to take users through a brief overview of the source code associated with the ZFS filesystem. It is not intended as an introduction to ZFS – it is assumed that you already have some familiarity with common terms and definitions, as well as a general sense of filesystem architecture.
Traditionally, we describe ZFS as being made of up of three components: ZPL (ZFS POSIX Layer), DMU (Data Management Unit), and SPA (Storage Pool Allocator). While this serves as a useful example for slideware, the real story is a little more complex. The following image gives a more detailed overview; clicking on one of the areas will take you to a more detailed description and links to source code.
In this picture, you can still see the three basic layers, though there are quite a few more elements in each. In addition, we show zvol consumers, as well as the management path, namely zfs(1M) and zpool(1M). You'll find a brief description of all these subsystems below. This is not intended to be an exhaustive overview of exactly how everything works. In the end, the source code is the final documentation. We hope that it is easy to follow and well commented.If not, feel free to post to the ZFS discussion forum.

Filesystem Consumers

These are your basic applications that interact with ZFS solely through the POSIX filesystem APIs. Virtually every application falls into this category. The system calls are passed through the generic OpenSolaris VFS layer to the ZPL.

Device Consumers

ZFS provides a means to created 'emulated volumes'. These volumes are backed by storage from a storage pool, but appear as a normal device under /dev. This is not a typical use case, but there are a small set of cases where this capability is useful. There are a small number of applications that interact directly with these devices, but the most common consumer is a kernel filesystem or target driver layered on top of the device.

Management GUI

Solaris will ship with a web-based ZFS GUI in build 28. While not part of OpenSolaris (yet), it is an example of a Java-based GUI layered on top of the JNI.

Management Consumers

These applications are those which manipulate ZFS filesystem or Storage pools, including examining properties and dataset hierarchy. While there are some scattered exceptions (zoneadm, zoneadmd, fstyp), the two main applications are zpool(1M) and zfs(1M).

zpool(1M)

This command is responsible for creating and managing ZFS storage pools. Its primary purpose is to parse command line input and translate them into libzfs calls, handling any errors along the way. The source for this command can be found in usr/src/cmd/zpool. It covers the following files:
zpool_main.cThe bulk of the command, responsible for processing all arguments and subcommands
zpool_vdev.cCode responsible for converting a series of vdevs into an nvlist representation for libzfs
zpool_iter.cCode to iterate over some or all the pools in the system easily
zpool_util.cMiscellaneous utility functions

zfs(1M)

This command is responsible for creating and managing ZFS filesystems. Similar to zpool(1M), its purpose is really just to parse command line arguments and pass the handling off to libzfs. The source for this command can be found in usr/src/cmd/zfs. It covers the following ilfes:
zfs_main.cThe bulk of the command, responsible for processing all arguments and subcommands
zfs_iter.cCode to iterate over some or all of the datasets in the system

JNI

This library provides a Java interface to libzfs. It is currently a private interface, and is tailored specifically for the GUI. As such, it is geared primarily toward observability, as the GUI performs most actions through the CLI. The source for this library can be found in usr/src/lib/libzfs_jni

libzfs

This is the primary interface for management apps to interact with the ZFS kernel module. The library presents a unified, object-based mechanism for accessing and manipulating both storage pools and filesystems. The underlying mechanism used to communicate with the kernel is ioctl(2) calls through /dev/zfs. The source for this library can be found in usr/src/lib/libzfs. It covers the following files:
libzfs_dataset.cPrimary interfaces for manipulating datasets
libzfs_pool.cPrimary interfaces for manipulating pool
libzfs_changelist.cUtility routines for propagating property changes across children
libzfs_config.cRead and manipulate pool configuration information
libzfs_graph.cConstruct dependent lists for datasets
libzfs_import.cDiscover and import pools
libzfs_mount.cMount, unmount, and share datasets.
libzfs_status.cLink to FMA knowledge articles based on pool state
libzfs_util.cMiscellaneous routines

ZPL (ZFS POSIX Layer)

The ZPL is the primary interface for interacting with ZFS as a filesystem. It is a (relatively) thin layer that sits atop the DMU and presents a filesystem abstraction of files and directories. It is responsible for bridging the gap between the OpenSolaris VFS interfaces and the underlying DMU interfaces. It is also responsible for enforcing ACL (Access Control List) rules as well as synchronous (O_DSYNC) semantics.
zfs_vfsops.cOpenSolaris VFS interfaces
zfs_vnops.cOpenSolaris vnode interfaces
zfs_znode.cEach vnode corresponds to an underlying znode
zfs_dir.cDirectory operations
zfs_acl.cACL implementation
zfs_ctldir.cPseudo filesystem implementation .zfs/snapshot
zfs_log.cZPL interface to record intent log entries
zfs_replay.cZPL interface to replay intent log entries
zfs_byteswap.cByteswap routines for ZPL data structures

ZVOL (ZFS Emulated Volume)

ZFS includes the ability to present raw devices backed by space from a storage pool. These are known as 'zvols' within the source code, and is implemented by a single file in the ZFS source.
zvol.cVolume emulation interface

/dev/zfs

This device is the primary point of control for libzfs. While consumers could consume the ioctl(2) interface directly, it is closely entwined with libzfs, and not a public interface (not that libzfs is, either). It consists of a single file, which does some validation on the ioctl() parameters and then vectors the request to the appropriate place within ZFS.
zfs_ioctl.c/dev/zfs ioctl() interface
zfs.hInterfaces shared between kernel and userland

DMU (Data Management Unit)

The DMU is responsible for presenting a transactional object model, built atop the flat address space presented by the SPA. Consumers interact with the DMU via objsets, objects, and transactions. An objset is a collection of objects, where each object is an arbitrary piece of storage from the SPA. Each transaction is a series of operations that must be committed to disk as a group; it is central to the on-disk consistency for ZFS.
dmu.cPrimary external DMU interfaces
dmu_objset.cObjset open/close/manipulate external interfaces
dmu_object.cObject allocate/free interfaces
txg.cTransaction model control threads
dmu_tx.cTransaction creation/manipulation interfaces
dnode.cOpen context object-level manipulation functions
dnode_sync.cSyncing context object-level manipulation functions
dbuf.cBuffer management functions
dmu_zfetch.cData stream prefetch logic
refcount.cGeneric reference counting interfaces
dmu_send.cSend and receive functions for snapshots

DSL (Dataset and Snapshot Layer)

The DSL aggregates DMU objsets into a hierarchical namespace, with inherited properties, as well as quota and reservation enforcement. It is also responsible for managing snapshots and clones of objsets.
For more information on how snapshots are implemented, see Matt's blog entry.
dsl_dir.cNamespace management functions
dsl_dataset.cSnapshot/Rollback/Clone support interfaces
dsl_pool.cPool-level support interfaces
dsl_prop.cProperty manipulation functions
unique.cSupport functions for unique objset IDs

ZAP (ZFS Attribute Processor)

The ZAP is built atop the DMU, and uses scalable hash algorithms to create arbitrary (name, object) associations within an objset. It is most commonly used to implement directories within the ZPL, but is also used extensively throughout the DSL, as well as a method of storing pool-wide properties. There are two very different ZAP algorithms, designed for different type of directories. The "micro zap" is used when the number of entries is relatively small and each entry is reasonably short. The "fat zap" is used for larger directories, or those with extremely long names.
zap.cFat ZAP interfaces
zap_leaf.cLow-level support functions
zap_micro.cMicro ZAP interfaces

ZIL (ZFS Intent Log)

While ZFS provides always-consistent data on disk, it follows traditional filesystem semantics where the majority of data is not written to disk immediately; otherwise performance would be pathologically slow. But there are applications that require more stringent semantics where the data is guaranteed to be on disk by the time the read(2) or write(2) call returns. For those applications requiring this behavior (specified with O_DSYNC), the ZIL provides the necessary semantics using an efficient per-dataset transaction log that can be replayed in event of a crash.
For a more detailed look at the ZIL implementation, see Neil's blog entry.
zil.cIntent Log

Traversal

Traversal provides a safe, efficient, restartable method of walking all data within a live pool. It forms the basis of resilvering and scrubbing. It walks all metadata looking for blocks modified within a certain period of time. Thanks to the copy-on-write nature of ZFS, this has the benefit of quickly excluding large subtrees that have not been touched during an outage period. It is fundamentally a SPA facility, but has intimate knowledge of some DMU structures in order to handle snapshots, clones, and certain other characteristics of the on-disk format.
dmu_traverse.cTraversal code

ARC (Adaptive Replacement Cache)

ZFS uses a modified version of an Adaptive Replacement Cache to provide its primary cacheing needs. This cache is layered between the DMU and the SPA and so acts at the virtual block-level. This allows filesystems to share their cached data with their snapshots and clones.
arc.cAdaptive Replacement Cache implementation

Pool Configuration (SPA)

While the entire pool layer is often referred to as the SPA (Storage Pool Allocator), the configuration portion is really the public interface. It is responsible for gluing together the ZIO and vdev layers into a consistent pool object. It includes routines to create and destroy pools from their configuration information, as well as sync the data out to the vdevs on regular intervals.
spa.cRoutines for opening, importing, exporting, and destroying pools
spa_misc.cMiscellaneous SPA routines, including locking
spa_config.cParse and update pool configuration data
spa_errlog.cLog of persistent pool-wide data errors.
spa_history.cPersistent ring buffer of successful commands that modified the pool's state.
zfs_fm.cRoutines to post ereports for FMA consumption.

ZIO (ZFS I/O Pipeline)

The ZIO pipeline is where all data must pass when going to or from the disk. It is responsible for translation DVAs (Device Virtual Addresses) into logical locations on a vdev, as well as checksumming and compressing data as necessary. It is implemented as a multi-stage pipeline, with a bit mask to control which stage gets executed for each I/O. The pipeline itself is quite complex, but can be summed up by the following digram:
I/O typesZIO stateCompressionChecksumGang Blocks DVA managementVdev I/O
RWFCIopen




RWFCIwait for
children ready





-W-
write compress



-W-

checksum generate


-WFC-


gang pipeline

-WFC-


get gang header

-W-


rewrite gang
header


--F--


free gang
members


-C-


claim gang
members


-W-



DVA allocate
--F--



DVA free
-C-



DVA claim
-W-

gang checksum
generate



RWFCIready




RW--I




I/O start
RW--I




I/O done
RW--I




I/O assess
RWFCIwait for
children done





R--

checksum verify


R--


read gang
members


R--
read decompress



RWFCIdone




I/O types
Each phase of the I/O pipeline applies to a certain type of I/O. The letters stand for (R)ead, (W)rite, (F)ree, (C)laim, and (I)octl.
ZIO state
These are internal states used to synchronize I/Os. For example, an I/O which child I/Os must wait for all children to be ready before allocating a BP, and must wait for all children to be done before returning.
Compression
This phase applies any compression algorithms, if applicable.
Checksum
This phase applies any checksum algorithms, if applicable.
Gang blocks
When there is not enough contiguous space to write a complete block, the ZIO pipeline will break the I/O up into smaller 'gang blocks' which can later be assembled transparently to appear as complete blocks.
DVA management
Each I/O must be allocated a DVA (Device Virtual Address) to correspon to a particular portion of a vdev in the pool. These phases perform the necessary allocation of these addresses, using metaslabs and the space map.
Vdev I/O
These phases are the once which actually issue I/O to the vdevs contained within the pool.
zio.cThe main ZIO stages, including gang block translation
zio_checksum.cGeneric checksum interface
fletcher.cFletcher2 and fletcher4 checksum algorithms
sha256.cSHA256 checksum algorithm
zio_compress.cGeneric compression interface
lzjb.cLZJB compression algorithm
uberblock.cBasic uberblock (root block) routines
bplist.cKeeps track of lists of block pointers
metaslab.cThe bulk of DVA translation
space_map.cKeeps track of free space, used during DVA translation
szio_inject.cFramework for injecting persistent errors for data and devices

VDEV (Virtual Devices)

The virtual device subsystem provides a unified method of arranging and accessing devices. Virtual devices form a tree, with a single root vdev and multiple interior (mirror and RAID-Z) and leaf (disk and file) vdevs. Each vdev is responsible for representing the available space, as well as laying out blocks on the physical disk.
vdev.cGeneric vdev routines
vdev_disk.cDisk virtual device
vdev_file.cFile virtual device
vdev_mirror.cN-Way mirrroring
vdev_raidz.cRAID-Z grouping (see Jeff's blog).
vdev_root.cToplevel pseudo vdev
vdev_missing.cSpecial device for import
vdev_label.cRoutines for reading and writing identifying labels
vdev_cache.cSimple device-level caching for reads
vdev_queue.cI/O Scheduling algorithm for vdevs

LDI (Layered Driver Interface)

At the bottom of the stack, ZFS interacts with the underlying physical devices through LDI, the Layered Driver Interface, as well as the VFS interfaces (when dealing with files).