Configuring ZFS to gracefully deal with pool failures
If you are running ZFS in production, you may have experienced a situation where your server paniced and reboot when a ZFS file system was corrupted. With George Wilson’s recent putback of CR #6322646, this is no longer the case. George’s putback allows the file system administrator to set the “failmode” property to control that happens when a pool incurs a fault. Here is a description of the new property from the zpool(1m) manual page:
failmode=wait | continue | panic
Controls the system behavior in the event of catas- trophic pool failure. This condition is typically a result of a loss of connectivity to the underlying storage device(s) or a failure of all devices within the pool. The behavior of such an event is determined as follows:
wait Blocks all I/O access until the device con- nectivity is recovered and the errors are cleared. This is the default behavior.
continue Returns EIO to any new write I/O requests but allows reads to any of the remaining healthy devices. Any write requests that have yet to be committed to disk would be blocked.
panic Prints out a message to the console and gen- erates a system crash dump.
To see just how well this feature worked, I decided to test out the new failmode property. To begin my tests, I created a new ZFS pool from two files:
$ cd / && mkfile 1g file1 file2
$ zpool create p1 /file1 /file2
$ zpool status
pool: p1
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
p1 ONLINE 0 0 0
/file1 ONLINE 0 0 0
/file2 ONLINE 0 0 0
After the pool was created, I checked the failmode property:
$ zpool get failmode p1
NAME PROPERTY VALUE SOURCE
p1 failmode wait default
And then then began writing garbage to one of the files to see what would happen:
$ dd if=/dev/zero of=/file1 bs=512 count=1024
$ zpool scrub p1
I was overjoyed to find that the box was still running, even though the pool showed up as faulted:
$ zpool status
pool: p1
state: FAULTED
status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-4J
scrub: scrub completed after 0h0m with 0 errors on Tue Feb 19 13:57:41 2008
config:
NAME STATE READ WRITE CKSUM
p1 FAULTED 0 0 0 insufficient replicas
/file1 UNAVAIL 0 0 0 corrupted data
/file2 ONLINE 0 0 0
errors: No known data errors
But my joy didn’t last long, since the box became unresponsive after a few minutes, and paniced with the following string:
Feb 19 13:57:47 nevadadev genunix: [ID 603766 kern.notice] assertion failed:
vdev_config_sync(rvd->vdev_child, rvd->vdev_children, txg) == 0 (0x5 == 0x0), file: ../../common/fs/zfs/spa.c, line: 4130
Feb 19 13:57:47 nevadadev unix: [ID 100000 kern.notice]
Feb 19 13:57:47 nevadadev genunix: [ID 655072 kern.notice] ffffff0001feab30
genunix:assfail3+b9 ()
Feb 19 13:57:47 nevadadev genunix: [ID 655072 kern.notice] ffffff0001feabd0 zfs:spa_sync+5d2 ()
Feb 19 13:57:47 nevadadev genunix: [ID 655072 kern.notice] ffffff0001feac60 zfs:txg_sync_thread+19a ()
Feb 19 13:57:47 nevadadev genunix: [ID 655072 kern.notice] ffffff0001feac70 unix:thread_start+8 ()
Since the manual page states that the failmode property “controls the system behavior in the event of catas-trophic pool failure,” it appears the box should have stayed up and operational when the pool became unusable. I filed a bug on the opensolaris website, so hopefully the ZFS team will get this issue addressed in the future.
sync=standard
This is the default option. Synchronous file system transactions
(fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
and then secondly all devices written are flushed to ensure
the data is stable (not cached by device controllers).
sync=always
For the ultra-cautious, every file system transaction is
written and flushed to stable storage by a system call return.
This obviously has a big performance penalty.
sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds. This option gives the
highest performance. However, it is very dangerous as ZFS
is ignoring the synchronous transaction demands of
applications such as databases or NFS.
Setting sync=disabled on the currently active root or /var
file system may result in out-of-spec behavior, application data
loss and increased vulnerability to replay attacks.
This option does *NOT* affect ZFS on-disk consistency.
Administrators should only use this when these risks are understood.
The property can be set when the dataset is created, or dynamically, and will take effect immediately. To change the property, an administrator can use the standard 'zfs' command. For example:
# zfs create -o sync=disabled whirlpool/milek
# zfs set sync=always whirlpool/perrin
No comments:
Post a Comment