Wednesday, February 22, 2012

Creating and manipulating zpools (zfs)

Creating and manipulating zpools (zfs)



Zpools are the underlying device layers for zfs filesystems. Mirrors, RAIDs and Concatenated Storage are defined here.



For pooling devices, zpools can be:



- a mirror


- a RAIDz with single or double parity


- a concatenated/striped storage



This work sheet has been done with Solaris 10 running on a virtual Parallels machine. The disks are not real, they are virtualized by Parallels, giving 8 GB to each disk. Not much, but enough to play with.




First we will try to look up the disks accessible by our system:




# format


Searching for disks...done




AVAILABLE DISK SELECTIONS:


0. c0d0


/pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0


1. c1d0


/pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0


Specify disk (enter its number): ^C




Type CTRL-C to quit "format".



If your disks do not show up, use devfsadm:




# devfsadm


# format


Searching for disks...done




AVAILABLE DISK SELECTIONS:


0. c0d0
/pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0


1. c0d1
/pci@0,0/pci-ide@1f,1/ide@0/cmdk@1,0


2. c1d0
/pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0


3. c1d1
/pci@0,0/pci-ide@1f,1/ide@1/cmdk@1,0


Specify disk (enter its number): ^C




You'll notice that the virtual disks are mapped as IDE/ATA drives, so the disk device names don't have a target specification "t".




Let's create our first pool by simply putting together all three disks (c0d0 is our root partition and boot disk which is not usable for our example):


# zpool create zfstest c0d1 c1d0 c1d1


That's it. You have just created a zpool named "zfstest" containing all three disks. Your available space will be just the sum of all three disks:


# zpool list


NAME SIZE USED AVAIL CAP HEALTH ALTROOT


zfstest 23.8G 91K 23.8G 0% ONLINE -


Use "zpool status" to get detailed status information of the components of your zpool:


# zpool status


pool: zfstest


state: ONLINE


scrub: none requested


config:



NAME STATE READ WRITE CKSUM


zfstest ONLINE 0 0 0


c0d1 ONLINE 0 0 0


c1d0 ONLINE 0 0 0


c1d1 ONLINE 0 0 0


errors: No known data errors


To destroy a pool, use "zpool destroy":


# zpool destroy zfstest


and your pool is gone.


Let's try a mirror now:


# zpool create mirror c1d0 c1d1


You just created a mirror between disk c1d0 and disk c1d1. Available storage is the same as if you used only one of these disks. If disk sizes differ, the smaller size will be your storage size. Data is replicated between these disks.


"zpool status" now reads:


# zpool status


pool: zfstest


state: ONLINE


scrub: none requested


config:



NAME STATE READ WRITE CKSUM


zfstest ONLINE 0 0 0


mirror ONLINE 0 0 0


c1d0 ONLINE 0 0 0


c1d1 ONLINE 0 0 0



errors: No known data errors


So now we have a simple mirror. But how to put data on it?


If you create a zpool there is automatically a zfs filesystem created in it. The mountpoint defaults to the poolname. So your pool "zfstest" is mounted as a zfs filesystem at /zfstest:


# df -k


Filesystem kbytes used avail capacity Mounted on


/dev/dsk/c0d0s0 14951508 5725085 9076908 39% /


/devices 0 0 0 0% /devices


ctfs 0 0 0 0% /system/contract


proc 0 0 0 0% /proc


mnttab 0 0 0 0% /etc/mnttab


swap 2104456 836 2103620 1% /etc/svc/volatile


objfs 0 0 0 0% /system/object


/usr/lib/libc/libc_hwcap1.so.1


14951508 5725085 9076908 39% /lib/libc.so.1


fd 0 0 0 0% /dev/fd


swap 2103624 4 2103620 1% /tmp


swap 2103644 24 2103620 1% /var/run


zfstest 8193024 24 8192938 1% /zfstest


We will create a big file on it:


# dd if=/dev/zero bs=128k count=40000 of=/zfstest/bigfile


40000+0 records in


40000+0 records out


It is really there now:


# ls -la /zfstest


total 10241344


drwxr-xr-x 2 root sys 3 Apr 21 11:15 .


drwxr-xr-x 39 root root 1024 Apr 21 11:13 ..


-rw-r--r-- 1 root root 5242880000 Apr 21 11:30 bigfile


Now the differences to classical volume managers do begin. The underlying zpool "zfstest" KNOWS actually that approx. 5 Gigabytes are taken by zfs:


# zpool list


NAME SIZE USED AVAIL CAP HEALTH ALTROOT


zfstest 7.94G 4.88G 3.05G 61% ONLINE -


This has enormous advantages: When replacing a mirrored disk, zfs will only copy allocated blocks to the new disk and not all blocks of the pool. The same is true with RAID devices, only allocated data blocks are reconstructed.



Now let's just stop the mirror. You do that just by detaching one drive from the mirror:



# zpool detach zfstest c1d0


This command pulled away disk c1d0 from the pool. Your mirror is not a mirror any more:


# zpool status


pool: zfstest


state: ONLINE


scrub: none requested


config:



NAME STATE READ WRITE CKSUM


zfstest ONLINE 0 0 0


c1d1 ONLINE 0 0 0



errors: No known data errors


However you did not lose any bit of data! Your zpool ist just available as it was before (as long disk c1d1 does not fail).


You may attach another disk to your pool to create a new mirror:


# zpool attach zfstest c1d1 c0d1


Now your mirror consists of disks c1d1 and c0d1. Solaris will immediately begin to duplicate any block that's used by zfs from drive c1d1 to drive c0d1:


# zpool status


pool: zfstest


state: ONLINE


status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.


action: Wait for the resilver to complete.


scrub: resilver in progress, 55.53% done, 0h7m to go


config:



NAME STATE READ WRITE CKSUM


zfstest ONLINE 0 0 0


mirror ONLINE 0 0 0


c1d1 ONLINE 0 0 0


c0d1 ONLINE 0 0 0



errors: No known data errors


The process of replicating data on new or outdated disks is named "resilvering".


A mirror is not limited to two disks. If you have big concerns that your valuable data is prone to losses, just attach another disk to your mirror:


# zpool attach zfstest c0d1 c1d0


Your mirror now has three elements. Note, that your storage size does not grow by attaching new mirror components. But now two drives may fail completely and you still have all of your data:


# zpool status


pool: zfstest


state: ONLINE


scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008


config:



NAME STATE READ WRITE CKSUM


zfstest ONLINE 0 0 0


mirror ONLINE 0 0 0


c1d1 ONLINE 0 0 0


c0d1 ONLINE 0 0 0


c1d0 ONLINE 0 0 0



errors: No known data errors


Let's detach two disks now:


# zpool detach zfstest c1d1


# zpool detach zfstest c1d0


Your mirror has gone once again. To set up a concatenated or striped storage (write operations on zfs occur on ALL pool members, so it's more like a striped disk set), you may add these disks to your pool (never mistake "add" for "attach" - the former ADDs storage, the latter attaches disks to mirrors):


# zpool add zfstest c1d0 c1d1


Your pool has become the same as the one we created at the beginning of our exercise:


# zpool list


NAME SIZE USED AVAIL CAP HEALTH ALTROOT


zfstest 23.8G 4.88G 18.9G 20% ONLINE -


# zpool status


pool: zfstest


state: ONLINE


scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008


config:



NAME STATE READ WRITE CKSUM


zfstest ONLINE 0 0 0


c0d1 ONLINE 0 0 0


c1d0 ONLINE 0 0 0


c1d1 ONLINE 0 0 0



errors: No known data errors


The only difference is our file "bigfile", which is still available as we did not destroy the pool. You see it from the output of "zpool list" above: 4,8G are still used.


Now we are stuck. It is not possible to remove disks added to our zpool. As writes occur on all members, newly written data is on all disks. No chance to throw away a disk. Mirror component disks can be detached at any time - as long it is not the last disk of a mirror.


Let's destroy the pool and set up a RAID storage. ZFS offers two RAID types: raidz1 and raidz2. raidz1 means single parity, raidz2 double parity.


# zpool destroy zfstest


# zpool create zfstest raidz1 c0d1 c1d0 c1d1


We have now created a raid group:


# zpool status


pool: zfstest


state: ONLINE


scrub: none requested


config:



NAME STATE READ WRITE CKSUM


zfstest ONLINE 0 0 0


raidz1 ONLINE 0 0 0


c0d1 ONLINE 0 0 0


c1d0 ONLINE 0 0 0


c1d1 ONLINE 0 0 0



errors: No known data errors


Be aware that "zpool list" is showing the global capacity of your raid set and not the usable capacity:


# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT


zfstest 23.9G 157K 23.9G 0% ONLINE -


To see how many space we are able to allocate, use a zfs command (zfs commands are explained in the zfs tutorial text):


# zfs list
NAME USED AVAIL REFER MOUNTPOINT


zfstest 101K 15.7G 32.6K /zfstest


One disk may fail in this scenario.


To put up the same pool with double parity:


# zpool destroy zfstest


# zpool create zfstest raidz2 c0d1 c1d0 c1d1


# zfs list


NAME USED AVAIL REFER MOUNTPOINT


zfstest 86.7K 7.80G 24.4K /zfstest


Only 7.8 GB left - compared to 15.7 GB with a single parity RAID device. Two drives may fail now.


We have achieved the same data security as with a three way mirror - hence the same usable storage.


As a practical example, here is the output of "zpool status" and "zpool list" of a mailserver. The zpool "mail" consists of two mirror pairs added to a pool.


The creation command has been:



# zpool create mail \


mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 c6t600D0230006B66680C50AB7821F0E900d0 \


mirror c6t600D0230006B66680C50AB0187D75000d0 c6t600D0230006C1C4C0C50BE27386C4900d0




As you see, it is perfectly legal and possible to add the storage of two mirrors in one pool.


# zpool status


pool: mail


state: ONLINE


scrub: none requested


config:



NAME STATE READ WRITE CKSUM


mail ONLINE 0 0 0


mirror ONLINE 0 0 0


c6t600D0230006C1C4C0C50BE5BC9D49100d0 ONLINE 0 0 0


c6t600D0230006B66680C50AB7821F0E900d0 ONLINE 0 0 0


mirror ONLINE 0 0 0


c6t600D0230006B66680C50AB0187D75000d0 ONLINE 0 0 0


c6t600D0230006C1C4C0C50BE27386C4900d0 ONLINE 0 0 0




# zpool list


NAME SIZE USED AVAIL CAP HEALTH ALTROOT


mail 6.81T 3.08T 3.73T 45% ONLINE -




As you see, you can also use Sun MPxIO devices - they have LONG device names. You may also use FDISK partitions of x86 computers (...p0,...p1,...) and Solaris slices (...s0, ....s1, ...) to set up zpools. Both are however not recommended but fine to play with zpool commands.


The MPxIO names are usable because they show up just like normal block disk devices in /dev/dsk:


# ls -la /dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0


lrwxrwxrwx 1 root root 65 Dec 11 06:22

/dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0 ->

../../devices/scsi_vhci/disk@g600d0230006c1c4c0c50be5bc9d49100:wd