Opensource - Information: 2012

Monday, December 17, 2012

ZFS Cheatsheet

This is a quick and dirty cheatsheet on Sun's ZFS

Directories and Files
error messages	/var/adm/messages console
States
DEGRADED	One or more top-level devices is in the degraded state because they have become offline. Sufficient replicas exist to keep functioning
FAULTED	One or more top-level devices is in the faulted state because they have become offline. Insufficient replicas exist to keep functioning
OFFLINE	The device was explicity taken offline by the "zpool offline" command
ONLINE	The device is online and functioning
REMOVED	The device was physically removed while the system was running
UNAVAIL	The device could not be opened
Storage Pools
displaying	zpool list zpool list -o name,size,altroot Note: there are a number of properties that you can select, the default is: name, size, used, available, capacity, health, altroot
status	zpool status ## Show only errored pools with more verbosity zpool status -xv
statistics	zpool iostat -v 5 5 Note: use this command like you would iostat
history	zpool history -il
creating	## performing a dry run but don't actual perform the creation zpool create -n data01 c1t0d0s0 # you can persume that I created two files called /zfs1/disk01 and /zfs1/disk02 using mkfile zpool create data01 /zfs1/disk01 /zfs1/disk02 # using a standard disk slice zpool create data01 c1t0d0s0 ## using a different mountpoint than the default /<pool name> zpool create -m /zfspool data01 c1t0d0s0 # mirror and hot spare disks examples zpool create data01 mirror c1t0d0 c2t0d0 mirror c1t0d1 c2t0d1 zpool create data01 mirror c1t0d0 c2t0d0 spare c3t0d0 ## setting up a log device and mirroring it zpool create data01 mirror c1t0d0 c2t0d0 log mirror c3t0d0 c4t0d0 ## setting up a cache device zpool create data 01 mirror c1t0d0 c2t0d0 cache c3t0d0 c3t1d0
destroying	zpool destroy /zfs1/data01 ## in the event of a disaster you can re-import a destroyed pool zpool import -f -D -d /zfs1 data031
adding	zpool add data01 c2t0d0 Note: make sure that you get this right as zpool only supports the removal of hot spares and cache disks
removing	zpool remove data01 c2t0d0 Note: zpool only supports the removal of hot spares and cache disks
clearing faults	zpool clear data01 ## Clearing a specific disk fault zpool clear data01 c2t0d0
attaching	## c2t0d0 is an existing disk that is not mirrored, by attaching c3t0d0 both disks will become a mirror pair zpool attach data01 c2t0d0 c3t0d0
detaching	zpool detach data01 c2t0d0 Note: see above notes is attaching
onlining	zpool online data01 c2t0d0
offlining	zpool offline data01 c2t0d0 ## Temporary offlining (will revent back after a reboot) zpool offline data01 -t c2t0d0
Replacing	## replacing like for like zpool replace data03 c2t0d0 ## replacing with another disk zpool replace data03 c2t0d0 c3t0d0
scrubbing	zpool scrub data01 ## stop a scrubbing in progress, check the scrub line using "zpool status data01" to see any errors zpool scrub -s data01 Note: Only one of scrubbing or resilvering can be running at the sametime scrubbing - examines all data to discover hardware faults or disk failures resilvering - examines only data that ZFS knows to be out of date
exporting	zpool export data01
importing	## when using standard disk devices i.e c2t0d0 zpool import data01 ## if using files in say the /zfs filesystem zpool import -d /zfs ## importing a destroyed pool zpool import -f -D -d /zfs1 data03
getting parameters	zpool get all data01 Note: the source column denotes if the value has been change from it default value, a dash in this column means it is a read-only value
setting parameters	zpool set autoreplace=on data01 Note: use the command "zpool get all <pool>" to obtain list of current setting
upgrade	## List upgrade paths zpool upgrade -v ## upgrade all pools zpool upgrade -a ## upgrade specific pool, use "zpool get all <pool>" to obtain version number of a pool zpool upgrade data01 ## upgrade to a specific version zpool upgrade -V 10 data01
Filesystem
displaying	zfs list ## list different types zfs list -t filesystem zfs list -t snapshot zfs list -t volume ## recursive display zfs list -r data01/oracle ## complex listing zfs list -o name,sharenfs,mountpoint Note: there are a number of attributes that you can use in a complex listing, so use the man page to see them all
creating	## persuming i have a pool called data01 create a /data01/apache filesystem zfs create data01/apache ## using a different mountpoint zfs create -o mountpoint=/oracle data01/oracle ## create a volume - the device can be accessed via /dev/zvol/[rdsk\|dsk]/data03/swap zfs create -V 50mb data01/swap swap -a /dev/zvol/dsk/data01/swap Note: don't use a zfs volume as a dump device it is not supported
destroying	zfs destroy data01/oracle ## using the recusive options -r = all children, -R = all dependants zfs destroy -r data01/oracle zfs destroy -R data01/oracle
mounting	zfs mount data01 Note: there are all the normal mount options that you can apply i.e ro/rw, setuid
unmounting	zfs umount data01
share	zfs share data01 ## Persist over reboots zfs set sharenfs=on data01
unshare	zfs unshare data01 ## persist over reboots zfs set sharenfs=off data01
snapshotting	## creating a snapshot zfs snapshot data01@10022010 ## destroying a snapshot zfs destroy data01@10022010
rollback	zfs rollback data01@10022010
cloning/promoting	zfs clone data01@10022010 data03/clone ## promoting a clone zfs promote data03/clone Note: the clone must reside in the same pool
renaming	## the dataset must be kept within the same pool zfs rename data03/ora_disk01 data03/ora_d01 Note: you have two options -p creates all the non-existing parent datasets -r recursively rename the sanpshots of all descendent datasets (used with snapshots only)
getting parameters	## List all the properties zfs get all data03/oracle ## get a specific property zfs get setuid data03/oracle ## get a list of a specific properites for all datasets zfs get compression Note: the source column denotes if the value has been change from it default value, a dash in this column means it is a read-only value
setting parameters	## set and unset a quota zfs set quota=50M data03/oracle zfs set quota=none data03/oracle Note: use the command "zfs get all <dataset> " to obtain list of current settings
inherit	## set back to the default value zfs inherit compression data03/oracle
upgrade	## List the upgrade paths zfs upgrade -v ## List all the datasets that are not at the current level zfs upgrade ## upgrade a specific dataset upgrade -V <version> data03/oracle
send/receive	## here is a complete example of a send and receive with incremental update ## create some test files mkfile -v 100m /zfs/master mkdir -v 100m /zfs/slave ## create mountpoints mkdir /master mkdir /slave ## Create the pools zpool create master zpool create slave ## create the data filesystem zfs create master/data ## create a test file echo "created: 09:58" > /master/data/test.txt ## create a snapshot and send it to the slave, you could use SSH or tape to transfer to another server (see below) zfs snapshot master/data@1 zfs send master/data@1 \| zfs receive slave/data ## set the slave to read-only because you can cause data corruption, make sure if do this before accessing anything the ## slave/data directory zfs set readonly=on slave/data ## update the original test.txt file echo "`date`" >> /master/data/text.txt ## create a second snapshot and send the differences, you may get an error message saying that the desination had been ## modified this is because you did not set the slave/data to ready only (see above) zfs snapshot master/data@2 zfs send master/data@1 master/data@2 \| zfs receive slave/data --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ## using SSH zfs send master/data@1 \| ssh backup_server zfs receive backups/data@1 ## using a tape drive, you can also use cpio zfs send master/data@1 > /dev/rmt/0 zfs receive slave/data2@1 < /dev/rmt/0 zfs rename slave/data slave/data.old zfs rename slave/data2 slave/data ## you can also save incremental data zfs send master/data@12022010 > /dev/rmt/0 zfs send -i master/data@12022010 master/data@13022010 > /dev/rmt/0 ## Using gzip to compress the snapshot zfs send master/fs@snap \| gzip > /dev/rmt/0
allow/unallow	## dislay the permissions set zfs allow master ## create permission set zfs allow -s @permset1 create,mount,snapshot,clone,promote master ## grant a user permissions zfs allow vallep @permset1 master ## revoke a user permissions zfs unallow vallep @permset1 master Note: there are many permissions that you can set so see the man page or just use the "zfs allow" command

Monday, August 27, 2012

Manipulation of strings in C

A string in C is an array of char values terminated by a special null character value '\0'. For example, here is a statically declared string that is initialized to "hi":

 char str[3];

 // need space for chars in str, plus for terminating '\0' char

 str[0] = 'h';

 str[1] = 'i';

 str[2] = '\0';

 printf("%s\n", str);

// prints hi to stdout

C library functions for strings (string.h)

C provides a library for strings. C string library functions do not allocate space for the strings they manipulate, nor do they check that you pass in valid strings; it is up to your program to allocate space for the strings that the C string library will use. Calling string library functions with bad address values will cause a segfault or "strange" memory access errors.Here is a description of some of the functions in the Stardard C string libarary (string.h) for manipulating strings:

#include <string.h>

-----------

int strlen(char *s);

  /* returns the length of the string 
  not including the null character */

-----------

char *strcpy(char *dst, char *src);

/* copies string src to string dst up 
 unitl first '\0' character in src
 returns a ptr to the destination string
 it adds '\0' to end of the copy */

-----------

char *strncpy(char *dst, char *src, int size);

/* copies up to first '\0' or size characters 
 from src to dst */

-----------

int strcmp(char *s1, char *s2);

/* returns 0 if s1 and s2 are the same strings
 a value < 0 if s1 is less than s2 
 a value > 0 if s1 is greater than s2 */

-----------

char *strcat(char *dst, char *src)

/* append chars from src to end of dst 
returns ptr to dst, it adds '\0' to end */

-----------

char *strncat(char *dst, char *src, int size);

------------

char *strstr(char *string, char *substr);

/* locates a substering inside a string
returns a pointer to the begining of substr in string
returns NULL if substr not in string */

--------------

char *strchr(const char *s, int c);

//locate a character in a string

--------------

char *strtok(char *s, const char *delim);

// extract tokens from strings

--------------

In addition there are some functions in stdlib.h for converting between strings and other C types:

#include <stdlib.h>

int atoi(const char *nptr);

// convert a string to an integer  
// "1234" to int value 1234
double atof(const char *nptr);

-------------

To get on-line documentation of C functions, use Unix's man utility:

        % man strcpy

Here is more information about using man.

Using strings in C programs

One thing to keep in mind when allocating space for strings is that you need to allocate one more space than the number of characters in the string so that you can store the terminating '\0' character.Here are some examples of using these functions:

int main() {
 int size;
 char *static_str = "hello there";   
 char *new_str = NULL;
 char *ptr = NULL;

 printf("%s\n", static_str);  // prints "hello there"
 size = strlen(static_str);
 new_str = malloc(sizeof(char)*(size+1)); // need space for '\0'
 if(new_str == NULL) { 
  Error("malloc failed\n"); 
 }
 strncpy(new_str, static_str, 6);
 strcat(new_str, "yo");
 printf("%s\n", new_str); // prints "hello yo"
 if((ptr = strstr(new_str, "yo")) != NULL) {
  printf("%s\n", ptr); // prints "yo"
 }
 free(new_str);
}

Tuesday, August 21, 2012

Highly Available Storage (HAST) Simple Configuration

As HAST provides a synchronous block-level replication of any storage media to several machines, it requires at least two nodes (physical machines) -- the primary (also known as master) node, and thesecondary (slave) node. These two machines together will be called a cluster.

Note: HAST is currently limited to two cluster nodes in total.

Since HAST works in a primary-secondary configuration, it allows only one of the cluster nodes to be active at any given time. The primary node, also called active, is the one which will handle all the I/O requests to HAST-managed devices. The secondary node is then being automatically synchronized from the primary node.

The physical components of the HAST system are:

local disk (on primary node)
disk on remote machine (secondary node)

HAST operates synchronously on a block level, making it transparent to file systems and applications. HAST provides regular GEOM providers in /dev/hast/ directory for use by other tools or applications, thus there is no difference between using HAST-provided devices and raw disks, partitions, etc.

Each write, delete or flush operation is sent to the local disk and to the remote disk over TCP/IP. Each read operation is served from the local disk, unless the local disk is not up-to-date or an I/O error occurs. In such case, the read operation is sent to the secondary node.

Synchronization and Replication Modes

HAST tries to provide fast failure recovery. For this reason, it is very important to reduce synchronization time after a node's outage. To provide fast synchronization, HAST manages an on-disk bitmap of dirty extents and only synchronizes those during a regular synchronization (with an exception of the initial sync).
There are many ways to handle synchronization. HAST implements several replication modes to handle different synchronization methods:

memsync: report write operation as completed when the local write operation is finished and when the remote node acknowledges data arrival, but before actually storing the data. The data on the remote node will be stored directly after sending the acknowledgement. This mode is intended to reduce latency, but still provides very good reliability. The memsync replication mode is currently not implemented.
fullsync: report write operation as completed when local write completes and when remote write completes. This is the safest and the slowest replication mode. This mode is the default.
async: report write operation as completed when local write completes. This is the fastest and the most dangerous replication mode. It should be used when replicating to a distant node where latency is too high for other modes. The async replication mode is currently not implemented.

Warning: Only the fullsync replication mode is currently supported.

HAST Configuration

HAST requires GEOM_GATE support in order to function. The GENERIC kernel does not include GEOM_GATE by default, however the geom_gate.ko loadable module is available in the default FreeBSD installation. For stripped-down systems, make sure this module is available. Alternatively, it is possible to build GEOM_GATE support into the kernel statically, by adding this line to the custom kernel configuration file:

options GEOM_GATE

The HAST framework consists of several parts from the operating system's point of view:

the hastd(8) daemon responsible for the data synchronization,
the hastctl(8) userland management utility,
the hast.conf(5) configuration file.

The following example describes how to configure two nodes in master-slave / primary-secondary operation using HAST to replicate the data between the two. The nodes will be called hasta with an IP address 172.16.0.1 and hastb with an IP address 172.16.0.2. Both of these nodes will have a dedicated hard drive /dev/ad6 of the same size for HAST operation. The HAST pool (sometimes also referred to as a resource, i.e., the GEOM provider in /dev/hast/) will be called test.
Configuration of HAST is done in the /etc/hast.conf file. This file should be the same on both nodes. The simplest configuration possible is:

resource test {
 on hasta {
  local /dev/ad6
  remote 172.16.0.2
 }
 on hastb {
  local /dev/ad6
  remote 172.16.0.1
 }
}

For more advanced configuration, please consult the hast.conf(5) manual page.

Tip: It is also possible to use host names in the remote statements. In such a case, make sure that these hosts are resolvable, e.g., they are defined in the /etc/hosts file, or alternatively in the local DNS.

Now that the configuration exists on both nodes, the HAST pool can be created. Run these commands on both nodes to place the initial metadata onto the local disk, and start the hastd(8) daemon:

# hastctl create test
# /etc/rc.d/hastd onestart

Note: It is not possible to use GEOM providers with an existing file system (i.e., convert an existing storage to HAST-managed pool), because this procedure needs to store some metadata onto the provider and there will not be enough required space available.

A HAST node's role (primary or secondary) is selected by an administrator or other software like Heartbeat using the hastctl(8) utility. Move to the primary node (hasta) and issue this command:

# hastctl role primary test

Similarly, run this command on the secondary node (hastb):

# hastctl role secondary test

Caution: When the nodes are unable to communicate with each other, and both are configured as primary nodes, the condition is called split-brain. To troubleshoot this situation, follow the steps described in Section 19.18.5.2.

Verify the result with the hastctl(8) utility on each node:

# hastctl status test

The important text is the status line, which should say complete on each of the nodes. If it says degraded, something went wrong. At this point, the synchronization between the nodes has already started. The synchronization completes when hastctl status reports 0 bytes of dirty extents.
The next step is to create a filesystem on the /dev/hast/test GEOM provider and mount it. This must be done on the primary node, as /dev/hast/test appears only on the primary node. Creating the filesystem can take a few minutes, depending on the size of the hard drive:

# newfs -U /dev/hast/test
# mkdir /hast/test
# mount /dev/hast/test /hast/test

Once the HAST framework is configured properly, the final step is to make sure that HAST is started automatically during the system boot. Add this line to /etc/rc.conf:

hastd_enable="YES"

Failover Configuration

The goal of this example is to build a robust storage system which is resistant to the failure of any given node. The scenario is that a primary node of the cluster fails. If this happens, the secondary node is there to take over seamlessly, check and mount the file system, and continue to work without missing a single bit of data.
To accomplish this task, another FreeBSD feature provides for automatic failover on the IP layer -- CARP. CARP (Common Address Redundancy Protocol) allows multiple hosts on the same network segment to share an IP address. Set up CARP on both nodes of the cluster according to the documentation available in Section 32.14. After setup, each node will have its own carp0 interface with a shared IP address172.16.0.254. The primary HAST node of the cluster must be the master CARP node.
The HAST pool created in the previous section is now ready to be exported to the other hosts on the network. This can be accomplished by exporting it through NFS, Samba etc, using the shared IP address172.16.0.254. The only problem which remains unresolved is an automatic failover should the primary node fail.
In the event of CARP interfaces going up or down, the FreeBSD operating system generates a devd(8) event, making it possible to watch for the state changes on the CARP interfaces. A state change on the CARPinterface is an indication that one of the nodes failed or came back online. These state change events make it possible to run a script which will automatically handle the HAST failover.
To be able to catch state changes on the CARP interfaces, add this configuration to /etc/devd.conf on each node:

notify 30 {
 match "system" "IFNET";
 match "subsystem" "carp0";
 match "type" "LINK_UP";
 action "/usr/local/sbin/carp-hast-switch master";
};

notify 30 {
 match "system" "IFNET";
 match "subsystem" "carp0";
 match "type" "LINK_DOWN";
 action "/usr/local/sbin/carp-hast-switch slave";
};

Restart devd(8) on both nodes to put the new configuration into effect:

# /etc/rc.d/devd restart

When the carp0 interface goes up or down (i.e., the interface state changes), the system generates a notification, allowing the devd(8) subsystem to run an arbitrary script, in this case /usr/local/sbin/carp-hast-switch. This is the script which will handle the automatic failover. For further clarification about the above devd(8) configuration, please consult the devd.conf(5) manual page.
An example of such a script could be:

#!/bin/sh

# Original script by Freddie Cash <fjwcash@gmail.com>
# Modified by Michael W. Lucas <mwlucas@BlackHelicopters.org>
# and Viktor Petersson <vpetersson@wireload.net>

# The names of the HAST resources, as listed in /etc/hast.conf
resources="test"

# delay in mounting HAST resource after becoming master
# make your best guess
delay=3

# logging
log="local0.debug"
name="carp-hast"

# end of user configurable stuff

case "$1" in
 master)
  logger -p $log -t $name "Switching to primary provider for ${resources}."
  sleep ${delay}

  # Wait for any "hastd secondary" processes to stop
  for disk in ${resources}; do
   while $( pgrep -lf "hastd: ${disk} \(secondary\)" > /dev/null 2>&1 ); do
    sleep 1
   done

   # Switch role for each disk
   hastctl role primary ${disk}
   if [ $? -ne 0 ]; then
    logger -p $log -t $name "Unable to change role to primary for resource ${disk}."
    exit 1
   fi
  done

  # Wait for the /dev/hast/* devices to appear
  for disk in ${resources}; do
   for I in $( jot 60 ); do
    [ -c "/dev/hast/${disk}" ] && break
    sleep 0.5
   done

   if [ ! -c "/dev/hast/${disk}" ]; then
    logger -p $log -t $name "GEOM provider /dev/hast/${disk} did not appear."
    exit 1
   fi
  done

  logger -p $log -t $name "Role for HAST resources ${resources} switched to primary."


  logger -p $log -t $name "Mounting disks."
  for disk in ${resources}; do
   mkdir -p /hast/${disk}
   fsck -p -y -t ufs /dev/hast/${disk}
   mount /dev/hast/${disk} /hast/${disk}
  done

 ;;

 slave)
  logger -p $log -t $name "Switching to secondary provider for ${resources}."

  # Switch roles for the HAST resources
  for disk in ${resources}; do
   if ! mount | grep -q "^/dev/hast/${disk} on "
   then
   else
    umount -f /hast/${disk}
   fi
   sleep $delay
   hastctl role secondary ${disk} 2>&1
   if [ $? -ne 0 ]; then
    logger -p $log -t $name "Unable to switch role to secondary for resource ${disk}."
    exit 1
   fi
   logger -p $log -t $name "Role switched to secondary for resource ${disk}."
  done
 ;;
esac

In a nutshell, the script takes these actions when a node becomes master / primary:

Promotes the HAST pools to primary on a given node.
Checks the file system under the HAST pool.
Mounts the pools at an appropriate place.

When a node becomes backup / secondary:

Unmounts the HAST pools.
Degrades the HAST pools to secondary.

Caution: Keep in mind that this is just an example script which should serve as a proof of concept. It does not handle all the possible scenarios and can be extended or altered in any way, for example it can start/stop required services, etc.

Tip: For this example, we used a standard UFS file system. To reduce the time needed for recovery, a journal-enabled UFS or ZFS file system can be used.

More detailed information with additional examples can be found in the HAST Wiki page.

HAST FREEBSD ZFS WITH CARP FAILOVER

HAST (Highly Available Storage) is a new concept for FreeBSD and it is under constant development. HAST allows to transparently store data on two physically separated machines connected over the TCP/IP network. HAST operates on block level making it transparent for file systems, providing disk-like devices in/dev/hast directory.

In this article we will create two identical HAST nodes, hast1 and hast2. Both devices will use one NIC connected to a vlan for data synchronization and another NIC will be configured via CARP in order to share the same IP address across the network. The first node will be called “storage1.hast.test”, the second “storage2.hast.test” and they will both listen to a common IP address which we will bind to “storage.hast.test”

HAST binds its resource names according to the machine’s hostname. Therefore, we will use “hast1.freebsd.loc” and “hast2.freebsd.loc” as the machines’s hostnames so that HAST can operate without complaining.

n order for carp to work we don’t have to compile a new kernel. We can just load it as a module by adding to /boot/loader.conf

 if_carp_load="YES"

Our both nodes are set up, it is time to make some adjustments. First a descent /etc/rc.conf for the first node:

zfs_enable="YES"
 
###Primary Interface##
ifconfig_re0="inet 10.10.10.181  netmask 255.255.255.0"
 
###Secondary Interface for HAST###
ifconfig_re1="inet 192.168.100.100  netmask 255.255.255.0"
 
defaultrouter="10.10.10.1"
sshd_enable="YES"
hostname="hast1.freebsd.loc"
 
##CARP INTERFACE SETUP##
cloned_interfaces="carp0"
ifconfig_carp0="inet 10.10.10.180 netmask 255.255.255.0 vhid 1 pass mypassword advskew 0"
 
hastd_enable=YES

The second node we will also much the first except for the IP addressing:

zfs_enable="YES"
 
###Primary Interface##
ifconfig_re0="inet 10.10.10.182  netmask 255.255.255.0"
 
###Secondary Interface for HAST###
ifconfig_re1="inet 192.168.100.101  netmask 255.255.255.0"
 
defaultrouter="10.10.10.1"
sshd_enable="YES"
hostname="hast2.freebsd.loc"
 
##CARP INTERFACE SETUP##
cloned_interfaces="carp0"
ifconfig_carp0="inet 10.10.10.180 netmask 255.255.255.0 vhid 1 pass mypassword advskew 100"
 
hastd_enable=YES

At this point we have assigned re1 with two IPs for HAST synchronization. We have also assigned two IPs to re0 which in turn we share with a third common IP assigned to carp0.
As a result, re1 is being used for HAST synchronization in a vlan while carp0 which is cloned by re0 used under the same vlan with the rest of our clients.

In order for HAST to function correctly we have to resolve the correct IPs on every node. We don’t want to rely on DNS for this because DNS can fail. Instead we will use /etc/hosts same on every node.

::1   localhost localhost.freebsd.loc
127.0.0.1  localhost localhost.freebsd.loc
192.168.100.100  hast1.freebsd.loc hast1
192.168.100.101  hast2.freebsd.loc hast2
 
10.10.10.181           storage1.hast.test storage1
10.10.10.182           storage2.hast.test storage2
10.10.10.180        storage.hast.test  storage

Next, we have to create the /etc/hast.conf file. Here we will declare the resources that we want to create. All resources will eventually create devices located under /dev/hast on the primary node. Every resource indicates a physical device specifying a local and remote IP device. The /etc/hast.conf must be exactly the same on every node.

resource disk1 {
        on hast1 {
                local /dev/ad1
                remote hast2
        }
        on  hast2 {
                local /dev/ad1
                remote hast1
        }
}
 
resource disk2 {
        on  hast1 {
                local /dev/ad2
                remote hast2
        }
        on  hast2 {
                local /dev/ad2
                remote hast1
        }
}
 
resource disk3 {
        on  hast1 {
                local /dev/ad3
                remote hast2
        }
        on  hast2 {
                local /dev/ad3
                remote hast1
        }
}

In this example we are sharing three resources, disk1, disk2 and disk3. Each resource indicates a device the local and the remote IP address. With this configuration in place, we are ready to begin setting up out HAST devices.

Lets start hastd on both nodes first:

hast1#/etc/rc.d/hastd start

hast2#/etc/rc.d/hastd start

Now on the primary node we will initialize our resources, create them and finally assign a primary role:

hast1#hastctl role init disk1
hast1#hastctl role init disk2
hast1#hastctl role init disk3
hast1#hastctl create disk1
hast1#hastctl create disk2
hast1#hastctl create disk3
hast1#hastctl role primary disk1
hast1#hastctl role primary disk2
hast1#hastctl role primary disk3

Next, on the secondary node we will initialize our resources, create them and finally assign a secondary role:

hast2#hastctl role init disk1
hast2#hastctl role init disk2
hast2#hastctl role init disk3
hast2#hastctl create disk1
hast2#hastctl create disk2
hast2#hastctl create disk3
hast2#hastctl role secondary disk1
hast2#hastctl role secondary disk2
hast2#hastctl role secondary disk3

There are other ways for creating and assigning roles to each resource. Having repeat this procedure a few times, I saw that this usually always works.

Now check the status on both nodes:

hast1# hastctl status
disk1:
  role: primary
  provname: disk1
  localpath: /dev/ada1
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk2:
  role: primary
  provname: disk2
  localpath: /dev/ada2
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk3:
  role: primary
  provname: disk3
  localpath: /dev/ada3
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...

The first node looks good. Status is complete.

hast2# hastctl status
disk1:
  role: secondary
  provname: disk1
  localpath: /dev/ada1
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk2:
  role: secondary
  provname: disk2
  localpath: /dev/ada2
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk3:
  role: secondary
  provname: disk3
  localpath: /dev/ada3
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...

So does the second. Like I mentioned earlier there are different ways for doing this the first time. You have to look for the word status: complete. If you get a degraded status you can always repeat the procedure.

Now it is time to create our ZFS pool. The primary node should have a /dev/hast directory containing our resources. This directory appears only at the active node.

hast1# zpool create zhast raidz1 /dev/hast/disk1 /dev/hast/disk2 /dev/hast/disk3
hast1# zpool status zhast
 pool: zhast
 state: ONLINE
 scan: none requested
 config:
 
 NAME            STATE     READ WRITE CKSUM
 zhast           ONLINE       0     0     0
   raidz1-0      ONLINE       0     0     0
     hast/disk1  ONLINE       0     0     0
     hast/disk2  ONLINE       0     0     0
     hast/disk3  ONLINE       0     0     0

We can now use hastctl status on each node to see if everything looks ok. The magic word we are looking for here is: replication: fullsync

At this point both of our nodes should be available for failover. We have storage1 running as primary and sharing a pool called zhast. Our storage2 is currently in a standby mode. If we have set DNS properly we can ssh to storage.hast.test or by using its carp IP to 10.10.10.180.

In order to perform a failover we have to first export our pool from the first node, change the role of each resource to secondary. Then change the role of each resource to primary on the standby node and import the pool. This procedure will be done manually to test if failover really works. But for a real HA solution we will eventually create a script that will take care of this.

First lets export our pool and change our resources role:

hast1# zpool export zhast
hast1# hastctl role secondary disk1
hast1# hastctl role secondary disk2
hast1# hastctl role secondary disk3

Now, lets reverse the procedure on the standby node:

hast2# hastctl role primary disk1
hast2# hastctl role primary disk2
hast2# hastctl role primary disk3
hast2# zpool import zhast

The roles have successfully changed, lets see our pool status:

hast2# zpool status zhast
 pool: zhast
 state: ONLINE
 scan: none requested
 config:
 
 NAME            STATE     READ WRITE CKSUM
 zhast           ONLINE       0     0     0
   raidz1-0      ONLINE       0     0     0
     hast/disk1  ONLINE       0     0     0
     hast/disk2  ONLINE       0     0     0
     hast/disk3  ONLINE       0     0     0
 
errors: No known data errors

Again, by using hastctl status on each node we can verify that the roles have indeed changed and that the status is complete. This is a sample output from the second node now in charge:

hast2# hastctl status
disk1:
  role: primary
  provname: disk1
  localpath: /dev/ad1
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...
disk2:
  role: primary
  provname: disk2
  localpath: /dev/ad2
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...
disk3:
  role: primary
  provname: disk3
  localpath: /dev/ad3
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...

It is now time to automate this procedure. When do we want our servers to automatically failover?
One reason would be if the primary node is not responding to the external network thus not being able to serve its clients. Using a devd event we can catch a carp interface going up or down and a state change.

Add the following lines to /etc/devd.conf on both nodes:

notify 30 {
 match "system" "IFNET";
 match "subsystem" "carp0";
 match "type" "LINK_UP";
 action "/usr/local/bin/failover master";
};
 
notify 30 {
 match "system" "IFNET";
 match "subsystem" "carp0";
 match "type" "LINK_DOWN";
 action "/usr/local/bin/failover slave";
};

And now lets create the failover script which will be responsible for doing automatically what we did before manually:

#!/bin/sh
 
# Original script by Freddie Cash <fjwcash@gmail.com>
# Modified by Michael W. Lucas <mwlucas@BlackHelicopters.org>
# and Viktor Petersson <vpetersson@wireload.net>
# Modified by George Kontostanos <gkontos.mail@gmail.com>
 
# The names of the HAST resources, as listed in /etc/hast.conf
resources="disk1 disk2 disk3"
 
# delay in mounting HAST resource after becoming master
# make your best guess
delay=3
 
# logging
log="local0.debug"
name="failover"
pool="zhast"
 
# end of user configurable stuff
 
case "$1" in
 master)
  logger -p $log -t $name "Switching to primary provider for ${resources}."
  sleep ${delay}
 
  # Wait for any "hastd secondary" processes to stop
  for disk in ${resources}; do
   while $( pgrep -lf "hastd: ${disk} \(secondary\)" > /dev/null 2>&1 ); do
    sleep 1
   done
 
   # Switch role for each disk
   hastctl role primary ${disk}
   if [ $? -ne 0 ]; then
    logger -p $log -t $name "Unable to change role to primary for resource ${disk}."
    exit 1
   fi
  done
 
  # Wait for the /dev/hast/* devices to appear
  for disk in ${resources}; do
   for I in $( jot 60 ); do
    [ -c "/dev/hast/${disk}" ] && break
    sleep 0.5
   done
 
   if [ ! -c "/dev/hast/${disk}" ]; then
    logger -p $log -t $name "GEOM provider /dev/hast/${disk} did not appear."
    exit 1
   fi
  done
 
  logger -p $log -t $name "Role for HAST resources ${resources} switched to primary."
 
 
  logger -p $log -t $name "Importing Pool"
  # Import ZFS pool. Do it forcibly as it remembers hostid of
                # the other cluster node.
                out=`zpool import -f "${pool}" 2>&1`
                if [ $? -ne 0 ]; then
                    logger -p local0.error -t hast "ZFS pool import for resource ${resource} failed: ${out}."
                    exit 1
                fi
                logger -p local0.debug -t hast "ZFS pool for resource ${resource} imported."
 
 ;;
 
 slave)
  logger -p $log -t $name "Switching to secondary provider for ${resources}."
 
  # Switch roles for the HAST resources
  zpool list | egrep -q "^${pool} "
         if [ $? -eq 0 ]; then
                 # Forcibly export file pool.
                 out=`zpool export -f "${pool}" 2>&1`
                  if [ $? -ne 0 ]; then
                         logger -p local0.error -t hast "Unable to export pool for resource ${resource}: ${out}."
                         exit 1
                  fi
                 logger -p local0.debug -t hast "ZFS pool for resource ${resource} exported."
         fi
  for disk in ${resources}; do
   sleep $delay
   hastctl role secondary ${disk} 2>&1
   if [ $? -ne 0 ]; then
    logger -p $log -t $name "Unable to switch role to secondary for resource ${disk}."
    exit 1
   fi
   logger -p $log -t $name "Role switched to secondary for resource ${disk}."
  done
 ;;
esac

Let’s try it and see if it works. Log into both the currently active and standby node. Make sure that you are on the active by issuing a hastctl status command. Then force a failover by bringing the interface which is associated with carp0 downL

hast1# ifconfig er0 down

Watch at the generated messages:

hast1# tail -f /var/log/debug.log
 
Feb  6 15:01:41 hast1 failover: Switching to secondary provider for disk1 disk2 disk3.
Feb  6 15:01:49 hast1 hast: ZFS pool for resource  exported.
Feb  6 15:01:52 hast1 failover: Role switched to secondary for resource disk1.
Feb  6 15:01:55 hast1 failover: Role switched to secondary for resource disk2.
Feb  6 15:01:58 hast1 failover: Role switched to secondary for resource disk3.

hast2# tail -f /var/log/debug.log
 
Feb  6 15:02:15 hast2 failover: Switching to primary provider for disk1 disk2 disk3.
Feb  6 15:02:19 hast2 failover: Role for HAST resources disk1 disk2 disk3 switched to primary.
Feb  6 15:02:19 hast2 failover: Importing Pool
Feb  6 15:02:52 hast2 hast: ZFS pool for resource  imported.

Voila! The failover worked like a charm and now hast2 has assumed the primary role.