Tuesday, August 21, 2012

HAST FREEBSD ZFS WITH CARP FAILOVER


HAST (Highly Available Storage) is a new concept for FreeBSD and it is under constant development. HAST allows to transparently store data on two physically separated machines connected over the TCP/IP network. HAST operates on block level making it transparent for file systems, providing disk-like devices in/dev/hast directory.
In this article we will create two identical HAST nodes, hast1 and hast2. Both devices will use one NIC connected to a vlan for data synchronization and another NIC will be configured via CARP in order to share the same IP address across the network. The first node will be called “storage1.hast.test”, the second “storage2.hast.test” and they will both listen to a common IP address which we will bind to “storage.hast.test”
HAST binds its resource names according to the machine’s hostname. Therefore, we will use “hast1.freebsd.loc” and “hast2.freebsd.loc”  as the machines’s hostnames so that HAST can operate without complaining.
n order for carp to work we don’t have to compile a new kernel. We can just load it as a module by adding to /boot/loader.conf
 if_carp_load="YES"
Our both nodes are set up, it is time to make some adjustments. First a descent /etc/rc.conf for the first node:
zfs_enable="YES"
 
###Primary Interface##
ifconfig_re0="inet 10.10.10.181  netmask 255.255.255.0"
 
###Secondary Interface for HAST###
ifconfig_re1="inet 192.168.100.100  netmask 255.255.255.0"
 
defaultrouter="10.10.10.1"
sshd_enable="YES"
hostname="hast1.freebsd.loc"
 
##CARP INTERFACE SETUP##
cloned_interfaces="carp0"
ifconfig_carp0="inet 10.10.10.180 netmask 255.255.255.0 vhid 1 pass mypassword advskew 0"
 
hastd_enable=YES
The second node we will also much the first except for the IP addressing:
zfs_enable="YES"
 
###Primary Interface##
ifconfig_re0="inet 10.10.10.182  netmask 255.255.255.0"
 
###Secondary Interface for HAST###
ifconfig_re1="inet 192.168.100.101  netmask 255.255.255.0"
 
defaultrouter="10.10.10.1"
sshd_enable="YES"
hostname="hast2.freebsd.loc"
 
##CARP INTERFACE SETUP##
cloned_interfaces="carp0"
ifconfig_carp0="inet 10.10.10.180 netmask 255.255.255.0 vhid 1 pass mypassword advskew 100"
 
hastd_enable=YES
At this point we have assigned re1 with two IPs for HAST synchronization. We have also assigned two IPs to re0 which in turn we share with a third common IP assigned to carp0.
As a result, re1 is being used for HAST synchronization in a vlan while carp0 which is cloned by re0 used under the same vlan with the rest of our clients.
In order for HAST to function correctly we have to resolve the correct IPs on every node. We don’t want to rely on DNS for this because DNS can fail. Instead we will use /etc/hosts same on every node.
::1   localhost localhost.freebsd.loc
127.0.0.1  localhost localhost.freebsd.loc
192.168.100.100  hast1.freebsd.loc hast1
192.168.100.101  hast2.freebsd.loc hast2
 
10.10.10.181           storage1.hast.test storage1
10.10.10.182           storage2.hast.test storage2
10.10.10.180        storage.hast.test  storage
Next, we have to create the /etc/hast.conf file. Here we will declare the resources that we want to create. All resources will eventually create devices located under /dev/hast on the primary node. Every resource indicates a physical device specifying a local and remote IP device. The /etc/hast.conf must be exactly the same on every node.
resource disk1 {
        on hast1 {
                local /dev/ad1
                remote hast2
        }
        on  hast2 {
                local /dev/ad1
                remote hast1
        }
}
 
resource disk2 {
        on  hast1 {
                local /dev/ad2
                remote hast2
        }
        on  hast2 {
                local /dev/ad2
                remote hast1
        }
}
 
resource disk3 {
        on  hast1 {
                local /dev/ad3
                remote hast2
        }
        on  hast2 {
                local /dev/ad3
                remote hast1
        }
}
In this example we are sharing three resources, disk1, disk2 and disk3. Each resource indicates a device the local and the remote IP address. With this configuration in place, we are ready to begin setting up out HAST devices.
Lets start hastd on both nodes first:
hast1#/etc/rc.d/hastd start
hast2#/etc/rc.d/hastd start
Now on the primary node we will initialize our resources, create them and finally assign a primary role:
hast1#hastctl role init disk1
hast1#hastctl role init disk2
hast1#hastctl role init disk3
hast1#hastctl create disk1
hast1#hastctl create disk2
hast1#hastctl create disk3
hast1#hastctl role primary disk1
hast1#hastctl role primary disk2
hast1#hastctl role primary disk3
Next, on the secondary node we will initialize our resources, create them and finally assign a secondary role:
hast2#hastctl role init disk1
hast2#hastctl role init disk2
hast2#hastctl role init disk3
hast2#hastctl create disk1
hast2#hastctl create disk2
hast2#hastctl create disk3
hast2#hastctl role secondary disk1
hast2#hastctl role secondary disk2
hast2#hastctl role secondary disk3
There are other ways for creating and assigning roles to each resource. Having repeat this procedure a few times, I saw that this usually always works.
Now check the status on both nodes:
hast1# hastctl status
disk1:
  role: primary
  provname: disk1
  localpath: /dev/ada1
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk2:
  role: primary
  provname: disk2
  localpath: /dev/ada2
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk3:
  role: primary
  provname: disk3
  localpath: /dev/ada3
  ...
  remoteaddr: hast2
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
The first node looks good. Status is complete.
hast2# hastctl status
disk1:
  role: secondary
  provname: disk1
  localpath: /dev/ada1
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk2:
  role: secondary
  provname: disk2
  localpath: /dev/ada2
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
disk3:
  role: secondary
  provname: disk3
  localpath: /dev/ada3
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  dirty: 0 (0B)
  ...
So does the second. Like I mentioned earlier there are different ways for doing this the first time. You have to look for the word status: complete. If you get a degraded status you can always repeat the procedure.
Now it is time to create our ZFS pool. The primary node should have a /dev/hast directory containing our resources. This directory appears only at the active node.
hast1# zpool create zhast raidz1 /dev/hast/disk1 /dev/hast/disk2 /dev/hast/disk3
hast1# zpool status zhast
 pool: zhast
 state: ONLINE
 scan: none requested
 config:
 
 NAME            STATE     READ WRITE CKSUM
 zhast           ONLINE       0     0     0
   raidz1-0      ONLINE       0     0     0
     hast/disk1  ONLINE       0     0     0
     hast/disk2  ONLINE       0     0     0
     hast/disk3  ONLINE       0     0     0
We can now use hastctl status on each node to see if everything looks ok. The magic word we are looking for here is: replication: fullsync
At this point both of our nodes should be available for failover. We have storage1 running as primary and sharing a pool called zhast. Our storage2 is currently in a standby mode. If we have set DNS properly we can ssh to storage.hast.test or by using its carp IP to 10.10.10.180.
In order to perform a failover we have to first export our pool from the first node, change the role of each resource to secondary. Then change the role of each resource to primary on the standby node and import the pool. This procedure will be done manually to test if failover really works. But for a real HA solution we will eventually create a script that will take care of this.
First lets export our pool and change our resources role:
hast1# zpool export zhast
hast1# hastctl role secondary disk1
hast1# hastctl role secondary disk2
hast1# hastctl role secondary disk3
Now, lets reverse the procedure on the standby node:
hast2# hastctl role primary disk1
hast2# hastctl role primary disk2
hast2# hastctl role primary disk3
hast2# zpool import zhast
The roles have successfully changed, lets see our pool status:
hast2# zpool status zhast
 pool: zhast
 state: ONLINE
 scan: none requested
 config:
 
 NAME            STATE     READ WRITE CKSUM
 zhast           ONLINE       0     0     0
   raidz1-0      ONLINE       0     0     0
     hast/disk1  ONLINE       0     0     0
     hast/disk2  ONLINE       0     0     0
     hast/disk3  ONLINE       0     0     0
 
errors: No known data errors
Again, by using hastctl status on each node we can verify that the roles have indeed changed and that the status is complete. This is a sample output from the second node now in charge:
hast2# hastctl status
disk1:
  role: primary
  provname: disk1
  localpath: /dev/ad1
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...
disk2:
  role: primary
  provname: disk2
  localpath: /dev/ad2
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...
disk3:
  role: primary
  provname: disk3
  localpath: /dev/ad3
  ...
  remoteaddr: hast1
  replication: fullsync
  status: complete
  ...
It is now time to automate this procedure. When do we want our servers to automatically failover?
One reason would be if the primary node is not responding to the external network thus not being able to serve its clients. Using a devd event we can catch a carp interface going up or down and a state change.
Add the following lines to /etc/devd.conf on both nodes:
notify 30 {
 match "system" "IFNET";
 match "subsystem" "carp0";
 match "type" "LINK_UP";
 action "/usr/local/bin/failover master";
};
 
notify 30 {
 match "system" "IFNET";
 match "subsystem" "carp0";
 match "type" "LINK_DOWN";
 action "/usr/local/bin/failover slave";
};
And now lets create the failover script which will be responsible for doing automatically what we did before manually:
#!/bin/sh
 
# Original script by Freddie Cash <fjwcash@gmail.com>
# Modified by Michael W. Lucas <mwlucas@BlackHelicopters.org>
# and Viktor Petersson <vpetersson@wireload.net>
# Modified by George Kontostanos <gkontos.mail@gmail.com>
 
# The names of the HAST resources, as listed in /etc/hast.conf
resources="disk1 disk2 disk3"
 
# delay in mounting HAST resource after becoming master
# make your best guess
delay=3
 
# logging
log="local0.debug"
name="failover"
pool="zhast"
 
# end of user configurable stuff
 
case "$1" in
 master)
  logger -p $log -t $name "Switching to primary provider for ${resources}."
  sleep ${delay}
 
  # Wait for any "hastd secondary" processes to stop
  for disk in ${resources}; do
   while $( pgrep -lf "hastd: ${disk} \(secondary\)" > /dev/null 2>&1 ); do
    sleep 1
   done
 
   # Switch role for each disk
   hastctl role primary ${disk}
   if [ $? -ne 0 ]; then
    logger -p $log -t $name "Unable to change role to primary for resource ${disk}."
    exit 1
   fi
  done
 
  # Wait for the /dev/hast/* devices to appear
  for disk in ${resources}; do
   for I in $( jot 60 ); do
    [ -c "/dev/hast/${disk}" ] && break
    sleep 0.5
   done
 
   if [ ! -c "/dev/hast/${disk}" ]; then
    logger -p $log -t $name "GEOM provider /dev/hast/${disk} did not appear."
    exit 1
   fi
  done
 
  logger -p $log -t $name "Role for HAST resources ${resources} switched to primary."
 
 
  logger -p $log -t $name "Importing Pool"
  # Import ZFS pool. Do it forcibly as it remembers hostid of
                # the other cluster node.
                out=`zpool import -f "${pool}" 2>&1`
                if [ $? -ne 0 ]; then
                    logger -p local0.error -t hast "ZFS pool import for resource ${resource} failed: ${out}."
                    exit 1
                fi
                logger -p local0.debug -t hast "ZFS pool for resource ${resource} imported."
 
 ;;
 
 slave)
  logger -p $log -t $name "Switching to secondary provider for ${resources}."
 
  # Switch roles for the HAST resources
  zpool list | egrep -q "^${pool} "
         if [ $? -eq 0 ]; then
                 # Forcibly export file pool.
                 out=`zpool export -f "${pool}" 2>&1`
                  if [ $? -ne 0 ]; then
                         logger -p local0.error -t hast "Unable to export pool for resource ${resource}: ${out}."
                         exit 1
                  fi
                 logger -p local0.debug -t hast "ZFS pool for resource ${resource} exported."
         fi
  for disk in ${resources}; do
   sleep $delay
   hastctl role secondary ${disk} 2>&1
   if [ $? -ne 0 ]; then
    logger -p $log -t $name "Unable to switch role to secondary for resource ${disk}."
    exit 1
   fi
   logger -p $log -t $name "Role switched to secondary for resource ${disk}."
  done
 ;;
esac
Let’s try it and see if it works. Log into both the currently active and standby node. Make sure that you are on the active by issuing a hastctl status command. Then force a failover by bringing the interface which is associated with carp0 downL
hast1# ifconfig er0 down
Watch at the generated messages:
hast1# tail -f /var/log/debug.log
 
Feb  6 15:01:41 hast1 failover: Switching to secondary provider for disk1 disk2 disk3.
Feb  6 15:01:49 hast1 hast: ZFS pool for resource  exported.
Feb  6 15:01:52 hast1 failover: Role switched to secondary for resource disk1.
Feb  6 15:01:55 hast1 failover: Role switched to secondary for resource disk2.
Feb  6 15:01:58 hast1 failover: Role switched to secondary for resource disk3.
hast2# tail -f /var/log/debug.log
 
Feb  6 15:02:15 hast2 failover: Switching to primary provider for disk1 disk2 disk3.
Feb  6 15:02:19 hast2 failover: Role for HAST resources disk1 disk2 disk3 switched to primary.
Feb  6 15:02:19 hast2 failover: Importing Pool
Feb  6 15:02:52 hast2 hast: ZFS pool for resource  imported.
Voila! The failover worked like a charm and now hast2 has assumed the primary role.

No comments:

Post a Comment