XEN Cluster

From SemarkIT
Jump to: navigation, search



Xen is one of the most advanced open source virtualization technology. Using virtualization allows easier server deployment and, starting, enhance application availability. Thanks to Live migration, admins can easily empty an host server (a.K.a. Dom0) so that they can fix hardware issue, or perform updates, without the need of shutting down virtual machines. But, at first look, all this stuff has to be done manually. Since root cause of many outage is human errors, this could be great to be able to make it automatic. Hard to do.?

Not so, follow this guide and you ar on the way...

Cluster basics

According to Wikipédia, a cluster consists in techniques which aim to group a pool of physical and independent server, making them working together for :

  • availability improvement
  • scalability facilitation
  • allowing easier load-balacing
  • resources mutualization (processor, memory, storage, network bandwidth)

Cluster size may vary a lot. In fact, cluster start from 2 servers up to thousands of them.

Xen cluster requirements

In our exemple, we'll use 2 physical servers as Dom0. Virtual machines will be spread on both dom0. So we have to:

  • be able to balance domU between dom0
  • make sure that each dom0 can host all domU at a time
  • make domU file system reachable on each dom0
  • be able to "live migrate" domU from one dom0 to the other
  • share domU configuration file between dom0

Xen cluster constraints

Our cluster must be compliant will following requirements:

  • Xen configuration is centralized on a dedicated cluster FS
  • The cluster FS has to be automatically mounted in each dom0
  • DomU state saving has to be deactivated since live migration will be handled by cluster stack
  • DomU must not be automatically started (cluster stack will take care of that)
  • DomU can be start only if associated DRBD resource is in master state on dom0
  • DomU cannot run if cluster FS is not mounted on dom0
  • DomU must run only on 1 dom0 at a time

The idear

The idea behind the whole set-up is to get a High availability Cluster with redundant data. In this example I have two identical Servers installed with XEN hypervisor 4.0 and almost the same configuration as Cluster nodes. The configuration and image files of XEN Virtual machines are stored on drbd device for redundancy. Drbd8 and OCFS2 allows simultaneous mounting on both nodes, which is required for live migration of xen virtual machines.

This Article describes Heartbeat XEN cluster Using Debian Squeeze OS, drbd8 and OCFS2 File system.

Cluster architecture

                       |    |
                _______|    |_______|                    |
           ____|____            ____|____
   LAN1---| master1 |          | master2 |---LAN1
          | Dom0 A  |          | Dom0 B  |
   LAN2---|_________|          |_________|---LAN2
               |                    ||                    |

Both dom0 have a quad network attachment. The first one will be dedicated to WAN access, the second one will be use for DRBD replication, cluster management and live migration, third and fourth will be your LAN.

Cluster installation on GNU/Linux Debian

OS Installation

Install two Computers with standard minimal Debian Squeeze and SSH-server. After standard installation is done, go ahead installing the required packages.

Disc Partition

On both computers we partition the disc in three partitions and use as follows /dev/sda1 as swap /dev/sda2 as Boot /dev/sda3 as root, and /dev/sdb1 as drdb8 ( just leave it as it is at the time of installation )

Network Configuration

Node Hostname IP-Address Interface

Node1 master1 eth0

Node2 master2 eth0

XEN system

http://en.wikipedia.org/wiki/Xen We start with installing Xen Hypervisor and boot with Xen-kernel.

apt-get install xen-hypervisor-4.0-amd64 xen-linux-system-2.6-xen-amd64 linux-headers-2.6-xen-amd64 \
xen-utils-4.0 xen-utils-common xenstore-utils xenwatch xen-tools bridge-utils xen-docs-4.0 xen-qemu-dm-4.0 \
python-xmltv vlan ifenslave-2.6

Answer yes for additional software.

Update GRUB2

After Debian GNU/Linux have changed from GRUB to GRUB2 we first have to do the following

mv -i /etc/grub.d/10_linux /etc/grub.d/50_linux

We can now reboot the system into the Xen hypervisor kernel

Check your system

First lets see if our kernel is right

uname -r

If your line looks something like this your are now running a XEN capable kernel


Lets check if hypervisor is running

xm dmesg

If no errors is displayed you are up-n-running


http://oss.oracle.com/projects/ocfs2/ OCFS2 is a Cluster File System which allows simultaneous access from many nodes. We will set this on our drbd device to access it from both nodes simultaneously. While configuring OCFS2 we provide the information about nodes, which will access the file system later. Every Node that has a OCFS2 file system mounted, must regularly write into a meta-data of file system, letting the other nodes know that node is still alive.


apt-get install ocfs2-tools ocfs2console


nano /etc/ocfs2/cluster.conf
       ip_port = 7777
       ip_address =
       number = 0
       name = master1
       cluster = ocfs2

       ip_port = 7777
       ip_address =
       number = 1
       name = master2
       cluster = ocfs2

       node_count = 2
       name = ocfs2
/etc/init.d/o2cb restart





The advantage of drbd8 over drbd7 is: It allows the drbd resource to be “master” on both nodes and so can be mounted read-write.

apt-get install drbd8-utils drbdlinks


nano /etc/drbd.d/global_common.conf

See DRBD.org for more info. here is my sample-file:

global {
       usage-count yes;
       # minor-count dialog-refresh disable-ip-verification

common {
       protocol C;

       handlers {
               pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
               pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
               local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
               fence-peer "/usr/lib/drbd/crm-fence-peer.sh";

               # Notify;
               # initial-split-brain "/usr/lib/drbd/notify-split-brain.sh root";
               split-brain "/usr/lib/drbd/notify-split-brain.sh root";
               out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";

               # Sync-handlers
               before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
               after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
               outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater";

       startup {
               # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb;

       disk {
               # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
               # no-disk-drain no-md-flushes max-bio-bvecs
               on-io-error detach;

       net {
               # snd-buf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
               # max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
               # after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
               after-sb-0pri discard-zero-changes;
               after-sb-1pri discard-secondary;
               after-sb-2pri disconnect;
               rr-conflict disconnect;
               # NB. You don't need this if your servers is not running over the network 
               cram-hmac-alg sha1;
               shared-secret "shared-secret";

       syncer {
               # rate after al-extents use-rle cpu-mask verify-alg csums-alg
               rate 1G;
               al-extents 127;

"allow-two-primaries" option in net section of drbd.conf allows the resource to be mounted as "master" on both nodes (primary).

nano /etc/drbd.d/master1-2.res
resource master1-2 {
       device    /dev/drbd1;
       disk      /dev/sdb1;
       meta-disk internal;

       on master1 {
       on master2 {

Copy the /etc/drbd.d/global_common.conf and /etc/drbd.d/master1-2.res to node2 and restart drbd on both nodes with following command.

/etc/init.d/drbd restart

If you check the status and it looks something like this

/etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757
m:res        cs         ro                   ds                         p  mounted
1:master1-2  Connected  Secondary/Secondary  Inconsistent/Inconsistent  C  r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:200768

The Inconsistent/Inconsistent disk state is expected at this point. By now, DRBD has successfully allocated both disk and network resources and is ready for operation. What it does not know yet is which of your nodes should be used as the source of the initial device synchronization.

You can now change the resource to primary on master1 with following command

drbdadm primary master1-2

This would be the perfect time to go grab a drink or hit the washroom. You need to be patient. You can check the sync. status with the following command

/etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757
m:res        cs         ro               ds                 p  mounted
1:master1-2  Connected  Primary/Primary  UpToDate/UpToDate  C  r---

As you can see resource is "master" on both nodes and the drbd device is now accessible under /dev/drbd1

File system

We can now create a file system on /der/drbd1 by following command

mkfs.ocfs2 /dev/drbd1

This can be mounted on both nodes simultaneously with

mkdir /XEN
mounnt.ocfs2 /dev/drbd1 /XEN

Now we have a common storage which will be synchronized with drbd on both nodes

Init script

We have to make sure that after reboot, the system will set drbd resources again to “master” and mount those on "/drbd1" before starting Heartbeat and XEN machines.

nano /etc/init.d/mountdrbd.sh
#! /bin/sh
# /etc/init.d/mountdrbd.sh
# Provides:       mountdrbd
# Required-Start: $local_fs $network $syslog $drbd
# Required-Stop:  $local_fs $network $syslog $drbd
# Should-Start:   sshd multipathd
# Should-Stop:    sshd multipathd
# Default-Start:  2 3 4 5
# Default-Stop:   0 1 6
# X-Start-Before: heartbeat corosync
# X-Stop-After:   heartbeat corosync
# Short-Description:    Mount drbd filesystem.

# Carry out specific functions when asked to by the system
case "$1" in
   echo "Mounting master1-2 in /XEN"
   drbdadm primary master1-2
   mount.ocfs2 /dev/drbd1 /XEN
   echo "Umounting /XEN"
   umount /XEN
   echo "Usage: /etc/init.d/mountdrbd.sh {start|stop}"
   exit 1

exit 0

make it executable and add symbolic link to this under /etc/rc3.d/S99mountdrbd.sh

chmod 755 /etc/init.d/mountdrbd.sh
insserv mountdrbd.sh

Actually this step can be integrated also in Heartbeat by adding appropriate resources to the configuration. But as time being we will do this with script.





Now we can install and setup Heartbeat

apt-get install heartbeat ocfs2-tools-pacemaker pacemaker cluster-glue

Answer yes for additional software.

nano /etc/ha.d/ha.cf
rm on
bcast eth0
node master1 master2

and restart heartbeat

/etc/init.d/heartbeat restart


In Heartbeat the configuration and status information of resources are stored in xml format in "/usr/lib/heartbeat/crm/cib.xml" file. Thy Syntax for this is very well explained by Alan Robertson in his tutorial at the linux.conf.au 2007. Which can be found at http://linux-ha.org/HeartbeatTutorials

This file can either edited directly as whole or manipulated in pieces using cibadmin-tool. We will use this tool as it makes it much easier to manage the cluster. The required components we will save in xml files under /XEN/cluster


mkdir /XEN/cluster
cibadmin --query > /XEN/cluster/orginal.xml
cp /XEN/cluster/orginal.xml /XEN/cluster/bootstrap.xml
nano /XEN/cluster/bootstrap.xml

In the cluster_property_set add the following after the last parameter

<cluster_property_set id="cib-bootstrap-options">
  [ ... ]
       <nvpair id="cib-bootstrap-options01" name="transition-idle-timeout" value="60"/>
       <nvpair id="cib-bootstrap-options02" name="default-resource-stickiness" value="INFINITY"/>
       <nvpair id="cib-bootstrap-options03" name="default-resource-failure-stickiness" value="-500"/>
       <nvpair id="cib-bootstrap-options04" name="stonith-enabled" value="true"/>
       <nvpair id="cib-bootstrap-options05" name="stonith-action" value="reboot"/>
       <nvpair id="cib-bootstrap-options06" name="symmetric-cluster" value="true"/>
       <nvpair id="cib-bootstrap-options07" name="no-quorum-policy" value="stop"/>
       <nvpair id="cib-bootstrap-options08" name="stop-orphan-resources" value="true"/>
       <nvpair id="cib-bootstrap-options09" name="stop-orphan-actions" value="true"/>
       <nvpair id="cib-bootstrap-options10" name="is-managed-default" value="true"/>
  [ ... ]

Load this file with following command

cibadmin --replace --xml-file /XEN/cluster/bootstrap.xml

This will initialize the Cluster with values set in xml file.

Setting up STONITH device

STONITH prevents "split-brain-situation" (i.e. running Resource on both nodes unwontedly at same time) by fencing the other node. Details can be found out at http://www.linux-ha.org/STONITH We will use "stonth" over ssh to reboot the faulty machine

Follow "http://sial.org/howto/openssh/publickey-auth/" to setup public key authentication. In short just do following on both nodes


--> save key under /root/.ssh/*
--> don't give any passphrase
scp /root/.ssh/id_rsa.pub master2:/root/.ssh/authorized_keys


--> save key under /root/.ssh/*
--> don't give any passphrase
scp /root/.ssh/id_rsa.pub master1:/root/.ssh/authorized_keys

Now check that you can log on from node1 to node2 per ssh without password asked and vice a versa Now check that stonith is working

ssh -q -x -n -l root "master2" "ls -la"

you should get a file list from master2.

ssh -q -x -n -l root "master1" "ls -la"

you should get a file list from master1.

Now we configure "stonith" device as Cluster resource. It will be a special cluster resource "Clone" which will run simultaneously on all nodes.

crm configure
primitive stonithclone stonith::external/ssh params hostlist="master1 master2" op monitor interval="60s"

Now check you config


if no error is given you can commit you new config


XEN as cluster resource