OCFS2 Interview Questions

How do I get started?

Download and install the module and tools rpms.
Create cluster.conf and propagate to all nodes.
Configure and start the O2CB cluster service.
Format the volume.
Mount the volume.

How do I know the version number running?

# cat /proc/fs/ocfs2/version
OCFS2 1.2.8 Tue Feb 12 20:22:48 EST 2008 (build 9c7ae8bb50ef6d8791df2912775adcc5)

How do I configure my system to auto-reboot after a panic?

To auto-reboot system 60 secs after a panic, do:

# echo 60 > /proc/sys/kernel/panic

To enable the above on every reboot, add the following to /etc/sysctl.conf:

kernel.panic = 60

How do I know which package to install on my box?

After one identifies the package name and version to install, one still needs to determine the kernel version, flavor and architecture.

To know the kernel version and flavor, do:

# uname -r
2.6.9-22.0.1.ELsmp

To know the architecture, do:

# rpm -qf /boot/vmlinuz-`uname -r` --queryformat "%{ARCH}\n"
i686

Why can’t I use uname -p to determine the kernel architecture?

uname -p does not always provide the exact kernel architecture. Case in point the RHEL3 kernels on x86_64. Even though Red Hat has two different kernel architectures available for this port, ia32e and x86_64, uname -p identifies both as the generic x86_64.

How do I install the rpms?

First install the tools and console packages:

# rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

Then install the appropriate kernel module package:

# rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm

Do I need to install the console?

No, the console is not required but recommended for ease-of-use.

What are the dependencies for installing ocfs2console?

ocfs2console requires e2fsprogs, glib2 2.2.3 or later, vte 0.11.10 or later, pygtk2 (RHEL4) or python-gtk (SLES9) 1.99.16 or later, python 2.3 or later and ocfs2-tools.

What modules are installed with the OCFS2 1.2 package?

configfs.ko
ocfs2.ko
ocfs2_dlm.ko
ocfs2_dlmfs.ko
ocfs2_nodemanager.ko
debugfs

The kernel shipped along with Enterprise Linux 5 includes configfs.ko and debugfs.ko.

What tools are installed with the ocfs2-tools 1.2 package?

mkfs.ocfs2
fsck.ocfs2
tunefs.ocfs2
debugfs.ocfs2
mount.ocfs2
mounted.ocfs2
ocfs2cdsl
ocfs2_hb_ctl
o2cb_ctl
o2cb - init service to start/stop the cluster
ocfs2 - init service to mount/umount ocfs2 volumes
ocfs2console - installed with the console package

debugfs is an in-memory filesystem developed by Greg Kroah-Hartman. It is useful for debugging as it allows kernel space to easily export data to userspace. It is currently being used by OCFS2 to dump the list of filesystem locks and could be used for more in the future. It is bundled with OCFS2 as the various distributions are currently not bundling it. While debugfs and debugfs.ocfs2 are unrelated in general, the latter is used as the front-end for the debugging info provided by the former. For example, refer to the troubleshooting section.

How do I populate /etc/ocfs2/cluster.conf?

If you have installed the console, use it to create this configuration file. For details, refer to the user’s guide. If you do not have the console installed, check the Appendix in the User’s guide for a sample cluster.conf and the details of all the components. Do not forget to copy this file to all the nodes in the cluster. If you ever edit this file on any node, ensure the other nodes are updated as well.

Should the IP interconnect be public or private?

Using a private interconnect is recommended. While OCFS2 does not take much bandwidth, it does require the nodes to be alive on the network and sends regular keepalive packets to ensure that they are. To avoid a network delay being interpreted as a node disappearing on the net which could lead to a node-self-fencing, a private interconnect is recommended. One could use the same interconnect for Oracle RAC and OCFS2.

The node name needs to match the hostname. The IP address need not be the one associated with that hostname. As in, any valid IP address on that node can be used. OCFS2 will not attempt to match the node name (hostname) with the specified IP address.

How do I modify the IP address, port or any other information specified in cluster.conf?

While one can use ocfs2console to add nodes dynamically to a running cluster, any other modifications require the cluster to be offlined. Stop the cluster on all nodes, edit /etc/ocfs2/cluster.conf on one and copy to the rest, and restart the cluster on all nodes. Always ensure that cluster.conf is the same on all the nodes in the cluster.

How do I add a new node to an online cluster?

You can use the console to add a new node. However, you will need to explicitly add the new node on all the online nodes. That is, adding on one node and propagating to the other nodes is not sufficient. If the operation fails, it will most likely be due to bug#741. In that case, you can use the o2cb_ctl utility on all online nodes as follows:

# o2cb_ctl -C -i -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME

How do I add a new node to an offline cluster?

You can either use the console or use o2cb_ctl or simply hand edit cluster.conf. Then either use the console to propagate it to all nodes or hand copy using scp or any other tool. The o2cb_ctl command to do the same is:

# o2cb_ctl -C -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME

Notice the “-i” argument is not required as the cluster is not online.

How do I configure the cluster service?

# /etc/init.d/o2cb configure

Enter ‘y’ if you want the service to load on boot and the name of the cluster (as listed in /etc/ocfs2/cluster.conf) and the cluster timeouts.

How do I start the cluster service?

To load the modules, do:

# /etc/init.d/o2cb load

To Online it, do:

# /etc/init.d/o2cb online [cluster_name]

If you have configured the cluster to load on boot, you could combine the two as follows:

# /etc/init.d/o2cb start [cluster_name]

The cluster name is not required if you have specified the name during configuration.

How do I stop the cluster service?

To offline it, do:

# /etc/init.d/o2cb offline [cluster_name]

To unload the modules, do:

# /etc/init.d/o2cb unload

If you have configured the cluster to load on boot, you could combine the two as follows:

# /etc/init.d/o2cb stop [cluster_name]

The cluster name is not required if you have specified the name during configuration.

How can I learn the status of the cluster?

To learn the status of the cluster, do:

# /etc/init.d/o2cb status

I am unable to get the cluster online. What could be wrong?

Check whether the node name in the cluster.conf exactly matches the hostname. One of the nodes in the cluster.conf need to be in the cluster for the cluster to be online.

Should I partition a disk before formatting?

Yes, partitioning is recommended even if one is planning to use the entire disk for ocfs2. Apart from the fact that partitioned disks are less likely to be “reused” by mistake, some features like mount-by-label only work with partitioned volumes. Use fdisk or parted or any other tool for the task.

How do I format a volume?

You could either use the console or use mkfs.ocfs2 directly to format the volume. For console, refer to the user’s guide.

# mkfs.ocfs2 -L "oracle_home" /dev/sdX

The above formats the volume with default block and cluster sizes, which are computed based upon the size of the volume.

# mkfs.ocfs2 -b 4k -C 32K -L "oracle_home" -N 4 /dev/sdX

The above formats the volume for 4 nodes with a 4K block size and a 32K cluster size.

What does the number of node slots during format refer to?

The number of node slots specifies the number of nodes that can concurrently mount the volume. This number is specified during format and can be increased using tunefs.ocfs2. This number cannot be decreased.

What should I consider when determining the number of node slots?

OCFS2 allocates system files, like Journal, for each node slot. So as to not to waste space, one should specify a number within the ballpark of the actual number of nodes. Also, as this number can be increased, there is no need to specify a much larger number than one plans for mounting the volume.

Does the number of node slots have to be the same for all volumes?

No. This number can be different for each volume.

What block size should I use?

A block size is the smallest unit of space addressable by the file system. OCFS2 supports block sizes of 512 bytes, 1K, 2K and 4K. The block size cannot be changed after the format. For most volume sizes, a 4K size is recommended. On the other hand, the 512 bytes block is never recommended.

What cluster size should I use?

A cluster size is the smallest unit of space allocated to a file to hold the data. OCFS2 supports cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M. For database volumes, a cluster size of 128K or larger is recommended. For Oracle home, 32K to 64K.

Any advantage of labelling the volumes?

As in a shared disk environment, the disk name (/dev/sdX) for a particular device be different on different nodes, labelling becomes a must for easy identification. You could also use labels to identify volumes during mount.

# mount -L "label" /dir

The volume label is changeable using the tunefs.ocfs2 utility.

Can OCFS2 file systems be grown in size?

Yes, you can grow an OCFS2 file system using tunefs.ocfs2. It should be noted that the tool will only resize the file system and not the underlying partition. You can use fdisk(8) (or any appropriate tool for your disk array) to resize the partition.

What do I need to know to use fdisk(8) to resize the partition?

To grow a partition using fdisk(8), you will have to delete it and recreate it with a larger size. When recreating it, ensure you specify the same starting disk cylinder as before and a ending disk cylinder that is greater than the existing one. Otherwise, not only will the resize operation fail, but you may lose your entire file system. Backup your data before performing this task.

Short of reboot, how do I get the other nodes in the cluster to see the resized partition?

Use blockdev(8) to rescan the partition table of the device on the other nodes in the cluster.

# blockdev --rereadpt /dev/sdX

What is the tunefs.ocfs2 syntax for resizing the file system?

To grow a file system to the end of the resized partition, do:

# tunefs.ocfs2 -S /dev/sdX

For more, refer to the tunefs.ocfs2 manpage.

Can the OCFS2 file system be grown while the file system is in use?

No. tunefs.ocfs2 1.2.2 only allows offline resize. i.e., the file system cannot be mounted on any node in the cluster. The online resize capability will be added later.

Can the OCFS2 file system be shrunk in size?

No. We have no current plans on providing this functionality. However, if you find this feature useful, file an enhancement request on bugzilla listing your reasons for the same.

How do I mount the volume?

You could either use the console or use mount directly.

# mount -t ocfs2 /dev/sdX /dir

The above command will mount device /dev/sdX on directory /dir.

How do I mount by label?

To mount by label do:

# mount -L "label" /dir

What entry to I add to /etc/fstab to mount an ocfs2 volume?

Add the following:

/dev/sdX /dir ocfs2 _netdev 0 0

The _netdev option indicates that the devices needs to be mounted after the network is up.

What do I need to do to mount OCFS2 volumes on boot?

Enable o2cb service using:

# chkconfig --add o2cb

Enable ocfs2 service using:

# chkconfig --add ocfs2

Configure o2cb to load on boot using:

# /etc/init.d/o2cb configure

Add entries into /etc/fstab as follows:

/dev/sdX /dir ocfs2 _netdev 0 0

How do I know my volume is mounted?

Enter mount without arguments, or,

# mount

List /etc/mtab, or,

# cat /etc/mtab

List /proc/mounts, or,

# cat /proc/mounts

Run ocfs2 service.

# /etc/init.d/ocfs2 status

mount command reads the /etc/mtab to show the information.

What are the /config and /dlm mountpoints for?

OCFS2 comes bundled with two in-memory filesystems configfs and ocfs2_dlmfs. configfs is used by the ocfs2 tools to communicate to the in-kernel node manager the list of nodes in the cluster and to the in-kernel heartbeat thread the resource to heartbeat on. ocfs2_dlmfs is used by ocfs2 tools to communicate with the in-kernel dlm to take and release clusterwide locks on resources.

Why does it take so much time to mount the volume?

It takes around 5 secs for a volume to mount. It does so so as to let the heartbeat thread stabilize. In a later release, we plan to add support for a global heartbeat, which will make most mounts instant.

Why does it take so much time to umount the volume?

During umount, the dlm has to migrate all the mastered lockres’ to an another node in the cluster. In 1.2, the lockres migration is a synchronous operation. We are looking into making it asynchronous so as to reduce the time it takes to migrate the lockres’. (While we have improved this performance in 1.2.5, the task of asynchronously migrating lockres’ has been pushed to the 1.4 time frame.) To find the number of lockres in all dlm domains, do:

# cat /proc/fs/ocfs2_dlm/*/stat
local=60624, remote=1, unknown=0, key=0x8619a8da

local refers to locally mastered lockres’.

Any special flags to run Oracle RAC?

OCFS2 volumes containing the Voting diskfile (CRS), Cluster registry (OCR), Data files, Redo logs, Archive logs and Control files must be mounted with the datavolume and nointr mount options. The datavolume option ensures that the Oracle processes opens these files with the o_direct flag. The nointr option ensures that the ios are not interrupted by signals.

# mount -o datavolume,nointr -t ocfs2 /dev/sda1 /u01/db

What about the volume containing Oracle home?

Oracle home volume should be mounted normally, that is, without the datavolume and nointr mount options. These mount options are only relevant for Oracle files listed above.

# mount -t ocfs2 /dev/sdb1 /software/orahome

Also as OCFS2 does not currently support shared writeable mmap, the health check (GIMH) file $ORACLE_HOME/dbs/hc_ORACLESID.dat and the ASM file $ASM_HOME/dbs/ab_ORACLESID.dat should be symlinked to local filesystem. We expect to support shared writeable mmap in the OCFS2 1.4 release.

Does that mean I cannot have my data file and Oracle home on the same volume?

Yes. The volume containing the Oracle data files, redo-logs, etc. should never be on the same volume as the distribution (including the trace logs like, alert.log).

Any other information I should be aware off?

The 1.2.3 release of OCFS2 does not update the modification time on the inode across the cluster for non-extending writes. However, the time will be locally updated in the cached inodes. This leads to one observing different times (ls -l) for the same file on different nodes on the cluster.

While this does not affect most uses of the filesystem, as one variably changes the file size during write, the one usage where this is most commonly experienced is with Oracle datafiles and redologs. This is because Oracle rarely resizes these files and thus almost all writes are non-extending.

In OCFS2 1.4, we intend to fix this by updating modification times for all writes while providing an opt-out mount option (nocmtime) for users who would prefer to avoid the performance overhead associated with this feature.

Can I mount OCFS volumes as OCFS2?

No. OCFS and OCFS2 are not on-disk compatible. We had to break the compatibility in order to add many of the new features. At the same time, we have added enough flexibility in the new disk layout so as to maintain backward compatibility in the future.

Can OCFS volumes and OCFS2 volumes be mounted on the same machine simultaneously?

No. OCFS only works on 2.4 linux kernels (Red Hat’s AS2.1/EL3 and SuSE’s SLES8). OCFS2, on the other hand, only works on the 2.6 kernels (RHEL4, SLES9 and SLES10).

Can I access my OCFS volume on 2.6 kernels (SLES9/SLES10/RHEL4)?

Yes, you can access the OCFS volume on 2.6 kernels using FSCat tools, fsls and fscp. These tools can access the OCFS volumes at the device layer, to list and copy the files to another filesystem. FSCat tools are available on oss.oracle.com.

Can I in-place convert my OCFS volume to OCFS2?

No. The on-disk layout of OCFS and OCFS2 are sufficiently different that it would require a third disk (as a temporary buffer) inorder to in-place upgrade the volume. With that in mind, it was decided not to develop such a tool but instead provide tools to copy data from OCFS without one having to mount it.

What is the quickest way to move data from OCFS to OCFS2?

Quickest would mean having to perform the minimal number of copies. If you have the current backup on a non-OCFS volume accessible from the 2.6 kernel install, then all you would need to do is to retore the backup on the OCFS2 volume(s). If you do not have a backup but have a setup in which the system containing the OCFS2 volumes can access the disks containing the OCFS volume, you can use the FSCat tools to extract data from the OCFS volume and copy onto OCFS2.

Can I export an OCFS2 file system via NFS?

Yes, you can export files on OCFS2 via the standard Linux NFS server. Please note that only NFS version 3 and above will work. In practice, this means clients need to be running a 2.4.x kernel or above.

Is there no solution for the NFS v2 clients?

NFS v2 clients can work if the server exports the volumes with the no_subtree_check option. However, this has some security implications that is documented in the exports manpage.

How do I enable and disable filesystem tracing?

To list all the debug bits along with their statuses, do:

# debugfs.ocfs2 -l

To enable tracing the bit SUPER, do:

# debugfs.ocfs2 -l SUPER allow

To disable tracing the bit SUPER, do:

# debugfs.ocfs2 -l SUPER off

To totally turn off tracing the SUPER bit, as in, turn off tracing even if some other bit is enabled for the same, do:

# debugfs.ocfs2 -l SUPER deny

To enable heartbeat tracing, do:

# debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow

To disable heartbeat tracing, do:

# debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny

61. How do I get a list of filesystem locks and their statuses? OCFS2 1.0.9+ has this feature. To get this list, do:

Mount debugfs is mounted at /debug (EL4) or /sys/kernel/debug (EL5).

# mount -t debugfs debugfs /debug

- OR -

# mount -t debugfs debugfs /sys/kernel/debug

Dump the locks.

# echo "fs_locks" | debugfs.ocfs2 /dev/sdX >/tmp/fslocks

How do I read the fs_locks output?

Let’s look at a sample output:

Lockres: M000000000000000006672078b84822 Mode: Protected Read
Flags: Initialized Attached
RO Holders: 0 EX Holders: 0
Pending Action: None Pending Unlock Action: None
Requested Mode: Protected Read Blocking Mode: Invalid

First thing to note is the Lockres, which is the lockname. The dlm identifies resources using locknames. A lockname is a combination of a lock type (S superblock, M metadata, D filedata, R rename, W readwrite), inode number and generation.

To get the inode number and generation from lockname, do:

# echo "stat " | debugfs.ocfs2 -n /dev/sdX
Inode: 419616 Mode: 0666 Generation: 2025343010 (0x78b84822)
....

To map the lockname to a directory entry, do:

# echo "locate " | debugfs.ocfs2 -n /dev/sdX
419616 /linux-2.6.15/arch/i386/kernel/semaphore.c

One could also provide the inode number instead of the lockname.

# echo "locate [419616]" | debugfs.ocfs2 -n /dev/sdX
419616 /linux-2.6.15/arch/i386/kernel/semaphore.c

To get a lockname from a directory entry, do:

# echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | debugfs.ocfs2 -n /dev/sdX
M000000000000000006672078b84822 D000000000000000006672078b84822 W000000000000000006672078b84822

The first is the Metadata lock, then Data lock and last ReadWrite lock for the same resource. The DLM supports 3 lock modes: NL no lock, PR protected read and EX exclusive. If you have a dlm hang, the resource to look for would be one with the “Busy” flag set.

The next step would be to query the dlm for the lock resource.

Note: The dlm debugging is still a work in progress.

To do dlm debugging, first one needs to know the dlm domain, which matches the volume UUID.

# echo "stats" | debugfs.ocfs2 -n /dev/sdX | grep UUID: | while read a b ; do echo $b ; done
82DA8137A49A47E4B187F74E09FBBB4B

Then do:

# echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug

For example:

# echo R 82DA8137A49A47E4B187F74E09FBBB4B M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug
# dmesg | tail
struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=79, key=965960985
lockres: M000000000000000006672078b84822, owner=75, state=0 last used: 0, on purge list: no
granted queue:
type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
converting queue:
blocked queue:

It shows that the lock is mastered by node 75 and that node 79 has been granted a PR lock on the resource. This is just to give a flavor of dlm debugging.

Is there a limit to the number of subdirectories in a directory?

Yes. OCFS2 currently allows up to 32000 subdirectories. While this limit could be increased, we will not be doing it till we implement some kind of efficient name lookup (htree, etc.).

Is there a limit to the size of an ocfs2 file system?

Yes, current software addresses block numbers with 32 bits. So the file system device is limited to (2 ^ 32) * blocksize (see mkfs -b). With a 4KB block size this amounts to a 16TB file system. This block addressing limit will be relaxed in future software. At that point the limit becomes addressing clusters of 1MB each with 32 bits which leads to a 4PB file system.

What are system files?

System files are used to store standard filesystem metadata like bitmaps, journals, etc. Storing this information in files in a directory allows OCFS2 to be extensible. These system files can be accessed using debugfs.ocfs2. To list the system files, do:

# echo "ls -l //" | debugfs.ocfs2 -n /dev/sdX
18 16 1 2 .
18 16 2 2 ..
19 24 10 1 bad_blocks
20 32 18 1 global_inode_alloc
21 20 8 1 slot_map
22 24 9 1 heartbeat
23 28 13 1 global_bitmap
24 28 15 2 orphan_dir:0000
25 32 17 1 extent_alloc:0000
26 28 16 1 inode_alloc:0000
27 24 12 1 journal:0000
28 28 16 1 local_alloc:0000
29 3796 17 1 truncate_log:0000

The first column lists the block number.

Why do some files have numbers at the end?

There are two types of files, global and local. Global files are for all the nodes, while local, like journal:0000, are node specific. The set of local files used by a node is determined by the slot mapping of that node. The numbers at the end of the system file name is the slot#. To list the slot maps, do:

# echo "slotmap" | debugfs.ocfs2 -n /dev/sdX
Slot# Node#
0 39
1 40
2 41
3 42

How does the disk heartbeat work?

Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive.

When is a node deemed dead?

An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=31) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node’s journal, recovers it by replaying the journal.

What about self fencing?

A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.

How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?

This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster.

What should one set O2CB_HEARTBEAT_THRESHOLD to?

It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.

O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)

How does one check the current active O2CB_HEARTBEAT_THRESHOLD value?

# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
7

What if a node umounts a volume?

During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery.

I encounter “Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing” whenever I run a heavy io load?

We have encountered a bug with the default CFQ io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to reading from and writing to the hb area atleast once every 12 secs (default). This bug has been addressed by Red Hat in RHEL4 U4 (2.6.9-42.EL) and Novell in SLES9 SP3 (2.6.5-7.257). If you wish to use the DEADLINE io scheduler, you could do so by appending “elevator=deadline” to the kernel command line as follows:

For SLES9, edit the command line in /boot/grub/menu.lst.

title Linux 2.6.5-7.244-bigsmp (with deadline)
kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5
vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off
initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp

For RHEL4, edit the command line in /boot/grub/grub.conf:

title Red Hat Enterprise Linux AS (2.6.9-22.EL) (with deadline)
root (hd0,0)
kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off
initrd /initrd-2.6.9-22.EL.img

To see the current kernel command line, do:

# cat /proc/cmdline

What is a quorum?

A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups.

How does OCFS2’s cluster services define a quorum?

The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network.

A node has quorum when:

- it sees an odd number of heartbeating nodes and has network connectivity to more than half of them.

OR,

- it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.

What is fencing?

Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn’t have quorum in a degraded cluster. It does this so that other nodes won’t get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described above, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test.

Due to user reports of nodes hanging during fencing, OCFS2 1.2.5 no longer uses “panic” for fencing. Instead, by default, it uses “machine restart”. This should not only prevent nodes from hanging during fencing but also allow for nodes to quickly restart and rejoin the cluster. While this change is internal in nature, we are documenting this so as to make users aware that they are no longer going to see the familiar panic stack trace during fencing. Instead they will see the message “*** ocfs2 is very sorry to be fencing this system by restarting ***” and that too probably only as part of the messages captured on the netdump/netconsole server.

If perchance the user wishes to use panic to fence (maybe to see the familiar oops stack trace or on the advise of customer support to diagnose frequent reboots), one can do so by issuing the following command after the O2CB cluster is online.

# echo 1 > /proc/fs/ocfs2_nodemanager/fence_method

Please note that this change is local to a node.

How does a node decide that it has connectivity with another?

When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for O2CB_IDLE_TIMEOUT_MS. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive.

How long does the quorum process take?

First a node will realize that it doesn’t have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of O2CB_IDLE_TIMEOUT_MS idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead. The current default of 31 iterations of 2 seconds results in waiting for 33 iterations or 66 seconds. By default, then, a maximum of 96 seconds can pass from the time a network fault occurs until a node fences itself.

How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster?

This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do:

# chkconfig --list ocfs2
ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off

To list the startup order for runlevel 3 on RHEL4, do:

# cd /etc/rc3.d
# ls S*ocfs2* S*o2cb* S*network*
S10network S24o2cb S25ocfs2

To list the shutdown order on RHEL4, do:

# cd /etc/rc6.d
# ls K*ocfs2* K*o2cb* K*network*
K19ocfs2 K20o2cb K90network

To list the startup order for runlevel 3 on SLES9/SLES10, do:

# cd /etc/init.d/rc3.d
# ls S*ocfs2* S*o2cb* S*network*
S05network S07o2cb S08ocfs2

To list the shutdown order on SLES9/SLES10, do:

# cd /etc/init.d/rc3.d
# ls K*ocfs2* K*o2cb* K*network*
K14ocfs2 K15o2cb K17network

Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service.