Wednesday, March 28, 2012

Data Redundancy by DRBD


Has your database (or mail or file) server crashed? Is your entire department waiting for you to restore service? Are your most recent backups a month old? Are those backups off-site? Is this a frighteningly real scenario? Oh, yeah. Can it be avoided? Oh, yeah. TheDistributed Replicated Block Device (DRBD) system can save the day, your data, and your job. DRBD provides data redundancy at a fraction of the cost of other solutions.
Almost every service depends on data. So, to offer a service, its data must be available. And if you want to make that service highly-available, you must first make the data it depends on highly-available.
The most natural way to do this (and hopefully, it’s something you already do on a regular basis) is to backup your data. In case you lose your active data, you just restore from the most recent backup, and the data is available again. Or, if the host your service runs on is (temporarily) unusable, you can replace it with another host configured to provide the identical service, and restore the data there.
To reduce possible downtime, you can have a second machine ready to takeover.
Whenever you change the data on one machine, you back it up on the other. You can have the secondary machine switched off, and just turn it on if the primary host goes down. This is typically referred to as cold standby. Or you can have the backup machine up and running, a configuration known as a hot standby.
However, whether your standby is hot or cold, one problem remains: if the active node fails, you lose changes to the data made after the most recent backup. But even that can be addressed… if you have the bucks.
One solution is to use some kind of shared storage device. With media shared between machines, both nodes have access to the most recent data when they need it. Storage can be simple SCSI sharing, dual controller RAID arrangements like IBM’s ServeRAID, shared fiber-channel disks, or high-end storage like IBM Shark or the various EMC solutions.
While effective, these systems are relatively costly, ranging from five thousand to millions of dollars. And unless you purchase the most expensive of these systems, shared storage typically has single points of failure (SPOFs) associated with them — whether they’re obvious or not. For example, some provide separate paths to a single shared bus, but have a single, internal electrical path to access the bus.
Another solution — and one that’s as good as the most expensive hardware — is live replication.

Real Time Backup with Replication
DBRD provides live replication of data. DRBD provides a mass storage device, such as a block device, and distributes the device over two machines. Whenever one node writes to the distributed device, the changes are replicated to the other in real time.
DRBD layers transparently over any standard block device (the “lower level device”), and uses TCP/IP over standard network interfaces for data replication. Though you can use raw devices for special purposes, the typical direct client to a block device is a filesystem, and it’s recommended that you use one of the journaling filesystems, such as Ext3 or Reiserfs. (XFS is not yet usable with DRBD.) You can think of DRBD as RAID1 over the network.
No special hardware is required, though it’s best to have a dedicated (crossover) network link for the data replication. And if you need high write throughput, you should eliminate the bottleneck of 10/100 megabit Ethernet and use Gigabit Ethernet instead. (To tune it further, you can increase the MTU to something greater than the typical files system block size, say, 5000 bytes). Thus, for the cost of a single, proprietary shared storage solution, you can setup several DRBD clusters.

Installing from Binary Packages
When there are (official or unofficial) packages available for your favorite distribution, just install from those, and you’re done. For example, SuSE officially includes DRBD and Heartbeat in its standard distribution, as well as in SuSE Linux Enterprise Sever 8 (SLES8).
The most recent “unofficial” SuSE packages can be found in Lars Marowsky-Bree’s subtree at ftp.suse.com/pub/people/ lmb/drbd and its mirrors. For Debian users, David Krovich provides prebuilt packages (via the apt updater) at debhttp://fsrc.csee.wvu.edu/debian/apt-repository/binary/deb-src, and source packages at http://fsrc.csee.wvu.edu/debian/apt-repository/source/.
If you need to compile DRBD from source, get a DRBD source package or source tarball from the download section of http://www.drbd.org, or check it out from CVS. Be sure to have the kernel sources for your running kernel, and make sure that the kernel source tree configuration matches the configuration of the running kernel. For reference, these are the steps for SuSE:
# cd /usr/src/linux
# make cloneconfig; make dep
# cd /wherever/drbd
# make; make install
In case you got the source tarball, you should backup the drbd/documentation/ subdirectory first. Since the sgml/docbook/ stuff is difficult to get right, the tarball contains “precompiled” man pages and documentation, which might be corrupted by an almost, but not quite, matching SGML environment.

DRBD Configuration
Once installed, you need to tell DRBD about its environment. You should be able to find a sample configuration file in /etc/drbd.conf; if not, there is a well commented one in the drbd/scripts/subdirectory.
dbrb.conf is divided into at most one global{} section, and an “arbitrary” number of resourceresource id {} sections, where resource id is typically something like drbd2.
In the global section, you can use minor-count to specify how many drbds (here, in lower case, drbd refers to the block devices) you want to be able to configure, in case you want to define more resources later without reloading the module (which would interrupt services).
Each resource{} section further splits into resource settings, grouped as disk{}-, net{}-, and node-specific settings, where the latter settings are grouped in on host hostname{}subsections. Parameters you need to change are hostnamedevice, physical disk and virtual disk-size, and Internet address and port number.

Testing Your System
Once you’ve configured drbd.conf, start DRBD. Assuming that the names of the nodes are pauland silas, choose one node to start with, say, paul. Run the command:
paul# /etc/init.d/drbd start
When prompted, make paul primary, then create a file system on the drbd with the command:
paul# mke2fs -j /dev/nb0
Make an entry into /etc/fstab (on both nodes!), like this:
/dev/nb0         /www           auto
  defaults,noauto     0 0
/dev/nb1         /mail          auto
  defaults,noauto     0 0
On the other node, silas, run:
silas# drbd start
When DRBD starts on the second node, it connects with the first node and starts to synchronize. Synchronization typically takes quite a while, especially if you use 100 megabit Ethernet and large devices.
The device that’s the synch target (here, the device on silas) typically blocks in the script until the synchronization is finished. However, the synch source (the primary or paul) is fully operational during a synch. So back on the first node, let the script mount the device:
paul# /etc/init.d/datadisk start
Start working with this file system, put some large files there, copy your CVS repository, or something.
When synch is finished, try a manual failover. Unmount the drbd devices on paul, and mount them on silas:
paul# datadisk stop
silas# datadisk start
You should now find the devices mounted on silas, and all of the files and changes you made should be there, too. In fact, the first disk-size blocks of the underlying physical devices should be bit-for-bit identical. If you want, you can verify this with an MD5SUM over the complete device.
Next, start DRBD again on both nodes. This time there should be no synch. This is the normal situation after an intentional reboot: if both nodes are in a “secondary” state before the cluster loses its connection, there is no need for a synch. (See the sidebar “How DRBD Works” for more information about when DRBD syncs, and why.)
Finally, you can automate the assignment of the primary and secondary roles to implement failover.

Some Do’s and Don’ts
Here are some things you should do and some things you should avoid when running DBRD.
Never mount a drbd in secondary state. Though it’s possible to mount a secondary device as read-only, changes made to the primary are mirrored to it underneath the filesystem and buffercache of the secondary, so you won’t see changes on the secondary. And changing metadata underneath a filesystem is a risky habit, since it may confuse your kernel to death.
Once you setup DRBD, never — as in never!! — bypass it, or access the underlying device directly, unless it’s the last chance to recover data after a catastrophic failure.
If your primary node fails and you rebuild it, make sure that the first synch is in the direction you want. Specifically, make sure that the synch does not overwrite the good data on the then-current primary (the node that didn’t fail). To ensure this happens correctly, remove all of the metadata found in /var/lib/drbd/drbd/ from the freshly-rebuilt node.
* Running DRBD on top of a loopback device, or vice versa, is expected to deadlock, so don’t do that.
You can run DRBD on top of the Linux Volume Manager (LVM), but you have to be very careful. Otherwise, snapshots (for example) won’t know how to notify the filesystem (possibly on the remote node) to flush its journal to disk to make the snapshot consistent. However, DRBD and LVM might be convenient for test setups, since you can easily create or destroy new drbds.

Tele-DRBD
The typical use of DRBD and HA clustering is probably two machines connected with a LAN and one or more crossover cables, and separated just a couple of meters apart, probably within one server room, or at least within the same building.
But you can use DRBD over long distance links, too. When you have the replica several hundred kilometers away in some other data center (a good plan for disaster recovery), your data will survive even a major earthquake at your primary location.
Running DRBD, the complete disk content goes over the wire. Consider issues about privacy: if the machines are interconnected through the (supposedly) hostile Internet, you should route DRBD traffic through some virtual private network. or even a full-blown IPSec solution. For a more lightweight solution for this specific task, have a look at the CIPE project.
Finally, make sure no other node can access the DRBD ports, or someone might provoke a connection loss and then race for the first reconnect, to get a full sync of your disk’s content.

More to Come…
If you have any troubles setting up DRBD, check the FAQ at http://faq.drbd.org. If that doesn’t help, feel free to subscribe and ask questions on drbd-devel@lists.sourceforge.net (there’s no drbd-users alias yet).
Development of DRBD continues. Work is already underway to eliminate the most displeasing limitations of drbd-0.6.x. drbd-0.7.x will be made more robust against block size changes to support XFS, and will avoid certain nasty side effects. Future versions will permit the primary node to be a target of an ongoing synchronization, which makes graceful failover/failback possible, and increases interoperability with Heartbeat. Combined with OpenGFS, future versions of DRBD will likely be able to support true active/ active configurations.
Unfortunately, these improvements are still in early alpha. But with your ongoing support, the pace of development should increase.
How DRBD Works
Whenever a higher-level application, typically a journaled file system, issues an I/O request, the kernel dispatches the request based on the target device’s major and minor numbers.
If the request is a “read,” and DRBD is registered as the major number, the kernel passes the request down the stack to the lower-level device locally. However, “write” requests are passed down the stack and sent over to the partner node.
Every time something changes on the local disk, the same changes are made at the same offset on the partner node’s device. If a “write” request finishes locally, a “write barrier” is sent to the partner to make sure that it is finished before another request comes in. Since later write requests might depend on successful finished previous ones, this is needed to assure strict write ordering on both nodes.
The two most important decisions for DRBD to make are when to synchronize and what to synchronize — is a full synchronization required, or just an incremental one? To make decisions, DRBD keeps several event and generation counters in metadata files located in/var/lib/drbd/drbd#/.
Let’s look at the failure cases. Say paul is our primary server, and silas is standby. In the normal state, paul and silas are up and running. If one of them is down, the cluster is degraded. Typical state changes are degraded to normal and normal to degraded.
Case One: The secondary fails. If silas was standby and leaves the cluster (for whatever reason: network, power, hardware failure), this isn’t a real problem, as long as paul keeps on running. In degraded mode, paulsimply flags all of the blocks that inure write operations as dirty. Then, aftersilas is repaired and joins the cluster again, paul can do an incremental synchronization (/proc/drbd says SyncQuick). If paul fails while alone, the dirty flags are lost, since they are held in RAM only. So unfortunately, the next time both nodes see each other, they perform a full synch (“SyncAll”) from paul to silas.
Case Two: The primary fails. When paul is the active primary and fails, the situation is a bit different. If silas remains standby (which is unlikely), and paul returns, paul becomes primary again. At that moment, it’s unknown which blocks were modified on paul that hadn’t reached silas. Therefore, a full synch from paul to silas is needed just to make sure that everything is identical again. In the more likely case that silas assumed the role of primary, paul becomes standby and synch target when it returns, receiving a full synch from silas. Why? It’s not known which blocks were modified on paul immediately before the crash.
Case Three: Both the primary and secondary fail. If both nodes go down (due to a main power failure or something catastrophic), when the cluster reboots, paul provides a full synch to silas.
While it seems like a full synch is needed whenever paul becomes unavailable, that’s not exactly accurate. You can stop the services on paul,unmount the drbd, and make paul secondary. In this case, both nodes are on standby, and you can shut off both nodes cleanly. When both nodes reboot (from previously being on standby), no synch is required.
Or you can make silas primary, mount drbd there, and start the services. This configuration allows you to bring paul down for maintenance. Whenpaul reboots, silas can provide an incremental synch to paul.
Case Four: Double failure.
If one of the nodes (or the network) fails during a synchronization, this is adouble failure, since the first failure caused the synch to happen. Assuming that paul was primary, paul has the good data; silas was receiving the synch. If silas became unavailable during the synch, it has inconsistent, only partially up-to-date data. So, when silas returns, the synch has to be restarted.
If the synch was incremental, it can be restarted at the place it was interrupted. If the the synch was supposed to be complete, it must be restarted from the very beginning. (This is a scenario that needs to be improved upon.)
If paul (the synch source) fails during the process, the cluster is rendered non-operational. silas cannot assume the role of the primary because it has inconsistent data.
However, if you really need availability, and don’t care about possibly inconsistent, out-of-date data, you can force silas to become primary. Use the explicit operator override…
silas# drbdsetup /dev/nb0 primary --do-what-I-say
But remember: if you use brute force, you take the blame.