Suresh Kumar Pakalapati's Linux Administration

Sunday, May 27, 2012

ITIL: Study Notes for “ISEB BH0-012 – The Foundation ITIL” Exam

Source :- http://cosonok.blogspot.in/2012/03/

ITIL was built around Deming's plan-do-check-act cycle.

Definitions

+ Definitive media library (DML) = a secure library where definitive authorized versions of all media configuration items (CIs) are stored and protected

+ Governance = is concerned with policy and direction.

+ ITIL = Information Technology Infrastructure Library.

+ Implementation of ITIL service management requires the preparation and planning of the effective and efficient use of the four Ps = People, Process, Products, Partners

+ Service Management = a set of specialized organizational capabilities for providingvalue to customers in the form of services.

+ Service request = a request from a user for information, advice, or for a standard change.

+ SLA = Service Level Agreement: An agreement between the service provider and their customer

ITIL Service Lifecycle

1. Service Strategy

2. Service Design: Design the Processes

3. Service Transition: Plan and Prepare for Deployment

4. Service Operation: IT Operations Management

5. Continual Service Improvement

1. Service Strategy

1.1 Strategy Management for IT services

1.2 Service Portfolio Management

1.3 Financial Management of IT Services

… ensuring that the IT infrastructure is obtained at the most effective price (which does not necessarily mean cheapest) and calculating the cost of providing IT services so that an organisation can understand the costs of its IT services.

1.4 Demand Management

1.5 Business Relationship Management

2. Service Design

A service should always deliver value to customers.

Resources and capabilities create value for customers.

The Service Design Stage is MOST concerned with defining policies and objectives, and includes:

+ Producing quality, secure, and resilient designs for new or improved services

+ Taking service strategies and ensuring they are reflected in the service design processes and the service designs that are produced

+ Measuring the effectiveness and efficiency of service design and the supporting processes

A service design package (SDP) contains information that is passed to service transition to enable the implementation of a new service.

2.1 Design Coordination

2.2 Service Catalogue

The responsibility of service catalogue management:

+ Ensuring that information in the service catalogue is accurate

+ Ensuring that information in the service catalogue is consistent with information in the service portfolio

+ Ensuring that all operation services are recorded in the service catalogue

2.3 Service Level Management

The purpose of service level management = to ensure an agreed level of IT service is provided for all current IT services.

Service level management process is responsible for discussing reports with customers showing whether services have met their targets.

2.4 Availability Management

+ Reliability: Ability of an IT component to perform at an agreed level at described conditions.

+ Maintainability: The ability of an IT component to remain in, or be restored to an operational state.

+ Serviceability: The ability for an external supplier to maintain the availability of component or function under a third-party contract.

+ Resilience: A measure of freedom from operation failure and a method of keeping services reliable (e.g. redundancy)

+ Security: refers to the confidentiality, integrity, and availability of that data.

2.5 Capacity Management

The capacity management process includes business, service, and component sub-processes. The high-level activities include:

+ Application sizing

+ Workload management

+ Demand management

+ Modelling

+ Capacity planning

+ Resource management

+ Performance management

2.6 IT Service Continuity Management (ITSCM)

Involves the following basic steps:

+ prioritising the activities to be recovered by conducting a business impact analysis(BIA)

+ performing a risk assessment (risk analysis) for each of the IT services to identify the assets, threats, vulnerabilities and countermeasures for each service

+ evaluating the options for recovery

+ producing the contingency plan / business continuity strategy

+ testing, reviewing, and revising the plan on a regular basis

2.7 Information Security Management System

2.8 Supplier Management

Third-party contracts are the responsibility of supplier management to negotiate and agree.

3. Service Transition

Service transition stage responsibilities:

+ To ensure that a service can be managed and operated in accordance with constraints specified during design

+ To provide good-quality knowledge and information about services

+ To plan the resources required to manage a release

3.1 Transition planning and support

3.2 Change management

The RACI Matrix – Who's Responsible, Accountable, Consulted... and kept Informed

+ R(esponsible) – Who is responsible for actually doing it?

+ A(ccountable) – Who has authority to approve or disapprove it?

+ C(onsulted) – Who has needed input about the task?

+ I(nformed) – Who needs to be kept informed about the task?

The main aims of change management include:

+ Minimal disruption of services

+ Reduction in back-out activities

+ Economic use of resources involved in the change

Emergency change advisory board: group that should review changes that must be implemented faster than the normal change process

3.3 Service asset and configuration management

The configuration management system is part of the service knowledge management system.

The configuration management system (CMS) can help determine the level of impactof a problem.

The relationship in service asset and configuration management describes how theconfiguration items (CIs) work together to deliver services.

Includes the following key process areas:

+ Identification

+ Planning

+ Change Control

+ Change Management

+ Release Management

+ Maintenance

3.4 Release and deployment management

Objectives of release and deployment management: To define and agree release and deployment plans with customers and stakeholders. The goals of release management include:

+ Planning the roll-out of software

+ Designing and implementing procedures for the distribution and installation of changes to IT systems

+ Effectively communicating and managing expectations of the customer during the planning and roll-out of new releases

+ Controlling the distribution and installation of changes to IT systems

3.5 Service Validation and testing

3.6 Change evaluation

3.7 Knowledge management

4. Service Operation

Service operation stage of the service lifecycle, delivers and manages IT services atagreed levels to business users and customers. Service operations contribution to business is adding value, and the service value is visible to customers.

The following areas of service management can benefit from automation:

+ Design and modelling

+ Reporting

+ Pattern recognition and analysis

+ Detection and monitoring

List of processes:

4.1 Event management

The event management process is involved in monitoring an IT service and detecting when the performance drops below acceptable limits

4.2 Incident management

*Major incidents require separate procedures.

The objectives of incident management:

+ To restore normal service operation as quickly as possible

+ To minimize adverse impacts on business operations

4.3 Request fulfilment

4.4 Problem management

A problem = a condition often identified as a result of multiple incidents that exhibit common symptons.

Objectives of problem management:

+ Minimizing the impact of incidents that cannot be prevented

+ Preventing problems and resulting incidents from happening

+ Eliminating recurring incidents

4.5 Access management

Access management process is responsible for providing rights to use an IT service.

5. Continual Service Improvement (CSI)

Where to we want to be?

Define measurable targets (Service metrics measure: The end-to-end service.)

Improvement initiatives typically follow a seven-step process:

5.1 Identify the strategy for improvement

5.2 Define what you will measure

5.3 Gather the data

5.4 Process the data

5.5 Analyse the information and data

5.6 Present and use the information

5.7 Implement improvement

Appendix A: ITIL Function – Service Desk (function of Service Operation)

Service desk features include:

+ single point of contact (SPOC)

+ single point of entry

+ single point of exit

+ make life easier for customers

+ data integrity

+ incident control: life-cycle management of all service requests

+ communication: keeping a customer informed of progress and advising on workarounds

Types of service desk structure:

+ Local service desk: to meet local business needs

+ Central service desk: for organisations having multiple locations

+ Virtual service desk: for organisations having multi-country locations
+ Follow the Sun

Appendix B: ITIL Function - Software Asset Management (function of Service Operation)

Software asset management (SAM) practices include:

+ maintaining software license compliance

+ tracking inventory and software asset use

+ maintaining standard policies and procedures surrounding definition, deployment, configuration, use, and retirement of software assets and the DML

Saturday, May 5, 2012

Google Chrome Browser Tips and Tricks

These tips will help you to get most out of the Chrome browser.

1. Pin Tab

When you pin a tab, it minimizes the tab to display only the icon. When you have several tabs opened, this feature is very helpful, as the Tab title displays only the icon and takes only little real estate.

Before pinning:

After pinning 1st two tabs:

2. Display Home Button

By default, chrome doesn’t display the ‘Home’ button in the toolbar.

Click on the ‘Wrench Icon’ on the right-hand corner of the browser to get to the “Customize” option for Chrome browser -> Preferences (or Options) -> Select the check-box for “Show home button in toolbar”

This will now display the ‘Home’ button in front of the URL field. Click this button to go to your home page quickly.

3. Omnibox

The URL address bar (also called as Omnibox in Chrome) in the Chrome browser is not only to enter your URL. Just type the keyword you want to search and press enter, which will perform a google search.

You can also perform calculations or conversions directly in the Omnibox. Try typing any one of the following in the URL address bar and press enter to see the results yourself.

7 + 200
7 * 200
1 lb in kg
2 miles in km

4. incognito – Secret Mode

Incognito mode is for private browsing, where Chrome doesn’t record your browsing history, download hisotry. Any cookies that are stored during the incognito mode is deleted when you close the browser.

You can launch incognito by pressing Ctrl + Shift + N, (or) Settings -> New incognito Window, (or) right mouse-click on a link from a regular Chrome session, and select “open link in incognito window”.

One of the practical use for this mode (apart from private browsing) is to login to the same site using two accounts.

For example, if you have two gmail accounts, login using the first account in your regular Chrome browser, and login using the second account in your Incognito mode on your Chrome browser. This way, you can be logged in to two gmail accounts at the same time on Chrome browser.

5. Reopen Recently Closed tab

If you’ve closed a tab by mistake, you can open it by pressing Ctrl + Shift + T, (or) right mouse-click on the empty area in the title-bar -> and select “Reopen closed tab” as shown below.

6. Chrome:// commands

There are various about command that you can type in the address bar.

chrome://histograms
chrome://memory
chrome://cache
chrome://dns
etc.

7. Task Manager

Task manager displays the memory and CPU usage of the Chrome browser, broken down by each and every Tab. If you have multiple Tabs open, and when your system is slow, you can use task manager to identify which Tab is causing the issue.

Right click on empty space in the title bar and select “Task manager”, (or) Press Shift + escape key to launch it as shown below.

8. Change Search Engine in the Omnibar

Type “amazon” in the address bar AND press “Tab”, which will change the address bar to “Search amazon.com:”, any keyword you type after this will be searched in the amazon.com and it will take you to amazon.com website.

You can also change the default search engine. Right click on the address bar -> and select “Edit search engine”. From here you can choose other search engines.

9. Open a link at a specific Tab Location

As you already know, when you right click on a link, and say “Open link in a new tab”, it opens it as a new tab (next to the current open tab).

However, if you want the link to be opened at a specific tab location, you can hold the link, drag it and drop it as a specific tab location. You’ll see a small arrow when you drag and drop the link. In the following example, I tried to drop the link at 2nd tab location.

10. Carry your Chrome Settings with You

If you are using multiple computers (at home, at work, etc.), you don’t need to worry about trying to setup the Chrome browser in the same way on all the computers you use. Instead, setup your bookmarks, extensions, themes, settings, etc, on your Chrome browser on one computer, and select “Sign in to Chrome” from the settings menu as shown below.

This will ask you to enter your google username and password. This will save all your chrome settings on your google account. Next time when you sign-up from another machine, all your chrome settings will be visible on the new system. If you make any changes to your chrome settings on this new system, it will be available on your other computers too. Use this feature only on the systems which you trust and not on public computers.

11. Drag and Drop Downloaded file

Once a file is downloaded, you can just drag and drop the file (from the Chrome download window) to your Windows explore, or any other file browser that you are using on your system.

12. History

Press Ctrl-H, or go to Customize -> History, to launch the history window. From here you can search for a specific website from your history, delete all your history, or delete only selected items from your history.

13. Create a Shortcut of the Current Tab

If you like to create a shortcut to the website that you are currently viewing, go to Customize -> Tools -> Create application shortcuts. This will ask you where you like to create the shortcut for this website, as shown below.

Once you create a application shortcut, next time when you click on it to open it, it will open this site in a Chrome browser without any tabs, url location bar, etc.

14. Navigate Between Tabs Quickly

Use Ctrl+Tab to navigate Tabs one by one
Press Ctrl-1 to go to 1st Tab
Press Ctrl-2 to go to 2nd Tab
..
Press Ctrl-9 to switch to the last Tab

15. Resize TextArea

You can also resize a textarea that you see on any website. Please note that you can resize only the textarea and not a textbox. At the bottom right corner of the textarea, you’ll see two slanted lines, use your mouse, hold this, and drag it to resize the textarea on the screen.

You can try this yourself on the comment box (which is a textarea) located at the bottom of this page.

16. Detach a tab

If you have multiple Tabs opened, and like to detach a single tab as a separate Chrome instance, just drag the tab anywhere outside the browser, which will detach the tab and run it in a separate Chrome browser window.

17. Highlight a Text and Search

When you are browsing a website, if you come across some word that you are not familiar with and like to perform a google search on it, just double-click on the text to high-light it, right mouse-click, and select “Search google for”, which will open a new Tab and search for the selected text. This saves some time.

18. Autofill

You can use the autofill option to enter one or more addresses that you can use to fill-up any web forms. You can also use this feature to store one or more credit card information that can be pre-populated on web forms. Don’t use this feature on a computer that you don’t trust.

Go to settings -> Options -> Personal Stuff -> Click on the check-box “Enable Autofill to fill out web forms in a single click” -> Click on Manager autofill settings -> Click on ‘Add new Address’ and enter the information.

19. Google Cloud Print

Go to Settings -> Options -> Under the Hood -> click on sign into “Google Cloud Print”.

Once you setup your printer using Google cloud print, you can print to it from anywhere. i.e You can print to it from your mobile, other PC at work, or any other system that is connected to the internet.

20. Google Chrome Browser Shortcuts

The following are some useful shortcuts:

Alt+F – Open the wrench menu (i.e chrome settings menu)
Ctrl+J – Go to downloads window
Ctrl+H – Go to history window
Ctrl+Tab – Navigate Tabs
Alt+Home – Go to home page
Ctrl+U – View source code of the current page
Ctrl+K – To search quickly in the address bar
Ctrl+L – Highlights the URL in the address bar (use this to copy/paste the URL quickly)
Ctrl+N – Open a new Chrome browser window
Ctrl+Shift+N – Open a new incognito window (for private browsing)
Ctrl+Shift+B – Toggle bookmark display
Ctrl+W – Close the current Tab
Alt+Left Arrow – Go to the previous page from your history
Alt+Right Arrow – Go to the next page from your history
Space bar – Scroll down the current web page

Following are the 12 most helpful chrome:// commands that you should know.

1. chrome://flags

From here you can enable some of the experimental features that are hidden in the google Chrome browser. Please note that as mentioned on this page, since these are experimental, these might not work as expected and might cause issues. Enable these features and use it at your own risk.

2. chrome://dns

This displays the list of hostnames for which the browser will prefetch the DNS records.

3. chrome://downloads

This is also available from the Menu -> Downloads. Short cut key is Ctrl+J

4. chrome://extensions

This is also available from the Menu -> Tools -> Extensions

5. chrome://bookmarks

This is also available from the Menu -> Bookmarks -> Bookmark Manager. Short cut key is Ctrl+Shift+O

6. chrome://history

This is also availble from the Menu -> History. Short cut key is Ctrl+H

7. chrome://memory

This will redirect to “chrome://memory-redirect/”. This will display the memory used by Google chrome browser, and all other browsers running on the system (including firefox).

This also display all the process related to browser with their PID, process name, and the memory it takes.

8. chrome://net-internals

This displays all networking related information. Use this to capture network events generated by the browser. You can also export this data. You can view DNS host resolver cache.

One of the important feature in this feature is “Test”. If a URL failed to load, you can go to “chrome://net-internals” -> click on “Tests” tab -> type that URL which failed, and click on “Start Test”, which will do some test and report you why that URL failed.
chrome://plugins/

9. chrome://quota-internals

This gives information about the disk space quote used by the browser, including the break down of how much space the individual websites took under temporary files.

10. chrome://sessions

This displays the number of sessions and magic list that are currently running.

11. chrome://settings

This is also available from the Menu -> Options (on Windows), and Menu -> Preferences (on Linux). From here you can control various browser related settings.

12. chrome://sync-internals

This gives information about the chrome sync feature, including the Sync URL used by google, and sync statistics.

Finally, to view all the available chrome:// commands, type chrome://about/ in your chrome browser URL as shown below.

Also, please note that all of the commands mentioned above can also be called using google chrome about command, which redirects to chrome://.

For example, both of the following are exactly the same.

about:dns
chrome://dns

Wednesday, April 11, 2012

Endeca Components

How does Endeca Work ?

Wednesday, March 28, 2012

Data Redundancy by DRBD

Has your database (or mail or file) server crashed? Is your entire department waiting for you to restore service? Are your most recent backups a month old? Are those backups off-site? Is this a frighteningly real scenario? Oh, yeah. Can it be avoided? Oh, yeah. TheDistributed Replicated Block Device (DRBD) system can save the day, your data, and your job. DRBD provides data redundancy at a fraction of the cost of other solutions.

Almost every service depends on data. So, to offer a service, its data must be available. And if you want to make that service highly-available, you must first make the data it depends on highly-available.

The most natural way to do this (and hopefully, it’s something you already do on a regular basis) is to backup your data. In case you lose your active data, you just restore from the most recent backup, and the data is available again. Or, if the host your service runs on is (temporarily) unusable, you can replace it with another host configured to provide the identical service, and restore the data there.

To reduce possible downtime, you can have a second machine ready to takeover.

Whenever you change the data on one machine, you back it up on the other. You can have the secondary machine switched off, and just turn it on if the primary host goes down. This is typically referred to as cold standby. Or you can have the backup machine up and running, a configuration known as a hot standby.

However, whether your standby is hot or cold, one problem remains: if the active node fails, you lose changes to the data made after the most recent backup. But even that can be addressed… if you have the bucks.

One solution is to use some kind of shared storage device. With media shared between machines, both nodes have access to the most recent data when they need it. Storage can be simple SCSI sharing, dual controller RAID arrangements like IBM’s ServeRAID, shared fiber-channel disks, or high-end storage like IBM Shark or the various EMC solutions.

While effective, these systems are relatively costly, ranging from five thousand to millions of dollars. And unless you purchase the most expensive of these systems, shared storage typically has single points of failure (SPOFs) associated with them — whether they’re obvious or not. For example, some provide separate paths to a single shared bus, but have a single, internal electrical path to access the bus.

Another solution — and one that’s as good as the most expensive hardware — is live replication.

Real Time Backup with Replication

DBRD provides live replication of data. DRBD provides a mass storage device, such as a block device, and distributes the device over two machines. Whenever one node writes to the distributed device, the changes are replicated to the other in real time.

DRBD layers transparently over any standard block device (the “lower level device”), and uses TCP/IP over standard network interfaces for data replication. Though you can use raw devices for special purposes, the typical direct client to a block device is a filesystem, and it’s recommended that you use one of the journaling filesystems, such as Ext3 or Reiserfs. (XFS is not yet usable with DRBD.) You can think of DRBD as RAID1 over the network.

No special hardware is required, though it’s best to have a dedicated (crossover) network link for the data replication. And if you need high write throughput, you should eliminate the bottleneck of 10/100 megabit Ethernet and use Gigabit Ethernet instead. (To tune it further, you can increase the MTU to something greater than the typical files system block size, say, 5000 bytes). Thus, for the cost of a single, proprietary shared storage solution, you can setup several DRBD clusters.

Installing from Binary Packages

When there are (official or unofficial) packages available for your favorite distribution, just install from those, and you’re done. For example, SuSE officially includes DRBD and Heartbeat in its standard distribution, as well as in SuSE Linux Enterprise Sever 8 (SLES8).

The most recent “unofficial” SuSE packages can be found in Lars Marowsky-Bree’s subtree at ftp.suse.com/pub/people/ lmb/drbd and its mirrors. For Debian users, David Krovich provides prebuilt packages (via the apt updater) at debhttp://fsrc.csee.wvu.edu/debian/apt-repository/binary/deb-src, and source packages at http://fsrc.csee.wvu.edu/debian/apt-repository/source/.

If you need to compile DRBD from source, get a DRBD source package or source tarball from the download section of http://www.drbd.org, or check it out from CVS. Be sure to have the kernel sources for your running kernel, and make sure that the kernel source tree configuration matches the configuration of the running kernel. For reference, these are the steps for SuSE:

# cd /usr/src/linux
# make cloneconfig; make dep
# cd /wherever/drbd
# make; make install

In case you got the source tarball, you should backup the drbd/documentation/ subdirectory first. Since the sgml/docbook/ stuff is difficult to get right, the tarball contains “precompiled” man pages and documentation, which might be corrupted by an almost, but not quite, matching SGML environment.

DRBD Configuration

Once installed, you need to tell DRBD about its environment. You should be able to find a sample configuration file in /etc/drbd.conf; if not, there is a well commented one in the drbd/scripts/subdirectory.

dbrb.conf is divided into at most one global{} section, and an “arbitrary” number of resourceresource id {} sections, where resource id is typically something like drbd2.

In the global section, you can use minor-count to specify how many drbds (here, in lower case, drbd refers to the block devices) you want to be able to configure, in case you want to define more resources later without reloading the module (which would interrupt services).

Each resource{} section further splits into resource settings, grouped as disk{}-, net{}-, and node-specific settings, where the latter settings are grouped in on host hostname{}subsections. Parameters you need to change are hostname, device, physical disk and virtual disk-size, and Internet address and port number.

Testing Your System

Once you’ve configured drbd.conf, start DRBD. Assuming that the names of the nodes are pauland silas, choose one node to start with, say, paul. Run the command:

paul# /etc/init.d/drbd start

When prompted, make paul primary, then create a file system on the drbd with the command:

paul# mke2fs -j /dev/nb0

Make an entry into /etc/fstab (on both nodes!), like this:

/dev/nb0         /www           auto
  defaults,noauto     0 0
/dev/nb1         /mail          auto
  defaults,noauto     0 0

On the other node, silas, run:

silas# drbd start

When DRBD starts on the second node, it connects with the first node and starts to synchronize. Synchronization typically takes quite a while, especially if you use 100 megabit Ethernet and large devices.

The device that’s the synch target (here, the device on silas) typically blocks in the script until the synchronization is finished. However, the synch source (the primary or paul) is fully operational during a synch. So back on the first node, let the script mount the device:

paul# /etc/init.d/datadisk start

Start working with this file system, put some large files there, copy your CVS repository, or something.

When synch is finished, try a manual failover. Unmount the drbd devices on paul, and mount them on silas:

paul# datadisk stop
silas# datadisk start

You should now find the devices mounted on silas, and all of the files and changes you made should be there, too. In fact, the first disk-size blocks of the underlying physical devices should be bit-for-bit identical. If you want, you can verify this with an MD5SUM over the complete device.

Next, start DRBD again on both nodes. This time there should be no synch. This is the normal situation after an intentional reboot: if both nodes are in a “secondary” state before the cluster loses its connection, there is no need for a synch. (See the sidebar “How DRBD Works” for more information about when DRBD syncs, and why.)

Finally, you can automate the assignment of the primary and secondary roles to implement failover.

Some Do’s and Don’ts

Here are some things you should do and some things you should avoid when running DBRD.

* Never mount a drbd in secondary state. Though it’s possible to mount a secondary device as read-only, changes made to the primary are mirrored to it underneath the filesystem and buffercache of the secondary, so you won’t see changes on the secondary. And changing metadata underneath a filesystem is a risky habit, since it may confuse your kernel to death.

Once you setup DRBD, never — as in never!! — bypass it, or access the underlying device directly, unless it’s the last chance to recover data after a catastrophic failure.

If your primary node fails and you rebuild it, make sure that the first synch is in the direction you want. Specifically, make sure that the synch does not overwrite the good data on the then-current primary (the node that didn’t fail). To ensure this happens correctly, remove all of the metadata found in /var/lib/drbd/drbd/ from the freshly-rebuilt node.

* Running DRBD on top of a loopback device, or vice versa, is expected to deadlock, so don’t do that.

You can run DRBD on top of the Linux Volume Manager (LVM), but you have to be very careful. Otherwise, snapshots (for example) won’t know how to notify the filesystem (possibly on the remote node) to flush its journal to disk to make the snapshot consistent. However, DRBD and LVM might be convenient for test setups, since you can easily create or destroy new drbds.

Tele-DRBD

The typical use of DRBD and HA clustering is probably two machines connected with a LAN and one or more crossover cables, and separated just a couple of meters apart, probably within one server room, or at least within the same building.

But you can use DRBD over long distance links, too. When you have the replica several hundred kilometers away in some other data center (a good plan for disaster recovery), your data will survive even a major earthquake at your primary location.

Running DRBD, the complete disk content goes over the wire. Consider issues about privacy: if the machines are interconnected through the (supposedly) hostile Internet, you should route DRBD traffic through some virtual private network. or even a full-blown IPSec solution. For a more lightweight solution for this specific task, have a look at the CIPE project.

Finally, make sure no other node can access the DRBD ports, or someone might provoke a connection loss and then race for the first reconnect, to get a full sync of your disk’s content.

More to Come…

If you have any troubles setting up DRBD, check the FAQ at http://faq.drbd.org. If that doesn’t help, feel free to subscribe and ask questions on drbd-devel@lists.sourceforge.net (there’s no drbd-users alias yet).

Development of DRBD continues. Work is already underway to eliminate the most displeasing limitations of drbd-0.6.x. drbd-0.7.x will be made more robust against block size changes to support XFS, and will avoid certain nasty side effects. Future versions will permit the primary node to be a target of an ongoing synchronization, which makes graceful failover/failback possible, and increases interoperability with Heartbeat. Combined with OpenGFS, future versions of DRBD will likely be able to support true active/ active configurations.

Unfortunately, these improvements are still in early alpha. But with your ongoing support, the pace of development should increase.

How DRBD Works

Whenever a higher-level application, typically a journaled file system, issues an I/O request, the kernel dispatches the request based on the target device’s major and minor numbers.
If the request is a “read,” and DRBD is registered as the major number, the kernel passes the request down the stack to the lower-level device locally. However, “write” requests are passed down the stack and sent over to the partner node.
Every time something changes on the local disk, the same changes are made at the same offset on the partner node’s device. If a “write” request finishes locally, a “write barrier” is sent to the partner to make sure that it is finished before another request comes in. Since later write requests might depend on successful finished previous ones, this is needed to assure strict write ordering on both nodes.
The two most important decisions for DRBD to make are when to synchronize and what to synchronize — is a full synchronization required, or just an incremental one? To make decisions, DRBD keeps several event and generation counters in metadata files located in/var/lib/drbd/drbd#/.
Let’s look at the failure cases. Say paul is our primary server, and silas is standby. In the normal state, paul and silas are up and running. If one of them is down, the cluster is degraded. Typical state changes are degraded to normal and normal to degraded.
* Case One: The secondary fails. If silas was standby and leaves the cluster (for whatever reason: network, power, hardware failure), this isn’t a real problem, as long as paul keeps on running. In degraded mode, paulsimply flags all of the blocks that inure write operations as dirty. Then, aftersilas is repaired and joins the cluster again, paul can do an incremental synchronization (/proc/drbd says SyncQuick). If paul fails while alone, the dirty flags are lost, since they are held in RAM only. So unfortunately, the next time both nodes see each other, they perform a full synch (“SyncAll”) from paul to silas.
* Case Two: The primary fails. When paul is the active primary and fails, the situation is a bit different. If silas remains standby (which is unlikely), and paul returns, paul becomes primary again. At that moment, it’s unknown which blocks were modified on paul that hadn’t reached silas. Therefore, a full synch from paul to silas is needed just to make sure that everything is identical again. In the more likely case that silas assumed the role of primary, paul becomes standby and synch target when it returns, receiving a full synch from silas. Why? It’s not known which blocks were modified on paul immediately before the crash.
* Case Three: Both the primary and secondary fail. If both nodes go down (due to a main power failure or something catastrophic), when the cluster reboots, paul provides a full synch to silas.
While it seems like a full synch is needed whenever paul becomes unavailable, that’s not exactly accurate. You can stop the services on paul,unmount the drbd, and make paul secondary. In this case, both nodes are on standby, and you can shut off both nodes cleanly. When both nodes reboot (from previously being on standby), no synch is required.
Or you can make silas primary, mount drbd there, and start the services. This configuration allows you to bring paul down for maintenance. Whenpaul reboots, silas can provide an incremental synch to paul.
* Case Four: Double failure.
If one of the nodes (or the network) fails during a synchronization, this is adouble failure, since the first failure caused the synch to happen. Assuming that paul was primary, paul has the good data; silas was receiving the synch. If silas became unavailable during the synch, it has inconsistent, only partially up-to-date data. So, when silas returns, the synch has to be restarted.
If the synch was incremental, it can be restarted at the place it was interrupted. If the the synch was supposed to be complete, it must be restarted from the very beginning. (This is a scenario that needs to be improved upon.)
If paul (the synch source) fails during the process, the cluster is rendered non-operational. silas cannot assume the role of the primary because it has inconsistent data.
However, if you really need availability, and don’t care about possibly inconsistent, out-of-date data, you can force silas to become primary. Use the explicit operator override…

silas# drbdsetup /dev/nb0 primary --do-what-I-say

But remember: if you use brute force, you take the blame.