Avoid a Backup Nightmare by Monitoring Your Backups

Posted by R1Soft on Feb 10, 2010 3:06:00 PM

Time and time again at R1Soft we see disasters that could easily be avoided with some basic reporting and attention to the CDP System after it is deployed.

How To Monitor your CDP Enterprise Installation

If Nothing Else Visit the CDP Task History Screen periodically

If you do nothing else at least periodically visit the global task history screen on your cdp server.

Schedule Email Reporting

You can create an email report for your entire CDP Server or for just your agents.  You can select to show all log messages of level WARN or higher in the report.

Configure Remote Syslog on Larger Deployments – http://wiki.r1soft.com/display/R1D/Remote+Syslog+Settings

Use a syslog server to receive all log messages from the CDP Server.  Once in syslog the CDP Server messages can be treated like any other log file you probably already monitor.  You can use your method of choice for monitoring: scripts, splunk, a database, nagios script etc.

Windows syslog Server

Linux Remote syslog Setup

Windows Event Log

Examples of why you Need to Monitor your CDP Enterprise Installation

We Do See Hardware Go Bad Sometimes

A few months ago we had a customer with failed restores.  Data on the disk was so corrupt that on one agent the 4 byte magic number at the beginning of the Disk Safe .data file written only once when the disk safe is created was corrupt.    In this case every task on the CDP Server had warnings or errors and most were failing with nasty errors.  They went unnoticed for 1 month until a restore was needed.  That and other warnings from the operating system of a serious problem with disk storage on their CDP Server.

All could have been fixed well before a restore was needed.

Like All Software R1Soft CDP Has Errors and Often there are Warning Signs

Linux Agent s with version numbers lower than1.67.0 (CDP Release 2.19) as reported in our release notes in some cases can fail to properly identify deltas and read blocks correctly on disks or volumes larger than 2 TB.  The issue was fixed in Linux Agent 1.67 in November.  The symptoms include complete failure to add the device/partition to the CDP Server OR literally thousands of warnings during the backup about corrupt or missing blocks.

The thousands of errors during each backup this issue can trigger are part of a validation done during backup to make sure none of the blocks allocated by the file system are missing from CDP replication.  If blocks are missing it can be an issue with the software, hardware, or a corrupt file system.

The File System on Your Server (where CDP Agent runs) May be Corrupt – In Some Cases CDP Can Detect and Warn about This

It’s actually not that uncommon for Linux and Windows file systems to become corrupt to one degree or the other especially on multi-tenant systems that rarely if ever get taken down for a periodic fsck or checkdisk of the file system (cuz we all love a 24 hour fsck… right).

While rare what can happen is some file system data structures on disk become corrupted.  Could be hardware issue, could simply be caused by a hard reset of the server.  These data structures in some cases are also resident in memory on the operating system.  This allows the server to pretty much function mostly correctly though there are often clues from the kernel in /var/log/messages.  The real surprise is when you reboot this server when you find out the file system corrupted beyond repair.

In some cases (not all) the CDP Server will give warnings signs.  Unable to read block XXXXX I/O error (you have bad sectors or some storage failure).  Warning blocks are corrupt which means they are allocated by the file system but the CDP Software never detected they were written to.  This typically means file system corruption and the file systems block allocation maps have issues.

Other good reasons to Monitor if these are not enough:

  • A firewall may have changed blocking access to the agent from the CDP Server.
  • There could be damage to a Disk Safe from a CDP Server crash or power loss .
  • You could have run out of disk space on your CDP Server
  • Backups may be working but MySQL may not be locked and flushed during the backup because the agent is unable to connect to MySQL
Improvements in Reporting Based On Customer Feedback

CDP 2.20 – Tasks Set to Warn State of MySQL Lock and Flush Fails

Task Alerts – The next major release of CDP has alerts.  Alerts  are added to a task if it needs attention for any reason.  Alerts contain a very simple message of basically what went wrong.  Inside of an alert are all the relevant details associated with the issue to troubleshoot it.

Dashboard – Included in the next major CDP release is a dashboard.  This can at a glance give you indicators that may require further investigation.  Its very cool but no silver bullet.  The dashboard must be looked at and in a large CDP Enterprise installation it’s a challenge to convey everything important on a dashboard.

Protection from Full Disks – CDP 3 with a Beta standard edition available for download right now has built in soft and hard disk quotas.  By default you will receive warnings if the disk/volume where a disk safe is stored has less than 20% free space.  With less than 10% free space backups will fail.

Localization – We have localized everything in the next CDP major version include most error messages.  What good are errors if you can’t understand them?  I agree.

Topics: Continuous Data Protection, windows, CDP Server 3.0, Backups, linux, NTFS, file systems, ext3

Recent Posts

Posts by Topic

see all