A recent post on Reddit's Sysadmin subreddit asked users to talk about their most notable tale of data loss, security breach, or IT SNAFU from this past year. With these types of incidents seemingly becoming more and more prevalent over the last 18 months - including making some of the top news headlines of 2014 - it's interesting to hear about lesser-known stories from normal, every day system administrators.
While you may never actually have to worry about a highly publicized breach the likes of Sony Pictures or Target, the following stories (although somewhat bizarre) could very well happen to any of us if we're not careful.
Testing the Data Center's Fire Supression System
Our first data loss story comes from /u/Ashmedai, who explained that an employee at a data center thought it would be a good idea to install a gaseous fire suppression system. Upon implementation, the employee rightfully decided to test the new "safety" system, but there was a small problem - they didn't notify the tenants of the data center.
According to /u/Ashmedai, the system's nozzles were installed with minor two flaws: they lacked a pressure deflector and, in at least one location, they were directly facing an occupied rack. We'll let him explain what happened next:
Nozzle discharges on a sustained basis for 60+ seconds with enough force to create a volume loud enough to chase people out of the first floor even through a closed door. Racks in question cooled to subzero temperatures well below the ability to conduct thermal measurement. All RAID sets lost. Company's self-insurance program invoked.
According to the source of the story, this oversight led to nearly $1 million in equiptment destruction - and all because that one guy didn't understand data center infastructure and failed to notify tenants about the tests of the new fire supression system.
In the words of /u/Ashmedai, "Don't be that guy."
We'll let /u/myairblaster set this scenerio up, and you can guess what happens next:
My boss insisted on provisioning all LUNs as RAID 5. We were making 25-30TB LUNs in RAID 5 and storing a few hundred (500+) customer VMs on a single SAN with questionable backups. A lot of the VMs ran critical apps like Exchange, SQL, CRM and even a few ERP systems.
Any guesses as to where this is going?
According to /u/myairblaster, a month after he left the organization, 1/3 of the hard drive disks in the SAN failed all at the same time, destroying all the LUNs. This led to a week of 20-hour days for their former co-workers, which included the manual reconstruction of a lot of the operating systems and databases due to the fact that the organization was relying on an open source backup solution that - apparently - didn't suit their needs.
In the end, this ended up costing the company a week and a half of downtime, a ton of man hours throughout the recovery process, and an undisclosed amount in small court claims brought up by customers.
Some of the biggest headlines of 2014 were caused by the global rise of ransomware. While it's not something any company would want to be a part of, one positive side of a ransomware attack is that your data isn't necessarily lost, and you can regain access to it if you are willing to meet the demands of whoever infected your system.
At least, this is true most of the time.
According to /u/Hippogrifld, they knew of a site that had roughly 10-years of server data stored on it. Unfortunately, the workstation on this domain ended up getting infected by the CryptoLocker randsomware. Even more unfortunate was the fact that this data was not backed up...
Finally, and this one takes the cake, the timing of this infection aligned right with the shutdown of the CryptoLocker tor site. This was great news for the majority of people. However, this meant that the unlock key was never sent, and the organization was no longer able to pay the ransom to retrieve their data.
While it's important to do what you can to prevent any type of malware from getting into your system, it's equally as important to have up-to-date backups of all your business-critical data.
Hose Bib Breach
/u/punkwalrus is a senior Linux administrator at a non-profit outside of Washington, D.C. Back in January 2014, as temperatures in that region dipped down below five degrees (Farenheit), a faucet on the fourth-floor balcony of his building structurally failed late at night, giving way to cataclysmic amounts of water.
After hours of warning alarms from the organization's HVAC system, /u/punkwalrus's co-worker showed up to the complex to find water surging down all the way to the first floor entrance of the building. The ceiling of the third floor, which housed a brand new server room, had collapsed, creating a waterfall and dousing the racks for hours.
According to /u/punkwalrus, the loss amounted to:
...our entire two SANs representing dozens of terabytes of over 360 virtual machines, the networking core, all ESX servers, the load balancers, and pretty much all we had that we hadn't moved from the old server room in a building next door. I would say a good twenty racks worth of stuff, including several UPS systems, and our server room power grid.
Even with all of that physical damage, a glimmer of hope came in the fact that an offsite backup had been completed just before the whole disaster started. However, the whole backup structure was set for a case-by-case basis, built with a 100mb pipe. Yes, it's better than nothing, but without having an efficient way to fully restore your system turns the recovery process into a days-long marathon of sleepless nights for your entire team.
Note: If you want to see more of the excrutiating details, /u/punkwalrus provided this link to a forum post from the time of the incident.
Do you have your own equally terrifying story from this past year? Tell us about it in the comment section below!
- Data Loss is Costing Companies $1.7 TRILLION Per Year
- Using Multiple Backup Systems Costs Companies Millions in Downtime and Data Loss
- Dark Data: What is it and Why Should I Care?