Results 1 to 6 of 6

Thread: Backup and Recovery Horror Story!

  1. #1
    Administrator Samantha Morris's Avatar
    Join Date
    Nov 2010
    Location
    Toronto, ON
    Posts
    105

    Backup and Recovery Horror Story!

    Let's try something new.

    Tell us of a time when you had to recover data, for either yourself or a customer, that was particularly extraordinary. Tell us what caused the data loss, what was at stake and how you were able (or unable) to get things back on track again.

    How about this? I'll throw some loot in. Whoever posts the gnarliest story will win a prize. I won't tell you what the prize is now, but what I'll tell you is that you're definitely going to want to win.

    We'll need multiples tales of woes and triumph and requires that 7 responses are posted to this thread in order for someone to win the prize. Alright Backup Heroes, let's do this!

    I'll reveal the prize in about a week's time! :P

  2. #2
    Founding Member
    Join Date
    Jan 2011
    Posts
    2

    Talking Welcome to Russia

    6 years ago once in Moscow -Russia my brother ask me to take quick look at his Russian friends backup server room.
    one of the back up server's could not be pinged from time to time - lol
    Modern Office with security/face control, secure parking, video security, armed guards with AKM's and Saigas
    got temporary pass with photo and barcode to access server room
    Server Room - inside refinished utility/cleaning/washroom area... no server racks
    all setup on the floor
    Mould on the walls, water on the floor from leaking pipe
    anytime some one flash toilet water from the pipe got on UPS - ups got knocked out by the surge protection - shutdown backup server - backup server off - could not be pinged, after few hours Air cond./humid. dry the area UPS reset itself - back server get restarted and reset - get visible on network- could be pinged.
    until some one go to washroom...
    repeat over and over.
    Server Room was managed by 17 year old kid - son of one of the Co. bosses.

    lol lol
    Welcome to Russia

  3. #3
    At the last company I worked at, I was the storage administrator among many other things. Being that the management pinched pennies everywhere except where it counted, acquiring additional disk space meant buying a Iomega Network Attached Storage array that barely qualified to hold the title of a NAS.

    When it was finally up, someone discovered that it had the capability of block storage through iSCSI. At about the same time, more space was needed for the Oracle database that supported one of the data warehouses.

    The NAS did indeed have the ability to serve block storage through iSCSI, but barely. The one important thing that was missing was any kind of filtering or security as to what servers could connect to those LUNs. This fact will reveal it's importance in a moment.

    I created an iSCSI LUN and got it connected to the Oracle database server creating the room needed for another datafile for the tablespace in need. Performance was fine and it was decided that this would be an acceptable method of supporting the data warehouse.

    It did not stay a secret for long that there was more disk space available. When it was discovered that the storage could also be block, the envy was palpable and the requests starting coming in fast.

    One of the first requests was for a 500GB volume for the new SUSE Linux environment coming up to replace the Novell 6.5 that plagued the data center. Being that Novell didn't bother to change their volume sharing system other than to recompile it to run under Linux, file storage was not an option. It had to be block and it had to be local.

    Before I had a chance to create the LUN, the requester emailed me a thank you for getting his request done so quickly. My face went white and my heart dropped. The volume had not been created yet and there was only one on the array; the LUN supporting the datafiles for the data warehouse. Guess what size it was? 500GB...which happened to be the same size requested. Naturally, when the Novell administrator scanned the array, he found the LUN he was looking for and connected to it.

    Sure as the sun rising in the east, my email started to fill with errors from three different monitoring solutions and a plethora of angry users without an application. The array was decommissioned and sent out for recovery but it was a practice in futility. The Novell application had low-level formatted the entire LUN and destroyed any data that may have existed.

    No problem right? All we have to do is restore the backup from last night since it is pretty much a read-only database anyway.

    It was discovered through recovery attempts that the previous nights backup failed...as did the night before and the week before and the month before. In four years, the database had never been successfully backed up. There were NO backups!

    I spent over 100 hours getting what was left of the database restored (sort of) to a working copy, enough that I could export the data and import it into a fresh database installation. Unfortunately, the installation we were using was way out of date and a newer copy of the software had to be installed...which required a newer version of the database...which required a newer version of the OS...which required new hardware.

    What it was finally over with, I had spent over 200 hours on the issue. We paid $10,000 to try and get the data back unsuccessfully. We spent $30,000 on new hardware and software for the new installation. For another $5,000 upfront, we could have gotten an actual SAN with minimal WWID filtering that would have prevented this issue.

    I'll never forget what I felt when I saw that seemingly benign email exclaiming, "Thanks for getting this done for me so quick, bro!". I'm not talking about the "Oh no, this is going to cause work for me over the weekend" feeling. I'm talking about the full body pucker and the loss of extremity functionality followed by the raging anger threatening to do things that if caught by camera would pass a million views overnight posted on YouTube.

  4. #4
    Founding Member clayramsey's Avatar
    Join Date
    Feb 2011
    Location
    D/FW Texas
    Posts
    7
    I think repetition plays a role in gnarliness. I have many customers from one region of the US that shall remain nameless. Each of them is responsible for picking files for backups. Most of these folks are dentists, mind you. Supposedly intelligent people who can work a mouse. Anyhow, I have lost count of the times that I have recovered these accounts only to see a blank / empty file list. The invariable question: Did you pick your files for protection? The answers range from what do you mean? to I had to PICK files to back them up? to bilious blame the nerd-a-thons.

    I used to get bugged by this, but now I've developed something of a callous. We can't and don't select files by policy. We do, however, get blamed for not being psychic enough to know what each of 300 million people want to backup and how often.
    Clay Ramsey
    Data Backups
    Dallas / Ft. Worth, Texas

  5. #5
    How about a customer I had a LONG time ago, who backed up to QIC tape. At the end of the working day, he used to load the tape, enter a tar command to run the backup, listen for the tape to start moving, and leave. I was called in after their hard disk failed, I installed a new drive, reinstalled Unix, and went and got the backup tapes. Tried to read from the first -nothing but a header. Tried to read from the second -the same. All the tapes were blank. More interrogation of the customer, and he produced a tattered piece of paper with the command he used on it "tar cvf /dev/rmt ." Note the "." on the end, which would back up everything from root downwards. His piece of paper had been around so long the "." was no longer visible, so he was writing a header to the tape and nothing more. Oh, Dear.
    I have another story about a large communications company that had complex issues that resulted in someone flying from one European city to another every day to carry backup tapes in case a recover was needed, but telling THAT story might result in repercussions!

  6. #6
    Nice post. I like it. Thanks for sharing these information. Keep it up.
    alton

Posting Permissions