Friday, January 28, 2011

Is there a good way to backup a petabyte of data and store it?

I'm starting to see clients with hundreds of terabytes of data (in SQL Server installations). As the total volume of data in some enterprises approaches meaningful fractions of a petabyte, I'd like to canvas the collective knowledge base out there to see what people dealing with that magnitude of data are doing to safeguard it.

The obvious issue is that storing multiple backups of that much data is prohibitively expensive, using enterprise-class storage, heck, even just RAID-5.

Options I see are as follows:

  1. Create a mirror copy of the data in another data-center, and continually ship differences to it (using whatever mechanism is available for your data source - e.g. log-shipping or database mirroring with SQL Server)
  2. Take regular backups using a hefty compression algorithm (probably only suitable if the data lends itself well to being heavily compressed)
  3. Take piecemeal backups of the critical/changing parts of the data.
  4. Don't backup the data and trust the corruption-gods.

I'm seeing option #4 being adopted as the default, and as an HA/DR expert it's really scary, but what do I advise as an alternative? I think #1 is the best approach, but "I don't think so" is the usual answer when any alternatives apart from #4 and possibly #3 are suggested.

Now, of course it depends on the change-rate and criticality of the data. No need to answer with that as I used to be responsible for all the HA features of SQL Server while I worked at Microsoft so I'm well-versed in the 'it depends' arguments - that's my catch-phrase :-)

I'd be very interested to hear of any alternatives I've missed, or to hear that everyone else is in the same boat and there's no realistic alternative to spending lots of money on more storage.

Thanks in advance - due credit will be given to all well thought-out and expressed answers.

  • Option 1 is mirroring, which is almost as bad as #4: any bug that corrupts data, and isn't discovered immediately, will corrupt both copies.

    If the data is critical, consider dedicated solutions; read about IBM's Shark products, for example, or competing products from EMS, etc. They have features like Flash-copy, that allow you to instantly create a logical copy of the file without doubling disk requirements; and then you can backup this copy to (e.g.) tape. Look into robotic tape backup as well.

    Paul Randal : Database mirroring in SQL Server ships log records, not physical pages so most corruptions don't get copied to mirror. Yup, anything that allows a split-mirror+backup to be taken, but still left with problem of where to put damn thing if its a PB. But anything that is diffs-only-from-the-original (e.g. db snapshots in SQL Server) is heavily susceptible to corruption of underlying source data, making diffs useless too. Have you tried storing a PB on tape + restoring it during disaster recovery? *Days* of downtime :-( Although still better than total data-loss. Thanks for answer!
  • 5 Print it out and re-key it on failure.

    Paul Randal : Terribly un-useful reply
    hayalci : instead, it is a verry funny reply :)
    Nick Kavadias : +1 for humor. 1000 monkeys at 1000 terminals. I think it could work!
    Brent Ozar : Hahaha. I made the mistake of adding a humorous answer a couple of weeks ago and I was so pissed when somebody downvoted me right away. Funny reply, made me laugh, but next time think about adding this as a comment to the question instead of an answer. That'll protect you from the downvotes.
    Paul Randal : When I was at MS there was a customer who had to do this - with 200 staff members spending 3 weeks inputing the data. Funny until you're involved in a situation where it actually happens.
    : Not so stupid reply. In China, many things are done this way since cheaper than equipment or automation...
    From Tony Bain
  • Point out to those that want to store a Petabyte of data that storage ain't cheap.

    I get so fed up with people moaning about not having an extra Terabyte of online storage because disc is cheap - disc may be, but managed storage sure as hell isn't.

    If it's prohibitively expensive to store the backups then it's prohibitively expensive to store the data in a safe manner, so the proposed solution isn't viable.

    One of the most important reasons for having backups is protection from user error (most hardware failure problems can be dealt with by hardware solutions) but even database mirroring is no protection against a dropped table (OK, you can protect against that, but it's still possible to get unremovable guff into your DB - unless the reason the DB is so big is that it only ever issues inserts).

    As I see it tape is no longer a viable solution - it is now cheaper to just work with disc arrays (though physical storage can be awkward). So I think your only option is some method of splitting the data into chunks small enough to be restored in a sensible timeframe and then getting them onto disc storage on a regular basis (and here EMS type solutions can help, if you've got the cash).

    Paul Randal : Yup - I'm proposing option #3 more and more - use data-based partitioning of the data if you can and only backup the most recent data frequently - but you'd be surprised at the number of people who want to support VLDBs with archaic schemas and still expect to be able to efficiently backup, manage, and maintain the data. I'd have to agree with you about tape, for VLDBs you may as well go with disk and pay the cost as a trade-off against fast recovery time. Thanks for the answer!
    Farseeker : I agree. If you can't afford a backup solution, you can't afford the storage. Too many people see storage as just the price of the disks.
  • Technically, storage is cheap, but at the petabyte level, not so much. It really depends on the application, but I'd say some combination of strategy #2 and #3 is going to be the answer, with #2 a given and #3 depending on how much investment you can make in storage and the kind of storage and IO/computational power that will let you get away with as little incrementalism and as much discreet, full backup as possible.

    Alternatively, something like Amazon S3 may also come into play depending on your bandwidth and how much change there is in the data -- at this volume, putting at least some of it on someone else's servers and letting them worry about redundancy gets more and more cost effective.

    Matt Simmons : I've got to agree with the person who asked the question. Storage is cheap. /Managed/ storage is expensive as hell.
    From nedm
  • Off the wall idea - is the all of the stored information needed or even useful?

    How much is the information actually worth? It seems obviously ridiculous to spend more in upkeep and management than the data is worth.

    Is the data in the database appropriate for storage in a database? For example, does keeping compressed multi-gigabyte core files in the support organization's database really provide any actual benefit?

    Is there a lot of duplicated data in the database? For example, are a thousand people keeping ten copies each of a weekly 10MB newsletter?

    Does some of the data have an "expiration date" after which it does not provide any value? Returning to the support organization example, for various reasons there is virtually no benefit in keeping around customer core files more than a few months after a fix has been delivered.

    Another thought - is keeping that much data opening the company to liabilities. Some data one must, by law, keep. Some data, however, should be "shredded" because of the risks posed if it is accidentally, or maliciously, released to inappropriate parties.

    From EricJLN
  • Interesting video detailing myspace.com's architecture (SQL2005 backend). Not sure if they have individual petabyte dbs as they scale out with multiple dbs. They use SAN snap backups.

    http://wtv.watchtechvideos.com/topic70.html

  • Yeah, another option is storage virtualization: a device that sits between your servers and the SAN, like IBM SVC. SVC manages SAN-to-SAN copies, and can do remote replication (although that's obviously pretty painful at the petabyte level unless you have really low data change rates and really high bandwidth.)

    The slick part is that the whole process is invisible to the servers involved. If you're using SQL Server, you design your filegroups to keep things with a low change rate together (like sales archives from >3 years ago), and things with a high change rate (like current sales) on a separate filegroup. They don't even have to be completely read-only - you just want to design it so that you can use different replication methods for each filegroup. The SAN gear can sync luns via network, tape, or via SANs - meaning, you can ship parts of the SAN back and forth. This is more effective with gear like LeftHand's, where the SAN is made up of a pool of participating units.

    Then you can sync the low change rate stuff over the wire automatically, and sync the high change rate with sneakernet. (Sounds like I've got that backwards, but it's true - you can't sync the high change rate stuff over the wire due to volume.) Even some of the low-end gear accommodates this now: LeftHand lets you replicate to other LeftHand units in your datacenter, and then ship them to your offsite datacenter. Plug 'em in, join them to the remote side by changing IPs and groups, and now they're part of your remote backup SAN. The LeftHand sales pitch on this is just brilliant: set up your two SANs side-by-side in your primary datacenter, get them in sync, then you can ship parts of them over to the remote datacenter while some of them stay in your current datacenter to keep in sync. Gradually move 'em over without getting way out of sync.

    I haven't done this at the petabyte level, though. You know what they say - in theory, in theory and in practice are the same. In practice...

    SuperCoolMoss : Hi Brent, is there hardware available which compresses data at the SAN level?
    Paul Randal : Cool stuff - thanks Brent
    Brent Ozar : SuperCoolMoss - yep, absolutely. NetApp bundles dedupe into its SANs for free now, for example. Check with your SAN vendor and ask what dedupe solutions they offer.
    Brent Ozar : And you're welcome, Paul. :-D
    Sam : We were running the incipient virtualization software for a while. Ended up uninstalling from the switches due to some issues. Sounded great, but didn't work out for us.
    From Brent Ozar
  • IMO, unless you have some kind of godzilla-level hardware, if you have that much data you should be using a backup compression technology. I'm most familiar with LiteSpeed, but there are similar products from other vendors and (of course) a similar feature is built into SQL2008. You might not get 10-to-1 compression, but it does cut storage requirements for the backup down, and can also shrink your backup window requirements. If your goal is to keep multiple backup sets (yesterday plus the day before that, plus one from last week and one from last month, or a series of differentials plus fulls, which can get plenty big if you change a lot of the data in the database), it's a simple matter of storage space.

    Filegroup based backup (IOW, put non-volatile data onto certain FGs and the back the up infrequently) never seems to fly because devs or users won't or can't decide what data is volatile and what isn't, and in brownfield scenarios you often can't take the risk.

    If a failover site is a requirement, in addition to thinking about Database Mirror) you might want to talk to your clients' storage vendor to see if they offer something like SRDF, which is a hardware-based data replication technology. Naturally, replication (of any sort, but particularly realtime or near-realtime replication) is not a substitute for backups.

    Matt Simmons : I'm really looking forward to the time when I can get a data dedup storage solution. It's not going to be any time soon, but the nature of my data would probably lead to a cut in size-on-disk of like 75%
    Paul Randal : Yup - backup compression is my option 2, but often another DC is required. I like the idea of have a remote SAN with different ways of syncing LUNS. Thanks
    From
  • I don't think you have much of a choice here on tape v. disk. Tape won't likely cut it in a regular backup window unless you stripe it, and I'm not sure the reliability is there.

    So you are down to disk backups. Are you versioning? Meaning do you worry about going back to backup 2 (current db minus 2 backups)? Or backup 3? In that case, you might have issues, but likely what you have to handle is log backups, not so much data backups.

    If you can split off some of the data as read-only/non changing, then perhaps you have manageable backup sizes/windows. Or at least you are hoping that backup technology and bandwidth is catching up with data growth.

    I don't think you're backing up as much as you are keeping a 2nd copy around in order to recover from issues with your primary. That means hardware, corruption, etc., and you are praying daily that errors aren't being shipped to the second copy. The copies most likely are being made SAN-SAN, with some snap-shot'ing technology. although the original copy might be via Fed-Ex rather than across the wire. Bandwidth to move 100TB is not easy to come by for anyone.

    I think you need a combination of 1, 2, and 3 (not 4), with excellent log backup management.

    Actually I think that at any point in time you are really looking at 3 copies of your data. Running CHECKDB on 1 of the copies while the 2nd copy is being used to actually receive changes. Then you snapshot that 2nd copy to the first and continue. With this much data, I'd imagine that you would need some diligence here. Paul, how does checkdb work on a multi-user, 100TB db that is online?

    As mentioned, aren't log backups, and probably a log reader, critical? Don't you need to recover drop tables/user error from the logs rather than a backup? You can potentially shortcut this by sending SAN copies through some delay, but I haven't seen that technology. A Log Shipping SAN that can delay changes by 4 hours (or some interval) to allow you to recover from issues before overwriting data. Or some log-reader-of-SAN-block-changes tool? Without that, you need to manage those transaction logs, which might be a whole other level of tracking those backups on various file systems for some xxx hours to allow you to potentially recover from non-fatal errors.

    Paul Randal : Hey Steve - some customers need versions, some don't. Depends on how advanced their HA/DR thinking is and how much money they have. CHECKDB on a 100TB database? No idea - I never tested it above several TB and AFAIK it hasn't been tested > 10 TB. I'd love to hear how it does in 2005/2008. Thanks
    Steve Jones : Hey, you're the guy that should ask for a test. Maybe Mr. Cox at SQLCAT can run one. The HA/DR situation matters. Amazon might not care about versions. Others might depend on legal/regulatory issues. It's something to think about.
  • Speak to your storage vendor, they'll have a deduplication product that they've used before, combined with regular compression you can often reduce your data footprint by 70%. Of course anyone with the money to spend on a petabyte of storage is also likely to have the budget to buy a decent backup solution too - if they haven't then you just need to ask them what losing that petabyte would cost their business.

    Paul Randal : Yup - had the compression as option 2, and most of these customers don't have a lot of duplication in their data. Disagree about the extra money - sometimes (and often) the data volume growth outstrips budget for redundant storage. Several Fortune-100 companies I work with are in that state for some of their applications.
    Paul Randal : But thanks for the comment!
    From Chopper3
  • ZFS. Sure, it's still just getting started, but there are a number of areas where ZFS is designed to handle just these sort of thing. First off it's ability to handle a large amount of data, as well as a multitude of different storage devices (local, SAN, fiber, etc.), all while keeping data safe with checksums and "layer violating" awareness of the device health and failures. How though does this help solve backing up this much data?

    One method is to use snapshots. Take a snapshot, send that to tape/disk/net for transfer to the remote site. Subsequent snapshots only send data that's been sent, and you can keep live data on both ends if need be.

    The other is to use Solaris Cluster software where (so long as you have sufficent network bandwidth) you can have a live mirroring between two servers and if the one goes down, the second can take over. It's more for use where high availability (HA) is important, but I would guess that most places with that much data want HA.

    And you say that ZFS isn't supported on Windows, the usual place you might find sqlserver, maybe you run the Sun/ZFS on the backend and connect via iSCSI. Maybe that's a horrid idea also, but it's at least worth giving some thought so you know what not to do.

    Paul Randal : Interesting idea - which I had some more hardware to play around with ideas like this.
    From jasonrm
  • In a large enterprise data warehouse, much of the data comes from sources that are already backed up. I've worked on Teradata and ODW installations where they've taken option #4, but known that they could restore a day or two of transactional data and transform it from the source systems.

    At one retail client (at the time they had one of the top 5 largest DWs in the world, at about 200TB ... gives you an idea for how long ago this was), they went with option #1 after purchasing a new Petabyte-class Teradata server. The old nodes would be used for a snapshot of the prior day's system, while the new one maintained the existing. This was also nice from a failover perspective - every once in a while they'd take the whole thing down for maintenance and we'd just switch over to using the old slow server with day-old data.

    Honestly though, it seemed like a big waste of processing/storage/etc to keep the thing going ... particularly when the biggest advantage was that their admins and NCR techs had to work fewer evenings to perform irregular maintenance.

    From Jess

0 comments:

Post a Comment