Tuesday, January 25, 2011

Global high availability setup question

I own and operate visualwebsiteoptimizer.com/. The app provides a code snippet which my customers insert in their websites to track certain metrics. Since the code snippet is external JavaScript (at the top of site code), before showing a customer website, a visitor's browser contacts our app server. In case our app server goes down, the browser will keep trying to establish the connection before it times out (typically 60 seconds). As you can imagine, we cannot afford to have our app server down in any scenario because it will negatively affect experience of not just our website visitors but our customers' website visitors too!

We are currently using DNS failover mechanism with one backup server located in a different data center (actually different continent). That is, we monitor our app server from 3 separate locations and as soon as it is detected to be down, we change A record to point to the back up server IP. This works fine for most browsers (as our TTL is 2 minutes) but IE caches the DNS for 30 minutes which might be a deal killer. See this recent post of ours visualwebsiteoptimizer.com/split-testing-blog/maximum-theoretical-downtime-for-a-website-30-minutes/

So, what kind of setup can we use to ensure an almost instant failover in case app data center suffers major outage? I read here www.tenereillo.com/GSLBPageOfShame.htm that having multiple A records is a solution but we can't afford session synchronization (yet). Another strategy that we are exploring is having two A records, one pointing to app server and second to a reverse proxy (located in a different data center) which resolves to main app server if it is up and to backup server if it is up. Do you think this strategy is reasonable?

Just to be sure of our priorities, we can afford to keep our own website or app down but we can't let customers' website slow down because of our downtime. So, in case our app servers are down we don't intend to respond with the default application response. Even a blank response will suffice, we just need that browser completes that HTTP connection (and nothing else).

Reference: I read this thread which was useful serverfault.com/questions/69870/multiple-data-centers-and-http-traffic-dns-round-robin-is-the-only-way-to-assure

  • Your situation is fairly similar to ours. We want split datacentres, and network-layer type failover.

    If you've got the budget to do it, then what you want, is two datacentres, multiple IP transits to each, a pair of edge routers doing BGP sessions to your transit providers, advertising your IP addresses to the global internet.

    This is the only way to do true failover. When the routers notice that the route to your servers is no-longer valid (which you can do in a number of ways), then they stop advertising that route, and traffic goes to the other site.

    The problem is, that for a pair of edge routers, you're looking at a fairly high cost initially to get this set up.
    Then you need to set up the networking behind all this, and you might want to consider some kind of Layer2 connectivity between your sites as a point-to-point link so that you'd have the ability to route traffic incoming to one datacentre, directly to the other in the event of partial failure of your primary site.

    http://serverfault.com/questions/110622/bgp-multihomed-multi-location-best-practice and http://serverfault.com/questions/86736/best-way-to-improve-resilience are questions that I asked about similar issues.

    The GSLB page of shame does raise some important points, which is why, personally I'd never willingly choose a GSLB to do the job of BGP routing.

    You should also look at the other points of failure in your network. Make sure all servers have 2 NICs (connected to 2 separate switches), 2 PSUs, and that your service is comprised of multiple backend servers, as redundant pairs, or load-balanced clusters.

    Basically, DNS "load balancing" via multiple A records is just "load-sharing" as the DNS server has no concept of how much load is on each server. This is cheap (free).

    A GSLB service has some concept of how loaded the servers are, and their availability, and provides some greater resistance to failure, but is still plagued by the problems related to dns caching, and pegging. This is less cheap, but slightly better.

    A BGP routed network, backed by a solid infrastructure, is IMHO, the only way to truly guarantee good uptime. You could save some money by using route servers instead of Cisco/Juniper/etc routers, but at the end of the day, you need to manage these servers very carefully indeed. This is by no means a cheap option, or something to be undertaken lightly, but it is a very rewarding solution, and brings you into the internet as a provider, rather than just a consumer.

    Paras Chopra : Thanks, I wanted to upvote your answer but couldn't because I am new. Well, yes BGP routed network seems to be the way to go but it can be fairly hard to setup and manage for a startup (both cost and man-power resources wise). I wish there were a cheaper solution for this but probably there isn't.
    Tom O'Connor : I'm going to write this up as an essay on my blog tonight, I think. The cheapest solution for the edge routers for you, would be a pair of Dell R200s each with a couple extra NICs, and a stack of RAM (4-6GB should be sufficient), then run something like FreeBSD and Quagga, or BIRD.
    Paras Chopra : Fantastic! I will be sure to check it up. Please update this thread with the link so that I don't miss it out.
    voretaq7 : +1 on the El-Cheapo router solution - We're actually running FreeBSD routers at my company with great results. If you want something a bit more commercial (but still way cheaper than comparable Cisco gear) Juniper Networks gear (www.juniper.net) might also be a good choice.
  • Actually, what you want could be upgraded to help your split testing activities as well if you combine geodns and dns failover.

    Sending group A to ip 1 and group B to ip 2, even if they were on the same server would let you separate your testing groups. Group A and Group B are from different geographical regions. To be fair, the next day/week/month, you flip the groups to make sure that you allow for geographic differences. Just to be rigourous in your methodology.

    The geodns/failover dns service at http://edgedirector.com can do this

    disclosure: i am associated with the above link, stumbled in here researching an article on applying stupid dns tricks to split testing.

    From spenser

0 comments:

Post a Comment