Pulling in a Web 2.0 application into production: hosting thoughts

Faster, Cheaper and Better choose any two. Hosting/Data center is largely an optimization problem where there are trade-offs involved for every decision you can make. Knowing your choices then becomes very important.

In planning for capacity you are limited by your slowest components. First make an informed guess if its CPU/Memory/Disk IO or Bandwidth bound based on measurements in your Load testing lab which can give you hints about what might be your slower components.

According to me site typically needs to have

1. Raw bit pushing capability, how fast can you render the content to the browser. That is what your users care about at the end of the day.
a) Your small sized static content hosted(flash, javascript, CSS, images) as close as possible to the end users, as the request-response time is nearly equal to latency of your site from an end user. ( Hint: buy services of a CDN which has servers geographically closer )
b) Larger blobs of content like progressive video downloads and like can be and should be hosted wherever bandwidth price is cheapest. Amazon S3 is a good starting point as there is no minimum commitment required there.
c) Ajax requests are typically designed to hide latency from a user so ideally it shouldn’t matter where in the world your application is hosted.
d) HTML rendering , are your pages cached, how many caching servers do you need can be determined by estimating data cached in-memory which would be used by your application for each user
2. Number crunching/backend processing capability, including your database. Your actual web application, middleware and database. Here is where the actual difference lies between hardware capacity requirements of different applications. You should run benchmarks of synthesized traffic from a typical user session replayed concurrently to your load testing servers(hint Perl WWW::Mechanize or Jmeter) . However its impossible to figure out in advance how your end users are actually going to use the site. They might stress that 5% of the code which is not optimized for performance bringing down your site anyway. Load testing doesn’t really yeild any useful information simply for the reason that its nearly impossible to create real world situations in a lab(that includes abuse and creative uses of your web application). Estimate how much data processing are you doing with the stats/data collected in your site and how you are feeding the results of that processing to your frontend application. What parts are synchronous/real-time and what parts are near-realtime (batch processing nearing real time hidden behind ajax/flash animations and like ) and what part is truly batch oriented.

3. Setting up a new site is then more about setting up initally with a reasonably sized capacity and be able to react to capex calls by monitoring the usage of bandwidth, CPU, memory and disk IO for each separated out component in the application by its class ( bit-pushing/caching or number crunching). If you have a reasonable budget for capacity then create an initial 4-20 servers(real or on the cloud at one of Amazon EC2 or other VPS based cloud solutions) with 2-4 instances of each component of your web application(outsource the things you wouldn’t want to worry like e-mail/DNS/CDN etc. ), get a good quality hardware loadbalancer (or buy shared access to a loadbalancer). And make sure you don’t constrain your flexibility in being able to add machines and switching capacity without requiring major physical layout changes. ( Hint: Buy larger switches than you need).

4. Long term goals for operations of a web application are
a) Bandwidth costs should decline as you start using more and more of it tending towards a very low(nearly zero) per Megabit cost
b) Cost(setup+rental or amortization) of adding physical machines(of the standard chosen configuration) and switching/loadbalancing should increase linearly.
c) Geographical scale up by being able to replicate your first datacenter node across the globe.
d) No single points of failure as in a atleast two geographical sites, access links for bandwidth at each datacenter node, load balancing, network switching, storage(multi-pathing) and your application components.

5. Start small and choose wisely and tend towards flexibility( aim for lower capex with no lock-in, even it means a higher opex initially) for you’ll need to live with limitations created by your initial set of decisions regarding production hosting environment for a long time to come or require a painful and costly migration to another production environment.