Downtime Release

Linkedin downtime screenshot

Linkedin downtime screenshot

A CEO of company running a large Web 2.0 site once asked me why is it that we can’t have a non-downtime release of our Web 2.0 portal while facebook, youtube and linkedin’s of the world can have 100% uptime. While it is possible to give an illusion of zero downtime, it is seldom that 100% of users can be served all the time for all types of applications. Some applications which do not depend a lot on the state like Search it might be possible to redirect a portion of the user to another version of application without any visible downtime for all of the users. While some applications like webmail might be upgraded for a small percentage of users at a time where upgrade time is small enough that requesting a user to come back in a few minutes to access the mailbox is not a big deal. Again another set of applications might rely on a write-through cache that allows database writes in a format not dependent on the database schema to wait while current version of database is locked, backedup and upgraded with the latest schema. Many Web 2.0 sites rely on batch processing and caching that gives an illusion of near-realtime update operations but it is seldom that operations are happening synchronously within an HTTP Request and it is possible to delay the processing until after an upgrade has happened.

The cost of having a 99.99% uptime is prohibitive, at the minimum it involves realtime data-replication between different versions of database schemas and running multiple versions of web-application simultaneously. While most applications can’t have 100 percent uptime for all users, smarter operational processes coupled with database sharding can allow an illusion of the same while different sets of users experience the upgrades at different times. But every once in a while the system needs a downtime release when application versions have major changes in their information architecture.