Skip to main content

Cloud

If at First you Don’t Succeed: Using the Windows Azure Model

With the recent movement from preview to release of features such as Virtual Machines and Networking (along with a promise to match any price drops from Amazon’s AWS offering), Microsoft’s Windows Azure has been getting a lot of attention lately. One of the keys to the success of the product has been the cost at which they can deliver on a 99.95% monthly SLA. This type of uptime is accomplished not by preventing the occurrence of hardware failure, but by ensuring that they are accounted for and can be recovered from quickly and in an automated fashion.
The Microsoft data center is filled with commodity-grade servers which are expected to fail at some point and a live copy customers’ virtual machines and data are kept alive as a “warm standby” within the data center so that when a hardware failure occurs the application fabric can almost instantly reroute traffic. In addition to copies being kept in the data center they are mirrored at another data center geographically separated from the customers’ primary data center (such as Eastern US being backed up by Western US) to mitigate the risk of catastrophic failure.
Microsoft’s planning for failure in the data center demonstrates an important concept that anyone who is designing or implementing connected systems should take to heart. Any time an application crosses boundaries (whether it’s crossing a process boundary on the same machine or passing through a maze of routers to a service on the other side of the world), there is great potential for failure that will eventually be realized. A resilient application will account for failures and, when appropriate, take steps to recover from the failure such as waiting a short time and retrying a database request when a deadlock is encountered.
One “out of the box” method for dealing with the potential for failure is built into the Windows Azure Queues (both Queue Storage and Service Bus Queue) in that messages retrieved from the queue but not subsequently deleted by the process that retrieved them will reappear on the queue after a set timeout. This is advantageous in relieving developers from having to implement their own fault tolerant behavior, but also needs to be a consideration if the processing performed on the message includes actions that should be performed exactly once (such as updating account balances in a financial system).
In addition to the support built into the queuing products, the Microsoft Patterns and Practices team has produced the Transient Fault Handling Application Block (or “Topaz”), which is available at CodePlex or as a NuGet package. This block provides functionality to consistently implement retry strategies when consuming services that may experience transient faults. You can learn more about this application block on MSDN.
With this post, I hope to have encouraged you to think about how you can design and implement systems in a way that are accepting of and resilient to failure. The next time you are getting ready to make a service call, ask yourself if something other than logging an error and terminating should happen if the service call times out.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.