Skip to main content

Cloud

Build 2016: Azure Lessons Learned from Microsoft Azure CTO

buildMark Russinovich, CTO of Microsoft Azure, hosted a session yesterday at Build titled “Building Resilient Services: Learning Lessons from Azure”. This was a great session where he detailed real world examples of how Microsoft has failed while deploying new features and functionality to Azure.
Building, upgrading, and maintaining a platform where Enterprises deploy mission critical applications is no easy task. Can you imagine how many different ways a developer could deploy a bug that would take out services worldwide for millions of users? It’s kind of a daunting thought.
This session was all about those failures, what went wrong, how they resolved it, and best practices for everyone else to avoid making these same mistakes in the future. Here are my best practice notes from the session:

  • If an application fails, it is likely to flood the error logs with data. Limit logs with a quota.

This example has probably happened to all of us as developers. In Microsoft’s case, the flooding log was causing severe disk space and memory issues that spread and took the service down. The best practice here makes sense, limit the log from growing so large that it takes down the entire service.

  • Don’t ignore or suppress warnings.

This goes along with the above. If your application is spitting out an error, you should pay attention to it and get it resolved. You may be filling up your error logs with unnecessary data or even worse, you may be ignoring a critical piece of functionality that hasn’t been reported yet. Don’t ignore the warnings.

  • Log like everybody is watching.

Can you imagine a large development team like Microsoft, when they have an error, the engineer who wrote the code is almost certainly not the engineer who will fix the problem. The first step in fixing a problem is identifying it’s cause. You can only do that when the error log has enough detail to lead you to that root cause. The examples that Mark described cost their support engineers countless hours by trying to track down vague errors that didn’t point to a specific function or method call. A little extra detail can go a long way.

In the next section, Mark discussed Exceptional Coding – or handling exceptions. It’s very difficult to have universal exception coverage. Third party code can often throw an error and you may not be able to determine the cause. One part of the debate is, do you Fail Fast or Catch-all? If you let your code fail fast, it can bring down your entire application. Sometimes this can be good as you can identify and error quickly and resolve it. Other times, that scenario can be disastrous. Conversely, if you catch all exceptions you may not know there is a persistent error condition. Visual Studio says not to catch general exceptions. Mark goes into a bit more detail here on what types of exceptions you should be catching and which you should not. The best practice here:
  • Log once per hour per exception per machine

This relates to the first best practice as well. Limiting how you log exceptions will greatly reduce your log file quota usage.
When investigating errors, an engineer will always examine the error log. The log is time stamped. Often one error will cause another and you often get started at the bottom of the error chain. In order to look at the full chain, you need to match errors sequentially based on time. What happens when your machines aren’t sync’d to the same time zone? Mark showed one example where the first error was logged in Pacific Time and the other errors were being logged in UTC. Best practice:

  • Always use UTC when logging errors

Well this post is getting pretty long, so I won’t go into full detail on the rest. You can watch the replay of Mark’s session here on channel 9 – https://channel9.msdn.com/Events/Build/2016/B863
Other best practices:

  • Store secrets and keys in Azure Key Vault
  • Avoid global services, partition instead
  • Test in a canary environment with production load
  • Bake in production for 24 hours
  • Isolate environments and keys, don’t let them see or talk to each other
  • Isolate endpoints and prevent cascading failures
  • Ensure graceful degradation

I hope this post is helpful for all you Azure Architects out there. Stay tuned to this blog for more great content from Build.
 
 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Joe Crabtree

More from this Author

Categories
Follow Us
TwitterLinkedinFacebookYoutubeInstagram