Mark Russinovich, CTO of Microsoft Azure, hosted a session yesterday at Build titled “Building Resilient Services: Learning Lessons from Azure”. This was a great session where he detailed real world examples of how Microsoft has failed while deploying new features and functionality to Azure.
Building, upgrading, and maintaining a platform where Enterprises deploy mission critical applications is no easy task. Can you imagine how many different ways a developer could deploy a bug that would take out services worldwide for millions of users? It’s kind of a daunting thought.
This session was all about those failures, what went wrong, how they resolved it, and best practices for everyone else to avoid making these same mistakes in the future. Here are my best practice notes from the session:
- If an application fails, it is likely to flood the error logs with data. Limit logs with a quota.
This example has probably happened to all of us as developers. In Microsoft’s case, the flooding log was causing severe disk space and memory issues that spread and took the service down. The best practice here makes sense, limit the log from growing so large that it takes down the entire service.
- Don’t ignore or suppress warnings.
The IT Leader's Guide to Multicloud Readiness
This guide provides practical key insights and important factors to consider to make informed decisions in your multicloud journey.
Download the Guide
This goes along with the above. If your application is spitting out an error, you should pay attention to it and get it resolved. You may be filling up your error logs with unnecessary data or even worse, you may be ignoring a critical piece of functionality that hasn’t been reported yet. Don’t ignore the warnings.
- Log like everybody is watching.
Can you imagine a large development team like Microsoft, when they have an error, the engineer who wrote the code is almost certainly not the engineer who will fix the problem. The first step in fixing a problem is identifying it’s cause. You can only do that when the error log has enough detail to lead you to that root cause. The examples that Mark described cost their support engineers countless hours by trying to track down vague errors that didn’t point to a specific function or method call. A little extra detail can go a long way.
- Log once per hour per exception per machine
This relates to the first best practice as well. Limiting how you log exceptions will greatly reduce your log file quota usage.
When investigating errors, an engineer will always examine the error log. The log is time stamped. Often one error will cause another and you often get started at the bottom of the error chain. In order to look at the full chain, you need to match errors sequentially based on time. What happens when your machines aren’t sync’d to the same time zone? Mark showed one example where the first error was logged in Pacific Time and the other errors were being logged in UTC. Best practice:
- Always use UTC when logging errors
Well this post is getting pretty long, so I won’t go into full detail on the rest. You can watch the replay of Mark’s session here on channel 9 – https://channel9.msdn.com/Events/Build/2016/B863
Other best practices:
- Store secrets and keys in Azure Key Vault
- Avoid global services, partition instead
- Test in a canary environment with production load
- Bake in production for 24 hours
- Isolate environments and keys, don’t let them see or talk to each other
- Isolate endpoints and prevent cascading failures
- Ensure graceful degradation
I hope this post is helpful for all you Azure Architects out there. Stay tuned to this blog for more great content from Build.