A few years ago Adrian Cockcroft, cloud architect at Netflix at the time, posted this blog which caused quite a stir among the IT community. It described how Netflix had almost done away with DevOps (or even plain Ops) using the cloud (AWS in this case) to come up with yet another new IT buzzword called NoOps.
Many in the DevOps community took strong issue to this, arguing that Ops by any name, whether NoOps or DevOps, is still Ops. A lot of the Platform-as-a-Service (PaaS) vendors jumped on the NoOps bandwagon even declaring the following year to be the definitive year of NoOps.
Vendors like Heroku, AWS Elastic Beanstalk and AppFog tout their PaaS platforms as pure development based without any need for operations support. I witnessed this in person during a Heroku workshop (by the way Heroku itself is hosted on AWS), it’s frighteningly simple and easy to create a website or web service using any of the supported language platforms and connect to a set of standard database backends and tools, and it scales efficiently and the setup is a breeze if you have ever worked on any kind of multi-stack project.
I think a key drawback in PaaS currently is that unless the project is a self-contained one or all your company’s data and services are located on the cloud or are accessible externally, it is difficult to punch enough holes through your company’s firewall to justify the move to PaaS especially if the data is sensitive. I think organizations are still uncomfortable with the idea of owning highly sensitive data hosted on systems outside of their control. Also being locked into a limited toolset or a particular database might not appeal to every project owner given the proliferation of specialized software resources especially in the Big Data landscape.
Going back to Cockcroft’s NoOps blog, it seems what he is describing is not PaaS but rather IaaS (Infrastructure-as-a-Service) and specifically Infrastructure as code. Infrastructure as code simply stipulates that the “scripts” used to provision, configure, install, deploy and monitor IT environments should be treated exactly like code. The implications are that all the lessons learned in software development over the years like agile development methodologies, test-driven methodologies, XP programming principles can all be applied to these operational “scripts” just as effectively. The impetus for this has come from the rise of coding languages (especially Ruby and specialized automation tools like Chef and Puppet ) to manage and automate operational scripts in both bare metal and cloud systems. The has huge implications on the project team as the lines between traditional software developers and operational teams become blurred and essentially the operational aspects of the project are coopted into a project team as part of the software.
The stars seem to be aligning currently for the infrastructure as code paradigm to be widely adopted. Two developments I think are significant. The first is the maturity of Openstack which is an open source cloud computing platform for infrastructure services and together with smarter and simpler automation frameworks like Ansible which seems to be built from the ground up to make infrastructure coding easier. The second is the recognition by leading cloud vendors like Amazon to recognize and specifically create tools to address this for their cloud platforms. If you want to know exactly how AWS CloudFormation fits into this, you can go through the presentation here. Basically Openstack addresses the internally hosted IaaS cloud while vendors like Amazon fills the public cloud space.
I recently wrapped up a project which used the former (Openstack) in both a privately hosted IaaS and publicly hosted vendor cloud. There were no DevOps or any Ops members on the team and we used Ansible to automate all the provisioning, installation, configuration, deployment and monitoring of the various environments. All the engineers basically came from a software development background with some DevOps exposure. The Ansible code itself was managed on the Git repository and went through the same Git code management processes, cloning, branching, forking, pull requests, reviews and merging as well as coding standards, refactoring and the like. We used Vagrant and VirtualBox to test installations locally on our laptops and stood up and stood down Openstack development environments to test our code. The QA team also stood up and stood down the test environments using the same Ansible code but the difference was that they also tested and verified that the environments and the automation code was functionally correct.
I would be lying if I said this was universally accepted by the entire team, myself included, because in the beginning it was a long drawn out process of learning Openstack with strange components like Nova, Neutron, Cinder, Heat, etc and Ansible concepts of modules, roles and playbooks. The hardest part I think however was the mindset shift to this new paradigm requiring software engineers to not just focus on the code but how this code would play out in the automation of the IT environment. For example a simple addition of a new property in the application required us to make changes on the Ansible deployment code which might trigger other changes on other components. But it came to dawn on me that we were having a hard time with this mainly because we still treated the Ansible code as an entirely different and separate animal from our main software code. However, if we viewed the code as simply code regardless of whether it was on the software side or the operational side it made more sense – just like a change on an interface will undoubtedly require changes on implementing classes and clients. The only difference was that the IDE we were developing code on did not flag those inconsistencies (it would be nice if someone developed an IDE to encompass this as well.) But software engineers are infinitely nimble (I would like to think) and success breeds adoption.
The main surprise for me was how agile and liberating the process was – because we were no longer bound by existing infrastructures and processes imposed by an IT team, we were free to try new solutions or engage more creatively in coming up with alternatives. A case in point was monitoring, initially we decided to leverage the organization’s existing log aggregation and monitoring tool but the on-boarding process was painfully slow due to large backlogs and licensing costs. There was no way to get to production in time with monitoring and we decided that it was imperative that we have monitoring and alerting since the application stack was very complex. Since we already had Elasticsearch as a document store on the project we decided quickly to expand this to include log aggregation and metrics reporting as well using their popular ELK stack. This may seem daunting to include a brand new monitoring stack into the mix fairly late in the game but since we had the whole team to work on it as another piece of code it was less formidable than we thought. The crucial point was that we were free to log and report everything that the application was doing. All software, system and applicaiton logs as well as metrics (business and operational) were captured and tracked.
This proved invaluable on the first day we went into production. Although we had performance tested the application, we found things were not getting processed, we found this while monitoring the Apache Kafka topic message consumption lags which showed up on the dashboard on an alarmingly upward trajectory. We searched the Apache Storm logs which showed workers dying and getting restarted which led to our natural inclination that something in the Storm configuration was not right. Redeploying the Storm topologies did not help and we were stumped – it looked like this was some kind of environmental issue and we had no idea what was going on, was there anyone in IT that could help us? But we had never engaged IT so no one outside of the team knew our infrastructure, it seemed clear cut that we had to resolve this ourselves.
Searching other logs along similar time lines we came across warning messages from Apache Zookeeper which was complaining about how long it was taking to synchronized it’s data logs to disk as well as similar warning messages from an in-memory database application. Since Storm depended on Zookeeper to maintain state we quickly came to the conclusion that there were disk latencies in the block storage devices that Openstack was using that was causing this problem. We turned off the disk synchronization in Zookeeper, did a rolling restart of the Zookeeper cluster and made the appropriate changes to the Ansible script and instantly everything started working again without any loss in data.
The point of this if we were to take this a step further was that the diagnosis of the problem required a software debugging mentality and the resolution required an operational mentality but since we could write the solution into Ansible, it became a software fix. If we were to take it another step further we could have very possibly have written the diagnosis into the alerting system as a set of declarative rules – when Kafka messages are lagging and Storm workers are restarting and Zookeeper disk warning messages are happening then send out an alert since we are measuring and monitoring everything.
Taking it even a step further, Perficient has a error handling system called Veracity which executes actions or workflows based on error conditions, it would not take a great leap to imagine that we could execute the Ansible playbook based on this diagnosis by the monitoring tool and take the actions prescribed. Because this error will likely happen again and again since disk latencies especially on network storage systems can be unpredictable. Now imagine we have a knowledge base of such error conditions and possible prescriptions, as long as the actions are not destructive we could execute those corrections and possibly even monitor for optimized paths to execute. For our example, in our case, the fix is temporary since if disk performance goes back to healthy levels we would want to reverse the solution, it would be nice if the system detected the disk storage performance levels and made the necessary reversals. But this is again very possible given the amount of metrics and data we collect.
Such autonomous self-healing and self-correcting systems I think are possible with predictive analytics and machine learning capabilites and the way infrastructure has become essentially code. It could spell the end of IT.