Matthieu Rethers, Author at Perficient Blogs https://blogs.perficient.com/author/mrethers/ Expert Digital Insights Wed, 19 Jun 2024 18:59:29 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png Matthieu Rethers, Author at Perficient Blogs https://blogs.perficient.com/author/mrethers/ 32 32 30508587 Openshift Essentials and Modern App Dev on Kubernetes https://blogs.perficient.com/2023/03/13/openshift-essentials-and-modern-app-dev/ https://blogs.perficient.com/2023/03/13/openshift-essentials-and-modern-app-dev/#respond Mon, 13 Mar 2023 19:02:00 +0000 https://blogs.perficient.com/?p=322779

Introduction

Whether you have already adopted Openshift or are considering it, this article will help you increase your ROI and productivity by listing the 12 essential features including with any Openshift subscription. This is where Openshift shines as a platform when compared to pure Kubernetes engine distributions like EKS, AKS, etc. which are more barebones and require quite a bit of setup to be production and/or enterprise ready. When you consider the total value of Openshift, and factor in the total cost of ownership for the alternative, Openshift is a very competitive option not only for cost conscious buyers but also organizations that like to get things done, and get things done the right way. Here we go:

  1. Managed Openshift in the cloud
  2. Operators
  3. GitOps
  4. Cluster Monitoring
  5. Cluster Logging
  6. Distributed Tracing
  7. Pipelines
  8. Autoscaling
  9. Service Mesh
  10. Serverless
  11. External Secrets
  12. Hyperscaler Operators

Special Bonus: Api Management

ROSA, ARO, ROKS: Managed Openshift in the cloud

If you want an easy way to manage your Openshift Cloud infra, these managed Openshift solutions are an excellent value and a great way to get ROI fast. Pay-as-you-go running on the hyperscaler’s infrastructure, you can save a ton of money by using reserved instances with a year commitment. RedHat manages the control plane (master and infra nodes) and you pay a small fee per worker. We like the seamless integration with native hyperscaler services like storage and node pools for easy autoscaling, zone awareness for HA, networking and RBAC security with IAM or AAD. Definitely worth a consideration over the EKS/AKS, etc. solutions which are more barebones.

Check out our Openshift Spring Boot Accelerator for ROSA, which leverages most of the tools I’m introducing down below…

Operators

Available by default on Openshift, the OperatorHub is pretty much the app store for Kubernetes. Operators manage the installation, upgrade and lifecycle of complex Kubernetes-based solutions like the tools we’re going to present in this list. They also are based on the controller pattern which is at the core of the Kubernete’s architecture and enable declarative configuration through the use of Custom Resource Definitions (CRD). Operators is a very common way to distribute 3rd party software nowadays and the Operator Framework makes it easy to create custom controllers to automate common Kubernetes operations tasks in your organization.

The OperatorHub included with Openshift out-of-the-box allows you to install said 3rd party tools with the click of a button, so you can setup a full-featured cluster in just minutes, instead of spending days, weeks, months gathering installation packages from all over. The Operator Framework support Helm, Ansible and plain Go based controllers to manage your own CRDs and extend the Kubernetes APIs. At Perficient, we leverage custom operators to codify operations of high-level resources like a SpringBootApp. To me, Operators represent the pinnacle of devsecops automation or at least a giant leap forward.

Openshift GitOps (AKA ArgoCD)

First thing you should install on your clusters to centralize the management of your clusters configuration with Git is GitOps. GitOps is a RedHat’s distribution of ArgoCD which is delivered as an Operator, and integrates seamlessly with Openshift RBAC and single-sign on authentication. Instead of relying on a CI/CD pipeline and the oc (kubectl) cli to implement changes in your clusters, ArgoCD works as an agent running on your cluster which automatically pulls your configuration manifests from a Git repository. This is the single most important tool in my opinion for so many reasons, the main ones being:

  1. Central management and synchronization of multi-cluster configuration (think multi-region active/active setups at the minimum)
  2. Ability to version control cluster states (auditing, rollback, git flow for change management)
  3. Reduction of learning curve for development teams (no new tool required, just git, manage simple yaml files)
  4. Governance and security (quickly propagating policy changes, no need to give non-admin users access to clusters’apis)

I have a very detailed series on GitOps on my Perficient’s blog, this is a must-read whether you’re new to Openshift or not.

Cluster Monitoring

Openshift comes with a pre-configured monitoring stack powered by Prometheus and Grafana. Openshift Monitoring manages the collection and visualization of internal metrics like resource utilization, which can be leveraged to create alerts and used as the source of data for autoscaling. This is generally a cheaper and more powerful alternative to the native monitoring systems provided by the hyperscalers like CloudWatch and Azure Monitoring. Like other RedHat’s managed operators, it comes already integrated with Openshift RBAC and authentication. The best part is it can be managed through GitOps by using the provided, super simple CRDs.

A less-know feature is the ability to leverage Cluster Monitoring to collect your own application metrics. This is called user-workload monitoring and can be enabled with one line in a manifest file. You can then create ServiceMonitor resources to indicate where Prometheus can scrape your application custom metrics, which can be then used to build custom alerts, framework-aware dashboards, and best of all, used as a source for autoscaling (beyond cpu/memory). All with a declarative approach which you can manage across clusters with GitOps!

Cluster Logging

Based on a Fluentd-Elasticsearch stack, cluster logging can be deployed through the OperatorHub and comes with production-ready configuration to collect logs from the Kubernetes engine as well as all your custom workloads in one place. Like Cluster Monitoring, Cluster Logging is generally a much cheaper and Powerful alternative to the hyperscaler’s native services. Again, the integration with Openshift RBAC and single-sign on makes it very easy to secure on day one. The built-in Kibana deployment allows you to visualize all your logs through a web browser without requiring access to the Kubernetes API or CLI. The ability to visualize logs from multiple pods simultaneously, sort and filter messages based on specific fields and create custom analytics dashboards makes Cluster Logging a must-have.

Another feature of Cluster Logging is log forwarding. Through a simple LogForwarder CRD, you can easily (and through GitOps too!) forward logs to external systems for additional processing such as real-time notifications, anomaly detection, or simply integrate with the rest of your organization’s logging systems. A great use case of log forwarding is to selectively send log messages to a central location which is invaluable when managing multiple clusters in active-active configuration for example.

Last but not least is the addition of custom Elasticsearch index schema in recent versions, which allows developers to output structured log messages (JSON) and build application-aware dashboards and analytics. This feature is invaluable when it comes to filtering log messages based on custom fields like log levels, or a trace ID, to track logs across distributed transactions (think Kafka messages transiting through multiple topics and consumers). Bonus points for being able to use Elasticsearch as a metrics source for autoscaling with KEDA for example.

Openshift Distributed Tracing

Based on Jaeger and Opentracing, Distributed Tracing can again be quicky installed through the OperatorHub and makes implementing Opentracing for your applications very, ridiculously easy. Just deploy a Jaeger instance in your namespace and you can just annotate any Deployment resource in that namespace with one single line to start collecting your traces. Opentelemetry is invaluable for pinpointing performance bottlenecks in distributed systems. Alongside Cluster Logging with structured logs as mentioned above, it makes up a complete solution for troubleshooting transactions across multiple services if you just log your Opentracing trace IDs.

Openshift Distributed Tracing also integrates with Service Mesh, which we’ll introduce further down, to monitor and troubleshoot traffic between services inside a mesh, even for applications which are not configured with Opentelemetry to begin with.

Openshift Pipelines

Based on Tekton, Openshift pipelines allow you to create declarative pipelines for all kind of purposes. Pipelines are the recommended way to create CI/CD workflows and replaces the original Jenkins integration. The granular declarative nature of Tekton makes creating re-usable pipeline steps, tasks and entire pipelines a breeze, and again can be managed through GitOps (!) and custom operators. Openshift pipelines can be deployed through the OperatorHub in one-click and comes with a very intuitive (Jenkins-like) UI and pre-defined tasks like S2I to containerize applications easily. Creating custom tasks is a breeze as tasks are simply containers, which allows you to leverage the massive ecosystem of 3rd party containers without having to install anything additional.

You can use Openshift pipelines for any kind of workflow, from standard ci/cd for application deployments to on demand integration tests, to executing operations maintenance tasks, or even step functions. As Openshift native, Pipelines are very scalable as they leverage the Openshift infrastructure to execute tasks on pods, which can be very finely tuned for maximum performance and high availability, integrate with the Openshift RBAC and storage.

Autoscaling

Openshift supports the three types of autoscalers: horizontal pod autoscaler, vertical pod autoscaler, cluster autoscaler. The horizontal pod autoscaler is included OOTB alongisde the node autoscaler, and the vertical pod autoscaler can be installed through the OperatorHub.

Horizontal pod autoscaler is a controller which increases and decreases the number of pod replicas for a deployment based on CPU and Memory metrics threshold. It leverages Cluster Logging to source the Kubernetes pod metrics from the included Prometheus server and can be extended to use custom application metrics. The HPA is great to scale stateless rest services up and down to maximize utilization and increase responsiveness during traffic increase.

Vertical pod autoscaler is another controller which analyses utilization metrics patterns to optimize pod resource configuration. It automatically tweaks your deployment resources CPU and memory requests to reduce wastes or undercommitment to insure maximum performance. It’s worth noting that a drawback of VPA is that pods have to be shutdown and replaced during scaling operations. Use with caution.

Finally, the cluster autoscaler is used to increase or decrease the number of nodes (machines) in the cluster to adapt to the number of pods and requested resources. The cluster autoscaler paired with the hyperscaler integration with machine pools can automatically create new nodes when additional space is required and remove the nodes when the load decreases. There are a lot of considerations to account for before turning on cluster autoscaling related to cost, stateful workloads requiring local storage, multi-zone setups, etc.  Use with caution too.

Special Mention

Special mention for KEDA, which is not commercially supported by RedHat (yet), although it is actually a RedHat-Microsoft led project. KEDA is an event-driven scaler which sits on top of the built-in HPA and provides extensions to integrate with 3rd party metrics aggregating systems like Prometheus, Datadog, Azure App Insight, and many many more. It’s most well-known for autoscaling serverless or event-driven applications backed by tools like Kafka, AMQ, Azure EventHub, etc. but it’s very useful to autoscale REST services as well. Really cool tech if you want  to move your existing AWS Lambda or Azure Functions over to Kubernetes.

Service Mesh

Service mesh is supported by default and can also be installed through the OperatorHub. It leverages Istio and integrates nicely with other Openshift operators such as Distributed Tracing, Monitoring & Logging, as well as SSO. Service mesh serves many different functions that you might be managing inside your application today (For example if you’re using Netflix OSS apps like Eureka, Hystrix, Ribbon, etc):

  1. Blue/green deployments
  2. Canary deployments (weighted traffic)
  3. A/B testing
  4. Chaos testing
  5. Traffic encryption
  6. OAuth and OpenID authentication
  7. Distributed tracing
  8. APM

You don’t even need to use microservices to take advantage of Service Mesh, a lot of these features apply to re-platformed monoliths as well.

Finally you can leverage Service Mesh as a simple API Management tool thanks to the Ingress Gateway components, in order to expose APIs outside of the cluster behind a single pane of glass.

Serverless

Now we’re getting into real modern application development and deployment. If you want peak performance and maximize your compute resources and/or bring down your cost, serverless is the way to go for APIs. Openshift Serverless is based on KNative and provides 2 main components: serving and eventing. Serving is for HTTP APIs containers autoscaling and basic routing, while eventing is for event-driven architecture with CloudEvents.

If you’re familiar with AWS Lambda or Azure Functions, Serverless is the equivalent in the Kubernetes world, and there are ways to migrate from one to the other if you want to leverage more Kubernetes in your infrastructure.

We can build a similar solution with some of the tools we already discussed like KEDA and Service Mesh, but KNative is a more opinionated model for HTTP-based serverless applications. You will get better results with KNative if you’re starting from scratch.

The big new thing is eventing which promotes a message-based approach to service-to-service communication (as opposed to point-to-point). If you’ve used that kind of decoupling before, you might have used Kafka, or AWS SQS or other types of queues to decouple your applications, and maybe Mulesoft or Spring Integration or Camel (Fuse) to produce and consume messages. KNative eventing is a unified model for message format with CloudEvent and abstracts the transport layer with a concept called event mesh. Check it out: https://knative.dev/docs/eventing/event-mesh/#knative-event-mesh.

External Secrets Add-On

One of the first things to address when deploying applications to Kubernetes is the management of sensitive configuration variables like passwords to external systems. Though Openshift doesn’t officially support loading secrets from external vaults, there are widely used solutions which are easily setup on Openshift clusters:

  • Sealed Secrets: if you just want to manage your secrets in Git, you cannot have them in clear even if you’re using GitHub or other Git providers. SealedSecrets allows you to encrypt secrets in Git which can only be read by your Openshift cluster. This requires an extra step before committing using the provided client certificate but doesn’t require a 3rd party store.
  • External Secrets: this operator allows you to map secrets stored in external vaults like Hashicorp, Azure Vault and AWS Secret Manager to internal Openshift secrets. Very similar to the CSI driver below, it essentially creates a Secret resource automatically, but doesn’t require an application deployment manifest to be modified in order to be leveraged.
  • Secrets Store CSI Driver: another operator which syncs an external secrets store to an internal secret in Openshift but works differently than the External Secrets operator above. Secrets managed by the CSI driver only exist as long as a pod using it is running, and the application’s deployment manifest has to explicitly “invoke” it. It’s not usable for 3rd party containers which are not built with CSI driver support out-of-the-box.

Each have their pros and cons depending on whether you’re in the cloud, use GitOps, your organization policies, existing secrets management processes, etc. If you’re starting from scratch and are not sure of which one to use, I recommend starting with External Secrets and your Cloud provider secret store like AWS Secret Manager or Azure Vault.

Special Mention: Hyperscalers Operators

If you’re running on AWS or Azure, each cloud provider has released their own operators to manage cloud infrastructure components through GitOps (think vaults, databases, disks, etc), allowing you to consolidate all your cloud configuration in one place, instead of using additional tools like Terraform and CI/CD. This is particularly useful when automating integration or end-to-end tests with ephemeral Helm charts to setup various components of an application.

API Management Add-On

Muleosft, Boomi or Cloupak for integration customers, this is an add-on but it’s way worth considering if you want to reduce your APIM costs: Redhat Application Foundation and Integration. These suites include a bunch of cool tech like Kafka (with a registry) and AMQ, SSO (SAML, OIDC, OAuth), Runtimes like Quarkus and Spring and Camel, 3Scale for API Management (usage plans, keys, etc), CDC, Caching and more.

Again because it’s all packaged as an operator, you can install and start using all these things in just a few minutes, with the declarative configuration goodness that enables GitOps and custom operators.

]]>
https://blogs.perficient.com/2023/03/13/openshift-essentials-and-modern-app-dev/feed/ 0 322779
6 Steps to successful autoscaling on Kubernetes https://blogs.perficient.com/2023/01/06/6-steps-to-successful-autoscaling-on-kubernetes/ https://blogs.perficient.com/2023/01/06/6-steps-to-successful-autoscaling-on-kubernetes/#respond Fri, 06 Jan 2023 18:44:17 +0000 https://blogs.perficient.com/?p=324354

Introduction

One of the big drivers of adopting containers to deploy microservices is the elasticity provided by platforms like Kubernetes. The ability to quickly scale applications up and down according to current demand can cut your spending by more than half, and add a few 9s to your SLAs. Because it’s so easy to setup nowadays, there’s really no good reason for autoscaling not to be one of your top priorities for a successful adoption of Kubernetes. In this post I’m going to give you the 6 easy steps to establish a solid autoscaling foundation using KEDA, and trust me you’ll go a long way with just these basic principles.

TL;DR

  1. Rightsize your deployment container
  2. Get a performance baseline for your application
  3. Use the baseline measurement as a the KEDA ScaledObject target
  4. Test your KEDA configuration with realistic load
  5. Refine the metric to minimize the number of pods running
  6. Iterate

Understand these principles

Before you jump into autoscaling, please consider the following

  • Autoscaling is not a silver bullet to solve performance problems
  • “Enabling HPA is not the same as having a working autoscaling solution” (credit: Sasidhar Sekar)
  • It’s a powerful tool that needs to be used with caution, bad configuration can lead to large cost overruns
  • Autoscaling is better suited for non-spiky load patterns
  • Autoscaling tuning can be different for each application
  • Tuning requires a solid understanding of traffic patterns, application performance bottlenecks
  • Sometimes it’s good to not auto-scale (you might want backpressure)
  • Careful with async workloads
  • Think about the whole system, external dependencies, tracing is invaluable
  • Tuning autoscaling is a process, to be refined over time

Now that we got out of the way, let’s get started…

Autoscaling options

Let’s super quickly review the different types of autoscaling available for Kubernetes:

Vertical Autoscaling: resizes individual pods to increase the load capacity. Great for rightsizing applications that don’t scale horizontally easily such as Stateful services (databases for example) or applications that are CPU or memory bound in general. Scaling a pod vertically requires replacing the pod, which might cause downtime. Note that for certain type of services, resizing a pod might have no effect at all on its capacity to process more requests. That’s because Spring Boot services for example have a set number of threads per instance, so you would need to explicitly increase the number of threads to leverage the additional CPU.

Horizontal Autoscaling: creates additional identical pods to increase the overall load capacity. Best option to use whenever possible in order to optimize pod density on a node. Supports CPU and memory-based scaling out-of-the-box but supports custom metrics as well. Well-suited for stateless services, event-driven consumers

Node Autoscaling: creates additional identical nodes (machines) in order run more pods when existing nodes are at capacity. This is a great companion for horizontal autoscaling but… there are many considerations to take into account before turning it on. The two main concerns are waste – new nodes might get provisioned for only minor capacity increase – and scaling down – when nodes run Stateful pods which might be tied to specific zones.

The rest of this article will be focused on horizontal pods autoscaling.

Understanding the Horizontal Pod Autoscaler

HPA ships with Kubernetes and consist of a controller that manages the scaling up and down of the number of pods in a deployment.

HPA flow

In a nutshell:

  1. You create a manifest to configure autoscaling for one of your deployments
  2. The manifests specifies what metric and threshold to use to make a scaling decision
  3. The operator constantly monitors the K8s metrics or some metrics API
  4. When a threshold is breached, the operator updates the number of replicas for your deployment

HPA is limited in terms of what metrics you can use by default though: CPU & memory. So this is fine if your service is CPU or memory bound but if you want to use anything else, you’ll need to provide HPA with a custom API to serve other types of metrics.

This is the basic formula that the HPA to calculate the desired number of pods to schedule:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

This is calculated on every “tic” of the HPA, which can be configured per deployment but default to 30 seconds.

Example:

An HPA configured with a target CPU usage of 60% will try to maintain an average usage of 60% CPU across all deployment’s pods.

If the current deployment is running 8 pods averaging %70 usage, desiredReplicas = ceil[8*(70/60)] = ceil(9.33) = 10. The HPA will add 2 pods.

Introducing KEDA

According to the KEDA website:

KEDA is a Kubernetes-based Event Driven Autoscaler. With KEDA, you can drive the scaling of any container in Kubernetes based on the number of events needing to be processed.

That’s actually a bit misleading and reducing. The common misconception is that KEDA can only be used when doing event-driven architecture like MQ or Kafka. In reality KEDA provides that API I mentioned earlier for serving custom metrics to the HPA. Any type of metrics, like response time or requests/second, etc

KEDA FLow

So say you want to use Prometheus metrics, or CloudWatch metrics, etc. KEDA has a lot of scalers to integrate with all these services. This is a very easy way to augment the default HPA and not write custom metrics APIs.

KEDA Workflow

  1. A ScaledObject Kubernetes manifest tells KEDA about your deployment and desired scaling configuration
  2. KEDA initially scales down the deployment’s pod to 0
  3. When pod activity is first detected, KEDA scales the deployment to the min number of pods specified in the config file
  4. KEDA also creates a native Kubernetes HorizontalPodAutoscaler (HPA) resource
  5. The HPA monitors the targeted metric for autoscaling by querying a KEDA metric server
  6. The KEDA metric server acts as a broker for the actual metric server (Azure Monitor, App Insight, Prometheus, etc)
  7. When the metric threshold is breached, the HPA adds more pods according to the formula below
  8. When no more traffic is detected, the HPA scales back the pods down to the min number of pods
  9. Eventually KEDA will scale back down to 0 and de-activate the HPA

A note about HTTP services

One of the interesting features of KEDA is the ability to scale down to 0 when there’s nothing to do. KEDA will just query the metric system until activity is detected. This is pretty easy to understand when you’re looking at things like queue size of Kafka records age, etc. The underlying service (i.e. Kafka) still runs and is able to receive messages, even if there aren’t any consumer doing work. No message will be lost.

When you consider HTTP services though, it doesn’t work quite the same. You need at least one instance of the service to process the first incoming HTTP request so KEDA cannot scale that type of deployment to 0.

(There is an add-on to handle HttpScaledObjects that creates a sort of HTTP proxy, but if you really need to scale down services to 0, I recommend looking at KNative instead)

You can still leverage KEDA as the HPA backend to scale on things like requests/seconds and this is what we’re going to do next.

Rightsizing your application pods

What we call rightsizing in Kubernetes is determining the ideal CPU and Memory requirements for your pod to maximize utilization while preserving performance.

Rightsizing serves 2 main purposes:

  • Optimize the density of pods on a node to minimize waste
  • Understand your application capacity for a single pod so we know when to scale

Optimizing density

This is more related to cost control and utilization of compute resources. If you picture the node as a box, and the pods as little balls, the smaller the balls, the less wasted space between the balls.

Also you’re sharing the node with other application pods, so the less you use, the more resources you leave to other applications.

Let’s walk through an example. Say your node pools are made of machine with 16 vCPUs and your pods are configured to request 2 vCPUs, you can put 8 pods on that node. If your pod actually only uses 1 vCPU, then you’re wasting 50% capacity on that node.

If you request a big vCPU number, also keep in mind that every time a new pod comes up, you might only use a fraction of that pod capacity while usage goes up. Say your pod uses 4 vCPU / 1000 concurrent requests. At 1250 requests for example, a new pod would be created, but only ¼ of the requested vCPU would be used. So you’re blocking resources that another application might need to use.

You get the idea… smaller pods = smaller scaling increment

Understanding performance

This is to give us a baseline for the metrics to scale on. The idea is to establish a relation between the pod resources and its capacity to reach a target so multiply by 2 and you can twice the capacity, multiple by 3 and you get 3 times the capacity etc.

I recommend using a performance-based metrics for autoscaling as opposed to a utilization metric. That’s because a lot of http services don’t necessarily use more resources to process more requests. Checkout the following load test of a simple Spring Boot application.

Spring Boot Load Test

In this test I’m doubling the number of concurrent requests at each peak. You can see that the max CPU utilization doesn’t change.

So what’s the right size? In a nutshell, the minimum CPU and memory size to insure a quick startup of the service and provide enough capacity to handle the first few requests.

Typical steps for a microservice with a single container per pod (not counting sidecar containers which should be negligible):

  1. To determine the initial CPU and memory requests, the easiest approach is to deploy a single pod and run a few typical requests against it. Defaults depend on the language and framework used by your application. In general, the CPU request is tied to the response time of your application, so if you’re expecting ~250ms response time, 250m CPU and 500Mi memory is a good start
  2. Observe your pod metrics and adjust the memory request to be around the memory used +- 10%.
  3. Observe your application’s startup time. In some cases, requests impact how fast an application pod will start so increase/decrease CPU requests until the startup time is stable

Avoid specifying CPU limits at least at this point to avoid throttling

To really optimize costs, this will need to be refined over time by observing the utilization trends in production.

Getting a performance baseline

This step measures how much load a single pod is able to handle. The right measure depends on the type of services that you’re running. For typical APIs, requests/seconds is the preferred metric. For event-driven consumers, throughput or queue-size is best.

A good article about calculating a performance baseline can be found here: https://blog.avenuecode.com/how-to-determine-a-performance-baseline-for-a-web-api

The general idea is to find the maximum load the pod can sustain without degradation of service, which is indicated by a drop in response time.

Don’t feel bad if your application cannot serve 1000s of RPS, that’s what HPA is for, and this is highly dependent on your application response time to begin with.

  1. Start a load test with a “low” number of threads, without a timer, to approximate your application response time
  2. Now double the number of threads and add a timer according to the formula in the above article
  3. Observe your application average response time
  4. Repeat until the response time goes up
  5. Iterate until the response time is stable
  6. You now have your maximum RPS for a rightsized pod

Keep an eye on your pod CPU utilization and load. A sharp increase might indicate an incorrect CPU request setting on your pod or a problem inside your application (async processes, web server threading configuration, etc)

Example: REST Service expecting 350ms response time

We rightsized our Spring Boot application and chose 500m CPU and 600Mi memory requests for our pod. We’ve also created a deployment in our Kubernetes cluster with a single replica. Using JMeter and Azure Load Testing we were able to get the following results. The graphs show number of concurrent threads (users) on the top left, response time on the top right, and requests/seconds (RPS) on the bottom left.

1 POD (500m CPU) – 200 users

Keda007

1 POD (500m CPU) – 400 users

Keda009

1 POD (500m CPU) – 500 users

Keda011

1 POD (500m CPU) – 600 users

Keda015

Observe the response time degrading at 600 users (460ms vs 355ms before). So our pod performance baseline is 355ms @ 578 rps (500 users).

Interestingly, the CPU load plateaued at around 580 RPS. That’s because Spring Boot rest services are typically not CPU bound. The requests are still accepted but land in the thread queue until capacity is available to process the request again. That’s why you see an increase of the response time despite the CPU load staying the same. This is a perfect example of why using CPU for autoscaling doesn’t work sometimes, since in this case, you would just never reach a high CPU utilization. We still want the CPU request to be higher because of startup time for Spring Boot apps.

Now let’s scale our deployment to 2 replicas and run the tests again.

2PODS (500m CPU) – 1000

Keda017

2 PODS (500m CPU) – 1200 users

Keda019

This confirms our baseline so we know we can double the number of pods to double the capacity (353.11ms @ 1.17 rps)

Configuring KEDA

I’ve previously explained that the HPA only supports CPU and memory metrics for autoscaling out-of-the-box. Since we’ll be using RPS instead, we need to provide the HPA an API to access the metric. This is where KEDA comes in handy.

KEDA provides access to 3rd party metrics monitoring systems through the concept of Scalers. Available scalers include Azure Monitor, Kafka, App Insights, Prometheus, etc. For our use case, the RPS metric is exposed by our Spring Boot application through the Actuator, then scraped by Prometheus. So we’ll be using the Prometheus scaler.

The ScaledObect resource

In order to register a deployment with KEDA, you will need to create a ScaledObject resource, similar to a deployment or service manifest. Here’s an example:

KEDA ScaleObject

Let’s discuss the main fields:

  • minReplicaCount is the number of replicas we want to maintain. Remember in the case of an HTTP service, we always want at least one at all time (see discussion above)
  • scaleTargetRef is a reference to your deployment resource (here we’re using Openshift DeploymentConfig, but normally you’d target a Deployment)
  • metadata.type indicates that we want to use the Prometheus scaler
  • metadata.query specifies the PromQL query to calculate the average RPS across all pods tagged with “echo-service-spring-boot”
  • metadata.threshold is the target for the HPA. Remember the formula at the beginning of the post “desiredMetricValue”, this is it
  • metadata.metricName is whatever you want and has to be unique across scalers

That’s pretty much it. Apply that resource to the namespace where your deployment is running and you can start testing

Tuning Autoscaling

Let’s first look at the basic steps and we’ll discuss details down below:

  1. Start with 1 minReplicaCount
  2. Start your test
  3. Observe the response time graph
  4. If things are configured properly, we expect the response time to remain constant as the number of RPS increases
  5. Increase the threshold until you start seeing spikes in response time, which would indicate that the autoscaler is scaling too late
  6. If the ramp-up time is too short, you will see a spike in response time and possibly errors coming up
  7. Change minReplicaCount to at least 2 for HA but match real-world normal traffic expectations

Understanding timing

Pay attention, this part is very important: always test for realistic load. Testing with a ramp-up of 10k users/s is probably not realistic and most likely will not work. Understanding your traffic patterns is critical.

Remember that the various components in the autoscaling system are not real-time. Prometheus has a scrapping interval, the HPA has a query interval, KEDA has a scaling interval, and then you have your pod startup time, etc. This can add up to a few minutes in the worst case scenario.

During load increase, only the current number of pods will be able to handle the incoming traffic, until KEDA detects the breach of threshold and triggers a scaling event. So you might experience more or less serious degradation of service until your new pods come up. Can your users tolerate a few seconds of latency? Up to you to decide what’s acceptable.

Example:

Let me try to illustrate what’s going on. Imagine an application which can serve 5 RPM and we set our autoscaling threshold to 4 RPM, and we configure our test with 10 threads and a ramp up time of 150 seconds, this means we have a ramp-up rate of 4 threads per minute. We calculated that it’d would take 1.5 min for KEDA to trigger a scale up, and for a new pod to be ready to receive requests. We can trace the following graph:

Keda023

In blue we show the number of users/min simulated by our load test, in orange, the capacity of a single pod and in purple, the threshold set in the autoscaler.

At the 1 minute mark, the threshold will be breached (blue line crossing), so 1.5 minutes after that – in the worst case – our second pod will be ready at the 2.5 minutes mark.

The vertical black line shows that the number of users at the 2.5 min would have already reached 10 so the single first pod will have to deal with up to 2x its RPM capacity until the second pod comes up.

We know our application can handle up to 5 RPS without service degradation, so we want to configure our tests so the ramp-up rate falls under the orange line. That’s a 2 threads/min ramp-up, hence we need to increase our ramp-up time in JMeter to 300 seconds and make sure our overall test duration is at least 300 seconds.

Tuning the threshold

In our previous example, what if your actual ramp-up in production is just that high? Before messing with the threshold, try this first:

  • Decrease your pod startup time
  • Decrease the autoscaler timers (not super recommended)
  • Improve your app performance so the RPS goes up
  • Be OK with slower response times for a short time

If none of that helps you achieve your goal, you can try lowering the threshold BUT you need to understand the tradeoffs. Let’s go back to our formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

You can see that the number of pods is directly related to the ratio between the threshold and the actual load. Now let’s say you want to handle a load of 1000 RPS.

If you set the threshold to 1000 RPS, the HPA will scale to 10 pods. Now, change the threshold to 50 RPS and the HPA will scale to 20 pods – i.e. twice the amount of pods – for the same load and same pod capacity!

A lower threshold will result in more pods for the same load, which will increase cost, waste resources (under-utilized pods) and potentially impact overall cluster performance. At the same time a lower threshold will result in less risk of degraded service.

Example of autoscaling based on our previously tested REST API

Autoscaled – 1400 users – 120 seconds ramp-up – 500 rps threshold

Keda025

Ramp-up time is too short and threshold is too high, resulting in a serious increase in response time for the first couple pods

Autoscaled – 1400 users – 240 seconds ramp-up – 500 rps threshold

Keda027

Double ramp-up time, still small spike in response time but better overall

Autoscaled – 1400 users – 240 seconds ramp-up – 400 rps threshold

Keda029

Decreased threshold improves response time degradation

Autoscaled – 1400 users – 240 seconds ramp-up – 300 rps threshold

Keda031

Lower threshold improved response time even more BUT…

Keda033

HPA scales to 4 pods @ 400 RPS threshold

Keda035

HPA scales to 6 pods @ 300 RPS threshold

In this case, we determined that 400 RPS was the correct threshold to avoid overly degraded response time during initial scale-up while maximizing resource utilization.

Impact of application performance problems in production

Autoscaling a part of a system means making sure the other parts can scale too

If an application response time starts increasing significantly, autoscaling can become a big problem if you’re not using the right metric.

A misconfigured autoscaler can result in much higher costs without benefit and negatively impact other systems.

For example, if an application becomes really slow because of a downstream problem with a database, adding more pods will not improve the situation. In some cases, that would actually aggravate the problem by putting more pressure on the downstream system.

A drop in response time would mean a drop in RPS. By using RPS as the scaling metric, in that case, we would actually decrease the number of pods to match what the system is actually capable of serving. If you instead scaled on response time, the number of pods would increase but the throughput would remain exactly the same. You’d just have stuck requests spred out across more pods.

Monitoring key metrics is critical to avoid runaway costs

Monitor HPA, understand how often pods come up and down and detect anomalies like unusually long response times. Sometimes autoscaling will mask critical problems and waste a lot of resources.

Improve your application’s resilience first

Sometimes it is actually better to not autoscale when you want back-pressure to avoid overwhelming downstream systems and provide feedback to users. It’s a good idea to implement circuit breakers, application firewalls, etc to guard against these problems

Continuous improvement

Autoscaling tuning CI/CD

All the steps above can be automated as part of your CI/CD pipeline. JMeter and Azure Load Tests can be scripted with ADO and ARM or Terraform templates.

This is to proactively track changes in application baseline performance which would result in changing the target value for the autoscaling metric.

You can easily deploy a temporary complete application stack in Kubernetes by using Helm. Run your scripted load tests, compare with previous results, and automatically update your ScaledObject manifest.

Reactive optimization

Monitoring the right platform and application metrics will surface optimization opportunities (and anticipate problems). Following are some of the metrics you want to keep an eye on:

Application response time: if the response time generally goes up, it might be time to re-evaluate your baseline performance and adjust your target RPS accordingly

Number of active pods: changes in active pods patterns usually indicate a sub-optimized autoscaling configuration. Spikes in number of pods can be an indication of a too low target

Pod CPU & memory utilization %: monitor your pods utilization to adjust your rightsizing settings

Request per seconds per pod: if the RPS of single pods is much below the configured target, the target is too low which results in underutilized pods

This process can also be automated to a certain extent. Some alerting mechanism which provides recommendation is best, in most cases you want a human looking at the metrics and decide on the appropriate action.

Conclusion

I’ll repeat what I’ve said at the very beginning of this article: autoscaling is not the solution to poor application performance problems. That being said if your application is optimized and you’re able to predictably scale horizontally, KEDA is the easiest way to get started with autoscaling. Just remember that KEDA is just a tool and in my experience, the number one impediment to a successful autoscaling implementation is a lack of understanding of testing procedures or lack of tests altogether. If you don’t want to end up with a huge bill at the end of the month, reach out to Perficient for help!

 

]]>
https://blogs.perficient.com/2023/01/06/6-steps-to-successful-autoscaling-on-kubernetes/feed/ 0 324354
Kubernetes Multi-Cluster Management – Part 3 https://blogs.perficient.com/2022/12/14/kubernetes-multi-cluster-management-part-3-2/ https://blogs.perficient.com/2022/12/14/kubernetes-multi-cluster-management-part-3-2/#respond Wed, 14 Dec 2022 17:18:01 +0000 https://blogs.perficient.com/?p=323506

Introduction

In part I of our Kubernetes multi-cluster management series, we’ve talked about the basics of GitOps and explained why you should really consider GitOps as a central tenet of your Kubernetes management strategy. In part II, we looked at a reference implementation of GitOps using ArgoCD, and how to organize your GitOps repositories for security and release management.

In this final installment of our Kubernetes multi-cluster management series, we’re taking our approach to the next level and look at enterprise scale, operational efficiency, compliance for regulated industries, and so much more. To help alleviate some of the shortcomings of the GitOps-only approach, we’re going to introduce a new tool which integrates really nicely with ArgoCD: RedHat Advanced Cluster Management (ACM) and the upstream community project Open Cluster Management.

I should mention right off-the-bat that ACM is not limited to RedHat Openshift clusters. If your hub cluster is Openshift, then you can use ACM and import your other clusters (AKS, EKS, etc) into it, but you can also use other clusters as hubs with Open Cluster Management.

Motivation

When I first heard about Advanced Cluster Management, I thought, why do I need this? After all, our GitOps approach discussed in Part II works and scales really well, up to thousands of clusters, but realized there are a few shortcomings when you’re deploying in the enterprise:

Visibility: because we use a pull model, where each cluster is independent of the others and manages its own ArgoCD instance, identifying what is actually deployed on a whole fleet of cluster is not straight forward. You can only look at one ArgoCD instance at a time.

Compliance: while it’s a good idea to have standard configuration for all your clusters in your organization, it’s not always possible, in practice, to force every team to be in sync all at the same time. You can certainly configure GitOps to sync manually let individual teams pick and choose what/when they want deployed in their cluster, but how do you keep track of gaps at a global level?

Management: GitOps only deals with configuration, not cluster lifecycle, and by that I mean provisioning, starting, pausing clusters. Depending on the organization, we’ve been relying on the cloud provider console or CLIs, and/or external automation tools like Terraform to do this. But that’s a lot more complicated in hybrid environments.

Use Case

Let’s focus on a specific use case and see how to address the above concerns by introducing advanced cluster management. The organization is a large retailer with thousands of locations all over the US. Each individual store runs an application which collects Bluetooth data and computes the information locally to customize the customer’s experience. The organization also run applications in the cloud like their website, pipelines to aggregate collected store data, etc. all on Kubernetes.

Requirements

  1. All the stores run the same version of the application
  2. Application updates need to happen at scale over-the-air
  3. All the clusters (on location and in the cloud) must adhere to specific standards like PCI
  4. Development teams need to be able to simulate store clusters in the cloud
  5. The infrastructure team needs to quickly provision new clusters
  6. The compliance and security teams regularly produce reports to show auditors

Solution

GitOps-only

1-4 requirements can be mostly addressed with our GitOps-only approach. This is the overall architecture following what we presented in part II

retail edge

  • Baseline configuration for all our clusters is in a shared repository managed by the infrastructure team
  • The baseline contains dependencies and tools like operators, monitoring and logging manifests, etc. And some of the PCI-related configuration.
  • Various teams can contribute to the baseline repository by submitting pull-requests
  • Store clusters are connected to a second repository which contains the BT application manifests
  • Stores clusters can be simulated in the cloud by connecting them to the same repository
  • Dev teams have dedicated repositories which they can use to deploy applications in the cloud clusters
  • For simplicity sake, we only have two environments: dev and prod. Each environment is connected to the repository branch of the same name
  • New releases are deployed in the dev branches, and promoted to production with Git merge

So GitOps-only goes a long way here. From a pure operational perspective, we covered most of the requirements and the devops flow is pretty straight forward:

  1. Dev teams can simulate a store cluster in the cloud
  2. They can deploy and test changes by committing to the dev branches of the repos
  3. When they’re ready to go, changes are merge to prod and automatically deployed to all the edge clusters
  4. All the clusters share the same baseline configuration with the necessary PCI requirements

ACM and OCM overview

Red Hat Advanced Cluster Management and the upstream community project Open Cluster Management is an orchestration platform for Kubernetes clusters. I’m not going to go into too many details in this article. Feel free to check out the documentation on the OCM website for a more complete explanation on architecture and concepts.

For the moment all you need to know is:

  • ACM uses a hub-and-spoke model, where the actual ACM platform is installed on the hub
  • The hub is where global configuration is stored
  • Agents run on the spokes to pull configuration from the hub

Hub-and-spoke Architecture

  • ACM provides 4 main features:
    • Clusters lifecycle management (create, start/stop/pause, group, destroy clusters)
    • Workload management to place application onto spokes using cluster groups
    • Governance to manage compliance of the clusters through policies
    • Monitoring data aggregation

Both tools can be installed in a few minutes through the operators’ marketplace, with ACM being more integrated with Openshift specifically and Red Hat enterprise support.

Cluster Provisioning

GitOps cannot be used to provision clusters so you need to rely on something else to create the cluster infrastructure and bootstrap the cluster with GitOps. You have 2 options:

  • If you’re deploying non-Openshift clusters, you will need to use some kind of automation tool like Ansible or Terraform or provider specific services like Cloudformation or ARM. I recommend wrapping those inside a CI/CD pipeline and manage the infrastructure state in Git so you can easily add a cluster by committing a change to your central infrastructure repo.
  • If you’re deploying Opensihft clusters, then you can leverage ACM directly, which integrates with the main cloud providers and even virtualized datacenters. A cool feature of ACM is cluster pools, which allows you to pre-configure Openshift clusters and assign them to teams with one click. More on that later…

Regardless of the approach, you need to register the cluster with the Hub. This is a very straight forward operation, which can be included as a step in your CI/CD pipeline, and is managed by ACM automatically if you’re using it to create the clusters.

Cluster bootstrapping

Once the blank cluster is available, we need to start installing dependencies, basic tooling, security, etc. Normally we would add a step in our provisioning CI/CD but this is where Advanced Cluster Management comes in handy. You don’t need an additional automation tool to handle it, you can create policies in ACM, which are just standard Kubernetes manifests, to automatically install ArgoCD on newly registered clusters and create an ArgoCD application which bootstraps the clusters using the baseline repository.

The 3 bootstrapping policies installed on our hub

The 3 bootstrapping policies installed on our hub

ACM governance is how we address point 5 and 6 in the requirements. You can pick existing policies and/or create new ones to apply specific configuration to your clusters, which are tied to a regulatory requirement like PCI. Security and compliance teams, as well as auditors can quickly identify gaps through the ACM interface.

ACM Governance Dashboard

ACM Governance Dashboard

You can choose whether you want policies to be enforced or just monitored. This gives your teams flexibility to decide when they’re ready to upgrade their cluster configuration to get into compliance.

In our case, our policy enforces the installation and initialization of GitOps on every single cluster, as a requirement for disaster recovery. This approach allows us to quickly provision new edge clusters and keep them all in sync:

  • Provision the new cluster hardware
  • Install Kubernetes
  • Register with the hub
  • As soon as the cluster is online, it bootstraps itself with GitOps
  • Each cluster then keeps itself up-to-date by syncing with the edge repo prod branch

Workload Management

You have 2 ways to manage workloads with Advanced Cluster Management:

Subscriptions: this is essentially ACM-native GitOps if you’re not using ArgoCD. You create channels to monitor Git repositories and use cluster sets to target groups of clusters to deploy the applications into. The main difference with GitOps is the ability to deploy to multiple clusters at the same time

ArgoCD ApplicationSets: this is a fairly new addition to ArgoCD which addresses the multi-cluster deployment scenario. You can use ApplicationSets without ACM, but ACM auto-configures the sets to leverage your existing cluster groups, so you don’t have to maintain that in two places

ApplicationSets are the recommended way but Subscriptions have some cool features which are not available with ArgoCD like scheduled deployments.

I have one main concern with ApplicationSets, and it’s the fact that it relies on a push model. ApplicationSets generates one ArgoCD Application per target cluster on the hub and syncs the changes from the one ArgoCD server on the hub. When you’re talking 100s or 1000s of clusters it will put a lot of load on the hub.

Another potential issue is ingress networking for the retail locations to allow the hub to push things out to the spoke. It’s usually easier to configure a NAT on the local network, going out to the internet to pull the manifests from Git.

So… in this case, we’re just not going to use any of the ACM provided method and stick to our approach where each cluster manages its own ArgoCD server 😀

But… we can still use ACM for visibility. ACM application page can show you all the workloads deployed on all your clusters in one place, show you deployment status, basic information, and makes it very easy to drill down into logs for example. And navigate to each individual cluster’s ArgoCD UI.

Application dashboard

Application dashboard

Note on cluster pools

I want to go back to cluster pools for a little bit because it’s a really neat feature if you are using Openshift.

This is an example of a cluster pool on AWS:

AWS Cluster Pool

AWS Cluster Pool

This cluster pool is configured to always have one cluster ready to be assigned. Notice the “claim cluster” link on the right. As soon as a team claims this cluster, the pool will automatically provision a new one, if so configured.

Cluster pool configuration

Cluster pool configuration

In this case the pool is configured with 10 clusters, with one warm cluster (not bootstrapped) available to be claimed immediately at all time.

Pool dashboard

Pool dashboard

Here we see 2 clusters provisioned through the pool, which we previously claimed. Notice the auto-generated cluster name indicating the origin of the cluster as rhug-.

Note that clusters deployed on AWS that way are not managed by AWS and Red Hat, unlike AWS ROSA clusters, although they are provisioned using the same installation method. Something to keep in mind. If unmanaged clusters are a deal breaker, you can always provision a cluster on ROSA and register it with the hub explicitly.

There are two great use case for cluster pools & cluster lifecycle management:

  • Quickly provisioning turnkey new clusters for development teams. If you maintain your shared config repositories correctly, you can build and bootstrap brand new, fully compliant and ready-to-use clusters in a matter of minutes. You can terminate a cluster at the end of the work day and re-create it identical the next morning to save on cost, reset clusters to the base configuration if you messed it up, etc
  • Active-passive disaster recovery across regions or even cloud providers. If you want to save on your redundant environment infrastructure cost, and you don’t have Stateful applications (no storage), create a second cluster pool in the other region/cloud and hibernate your clusters. When you need to failover, resume the clusters in the standby region, and they will very quickly start and get up-to-date with the last-know good config.

Conclusion

There is so much to discuss on the topics of GitOps and ACM, this is barely scratching the surface, but hopefully it gave you enough information to get started on the right foot and to get excited about the possibilities. Even if you only have one cluster, GitOps is very easy to setup, and you’ll rip the benefits on day 1. I strongly recommend you check out part II of this series for a reference. As a reminder, I also shared a CloudFormation template which deploys an implementation of that architecture using AWS ROSA (minus ACM as of today). When you’re ready to scale, also consider Advanced Cluster Management for the ultimate multi-cluster experience. I haven’t had a chance to include it in my reference implementation yet, but check back later.

Stay tuned for my next GitOps articles on ArgoCD security and custom operators…

]]>
https://blogs.perficient.com/2022/12/14/kubernetes-multi-cluster-management-part-3-2/feed/ 0 323506
Announcing AWS ROSA Accelerator “Devops-In-A-Box” https://blogs.perficient.com/2022/12/07/announcing-aws-rosa-accelerator-devops-in-a-box/ https://blogs.perficient.com/2022/12/07/announcing-aws-rosa-accelerator-devops-in-a-box/#respond Wed, 07 Dec 2022 21:06:11 +0000 https://blogs.perficient.com/?p=323074

Today we’re making our Openshift on AWS (ROSA) accelerator available for everybody to use *free of charge. This solution is the product of years of experience on the ground delivering application modernization on containers to some of the biggest companies in the world. With the release of ROSA (Red Hat OpenShift on AWS) at the beginning of this year, we are able to deliver a click-button installation on managed Openshift for the ultimate developer’s experience.

Launch on AWS

“DevOps-in-a-box” represents the pinnacle of DevOps automation and development efficiency, bundling the best tools and processes in a single operator, backed by the power of Red Hat Openshift and ArgoCD. Deploying this accelerator at the very start of our Openshift engagements has proven to significantly speed up the adoption of Kubernetes and shorten the time-to-market for Java/Spring Boot APIs. Not to mention the quality of deliverables. It also makes it a breeze to migrate existing Spring Boot applications out of legacy platforms like PCF and Websphere and take full advantage of containers in just a few days.

This is a tool for developers by developers. With our guiding principle “how can we help developers do more of what they’re good at?”, we wanted to build something that Java developers could leverage to create sophisticated container-native applications with zero prior knowledge of Kubernetes on day 1. This is an instrument for businesses to innovate faster, save on development cost, increase resilience, boost productivity, streamline operations, and it goes without saying but I’ll say it anyways… ROI.

Installating the AWS ROSA Accelerator:

You need to be an administrator in the targeted AWS account, as the installer will create several IAM roles. We recommend creating a separate account in your organization to host the cluster, if possible.

  1. Create a Red Hat account (select “Managed services”) if you don’t already have one
  2. Enable ROSA
  3. Launch the CloudFormation template

To enable ROSA, look for “Red Hat OpenShift Service on AWS” in your AWS account’s console menu and find the “Enable Red Hat OpenShift” button on the service landing page

You can also download the CF template first and launch it from your AWS console.

On the CloudFormation stack page:

  • Enter a (short) name for your cluster
  • Paste your ROSA token

Provisioning can take up to 30 minutes. You can follow the installation progress in your AWS account’s Cloudwatch logs:

In the AWS console, go to Cloudwatch > Log Groups > perficient-[your-cluser-name] > launcher and wait until you see “dev cluster ready!”. Check the View as a text box at the top for more visibility. You might need to click the “Load older events” link to see the entire output.

Get started by downloading our user guide and learn more about our reference architecture in our GitOps blog series.

 

* You will be charged for the AWS infrastructure resources created by the installer (ROSA, EC2, CodeCommit, etc)

 

]]>
https://blogs.perficient.com/2022/12/07/announcing-aws-rosa-accelerator-devops-in-a-box/feed/ 0 323074
Kubernetes Multi-Cluster Management – Part 2 https://blogs.perficient.com/2022/12/05/kubernetes-multi-cluster-management-part-2/ https://blogs.perficient.com/2022/12/05/kubernetes-multi-cluster-management-part-2/#respond Mon, 05 Dec 2022 21:48:43 +0000 https://blogs.perficient.com/?p=322932

Introduction

In Part I of our multi-cluster management series, we’ve introduced GitOps and gone over some of the reasons why you should adopt GitOps for the management of Kubernetes clusters. GitOps is always the #1 thing we set up when starting an engagement and we’ve spent a lot of time perfecting our best practices. Today we’re diving deep inside our reference architecture using Openshift and ArgoCD. Note that the same approach can be used for other Kubernetes distributions on-prem or in the cloud, and FluxCD.

TLDR: If you just want to see this reference implementation in action, feel free to try our Red Hat Openshift on AWS accelerator in your own environment. Simply click the button at the top to launch the Cloudformation template.

Check out part 3 on Advanced Cluster Management

Our GitOps Architecture

Now that you’re familiar with all the reasons why you should use GitOps, let’s talk about how to get started on the right track. The way of organizing GitOps repository I’m going to describe is our baseline for all our greenfield Kubernetes engagements. I’ll explain why we made some of these decisions but as always this might not apply to your particular use case.

High-level repository layout

Image010

We split our cluster’s configuration into 3 or more repositories:

  • baseline: contains the manifests common to all the clusters. That repo is split into two directories, each mapping to a separate ArgoCD App:
    • dependencies: contains basically operator subscriptions, default namespaces, some rbac, service accounts, etc
    • resources: contains the manifests for the custom resources (CR) that actually use the operators, and other base manifests
  • custom A, B, …: same directory structure as baseline but are used for specific clusters or families of clusters (for example clusters for ML, clusters for microservices, clusters for particular teams, etc)
  • teams: these repositories are the only ones to app dev teams with write access. This is the place where workload manifests are managed. Since ArgoCD can process repositories recursively, we like to create subfolders to group application manifests which go together (typically a deployment, service, route and autoscaler). Each team repository is tied to single ArgoCD App which maps to a single namespace in Kubernetes.

This is what this looks like in ArgoCD on any given cluster:

Image012

Remember that an ArgoCD Application (ArgoCD App) does not necessarily map to an actual “application” as-in a single program. ArgoCD Application is an ArgoCD concept that represents a connection to a repository, which contains multiple manifests for one or more programs, or just pure configuration resources. This is a common misconception and application could be easily renamed configset or something like that, but the point is an ArgoCD Application is just a collection of arbitrarily organized manifests.

Here we see the top two ArgoCD applications baseline-deps and baseline-resources connected to the shared baseline repo dependencies and resources directories. The next two are the connected to the shared type A cluster repo dependencies and resources directories. The last one is owned by the “dreamteam” team connected to their own repo. Notice the last application targets the dreamteam namespace while the others don’t specify any.

Example 1:

Baseline repository contains the manifests responsible for the installation of the cluster logging operator but a particular cluster needs to forward the logs to a 3rd party. In that case the custom repository contains the LogForwarding CR for that particular cluster.

Example 2:

Custom repository “kubernetes-data-pipeline” contains the manifests to configure clusters that run data pipeline applications. These clusters require Kafka, Service Mesh, and KNative, so this is the repository where you will place your manifests to install the required operators, and configure a Kafka cluster, the service mesh control plane, etc. Then each team can configure their own topics and service mesh virtual services, etc from their own repo.

App of Apps Design

We almost always use the “App of App” technique to build our clusters configuration, where we only create one ArgoCD application initially: the type-a-baseline. The type-a-baseline repository contains 2 ArgoCD applications: type-a-res and baseline-deps. The hierarchy is as follows:

  • type-a-deps (kind: argoproj.io/v1alpha1 Application, type-a/deps repo)
    • type-a-res (kind: argoproj.io/v1alpha1 Application, type-a/res repo)
    • baseline-deps (kind: argoproj.io/v1alpha1 Application, baseline/deps repo)
      • baseline-res (kind: argoproj.io/v1alpha1 Application, bsaeline/res repo)

You can create how many layers you want, each one referencing the layer below. For example, if you wanted to bootstrap a unique cluster, that’s built on top of type-a config, you would create a single ArgoCD application called my-cluster-deps, connected to a my-cluster Gitops repo, containing a manifest for the type-a-deps application, etc:

  • my-cluster-deps
    • my-cluster-res
    • type-a-deps
      • type-a-res
      • baseline-deps
        • baseline-res
      • my-team-config
        • app-1
        • app-2
        • etc

The main point of this is you can build increasingly complex cluster configuration by assembling and stacking up ArgoCD Apps, but you only ever need one single ArgoCD App to bootstrap any cluster. Think of it as a maven library which references other maven libraries, and they themselves reference other libraries, building a complex dependency tree.

ArgoCD unfortunately only shows a single layer of dependencies, so you cannot visualize the hierarchy in the UI.

Image014

Note: why do we separate dependencies and resources? In Openshift, we leverage operators to install extensions such as Prometheus, Elasticsearch, etc. Operators register new object types with the Kubernetes API called CRD (Custom Resource Definition). Once a CRD is declared, you can create instances of these objects (like a Prometheus cluster for example) using a CR (Custom Resource) which is a manifest with a kind: NewResource.

ArgoCD App will not be able to deploy CRs and sync will fail when the corresponding CRD is missing (i.e. the operator hasn’t been installed yet). Even if you have the operator subscription and the CR in the same ArgoCD App, you will run into a race condition. To work around that problem, we had split the baseline into two directories and map them to 2 different ArgoCD Apps. The 2 apps will try to sync initially, but the second one will immediately fail as the first one is installing the operator. No worries though, ArgoCD will keep retrying and eventually sync once the operator is finally installed.

Image016

Basic GitOps Security

In theory, you could very well use a single repository that contains all the manifests for a cluster and use pull requests to review and control what actually gets deployed. We prefer separating our manifests in different repos to align with Kubernetes user personas and facilitate reusability instead.

Admins: have write access to baseline and custom repositories which contains cluster-wide configuration manifests such as operators, security aspects and shared resources like monitoring and logging. These are resources which, when modified, will impact all the tenants on the cluster, so special care needs to be taken during the PR process.

Dev Teams: have write access to team’s repos. These repos are tied to specific namespaces in Kubernetes and changes to manifests in these namespaces only affect that particular team.

The basic principle here is that only admins are allowed to create ArgoCD Application instances so you can use Git permissions as a high-level access control mechanism. Dev teams can only deploy manifests in the Git repo that is assigned to them, which is tied to a specific namespace on the cluster. They cannot create resources in global namespaces because they don’t have write access to those repos.

More on GitOps security and granular ArgoCD ACLs coming up in a separate post.

Provisioning Clusters

Using this shared repositories model, provisioning new clusters is a simple two-steps process:

  • Create a new blank cluster using the appropriate installer
  • Install ArgoCD
  • Create an ArgoCD Application which points to the top-level repo (see above)
  • The cluster will bootstrap itself from Git

You can repeat that process as many times as you want, there is no technical limitation to how many clusters can sync from the same repo. Existing clusters using the shared repositories will keep themselves up-to-date.

Once bootstrapped, an admin can onboard a development team by:

  • Creating a dedicated team repo
  • Creating a namespace on the cluster for that team
  • Creating an ArgoCD App which binds the team repo to the new namespace

And the new team can start deploying workloads by adding application manifests to their new repo.

Image018

Releasing Applications

Classic Deployment Pattern

This is a common workflow to deploy artifacts to classic application servers, using CI/CD:

Image020

  1. Somebody commits some changes
  2. A new application artifact is built
  3. (Sometimes) the artifact is pushed to an artifact repository
  4. The artifact is deployed to the application servers (either copied directly or indirectly)

If you’re using IaC, like Chef or Ansible or similar, you might update your web server by running some kind of recipe, that’s good practice to deploy at scale and recovering servers.

The GitOps Way

We’ve shown how each application team gets its own Git repository and that Git repository is tied to a specific namespace in the Kubernetes cluster. So for a team to deploy a new application, all they need to do is commit the application manifests (usually deployment, service, route, autoscaler) and ArgoCD will sync the state automatically. Of course before that happens, a container image has to be created, and ArgoCD can’t do that for you so you need a CI pipeline for that purpose.

Image022

The main difference with a classic CI/CD pipeline is that we always push the artifact (the container image) to the container registry, and the end of the flow is an update to the deployment manifest containing the new image tag. Notice that the Kubernetes cluster is not in the CI/CD process at all. That because the actual CD is handle by the GitOps tool.

Managing Multiple Environments

Branching Strategy

When you first start with Kubernetes, you will likely have a single Kubernetes cluster for dev. Soon though you will need to promote your applications to QA for testing, then staging and prod.

We tend to treat our lower environment clusters as production, because a broken cluster might prevent dev and QA teams from doing their job and cause delays and additional costs. Typically the only differences between our environments are related to scale and security. I’m not talking about an experimental cluster that you use to try different operators and configurations and is essentially a throw-away cluster.

So with that in mind, we like to use long-lived branches for environments. This makes it really easy to moves things around by just merging branches. It also helps with permissions as people allowed to commit code to dev can be blocked from deploying changing to production.

Image024
application version promotion process

What about sub-directories for environments?

Another approach that I commonly see out there to manage multiple environments is to use a single branch repository but one sub-directory per environment. In some cases teams use Kustomize to create a shared base and create overlays to customize each environment.

This approach has a few disadvantages:

  • More prone to error. You might inadvertently make a change in a higher environment and allow it to go through if the PR is not thorough. It’s harder to tell what environment a file change impacts if you have a lot of changes in the commit. Using branches, you can immediately tell that the target for the change is the production branch and pay extra attention. You can even configure Git to require more approvals when targeting specific branches
  • There is no way to manage write permissions for different environments, it’s all or nothing since a person with write access to the repository will have write access to all the sub-directories
  • It’s not as straight forward to compare differences between environments. You can pull the repository locally and run a diff, but with branches, you can use the Git UI to compare entire branches
  • It’s not as straight forward to look at change history for a given environment. Your commit logs will show all changes across all environments. You would need to filter by sub-directory to only see a particular environment log
  • The process to promote things between environments is manual. Adding an application for example, requires to copy the manifests in the right sub-directory. It’s not always obvious which files are affected, you might forget to copy one or more files, or copy the wrong file. Branch merging will immediately highlight additions, deletions, etc

A case where branching would not be a good option is when environments are significantly different and too many conflicts would arise when trying to merge changes. For example, regional clusters with specific networking requirements, or edge clusters:

Image026
Example of a multi-site gitops repo structure with Kustomize
(Thanks to Ryan Etten at Red Hat)

Limitations

As with any shared resources, the main problems are:

  • Any change to a manifest in the baseline repository will be propagated to all clusters connected to it. It means an untested change could negatively impact everybody
  • Everyone is going to receive the changes whether they want it or not. It could be desired to make sure everybody is running the latest versions of operators, or follow basic security practices, but it could also be inconvenient if there is a breaking change
  • A team might want to customize something that’s set in the baseline, which is incompatible with other teams
  • A team might request a change that requires everyone to be in agreement and has to be reviewed by a lot of people, which might cause delays in getting things changed

A solution to this problem is to let individual teams manage their own cluster and use the shared repositories as helpers instead. In that scenario, a team can fork the baseline and custom repositories they need and make modifications as they see fit.

Consider the trade-offs though:

  • Individual teams need to know what they’re doing with regard to Kubernetes
  • Individual teams need to be more disciplined and regularly check the shared repositories for updates, then merge the changes to their own repositories
  • IT teams need to trust that individual teams will configure their own clusters according to organization security or regulatory standards

Personally, I like product teams to be responsible for themselves as long as the skills exist and there is accountability. It really depends on your organization’s culture but if you are new to Kubernetes my recommendation is to start with the shared model, and slowly increase autonomy as you gain more knowledge about what works and doesn’t work for you.

Trying It Out

If you want to see this process in action, Perficient has developed an accelerator for AWS using ROSA (Red Hat Openshift on AWS). The included Cloudformation template creates a cluster and all the necessary AWS resources to run Gitops as described in this article.

]]>
https://blogs.perficient.com/2022/12/05/kubernetes-multi-cluster-management-part-2/feed/ 0 322932
Review of Industrial IoT Solutions – Part I https://blogs.perficient.com/2022/12/02/review-of-industrial-iot-solutions-part-i/ https://blogs.perficient.com/2022/12/02/review-of-industrial-iot-solutions-part-i/#respond Fri, 02 Dec 2022 15:00:03 +0000 https://blogs.perficient.com/?p=322728

Introduction

Edge computing and more generally the rise of Industry 4.0 delivers tremendous value for your business. Having the right data strategy is critical to get access to the right information at the right time and place. Processing data on-site allows you to react to events near real-time and propagating that data to every part of your organization will open a whole new world of capabilities and boost innovation.

Today we want the edge to be an extension of the cloud, we’re looking for consistency across environments and containers are at the center of this revolution. It would take way too long to do a comprehensive review of all available solutions, so in this first part, I’m just going to focus on AWS, Azure – as the leading cloud providers – as well as hybrid-cloud approaches using Kubernetes.

Solution Overview

At the core of Industry 4.0 is the collection of equipment and devices data and dissemination of said data to local sites as well as the rest of the organization. The basic flow of data can be summarize like so:

  1. Events are emitted by IoT devices over OPC-UA or MQTT to a local broker
  2. Part of the raw data is consumed directly for real-time processing (anomaly detection, RT dashboards, etc)
  3. Part of the data is (selectively) copied to a message broker for event-driven services, streaming analytics
  4. Messages are also (selectively) transferred to the cloud for analytics and global integration
  5. Configuration can also be pushed backed to the sites post-analysis
Industrial IoT (IIoT) solution overview diagram

Industrial IoT (IIoT) solution overview diagram

Data Flow

Data Collection

Looking at the edge-most side of our diagram (top in light green), the first step in the solution is to collect the data from the devices. OPC-UA is very typical for industrial equipment, and the historical way we’ve been collecting events on the factory floor. The second, more modern option is MQTT, now available on most IoT devices and certain industrial equipments. Both are IoT optimized transport protocols like HTTP. OPC-UA also specifies a standard data structure for equipment information, while Sparkplug is the standard data structure specification for MQTT.

Most plants will have a mix of both, with older equipment connecting to an OPC-UA server that is bridged to an MQTT broker

OPC-UA MQTT Bridge - Source: Inductive Automation

OPC-UA MQTT Bridge – Source: Inductive Automation

AWS

AWS IoT Sitewise Edge provides an MQTT gateway and has out-of-the-box support for OPC-UA to MQTT transformation. AWS Sitewise Edge is great to monitor devices on location, and run/visualize simple analytics. Sitewise Edge gateway then copies the data to the cloud where it can be processed by AWS IoT Core, and bridged to other systems. Downstream systems can be AWS IoT services, other AWS services like Kinesis, S3, Quicksight, etc. or non-AWS products like Kafka, Spark, EMR, Elasticsearch, and so on. Sitewise Edge is a software product which can be deployed on AWS outpost which we’re going to discuss later, or existing infrastructure, bare-metal or virtualized environment.

AWS Sitewise Edge - Source: AWS

AWS Sitewise Edge – Source: AWS

Azure

Azure IoT Edge edge is the equivalent of AWS Sitewise for Microsoft. IoT Edge ties directly into Azure Iot Hub to make the data available in Azure cloud, which can then be processed using the whole gamut of Azure services like Stream Analytics, Functions, Event Hub, etc

Azure IoT Edge - Source: Azure

Azure IoT Edge – Source: Azure

 

Self-Managed

You can also run your own OPC-UA & MQTT broker at the edge, in one of the compute options we’ll discuss down below, but you will be responsible for transferring that data to the cloud. RedHat AMQ, Eclipse Mosquitto, HiveMQ are common MQTT brokers with various degrees of sophistication. Red Hat Fuse, part of the Red Hat Integration suite, can also be used to set up an OPC-UA server if you’re not already using an off-the-shelf solution like Inductive Automation’s Ignition.

This is a great option if you’re just getting started and want to quickly prototype, since they all have trial versions, and/or are just plain free, open-source solutions. HiveMQ and Mosquitto are pure MQTT, while Red Hat Integration’s MQTT implementation is more raw, but offers a much broader range of capabilities beyond MQTT. You can’t go wrong with that option, it will scale really well, you’ll have a lot more flexibility, no vendor lock-in, but there is some assembly required.

Self-managed MQTT/Kafka Stack – Source: RedHat

Self-managed MQTT/Kafka Stack – Source: RedHat

Streaming Data

In order to process the data for analytics, error detection, ML in general, etc we need to move it into a more suitable messaging system. The most common solution for that kind of workload is Kafka, and most analytics tools already know how to consume data from it. Kafka also supports replication between sites so it’s a great option to move data between sites and to the cloud.

If you’re using Sitewise edge or Azure Edge and you don’t need Kafka on the edge, the MQTT data will automatically be transmitted for you to the cloud (AWS IoT Core & Azure IoT Hub). You can then send that data your cloud hosted Kafka cluster if so desired (and you definitely should).

If you’re running your own MQTT broker, you can use MQTT replication to the cloud in some instances (Mosquitto Mirroring, HiveMQ Replication), but a better option is to use Kafka and leverage the mirroring feature.

Kafka implementations are plenty, RedHat AMQ Streams, part of the RedHat Integration suite, Strimzi for Kubernetes and Confluent Kafka are the most recognized vendors. Neither AWS nor Azure have managed Kafka products at the edge, but have other – native – ways to run analytics (Kinesis, Event Hub, etc), but more on that later.

Compute Options

Next we need to run workloads at the edge to process the data. Again we’ll have 3 options for the infrastructure setup, and many options to deploy applications to that infrastructure.

AWS

If you don’t have an existing compute infrastructure at the edge or if you’re trying to extend your existing AWS footprint to edge locations, the easier way is definitely AWS Outpost. Outpost are physical AWS servers that you run at your location, which come pre-configure with AWS services similar to the ones running in the AWS cloud, and seamlessly integrate with your existing AWS Cloud. Think of it as a private region with some limitations.

Edge Deployment with AWS Outpost - Source: AWS

Edge Deployment with AWS Outpost – Source: AWS

Sitewise Edge can be deployed directly on Outpost servers as a turnkey solution for basic analytics on location. You can pair it with Wavelength for sites without existing internet connection, going over 5G networks.

AWS Snowball Edge is another hardware option more suitable for rough environments, remote sites without connection when you want to process the data locally and eventually move the data physically into the cloud (and I mean physically, as in sending the device back to AWS so they can copy the storage)

Outpost can locally run AWS services like S3, ECS, EKS, RDS, Elasticache, EMR, EBS volumes and more

You can also run AWS IoT Greengrass on both devices and use it to run Lambda functions and Kinesis Firehose.

This is a cool article about running Confluent Kafka on Snowball devices: https://www.confluent.io/blog/deploy-kafka-on-edge-with-confluent-on-aws-snowball/

Azure

Azure IoT Edge - Source: Azure

Stack Edge is the Azure counterpart to AWS Outpost, with the mini-series for use cases similar to AWS Snowball. Again, here, seamless integration with your existing Azure cloud. Azure Stack Hub can run Azure services locally, essentially creating an Azure region at your site.

IoT Edge is the runtime, similar to AWS Greengrass, with IoT specific features. Azure services like Azure Functions, Azure Stream Analytics, and Azure Machine Learning can all be run on-premises via Azure IoT Edge.

Kubernetes (on bare-metal or vms)

Of course both options above are great if you’re already familiar and/or committed to a cloud provider and using the cloud provider’s native services. If you’re not, the ramp-up time can be a little intimidating, and even as a seasoned AWS architect, making sense of how to setup all the components at the edge is not straight forward.

An easier first step at the edge, whether you have existing compute infrastructure or not, is to use your own servers and deploy Kubernetes. If you’re already familiar with Kubernetes it’s of course a lot easier. If you’re starting from scratch, with no prior knowledge of Kubernetes or cloud providers’ edge solutions, I’m honestly not sure which one would be quicker to implement.

For Kubernetes, Red Hat Openshift is a great option. Red Hat suite of products has pretty much everything you need to build a solid edge solution in just a few clicks, through the OperatorHub. Red Hat integration provides MQTT, Kafka, runtimes for your message processing microservices and Open Data Hub for AI/ML. Red Hat published a blueprint for industrial edge to get you started: https://redhat-gitops-patterns.io/industrial-edge/

Edge IIoT on Kubernetes

Edge IIoT on Kubernetes

All 3 options have the ability to run Kubernetes at the edge: AWS EKS, Azure AKS or plain Kubernetes (which I don’t recommend), or Openshift (best). A big advantage of using Openshift is the consistent experience and processes across all environments, including multi-cloud. GitOps is a very easy and powerful way to manage large number of clusters at scale and Red Hat Advanced Cluster Manager can be use to further enhance that experience.

Consider the complexity of keeping sites in sync when using AWS or Azure native services at the edge, in terms of applications deployment, architecture changes, etc. at scale. ARM and Cloudformation are available at the edge but you’re not likely to be able to re-use your existing CI/CD process. This is not the case for Kubernetes though, edge and cloud clusters are treated exactly the same way thanks to GitOps.

If you go with Openshift, the integration to the cloud is not seamless. AWS ROSA and Azure ARO are not currently supported on Outpost and Stack Edge so you’re responsible for setting up the cluster, securing the connection to the cloud (VPN) and replicating the data (Kafka Mirroring) to the cloud… so, as always, trade-offs have to be considered carefully.

Conclusion

There is no shortage of options when it comes to implementing edge solutions. All the major cloud providers cover most standard needs, end-to-end data pipelines with native services, even pre-configured server racks that you can just drop into your existing infrastructure. These allow you to easily extend your existing cloud architecture in physical locations. You can also bring your own hardware and software. Red Hat Integration running on Openshift, or the recently release Microshift running on edge optimized Linux is a very powerful and cost effective solution. There is no right or wrong choice here, you just have to consider the effort, cost & timing and find the solution with the best ROI.

Stay tuned for the next part in this series in which we’ll run through the Red Hat IIoT demo and look at what else we can build on top of that foundation.

]]>
https://blogs.perficient.com/2022/12/02/review-of-industrial-iot-solutions-part-i/feed/ 0 322728
Kubernetes Multi-cluster Management – Part 1 https://blogs.perficient.com/2022/11/16/kubernetes-multi-cluster-management-series-part-1/ https://blogs.perficient.com/2022/11/16/kubernetes-multi-cluster-management-series-part-1/#respond Wed, 16 Nov 2022 20:28:26 +0000 https://blogs.perficient.com/?p=321954

Introduction

With more and more organizations adopting Kubernetes across multiple teams, the need for IT to provide a way for these teams to quickly provision and configure clusters is increasing fast. This is not only true in cloud environments but also at the edge, and from a practical standpoint, adding more clusters can become exponentially difficult to manage if done individually.

Kubernetes has given development teams a lot of freedom and autonomy to innovate quickly. But the breakneck speed of adoption also caused operations, security, compliance, and infrastructure teams to fall behind. In the last 6 months, we’ve seen a trend in large organizations to try and catch up and put up some guardrails as well as look for ways to improve governance around Kubernetes.

As a developer’s advocate, and startup enthusiast, I have to admit when I hear governance, I immediately think of red tape, roadblocks, and unnecessary speed bumps. This is actually not the case when we engage with central IT today. They seem to be embracing the devops concept more and more and instead of trying to reign in and “regain control”, are more interested in enabling adopters and increasing visibility in existing and future environments.

The key to making everybody happy is more automation, and introducing new tools to help organize Kubernetes clusters at scale. GitOps and ACM are here to save the day!

You can also skip ahead to the good stuff and check out part 2 & part 3

GitOps Basics

This is going to be a quick intro to what gitops is, you can already find a lot of articles about this topic online but I’ll save you the trouble and just summarize what you need to know.

Infrastructure-as-code

If you’ve done any work in the cloud, chances are you are already familiar with infrastructure-as-code (IaC). Tools like Ansible, terraform, chef, CloudFormation, etc have been around for years to provision and manage servers, deploy applications, and more generally organize operational knowledge. If you’ve been doing things the right way, you have your said tools code into a git repository so it can be shared and evolved easily. Typically you would also add some ci/cd pipeline tool in order to run those scripts in an automated fashion. If you’re doing all that, you’re doing things right.

K8s declarative configuration

Enters Kubernetes. You may or may not know that Kubernetes relies on declarative configuration, in other words, every aspect of a cluster is described in state files called manifests, typically in YAML format although you can also use JSON. So instead of making changes to Kubernetes clusters by executing a series of script, you just submit your state files through the Kubernetes API, and various controllers on the cluster react to differences between the current and desired state.

Picture1

Example of a deployment manifest which creates 3 nginx pods

Kubernetes has a simple command line which wraps the CRUD API to manage state files and if you’re just playing around that’s a great way to get started.

kubectl create -f deployment.yml

kubectl update -f deployment.yml

kubectl delete -f deployment.yml

If you’ve ever used Kubernetes, no doubt you have used these commands before. If you save and commit all your manifests to git, you can then pull your repository and create/apply any number of manifests recursively to make changes to the cluster.

You can then throw a ci/cd pipeline on top of that and trigger updates when your repository is updated. This is what most organizations do at first, it’s familiar, it’s been working, and they do that for their other IaC tools. And this is fine for a single cluster but it gets a little bit more complicated when you have to propagate changes to multiple clusters, have multiple teams using the cluster(s) and your ci/cd tool requires admin access to your clusters which is certainly a security risk.

Picture2

Classic release process with ci/cd

GitOps to the rescue

Simply, GitOps is IaC specifically for Kubernetes and a CD process. I say process because you have a choice of a couple tools to actually implement it, depending on preferences, use cases, security requirements, Kubernetes distribution, etc. But they essentially all do the same thing: sync your desired cluster state from Git.

Picture3

How is it different from Terraform? First of all, terraform is not declarative. You need to execute your terraform scripts yourself and it’s not immediately apparent what the state of your cluster looks like just by looking at some terraform scripts. In the cloud, it does a great job at maintaining an internal state of your infrastructure, but it’s not built to understand what’s going on inside Kubernetes. Teams that use terraform to manage Kubernetes deployments typically run the kubectl cli extension to push manifest changes to the clusters. I feel that using a CLI in terraform defeats the purpose of using a specialized tool, you might as well run the scripts from your ci/cd directly.

GitOps tools, on the other hand, are built for Kubernetes. Just by looking at the manifests in git, you can immediately know what your cluster state is at any given time. As you can see from the diagram above, GitOps uses a pull model to bring the manifests to the cluster. We’ll see how that helps when dealing with multiple clusters in a little bit.

GitOps Options

The two main GitOps vendors at the moment are FluxCD and ArgoCD. Both are CNCF projects, both implement the same basic GitOps process, but they have slight variations in how they organize manifests and implement security. We’ve successfully used both in our engagements, regardless of the environment and Kubernetes vendor.

For this particular approach, we’ve selected ArgoCD, because we’re using Openshift as the Kubernetes platform and Red Hat Openshift GitOps is based on ArgoCD. Red Hat ACM also has out-of-the-box integration with ArgoCD which will make some aspects easier, but you could use FluxCD and another Kubernetes vendor and get very similar results.

Why/how we use GitOps

Let’s get to the good part. I’m a very pragmatic person. I typically don’t adopt a new tool or process unless I have an existing limitation or I see a clear potential for improvement. In other words, no “shiny ball syndrome” here. I’ve been using IaC for a long time, and GitOps solves problems with Kubernetes that are not easily addressed by using existing tools. This applies to all milestones in your Kubernetes journey, regardless of where you are, if you’re not using GitOps, you’re seriously missing out! Let’s look at the benefits of adopting GitOps at different stages of your Kubernetes journey.

Early Stage

Developers friendly

A lot of our focus with Kubernetes is on enabling development teams. We want to start an engagement with Kubernetes and allow developers to be productive on day one. That means, without having to install and configure new tools, assuming they use git (but who doesn’t?), they can just start typing some yaml code in their favorite text editor and commit. They’re already familiar with that process.

No learning curve

With GitOps, you can start deploying workloads on Kubernetes without really understanding anything about the underlying architecture of container orchestration. Just copy/paste some deployment, service, and route manifests from examples, replace with your values and commit. Most already know about YAML format, and the manifests are usually very easy to understand, most of the complexity has been abstracted out in Kubernetes controllers. They also don’t have to learn about Terraform or Ansible, etc or the kubectl command itself.

Growth Phase

Mistake-proof

As more and more people start using the cluster and making changes, things break, people don’t remember what they did. With GitOps, a simple revert commit can restore a cluster to the last know good configuration. And because it’s all declarative, you don’t risk having lingering resources. The cluster will be restored to the exact same state it was at a given commit.

Organize operational knowledge

Going beyond basic application deployments, you start adding new features to your cluster, installing operators, creating security rules, etc and you need to keep things organized. Making those changes directly through the interactive cli and web console will result in lost knowledge when it comes time to create more clusters. GitOps is easy enough that it doesn’t add more time to channel cluster changes through git: kubectl apply -> git commit, same effort.

Multi-cluster configuration management

Further down the road, everybody wants to use Kubernetes in your organization but you want to ensure some consistency between clusters, and you don’t want every single team to re-invent the wheel. Because GitOps uses a pull model, multiple clusters can share the same git repository which provides a basic level of governance, and allow multiple teams to collaborate on a cluster configuration baseline.

Picture4

ArgoCD Pull Model

Steady State

Security

GitOps provides a control layer by not allowing people to make changes to clusters directly. In fact, we typically don’t provide write access to anybody except super-admins and redirect development teams to the git repositories to deploy applications. It not only prevents users from making untracked changes to clusters, it prevents external agents from gaining access to clusters through regular developer accounts. You can leverage your existing git access policy to regulate changes to the clusters as well as restrict what ArgoCD is allowed to do on behalf of the users. This is a whole blog article on its own.

Governance

By having multi-clusters share a common baseline as a GitOps repository, you can define what a cluster configuration should look like, in general. That doesn’t prevent individual teams from augmenting the baseline with custom changes, but you probably want to make sure that everyone is using things like cluster logging, monitoring, ingress, etc. While GitOps alone cannot address all aspects of governance, it’s still a good place to start.

Release management

Once you have clusters in production, you need to manage how workloads get promoted between different environments. In typical CI/CD fashion, you would end up with multiple stages in your pipeline to target dev/qa/prod, etc with approval gates in most cases. If you’re following 12 factor app recommendations, you already architected your application to read configuration variables from the environment. With Kubernetes, we want to make sure this is the case and use immutable images so we know that the application we’re deploying to prod is the same one that was tested in QA for example. Moving images between environments then is just a matter of referencing that image in the correct deployment manifest in git, for that environment. If you’re using git branches for your Kubernetes environments, then you’d simply merge the qa branch to the prod branch to promote a buid.

Disaster Recovery

Whether you’re using multi-region active/active or active/passive deployment models for your Kubernetes cluster, you can leverage gitops to git multiple clusters in sync as we’ve discussed before. If your two clusters share the exact same repositories, then their state will be completely identical. That way you don’t have to worry about a region going down. You can also do that between hyperscalers as long as the clusters have access to the same git repository. Now if your cluster goes down for some unexpected reason or you are a victim of a ransom attack, having your cluster state in Git allows you to re-create an identical cluster within just a few minutes, in the same cloud, or a completely different environment. You can also use that approach to save money and delete development clusters during off-business hours and re-create it brand new in the morning.

Overall GitOps Flow

Picture5

This is a pretty typical GitOps flow for the management of applications and cluster configuration.

For application developers:

  1. App code and deployment manifests are separated in two different repositories
  2. App code repositories are not affected by gitops, they just contain application code as usual
  3. Changes to app code triggers a ci/cd pipeline which results in the creation of a container
  4. The container is uploaded to a container registry
  5. The pipeline updates the deployment manifest in the config repo with the new container image version
  6. ArgoCD automatically deploys the updated manifest to the cluster(s) causing an application rollout

For admins:

  1. Cluster manifests are stored in a config repo(s)
  2. Admins add/update manifests in the repo(s)
  3. ArgoCD automatically applies to the changes to the cluster

Conclusion

In this first part of our multi-cluster management series, we’ve covered the basics of GitOps and why you should seriously consider adopting it if you’re doing any type of work on Kubernetes. My opinion is that GitOps should be the first thing you install on any cluster, even if you’re at the very beginning of your journey, especially if you are at the beginning actually. You really don’t want to be 3 months down the road and lose your cluster and have to start from scratch, and I’ve seen it happen.

Some good resources to understand more about the specifics of GitOps:

https://www.weave.works/technologies/gitops/

https://blog.argoproj.io/introducing-argo-cd-declarative-continuous-delivery-for-kubernetes-da2a73a780cd

In our next post in this series, we’ll look at a reference architecture and implementation to help you get started on the right foot. Stay tuned!

 

]]>
https://blogs.perficient.com/2022/11/16/kubernetes-multi-cluster-management-series-part-1/feed/ 0 321954
Introducing Our Container-Native Modernization Suite for ROSA https://blogs.perficient.com/2021/04/19/introducing-our-container-native-modernization-suite-for-rosa/ https://blogs.perficient.com/2021/04/19/introducing-our-container-native-modernization-suite-for-rosa/#respond Mon, 19 Apr 2021 15:50:44 +0000 https://blogs.perficient.com/?p=291118

As a launch partner for Red Hat OpenShift Service on AWS (ROSA), we’ve developed a DevOps-in-a-box reference architecture and Kubernetes-based implementation for rapid application modernization. Kubernetes is an open-source container orchestration platform that automates many of the manual processes involved in deploying, managing, and scaling containerized applications. This means you can cluster together groups of containers, and Kubernetes helps you easily and effectively manage those clusters on-premise or in public, private, or hybrid cloud environments.

Leveraging Red Hat OpenShift and AWS, our turnkey environment gets you from zero to production within days. This approach to DevSecOps will save you months of time, allowing you to focus on innovation.

ROSA delivers the production-ready Kubernetes that many enterprises already use on-premises today and simplifies the ability to shift workloads to the AWS public cloud as business dictates. With ROSA, you’re able to realize faster time to market and increased ROI, among other benefits.

DevSecOps on Kubernetes Pilot

Our pilot program aims to migrate existing Java applications to Kubernetes in under 12 weeks. It leverages the operator to establish the foundation for modern application development that will drive your organization’s success for years to come.

The Kubernetes operator manages the entire lifecycle of your applications using simple abstractions to describe them (for example, SpringBootApp). Your development teams can benefit from the many capabilities of containers on Day 1 without the steep learning curve, using the tools and languages they already know.

This pilot enables us to introduce best practices and customize a solution to best fit your needs in the following steps:

  1. GitOps/Infrastructure-as-code
  2. End-to-end cloud-native CI/CD (Tekton)
  3. External application configuration and secrets
  4. Kubernetes API extensions for Spring Boot and JS applications (CRDs)
  5. Modern release flow
  6. Custom JVM/Spring Boot metrics monitoring and alerts
  7. Advanced logs management
  8. APIM and tracing
  9. Vertical and horizontal auto-scaling
  10. Service mesh (security, fault tolerance, blue/green deployments, etc.)

More About Kubernetes

The primary advantage of using Kubernetes is that it gives you the platform to schedule and run containers on clusters of physical or virtual machines (VMs). Kubernetes enables you to automate operational tasks, which means you can do many of the same things other application platforms or management systems let you do, but with containers.

With Kubernetes, you can:

  • Orchestrate containers across different environments
  • Use hardware to maximize resources to run your enterprise applications
  • Control and automate application deployments and updates
  • Mount and add storage to run applications
  • Scale containerized applications and their resources
  • Manage services to ensure applications run the way they need to run
  • Self-heal your applications with health checks

Contact us today to learn more about this offering and how you can implement Kubernetes for your enterprise.

 

]]>
https://blogs.perficient.com/2021/04/19/introducing-our-container-native-modernization-suite-for-rosa/feed/ 0 291118
An Inside Look at Managed Clusters https://blogs.perficient.com/2020/10/06/an-inside-look-at-managed-clusters/ https://blogs.perficient.com/2020/10/06/an-inside-look-at-managed-clusters/#respond Tue, 06 Oct 2020 15:00:26 +0000 https://blogs.perficient.com/?p=281891

Spend a few minutes with one of our Red Hat technical experts, Matthieu Rethers, as he discusses the advantages and disadvantages of managed clusters, as well as differences between them on various cloud platforms, when you should use them, alternatives to managed clusters, and how Red Hat OpenShift fits into the picture.

When should developers use managed clusters?

That’s a moot question for developers because, from their point of view, it’s always managed whether it’s by your organization or an IaaS vendor. But, I’d say that a developer’s main goal is delivery, so anything that helps them do that faster is golden. It’s the same reason why we don’t program in assembly anymore – we’ve built these abstraction layers so we can focus on differentiators and business value. Plus, I don’t think many developers under pressure want to spend weeks learning how to deploy a Kubernetes cluster, so for them, managed is definitely the way to go.

Many have already learned to trust IaaS vendors with their hardware for the infrastructure folks, so software is the next logical step. A managed service: 1) gives them the ability to get familiar with the technology and form their own opinion over time, maybe to the point where they would feel comfortable taking over; 2) reduces their capital expenditures; and 3) allows them almost immediately, to focus on other critical things like automation, security, reliability, etc.

As far as I’m concerned, I’ll always recommend the managed approach first and then evaluate it. But I’ll admit, there are times when having complete control is a requirement – but remember that managed doesn’t mean sealed either. You can still configure a lot of things on your own.

What are some pros and cons to leveraging a managed cluster on Amazon EKS, Google Kubernetes Engine (GKE), and Azure Kubernetes Service (AKS)?

Like with any managed service, you get many benefits right out of the box. An organization that’s new to Kubernetes can be productive immediately. If all you want to do is run a few Java services on your cluster, and you’re told it will take two weeks before you can actually publish them in a production environment, you might reconsider using containers for the time being. To be fair, installing Kubernetes has become much easier over the past couple of years, but any production workload will have requirements beyond basics, and you’ll lose that initial momentum.

If you’re new to the cloud, these managed services will allow you to skip a few steps and focus mainly on Kubernetes. If, on the other hand, you’re already using a cloud vendor, it will feel like using any other service. Now, if you’re wondering about a self-managed cluster on-prem, it’s going to be a bumpy ride – I wouldn’t even consider it an option if you’re trying to get into containers quickly.

Miscellaneous advantages of managed services include:

  • Cluster upgrades
  • Prescriptive approach
  • Easy provisioning (EKS now integrates with Fargate but watch for limitations like storage)
  • Integration with other vendor services

Now the big concern should be vendor lock-in. Because Kubernetes is very flexible, there are several choices that vendors have to make to deliver a production-grade service, so if you don’t like those choices in the future, it might not be easy to go back to a self-managed cluster or transfer to a different vendor. You’re also at the mercy of the vendor for upgrades and features availability – most of them will be behind and provide only the most common denominator, and that’s understandable. For non-greenfield situations, it might not even be a good fit, to begin with.

Are there major differences between how managed clusters work on these platforms?

There are some differences, as vendors will try and set up Kubernetes clusters in a way that can integrate with their other services, and that’s a good thing – that way, it’s a one-stop-shop for all infrastructure needs. They run different versions of the engine and features and bundle features differently depending on the point of view and best practices. Make sure you’re comfortable with those choices because, as I said before, once you’re committed to a vendor, it might be difficult to change. Of course, if you’re currently running IBM or Microsoft products, they make it a lot easier to transition, so that’s certainly something that will weigh heavily in the balance. Most IaaS have free tiers – I strongly recommend trying before you buy, and sometimes, it just comes down to feelings.

What alternatives are available to managed clusters?

Kubernetes is becoming easier and easier to install and supports a wide array of deployment options from bare-metal to virtualized environments to the cloud, so self-managed is always an option.

For greenfield applications, you won’t have to deal with all the constraints of a pre-container era, and you should be able to get started quickly. If you have a lot of legacies to deal with, things might get a little dicey.

Now, if you’ve already invested a lot in your bare metal infrastructure, self-managed is a good way to leverage it and squeeze the last drop off of your machines. If you’re lucky and already have a solid virtualized environment, you’ll have a good head start.

Where does OpenShift, a container platform, fit into these other options?

OpenShift is an enterprise platform from Red Hat that runs Kubernetes at its core. I like to think of Kubernetes as Tony Stark and OpenShift as Iron Man. You have the awesome brain, but suit up, and now you can go and blow up giant alien ships in distant galaxies.

As I explained before, Kubernetes alone isn’t enough – you need security, networking, logging, monitoring, integration with your cloud infrastructure, etc. Most of the time, you get some form of that through the managed service in IaaS, but you’re confined to the vendor’s boundaries.

OpenShift delivers all of that and more but in a cloud-agnostic fashion. You can also run it on your own infrastructure on-prem or in the cloud. Because all of those features are encapsulated inside the cluster itself, you can easily move your entire infrastructure from one cloud to another, and you can do multi-cloud. The main advantage is you only have to learn it once and apply the same principles everywhere you go.

Remember that cloud providers are ultimately not in the software business. They typically use tools built by other people and package them in a way that helps them sell more infrastructure services. That’s great, but Red Hat’s focus, on the other hand, is to deliver a great tool regardless of your environment, and their long-term commitment to Open means no lock-in. As a matter of fact, many OpenShift enhancements make it back to the upstream Kubernetes project and eventually to EKS, AKS, and others. It’s very telling that all the major cloud providers now offer a managed OpenShift service as part of their catalog.

By the way, the reasons to go managed are the same as the ones listed above, but OpenShift comes with a lot of options right out of the box, so the learning curve isn’t as important of a factor if you decide to go self-managed. In the cloud, upfront costs will be higher than EKS, GKE, or AKS, but OpenShift licensing is probably a drop in the bucket for most organizations, starting at about $20k when you bring your own infrastructure. When you consider what you get in return, this should be a no brainer (one study shows ROI over 500%) – this is the enterprise platform.

Any final thoughts?

Just know that running any Kubernetes cluster at scale on your own is no walk in the park – don’t be fooled by the apparent simplicity of that “hello world” video on YouTube. Sure, you can get a cluster running anywhere quickly, but there’s a lot of stuff to consider to go from that first service to running anything in a production environment.

There’s a reason why vendors can charge a premium for these services. Unless you already have a team of experts in Kubernetes (and I’m assuming you wouldn’t be reading this if you did), count on spending a lot of time and money getting up to speed. It’s sure to be way more than a managed service subscription, and if things go wrong, you’re on your own.

So, if there is any chance you can go managed, please do. Our team of experts is here to help you.

]]>
https://blogs.perficient.com/2020/10/06/an-inside-look-at-managed-clusters/feed/ 0 281891
Keep Your Fish Warm: Event-Driven Architecture on OpenShift https://blogs.perficient.com/2020/06/22/keep-your-fish-warm-event-driven-architecture-on-openshift/ https://blogs.perficient.com/2020/06/22/keep-your-fish-warm-event-driven-architecture-on-openshift/#respond Mon, 22 Jun 2020 21:06:52 +0000 https://blogs.perficient.com/?p=275699

Typically, the idea of container orchestration is associated with DevOps, and while IT departments have quickly adopted the tech for OPS, the development part has been misunderstood and underrated. With the array of OpenShift tools at a developers’ disposal, it’s time to deliver on the premise of DevOps and bring power back to the developers.

Coders like to code, but most would rather spend more time building cool stuff than wasting it on boilerplate. That’s why we use frameworks. For example, when we start a project, we want to jump right into it instead of learning how to set up an Elasticsearch cluster.

Large organizations require you to create IT tickets and wait for provisioning, which can be a major roadblock. Sometimes it’s because IT is very busy. Sometimes it’s because they don’t know how to provision these things, and sometimes it’s because they are concerned about security. OpenShift has a powerful way to address all of that.

The other important aspect of container adoption is that programmers might be under the impression that they have to learn all about Docker or use a different language than the one they’re familiar with.

For this exercise, we’re not going to build a simple hello world, run a microservice on an OpenShift-type scenario because it wouldn’t illustrate these points. We’re going to build a sophisticated application with a lot of moving parts that delivers true value. We’ll demonstrate how easy it is to do with OpenShift, and on Day 1.

Use Case

For this demo, we’re using a fish farm with growing tanks. The water temperature in the tanks needs to remain within a safe range, especially in the winter. If the temperature drops below a certain point, it will cause damage to the fish.

The tanks are equipped with internet-connected thermostats that control the heaters. We’ll use the thermostat readings to ensure maximum efficiency and identify early signs of failure.

Efficiency is calculated by the rate of temperature increase when a heater turns on. For demo purposes, we’ll assume that a one-degree increase per second is our standard. (In the real world, that would not be true, as the tanks would boil in a matter of seconds, but this is so we can see the changes quickly in the demo.)

Figure 1 Fish Farm

Figure 1 – Infrastructure

Data Flow

  1. The gateway will feed the temperature readings to our application through a REST service
  2. The service will dump the readings into a message queue
  3. The processor service will consume the messages and calculate the current efficiency for each device
  4. It will then publish the efficiency records to a second queue, as well as expose them as a metric that will be monitored in real-time
  5. We’ll also monitor the raw temperature readings and present them in a dashboard
  6. Finally, alerts will be generated when appropriate and published to another message queue to be dispatched appropriately

Figure 2 Fish Farm

Figure 2 – Architecture

And of course, everything will be running on OpenShift.

Environment

We need an OpenShift cluster on version 4 or newer. This is because we’re going to use Operators to facilitate the provisioning of our middleware. A vanilla installation is good enough, as you can run it on a single node, and now you can even use CodeReady Containers, which allow you to run OpenShift on your development machine.

You can also use evaluation licenses and deploy a test cluster on AWS, which will give you access to storage, on-demand scaling, etc., all out-of-the-box.

Kafka

Kafka is stream processing software. For those coming from classic, event-driven systems, you can think of it as a message queue, but it does a lot more than that. What’s of interest to us is the real-time streaming analytics, sliding-window calculations.

To calculate our efficiency, we need to compare two consecutive temperature readings, and without Kafka, you would need to maintain a state in some sort of cache, which will force us to do more coding, and that’s precisely what we’re trying to avoid. Kafka is also future-proof. Check out this article for an introduction to Kafka Streams.

The enterprise version of Kafka is provided by Red Hat under the name AMQ Streams, a downstream project of the open-source initiative Strimzi. It comes with seamless integration with the OpenShift ecosystem.

Prometheus

Figure 3 Fish Farm

Figure 3 – Grafana Dashboard Example

Prometheus will be used as our metric monitoring system. It connects to your services at a given interval, scrapes the data that you want to watch, and stores it in a time-series database. You can then give it rules and it will generate alerts and publish them to an endpoint of your choosing or send an email or other notifications. We’ll also install Grafana, which is the standard visualization tool for Prometheus, and you can import community dashboards for commonly used tools like Spring Boot and Kafka. OpenShift uses Prometheus internally to monitor the cluster.

Spring Boot and Spring Cloud Streams

For our microservices, we’re going to start with a plain Spring Boot application and add a few modules that will make things easier:

  • Actuator: Provide health check endpoint for OpenShift to watch, and pair it with the Prometheus module to expose all of our internal service metrics like response time, JVM details, etc., as well as some custom metrics to monitor our temperature and efficiency.

Figure 4 Fish Farm

Figure 4 – Spring Boot Actuator Prometheus Output

  • Spring Cloud Stream: Provides an abstraction for Kafka Streams. It’s similar to Spring Data for relational databases in that it manages the connection to Kafka, serializes and de-serializes messages, etc. It also integrates with the actuator module so we can watch our integration metrics in Grafana and set up rules in Prometheus. Check out this video for an introduction to Spring Cloud Streams with Kafka.

Operators

At this point, we need to install our three middlewares on OpenShift. In version 4 and newer, the entire platform is constructed around the concept of Operators. At a basic level, Operators are regular applications that manage other applications’ lifecycles. They ensure that all the components of an application are set up properly and watch configuration changes to take appropriate actions in OpenShift. For our immediate purpose, you can just think of them as installers for complex application stacks.

Figure 5 Fish Farm

Figure 5 – OpenShift Operators Catalog

OpenShift’s Operator Catalog is accessible directly from the UI, and you can choose from the list of curated software, install, and create instances of those applications. It requires zero knowledge of the underlying infrastructure and they come with production defaults so you don’t have to customize anything at first.

Figure 6 Fish Farm

Figure 6 – Grafana Custom Resource

The way you do that is by creating a Custom Resource, which is the abstract representation of an application’s infrastructure, in a YAML format.

Code Overview

Thermostat Service

I wanted a low-code solution, and if possible, I didn’t want to learn too much about Kafka. All I want to do is produce messages and dump them into a topic, then consume them on the other end and produce another message as a result.

Figure 7 Fish Farm

Figure 7 – Intake Service

This standard Spring Boot Rest Service takes in a message from the IOT Gateway, wraps it into a standard Spring Integration message, and sends it to the output, which is bound to our temperatures topic. We then add the config to tell Spring where our Kafka cluster is, and what topic to send the message to and let it do all the heavy lifting. We didn’t write any Kafka-specific code – Spring will make that topic available to us through a Source bean that we can auto-wire.

Figure 8 Fish Farm

Figure 8 – Spring Cloud Stream Config

Processor Service

Figure 9 Fish Farm

Figure 9 – Processor Function

Here we leverage the Kafka abstraction for streams, and we create a simple Function that takes in messages from the temperatures topic as an input, augments the record with the calculated efficiency, and sends it to our efficiency topic.

Figure 10 Fish Farm

Figure 10 – Spring Cloud Stream Processor Config

Here we declare that we have a processor function, and configure Spring to bind its input and output to specific Kafka topics. By convention, functionName-in-x for inputs and functionName-out-x for outputs. In this case, we used a few Kafka-specific APIs with KStream because we want to leverage Kafka’s ability to do time-based calculations.

We also want to track the temperature and efficiency data and expose them as metrics to Prometheus, so we add our two consumers accordingly:

  • tempGauge: Consumes raw temperature messages from the temperatures topic
  • efficiencyGauge: Consumes augmented message from the efficiencies topic

Then all we need to do is auto-wire the MeterRegistry that’s made available to us by Spring Actuator, and update the metric values with the reading from the topics for each device.

Keep in mind, that at this point, we haven’t added any Docker or OpenShift-related code at all.

The Demo

For convenience, we created a Helm Chart that will install all the application components for you automatically, assuming you installed the Operators as described in the previous steps.

Instructions to install the demo are available in the git repository README file.

If you are only interested in running the demo, you can skip the following two sections and click here to watch the demo, which begins at 21:50 of our Modernize Applications with Red Hat OpenShift on AWS webinar.

After the installation is complete, you can access the Prometheus, Grafana, and Alertmanager UIs by clicking the links in the Networking > Routes page of the OpenShift console.

Figure 11 Fish Farm

Figure 11 – Openshift Routes Page

Prometheus should show all three services up and running on the Targets page.

Figure 12 Fish Farm

Figure 12 – Prometheus Targets Post-Install

Manual Deployments with Source-2-Image (S2I)

We used Helm to install everything to save time, but we could have deployed the application manually instead. Applications on Kubernetes run inside Pods, which manage one or more containers, and containers run Docker images. OpenShift provides an easy way to package and run applications without any code change if you use one of the standard runtimes.

Open the developer perspective in the OpenShift web console, click add, point to where your code is on Git, pick a runtime, and that’s it. OpenShift will set up a number of things for you automatically:

  • OpenShift will leverage a feature called S2I or Source-to-image to create a Docker image directly out of your source code. S2I clones your code from a git repo – public or private, figures out how to compile and package your app, and wraps it inside a Docker image that already knows how to run your code. For example, if it’s a Java project, it will detect a maven or groovy configuration and package your app in a Jar file. In the case of Spring Boot, use a Java11 Docker image to start the webserver. If you’re using NodeJS, it will run npm install and npm start, and even React apps runtime is now available.

You can code your application the way you already know how – just make sure you use industry standards and conventions, which you probably already are.

Figure 13 Fish Frm

Figure 13 – Deploy Service

Figure 14 Fish Farm

Figure 14 – Deploy From Git

  1. OpenShift creates a DeploymentConfig, which describes how we want our application deployed. In this case, just one instance, with one container, standard CPU, and memory allocation. It will also configure it so the application will be automatically re-deployed when we push a new build.
  2. It then creates a Service, which exposes our application to the rest of the cluster so other services can talk to it using the OpenShift internal DNS.
  3. Finally, it creates a Route, which exposes the application outside of the cluster and gives it a public URL. In this case, it also mapped a standard HTTP port to the default 8080 port Spring uses for the webserver. When it’s all said and done, we have a public URL that we can open in a browser. You can find the link generated for your service by going to the OpenShift web console Networking > Routes page.

To install this demo manually, add the three services present in the git repository that way. You have to make sure you go into the advanced git options and enter the name of the sub-directory containing the service in the Context Dir field.

Manual deployment of Kafka and Grafana is very straightforward using the Operator > Installed Operator feature in the OpenShift console, but Prometheus is a bit more involved. I’ll dedicate a blog post to that subject later on.

Running the simulation

Complete instructions to install and run the simulator are available in the git repository README file.

The simulator uses a Spring Boot REST client to send temperature records to the thermostat service using the /temperature-records endpoint. By default, it generates a one-degree increase message every second for pond_1, which represents a 100% efficiency for demo purposes.

  • Clone the git repository on your local machine
  • Go into the thermostat-simulator directory
  • Run the simulator with Maven: mvn spring-boot:run -Dspring-boot.run.jvmArguments=”-Dsimulator.url=[thermostat service url in openshift]/temperature-records”

You should see the temperature records that are printed to the console. You can verify that the records are being consumed by going to the pod logs in the OpenShift console, for both the thermostat service and temperature processor service.

Figure 15 Fish Farm

Figure 15 – REST Service Logs

Figure 16 Fish Farm

Figure 16 – Processor Service Logs

You can also look at the Temperatures dashboard in the Grafana UI, which should show ~100% efficiency for pond-1.

Figure 17 Fish Farm

Figure 17 – Grafana Dashboard 100%

Now we can run a second simulator at 100%. In a separate terminal:

  1. Run the simulator with Maven: mvn spring-boot:run -Dspring-boot.run.jvmArguments=”-Dsimulator.id=pond-2 -Dsimulator.url=[thermostat service url in openshift]/temperature-records”
  2. Go to Grafana and select pond-2 in the Temperatures dashboard at the top
  3. Notice the efficiency is at ~100%

At this point, it’s worth noting that we didn’t set up anything in Kafka. Spring Cloud Stream automatically created and configured the topics for us. If you need to tune those settings, you can use the Operator custom resources to create the topics beforehand and specify sharing and permissions in the YAML file. 

Now let’s simulate a failure on pond-2. To do that, we simply change the temperature increase rate to one degree every 1.5 seconds by setting the simulator.rate property to 1500 in our simulator.

  1. Stop the simulator on the second terminal
  2. Run the simulator with Maven: mvn spring-boot:run -Dspring-boot.run.jvmArguments=”-Dsimulator.id=pond-2 -Dsimulator.rate=1500 -Dsimulator.url=[thermostat service url in openshift]/temperature-records”

The water is now heating slower than before, which reduces the efficiency to ~66% (1/1.5). This will be reflected almost instantly on the dashboard.

Figure 18 Fish Farm

Figure 18 – Grafana Dashboard 66%

You should see a new alert on the Prometheus Alerts page (keep refreshing the page as Prometheus doesn’t show real-time alerts). The initial status of the alert will say PENDING (yellow) then change to FIRING (red) after a minute. This is because we set up the alert to only fire after a minute of the condition to avoid false alarms. When the alert is firing, notifications will be sent to the chosen destination at the configured interval.

Figure 19 Fish Farm

Figure 19 – Prometheus Alert Pending

Figure 20 Fish Farm

Figure 20 – Prometheus Alert Firing

Finally, let’s restore the efficiency by stopping the simulator and restarting it with the normal rate:

  1. Stop the simulator on the second terminal
  2. Run the simulator with Maven: mvn spring-boot:run -Dspring-boot.run.jvmArguments=”-Dsimulator.id=pond-2 -Dsimulator.url=[thermostat service url in openshift]/temperature-records”

The alert status should go back to green on Prometheus after a few seconds.

Production Considerations

What do you need to do if you want to deploy your solution in a production environment?

  • Ensure the production environment is the same as the test environment to avoid the classic “it works in my environment” problem. Because everything in OpenShift is a YAML config file, you can copy the files into your project to effectively source control your application infrastructure. Move these files to your prod cluster and deploy your app. You can have one branch per environment with different configurations and you don’t have to explain to IT how to run your stuff.

Figure 21 Fish Farm

Figure 21 – DeploymentConfig Thermostat Service

  • You want to make sure all these components will scale. For things that are managed by Operators like Kafka, the default configurations are usually a solid starting point, but you can tinker with it and find what’s right for you.
  • For your own services, OpenShift will keep an eye on your pods and replace the ones that die for one reason or another. It’s a good idea to have at least two replicas in prod so you’ll always have one running while OpenShift bootstraps a new one.
  • Spring Boot also comes with production-grade defaults, and because we didn’t write much code at all, and the services are stateless, you can scale them up and down very easily and safely. If you went with the extra step of creating your own Operator, you can configure it so your services and your Kafka cluster scale in sync. Or, re-balance data, execute backups, etc. (See the replicas: 1 field in the previous screenshot. Change it to 2 or more.)
  • On most clusters, OpenShift monitoring and logging is installed out-of-the-box, but if they are not, there is an Operator for that. Cluster Monitoring gives you access to the internal Prometheus/Grafana cluster and allows your NOC to monitor your app on Day 2.
  • Cluster Logging will set up an Elasticsearch cluster and a Kibana GUI and automatically stream your application logs to an index where they can be organized and searched and parsed, etc. That’s useful when you want to troubleshoot particular conditions or do post-mortems if a pod dies.
  • If you want to do Blue/Green deployments, versioning, authentication, A/B testing, etc., there is a product called Service Mesh in OpenShift that allows you to configure these easily without touching your code. There is an Operator for that of course – it’s a bitmore advanced but not out of reach for the everyday developer either.

Figure 22 Fish Farm

Figure 22 – Service Mesh

  • Remember that when you build your applications around Kafka, you can easily add new features down the road? One of AMQ Streams’ companions is Fuse Online, a Red Hat product that lets you create integrations around Kafka without having to write a single line of code.

Figure 23 Fish Farm

Figure 23 – Fuse Online

For example, we can use it to consume our alerts topic and send AWS SNS notifications. Or we can install an Elasticsearch cluster using the Operator and log our metrics there to create executive reports and even do machine learning.

Conclusion

We created a fairly complex application with very little code or knowledge of the underlying technologies. We used a familiar Java framework, and thanks to its tight integration with OpenShift and the use of other OpenShift native techs, we were able to wire things up together.

As a developer, I can now be given my project space in the OpenShift cluster and create my ideal development environment. I can set up clusters and tear them down in an instant – I’m not required to change the way I code or learn about Docker or Kubernetes, and I only coded the part that’s actually providing business value. Plus, I’m able to describe my application infrastructure in yml files so I don’t have to explain it to IT or deal with conflicting configurations from different groups. I’m in control.

And thanks to the numerous OpenShift tools available to deal with production concerns, I don’t have to bake all these things in my code, so it’s a lot easier to maintain, leaner, and more to the point.

]]>
https://blogs.perficient.com/2020/06/22/keep-your-fish-warm-event-driven-architecture-on-openshift/feed/ 0 275699