Triggers in Infrastructure-as-Code Provisioning

Introduction
Because enterprises rely on Infrastructure-as-Code (IaC) for their infrastructure and platform resource management, provisioning those resources are tied to a software control management (SCM) system. That same SCM usually manages their application code sources, as well. The main trigger for engineering teams to initiate action to provision, modify or manage resource lifecycles, is in response to a request from another team within the organization. That request is usually at the direction of the requestor’s development team, and in conjunction with their vendor or user specification.

This paper examines methodology models that significantly alter the ‘request’ and ‘deliver’ methodology by removing many, or almost all steps, between identifying a need for a platform resource add or change, and delivering that add or change. Adoption of the agile method has helped with the management and ‘politics’ of this conversion from the ‘old’ request/deliver methodology model to newer models. And, the agile method has to be adopted to implement these newer models. This write up also examines a very specific use-case that Technology Leadership Corporation (TLC) provides to enable a drastic reduction in ‘request’ to ‘deliver’ timeline and expense for development teams; using a strategy that is mature and ubiquitous, yet differs significantly from some of the costlier and less efficient strategies promoted by many cloud hosting providers today.

Background
Historically, the ‘request’ and ‘deliver’ method for provisioning IT resources has been extremely complex, frequently wrought with scheduling delays and conflicts. Delivery is very expense as it involves numerous teams, meetings, testers, managers, technical writers/documentation, and engineers to complete. Great strides have been made over the years to improve this situation. One of them is the adoption of the ‘agile’ method, which is more a philosophy than a method. This method is to make smaller, easier changes that require less testing, less documentation, less engineering and less management oversight (i.e. fewer approvals) to implement.

But, agile alone does not really change the ‘request’ and ‘deliver’ model. Some red tape can be cut, which marginally improves delivery schedules. But, the increased number of changes agile requires, and the detailed planning for agile to be ‘safe to implement’, does not alter significantly the complexity and expense of the delivery model. The ‘trigger’ for a resource change is the same in agile as it ever was: a human enters a request into a ticketing system, and the organization’s change management process takes over. Agile manages that change management process well. However, much of today’s modern automation seeks to automate these triggers. This requires a renewed and more aggressive adoption of an agile philosophy that reduces need for all steps between request and deliver, and for as many different types of request, as possible.

Automation Triggers
The use of automation triggers is not new, and can be traced back to applications like Unix cron (1975) and (always late to the party) Microsoft’s Windows Scheduler (1995). These are known as time based triggers. The other type of trigger is the event based triggers, which come from a change of state, not a change in time. Event based triggers can be found in SNMP configured products like telecom switches and their related enterprise monitoring systems. CISCO’s Embedded Event Manager (EEM) or HP’s Network Node Manager (rebranded as OpenView then later integrated with ITO) have been around since the mid 1980s. SNMP relies on a Management Information Base (MIB) and ‘agents’ that utilize the MIBs. Agents are programmed to alert when certain thresholds are reached, and can also be programmed to take ‘actions’ dynamically, based on changes in state. As with all great technologies, SNMP MIB is still in use today, but is slowly being replaced with streaming telemetry, which we will come back to later.

Repository Hook
The SCM repository hook was first publicly released by Perforce, which introduced a central server that allowed for “check-in hooks” used for style enforcement and continuous build actions. I was introduced to Perforce in the early 2000s (and used it until Git came out in mid 2005). Perforce (P4) could trigger scripts on the P4 server to run scripts that were installed on it. Those scripts could run programs like ‘cc’, ‘make’ or ‘make install’, if they were also installed on the Perforce server. This was how programs could be compiled ‘automatically’ after a commit. The way a check-in hook works is that the SCM application itself ‘knows’ there was a check-in because itself was invoked to do the check-in. Thus, part of the modern SCM is to ascertain whether there are any ‘pre’ or ‘post’ check-in scripts to run at check-in. This means there is no separate agent checking periodically to see if there was a commit, or not. The SCM itself knows there was an action and triggers the hook, if configured.

Static Infrastructure Provisioning with Terraform
Long before modern ‘cloud computing’, IaC was already a thing. There was CFEngine (1993), Puppet (2005), Chef (2008), and Ansible (2012). I have used them all, and there are many others I have not used. The late-comer to the party is Terraform (2014) and was primarily developed to do what other IaC tools do, but with specificity to the hosting provider platforms (i.e. The Cloud). The process of computer and application resource provisioning became a lot more automated with the advent of cloud computing due to their ubiquitous use of APIs for management and control. With the advent of these powerful API systems, the nature of what Infrastructure-as-Code (IaC) automation tools do expanded. Legacy tools implemented APIs via plugins (e.g. Chef implement the knife ec2 plugin) to allow their tool to interact with the cloud provider’s API system. Terraform implemented the API systems not as a bunch of plugins and extensions, but as the core of the product configuration, via a ‘provider’ section in the configuration.

Using Terraform, small and large scale cloud provisioning requests can be stored in HCL (HashiCorp Language) files, which are specially formatted simple ascii text files analogous to YAML or JSON. HCL files are proprietary in formatting, structure and usage, and not interchangeable with other formats such as yaml or json. Infrastructure requests like “please create an S3 bucket with my initials in it”; “please install a database system with Postgresql and ensure replication to another (remote) location”; and “please create a Windows Jump box with a public IP by which I can connect to the backend database” would have been (at least) 3 separate service tickets. Each ticket needed lots of information for the staff member to implement; it needed a budget; it needed management approval and scheduling within the team responsible for infrastructure provisioning (historically, this group was called operations). Terraform allows these requests to be bundled into one ‘plan’, and therefore (possibly) one agile ticket, and implemented with one run of the ‘terraform’ command. Similarly, updates, additions and deletions for resources can be made to the same plan, ensuring an extremely consistent environment across the organization with minimal re-testing needed.

Automation tools like Terraform help address some of the traditional ‘request’ and ‘deliver’ methodology short-comings by potentially reducing the number of requests; the amount of details needed for a request; and always reduces number of tasks to deliver the resource(s). But, it does not fundamentally alter the two-step hand-off where the requestor is waiting for the provider to deliver, and is blocked until then. Terraform still envisions static provisioning where the requestor is blocked until the IaC is written and tested. Terraform plans are then carefully applied, often in a maintenance window. Before I examine our TLC use-case, we will dive a bit deeper into dynamic provisioning methods that Terraform can help configure, but does not trigger without a human and/or complicated customization

Development Operations (aka DevOps)
Ambitious developers could coordinate with administrators, and do more than just run a compile via repository check-in hook. For instance, the script that runs the compile could also check to see if the compile has enough disk space to complete before running. If there was not enough space, a good script might attempt to expand the filesystem (if the script had access to administrator commands), or move the build to another disk volume to complete. Likewise, the script could run post-build commands like an sftp job to send the compiled object to an object store such as a dedicated storage volume or (by early 2000) an object store system like NetApp. These were steps that an administrator would normally have to fulfill via the legacy ‘request’ and ‘deliver’ model before and/or after a build, including approvals and scheduling. This repository check-in hook is the beginning of what is now called Continuous Integration/Continuous Deployment (CICD), or simply the ‘pipeline’.

The ownership of static IaC management and control, along with creation and/or management of the dynamic repository check-in hooks forms the basis of an entire career called ‘DevOps’ today. Like all good technologies, the check-in hook has been adopted by all modern repositories; and Terraform (or Ansible, Puppet or Chef) for an organization’s IaC management and control. Perforce repository calls it a ‘check-in hook’. Git calls it a ‘commit-hook’. GitHub calls it a ‘Webhook’. The Github Webhook is key to the TLC IaC automation use-case programmed into the TLC tf-files Terraform plan, but is applicable to any SCM repository hook.

SNMP vs Streaming Telemetry
SNMP is an example of an event based triggering protocol that relies on periodic local agent polling and sending data to a central server (monitor) when a threshold is met. Like streaming telemetry, there is no set time to run it, as it is not (usually) controlled by a scheduler. Unlike the SNMP MIB, streaming telemetry does not require a centralized database of conditions (like filesystems, CPU, and/or memory usage thresholds), nor a local agent constantly monitoring for changes in those thresholds to trigger sending data. Instead, each application (including firmware) or service is constantly sending data, often of every possible condition (that is logged) on the given server, application or system. The sending protocol could use SNMP, but there are more modern protocols for streaming telemetry data. The server receiving the telemetry data has it’s own set of scripts and binaries that can parse the (log) data coming in, and establish its own set of alerts and actions, as desired.

One of the modern concepts of streaming telemetry comes from the check-in hook. Modern applications use that method to ‘know’ when itself was invoked, log it, and perform any additional ‘pre’ and ‘post’ invocation steps, as able. Unlike streaming-telemetry, the check-in hook behaves more like an agent. Streaming telemetry data is essentially ‘spamming’ a central server with endless data, to be parsed by the server (only) in order to derive some meaningful functionality or diagnostics from that data (we hope). In stark contrast, the check-in hook (aka Github Webhook we detail later) only runs when the git server receives a very specific event, usually either the ‘push’ (the check-in of code to the SCM) or the ‘merge’; but could be from ‘pull’ or other events: whatever the particular SCM you are using can/will support.

Streaming telemetry, on the other hand, requires the collection, transfer, and analysis of large amounts of data . It also requires the server to ‘discover’ the event in a mountain of data and respond, accordingly. Overall, the performance of streaming telemetry is only marginally better than SNMP agent polling and data transfer, and only if you throw a ton of resources at it. Comparatively, the check-in hook is an extremely efficient method to discover state change and operates near real-time, That efficiency lowers costs through lower data traffic, reduced disk I/O and greatly reduced memory and CPU requirements to discover and process changes in state.

Dynamic Infrastructure Provisioning and the Self-Service Model
As described, whether via Cron, SNMP agent, Streaming Telemetry, or SCM (check-in) Hook, there are ways to trigger actions based on schedule or events. Those actions can theoretically be to provision platform infrastructure, provided the proper planning, approvals and testing are performed before implementing. All of these methods require an ‘agile’ environment where rigorous planning and testing go into a dynamic system of resource allocation. Meanwhile, the vast majority of (cloud and on-premise) infrastructure is still delivered via the older ‘request’ and ‘deliver’ model. Teams meet, planners plan, budgets are analyzed, maintenance windows are allocated, and eventually the resource is provisioned.

These requests are often providing ‘static’ resources that are stood up and keep running, forever. Sometimes, the provisioning is ‘ephemeral’, but based on a prearranged timeline, or estimated timeline, for the resource to exist. In something called a ‘recurring’ request, which predates agile itself, but is foundational to agile, routine or recurring requests can bypass a lot of the planning and approvals and go straight from request to deliver. But, even in that scenario, there is still this two-step hand-off where the requestor is waiting for the provider to deliver, and is ‘blocked’ until then. These requests largely result in the delivery of static resources that must exist before the requestor can proceed.

Another modern and ever growing ‘request’ and ‘deliver’ methodology is called the ‘self-service’ model. Self-service is now more-or-less ubiquitous when it comes to user access and management. Much to the approval and verification steps needed are fully automated. It is still a very ‘manual’ process as anyone who has ever signed up for an online account knows. But, the manual data entry portion/burden of user access is shifted onto the customer. Only if the customer has problems does an organization’s staff have to get involved. The trigger involved is still initiated by a person, the user, when the ‘submit’ button is clicked. From there, the organization’s automation kicks in. The user account creation, verification and most importantly for this use-case, dynamic platform resource allocation, is the kind of automation where a huge amount of an organization’s staffing and budget go today; particularly for newer organizations that have not matured that process quite yet.

A simple example of the dynamic resource provisioning in AWS that may occur (beyond adding a user to a user database) is creating a dedicated storage resource for the user at account creation. This can be an AWS S3 bucket linked to the user by a row in a database table. Not long ago, and still in many shops today, the AWS S3 bucket request is a ticket in an agile system and is blocked until approved, scheduled and delivered. The self-service model is crucial to adopting platform and infrastructure provisioning that by pass all agile steps between request and deliver.

Auto-Scaling and Kubernetes
One of the newer types of dynamic resource provisioning tool is the ‘auto-scaling’ group. These are groups of virtual servers that, when ‘scaled’ via some trigger, can add to, or subtract from, a baseline number of running virtual servers. The auto-scaling groups can use a schedule to trigger ‘scale up’ (add) actions or ‘scale down’ (remove) actions to change the number of running virtual servers. Auto-scaling can also use SNMP and/or streaming telemetry (depending on what component is being measured) and based on thresholds set for the component measured, scale up or scale down actions can be triggered. The auto-scaling group itself relies on the provider’s API system, and (SNMP and/or telemetry data) between the monitor and the auto-scaling group to take the appropriate scale up or scale down action.

There are many problems with auto-scaling groups. The major problems are, for event based triggering, the need for streaming telemetry from the group (only) and fast parsing to identify an event that requires the scale up or scale down action. The threshold for that trigger is part of the main problem. The problem manifests because auto-scaling groups are aggregating the data. If you have a group of servers but only one instance has high CPU utilization, there is no trigger, at all. The one server continues to run slowly compared to the others. A weak workaround is to keep the auto-scaling groups very small, maybe one server minimum with 2 server maximum…. then you’ll need a lot of auto-scaling groups. Another is to somehow guarantee each server in the group gets more-or-less the same load across all instances. That is much easier said than done.

The other major problem with auto-scaling groups is the event trigger must be one of the data components streamed from within the group. This largely limits triggers to basic OS and network level events, such as low or high levels of CPU, Memory, Network Interface and Disk storage and I/O utilization. If you need to trigger on something else, the task becomes exponentially more complicated in AWS. A typical way to do event based scale up or scale down without relying on auto-scaling triggers is: EventBridge rule to inspect and trigger on CloudWatch events either inside or outside of EC2 – but still limited to AWS events! AWS is both unprepared to, and disinterested in, trigger events from outside of AWS. But, we are not done because EventBridge doesn’t actually control anything. It can then pass the event rule to another AWS system, usually AWS Step Functions or AWS Lambda Functions. From there, you are not done! Unless what you want to trigger is a python script, Lambda did not do it for you. The AWS Step Function similarly does not actually do much for you. One of those two (Lambda or Step Functions) must then invoke another service, usually AWS CloudFormation templates or possibly your Lambda implementation of the boto3 API via python. And, of course, the AWS CloudFormation templates and python scripts are more IaC, and you are now officially a dog chasing it’s tail! I will address this with the use-case I lay out later in this write up.

Going into all the pros and cons of Kubernetes is outside the scope of this paper. However, the inherent ‘problem’ with Kubernetes is it is, what I call, a ‘platform-in-a-platform’. Whether on-premise or ‘cloud’ hosted, Kubernetes is a container hosting platform that requires dedicated servers running 24/7 and tons of Kubernetes-specific resources to effectively manage and maintain. The Kubernetes control plane cannot be easily shutdown and restarted. Its etcd database, as well as various operational requirements, preclude the control plane from being completely shutdown. Removing or reducing (or increasing) the number of control plane nodes needs to be done carefully, and is a process apart from growing or shrinking the worker nodes. Achieving ‘stability’ in the control plane after changes takes time, thus does not lend itself to dynamic provisioning, at all. This is why cloud hosted Kubernetes always has the control plane as the ‘serverless’ component to the cluster. You cannot see it and interaction with the control plane is limited.

The Kubernetes ‘worker’ nodes can be shutdown, but the process requires ‘draining’ the nodes before shutdown. Removing worker nodes entirely requires joining new nodes, while removing/replacing old ones, which is additional database work and overhead, also requiring the control plane to be running already. Changing availability of worker nodes is also time consuming, but less ‘dangerous’ to the control operations. AWS Elastic Kubernetes Service (EKS) uses auto-scaling groups for the underlying API management control layer for the worker nodes.

The advantage of the Kubernetes platform involves the use of containers: smaller implementations of a standard computer operating system, plus the application that runs on it. These containers run in what is called a service mesh, as each container runs a separate service (e.g. web server on port 443 is one set of containers, the log scraping on port 6514 for streaming telemetry over TLS is another set of containers, the ephemeral data store like memcache on port 11211 is another set of containers, and so on). This mesh is then networked together with a virtual network called Container Networking Interface (CNI). The CNI allows for easier detection of services within the mesh. This allows containers to be stopped, removed, and added with minimal disruption or reconfiguration to the service. Containers also start and stop quicker than a full blown server and application, but that time also depends on the size of the container and the application running on it. Kubernetes’ ability to evenly distribute load across nodes a little more easily, usually via significant manual configuration of helm charts, which is just more IaC, is a plus. This makes the auto-scaling triggers are a bit more useful, and worker nodes can scale up or down within 10 minutes or so, per event, a bit more reliably.

Github Action Runners the Hard Way
The vendor recommendation for Github dynamically created self-hosted actions runners is to use ephemeral runners on containers, which creates a management and provisioning architecture that looks something like this:
* Use a new container registry to store and manage GH runner containers (e.g. Docker Hub or AWS ECR).
* Then, use an IaC tool (Terraform) to provision another platform inside AWS, called Kubernetes (EKS),
* Only then to abandon Terraform completely with a completely separate set of Iac, Helm Charts!
* Even if you are CIS compliant, you’ll still need security tools like Kyverno you don’t need anywhere else.
* To secure your service mesh traffic, you have to provision many new SSL certificates for each endpoint.
* Let’s go-ahead and tool EKS with streaming telemetry because of all the fun new metrics to track.
* But, add insult to injury, who needs Terraform or Helm Charts for Kubernetes deployments? Why not go
ahead an use yet another IaC deployment tool for your platform-in-a-platform solution? Let’s use FluxCD (or
ArgoCD) to deploy the endless helm chart updates you’ll need to maintain your platform-in-a-platform.
* Finally, to help manage the nightmare that is Helm Charts, we’ll setup another repo for the FluxCD override
file, to override all of your Helm Chart IaC settings.
* With any luck, we can finally deploy our Github self-hosted action runners now :-)

What’s the problem? Not a thing if you have duffle bags full of money and endless staff, time, skilled human resources to manage multiple IaC code sets, multiple vendor product versions, and the entire platform-in-a-platform with all of its dependencies. Oh, and let’s not forget licenses for all that new software you need. Please don’t misunderstand, I actually embrace and support containerization, Docker and Kubernetes. It is a truly revolutionary solution for the application deployment lifecycle. But, the platform-in-a-platform solution is both expensive and ill-advised in many implementation scenarios. To borrow an old NYSE VP of mine’s description of over population and subsequent under utilization of expensive tech resources, Kubernetes clusters ‘multiply like mice’. A cluster for each environment. Dedicated clusters for various products. Multiple clusters in each account, in each region. You start with one cluster and end up with dozens in the blink of an eye.

Github’s recommendation for dynamically provisioned (aka ephemeral) action runners makes little sense if you are not already a Kubernetes shop. But, even if you are a Kubernetes shop, knowing that the mission of an actions runner is to do software builds and pipelines for development, you’d know already that other departments, like the finance department or the BI team loathe sharing their development or production Kubernetes cluster(s) with other high utilization, ‘non-production’ departments. So, you are standing up (at least) one entire platform-in-a-platform Kubernetes cluster, complete with it’s requisite 24/7 control plane. Plus, streaming telemetry (via auto-scaling groups) to right-size the worker nodes with a bare minimum 10 minute lag in adding or removing worker nodes (usually closer to 20 minutes) to the 24/7 control plane to host your build pipeline. Does that sound ‘ephemeral’ to you? I will answer very gently and say ‘kind of….?’

Github Action Runners the Easy Way – Terraform provisioning with Github Webhook and AWS Lambda
The Terraform plan is referred to below as tf-files, it has a very good README and can be downloaded from here: https://github.com/AndrewSimon/tf-files/tree/workflow

Above is a conceptual architectural diagram depicting the ‘tf-files’ terraform plan that creates all of the static resources that combine to provide a dynamic solution to AWS infrastructure provisioning. The plan also creates the lambda function used for dynamic provisioning. It does not rely on auto-scaling groups, streaming telemetry, Event Bridge, or Kubernetes clusters. It does rely on a
1. A repository hook (Github’s Webhook for demonstration purposes).
2. A python script maintained within the terraform code base (not pulled from a separate python package
repository for simpler demonstration purposes).
3. Lots of terraform code including AWS policy documents to make it all work
4. AWS Marketplace: Oracle Linux 9 with Github Actions Runner by TLC

The plan creates a new VPC and all new resources in it. This is the easiest, safest and fastest way to test it and use it, if desired. The idea is you have the code now, and it is a simple enough task to change which repo the hook will be applied to. Once the hook is assigned to a repo, the AWS portion is more-or-less identical for all of your Github repos. Currently, each repo will get it’s own hook and endpoint, but again some small modification and/or hard-coding some variables can ensure all of your repos hit the same lambda endpoint, if preferred.

Terraform generally does not destroy resources that were not created with terraform. But, if you use ‘terraform import’ to attempt to use an existing VPC and/or existing resources, you then have potential to delete/replace those resources if certain attributes in the current infrastructure do not match the plan. When importing resources into Terraform state, ensure the ‘important’ attributes are changed in the plan to match what is currently deployed in your environment. Terraform will not destroy resources it does not know about. But, any imported resources need to be configured correctly in terraform or it could be deleted/replaced if certain attributes (eg AMI ID for instance, database version for RDS, etc) in the plan do not match that of the existing infrastructure exactly.

In the tf-files design, Terraform builds the python script and deploys it to Lambda. Terraform also builds the webhook that, when a commit occurs to the repo, triggers Lambda via Lambda URL. The python runs boto3 and stands up a new GH runner dynamically. To complete the lifecycle, the tf–files provides the github action that in installs the dependencies, which combines with the supplied user-data and runner hook that destroys the runner at job completion. When the hook is implemented, it is recommended to set the maximum number of runners high such that each job gets an assigned runner. I am still working on the code to prevent instance termination if there are jobs queued with unassigned runners; and better support for multiple regions. Maybe another week or two of coding and testing for those enhancements.

The TLC Amazon AMI is a lean custom built virtual machine image that is an excellent choice for any workload, not just as a Github hosted runner, and I encourage it as the base image for all of your instance based deployments. As far as acting as a runner, I have timed it and it will go from webhook request to registered runner in 90 to 120 seconds. If your job queue exceeds the number of kubernetes runner pods you are already running (ie pre-provisioned static pods), the wait time for my solution is usually far less than the containerized implementation. Github ephemeral containers have to complete it’s run and die before the next container runs and connects. For kubernetes, the trigger is a dead pod, for tf-files, the trigger is the push itself. Write me at asimon@technology-leadership.com for any question or if you would like to engage TLC in a short, medium or long term contract to enhance and improve your real-time provisioning, with or without Kubernetes.

Triggers in Infrastructure-as-Code Provisioning

Pages