How we planned to terraform our environment, for growth, and scale

Introduction

I was introduced to Terraform in June of 2017, as my employer was using Terraform to manage their AWS resources and infrastructure for our product as we moved from an on-prem colo datacenter to an AWS Infra-hosted environment. However, the scope of use within Technical Operations infrastructure still was in it’s early stages, specifically for services like Okta, Meraki, Workspace One, Jamf, Microsoft Entra ID, etc. Then Okta released their Terraform Provider and initial how-to information in 2019, and this was a great start, but how could we take advantage of this.

Some historic context

I originally planned on wanting to use Terraform at my first job, but then shortly after 2019, we got acquired. Then, when moving to a new company in 2021, there was some clean-up and organization to do - our team also rotated through 4 different managers, and was in a bit of a chaotic turbulent workforce situation. Then, I finally had a manager who was sold on my push for Terraform but I ended up moving to another opportunity. Subsequently, I have been pushing to use Terraform at $currentJob to help reduce administration overhead, prevent unwanted changes, declare what we would like in our configuration, and, among other things, make it easy to revert and allow us to scale out.

So, how can we plan for our Okta environment’s horizontal and vertical scale? We also want to prevent drift issues managed explicitly by Terraform code, scalability issues, and more. Part of this is done through scalable, automated code and logical siloed infrastructure (for example, through Terraform Workspaces).

How do we plan better?

When working with my previous employer’s code, there were some massive drawbacks:

Their code was highly opinionated and made only for the intent of their infrastructure and nothing else in terms of IT Infra. Meaning it wasn’t flexible enough for use within other teams.
They had a scaling problem, which eventually became outlined in a blog post and a talk at SRE Con, where they created a wrapper/processor script, to fix some of their issues.
When I left, they had a single Terraform Pipeline, which was used to create all state items and resources.
The code was incredibly confusing and complex for a new learner of Terraform to use.

Using Architecture Decision Records to determine our Requirements and Use-cases

This took some substantial discussions, but we finally agreed and have been rolling the ball down the hill since configuring all we can with Terraform after an Architecture Decision Records (ADR). Some examples of ADRs can be found here:

Below, we outline challenges, various code samples, and use cases within Terraform that assist our scalability.

Workspace Segmentation

This topic covers multiple Workspaces and Segmentation facilities; it isn’t just limited to Terraform but also Github, Okta itself (through DEV, Preview, and Prod environments), and users’ capabilities.

In doing this, we will segment a “core” stack and a “standard” stack. To explain what this means:

Core: Pertains to critical pieces automation, groups, applications, routines, policies, infrastructure, or frameworks.
Standard: This requires contributing code from other teams, non-critical frameworks, groups, applications, branding, etc.

Additionally, we will have branch protection on our main (EG: Prod) branches for both segments. This means we will utilize the non-main branches for development testing and QA validation in our preview environment. However, this could mean that drift or unexpected states occur in the sandbox for our preview, and this is fine and should be considered as long as it is the non-core environment where that drift happens.

Then, our last mechanism will be that we have multiple GitHub repos and Terraform workspaces that are self-contained within our core stack, and Terraform will auto-run for the necessary bits.

Everyone in our team will also have a personal tenant used for local Terraform Development Testing. This way, we can work on things in a development cycle that follows:

Develop (Individual Local Environment) > Feature / Change Validation / QA (Preview Environment) > Release (Production)

This allows our development to be unimpeded in our preview environment while allowing our team to validate the changes we have executed in a feature pull, change, or fix before it hits our production environment. Our login to QA / Preview is also specifically limited to only team members in our environment.

So, how will this look in staging and production? Read the Workspace Segmentation in Github, Terraform, Okta section for more information.

Scalable Code Structure

Things we acknowledge:

The code that we generate should be dynamic and flexible. It should adjust to new features, code, services, or functionality wherever possible. We expect drift within the Okta environment. It is naive to think otherwise. Some groups are and will not be managed by Terraform, as Tech Support can and will do work in the environment. Terraform, where possible, should correct this drift if it is unexpected and leave it alone otherwise.
Any team should be able to understand and read the code with minimal knowledge of Okta. It should be commented heavily, thoroughly, and succinctly. This will not replace proper service documentation.
Any code outside of Core should be easy to replace and rewrite.
Any modules should be dynamic enough for use within the org without being hard coded for a specific team. If a module is hardcoded, it should be named something team-specific.

Monitoring, Alerts, and Automated Resolution

We need alerting and monitoring to compensate for potential issues, missed things, and other situations. But why not add some spice of excellence to it? Make it super lame.

The following situation involves Logging, Alerting Monitoring, and Excellence (not necessarily in that order).

Things we need to consider:

How many changes are being made during each run?
What is being flagged for changes in each run?
Consequently, what has drifted in each run?
What actions should occur when drifts are discovered?
Depending on the drifted item, should it alert,
Etc, etc.

Honestly, we could generate a lot of questions around this about hypothetical situations. I would suggest spending several good weeks in parallel while writing out your ADRs to add to your list, whether that is a Google Doc, Notion Page, Github MD File, etc. Something that your Ops Team, Security Team, Compliance Team, or anyone else that is present has access to.

So, how do we take these questions and turn them into actions?

Create graphical (EG: Grafana or otherwise) dashboards that plot items that can be drilled down to numerical representations into graphs for each situational context. For example:
1. You could count and graph the number of objects in all states across the Terraform Workspaces and Resources and then break them down by sub-categories (EG, Users, Groups, Security Policies, etc.)
2. Create a graph based on the time taken for each run into a multi-sourced time graph (EG: core-run-1 takes 55 seconds, standard-run-1 takes 60 seconds, core-run-2 takes 30 seconds, etc.) and plot this
3. Determine key log contexts and monitor how often they occur (EG: security.request.blocked is a good example, as if something is misconfigured, this could potentially increase tenfold)
4. The number of login attempts per geographic region
Create a log dashboard that shows specific key context situations that are important to you so that you only ever receive the logs that are important from a quick glance and not the dashboard
When a drift is registered or found, depending on the context of the drifted event, workspace, or resource, automatically re-apply the Terraform Run in the designated terraform

Datadog has some good examples of detection rules to begin this process, and these examples can be used in other services if Datadog is not your primary provider.

The Implementation

This will be broken into a 7 part series, with more of the content being posted each week over the next 7 weeks.

And that is it

A lot will be covered over the next several parts, which sums up how we have terraformed certain pieces of our Okta environment. If you have questions and are looking for a community resource, I would heavily recommend reaching out to #okta-terraform on MacAdmins, as I would say at least 30% (note, I made this statistic up) of the organizations using Terraform hang out in this channel. Otherwise, you can always find an alternative unofficial community for assistance or ideas.