Why I like Kolide for Device Remediation and Assurance

A Brief Summary
My Expectations for This Project
What Does Our Security Model Look Like Today?
Why Device Assurance Is Essential for Large Businesses
So, What Do We Do?
Going Forward
Closing Out

A Brief Summary

After several months of configuration refactoring on macOS, Apple’s statements of “Declarative Device Management is the new hotness!”, and about to begin approaching our Okta & Windows Infrastructure - I realized I need something to not only validate our configurations are applied successfully and working as expected - but something we can use to help reduce friction of tickets, and inform our employees on a way to validate and fix those problems themselves.

We need a third-party Device Assurance and Posture Checking tool that can integrate with multiple systems.

Info

November 2024 Update:

The company Kolide has been purchased by 1Password, and going forward in any new blog posts (as long as I can remember it), will be referred to as 1Password (formerly Kolide).

The product Kolide, in any new blog posts (again, as long as I can remember it), will be referred to as 1Password’s Device Trust solution (formerly Kolide).

My Expectations for This Project

When I joined $currentJob, we had two understandings (check the blog post time compared to my work history to figure out where / when that is).

Our Windows Devices are up to standard.
After several refactoring improvements and months of work after I joined, our macOS devices were now up to standard.

However, there were several fallacies to this understanding.

We used an MSP for our Windows Management that did little work or hardening of our devices — at least not more than the baseline security compliance that any non-technical user could do. They were anything but proactive.
On macOS, if a configuration (like a password) was set prior to the configuration profile being deployed, it wouldn’t necessarily require a change of the password to put it into compliance. This applies to various situations that could potentially prove problematic. NOTE
Apple’s services are only sometimes as reliable as Microsoft’s.
Microsoft’s services are only sometimes as reliable as Apple’s.

NOTE: In Sonoma, this behavior is supposedly supposed to be fixed, however, I have first hand experience of pushing a password profile down, expiring and waiting for the password policy to finish applying - and nothing occurs. I can still use my insecure old password on my device. Great job Apple!

This made me realize that we need ways to evaluate an on-device policy that cannot be determined via Intune, Jamf, Crowdstrike, etc. Sure, Jamf has extension attributes, but they don’t really do as much as what we need in the grand scheme of things. We need a secondary compliance processor and broker that doesn’t merely set the configuration but also verifies that it is correctly configured.

While osquery is an incredible tool, it is sometimes difficult to convince a large-scale business to use open source software in an enterprise, especially in a regulated industry, for the details I had in mind with how I wanted to use it to improve several of our processes. Such as:

Device Problem Ownership - placing the responsibility back on the end user
Reducing Technical Support Ticket requests - reducing the inbound ticket support system
Improve Logging, Alerting, and Monitoring Excellence (AKA: LAME) around end-user devices
Improve our Linux attestation from 0% to anything higher

What Does Our Security Model Look Like Today?

At $currentJob currently follows a Zero Trust Security model. However, what does that actually mean?

There are various different implementations of this model and what it means, but fundamentally:

Zero Trust Policy is a cybersecurity model that assumes no entity, whether inside or outside the network, can be trusted by default. Verification is required from everyone attempting to gain access to resources within the network, regardless of their location or whether they have been previously verified.

In practice, this means:

Verify Explicitly: Always verify access requests using all available data points, such as user identity, location, device health, service or workload, and data classification.
Use Least Privileged Access: Limit user and device access to only the resources necessary to perform their functions. This minimizes the potential damage from compromised credentials.
Assume Breach: Operate with the assumption that the network is already compromised and design security measures accordingly. This includes minimizing the impact of breaches and enhancing the ability to detect malicious activity.

Key components of that mean implementing Identity verification (both as a person, device, machine, service, etc.), Continuous Monitoring of User / Device / Device Security in real time, Endpoint Security—such as an EDR, Network Segmentation of both offices, cloud infrastructure, or any other network sections, and finally, Adaptive Policies for Service Access.

We do all of this well enough, but how do we improve? How do we take this to the next level? Let’s up our game!

Why Device Assurance Is Essential for Large Businesses

Nowadays, large businesses depend on various devices to keep operations smooth, enhance productivity, and safeguard data. Device Assurance is crucial for ensuring every network device is secure, compliant, and working well. Here are a few reasons why:

Better Security: With cybersecurity threats constantly changing, protecting devices from hacks and malware is vital. Device Assurance offers solid security measures like antivirus software, firewalls, and regular updates, which cut down the risk of data breaches and cyber-attacks.
Staying Compliant: Large businesses need to follow various industry rules/standards, global compliance policies, and laws. Device Assurance helps keep everything in check with regular audits and by sticking to regulatory standards so companies can avoid fines and legal trouble.
Smooth Performance: Constantly monitoring performance ensures devices work efficiently, reducing downtime and boosting productivity. Businesses can keep things running smoothly by tackling performance issues before they become problems.
Saving Money: Regular maintenance and updates prevent expensive repairs and help devices last longer. This saves money and ensures resources are used wisely, which is good for the company’s bottom line.
Happier Employees: Reliable and well-performing devices are key to keeping employees productive and satisfied. Device Assurance ensures everyone has the tools they need to do their jobs effectively, leading to a more efficient and happy workforce.

In summary, Device Assurance is not just a technical requirement but a strategic necessity for large businesses. Device Assurance helps enterprises maintain operational excellence and gain a competitive edge in the market by ensuring security, compliance, performance, and cost efficiency.

So, What Do We Do?

We implement this in a way that does three things:

Doesn’t impede employee work or capability
Is flexible, which means expanding to other systems and seamless integration in our environment
Doesn’t have a narrow or single scope or use case

I was inspired by Pinterest’s AuthN & Compliance Blog Post back in 2023, which, as we were at the beginning of our Okta Device Trust Roll Out, I wanted to implement something like this during our project - but, I was too early into my career at $currentJob at the time not only onboard another tool but also learn the infrastructure and do the roll out at the same time.

Investigating Options

Personally, there were only a few good options and one new service that could have incorporated some of our functionality (since we started looking into this).

Kolide

I had heard of Kolide back in 2018. It was excellent - but not what we needed based on what was being done at the time and at my $oldJob, but, then shortly after the start of our rollout - Kolide announced some major overhauls. This looked perfect!

Okta’s Device Assurance as Part of Okta Verify

Okta also has Device Assurance. We tried it, but there are some proper downsides to this, which I explained in my Okta Device Trust Roll Out blog post. It works, but it isn’t ideal, and what it is doing—we can do with Kolide and a lot more.

For example, it doesn’t tell the user what issues are missing with the device and why they are getting blocked, which is a terrible experience.

Crowdstrike Falcon Foundry

Crowdstrike, unfortunately, released their advertising about this a bit too late, but, they have recently announced “Foundry”. However, this introduces problems and complexity - our team doesn’t own Crowdstrike, meaning that any future implementation of Falcon we would need to do, would need to be coordinated, any time we run into a permissions problem, we would need to discuss with our Crowdstrike team to sort it out. It could cause complications to the work itself and our use cases.

FleetDM

FleetDM is a multi-approach MDM tool with functionality similar to Kolide. In fact, around the time that Kolide retired its “Fleet” tool, FleetDM forked or started reworking Fleet. However, they schedule events in the employees’ calendars to resolve issues for them to resolve themselves. An interesting approach and one I favor - but this is still the onus on the device. Additionally, as far as I can tell from researching - they do not have In-line SAML Hooks or an experience similar to that of Pinterest/Kolide’s. I thought the UI needed to be more friendly (read that however you want about end users) to be proactively helpful.

Our management has historically told me that Kolide’s version of Fleet, which FleetDM more or less forked, was not sufficient for our needs.

So, Kolide Is the Winner for Our Use Case

So Kolide (K2), with its ease of use, inline SAML Hooking, and experience, seemed like the best approach we needed.

So, Testing… Right?

So we began testing this in February - right after I got back from the US - so yay, jet lag. We had two stages of testing this on a trial basis to validate that the behavior was what we wanted and would work for our needs. To do this, we attempted using Kolide with three Operating Systems:

Windows
macOS
Linux
- Ubuntu 22.04 LTS

From an individual perspective, this worked great. Devices were checked or blocked depending on their configurations.

We did see some slight issues. For example, enabling a check called “Remote Access—Daemons should not be installed or running” seems to have some drawbacks; everything on Linux runs a Daemon—GUI? Daemon.

So, how do we resolve that? We set up specific configuration overrides. Thankfully, Kolide allows us to set up exclusions to the check, meaning we can specify which Daemons we would like to allow to run. In addition, we still pull out with a win. Now, we have more insight into Linux devices before they hit our SaaS Application.

Roll Out to a More Extensive Testing Audience

So, as soon as we were about to sign, or maybe even the day of or after, 1Password announced their purchase of Kolide. Well… great? This could provide great things for Kolide but hamper our ability to roll out an excellent experience to our employees. Thankfully, this fear has been unfounded.

We began to roll this out to a larger audience, our entire Digital Experience team (which includes a subset of Tech Support, Tech Operations, and Digital Transformation), and this is where things went south - the experience we were expecting could have been better. We hit multiple random issues along the way. The main concerns, though, were:

Devices previously registered falling out of registration due to edge case situations within the registration flow
The Kolide agent disappears after an update to the agent on Windows devices
Device Registration flow issues for end users

In all of the cases, the Kolide team was super responsive to the feedback and incredibly transparent and supportive about the reasons why the issues were occurring and a resolution to fix the problem. I would say they were refreshingly transparent, which I personally at least appreciate.

Initially, Jason Meller, the CEO / Founder & Joseph, principal engineer at Kolide, hopped in and provide some updates.

Kolide Agent Disappearing After an Agent Update

It seemed like a configuration we had set for, or a race condition caused Windows to cause some interference. If the device was asleep and woke up briefly for either network connectivity or some alternative reason, the Kolide Agent attempted to update at the same time (and/or the device would go back to sleep during the update). This was a very weird edge case, which was ultimately solved in a follow-up version of v1.6.2.

Devices Previously Registered Falling Out of Registration

So, this was a more exciting edge case. Apparently, when some of our users went through the “Register Device…” menu bar item, they successfully completed the flow. However, instead of clicking the “Okay” button on the screen after finishing the flow - they just closed the Chrome tab. This means that a final completion trigger never properly registered the device into Kolide and eventually de-registered after a period. This was caused by 3 things, most of which were end-user problems on the device. For example, not clicking “I’ll fix this later - Continue to Sign In” or the device state being “Will Block” mode because the device had not passed blocking checks.

Ultimately, Kolide resolved this matter, and it worked fantastically afterward.

(Multiple) Device Registration Flow for End Users

This was more of an impactful problem, given the devices falling out of registration and the Trust On First Use (TOFU) Model that Kolide uses for its deployment process. It created some unnecessary headwinds for us to be able to roll out the software properly across all devices and users. While we would be able to register the first device, any subsequent device also seemed to cause problems.

Within 2 days, however, Kolide had a functional workaround for us and a new feature for other users of the Kolide service. They announced this as a new device registration flow.

With this new registration flow the experience improved by 200%.

(obtained from Kolide’s blog post)

Within 10 days, all of our problems, including some of the issues outlined above, were resolved.

What We Learned About Our Devices During the Initial Rollout

Even using the checks provided by Kolide, I had an opinion that our Windows configurations could be improved. Now, I had firm data I could use to show to our director-level (and higher) colleagues, indicating where we needed to improve things. Without going into specific checks, things that should have been configured by the MSP had not, even for simple things like UAC prompts.

Global Roll Out

So, how do we roll this out globally now that we have tested with the testing team?

Install software to all devices before any rollout plan as
1. Users won’t have admin access to install it themselves during the registration flow
2. Helps make the registration flow smoother
Roll out to 10 people in Tech - to a random selection of devices.
1. Validate that their experience is as expected.
2. Expectations and the new registration flow are as described above.
Setup automation to gradually roll out the software over the next month
1. Make sure that this skips accounts or only impacts “human” accounts
2. Make sure that the group in question can’t be confused or managed by Tech Support
Turn on checks 2 months after rollout
Re-asses checks after 6 months

Going Forward

Better Automation of Device Groups

Currently, Device Groups are a manual process, unless you use the API,and this is fine for the most part. However, when using Virtual Desktop Infra, which may spin up or down or—even potentially worse—generate new agent ID / Install instances, it would be ideal to automate this where possible. Currently, as far as I am aware, this is not possible natively inside of Kolide.

Writing Checks

The limitations of check writing are pretty limitless, basically only bound to our knowledge of SQL lookups. I am hoping to hand over some basic check writing to our Tech Support staff so that they understand how Kolide works, can work with new technology, and can help in writing documentation or support for the checks so that they can decrease their own ticket count.

Management of Kolide via Terraform / APIs

I would love to see Kolide add Terraform to its infrastructure management methods as environments transition to change management via GitOps rather than ClickOps.

Remote Workers / External Parties

I would like to at some point, incorporate the use case of External Parties downloading this tool as a way to validate their system is compliant with our requirements before they can access any of our services.

Closing Out

I’ll give an update later in the year to see how effective this project was and whether it fulfilled all the goals and wishes we had when incorporating it.