Warden — Self-Healing Kubernetes Security Agent

A self-healing Kubernetes security agent deployed on Azure Kubernetes Service. Warden is a Python FastAPI webhook server that receives Falco runtime security alerts and OPA Gatekeeper admission violations, triages each incident with Claude Sonnet 4.6, auto-patches low-severity threats immediately, and drafts remediation runbooks for high-severity incidents requiring human approval. Two-layer security coverage proven end-to-end on a live AKS cluster for ~$2 in total cloud spend.

Core Technologies

PythonFastAPIKubernetes / AKSFalcoOPA GatekeeperClaude Sonnet 4.6Azure Key VaultPrometheusTerraformAzure DevOps

Architecture Components

  • Python FastAPI webhook server receiving alerts from Falco and OPA Gatekeeper
  • OPA Gatekeeper ConstraintTemplates (Rego) blocking privileged containers, unverified registries, and non-root violations at admission
  • Falco DaemonSet using eBPF kernel probes for real-time runtime detection of shell spawns, privilege escalation, and suspicious syscalls
  • Falcosidekick forwarding Falco events to the Warden webhook endpoint inside the cluster
  • Claude Sonnet 4.6 for structured alert triage — severity classification and remediation recommendation
  • Azure Key Vault storing the Claude API key, injected at runtime via managed identity — never in source control
  • Prometheus metrics tracking alert volume by severity, auto-patch outcomes, and Claude API latency
  • Terraform provisioning the AKS cluster and Azure supporting resources
  • Azure DevOps pipeline for infrastructure deployment

Problem

Kubernetes clusters generate a constant stream of security events — admission violations, runtime anomalies, syscall alerts — that require expert triage, manual response, and accurate severity judgment. Without automation, security teams are buried in noise, slow to respond to real threats, and unable to scale coverage.

  • Runtime security events require immediate triage, but manual review is slow and error-prone at volume.
  • Admission control and runtime detection are often siloed with no unified response layer connecting them.
  • Low-severity incidents that could be auto-remediated still consume engineer time, reducing capacity for real threats.

Solution

A Python FastAPI webhook server that sits between Kubernetes security tooling and on-call engineers. Warden receives Falco and OPA alerts, classifies severity with Claude Sonnet 4.6, and acts — auto-patching low-severity incidents immediately and surfacing AI-drafted runbooks for high-severity threats that require a human decision.

  • Layer 1 — OPA Gatekeeper: Rego policies block non-compliant workloads at admission — no privileged containers, no unverified registries, no non-root violations.
  • Layer 2 — Falco: eBPF kernel probes detect threats at runtime — shell spawns, privilege escalation, and suspicious syscall patterns.
  • Claude Sonnet 4.6 triage: structured severity classification — low severity triggers auto-patch, high severity generates a runbook for human review.
  • Azure Key Vault for secret management: the Claude API key is injected at runtime, never stored in source control or container images.
Solution visual

OPA Gatekeeper blocking a privileged container at admission — Layer 1 enforcement working on AKS.

Outcome

End-to-end Kubernetes security automation proven on a live AKS cluster: Falco detected a shell spawn, Falcosidekick forwarded the alert to Warden, Claude triaged it as low severity, Warden auto-patched and returned HTTP 200 — all within 3 seconds. Total cloud spend: ~$2.

  • Full AKS pipeline proven: shell spawn → Falcosidekick → Claude triage (severity=low) → auto-patch → HTTP 200.
  • Two-layer security coverage: OPA admission control and Falco runtime detection operating independently on the same cluster.
  • Total cloud spend across all sessions: ~$2 — production-grade Kubernetes security at near-zero cost.
  • All AI triage decisions logged with full reasoning chain, original alert payload, and action taken — complete audit trail.
Outcome visual

Falco detecting a shell spawn inside a running container — Layer 2 runtime detection triggering the full triage pipeline.

Key Learnings & Decisions

Kubernetes Security

  • OPA Gatekeeper and Falco address different threat surfaces — admission control stops misconfigurations; runtime detection catches adversarial behaviour that only appears after a workload is running.
  • Namespace-scoped exclusions are the right pattern for security tooling that needs elevated permissions — Falco's DaemonSet is privileged, so the falco namespace is excluded from BlockPrivilegedContainers constraints.
  • Parameterised OPA ConstraintTemplates add schema validation complexity that can fail across Kubernetes versions; for a small known-registry list, hardcoded Rego is simpler and more reliable.

AI-Driven Automation

  • Claude Sonnet 4.6 works as a reliable triage component when given structured inputs and constrained output formats — it classifies alerts consistently with defined severity rubrics.
  • Human approval gates for high-severity actions are non-negotiable; auto-remediation without bounds is a liability, not a feature.
  • Full logging of AI reasoning with the original alert payload enables post-incident review and builds trust in automated decisions over time.

Secrets & Operations

  • Secrets injected from external vaults can be silently corrupted — validate API key format and length at startup before accepting traffic, not reactively after the first 401.
  • The ConfigMap-mounted code pattern enables rapid iteration on cluster-internal tooling without custom image builds — acceptable for development, not for production.
  • WSL2 eBPF limitations mean local Falco detection is not possible; design the dev loop around real cluster validation for eBPF-dependent tooling from the start.

Implementation Milestones

A breakdown of the key tasks and milestones that brought this project to life.

AKS Cluster Provisioning

Complete

AKS cluster and supporting Azure resources provisioned with Terraform. Azure Key Vault configured for runtime secret injection.

Key Tasks Completed

  • Terraform AKS + Azure Resources

    Cluster provisioned, Key Vault wired up, and Azure DevOps pipeline configured. Infrastructure ready for Warden deployment.

  • Azure Key Vault Integration

    Claude API key stored in Key Vault and injected at runtime. Startup validation added to catch corrupted secrets before traffic is accepted.

OPA Gatekeeper Policies

Complete

Admission control layer implemented with Rego ConstraintTemplates. Namespace exclusions configured for Falco's privileged DaemonSet.

Key Tasks Completed

  • BlockPrivilegedContainers ConstraintTemplate

    Rego policy blocks privileged containers cluster-wide. Falco namespace excluded to allow Falco's own DaemonSet to run.

  • AllowedImageRegistries ConstraintTemplate

    Registries hardcoded in Rego after parameterised schema validation errors on AKS. Simpler and more reliable.

Falco Runtime Detection

Complete

Falco DaemonSet deployed on AKS with eBPF probes active. Falcosidekick configured to forward events to the Warden webhook endpoint.

Key Tasks Completed

  • Falco DaemonSet on AKS

    Falco running on every node with eBPF probes. WSL2 eBPF limitations documented — full detection validated on-cluster only.

  • Falcosidekick Webhook Forwarding

    Falcosidekick configured to POST structured Falco events to the Warden webhook endpoint inside the cluster.

Warden Webhook Server

Complete

FastAPI webhook server deployed via ConfigMap-mounted code. Claude Sonnet 4.6 triage integrated with auto-patch and runbook generation.

Key Tasks Completed

  • FastAPI Webhook Endpoint

    Webhook server receives Falco and OPA alert payloads and routes to the triage pipeline.

  • Claude Sonnet Triage Integration

    Structured triage with severity classification. Low severity triggers auto-patch; high severity generates a runbook for human review.

  • ConfigMap-Mounted Agent Pattern

    Agent code mounted as a Kubernetes ConfigMap and executed via a base Python image — no custom image build required for development iterations.

End-to-End Pipeline Proven

Complete

Full AKS pipeline validated: Falco detected a shell spawn, Falcosidekick forwarded to Warden, Claude triaged severity=low, auto-patch applied, HTTP 200 returned.

Key Tasks Completed

  • AKS End-to-End Demo

    Shell spawn in a container triggered Falco. Alert forwarded via Falcosidekick. Claude returned severity=low. Warden auto-patched and returned HTTP 200. Pipeline complete.

Case Study Published

In Progress

Portfolio case study written and published to adventuringghost.com.

Key Tasks Completed

  • Portfolio Entry

    Case study content complete. Publishing to portfolio site now.

Monitoring & Analysis

Prometheus Metrics

Custom metrics track alert volume by severity, auto-patch success and failure rate, Claude API response latency, and runbook generation count. Scraped by the cluster's Prometheus instance and queryable for post-incident analysis.

Triage Audit Log

Every AI triage decision is logged with the original alert payload, Claude's severity classification, the action taken, and a timestamp — giving a complete audit trail for post-incident review and trust-building in automated decisions.

Warden webhook handler — receive alert, triage with Claude, act on severity

Loading code...