GitLab as an Operating System for AI-Assisted Infrastructure

·

Over the last months, I’ve been experimenting with a different way of operating infrastructure.

Not simply using AI for code generation or autocomplete, but embedding AI agents directly into a controlled infrastructure delivery workflow: GitLab issues, merge requests, CI pipelines, SSH diagnostics, validation gates and real production systems.

The interesting part is not the AI itself.

The interesting part is what changes once agents become operational participants in real infrastructure work.

The Core Idea

Most current discussions around “AI agents” focus on writing code faster.

But operational infrastructure engineering is fundamentally different from isolated code generation.

Infrastructure has state. It has dependencies. It has drift. It has partial failures. It has hidden assumptions. And most importantly: it has consequences.

DNS failures break authentication. DHCP inconsistencies break provisioning. Unsafe Terraform plans destroy VMs. Bad network changes lock you out of switches.

The real challenge is therefore not: “How do I make AI generate YAML?”

The real challenge is: “How do I create a delivery model in which AI agents can safely participate in operational infrastructure work?”

That question became the foundation of my infrastructure repository.

The Environment

The setup started as a private RKE2-based platform environment, but evolved into something much larger.

Today the repository manages substantial parts of my real infrastructure stack:

  • RKE2 Kubernetes clusters
  • libvirt/KVM virtual machines
  • Terraform-based VM provisioning
  • Ansible-based OS and service convergence
  • NetBox as infrastructure source of truth
  • FreeIPA and Samba AD
  • Kea DHCP and DNS integration
  • OPNsense and Juniper network devices
  • GitLab CI/CD pipelines
  • Backup orchestration
  • Nextcloud, PostgreSQL, Redis, Pi-hole and other services

This is not a demo environment.

The repository represents the operational state of a live system with real dependencies, real failure modes and real operational constraints.

That distinction matters.

Because operational engineering is where many simplistic AI automation ideas begin to collapse.

AI Agents as Operational Engineering Partners

The most important insight from this project is:

AI agents are not particularly interesting as autocomplete systems.

They become interesting once they can participate in operational workflows with constraints, verification and accountability.

The workflow in my setup typically looks like this:

Diagram 1 for GitLab as an Operating System for AI-Assisted Infrastructure

The process does not start with files.

It starts with intent.

GitLab issues define:

  • the operational context
  • target state
  • constraints
  • dependencies
  • risk boundaries
  • acceptance criteria

The AI agent then operates within those boundaries.

This changes the role of the engineer significantly.

My work shifted away from manually writing individual YAML or Terraform lines and toward:

  • defining target states
  • designing operational workflows
  • identifying risk boundaries
  • reviewing implementation strategies
  • validating production behavior
  • and building systems in which agents can operate safely

That shift feels far more important than code generation itself.

The Merge Request as an Engineering Artifact

One of the biggest mistakes in current AI tooling discussions is reducing engineering work to “generated code”.

Operational engineering is not the generated artifact.

It is the decision process around change.

In my setup, the merge request became the primary operational artifact.

A merge request contains:

  • implementation changes
  • validation output
  • pipeline evidence
  • rollback considerations
  • verification commands
  • operational notes
  • and failure analysis

The important part is not that AI wrote something.

The important part is whether the change became understandable, reviewable and operationally trustworthy.

That distinction is critical.

GitLab as the Control Plane

GitLab gradually evolved into far more than source control.

It effectively became the operating system for infrastructure change.

  • Issues define work
  • Branches isolate work
  • Merge requests document work
  • Pipelines validate work
  • Registries distribute work
  • Terraform state lives inside GitLab
  • Operational history becomes queryable

This creates a continuous operational audit trail:

  • Why was something changed?
  • What problem triggered it?
  • Which validations ran?
  • What failed?
  • How was the issue corrected?
  • What permanent guardrails were added afterward?

That last point turned out to be especially important.

When a pipeline fails, the failure is not merely an incident.

It becomes input for the system itself.

The agent investigates the issue, reproduces it, validates a fix and then permanently encodes the correction into the repository or pipeline logic.

That feedback loop is where the system starts becoming genuinely interesting.

Source of Truth Before Automation

One of the most important architectural decisions was strict layering.

Diagram 2 for GitLab as an Operating System for AI-Assisted Infrastructure

NetBox owns infrastructure identity:

  • hosts
  • interfaces
  • IP addresses
  • networks
  • gateways
  • device metadata

Terraform owns infrastructure provisioning and VM lifecycle management.

Ansible owns operating system and service convergence.

GitLab pipelines enforce the ordering between these layers.

This prevents one of the most common infrastructure failure patterns: configuration truth becoming fragmented across notes, DNS entries, inventories, Terraform variables and manually modified systems.

The pipeline enforces consistency boundaries early.

If the source of truth is inconsistent, the workflow stops before destructive actions occur.

Controlled Agentic Workflows

One important lesson is that operational agents require friction.

Fully autonomous infrastructure agents are dangerous.

The workflow itself must constrain them.

For example:

  • destructive Terraform operations are gated
  • network devices use separate apply paths
  • dry-run stages exist before production apply
  • merge approvals are mandatory
  • validation pipelines execute before deployment
  • rollback paths are documented
  • agents are expected to test commands manually before codifying them

The goal is not removing humans.

The goal is creating operationally reliable human-agent collaboration.

That distinction is essential.

A Concrete Example: Nextcloud Updates

One concrete example was a failed scheduled update on my DMZ Nextcloud host.

The recurring operational problem was not the package update itself. The problem was what happened afterward: whenever Arch updated Nextcloud itself or one of the packaged Nextcloud apps, the application could be left in a state where it required an explicit occ upgrade before it was fully usable again.

The agent reviewed the GitLab pipeline artifacts, connected to the host over SSH, inspected the live Nextcloud state and verified the required recovery path directly on the machine. The important detail was that the fix was not simply „run this command once“. The operational lesson was that the update workflow itself needed to know about Nextcloud’s post-update contract.

The follow-up merge request encoded that lesson into the repository. Nextcloud post-update handling now invokes OCC through the correct Arch/Nextcloud PHP configuration, prepares the required services before OCC runs and installs a local pacman hook for future Nextcloud core and packaged app upgrades.

That is the feedback loop I care about: a failed scheduled update became a permanent operational guardrail. The next update should not leave the application waiting for a manual recovery step; the recovery behavior became part of the managed infrastructure model.

Operational Reality Beats Clean Architecture

One thing I increasingly realized is that infrastructure engineering rarely happens on a clean greenfield.

Most real environments are messy:

  • partially documented
  • historically inconsistent
  • operationally sensitive
  • full of exceptions
  • and constrained by uptime requirements

A large part of the project therefore became:

  • adopting existing systems without drift
  • incrementally improving consistency
  • introducing automation gradually
  • and preserving operational safety while increasing control

That is also where AI agents become genuinely useful.

Not because they magically solve infrastructure problems.

But because they can investigate, test, iterate and document changes far faster once they operate inside a structured engineering process.

Why This Matters for Forward-Deployed Engineering

Forward-deployed engineering is rarely about implementing idealized systems from scratch.

More often it is about:

  • understanding incomplete environments
  • navigating operational constraints
  • translating vague requirements into reliable systems
  • making risks visible
  • and building workflows that survive real production behavior

That is exactly what this project continuously trains.

The most important outcome was not the infrastructure itself.

The most important outcome was discovering that AI agents become far more capable once they are embedded into operational systems with:

  • governance
  • verification
  • auditability
  • rollback boundaries
  • and clear engineering ownership

The future of infrastructure engineering is probably not fully autonomous systems.

It is operational environments in which humans define intent, constraints and risk boundaries — while agents execute, investigate and continuously improve the surrounding delivery process.

That is the model I increasingly believe in.

Note: I also wrote a more detailed German companion article that goes deeper into the concrete GitLab, Terraform, Ansible and pipeline setup behind this project: GitLab als Betriebssystem meiner Infrastruktur.

Kommentare

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

This website stores cookies on your computer. These cookies are used to provide a more personalized experience and to track your whereabouts around our website in compliance with the European General Data Protection Regulation. If you decide to to opt-out of any future tracking, a cookie will be setup in your browser to remember this choice for one year.

Accept or Deny