GitLab as an Operating System for AI-Assisted Infrastructure

Over the last months, I’ve been experimenting with a different way of operating infrastructure.

Not simply using AI for code generation or autocomplete, but embedding AI agents directly into a controlled infrastructure delivery workflow: GitLab issues, merge requests, CI pipelines, SSH diagnostics, validation gates and real production systems.

The interesting part is not the AI itself.

The interesting part is what changes once agents become operational participants in real infrastructure work.

The Core Idea

Most current discussions around “AI agents” focus on writing code faster.

But operational infrastructure engineering is fundamentally different from isolated code generation.

Infrastructure has state. It has dependencies. It has drift. It has partial failures. It has hidden assumptions. And most importantly: it has consequences.

DNS failures break authentication. DHCP inconsistencies break provisioning. Unsafe Terraform plans destroy VMs. Bad network changes lock you out of switches.

The real challenge is therefore not: “How do I make AI generate YAML?”

The real challenge is: “How do I create a delivery model in which AI agents can safely participate in operational infrastructure work?”

That question became the foundation of my infrastructure repository.

The Environment

The setup started as a private RKE2-based platform environment, but evolved into something much larger.

Today the repository manages substantial parts of my real infrastructure stack:

RKE2 Kubernetes clusters
libvirt/KVM virtual machines
Terraform-based VM provisioning
Ansible-based OS and service convergence
NetBox as infrastructure source of truth
FreeIPA and Samba AD
Kea DHCP and DNS integration
OPNsense and Juniper network devices
GitLab CI/CD pipelines
Backup orchestration
Nextcloud, PostgreSQL, Redis, Pi-hole and other services

This is not a demo environment.

The repository represents the operational state of a live system with real dependencies, real failure modes and real operational constraints.

That distinction matters.

Because operational engineering is where many simplistic AI automation ideas begin to collapse.

AI Agents as Operational Engineering Partners

The most important insight from this project is:

AI agents are not particularly interesting as autocomplete systems.

They become interesting once they can participate in operational workflows with constraints, verification and accountability.

The workflow in my setup typically looks like this:

Diagram 1 for GitLab as an Operating System for AI-Assisted Infrastructure

The process does not start with files.

It starts with intent.

GitLab issues define:

the operational context
target state
constraints
dependencies
risk boundaries
acceptance criteria

The AI agent then operates within those boundaries.

This changes the role of the engineer significantly.

My work shifted away from manually writing individual YAML or Terraform lines and toward:

defining target states
designing operational workflows
identifying risk boundaries
reviewing implementation strategies
validating production behavior
and building systems in which agents can operate safely

That shift feels far more important than code generation itself.

The Merge Request as an Engineering Artifact

One of the biggest mistakes in current AI tooling discussions is reducing engineering work to “generated code”.

Operational engineering is not the generated artifact.

It is the decision process around change.

In my setup, the merge request became the primary operational artifact.

A merge request contains:

implementation changes
validation output
pipeline evidence
rollback considerations
verification commands
operational notes
and failure analysis

The important part is not that AI wrote something.

The important part is whether the change became understandable, reviewable and operationally trustworthy.

That distinction is critical.

GitLab as the Control Plane

GitLab gradually evolved into far more than source control.

It effectively became the operating system for infrastructure change.

Issues define work
Branches isolate work
Merge requests document work
Pipelines validate work
Registries distribute work
Terraform state lives inside GitLab
Operational history becomes queryable

This creates a continuous operational audit trail:

Why was something changed?
What problem triggered it?
Which validations ran?
What failed?
How was the issue corrected?
What permanent guardrails were added afterward?

That last point turned out to be especially important.

When a pipeline fails, the failure is not merely an incident.

It becomes input for the system itself.

The agent investigates the issue, reproduces it, validates a fix and then permanently encodes the correction into the repository or pipeline logic.

That feedback loop is where the system starts becoming genuinely interesting.

Source of Truth Before Automation

One of the most important architectural decisions was strict layering.

Diagram 2 for GitLab as an Operating System for AI-Assisted Infrastructure

NetBox owns infrastructure identity:

hosts
interfaces
IP addresses
networks
gateways
device metadata

Terraform owns infrastructure provisioning and VM lifecycle management.

Ansible owns operating system and service convergence.

GitLab pipelines enforce the ordering between these layers.

This prevents one of the most common infrastructure failure patterns: configuration truth becoming fragmented across notes, DNS entries, inventories, Terraform variables and manually modified systems.

The pipeline enforces consistency boundaries early.

If the source of truth is inconsistent, the workflow stops before destructive actions occur.

Controlled Agentic Workflows

One important lesson is that operational agents require friction.

Fully autonomous infrastructure agents are dangerous.

The workflow itself must constrain them.

For example:

destructive Terraform operations are gated
network devices use separate apply paths
dry-run stages exist before production apply
merge approvals are mandatory
validation pipelines execute before deployment
rollback paths are documented
agents are expected to test commands manually before codifying them

The goal is not removing humans.

The goal is creating operationally reliable human-agent collaboration.

That distinction is essential.

A Concrete Example: Nextcloud Updates

One concrete example was a failed scheduled update on my DMZ Nextcloud host.

The recurring operational problem was not the package update itself. The problem was what happened afterward: whenever Arch updated Nextcloud itself or one of the packaged Nextcloud apps, the application could be left in a state where it required an explicit occ upgrade before it was fully usable again.

The agent reviewed the GitLab pipeline artifacts, connected to the host over SSH, inspected the live Nextcloud state and verified the required recovery path directly on the machine. The important detail was that the fix was not simply „run this command once“. The operational lesson was that the update workflow itself needed to know about Nextcloud’s post-update contract.

The follow-up merge request encoded that lesson into the repository. Nextcloud post-update handling now invokes OCC through the correct Arch/Nextcloud PHP configuration, prepares the required services before OCC runs and installs a local pacman hook for future Nextcloud core and packaged app upgrades.

That is the feedback loop I care about: a failed scheduled update became a permanent operational guardrail. The next update should not leave the application waiting for a manual recovery step; the recovery behavior became part of the managed infrastructure model.

Operational Reality Beats Clean Architecture

One thing I increasingly realized is that infrastructure engineering rarely happens on a clean greenfield.

Most real environments are messy:

partially documented
historically inconsistent
operationally sensitive
full of exceptions
and constrained by uptime requirements

A large part of the project therefore became:

adopting existing systems without drift
incrementally improving consistency
introducing automation gradually
and preserving operational safety while increasing control

That is also where AI agents become genuinely useful.

Not because they magically solve infrastructure problems.

But because they can investigate, test, iterate and document changes far faster once they operate inside a structured engineering process.

Why This Matters for Forward-Deployed Engineering

Forward-deployed engineering is rarely about implementing idealized systems from scratch.

More often it is about:

understanding incomplete environments
navigating operational constraints
translating vague requirements into reliable systems
making risks visible
and building workflows that survive real production behavior

That is exactly what this project continuously trains.

The most important outcome was not the infrastructure itself.

The most important outcome was discovering that AI agents become far more capable once they are embedded into operational systems with:

governance
verification
auditability
rollback boundaries
and clear engineering ownership

The future of infrastructure engineering is probably not fully autonomous systems.

It is operational environments in which humans define intent, constraints and risk boundaries — while agents execute, investigate and continuously improve the surrounding delivery process.

That is the model I increasingly believe in.

Note: I also wrote a more detailed German companion article that goes deeper into the concrete GitLab, Terraform, Ansible and pipeline setup behind this project: GitLab als Betriebssystem meiner Infrastruktur.