When I built the AI-powered Regression Analyzer at Tietoevry, the goal was simple: stop engineers from spending 15–45 minutes manually triaging each test failure in our 5G Core regression cycles.

The result was a multi-agent pipeline that does it automatically. Here’s what I learned.

The Problem with Single-Agent Approaches

A single LLM call asking “why did this test fail?” works for toy examples. In production it falls apart because:

  • Context windows fill up fast when you have hundreds of test cases
  • Sequential processing is too slow for regression cycles
  • A single agent can’t hold the full picture needed to cross-correlate failures

The answer is parallel specialised agents.

Parallel Agents > One Big Prompt

Instead of one agent trying to do everything, I designed a pipeline where:

  1. A coordinator ingests the failing test list and spawns agents
  2. Each analysis agent gets exactly one failure — its logs, config, and relevant spec context
  3. Agents run in parallel, each producing a structured triage report
  4. An aggregation agent cross-correlates findings and identifies patterns

This kept each agent’s context focused, made the pipeline horizontally scalable, and reduced total wall-clock time dramatically.

Structured Output Is Non-Negotiable

In engineering automation, you can’t parse free-text agent output reliably. Every agent in the pipeline outputs structured JSON:

{
  "test_id": "TC_5GC_UPF_042",
  "failure_category": "configuration_mismatch",
  "root_cause": "...",
  "confidence": 0.87,
  "relevant_log_lines": [...]
}

This makes aggregation deterministic and lets the pipeline feed results into dashboards, tickets, or downstream systems without fragile string parsing.

Trust But Verify

AI agents make mistakes. In a regression analysis context, a wrong root cause hypothesis is annoying but not catastrophic — an engineer reviews and confirms. Design your human-in-the-loop accordingly:

  • Flag low-confidence findings for mandatory human review
  • Log all agent reasoning, not just conclusions
  • Track false positive/negative rates over time and tune prompts accordingly

The Real Win

The biggest benefit wasn’t the time saving per se — it was consistency. Human engineers triage differently on Monday morning vs. Friday afternoon. The agents are consistent every time. That consistency makes it possible to track trends across regression cycles in ways that were impossible before.


Interested in AI agent architecture or applying this to your engineering workflow? Reach out at bartlomiej@paszkiewicz.tech.