Research Agenda

Our roadmap for building the science of AI trust

Last updated: January 2026

Why AI Trust Matters Now

The transition from AI assistants to AI agents changes everything. When models browse the web, execute code, manage files, and chain decisions across multi-step workflows, the trust question shifts from "is this output accurate?" to "should this system be allowed to act?"

Current approaches - output filtering, behavioral red-teaming, RLHF alignment - treat trust as a binary classification problem. But trust in agentic systems is compositional, contextual, and continuous. A model trusted to summarize documents may not be trusted to send emails. A workflow trusted at 10 requests/minute may fail catastrophically at 10,000.

Rotalabs exists to develop the science, benchmarks, and infrastructure for AI trust at scale.

Research Priorities

Six frontier areas where current methods are insufficient.

01 Multi-Agent Trust & Security

The problem: As AI systems coordinate — agents calling agents, tool-using models delegating to specialists — how does Agent A verify Agent B isn't compromised? How do we detect when agents coordinate on harmful strategies?

Why it's hard: Traditional authentication assumes static identities. But AI agents are defined by their weights, prompts, and tool access — all of which can change.

Our approach: Inter-agent trust protocols with cryptographic verification. Collusion detection across multi-agent workflows. Agent attestation infrastructure.

02 Scalable Oversight for Agentic Systems

The problem: When an agent executes a 50-step workflow, which steps need human review? Which can be AI-verified? Which can be automated?

Why it's hard: Perfect verification is expensive. Skipping verification is dangerous. The optimal policy depends on task type, stakes, model confidence, and historical reliability.

Our approach: Hierarchical verification routing (human → AI-assisted → automated). Debate and self-critique protocols for high-stakes decisions. Verification cost economics.

03 Chain-of-Thought Monitoring & Faithfulness

The problem: Models generate reasoning traces, but do those traces reflect actual internal computation? Or is the model post-hoc rationalizing while "thinking" something different?

Why it's hard: We can't directly observe model cognition. CoT is a lossy projection. Models might learn to produce plausible reasoning that masks their true decision process.

Our approach: CoT faithfulness detection methods. Identifying hidden communication in reasoning traces. Comparing internal representations to stated reasoning.

04 Adversarial Robustness for Trust Systems

The problem: Any trust mechanism can be gamed. If we deploy sandbagging detection, adversaries will train models to sandbag while evading detection. Trust systems must be robust to adaptive adversaries.

Why it's hard: It's an arms race. Red-teaming finds vulnerabilities, but sophisticated adversaries adapt. Game-theoretic equilibria are hard to characterize.

Our approach: Evasion attacks — train models to evade detection, then harden defenses. Game-theoretic analysis of routing exploitation. Evolutionary adversarial testing.

05 Formal Verification for AI Decision Boundaries

The problem: Regulated industries need guarantees, not probabilities. "95% accurate" isn't acceptable when the 5% failure causes regulatory violation or patient harm.

Why it's hard: Neural networks are not formally verifiable in the traditional sense. But we can verify properties of decision boundaries, tool calls, and action constraints.

Our approach: Bounded uncertainty certificates — provable bounds on model error margins. Verified tool use — formal verification of agent tool calls. Regulatory-grade audit trails.

06 Memory Security & Integrity

The problem: Long-running agents accumulate memory that can be poisoned, stolen, or legally required to be deleted. Current memory systems have no security model.

Why it's hard: Memory isn't just data; it's contextual and associative. "Delete all information about user X" is semantically ambiguous when information is distributed across embeddings.

Our approach: Memory poisoning detection. Memory provenance — cryptographic attestation of memory sources. Forgetting guarantees — verified deletion for regulatory compliance.

Open Problems

Unsolved questions we're thinking about - we don't have complete approaches yet.

Trust calibration across capability jumps

When a model gets significantly more capable, all trust measurements are invalidated. How do you bootstrap trust for a new capability regime?

Compositional trust verification

If Agent A is trusted and Agent B is trusted, is A→B trusted? Trust doesn't compose linearly. What's the algebra?

Adversarial prompt robustness at scale

Jailbreaks keep working. Is there a fundamental limit to prompt-based alignment, or just engineering debt?

Trust under distribution shift

A model trusted on English financial documents may fail on Japanese medical records. How do you characterize the trust boundary?

Human-AI trust calibration

Humans over-trust confident models and under-trust uncertain ones. The trust signal we send matters as much as the trust we compute.

How We Work

Benchmarks first

We build evaluation infrastructure before claiming solutions. If we can't measure it, we don't ship it.

Open by default

Core research, benchmarks, and tools are open source. Enterprise products build on open foundations.

Adversarial mindset

Every trust mechanism we build, we try to break. If we can't break it, someone else will.

Get Involved

We're looking for researchers, engineers, and organizations interested in AI trust.

Researchers

AI safety, interpretability, formal methods

Engineers

Building trust infrastructure at scale

Organizations

Piloting trust systems in production