Our roadmap for building the science of AI trust
Last updated: January 2026
Why AI Trust Matters Now
The transition from AI assistants to AI agents changes everything. When models browse the web, execute code, manage files, and chain decisions across multi-step workflows, the trust question shifts from "is this output accurate?" to "should this system be allowed to act?"
Current approaches - output filtering, behavioral red-teaming, RLHF alignment - treat trust as a binary classification problem. But trust in agentic systems is compositional, contextual, and continuous. A model trusted to summarize documents may not be trusted to send emails. A workflow trusted at 10 requests/minute may fail catastrophically at 10,000.
Rotalabs exists to develop the science, benchmarks, and infrastructure for AI trust at scale.
Research Priorities
Six frontier areas where current methods are insufficient.
01 Multi-Agent Trust & Security
The problem: As AI systems coordinate — agents calling agents, tool-using models delegating to specialists — how does Agent A verify Agent B isn't compromised? How do we detect when agents coordinate on harmful strategies?
Why it's hard: Traditional authentication assumes static identities. But AI agents are defined by their weights, prompts, and tool access — all of which can change.
Our approach: Inter-agent trust protocols with cryptographic verification. Collusion detection across multi-agent workflows. Agent attestation infrastructure.
02 Scalable Oversight for Agentic Systems
The problem: When an agent executes a 50-step workflow, which steps need human review? Which can be AI-verified? Which can be automated?
Why it's hard: Perfect verification is expensive. Skipping verification is dangerous. The optimal policy depends on task type, stakes, model confidence, and historical reliability.
Our approach: Hierarchical verification routing (human → AI-assisted → automated). Debate and self-critique protocols for high-stakes decisions. Verification cost economics.
03 Chain-of-Thought Monitoring & Faithfulness
The problem: Models generate reasoning traces, but do those traces reflect actual internal computation? Or is the model post-hoc rationalizing while "thinking" something different?
Why it's hard: We can't directly observe model cognition. CoT is a lossy projection. Models might learn to produce plausible reasoning that masks their true decision process.
Our approach: CoT faithfulness detection methods. Identifying hidden communication in reasoning traces. Comparing internal representations to stated reasoning.
04 Adversarial Robustness for Trust Systems
The problem: Any trust mechanism can be gamed. If we deploy sandbagging detection, adversaries will train models to sandbag while evading detection. Trust systems must be robust to adaptive adversaries.
Why it's hard: It's an arms race. Red-teaming finds vulnerabilities, but sophisticated adversaries adapt. Game-theoretic equilibria are hard to characterize.
Our approach: Evasion attacks — train models to evade detection, then harden defenses. Game-theoretic analysis of routing exploitation. Evolutionary adversarial testing.
05 Formal Verification for AI Decision Boundaries
The problem: Regulated industries need guarantees, not probabilities. "95% accurate" isn't acceptable when the 5% failure causes regulatory violation or patient harm.
Why it's hard: Neural networks are not formally verifiable in the traditional sense. But we can verify properties of decision boundaries, tool calls, and action constraints.
Our approach: Bounded uncertainty certificates — provable bounds on model error margins. Verified tool use — formal verification of agent tool calls. Regulatory-grade audit trails.
06 Memory Security & Integrity
The problem: Long-running agents accumulate memory that can be poisoned, stolen, or legally required to be deleted. Current memory systems have no security model.
Why it's hard: Memory isn't just data; it's contextual and associative. "Delete all information about user X" is semantically ambiguous when information is distributed across embeddings.
Our approach: Memory poisoning detection. Memory provenance — cryptographic attestation of memory sources. Forgetting guarantees — verified deletion for regulatory compliance.
Open Problems
Unsolved questions we're thinking about - we don't have complete approaches yet.
When a model gets significantly more capable, all trust measurements are invalidated. How do you bootstrap trust for a new capability regime?
If Agent A is trusted and Agent B is trusted, is A→B trusted? Trust doesn't compose linearly. What's the algebra?
Jailbreaks keep working. Is there a fundamental limit to prompt-based alignment, or just engineering debt?
A model trusted on English financial documents may fail on Japanese medical records. How do you characterize the trust boundary?
Humans over-trust confident models and under-trust uncertain ones. The trust signal we send matters as much as the trust we compute.
How We Work
Benchmarks first
We build evaluation infrastructure before claiming solutions. If we can't measure it, we don't ship it.
Open by default
Core research, benchmarks, and tools are open source. Enterprise products build on open foundations.
Adversarial mindset
Every trust mechanism we build, we try to break. If we can't break it, someone else will.
Get Involved
We're looking for researchers, engineers, and organizations interested in AI trust.
AI safety, interpretability, formal methods
Building trust infrastructure at scale
Piloting trust systems in production