Whitepaper · April 2026 · Draft

Constraints on agent execution environments

Three constraints bind the design of any sandbox intended for autonomous code-executing agents. They bind simultaneously, not in sequence. This whitepaper enumerates them, derives the seven concrete properties they imply, and discusses why a general-purpose execution environment cannot satisfy all seven without being designed for them from the start.

Emir Beganović · Isorun BV · v1.0
Cold-start latency
9 ms
measured, p50, snapshot restore to running shell
Per-sandbox cost
$0.041/h
1 vCPU + 1 GiB, billed per second
Isolation depth
4 layers
CPU virtualization · network policy · credential proxy · audit chain

Abstract

Autonomous agents now perform a growing share of software-engineering work — generating code, executing tests, manipulating files, calling external APIs, modifying repositories, and (with appropriate gates) deploying changes. Each of these actions requires an execution environment that can run untrusted code at low cost, with low latency, and with strong isolation. We argue that these three constraints bind simultaneously for agent workloads in a way that they do not for adjacent workloads (web hosting, batch compute, CI/CD), and that this simultaneity rules out the standard solutions inherited from the previous decade. We then enumerate seven concrete properties that follow from the three constraints, observe that the properties are mutually reinforcing rather than independent, and describe two integration patterns — embedded in a customer-facing product, or driven from an internal agent loop — that use the same underlying primitive.

1. Problem definition

An agent workload, for the purposes of this document, is any execution of code that satisfies all three of the following:

  1. The code was produced or selected by an autonomous decision-making system rather than written by a human directly. This includes code generated by a language model, code fetched dynamically based on a model's tool selection, and existing code whose execution path was chosen by a model.
  2. The execution carries access to secrets or to a network. In practice, almost all useful agent code does both: it needs an API key or token to call something, and it needs network reachability to call that something.
  3. The execution is short-lived and frequent. A typical agent task spawns between 5 and 50 distinct execution contexts and tears them all down within minutes.

Each of the three properties has consequences for the execution environment.

Property (1) implies that the code inside the environment is, by construction, untrusted. This is true even when the agent is owned and operated by the same party that owns the environment, because the agent's behavior is influenced by inputs (user prompts, fetched documents, web pages, tool outputs) over which the operator has no integrity guarantee. Prompt-injection attacks, jailbreaks, and unintended completions are all cases where an agent that the operator trusts produces code that the operator should not.

Property (2) implies that a security failure inside the environment is not a self-contained event. A leaked OpenAI API key compromises an account that may be billed in dollars per minute. A leaked GitHub token may enable code commits to private repositories. A leaked cloud credential may grant lateral access to the operator's production infrastructure. The cost of a single isolation failure is therefore a function not of the failed sandbox alone, but of the credentials reachable from it.

Property (3) implies that any per-invocation overhead is multiplied by a large constant. An environment that takes 200 ms to start adds 1–10 seconds of unrecoverable latency to a typical agent task; an environment that costs $0.10 per sandbox-hour adds $1–10 per task at typical concurrency. Both numbers are tolerable in research and demos and intolerable in production.

The defining feature of an agent workload is that it is simultaneously price-sensitive, latency-sensitive, and security-sensitive. Adjacent workloads typically stress at most two of the three.

2. Three simultaneous constraints

From the three properties above, we derive three constraints on any execution environment that intends to host agent workloads:

Constraint A — Low marginal cost. The cost per sandbox-hour at the typical operating point (1 vCPU, 1 GiB RAM, sub-minute lifetime) must be low enough that an agent platform can run continuously without each invocation requiring a deliberate budget decision. Empirically, this threshold is in the low single-digit US cents per sandbox-hour. Above $0.05/h the unit economics break for any deployment that is not enterprise-funded; below $0.05/h they admit individual developers, side projects, and research use.

Constraint B — Low cold-start latency. The wall-clock time from a create call returning to the first executable user instruction inside the environment must be small relative to the model's inference latency for a single tool call. With current frontier models in the 200–800 ms range for a small tool call, an environment that takes 200 ms to start contributes meaningfully to user-perceived latency on every operation. An environment that takes under 20 ms is invisible. The threshold is therefore approximately one order of magnitude below the inference cost of the call that uses it.

Constraint C — Isolation against hostile code. The execution environment must assume the code inside is hostile and structurally prevent it from reaching any resource not explicitly granted to it. "Structurally" here means the prevention does not rely on the correctness of the code being executed; it relies on a boundary outside of and below that code's reach. This rules out any isolation strategy implemented inside the same kernel as the untrusted code.

The three constraints are not independent variables that an operator can trade off against each other for a given workload. They bind at the same time and for the same task. A cheap-and-fast sandbox that leaks credentials is not "two out of three" — it is a sandbox that, on first use, will eventually be the source of an incident. A safe-and-fast sandbox that costs ten times the budget is not deployable continuously, so its safety is measured against demos rather than production. A safe-and-cheap sandbox with multi-second cold starts is equivalent to having no sandbox at all for any agent that issues more than a handful of tool calls per task.

Most existing general-purpose execution environments satisfy two of the three constraints comfortably and the third only partially. This is not a defect of those environments; it is a consequence of having been designed for adjacent workloads (web hosting, ML inference, CI/CD, batch compute) where two of the three constraints do not bind. Agent workloads are the case where they all do.

3. Seven required properties

We claim that the three constraints, applied together, produce seven concrete required properties of the execution environment. Each property contributes to one or more constraints; none is optional if all three constraints are to hold simultaneously.

P1 Hardware-enforced isolation Constraint C

Each sandbox executes inside its own guest operating system kernel under hardware-assisted virtualization (KVM or equivalent). The boundary between the untrusted code and the host is enforced by the CPU's virtualization extensions and a small VMM, not by namespace and cgroup mechanisms inside a shared host kernel. This eliminates an entire class of escape vectors that target the kernel syscall surface, which historically has been the dominant source of container escape vulnerabilities. The trust base shrinks from a multi-million-line kernel to a small VMM that exposes a limited paravirtualized interface.

P2 Out-of-sandbox credential injection Constraint C

API keys, tokens, and other secrets are never present inside the guest's address space, environment, or process table. The execution environment exposes a credential map at the host boundary; inside the guest, the standard environment variables (e.g. OPENAI_API_KEY) hold non-functional placeholder values, and the corresponding base URL variables point to a host-side proxy. The proxy injects the real credential at the network layer when a matching outbound request is observed. A complete memory dump of the guest, an /proc/*/environ walk, and a process listing all yield no information about the underlying secret. The blast radius of an isolation failure inside the guest is therefore bounded by the in-flight request, not by the set of credentials the agent holds.

P3 Default-deny network egress Constraint C

The default network policy is denial of all outbound traffic. Permitted traffic is added by an explicit allow-list, expressed either as a named profile or as a structured policy of domains, IP ranges, and (where applicable) HTTP method and path constraints. Direct-IP traffic, including to literal addresses such as 1.1.1.1 or to private network ranges, is subject to the same allow-list as DNS-resolved traffic. The enforcement runs in the host kernel for established connections and is therefore zero-overhead on the data path; userspace involvement is limited to per-connection decisions for new TLS or HTTP flows.

P4 Tamper-evident audit log Constraint C

Every meaningful event inside the sandbox — command execution, file operation, network request, credential-proxied call — is recorded as a structured entry in an append-only log. Each entry is signed with an HMAC chained to the previous entry's signature. Modification of any entry breaks the chain at every subsequent entry, providing a verifiable property that can be checked by any party with the per-sandbox key. The audit log is the basis for compliance review, incident response, and post-hoc behavioral analysis, and is meaningful only to the extent that the other properties (in particular P1 and P2) are also satisfied — otherwise an attacker who compromises the sandbox can rewrite the log before it is flushed.

P5 Sub-decimillisecond create-to-execute latency Constraint B

The interval between a successful create response and the first executable instruction inside the sandbox is less than 10 milliseconds at the median, on representative hardware, for representative images. This is achieved by a combination of design decisions across the host's storage layer, the guest kernel's boot path, and the network setup. The implementation details are out of scope for this document; what is in scope is the consequence: an operating regime in which sandbox creation is a routine primitive operation rather than an expensive setup step. Agent retries are cheap. Exploration of multiple parallel paths is cheap. Discarding state and starting again is cheap.

P6 Snapshot and restore as a primitive operation Constraint B

The execution environment exposes the full memory and filesystem state of a running sandbox as a checkpoint, retrievable later as a new sandbox in the same state. The cost of a checkpoint is bounded by the dirty page set since boot, not by total memory; the cost of a restore is bounded by the snapshot's resident set, not by total filesystem size. Both operations complete in time comparable to the cold-start latency in P5. This makes speculative execution patterns viable: an agent checkpoints before a risky operation, executes the operation, and either commits to the resulting state or discards it and restores from the checkpoint. Without P6, exploratory agents are forced to cold-start every retry.

P7 Ephemeral lifecycle Constraint A Constraint C

When a sandbox terminates, all of its state — guest memory, scratch filesystem, network namespace, allocated host resources — is reclaimed immediately. There is no residual disk artifact to scrub, no zombie process to reap, and no accumulated billing for storage of state that is no longer in use. The default sandbox lifecycle is bounded by an explicit timeout; sandboxes that outlive the controlling process are terminated by the host without operator intervention. Ephemeral lifecycle simultaneously reduces the cost surface (no storage to bill) and the security surface (no persistent state to leak), which is why it is a load-bearing element of both Constraints A and C rather than only one.

4. The properties are mutually reinforcing

The seven properties are not a checklist from which an operator can drop the ones that do not seem necessary. Each property depends on the others to function as advertised.

Without P7 (ephemeral lifecycle), P6 (snapshot/restore) becomes a liability rather than an asset: snapshots accumulate as long-lived storage that may contain residual credentials, intermediate state, or data subject to retention policies. The same operation that enables fast retries becomes the source of state that should not exist.

Without P2 (credential injection), P4 (audit log) does not change the threat model. By the time an event reaches the log, the credential is already in the guest's address space; an attacker who escapes the sandbox has the credential, and an attacker who has the credential does not need to escape the sandbox.

Without P1 (hardware isolation), P4 is similarly weakened: hostile code in the guest can write to the audit log with arbitrary content before flushing, since both run inside the same kernel address space.

Without P5 (sub-10 ms boot), P6 becomes operationally clunky and agents revert to avoiding retries to save wall-clock time. The exploratory pattern that P6 was meant to enable does not appear in practice.

Without P3 (default-deny network), P1 and P2 do not prevent data exfiltration. The hostile code inside the guest cannot reach the host kernel and cannot read the credential, but it can still send arbitrary data to an attacker-controlled endpoint over an outbound TCP connection.

Without P4, none of the above is observable after the fact, which makes incident response retrospective rather than concrete.

An execution environment that satisfies six of the seven properties is therefore not 86% as good as one that satisfies all seven; it has a structural failure mode in the dimension it omitted, and the dimension it omitted determines the failure mode that will eventually be discovered.

5. Two integration patterns

The properties above are the same regardless of how the execution environment is consumed. Two integration patterns are common:

Pattern A — Embedded in a customer-facing product

An organization that ships a developer-facing product (a coding tool, an application builder, a code-review platform, an autonomous research system, an in-product agent) needs an execution environment that can run code on behalf of its end users. The end users do not see the sandbox directly; they see the product surface. The sandbox is part of the substrate underneath that surface, in the same way a database or an inference endpoint is part of the substrate.

At this scale, the sandbox is invoked via the SDK from the product's backend. Volume is high — thousands to millions of sandboxes per day — and the cost per sandbox is multiplied by the number of end users. All seven properties matter, and the cost property in particular is amplified by the multi-tenant economics: a 50% reduction in per-sandbox cost is a 50% improvement in the gross margin of every customer on the platform.

For this pattern, the sandbox is a load-bearing piece of infrastructure that should not be re-implemented in-house. It is the same decision as the choice not to implement a database or an inference engine in-house: the work is specialist, the failure modes are subtle, and the cost of getting it right is dominated by experience that is hard to acquire from a standing start.

Pattern B — Driven from an internal agent loop

An organization that builds agents for its own internal use — code review, refactoring at scale, incident triage, test generation, documentation maintenance, automated investigation — needs the same execution environment, but at smaller scale and with different access patterns. The agent loop runs in the organization's own backend (or in a developer's terminal, or in a CI runner) and calls the sandbox SDK directly from the agent's tool layer.

Volume is lower than Pattern A — typically tens to hundreds of sandboxes per task, multiplied by the number of agent invocations per day. The cost constraint is still binding because the operator absorbs the full cost rather than passing it through to end users; the latency constraint is still binding because the agent's wall-clock time is the developer's wall-clock time; and the isolation constraint is binding because the agent is operating inside the organization's network with access to the organization's credentials.

For this pattern, the sandbox appears as a tool in the agent's toolset, alongside model calls and external API integrations. The integration is a few lines of SDK code; the operational concern is the same as for any external dependency.

Both patterns use the same product. The properties are the same. The difference is volume, and what follows from volume in unit economics.

6. Cost model

A common question is how the per-sandbox-hour price can be set at $0.041 when the median market rate is several times higher. The answer is operational rather than algorithmic.

The execution environment runs on dedicated bare-metal hardware leased monthly from specialist providers, not on virtual machines from hyperscale cloud providers. The cost per CPU-hour of bare-metal capacity is approximately one-third of the equivalent on-demand cloud rate, and the difference compounds further when virtualization overhead is considered: a sandbox that runs on a hyperscaler's VM is being virtualized twice (once by the cloud provider, once by the sandbox runtime), which both costs CPU and constrains the available isolation primitives.

Beyond the choice of substrate, the price reflects an absence of secondary metering: there is no separate charge for inter-sandbox network transfer, no separate charge for memory beyond the sandbox specification, no separate platform fee on top of the per-sandbox rate. The advertised price is the price.

The trade made for this cost discipline is operational complexity: the operator runs its own fleet on bare metal, including hardware lifecycle, network management, and incident response. This complexity is invisible to consumers of the API but is the reason most adjacent services choose a higher-level substrate.

7. Out of scope

This document describes a primitive. It does not describe a complete agent platform, and the primitive does not solve problems outside its scope.

It does not replace human code review. The most reliable defense against incorrect or unintended agent output is a human reading the diff. The execution environment makes it safer to run the agent while review is in progress; it does not remove the need for the review.

It does not defend against prompt injection at the model level. A successfully injected prompt will produce hostile code regardless of where that code runs. The execution environment limits the consequences of a successful injection (by Constraint C) but does not prevent the injection itself.

It is not suitable as a hosting target for stateful long-running services. The sandbox is ephemeral by design (P7); workloads that require persistence should colocate persistence with a system designed for it and treat the sandbox strictly as disposable compute.

It is not the cheapest possible compute. The cheapest compute is a spot instance running a container without isolation, and that is the right answer for trusted workloads. The execution environment is for the case where the workload is untrusted, and the relevant comparison is therefore against other isolation primitives, not against unisolated compute.

8. Origin

The author has spent the last decade in site reliability engineering, most recently building infrastructure that executed approximately six million CI jobs per month on bare-metal hardware, and previously published a survey of the state of microVM isolation. The motivation for the present work is the observation that no execution environment satisfied all seven properties enumerated in Section 3 simultaneously, and that the lack of such an environment was a structural rather than incidental gap in the available toolkit. Isorun is the result.

Try it.

Five lines of Python. 9 ms cold boot. $5 free credit, no card required.

Get started →