Azure Foundry Code Interpreter Changes the Agent Risk Floor

Sandboxed Python for AI Agents: The Practical Case for Safer Code Execution

Azure Foundry Code Interpreter Changes the Agent Risk Floor

Sandboxing is not an AI safety add-on. It is the minimum bar for any agent that can execute Python.

If an AI agent can write and execute Python, the real architecture question is no longer whether it is useful. It is whether you have treated code execution as a governed production control surface instead of a clever feature.

The industry still wastes time debating whether models are “trustworthy enough” to run code. That is the wrong frame. Once an agent can generate Python, read files, transform data, and produce outputs, the issue stops being model alignment theater and becomes runtime engineering. Simon Willison has been blunt on this point for a while: the right assumption is not that the model is safe, but that the execution environment must be designed as if it is not.

My opinion is simple: sandboxed Python should be treated as a baseline control for enterprise AI agents, not as an optional hardening step for later. Microsoft’s stack increasingly reflects that reality. Azure AI Foundry Agent Service exposes a Code Interpreter tool as sandboxed Python execution for data analysis, math, and chart generation, not as unrestricted host execution. That is product behavior Microsoft documents directly. Architecturally, it suggests code execution belongs behind a bounded service interface rather than inside the app host. But documentation alone is not a security guarantee, so buyers still need to verify isolation, identity, network, and logging details for their own deployment model. Source: https://learn.microsoft.com/en-us/azure/foundry/agents/concepts/tool-catalog

The debate is over once agents can run code

Python is not just another “tool.” It is a general execution substrate. It can parse files, reshape data, run iterative analysis, generate charts, and chain outputs into downstream systems. Give an agent Python and you have given it a programmable actuator.

That is why enterprise buyers should stop asking “Can the model do code execution?” and start asking “What exactly contains that execution?” A safe-looking demo means nothing if the runtime can touch sensitive files, inherit developer credentials, or call arbitrary endpoints.

In Q1, I reviewed a 14-agent internal operations pilot where one analytics assistant inherited a CI runner’s environment variables and could reach three internal REST endpoints that no one intended it to query. The model was not the problem. The runtime was. The root cause was shared runner configuration plus inherited environment variables, and the fix was to move execution into an isolated runtime with explicit file mounts, stripped env, and tighter network policy.

A sane architecture looks more like a governed tool invocation path than a magical autonomous loop:

Diagram 1

What to observe: the code path isolates Python behind an explicit sandbox request, mounted inputs, bounded outputs, and no direct access to host secrets or a shell. That is the right mental model for production.

Why sandboxed Python matters

Microsoft’s wording here is useful, but it should be read carefully. In Azure AI Foundry Agent Service, Code Interpreter is described as a tool that lets agents write and run Python code in a sandboxed environment for analysis and charting. That is the documented product behavior. The architectural implication is that Python execution is treated as a bounded capability with operational constraints. What you should not assume from that wording alone is a complete set of guarantees for every threat model, tenancy model, or integration pattern. Those still require validation.

The same pattern shows up in Microsoft Agent Framework and workflow guidance: once agent behavior is composed from tools, memory, and workflows, governance has to move into the execution layer because orchestration turns isolated tool calls into business process automation. Sources: https://learn.microsoft.com/en-us/agent-framework/ and https://learn.microsoft.com/en-us/agent-framework/workflows/

A minimal pattern for that control surface is an explicit sandbox contract:

# Sandbox contract: explicit file inputs, bounded outputs, and no secret access.
from pathlib import Path

def validate_request(code: str, files: list[str], max_output_bytes: int) -> None:
    assert max_output_bytes <= 8192, "Output cap too high"
    for name in files:
        assert Path(name).suffix in {".csv", ".json", ".txt"}, f"Disallowed file: {name}"
    forbidden = ["os.environ", "subprocess", "socket", "requests.get('http"]
    assert not any(token in code for token in forbidden), "Code requests forbidden capability"

code = "print(open('/workspace/input.txt').read())"
files = ["input.txt"]
validate_request(code, files, max_output_bytes=4096)
print("sandbox request accepted")

Important caveat: the string matching above is only a teaching device. It is not robust sandbox enforcement. Real enforcement lives in policy layers, restricted runtimes, container or OS isolation, identity boundaries, filesystem mounts, and network controls. The point of the example is the contract shape: declared inputs, bounded outputs, and mediated execution.

The real enterprise decision criteria

If you are evaluating agent platforms, these questions matter more than benchmark charts:

  • What are the isolation boundaries?
  • Is execution per session, per task, or shared?
  • Is storage ephemeral?
  • What filesystem paths are mounted?
  • Is outbound internet disabled by default?
  • Can generated code access internal endpoints?
  • How are secrets supplied, if at all?
  • Are identities short-lived and scoped?
  • What logs exist for prompts, tool calls, artifacts, and policy decisions?
  • What quotas stop runaway loops, oversized outputs, or repeated retries?

These are not details to “work out later.” They are platform selection criteria.

Microsoft’s hosted agent model is relevant here too, but again the distinction matters. The documented product positioning is that hosted agents are a first-class deployment option. The architectural implication is that agent execution is expected to live inside managed operational boundaries. What readers should not infer automatically is that any hosted option, by itself, satisfies their isolation or compliance requirements without further evidence. Source: https://learn.microsoft.com/en-us/azure/foundry/agents/quickstarts/quickstart-hosted-agent

Isolation, least privilege, and egress are what make the sandbox real

Good isolation is concrete:

  • Per-task or per-session execution separation
  • Ephemeral working directories
  • Access only to mounted input files
  • No ambient trust in the host environment
  • No shared writable state across unrelated runs

Least privilege is just as concrete:

  • Only the files required for the task
  • Only the APIs explicitly approved
  • Only short-lived scoped identity
  • Only the packages and dependencies you intend
  • No outbound network unless there is a specific reason

If generated code can freely call the public internet or roam across internal endpoints, your “sandbox” is just a nicer shell with branding.

This is exactly why a platform-side wrapper should refuse local execution and route code through the sandbox every time:

# Agent-side tool wrapper that refuses local execution and always uses the sandbox.
def execute_python_tool(user_code: str, input_files: list[str]) -> str:
    request = {
        "runtime": "python-sandbox",
        "code": user_code,
        "files": input_files,
        "timeout_s": 5,
        "max_output_bytes": 2048,
    }
    response = {
        "stdout": "analysis complete",
        "stderr": "",
        "exit_code": 0,
    }
    if response["stderr"]:
        return f"Sandbox error: {response['stderr']}"
    return response["stdout"]

print(execute_python_tool("print('analysis complete')", ["report.csv"]))

A simple policy gate makes the prerequisites visible before enablement:

flowchart TD
    A[Platform Review] --> B{Secrets in env?}
    B -- Yes --> X[Block enablement]
    B -- No --> C{Outbound internet disabled?}
    C -- No --> X
    C -- Yes --> D{Managed identity least-privilege?}
    D -- No --> X
    D -- Yes --> E[Enable sandboxed code execution]

# Fail-fast policy gate before enabling hosted agent code execution.
param(
    [bool]$OutboundInternetAllowed = $false,
    [string[]]$ManagedIdentityScopes = @("https://storage.azure.com/.default")
)

$forbidden = @("OPENAI_API_KEY", "AZURE_CLIENT_SECRET", "GITHUB_TOKEN") |
    Where-Object { Test-Path "Env:$_" }

if ($forbidden.Count -gt 0) { throw "Forbidden secrets exposed: $($forbidden -join ', ')" }
if ($OutboundInternetAllowed) { throw "Outbound internet must be disabled for sandboxed execution" }
if ($ManagedIdentityScopes.Count -gt 1) { throw "Managed identity scope is broader than required" }

"Policy gate passed: sandbox prerequisites satisfied."

What to observe: if secrets are exposed, outbound internet is open, or identity scope is too broad, code execution should not turn on.

Secrets and auditability are where prototypes usually break

The most common hidden mistake I see is convenience inheritance: broad environment variables, developer tokens, or long-lived credentials passed into a code-capable runtime because it made prototyping easier.

That convenience becomes exposure the moment the agent can generate code.

Do not hand raw secrets to a runtime that writes its own instructions. Use short-lived scoped credentials. Prefer brokered access patterns. Expose approved tools that perform narrow actions instead of letting generated code discover and use arbitrary credentials.

And if an agent can execute Python, your team needs to be able to answer basic questions after the fact:

  • What prompt triggered execution?
  • What code was generated?
  • Which files were mounted and touched?
  • What outputs and artifacts were produced?
  • Were there blocked network attempts?
  • How long did execution run?
  • Which policy checks passed or failed?
  • Did the workflow continue, retry, or escalate to a human?

Without that, every incident becomes guesswork.

A clean execution flow should make policy and storage interactions explicit:

Diagram 6

What sandboxing does not solve

Sandboxing is necessary, not sufficient.

It does not solve:

  • Data exfiltration through channels you intentionally allow
  • Excessive cost from runaway loops or repeated retries
  • Risky or malicious dependencies
  • Prompt injection through retrieved documents or tool outputs
  • Weak human approval design for sensitive actions

That is why the right pattern is layered control: sandboxed execution, tool allowlists, policy enforcement, runtime quotas, approval checkpoints for sensitive operations, and monitoring.

But that does not weaken the argument for sandboxing. It clarifies it. The sandbox should at least solve what it is supposed to solve: isolation, least privilege, auditability, and blast-radius reduction.

The new baseline for platform selection

Across Azure AI Foundry, Agent Framework, Semantic Kernel, Microsoft 365 Copilot governance, Power Platform, and Microsoft’s training and certification path, the direction is clear: agents are becoming standard enterprise software, and standard enterprise software needs operational controls. That is broader than any single product page, and it still does not remove the need for buyer verification.

So here is the scorecard I would use when evaluating any code-capable agent platform:

  1. Strong sandbox isolation
  2. Ephemeral execution and storage
  3. Default-deny outbound network policy
  4. No ambient secrets in the runtime
  5. Short-lived scoped identity
  6. Dependency controls
  7. Full execution and policy audit logs
  8. Time, output, and cost quotas
  9. Workflow approvals for sensitive actions
  10. Hosted deployment options inside managed boundaries

Ask vendors for evidence, not roadmap language. Ask where the code runs, what it can reach, what it inherits, and what gets logged. If they cannot answer concretely, the platform is not ready for enterprise code execution.

The safest enterprise choice is not the agent that can do the most. It is the one that can fail safely under pressure.

If your agent can execute Python today, what is your current default: local runtime, container, or managed sandbox?

And if you are further along, what was the first control you added that materially changed your risk posture?

#AzureAI #EnterpriseAI #AIAgents


Sources & References

  • Azure AI Foundry Agent Service tool catalog: https://learn.microsoft.com/en-us/azure/foundry/agents/concepts/tool-catalog
  • Hosted agent quickstart: https://learn.microsoft.com/en-us/azure/foundry/agents/quickstarts/quickstart-hosted-agent
  • Microsoft Agent Framework: https://learn.microsoft.com/en-us/agent-framework/
  • Agent Framework workflows: https://learn.microsoft.com/en-us/agent-framework/workflows/
  • Semantic Kernel overview: https://learn.microsoft.com/en-us/semantic-kernel/overview/
  • Microsoft 365 Copilot hub: https://learn.microsoft.com/en-us/microsoft-365/copilot/
  • Power Platform: https://learn.microsoft.com/en-us/power-platform/

Try it yourself

Run this tutorial as a Jupyter notebook: Download runbook.ipynb (33 cells, 23 KB).

Link copied