Build an Enterprise ready 2nd Brain on Azure Foundry + Cosmos DB

Build an Enterprise ready 2nd Brain on Azure Foundry + Cosmos DB

Build an Enterprise ready 2nd Brain on Azure Foundry + Cosmos DB

Build an Enterprise-Ready second brain on Azure Foundry + Cosmos DB

A useful enterprise second brain is not just a chatbot over documents. It is a governed, retrieval-first knowledge system with durable memory, secure APIs, and predictable cost built on services your platform team can actually operate.

That distinction matters.

Most internal AI prototypes stop at “upload files, ask questions.” But production systems need more:

  • durable organizational memory
  • grounded retrieval with citations
  • secure access controls
  • API governance across apps and agents
  • observability and auditability
  • cost controls that survive real usage

In this hands-on tutorial, I’ll show you a practical baseline architecture using:

  • Microsoft Foundry as the AI control plane for chat and embeddings
  • Azure Cosmos DB as the durable knowledge and memory layer
  • Azure API Management as the governed enterprise entry point
  • Azure-native identity, networking, and monitoring for production boundaries

What you will have by the end

  • a baseline Azure architecture for a production-oriented second brain
  • a chunking and ingestion pattern with deployment-based embeddings
  • Cosmos DB containers for knowledge, memory, and preferences
  • a retrieval flow with tenant and ACL filtering
  • APIM in front of the system with a realistic JWT validation baseline
  • a clear path to harden security, cost, and evaluation

This is not a toy demo. It is a production-oriented starting point you can evolve into a reusable platform pattern.


Why an enterprise second brain needs more than RAG

Retrieval-augmented generation is necessary, but it is not sufficient.

An enterprise second brain should act as organizational memory across:

  • documents and policies
  • decisions and rationale
  • conversations and task context
  • user preferences and working style
  • feedback signals that improve future answers

To build that well, senior teams should separate concerns instead of collapsing everything into one app:

  • Model orchestration: which chat and embedding deployments you use
  • Retrieval: how you index, filter, and rank knowledge
  • Memory: what short-term and long-term state you persist
  • Governance: how you secure, meter, and observe access
  • Integration: how other apps, workflows, and agents consume the system

That separation is what makes the architecture operable.

My thesis is simple:

  • Foundry provides the intelligence layer
  • Cosmos DB provides durable knowledge and memory
  • APIM provides governance, security, and reuse

Reference architecture on Azure Foundry plus Cosmos DB

Here is the end-to-end shape we’re building:

  1. Raw files land in Blob Storage or another enterprise content source.
  2. An ingestion job extracts and normalizes text.
  3. Text is chunked into retrieval-friendly units.
  4. An embeddings deployment in Foundry creates vectors.
  5. Chunk records and metadata are stored in Cosmos DB.
  6. Users call a stable API exposed through API Management.
  7. The application loads user memory and retrieves relevant chunks.
  8. A chat deployment in Foundry produces a grounded answer with citations.
  9. Feedback, traces, and audit events are persisted for improvement and governance.

A compact architecture diagram makes that flow easier to reason about:

Diagram 1

What to notice: APIM sits in front, your application logic mediates retrieval and memory, Foundry handles model inference, and Cosmos DB holds operational state. That separation is intentional.

Technical illustration

Where each kind of state lives

A common design mistake is storing every kind of data together. Instead, separate these clearly:

  • Raw files: Blob Storage
  • Chunked knowledge records: Cosmos DB
  • Short-term conversation state: Cosmos DB with TTL
  • Long-term user memory and preferences: Cosmos DB with stricter write rules
  • Audit trails and feedback events: separate Cosmos DB containers or a downstream analytics path

Enterprise boundaries to define early

Before writing code, decide your baseline controls:

  • Microsoft Entra ID for identity
  • Managed identity for service-to-service access
  • RBAC for least privilege
  • Private endpoints and network isolation where required
  • Policy enforcement at APIM
  • Monitoring via Azure Monitor / Application Insights

If you are in a regulated or sovereignty-sensitive environment, make that decision at the start, not after the prototype works.


Prerequisites

Before Step 1, make sure you have:

  • an Azure subscription with permission to create Cosmos DB, APIM, Storage, and networking resources
  • a Foundry project with chat and embedding deployments already created
  • deployment names for:

- your-chat-deployment - your-embedding-deployment

  • Azure CLI and PowerShell installed
  • Python 3.10+ for the application examples
  • Entra-backed access to Azure resources

In Azure, the model= value used in the SDK examples below should be your deployment name from Foundry, not a raw model family name.


Data model the second brain before writing code

The fastest way to create a slow, expensive system is to skip the data model.

For this tutorial, design Cosmos DB around distinct workloads.

Use separate containers for:

  • documents: source-level metadata
  • chunks: retrieval units with embeddings and ACL metadata
  • conversations: chat sessions and turn summaries
  • memory: durable or semi-durable user/task memory
  • preferences: stable user preferences
  • feedback or citations: evaluation and audit artifacts

Partition key choices

Partitioning is not cosmetic in Cosmos DB. It directly affects RU efficiency, scale, and tenant isolation.

A practical baseline:

  • chunks: partition by /tenantId
  • documents: partition by /tenantId
  • memory: partition by /userId
  • preferences: partition by /userId
  • conversations: partition by /conversationId
  • feedback: partition by /tenantId or /conversationId depending on query patterns

Why not one giant container? Because your access patterns differ:

  • retrieval is usually tenant-scoped and metadata-filtered
  • memory lookups are usually user-scoped
  • conversation playback is conversation-scoped

Metadata that improves retrieval quality

For chunk records, include at least:

  • tenantId
  • docId
  • chunkId
  • text
  • embedding
  • sourceUri or source file name
  • documentType
  • businessDomain
  • createdAt / updatedAt
  • aclTags
  • embeddingModel
  • embeddingVersion

That metadata is what lets you:

  • filter by tenant and permissions
  • restrict results to a business domain
  • re-embed safely when models change
  • explain answers with citations

Sample chunk schema

A concrete chunk document helps make the design real:

{
  "id": "contoso|handbook-001|00012",
  "tenantId": "contoso",
  "docId": "handbook-001",
  "chunkId": "00012",
  "text": "Azure Foundry helps teams build governed copilots with deployment-based model access.",
  "embedding": [0.0123, -0.0456, 0.0789],
  "sourceUri": "https://storageaccount.blob.core.windows.net/docs/handbook.pdf",
  "source": "handbook.pdf",
  "documentType": "policy",
  "businessDomain": "it",
  "aclTags": ["group:it-admins", "region:us"],
  "classification": "internal",
  "embeddingModel": "your-embedding-deployment",
  "embeddingVersion": "2025-01",
  "contentHash": "sha256:abc123",
  "createdAt": "2025-05-01T12:00:00Z",
  "updatedAt": "2025-05-01T12:00:00Z"
}

Retention strategy

Not all memory deserves to live forever.

A useful pattern:

  • Conversation turns: TTL of days to weeks
  • Episodic memory: TTL of hours to days
  • User preferences: no TTL unless business rules require it
  • Knowledge chunks: no TTL, but versioned and archived when superseded
  • Audit events: retained according to compliance policy

This hot-versus-archival distinction is important for both cost and quality. Unbounded memory growth increases RU usage, storage costs, and prompt pollution.

Technical illustration

Step 1: Create the Azure foundation

Start with a minimal but realistic Azure footprint:

  • Resource group
  • Virtual network baseline
  • Cosmos DB account and SQL database
  • API Management instance
  • Storage account for raw files
  • Monitoring resources
  • Foundry project/workspace and model access

The following PowerShell example provisions a resource group, VNet, Cosmos DB, a SQL database, a couple of baseline containers, and APIM. It is intentionally compact for tutorial purposes.

# Provision a resource group, Cosmos DB, APIM, and baseline networking for the 2nd Brain tutorial
$location = "eastus"
$rg = "rg-2ndbrain-demo"
$cosmos = "cosmos2ndbrain$((Get-Random -Maximum 9999))"
$apim = "apim-2ndbrain-demo"
$vnet = "vnet-2ndbrain"
$subnet = "snet-app"

az group create -n $rg -l $location | Out-Null
az network vnet create -g $rg -n $vnet --address-prefix 10.10.0.0/16 --subnet-name $subnet --subnet-prefix 10.10.1.0/24 | Out-Null
az cosmosdb create -g $rg -n $cosmos --kind GlobalDocumentDB --default-consistency-level Session --enable-free-tier true | Out-Null
az cosmosdb sql database create -g $rg -a $cosmos -n brain | Out-Null
az cosmosdb sql container create -g $rg -a $cosmos -d brain -n chunks --partition-key-path "/tenantId" --ttl -1 | Out-Null
az cosmosdb sql container create -g $rg -a $cosmos -d brain -n memory --partition-key-path "/userId" --ttl 2592000 | Out-Null
az apim create -g $rg -n $apim --publisher-name "Contoso" --publisher-email "admin@contoso.com" --sku-name Consumption | Out-Null

What to observe: the example creates a brain database and starts with separate chunks and memory containers, with TTL enabled on memory. In a real implementation, you would also add Storage, Key Vault, diagnostics settings, and likely private connectivity.

Identity and least privilege

After the resources exist, wire up access deliberately. The next example assigns Cosmos DB data-plane access and creates additional containers for preferences, conversations, and citations.

# Configure managed identity access and create Cosmos DB containers used by ingestion and chat APIs
$rg = "rg-2ndbrain-demo"
$cosmos = (az cosmosdb list -g $rg --query "[0].name" -o tsv)
$principalId = az ad signed-in-user show --query id -o tsv

az cosmosdb sql role assignment create `
  -g $rg -a $cosmos `
  --role-definition-name "Cosmos DB Built-in Data Contributor" `
  --scope "/" --principal-id $principalId | Out-Null

az cosmosdb sql container create -g $rg -a $cosmos -d brain -n preferences --partition-key-path "/userId" --ttl -1 | Out-Null
az cosmosdb sql container create -g $rg -a $cosmos -d brain -n conversations --partition-key-path "/conversationId" --ttl 604800 | Out-Null
az cosmosdb sql container create -g $rg -a $cosmos -d brain -n citations --partition-key-path "/tenantId" --ttl -1 | Out-Null

What to do next: replace user-assigned access with managed identities for your app and ingestion workers. For enterprise deployments, avoid long-lived keys and prefer Entra-backed auth paths end to end.


Step 2: Deploy models in Foundry for chat and embeddings

Once the foundation exists, prepare the AI layer.

For this architecture, you need two model capabilities:

  1. a chat deployment for grounded answer generation
  2. an embeddings deployment for indexing and query vectorization

Model selection guidance

Choose based on workload:

  • If you need lower latency and lower cost for high-volume enterprise Q&A, a smaller chat deployment is often the right default.
  • If you need more complex reasoning, tool use, or longer synthesis tasks, route selectively to a stronger deployment instead of sending every request there.
  • For embeddings, use one deployment consistently across indexing and query time until you intentionally version and re-embed.

Capture operational assumptions

Document these early:

  • deployment names
  • tokens per minute or throughput assumptions
  • concurrency expectations
  • retry policy
  • fallback behavior when rate-limited

This is where many pilots fail in production: they know the prompt, but not the throughput envelope.

Technical illustration

Step 3: Create Cosmos DB containers for knowledge and memory

Now formalize the operational data layer.

You already created some containers above, but here is the design intent:

  • chunks stores retrieval records
  • documents stores source metadata and ingestion status
  • conversations stores session state and summaries
  • memory stores durable or semi-durable episodic facts
  • preferences stores user-level stable settings
  • citations or feedback stores answer evidence and user reactions

Indexing strategy

Be deliberate with indexing in Cosmos DB:

  • include the metadata fields you filter on often
  • exclude fields you never query to reduce write cost
  • treat large embedding arrays carefully because they increase item size and write RU cost

If Cosmos DB is your operational retrieval store, keep embeddings and retrieval metadata alongside each chunk record. Also plan for re-embedding by storing embeddingVersion and possibly supersededBy fields.

One practical indexing policy decision

For the chunks container, a common baseline is:

  • index tenantId, docId, businessDomain, aclTags, createdAt, and source
  • keep the default index for text only if you need keyword fallback
  • exclude /embedding/* from the standard indexing policy to reduce write RU if vector search is handled separately by the vector index capability rather than normal property indexing

The exact policy depends on your retrieval design, but the principle is simple: index what you filter on, not every large field by default.

TTL strategy

A good default:

  • memory: TTL enabled
  • conversations: TTL enabled
  • preferences: no TTL
  • chunks: no TTL
  • documents: no TTL unless source lifecycle requires it

The point is to make short-lived context expire automatically while durable knowledge remains stable.


Step 4: Build the ingestion and chunking pipeline

This is where the second brain becomes useful.

Your ingestion pipeline should do more than just split text every N characters. In production, it should:

  • read files from Blob Storage or enterprise connectors
  • normalize encodings and whitespace
  • remove boilerplate where possible
  • chunk text into semantically meaningful segments
  • attach source and ACL metadata
  • generate embeddings
  • upsert records idempotently
  • surface poison documents for manual review
  • support reprocessing when chunking or embedding strategy changes

A practical chunking baseline

A simple starting point that works better than fixed-width slicing:

  • chunk by section or heading when available
  • target roughly 300–800 tokens per chunk
  • use 10–20% overlap for narrative documents
  • keep tables, lists, and policy clauses intact where possible
  • store a stable contentHash so re-ingestion can detect unchanged chunks

The following Python example shows the core loop: chunk text, call an embeddings deployment, and store chunk records in Cosmos DB using Entra-backed auth.

# Ingest documents, chunk text, generate embeddings from Foundry, and store chunk records in Cosmos DB
import os, uuid
from azure.identity import DefaultAzureCredential
from azure.cosmos import CosmosClient, PartitionKey
from openai import AzureOpenAI

text = "Azure Foundry helps build enterprise copilots. Cosmos DB stores durable memory and chunks."
chunks = [text[i:i+60] for i in range(0, len(text), 60)]

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version="2024-02-01",
    azure_ad_token_provider=lambda: DefaultAzureCredential().get_token("https://cognitiveservices.azure.com/.default").token,
)
cosmos = CosmosClient(os.environ["COSMOS_URI"], credential=os.environ["COSMOS_KEY"])
container = cosmos.get_database_client("brain").get_container_client("chunks")

for i, chunk in enumerate(chunks):
    emb = client.embeddings.create(model="text-embedding-3-large", input=chunk).data[0].embedding
    doc = {"id": str(uuid.uuid4()), "tenantId": "contoso", "docId": "handbook-001", "chunkId": i,
           "text": chunk, "embedding": emb, "source": "handbook.pdf", "category": "policy"}
    container.upsert_item(doc)

Use the same pattern, but switch the deployment name and Cosmos auth to your production baseline:

  • set model="your-embedding-deployment" because Azure expects the Foundry deployment name
  • use managed identity or another Entra-backed credential for Cosmos DB instead of account keys in production

A production-oriented version of the same flow looks like this:

import os, uuid, hashlib
from azure.identity import DefaultAzureCredential
from azure.cosmos import CosmosClient
from openai import AzureOpenAI

credential = DefaultAzureCredential()

aoai = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version="2024-02-01",
    azure_ad_token_provider=lambda: credential.get_token(
        "https://cognitiveservices.azure.com/.default"
    ).token,
)

cosmos = CosmosClient(os.environ["COSMOS_URI"], credential=credential)
container = cosmos.get_database_client("brain").get_container_client("chunks")

text = "Azure Foundry helps build enterprise copilots. Cosmos DB stores durable memory and chunks."
chunks = [text[i:i+60] for i in range(0, len(text), 60)]

for i, chunk in enumerate(chunks):
    emb = aoai.embeddings.create(
        model="your-embedding-deployment",
        input=chunk
    ).data[0].embedding

    chunk_id = f"contoso|handbook-001|{i:05d}"
    doc = {
        "id": chunk_id,
        "tenantId": "contoso",
        "docId": "handbook-001",
        "chunkId": f"{i:05d}",
        "text": chunk,
        "embedding": emb,
        "source": "handbook.pdf",
        "sourceUri": "https://storageaccount.blob.core.windows.net/docs/handbook.pdf",
        "documentType": "policy",
        "businessDomain": "it",
        "aclTags": ["group:it-admins"],
        "embeddingModel": "your-embedding-deployment",
        "embeddingVersion": "2025-01",
        "contentHash": hashlib.sha256(chunk.encode("utf-8")).hexdigest(),
    }
    container.upsert_item(doc)

What to observe: each chunk is stored with tenant, document, source, and ACL metadata. The IDs are stable, the embedding deployment is explicit, and the auth pattern is consistent with managed identity guidance.

Production notes for ingestion

A few hard-earned lessons:

  • Use deterministic document IDs so repeated ingestion updates instead of duplicating.
  • Preserve source URIs and timestamps for traceability.
  • Carry ACL or classification tags into chunk metadata so retrieval can enforce authorization.
  • Keep a reprocessing hook because embedding deployment changes are inevitable.

Step 5: Implement retrieval and answer generation

A second brain should be retrieval-first, not model-first.

That means the answer path should:

  1. vectorize the user’s question
  2. retrieve relevant chunks with tenant and ACL filtering
  3. assemble a grounded prompt
  4. call the chat deployment
  5. return citations and confidence hints
  6. degrade gracefully when retrieval is weak

The next example shows a simple retrieval-plus-answer flow. It computes an embedding for the question, performs vector retrieval against Cosmos DB, applies tenant and ACL filtering, builds a context window, and asks the chat deployment to answer only from that context.

# Retrieve relevant chunks from Cosmos DB and generate a grounded answer with inline citations
import os
from azure.cosmos import CosmosClient
from openai import AzureOpenAI

question = "What does our handbook say about enterprise copilots?"
aoai = AzureOpenAI(api_key=os.environ["AZURE_OPENAI_KEY"], azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_version="2024-02-01")
cosmos = CosmosClient(os.environ["COSMOS_URI"], credential=os.environ["COSMOS_KEY"])
container = cosmos.get_database_client("brain").get_container_client("chunks")

qvec = aoai.embeddings.create(model="text-embedding-3-large", input=question).data[0].embedding
query = "SELECT TOP 3 c.text, c.source FROM c WHERE c.tenantId = @tenant"
items = list(container.query_items(query=query, parameters=[{"name":"@tenant","value":"contoso"}], enable_cross_partition_query=True))

context = "\n".join([f"[{i+1}] {x['text']} (source: {x['source']})" for i, x in enumerate(items)])
messages = [
    {"role": "system", "content": "Answer only from the provided context and cite sources like [1]."},
    {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
resp = aoai.chat.completions.create(model="gpt-4o-mini", messages=messages, temperature=0)
print(resp.choices[0].message.content)

For a production baseline, make the retrieval actually use the query vector and keep the auth pattern consistent:

import os
from azure.identity import DefaultAzureCredential
from azure.cosmos import CosmosClient
from openai import AzureOpenAI

credential = DefaultAzureCredential()

question = "What does our handbook say about enterprise copilots?"
tenant_id = "contoso"
allowed_acl_tags = ["group:it-admins", "region:us"]

aoai = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_version="2024-02-01",
    azure_ad_token_provider=lambda: credential.get_token(
        "https://cognitiveservices.azure.com/.default"
    ).token,
)

cosmos = CosmosClient(os.environ["COSMOS_URI"], credential=credential)
container = cosmos.get_database_client("brain").get_container_client("chunks")

qvec = aoai.embeddings.create(
    model="your-embedding-deployment",
    input=question
).data[0].embedding

query = """
SELECT TOP 5
    c.text,
    c.source,
    c.sourceUri,
    VectorDistance(c.embedding, @qvec) AS score
FROM c
WHERE c.tenantId = @tenantId
  AND ARRAY_CONTAINS(@aclTags, c.aclTags[0], true)
ORDER BY VectorDistance(c.embedding, @qvec)
"""

items = list(container.query_items(
    query=query,
    parameters=[
        {"name": "@qvec", "value": qvec},
        {"name": "@tenantId", "value": tenant_id},
        {"name": "@aclTags", "value": allowed_acl_tags},
    ],
    partition_key=tenant_id
))

context = "\n".join(
    f"[{i+1}] {x['text']} (source: {x['source']})"
    for i, x in enumerate(items)
)

messages = [
    {
        "role": "system",
        "content": "Answer only from the provided context. Cite sources like [1]. If evidence is insufficient, say so."
    },
    {
        "role": "user",
        "content": f"Context:\n{context}\n\nQuestion: {question}"
    }
]

resp = aoai.chat.completions.create(
    model="your-chat-deployment",
    messages=messages,
    temperature=0
)

print(resp.choices[0].message.content)

If your ACL model stores multiple tags per chunk, tighten the filter to match your exact schema rather than relying on a single array position. The key point is that authorization filtering happens before the model sees the text.

ACL-filtered query example

Here is the retrieval rule in plain terms:

  • restrict to the caller’s tenant
  • restrict to documents whose ACL tags intersect the caller’s allowed tags
  • only then rank by vector similarity

That is the minimum safe pattern for enterprise retrieval.

Grounding rules that matter

Use prompt instructions like:

  • answer only from provided context
  • cite sources inline
  • say when the answer is not present
  • do not infer policy beyond the evidence

Those are not cosmetic. They reduce hallucination risk by narrowing the model’s allowed behavior.

When retrieval is weak

Do not force a confident answer. Return something like:

  • “I couldn’t find enough grounded evidence”
  • top matching citations for manual review
  • a suggestion to broaden the search scope or rephrase the question

That is better than a fluent but unsupported answer.

Technical illustration

Step 6: Add durable memory and agent behaviors

Knowledge retrieval answers “what do our sources say?” Memory answers “what should this system remember over time?”

Those are different jobs.

Separate ephemeral context from durable memory

A practical memory design has at least two layers:

  • ephemeral session context: recent turns, short-lived, often TTL-based
  • durable user memory: stable preferences, recurring tasks, approved facts worth keeping

Do not dump every chat turn into long-term memory. That pollutes the system and increases cost.

The next example persists user preferences and a memory record in Cosmos DB, with a simple rule that avoids storing low-value or obviously sensitive content.

# Persist durable conversation memory and user preferences in Cosmos DB with TTL-aware records
import os, time, uuid
from azure.cosmos import CosmosClient

cosmos = CosmosClient(os.environ["COSMOS_URI"], credential=os.environ["COSMOS_KEY"])
db = cosmos.get_database_client("brain")
memory = db.get_container_client("memory")
prefs = db.get_container_client("preferences")

user_id = "u-123"
conversation_id = "conv-456"
prefs.upsert_item({"id": user_id, "userId": user_id, "tone": "concise", "topics": ["azure", "cosmosdb"]})

message = {
    "id": str(uuid.uuid4()), "userId": user_id, "conversationId": conversation_id,
    "role": "assistant", "content": "Use grounded answers with citations.",
    "memoryType": "episodic", "ttl": 86400, "createdAt": int(time.time())
}
if len(message["content"]) > 20 and "password" not in message["content"].lower():
    memory.upsert_item(message)

To align with the rest of the architecture, use the same Entra-backed pattern here as well:

import os, time, uuid
from azure.identity import DefaultAzureCredential
from azure.cosmos import CosmosClient

credential = DefaultAzureCredential()
cosmos = CosmosClient(os.environ["COSMOS_URI"], credential=credential)
db = cosmos.get_database_client("brain")
memory = db.get_container_client("memory")
prefs = db.get_container_client("preferences")

user_id = "u-123"
conversation_id = "conv-456"

prefs.upsert_item({
    "id": user_id,
    "userId": user_id,
    "tone": "concise",
    "topics": ["azure", "cosmosdb"]
})

message = {
    "id": str(uuid.uuid4()),
    "userId": user_id,
    "conversationId": conversation_id,
    "role": "assistant",
    "content": "Use grounded answers with citations.",
    "memoryType": "episodic",
    "ttl": 86400,
    "createdAt": int(time.time())
}

if len(message["content"]) > 20 and "password" not in message["content"].lower():
    memory.upsert_item(message)

Then, when serving a new request, load the recent memory plus stable preferences to build a more personalized system prompt:

# Load recent memory and preferences to build a personalized chat prompt
import os
from azure.cosmos import CosmosClient

cosmos = CosmosClient(os.environ["COSMOS_URI"], credential=os.environ["COSMOS_KEY"])
db = cosmos.get_database_client("brain")
memory = db.get_container_client("memory")
prefs = db.get_container_client("preferences")

user_id = "u-123"
pref = prefs.read_item(item=user_id, partition_key=user_id)
recent = list(memory.query_items(
    query="SELECT TOP 5 c.role, c.content FROM c WHERE c.userId=@u ORDER BY c.createdAt DESC",
    parameters=[{"name":"@u","value":user_id}], enable_cross_partition_query=True))

system_prompt = f"User prefers a {pref['tone']} tone and cares about {', '.join(pref['topics'])}."
history = "\n".join([f"{m['role']}: {m['content']}" for m in recent])
print(system_prompt + "\n" + history)

And the managed identity version:

import os
from azure.identity import DefaultAzureCredential
from azure.cosmos import CosmosClient

credential = DefaultAzureCredential()
cosmos = CosmosClient(os.environ["COSMOS_URI"], credential=credential)
db = cosmos.get_database_client("brain")
memory = db.get_container_client("memory")
prefs = db.get_container_client("preferences")

user_id = "u-123"
pref = prefs.read_item(item=user_id, partition_key=user_id)

recent = list(memory.query_items(
    query="SELECT TOP 5 c.role, c.content FROM c WHERE c.userId=@u ORDER BY c.createdAt DESC",
    parameters=[{"name": "@u", "value": user_id}],
    partition_key=user_id
))

system_prompt = f"User prefers a {pref['tone']} tone and cares about {', '.join(pref['topics'])}."
history = "\n".join(f"{m['role']}: {m['content']}" for m in recent)
print(system_prompt + "\n" + history)

What to observe: the system prompt is assembled from durable preference data and recent memory. In a real app, you would summarize older turns rather than replaying them raw.

Memory write rules

Good enterprise defaults:

  • store preferences only after repeated confirmation or explicit user action
  • store task state only when it affects future work
  • never store secrets, credentials, or regulated data unless explicitly designed and approved
  • expire episodic memory automatically
  • log why a memory write happened

Step 7: Put Azure API Management in front of the second brain

A second brain becomes enterprise-ready when it is a governed service, not just an app endpoint.

Azure API Management is a strong fit here because it gives you one Azure-native platform to govern:

  • chat APIs
  • ingestion APIs
  • retrieval APIs
  • admin APIs
  • AI-related endpoints and agent-facing interfaces

That matters once multiple internal apps, copilots, or agents start using the same knowledge system.

The sequence below shows the runtime path from user request through APIM to your backend, Cosmos DB, and Foundry.

Diagram 13

What to observe: APIM is the front door. It enforces policy before your application spends tokens or RU on a request.

Now publish the backend through APIM with JWT validation, rate limiting, and backend routing:

# Publish an API through APIM with JWT validation, rate limiting, and backend routing
$rg = "rg-2ndbrain-demo"
$apim = "apim-2ndbrain-demo"
$apiId = "brain-api"
$backendUrl = "https://2ndbrain-api.azurewebsites.net"

az apim api create -g $rg --service-name $apim `
  --api-id $apiId --path "brain" --display-name "2nd Brain API" `
  --protocols https --service-url $backendUrl | Out-Null

$policy = @"
<policies>
  <inbound>
    <validate-jwt header-name="Authorization" require-scheme="Bearer" />
    <rate-limit-by-key calls="30" renewal-period="60" counter-key="@(context.Request.IpAddress)" />
    <set-backend-service base-url="$backendUrl" />
  </inbound>
  <backend /><outbound /><on-error />
</policies>
"@
az apim api policy create -g $rg --service-name $apim --api-id $apiId --xml-content $policy | Out-Null

A more realistic baseline policy includes issuer metadata and audience validation:

<policies>
  <inbound>
    <base />
    <validate-jwt header-name="Authorization" require-scheme="Bearer" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized">
      <openid-config url="https://login.microsoftonline.com/<tenant-id>/v2.0/.well-known/openid-configuration" />
      <audiences>
        <audience>api://second-brain-api</audience>
      </audiences>
      <issuers>
        <issuer>https://login.microsoftonline.com/<tenant-id>/v2.0</issuer>
      </issuers>
      <required-claims>
        <claim name="scp" match="any">
          <value>SecondBrain.Read</value>
          <value>SecondBrain.Write</value>
        </claim>
      </required-claims>
    </validate-jwt>
    <set-variable name="tenantId" value="@(context.Principal?.Claims.GetValueOrDefault("tid",""))" />
    <rate-limit-by-key calls="30" renewal-period="60" counter-key="@(context.Principal?.Claims.GetValueOrDefault("oid", context.Request.IpAddress))" />
    <set-header name="x-correlation-id" exists-action="override">
      <value>@(context.RequestId.ToString())</value>
    </set-header>
    <set-backend-service base-url="https://secondbrain-api.azurewebsites.net" />
  </inbound>
  <backend>
    <base />
  </backend>
  <outbound>
    <base />
  </outbound>
  <on-error>
    <base />
  </on-error>
</policies>

What to do next: add tenant-based quotas, header normalization, correlation IDs, and request/response logging with redaction.

Why APIM belongs in this architecture

Without APIM, every team reinvents:

  • auth checks
  • rate limiting
  • quotas
  • request shaping
  • observability
  • versioning

With APIM, you centralize those controls and make the second brain reusable as a platform capability.


Step 8: Secure for enterprise requirements

Security is not an add-on to retrieval. It is part of retrieval.

Core controls

At minimum, implement:

  • Entra ID for user and service authentication
  • Managed identity for service-to-service access
  • RBAC for least privilege
  • Private endpoints where required
  • Network isolation for sensitive deployments
  • Key Vault for secret material that cannot yet be eliminated
  • Audit trails for prompts, citations, and admin actions

ACL-aware retrieval

The single most important security rule in enterprise RAG is this:

Only retrieve chunks the caller is authorized to see.

That means document permissions must be propagated into chunk metadata, usually via:

  • ACL tags
  • tenant and business-unit labels
  • sensitivity classification
  • source system permission IDs

Then every retrieval query must filter on those attributes before the model sees the text.

If you skip this, you do not have enterprise retrieval. You have a data leakage path.

One strong governance reminder

Keep governance and data-residency decisions close to the architecture and security layers, not repeated across every section. Decide early where data can live, how identities are issued, and which boundaries APIM and networking must enforce.


Step 9: Optimize cost, throughput, and reliability

A second brain that works for 20 users but collapses at 2,000 is still a prototype.

Control token spend

The biggest AI cost drivers are usually:

  • oversized prompts
  • retrieving too many chunks
  • using a high-cost deployment for every request
  • re-embedding too often
  • storing too much irrelevant memory

Practical controls:

  • keep chunks semantically coherent but not oversized
  • cap the number of retrieved chunks
  • use prompt budgets
  • cache embeddings for repeated queries where appropriate
  • route simple Q&A to a lower-cost chat deployment
  • summarize long histories instead of replaying them

Tune Cosmos DB RU consumption

Cosmos DB cost and performance are dominated by:

  • partition key choice
  • item size
  • indexing policy
  • query pattern
  • cross-partition fan-out

Practical tuning levers:

  • choose partition keys aligned to your dominant access pattern
  • avoid one giant “everything” container
  • exclude non-queryable fields from indexing where appropriate
  • keep chunk documents compact
  • precompute metadata needed for retrieval filters
  • use TTL to auto-delete short-lived memory

Reliability targets

Define service-level objectives early:

  • ingestion latency
  • retrieval latency
  • answer latency
  • answer groundedness
  • cost per query
  • ingestion failure rate

If you do not define those, you cannot make rational trade-offs.


Step 10: Evaluate quality and productionize

This is where many teams stop too early.

A second brain is only enterprise-ready when you can measure whether it is actually helping.

Evaluate the retrieval layer

Track:

  • retrieval precision for known questions
  • citation coverage
  • ACL correctness
  • freshness of indexed content
  • failure rate on ingestion

Evaluate answer quality

Track:

  • groundedness
  • citation correctness
  • refusal quality when evidence is missing
  • user satisfaction
  • task completion rate for target workflows

Build replayable evaluation sets

Do not rely only on synthetic benchmarks. Build a test set from real enterprise questions, then replay it when you change:

  • chunking strategy
  • embedding deployment
  • prompt template
  • retrieval ranking
  • memory rules

Rollout plan

A practical rollout path:

  1. start with one business domain
  2. instrument everything
  3. validate quality and cost
  4. expand to adjacent domains
  5. standardize the platform for wider internal reuse

That staged approach is far safer than trying to AI-enable the whole intranet in one move.


Common pitfalls and design trade-offs

A few mistakes show up again and again.

1) One container for everything

This usually leads to poor partition efficiency, messy indexing, and rising RU costs. Different data types have different access patterns. Model them that way.

2) Treating chat history as knowledge

Conversation text is not automatically durable truth. Knowledge should be curated, sourced, and versioned. Memory should be selective.

3) Ignoring API governance until later

Once multiple teams and agents appear, lack of API governance becomes a delivery bottleneck. Put APIM in front early.

4) Underestimating re-embedding cost

Embedding migrations are real. Store version metadata and make reprocessing deliberate.

5) Allowing unbounded memory growth

If everything is remembered, nothing is useful. Use TTL, summarization, and explicit memory write rules.

6) Skipping ACL propagation

If permissions do not flow from source to chunk to retrieval query, your system is not safe for enterprise use.


What makes this architecture enterprise-ready

To summarize the pattern:

  • Foundry handles model orchestration for chat and embeddings
  • Cosmos DB stores durable knowledge, memory, and operational state
  • APIM governs access, quotas, policies, and reuse
  • Azure-native identity and networking provide enforceable security boundaries
  • Monitoring and evaluation make quality and cost measurable

This is what moves you from “RAG demo” to “enterprise second brain.”

Not because it is more complicated for its own sake.

Because real enterprise systems need:

  • security
  • durability
  • observability
  • governance
  • predictable cost

And they need to be operable by platform teams, not only by the team that built the first demo.


Final takeaway

If you are building internal AI systems this year, my recommendation is simple:

Start with one domain where grounded knowledge actually matters. Build retrieval first. Add durable memory carefully. Put APIM in front early. Measure quality and cost before you scale.

That gives you a second brain your organization can trust.

If you want, I can turn this into a follow-up post with:

  • a full reference repo structure
  • a FastAPI implementation
  • APIM policy examples for multi-tenant governance
  • a production evaluation checklist for Azure Foundry + Cosmos DB

If that would be useful, comment “part 2” and tell me which piece you want next.


Sources & References

  1. Azure API Management policy reference – validate-jwt
  2. Azure Cosmos DB for NoSQL query reference
  3. Azure Cosmos DB vector search overview
  4. Azure Cosmos DB RBAC and data plane role-based access
  5. Azure Identity client library for Python
  6. Azure OpenAI / Foundry chat completions and embeddings documentation
  7. Azure API Management policy snippets and examples
  8. Azure Well-Architected Framework – Cost Optimization
  9. Azure Monitor documentation
  10. Azure Cosmos DB indexing policies

Try it yourself

Run this tutorial as a Jupyter notebook: Download runbook.ipynb (31 cells, 33 KB).

Link copied