{
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.13.0"
    },
    "blog_metadata": {
      "topic": "How Microsoft Is Bringing PII Testing Into Azure AI Language and Foundry",
      "slug": "how-microsoft-is-bringing-pii-testing-into-azure-ai-language",
      "generated_by": "LinkedIn Post Generator + Azure OpenAI",
      "generated_at": "2026-07-02T12:39:29.449Z"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# How Microsoft Is Bringing PII Testing Into Azure AI Language and Foundry\n",
        "\n",
        "This notebook turns the blog post into a hands-on validation workflow. It shows how to build a small PII regression dataset, test Azure AI Language detection and redaction behavior, simulate Foundry-adjacent leakage checks for generative outputs, and enforce simple release-gate logic with Python."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "%pip install -q requests pandas"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import os\n",
        "import json\n",
        "import re\n",
        "from statistics import mean\n",
        "\n",
        "import requests\n",
        "import pandas as pd"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Lifecycle view of PII testing\n",
        "\n",
        "The blog distinguishes two complementary surfaces:\n",
        "\n",
        "- **Azure AI Language** for targeted PII detection and redaction validation across text, conversations, and documents.\n",
        "- **Azure AI Foundry** for testing whether generative workflows produce outputs that leak sensitive information.\n",
        "\n",
        "The practical governance idea is simple: detection quality, redaction quality, and leakage rate should become measurable controls rather than policy statements."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Workflow diagram\n",
        "\n",
        "This cell captures the release-gate flow described in the post as notebook-friendly text."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "workflow = {\n",
        "    \"language_path\": [\n",
        "        \"Curated PII test dataset\",\n",
        "        \"Azure AI Language PII detection\",\n",
        "        \"Compare detected entities to expected labels\",\n",
        "        \"Score precision / recall / redaction accuracy\",\n",
        "        \"Threshold decision\",\n",
        "        \"Publish report or fail build\"\n",
        "    ],\n",
        "    \"foundry_path\": [\n",
        "        \"Generative app output\",\n",
        "        \"PII leakage evaluator\",\n",
        "        \"Threshold decision\",\n",
        "        \"Pass or fail deployment\"\n",
        "    ]\n",
        "}\n",
        "\n",
        "print(json.dumps(workflow, indent=2))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Build a tiny curated regression dataset\n",
        "\n",
        "Start with a compact benchmark that includes source text, expected entities, and expected redacted output. This becomes the seed for repeatable privacy regression testing."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "dataset = [\n",
        "    {\n",
        "        \"id\": \"1\",\n",
        "        \"text\": \"Call John Doe at 555-123-4567.\",\n",
        "        \"expected_entities\": [\n",
        "            {\"text\": \"John Doe\", \"category\": \"Person\"},\n",
        "            {\"text\": \"555-123-4567\", \"category\": \"PhoneNumber\"}\n",
        "        ],\n",
        "        \"expected_redacted\": \"Call ******** at ************.\"\n",
        "    },\n",
        "    {\n",
        "        \"id\": \"2\",\n",
        "        \"text\": \"SSN 123-45-6789 belongs to Alice.\",\n",
        "        \"expected_entities\": [\n",
        "            {\"text\": \"123-45-6789\", \"category\": \"USSocialSecurityNumber\"},\n",
        "            {\"text\": \"Alice\", \"category\": \"Person\"}\n",
        "        ],\n",
        "        \"expected_redacted\": \"SSN *********** belongs to *****.\"\n",
        "    }\n",
        "]\n",
        "\n",
        "print(json.dumps(dataset, indent=2))\n",
        "pd.DataFrame([\n",
        "    {\n",
        "        \"id\": row[\"id\"],\n",
        "        \"text\": row[\"text\"],\n",
        "        \"expected_redacted\": row[\"expected_redacted\"],\n",
        "        \"expected_entity_count\": len(row[\"expected_entities\"])\n",
        "    }\n",
        "    for row in dataset\n",
        "])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Required environment variables for Azure AI Language API\n",
        "\n",
        "To run the next live service call, set these variables in your notebook environment:\n",
        "\n",
        "- `AZURE_LANGUAGE_ENDPOINT`\n",
        "- `AZURE_LANGUAGE_KEY`\n",
        "\n",
        "Example endpoint format: `https://<your-resource>.cognitiveservices.azure.com`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Call Azure AI Language PII detection\n",
        "\n",
        "This example sends a small batch to Azure AI Language using the request shape shown in the blog. It validates both detected entities and the returned `redactedText`. If credentials are not present, the cell will skip the live call gracefully."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "endpoint = os.getenv(\"AZURE_LANGUAGE_ENDPOINT\", \"\").rstrip(\"/\")\n",
        "key = os.getenv(\"AZURE_LANGUAGE_KEY\", \"\")\n",
        "\n",
        "if not endpoint or not key:\n",
        "    print(\"Skipping live Azure AI Language call. Set AZURE_LANGUAGE_ENDPOINT and AZURE_LANGUAGE_KEY to run this cell.\")\n",
        "else:\n",
        "    url = f\"{endpoint}/language/:analyze-text?api-version=2023-04-01\"\n",
        "    payload = {\n",
        "        \"kind\": \"PiiEntityRecognition\",\n",
        "        \"parameters\": {\"modelVersion\": \"latest\", \"domain\": \"none\"},\n",
        "        \"analysisInput\": {\n",
        "            \"documents\": [\n",
        "                {\"id\": \"1\", \"language\": \"en\", \"text\": \"Call John Doe at 555-123-4567.\"}\n",
        "            ]\n",
        "        }\n",
        "    }\n",
        "    headers = {\n",
        "        \"Ocp-Apim-Subscription-Key\": key,\n",
        "        \"Content-Type\": \"application/json\"\n",
        "    }\n",
        "    response = requests.post(url, headers=headers, json=payload, timeout=30)\n",
        "    response.raise_for_status()\n",
        "    result = response.json()[\"results\"][\"documents\"][0]\n",
        "    print(\"Redacted text:\", result.get(\"redactedText\"))\n",
        "    print(\"Entities:\")\n",
        "    for entity in result.get(\"entities\", []):\n",
        "        print(entity.get(\"text\"), entity.get(\"category\"), round(entity.get(\"confidenceScore\", 0.0), 3))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Score Azure AI Language results against expectations\n",
        "\n",
        "This scoring function computes precision, recall, and redaction correctness. These are the minimum viable metrics for turning privacy checks into a release control."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "def score_case(expected_entities, actual_entities, expected_redacted, actual_redacted):\n",
        "    expected = {(e[\"text\"], e[\"category\"]) for e in expected_entities}\n",
        "    actual = {(e[\"text\"], e[\"category\"]) for e in actual_entities}\n",
        "    tp = len(expected & actual)\n",
        "    fp = len(actual - expected)\n",
        "    fn = len(expected - actual)\n",
        "    precision = tp / (tp + fp) if (tp + fp) else 1.0\n",
        "    recall = tp / (tp + fn) if (tp + fn) else 1.0\n",
        "    redaction_ok = expected_redacted == actual_redacted\n",
        "    return {\"precision\": precision, \"recall\": recall, \"redaction_ok\": redaction_ok}\n",
        "\n",
        "expected = [\n",
        "    {\"text\": \"John Doe\", \"category\": \"Person\"},\n",
        "    {\"text\": \"555-123-4567\", \"category\": \"PhoneNumber\"}\n",
        "]\n",
        "actual = [\n",
        "    {\"text\": \"John Doe\", \"category\": \"Person\"},\n",
        "    {\"text\": \"555-123-4567\", \"category\": \"PhoneNumber\"}\n",
        "]\n",
        "\n",
        "scores = score_case(\n",
        "    expected,\n",
        "    actual,\n",
        "    \"Call ******** at ************.\",\n",
        "    \"Call ******** at ************.\"\n",
        ")\n",
        "print(scores)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Run a compact regression harness for CI-style metrics\n",
        "\n",
        "This example aggregates multiple test cases into summary metrics such as average precision, average recall, and redaction pass rate. These outputs are suitable for CI/CD artifacts and release-gate decisions."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "tests = [\n",
        "    {\n",
        "        \"id\": \"1\",\n",
        "        \"expected_entities\": [{\"text\": \"John Doe\", \"category\": \"Person\"}],\n",
        "        \"actual_entities\": [{\"text\": \"John Doe\", \"category\": \"Person\"}],\n",
        "        \"expected_redacted\": \"Hi ****\",\n",
        "        \"actual_redacted\": \"Hi ****\"\n",
        "    },\n",
        "    {\n",
        "        \"id\": \"2\",\n",
        "        \"expected_entities\": [{\"text\": \"123-45-6789\", \"category\": \"USSocialSecurityNumber\"}],\n",
        "        \"actual_entities\": [],\n",
        "        \"expected_redacted\": \"***\",\n",
        "        \"actual_redacted\": \"123-45-6789\"\n",
        "    }\n",
        "]\n",
        "\n",
        "def score_test_case(t):\n",
        "    exp = {(e[\"text\"], e[\"category\"]) for e in t[\"expected_entities\"]}\n",
        "    act = {(e[\"text\"], e[\"category\"]) for e in t[\"actual_entities\"]}\n",
        "    tp = len(exp & act)\n",
        "    fp = len(act - exp)\n",
        "    fn = len(exp - act)\n",
        "    return {\n",
        "        \"precision\": tp / (tp + fp) if (tp + fp) else 1.0,\n",
        "        \"recall\": tp / (tp + fn) if (tp + fn) else 1.0,\n",
        "        \"redaction_ok\": t[\"expected_redacted\"] == t[\"actual_redacted\"]\n",
        "    }\n",
        "\n",
        "results = [score_test_case(t) for t in tests]\n",
        "summary = {\n",
        "    \"avg_precision\": mean(r[\"precision\"] for r in results),\n",
        "    \"avg_recall\": mean(r[\"recall\"] for r in results),\n",
        "    \"redaction_pass_rate\": sum(r[\"redaction_ok\"] for r in results) / len(results)\n",
        "}\n",
        "\n",
        "print(\"Per-test results:\")\n",
        "print(json.dumps(results, indent=2))\n",
        "print(\"\\nSummary:\")\n",
        "print(json.dumps(summary, indent=2))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Generative workflow sequence as structured data\n",
        "\n",
        "The blog also describes a generative evaluation path: prompt the model, inspect generated output, score leakage, and use the result as a deployment gate."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sequence = [\n",
        "    {\"from\": \"GenAI App\", \"to\": \"Foundry Model\", \"message\": \"Prompt / test scenario\"},\n",
        "    {\"from\": \"Foundry Model\", \"to\": \"GenAI App\", \"message\": \"Generated output\"},\n",
        "    {\"from\": \"GenAI App\", \"to\": \"PII Evaluator\", \"message\": \"Submit output for PII checks\"},\n",
        "    {\"from\": \"PII Evaluator\", \"to\": \"Release Gate\", \"message\": \"Leakage metrics + violations\"},\n",
        "    {\"from\": \"Release Gate\", \"to\": \"GenAI App\", \"message\": \"Pass or fail deployment\"}\n",
        "]\n",
        "\n",
        "print(json.dumps(sequence, indent=2))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Evaluate generated outputs for likely PII leakage patterns\n",
        "\n",
        "This notebook-safe example uses regex rules to flag common PII patterns such as email, phone number, and SSN. It is not a replacement for Azure-native controls, but it is useful for illustrating application-side leakage evaluation."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "outputs = [\n",
        "    \"Customer email is jane.doe@contoso.com and phone is 555-222-1212.\",\n",
        "    \"Your order is confirmed. Reference number: A12345.\"\n",
        "]\n",
        "\n",
        "patterns = {\n",
        "    \"email\": r\"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b\",\n",
        "    \"phone\": r\"\\b\\d{3}[-.\\s]?\\d{3}[-.\\s]?\\d{4}\\b\",\n",
        "    \"ssn\": r\"\\b\\d{3}-\\d{2}-\\d{4}\\b\"\n",
        "}\n",
        "\n",
        "for text in outputs:\n",
        "    hits = {name: re.findall(pattern, text) for name, pattern in patterns.items()}\n",
        "    violations = {k: v for k, v in hits.items() if v}\n",
        "    print({\"text\": text, \"violations\": violations, \"passed\": not violations})"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Foundry-adjacent leakage-rate harness across prompts\n",
        "\n",
        "This example scores a small set of prompt/output pairs and computes leakage rate. The key idea from the blog is that leakage is an application metric: prompt changes, retrieval changes, and grounding changes can all affect privacy outcomes."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "samples = [\n",
        "    {\"prompt\": \"Summarize the support case.\", \"output\": \"User john@contoso.com requested a refund.\"},\n",
        "    {\"prompt\": \"Draft a generic response.\", \"output\": \"Thanks for contacting support.\"}\n",
        "]\n",
        "\n",
        "def has_pii(text):\n",
        "    rules = [\n",
        "        r\"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\\b\",\n",
        "        r\"\\b\\d{3}-\\d{2}-\\d{4}\\b\"\n",
        "    ]\n",
        "    return any(re.search(rule, text) for rule in rules)\n",
        "\n",
        "evaluated = [\n",
        "    {\"prompt\": s[\"prompt\"], \"output\": s[\"output\"], \"passed\": not has_pii(s[\"output\"])}\n",
        "    for s in samples\n",
        "]\n",
        "summary = {\n",
        "    \"total\": len(evaluated),\n",
        "    \"passed\": sum(x[\"passed\"] for x in evaluated),\n",
        "    \"leakage_rate\": 1 - (sum(x[\"passed\"] for x in evaluated) / len(evaluated))\n",
        "}\n",
        "\n",
        "print(json.dumps({\"results\": evaluated, \"summary\": summary}, indent=2))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Enforce a release gate in Python\n",
        "\n",
        "The original post showed PowerShell for deployment gating. This Python version applies the same logic so the notebook remains fully executable in the requested primary language."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "metrics = {\n",
        "    \"avg_precision\": 0.96,\n",
        "    \"avg_recall\": 0.91,\n",
        "    \"redaction_pass_rate\": 0.98\n",
        "}\n",
        "\n",
        "thresholds = {\n",
        "    \"avg_precision\": 0.95,\n",
        "    \"avg_recall\": 0.95,\n",
        "    \"redaction_pass_rate\": 0.99\n",
        "}\n",
        "\n",
        "failed_checks = {\n",
        "    name: {\"metric\": metrics[name], \"threshold\": thresholds[name]}\n",
        "    for name in thresholds\n",
        "    if metrics[name] < thresholds[name]\n",
        "}\n",
        "\n",
        "if failed_checks:\n",
        "    print(\"PII quality gate failed\")\n",
        "    print(json.dumps(failed_checks, indent=2))\n",
        "else:\n",
        "    print(\"PII quality gate passed\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Load metrics from a JSON artifact and enforce recall threshold\n",
        "\n",
        "This mirrors the CI example from the blog, but uses Python to parse a JSON artifact and fail or pass based on a minimum recall requirement."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "json_artifact = '''\n",
        "{\n",
        "  \"avg_precision\": 0.97,\n",
        "  \"avg_recall\": 0.94,\n",
        "  \"redaction_pass_rate\": 1.0\n",
        "}\n",
        "'''\n",
        "\n",
        "metrics_from_artifact = json.loads(json_artifact)\n",
        "min_recall = 0.95\n",
        "\n",
        "if metrics_from_artifact[\"avg_recall\"] < min_recall:\n",
        "    print(f\"PII recall below threshold: {metrics_from_artifact['avg_recall']}\")\n",
        "else:\n",
        "    print(\"Metrics accepted:\")\n",
        "    print(json.dumps(metrics_from_artifact, indent=2))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Failure modes to validate explicitly\n",
        "\n",
        "As you adapt these examples, test for the failure modes highlighted in the blog:\n",
        "\n",
        "- false positives and over-redaction\n",
        "- domain-specific identifier misses\n",
        "- multilingual edge cases\n",
        "- indirect leakage in generative outputs\n",
        "- procedural failure, where demos exist but no release gate is enforced"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Summary\n",
        "\n",
        "This notebook translated the article into a practical validation flow: create a curated PII dataset, test Azure AI Language for detection and redaction, estimate leakage in generative outputs, and turn the resulting metrics into a release decision. The central idea is that privacy testing becomes governance only when it can block a build, stop a deployment, or trigger follow-up action.\n",
        "\n",
        "## Next Steps\n",
        "\n",
        "1. Replace the toy dataset with domain-specific identifiers from your enterprise.\n",
        "2. Run the live Azure AI Language call against a larger multilingual benchmark.\n",
        "3. Connect generative-output checks to your real prompt and retrieval test sets.\n",
        "4. Persist metrics as CI artifacts and enforce thresholds in your deployment pipeline.\n",
        "5. Add runtime controls through API gateway, logging, and incident-response workflows."
      ]
    }
  ]
}