{
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.13.0"
    },
    "blog_metadata": {
      "topic": "From Data Movement to Decision Velocity: Why Faster Python Reads Matter for Copilot and AI Analytics",
      "slug": "from-data-movement-to-decision-velocity-why-faster-python-re",
      "generated_by": "LinkedIn Post Generator + Azure OpenAI",
      "generated_at": "2026-05-04T14:02:03.635Z"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# From Data Movement to Decision Velocity: Why Faster Python Reads Matter for Copilot and AI Analytics\n",
        "\n",
        "This notebook turns the blog post into a hands-on validation workflow. It focuses on the practical bottlenecks between governed SQL data and Python analytics, including query pushdown, dataframe materialization, chunked reads, and prompt-ready summaries for Copilot-style workflows.\n",
        "\n",
        "The goal is not benchmark theater. It is to help you identify whether your real bottleneck is query design, network path, driver overhead, dataframe conversion, memory pressure, or governance constraints."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "%pip install --upgrade pip\n",
        "%pip install pandas polars sqlalchemy pyodbc azure-identity"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import os\n",
        "import time\n",
        "import socket\n",
        "from textwrap import dedent\n",
        "\n",
        "import pandas as pd\n",
        "import polars as pl\n",
        "from sqlalchemy import create_engine, text\n",
        "\n",
        "pd.set_option(\"display.max_columns\", 20)\n",
        "pd.set_option(\"display.width\", 120)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Workflow map: where read latency enters the loop\n",
        "\n",
        "This cell captures the blog post's core idea as a simple Python representation of the SQL-to-Python decision path. It helps frame the rest of the notebook around the bridge between governed data and fast iteration in notebooks, feature engineering, and Copilot-assisted analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "workflow_steps = [\n",
        "    \"SQL Server / Azure SQL\",\n",
        "    \"Secure connection + query pushdown\",\n",
        "    \"Result transfer to Python\",\n",
        "    \"pandas DataFrame path\",\n",
        "    \"Arrow-friendly interchange\",\n",
        "    \"Polars DataFrame path\",\n",
        "    \"Feature engineering / BI / notebooks\",\n",
        "    \"Copilot prompts, summaries, anomaly checks\",\n",
        "    \"Faster decision loops\",\n",
        "]\n",
        "\n",
        "for i, step in enumerate(workflow_steps, start=1):\n",
        "    print(f\"{i}. {step}\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Required environment variables for SQL connectivity\n",
        "\n",
        "The next examples use SQL Server or Azure SQL connection settings. Set these variables before running live database reads:\n",
        "\n",
        "- `SQL_SERVER`\n",
        "- `SQL_DATABASE`\n",
        "- `SQL_USERNAME` and `SQL_PASSWORD` for SQL auth, or\n",
        "- Azure identity settings such as `AZURE_TENANT_ID` and `AZURE_CLIENT_ID` if your environment uses Entra-based auth\n",
        "\n",
        "You may also need the ODBC Driver 18 for SQL Server installed on the machine."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Compare pandas and Polars read paths\n",
        "\n",
        "This example mirrors the blog's mixed workflow: read from SQL into pandas, then convert to Polars. That is useful in many teams, but it is not the same as a fully Arrow-native path end to end. The timing output helps you see where transfer and conversion overhead may appear."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "server = os.getenv(\"SQL_SERVER\", \"myserver.database.windows.net\")\n",
        "database = os.getenv(\"SQL_DATABASE\", \"mydb\")\n",
        "username = os.getenv(\"SQL_USERNAME\")\n",
        "password = os.getenv(\"SQL_PASSWORD\")\n",
        "\n",
        "if username and password:\n",
        "    conn_str = (\n",
        "        f\"mssql+pyodbc://{username}:{password}@{server}/{database}\"\n",
        "        \"?driver=ODBC+Driver+18+for+SQL+Server\"\n",
        "        \"&Encrypt=yes&TrustServerCertificate=no\"\n",
        "    )\n",
        "else:\n",
        "    conn_str = (\n",
        "        f\"mssql+pyodbc://@{server}/{database}\"\n",
        "        \"?driver=ODBC+Driver+18+for+SQL+Server\"\n",
        "        \"&Authentication=ActiveDirectoryIntegrated\"\n",
        "        \"&Encrypt=yes&TrustServerCertificate=no\"\n",
        "    )\n",
        "\n",
        "sql = dedent(\"\"\"\n",
        "SELECT TOP (10000) order_id, customer_id, order_total, order_date\n",
        "FROM dbo.fact_orders\n",
        "WHERE order_date >= DATEADD(day, -30, SYSUTCDATETIME())\n",
        "\"\"\")\n",
        "\n",
        "try:\n",
        "    engine = create_engine(conn_str)\n",
        "    t0 = time.perf_counter()\n",
        "    pdf = pd.read_sql(sql, engine)\n",
        "    t1 = time.perf_counter()\n",
        "    pldf = pl.from_pandas(pdf)\n",
        "    t2 = time.perf_counter()\n",
        "\n",
        "    print(\"pandas rows:\", len(pdf))\n",
        "    print(\"pandas read seconds:\", round(t1 - t0, 4))\n",
        "    print(\"pandas->polars conversion seconds:\", round(t2 - t1, 4))\n",
        "    print(pdf.head(2))\n",
        "    print(pldf.head(2))\n",
        "except Exception as e:\n",
        "    print(\"Live SQL read skipped or failed:\", e)\n",
        "    sample_pdf = pd.DataFrame(\n",
        "        {\n",
        "            \"order_id\": [101, 102, 103],\n",
        "            \"customer_id\": [1, 2, 1],\n",
        "            \"order_total\": [120.5, 89.0, 42.25],\n",
        "            \"order_date\": pd.to_datetime([\"2025-04-01\", \"2025-04-02\", \"2025-04-03\"]),\n",
        "        }\n",
        "    )\n",
        "    t1 = time.perf_counter()\n",
        "    sample_pldf = pl.from_pandas(sample_pdf)\n",
        "    t2 = time.perf_counter()\n",
        "    print(\"Using sample data instead.\")\n",
        "    print(sample_pdf.head(2))\n",
        "    print(sample_pldf.head(2))\n",
        "    print(\"sample pandas->polars conversion seconds:\", round(t2 - t1, 6))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Push filters and projections into SQL\n",
        "\n",
        "Before optimizing dataframe transfer, tighten the query itself. This example keeps reads lean by filtering on date and region in SQL and only materializing the grouped result needed for analysis."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "server = os.getenv(\"SQL_SERVER\", \"myserver.database.windows.net\")\n",
        "database = os.getenv(\"SQL_DATABASE\", \"mydb\")\n",
        "username = os.getenv(\"SQL_USERNAME\")\n",
        "password = os.getenv(\"SQL_PASSWORD\")\n",
        "\n",
        "if username and password:\n",
        "    conn_str = (\n",
        "        f\"mssql+pyodbc://{username}:{password}@{server}/{database}\"\n",
        "        \"?driver=ODBC+Driver+18+for+SQL+Server\"\n",
        "        \"&Encrypt=yes&TrustServerCertificate=no\"\n",
        "    )\n",
        "else:\n",
        "    conn_str = (\n",
        "        f\"mssql+pyodbc://@{server}/{database}\"\n",
        "        \"?driver=ODBC+Driver+18+for+SQL+Server\"\n",
        "        \"&Authentication=ActiveDirectoryIntegrated&Encrypt=yes\"\n",
        "    )\n",
        "\n",
        "query = text(dedent(\"\"\"\n",
        "SELECT customer_id, SUM(order_total) AS revenue\n",
        "FROM dbo.fact_orders\n",
        "WHERE order_date >= :start_date AND region = :region\n",
        "GROUP BY customer_id\n",
        "\"\"\"))\n",
        "\n",
        "params = {\"start_date\": \"2025-01-01\", \"region\": \"West\"}\n",
        "\n",
        "try:\n",
        "    engine = create_engine(conn_str)\n",
        "    df = pd.read_sql(query, engine, params=params)\n",
        "    print(df.sort_values(\"revenue\", ascending=False).head(5))\n",
        "except Exception as e:\n",
        "    print(\"Live SQL read skipped or failed:\", e)\n",
        "    df = pd.DataFrame(\n",
        "        {\n",
        "            \"customer_id\": [1, 2, 3, 4],\n",
        "            \"revenue\": [820.0, 1200.0, 450.0, 990.0],\n",
        "        }\n",
        "    )\n",
        "    print(\"Using sample aggregated data instead.\")\n",
        "    print(df.sort_values(\"revenue\", ascending=False).head(5))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Validate environment prerequisites in Python\n",
        "\n",
        "The original post used PowerShell to validate environment setup. This Python version checks the same idea inside the notebook so you can quickly rule out missing configuration before diagnosing performance."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "required = [\"AZURE_TENANT_ID\", \"AZURE_CLIENT_ID\", \"SQL_SERVER\", \"SQL_DATABASE\"]\n",
        "status = []\n",
        "for name in required:\n",
        "    value = os.getenv(name)\n",
        "    status.append({\"variable\": name, \"configured\": bool(value and str(value).strip())})\n",
        "\n",
        "env_df = pd.DataFrame(status)\n",
        "print(env_df)\n",
        "print(\"python executable available in notebook kernel\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Confirm network connectivity before blaming the stack\n",
        "\n",
        "If the network path to SQL is unstable, dataframe optimizations will not help. This Python version performs a lightweight TCP connectivity check to the SQL endpoint on port 1433."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "server = os.getenv(\"SQL_SERVER\", \"\")\n",
        "port = 1433\n",
        "\n",
        "if not server:\n",
        "    print(\"SQL_SERVER environment variable is not set; skipping live connectivity test.\")\n",
        "else:\n",
        "    try:\n",
        "        with socket.create_connection((server, port), timeout=5):\n",
        "            print({\"server\": server, \"port\": port, \"tcp_test_succeeded\": True})\n",
        "    except Exception as e:\n",
        "        print({\"server\": server, \"port\": port, \"tcp_test_succeeded\": False, \"error\": str(e)})"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Stream large reads with chunking\n",
        "\n",
        "When full materialization is unnecessary, chunked reads can be the right business answer. This example computes a running total incrementally instead of loading the entire result set into memory at once."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "server = os.getenv(\"SQL_SERVER\", \"myserver.database.windows.net\")\n",
        "database = os.getenv(\"SQL_DATABASE\", \"mydb\")\n",
        "username = os.getenv(\"SQL_USERNAME\")\n",
        "password = os.getenv(\"SQL_PASSWORD\")\n",
        "\n",
        "if username and password:\n",
        "    conn_str = (\n",
        "        f\"mssql+pyodbc://{username}:{password}@{server}/{database}\"\n",
        "        \"?driver=ODBC+Driver+18+for+SQL+Server\"\n",
        "        \"&Encrypt=yes&TrustServerCertificate=no\"\n",
        "    )\n",
        "else:\n",
        "    conn_str = (\n",
        "        f\"mssql+pyodbc://@{server}/{database}\"\n",
        "        \"?driver=ODBC+Driver+18+for+SQL+Server\"\n",
        "        \"&Authentication=ActiveDirectoryIntegrated&Encrypt=yes\"\n",
        "    )\n",
        "\n",
        "sql = \"SELECT order_id, order_total FROM dbo.fact_orders\"\n",
        "running_total = 0.0\n",
        "\n",
        "try:\n",
        "    engine = create_engine(conn_str)\n",
        "    for chunk in pd.read_sql(sql, engine, chunksize=50000):\n",
        "        running_total += chunk[\"order_total\"].sum()\n",
        "    print({\"running_total\": round(float(running_total), 2)})\n",
        "except Exception as e:\n",
        "    print(\"Live chunked read skipped or failed:\", e)\n",
        "    sample = pd.DataFrame({\"order_id\": [1, 2, 3, 4], \"order_total\": [10.0, 20.5, 30.25, 5.25]})\n",
        "    for start in range(0, len(sample), 2):\n",
        "        chunk = sample.iloc[start:start+2]\n",
        "        running_total += chunk[\"order_total\"].sum()\n",
        "    print(\"Using sample chunks instead.\")\n",
        "    print({\"running_total\": round(float(running_total), 2)})"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Convert pandas results into Polars for downstream analytics\n",
        "\n",
        "This example shows a common mixed workflow: keep the SQL read step unchanged, then move into Polars for expression-style analytics. It is useful for testing whether downstream transformation speed improves even if the read path itself is unchanged."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "pdf = pd.DataFrame(\n",
        "    {\"customer_id\": [1, 1, 2], \"order_total\": [120.0, 80.0, 50.0], \"region\": [\"West\", \"West\", \"East\"]}\n",
        ")\n",
        "\n",
        "pldf = pl.from_pandas(pdf)\n",
        "summary = (\n",
        "    pldf.group_by(\"customer_id\")\n",
        "    .agg(pl.col(\"order_total\").sum().alias(\"revenue\"))\n",
        "    .sort(\"revenue\", descending=True)\n",
        ")\n",
        "\n",
        "print(summary)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Turn compact summaries into Copilot-ready context\n",
        "\n",
        "This example demonstrates the business-facing payoff: once data is summarized, it can be turned into concise prompt context for Copilot-style narratives, anomaly checks, or follow-up SQL suggestions."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "df = pd.DataFrame(\n",
        "    {\"region\": [\"West\", \"East\", \"Central\"], \"revenue\": [125000, 98000, 101500], \"growth_pct\": [8.2, -1.4, 3.1]}\n",
        ")\n",
        "\n",
        "top = df.sort_values(\"revenue\", ascending=False).iloc[0]\n",
        "lagging = df.sort_values(\"growth_pct\").iloc[0]\n",
        "\n",
        "prompt_context = (\n",
        "    f\"Top region by revenue: {top.region} (${top.revenue:,.0f}). \"\n",
        "    f\"Lowest growth region: {lagging.region} ({lagging.growth_pct:.1f}%). \"\n",
        "    \"Explain likely drivers and suggest two follow-up SQL checks.\"\n",
        ")\n",
        "\n",
        "print(prompt_context)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Governance-aware production path\n",
        "\n",
        "The blog emphasizes that speed without governance is just faster risk. This cell expresses the production-ready sequence in notebook-friendly form: identity, encrypted connection, filtered result set, dataframe materialization, and prompt-ready analytics."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sequence = [\n",
        "    (\"Python Analytics App\", \"Microsoft Entra ID\", \"Acquire identity / integrated auth\"),\n",
        "    (\"Python Analytics App\", \"Azure SQL / SQL Server\", \"Open encrypted connection\"),\n",
        "    (\"Azure SQL / SQL Server\", \"Python Analytics App\", \"Return filtered result set\"),\n",
        "    (\"Python Analytics App\", \"DataFrame Engine\", \"Materialize dataframe\"),\n",
        "    (\"DataFrame Engine\", \"Python Analytics App\", \"Aggregations, joins, prompt-ready context\"),\n",
        "]\n",
        "\n",
        "for src, dst, action in sequence:\n",
        "    print(f\"{src} -> {dst}: {action}\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Measure workflow impact, not benchmark vanity\n",
        "\n",
        "Use this simple KPI template to evaluate whether faster SQL-to-Python reads actually improve business throughput. The point is to measure decision velocity, not just isolated transfer speed."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "kpis = pd.DataFrame(\n",
        "    {\n",
        "        \"kpi\": [\n",
        "            \"Time to first dataframe\",\n",
        "            \"Notebook rerun latency on common datasets\",\n",
        "            \"Feature or scenario iterations per week\",\n",
        "            \"Time to produce prompt-ready context\",\n",
        "        ],\n",
        "        \"baseline\": [None, None, None, None],\n",
        "        \"current_measurement\": [None, None, None, None],\n",
        "        \"notes\": [\n",
        "            \"Measure from query start to dataframe available\",\n",
        "            \"Use representative notebook cells\",\n",
        "            \"Count completed analytical iterations\",\n",
        "            \"Measure from query to compact narrative\",\n",
        "        ],\n",
        "    }\n",
        ")\n",
        "\n",
        "print(kpis)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Next Steps\n",
        "\n",
        "This notebook showed how to validate the main claims from the post in a practical way: inspect the SQL-to-Python handoff, push filters into SQL, test connectivity, stream large reads, compare pandas and Polars workflows, and generate prompt-ready summaries.\n",
        "\n",
        "Next steps:\n",
        "\n",
        "1. Replace sample queries with one painful production workflow.\n",
        "2. Record time to first dataframe and notebook rerun latency.\n",
        "3. Compare full materialization versus chunked processing.\n",
        "4. Test whether pandas-first or pandas-to-Polars paths are the real bottleneck.\n",
        "5. Evaluate Arrow-capable and emerging drivers carefully in your own environment.\n",
        "6. Review performance together with identity, encryption, access controls, and cost.\n",
        "\n",
        "The key question is not whether faster reads are elegant. It is whether they compress the time between data retrieval, interpretation, and action in your actual AI and analytics workflows."
      ]
    }
  ]
}