{
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.13.0"
    },
    "blog_metadata": {
      "topic": "How Apache Arrow Changes SQL Server-to-Python Analytics in Microsoft Data Stacks",
      "slug": "how-apache-arrow-changes-sql-server-to-python-analytics-in-m",
      "generated_by": "LinkedIn Post Generator + Azure OpenAI",
      "generated_at": "2026-05-04T20:00:52.530Z"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# How Apache Arrow Changes SQL Server-to-Python Analytics in Microsoft Data Stacks\n",
        "\n",
        "Apache Arrow is more than a driver feature toggle for SQL Server-to-Python workflows. It changes how result sets move into Python by reducing row-wise Python object materialization and enabling more direct handoff into columnar tools like PyArrow, Polars, and Arrow-aware pandas paths.\n",
        "\n",
        "This notebook turns the article into hands-on validation steps so you can compare traditional row-fetch patterns with Arrow-oriented workflows, inspect timing and memory behavior, and test an end-to-end analytical pattern."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "%pip install -q pyodbc pandas pyarrow polars adbc-driver-manager matplotlib"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import os\n",
        "import time\n",
        "import tracemalloc\n",
        "import importlib.util\n",
        "\n",
        "import pandas as pd\n",
        "import polars as pl\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "try:\n",
        "    import pyodbc\n",
        "except Exception as e:\n",
        "    pyodbc = None\n",
        "    print(f\"pyodbc import unavailable: {e}\")\n",
        "\n",
        "try:\n",
        "    import adbc_driver_manager.dbapi as adbc_dbapi\n",
        "except Exception as e:\n",
        "    adbc_dbapi = None\n",
        "    print(f\"ADBC import unavailable: {e}\")\n",
        "\n",
        "try:\n",
        "    import pyarrow as pa\n",
        "except Exception as e:\n",
        "    pa = None\n",
        "    print(f\"pyarrow import unavailable: {e}\")\n",
        "\n",
        "pd.set_option(\"display.max_columns\", 50)\n",
        "print({\n",
        "    \"python_packages\": {\n",
        "        \"pyodbc\": pyodbc is not None,\n",
        "        \"pandas\": True,\n",
        "        \"polars\": True,\n",
        "        \"pyarrow\": pa is not None,\n",
        "        \"adbc_driver_manager\": adbc_dbapi is not None,\n",
        "    }\n",
        "})"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conceptual flow: traditional row path vs Arrow columnar path\n",
        "\n",
        "This example recreates the article's architecture diagram in Python so it can run inside the notebook. The key idea is that the traditional path materializes Python row objects first, while the Arrow path keeps data in typed columnar buffers longer and hands them off more efficiently to analytical engines."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "fig, ax = plt.subplots(figsize=(14, 4))\n",
        "ax.axis(\"off\")\n",
        "\n",
        "nodes = {\n",
        "    \"sql\": (0.08, 0.5, \"SQL Server / Azure SQL\"),\n",
        "    \"client\": (0.28, 0.5, \"ODBC / SQL client\"),\n",
        "    \"trad\": (0.50, 0.72, \"Traditional path:\\nrow-by-row fetch\"),\n",
        "    \"pandas\": (0.72, 0.72, \"pandas DataFrame\\nfrom Python objects\"),\n",
        "    \"overhead\": (0.92, 0.72, \"Higher CPU and\\nmemory overhead\"),\n",
        "    \"arrow\": (0.50, 0.28, \"Arrow path:\\ncolumnar buffers\"),\n",
        "    \"table\": (0.72, 0.28, \"PyArrow Table\"),\n",
        "    \"polars\": (0.86, 0.28, \"Polars DataFrame\"),\n",
        "}\n",
        "\n",
        "for x, y, label in nodes.values():\n",
        "    ax.text(x, y, label, ha=\"center\", va=\"center\", fontsize=10,\n",
        "            bbox=dict(boxstyle=\"round,pad=0.35\", fc=\"#eef5ff\", ec=\"#4a6fa5\"))\n",
        "\n",
        "def arrow_line(a, b):\n",
        "    ax.annotate(\"\", xy=b, xytext=a, arrowprops=dict(arrowstyle=\"->\", lw=1.8, color=\"#333\"))\n",
        "\n",
        "arrow_line((0.14, 0.5), (0.22, 0.5))\n",
        "arrow_line((0.34, 0.55), (0.44, 0.68))\n",
        "arrow_line((0.56, 0.72), (0.66, 0.72))\n",
        "arrow_line((0.78, 0.72), (0.88, 0.72))\n",
        "arrow_line((0.34, 0.45), (0.44, 0.32))\n",
        "arrow_line((0.56, 0.28), (0.66, 0.28))\n",
        "arrow_line((0.78, 0.28), (0.82, 0.28))\n",
        "\n",
        "ax.set_title(\"Traditional SQL-to-Python transfer vs Arrow-oriented transfer\", fontsize=13)\n",
        "plt.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Required connection variables\n",
        "\n",
        "The next examples connect to SQL Server or Azure SQL. Set these environment variables before running the benchmark cells:\n",
        "\n",
        "- `SQLSERVER_CONN`: ODBC connection string for `pyodbc`\n",
        "- `SQLSERVER_ADBC`: ADBC URI/connection string for the SQL Server ADBC driver\n",
        "\n",
        "Example values:\n",
        "\n",
        "- `Driver={ODBC Driver 18 for SQL Server};Server=localhost;Database=WideWorldImporters;Trusted_Connection=yes;Encrypt=no`\n",
        "- `Server=localhost;Database=WideWorldImporters;Trusted_Connection=yes;Encrypt=no`\n",
        "\n",
        "If your environment uses SQL authentication or Azure SQL, adjust server, database, encryption, and credentials accordingly."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Traditional row-fetch into pandas\n",
        "\n",
        "This is the familiar pattern described in the article: execute SQL through an ODBC-style client, fetch rows into Python objects, then build a pandas DataFrame from those objects. It is simple and widely used, but it pays the Python object materialization cost before analysis begins."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import os\n",
        "import time\n",
        "import tracemalloc\n",
        "import pandas as pd\n",
        "\n",
        "if pyodbc is None:\n",
        "    raise ImportError(\"pyodbc is not available. Install it and ensure the SQL Server ODBC driver is installed.\")\n",
        "\n",
        "conn_str = os.getenv(\n",
        "    \"SQLSERVER_CONN\",\n",
        "    \"Driver={ODBC Driver 18 for SQL Server};Server=localhost;Database=WideWorldImporters;Trusted_Connection=yes;Encrypt=no\"\n",
        ")\n",
        "sql = \"SELECT TOP (50000) OrderID, CustomerID, OrderDate, PickingCompletedWhen FROM Sales.Orders\"\n",
        "\n",
        "conn = pyodbc.connect(conn_str)\n",
        "tracemalloc.start()\n",
        "t0 = time.perf_counter()\n",
        "rows = conn.cursor().execute(sql).fetchall()\n",
        "t1 = time.perf_counter()\n",
        "df = pd.DataFrame.from_records(\n",
        "    rows,\n",
        "    columns=[\"OrderID\", \"CustomerID\", \"OrderDate\", \"PickingCompletedWhen\"],\n",
        ")\n",
        "t2 = time.perf_counter()\n",
        "_, peak = tracemalloc.get_traced_memory()\n",
        "tracemalloc.stop()\n",
        "conn.close()\n",
        "\n",
        "print({\n",
        "    \"fetch_s\": round(t1 - t0, 3),\n",
        "    \"dataframe_s\": round(t2 - t1, 3),\n",
        "    \"peak_mb\": round(peak / 1024 / 1024, 2),\n",
        "})\n",
        "print(df.dtypes.astype(str).to_dict())\n",
        "print(df.head(3).to_dict(orient=\"records\"))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Arrow-based fetch into PyArrow and Polars\n",
        "\n",
        "This example follows the Arrow-oriented shape from the article. Instead of fetching Python row objects first, it requests an Arrow table and then loads that table into Polars, which is designed for columnar execution."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import os\n",
        "import time\n",
        "import tracemalloc\n",
        "import polars as pl\n",
        "\n",
        "if adbc_dbapi is None:\n",
        "    raise ImportError(\"adbc_driver_manager is not available. Install adbc-driver-manager.\")\n",
        "\n",
        "adbc_uri = os.getenv(\n",
        "    \"SQLSERVER_ADBC\",\n",
        "    \"Server=localhost;Database=WideWorldImporters;Trusted_Connection=yes;Encrypt=no\"\n",
        ")\n",
        "sql = \"SELECT TOP (50000) OrderID, CustomerID, OrderDate, PickingCompletedWhen FROM Sales.Orders\"\n",
        "\n",
        "conn = adbc_dbapi.connect(driver=\"adbc_driver_sqlserver\", uri=adbc_uri)\n",
        "tracemalloc.start()\n",
        "t0 = time.perf_counter()\n",
        "cur = conn.cursor()\n",
        "cur.execute(sql)\n",
        "tbl = cur.fetch_arrow_table()\n",
        "t1 = time.perf_counter()\n",
        "df = pl.from_arrow(tbl)\n",
        "t2 = time.perf_counter()\n",
        "_, peak = tracemalloc.get_traced_memory()\n",
        "tracemalloc.stop()\n",
        "conn.close()\n",
        "\n",
        "print({\n",
        "    \"arrow_fetch_s\": round(t1 - t0, 3),\n",
        "    \"polars_build_s\": round(t2 - t1, 3),\n",
        "    \"peak_mb\": round(peak / 1024 / 1024, 2),\n",
        "})\n",
        "print(df.schema)\n",
        "print(df.head(3))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Benchmark harness: old path vs Arrow path\n",
        "\n",
        "This benchmark compares the two approaches side by side. It separates query execution, fetch time, DataFrame construction time, and a lightweight peak-memory estimate using `tracemalloc`.\n",
        "\n",
        "Important caveat: `tracemalloc` tracks Python allocations and may underreport native memory used by Arrow, Polars, and underlying C/C++ libraries. Treat it as directional rather than definitive."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import os\n",
        "import time\n",
        "import tracemalloc\n",
        "import pandas as pd\n",
        "import polars as pl\n",
        "\n",
        "if pyodbc is None:\n",
        "    raise ImportError(\"pyodbc is not available.\")\n",
        "if adbc_dbapi is None:\n",
        "    raise ImportError(\"adbc_driver_manager is not available.\")\n",
        "\n",
        "SQL = \"SELECT TOP (100000) OrderID, CustomerID, OrderDate FROM Sales.Orders\"\n",
        "ODBC = os.getenv(\n",
        "    \"SQLSERVER_CONN\",\n",
        "    \"Driver={ODBC Driver 18 for SQL Server};Server=localhost;Database=WideWorldImporters;Trusted_Connection=yes;Encrypt=no\"\n",
        ")\n",
        "ADBC = os.getenv(\n",
        "    \"SQLSERVER_ADBC\",\n",
        "    \"Server=localhost;Database=WideWorldImporters;Trusted_Connection=yes;Encrypt=no\"\n",
        ")\n",
        "\n",
        "results = []\n",
        "\n",
        "def measure(name, fn):\n",
        "    tracemalloc.start()\n",
        "    t0 = time.perf_counter()\n",
        "    out = fn()\n",
        "    t1 = time.perf_counter()\n",
        "    _, peak = tracemalloc.get_traced_memory()\n",
        "    tracemalloc.stop()\n",
        "    summary = {\n",
        "        \"path\": name,\n",
        "        \"elapsed_s\": round(t1 - t0, 3),\n",
        "        \"peak_mb\": round(peak / 1024 / 1024, 2),\n",
        "    }\n",
        "    summary.update(out)\n",
        "    print(summary)\n",
        "    results.append(summary)\n",
        "    return summary\n",
        "\n",
        "def old_path():\n",
        "    conn = pyodbc.connect(ODBC)\n",
        "    cur = conn.cursor()\n",
        "    q0 = time.perf_counter()\n",
        "    cur.execute(SQL)\n",
        "    q1 = time.perf_counter()\n",
        "    rows = cur.fetchall()\n",
        "    q2 = time.perf_counter()\n",
        "    _df = pd.DataFrame.from_records(rows, columns=[\"OrderID\", \"CustomerID\", \"OrderDate\"])\n",
        "    q3 = time.perf_counter()\n",
        "    conn.close()\n",
        "    return {\n",
        "        \"query_s\": round(q1 - q0, 3),\n",
        "        \"fetch_s\": round(q2 - q1, 3),\n",
        "        \"df_build_s\": round(q3 - q2, 3),\n",
        "    }\n",
        "\n",
        "def arrow_path():\n",
        "    conn = adbc_dbapi.connect(driver=\"adbc_driver_sqlserver\", uri=ADBC)\n",
        "    cur = conn.cursor()\n",
        "    q0 = time.perf_counter()\n",
        "    cur.execute(SQL)\n",
        "    q1 = time.perf_counter()\n",
        "    tbl = cur.fetch_arrow_table()\n",
        "    q2 = time.perf_counter()\n",
        "    _df = pl.from_arrow(tbl)\n",
        "    q3 = time.perf_counter()\n",
        "    conn.close()\n",
        "    return {\n",
        "        \"query_s\": round(q1 - q0, 3),\n",
        "        \"fetch_s\": round(q2 - q1, 3),\n",
        "        \"df_build_s\": round(q3 - q2, 3),\n",
        "    }\n",
        "\n",
        "measure(\"pyodbc+pandas\", old_path)\n",
        "measure(\"adbc+arrow+polars\", arrow_path)\n",
        "\n",
        "results_df = pd.DataFrame(results)\n",
        "results_df"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Side-by-side summary formatter\n",
        "\n",
        "This cell formats benchmark output into a compact table suitable for internal reporting. You can use either the live `results` from the prior benchmark or the sample values from the article."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "sample_results = [\n",
        "    {\"path\": \"pyodbc+pandas\", \"query_s\": 0.041, \"fetch_s\": 1.882, \"df_build_s\": 0.317, \"peak_mb\": 96.4},\n",
        "    {\"path\": \"adbc+arrow+polars\", \"query_s\": 0.039, \"fetch_s\": 0.611, \"df_build_s\": 0.041, \"peak_mb\": 28.7},\n",
        "]\n",
        "\n",
        "rows_to_print = results if \"results\" in globals() and len(results) > 0 else sample_results\n",
        "headers = [\"path\", \"query_s\", \"fetch_s\", \"df_build_s\", \"peak_mb\"]\n",
        "\n",
        "print(\"| \" + \" | \".join(headers) + \" |\")\n",
        "print(\"|\" + \"|\".join([\"---\"] * len(headers)) + \"|\")\n",
        "for row in rows_to_print:\n",
        "    print(\"| \" + \" | \".join(str(row.get(h, \"\")) for h in headers) + \" |\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Visual comparison of benchmark results\n",
        "\n",
        "This optional chart helps make the article's point visible: query time may stay similar while fetch time and DataFrame construction time can shift materially when the transfer path changes."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "plot_rows = results if \"results\" in globals() and len(results) > 0 else [\n",
        "    {\"path\": \"pyodbc+pandas\", \"query_s\": 0.041, \"fetch_s\": 1.882, \"df_build_s\": 0.317, \"peak_mb\": 96.4},\n",
        "    {\"path\": \"adbc+arrow+polars\", \"query_s\": 0.039, \"fetch_s\": 0.611, \"df_build_s\": 0.041, \"peak_mb\": 28.7},\n",
        "]\n",
        "\n",
        "rdf = pd.DataFrame(plot_rows)\n",
        "metrics = [\"query_s\", \"fetch_s\", \"df_build_s\"]\n",
        "ax = rdf.set_index(\"path\")[metrics].plot(kind=\"bar\", figsize=(10, 5))\n",
        "ax.set_title(\"Benchmark comparison by path\")\n",
        "ax.set_ylabel(\"Seconds\")\n",
        "ax.set_xlabel(\"\")\n",
        "plt.xticks(rotation=0)\n",
        "plt.tight_layout()\n",
        "plt.show()\n",
        "\n",
        "ax = rdf.set_index(\"path\")[[\"peak_mb\"]].plot(kind=\"bar\", figsize=(8, 4), color=[\"#7aa6c2\", \"#5cb85c\"])\n",
        "ax.set_title(\"Peak memory comparison (directional via tracemalloc)\")\n",
        "ax.set_ylabel(\"MB\")\n",
        "ax.set_xlabel(\"\")\n",
        "plt.xticks(rotation=0)\n",
        "plt.tight_layout()\n",
        "plt.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Required connection variables for the end-to-end analytical example\n",
        "\n",
        "This example uses an ADBC SQL Server connection. Set:\n",
        "\n",
        "- `SQLSERVER_ADBC`\n",
        "\n",
        "Example local value:\n",
        "\n",
        "- `Server=localhost;Database=WideWorldImporters;Trusted_Connection=yes;Encrypt=no`\n",
        "\n",
        "Example Azure SQL style value:\n",
        "\n",
        "- `Server=tcp:myserver.database.windows.net,1433;Database=mydb;Uid=myuser;Pwd=mypassword;Encrypt=yes;TrustServerCertificate=no`\n",
        "\n",
        "Use a secure secret-management approach in real environments rather than hardcoding credentials."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## End-to-end analytical query into Polars through Arrow-native buffers\n",
        "\n",
        "This example demonstrates the article's broader point: the value is not only faster fetches, but staying in a columnar, vectorized path for downstream analytics. The query lands as Arrow, Polars ingests it directly, and the aggregation remains vectorized."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import os\n",
        "import polars as pl\n",
        "\n",
        "if adbc_dbapi is None:\n",
        "    raise ImportError(\"adbc_driver_manager is not available.\")\n",
        "\n",
        "adbc_uri = os.getenv(\n",
        "    \"SQLSERVER_ADBC\",\n",
        "    \"Server=localhost;Database=WideWorldImporters;Trusted_Connection=yes;Encrypt=no\"\n",
        ")\n",
        "\n",
        "sql = \"\"\"\n",
        "SELECT TOP (100000)\n",
        "    CAST(OrderDate AS date) AS order_date,\n",
        "    CustomerID,\n",
        "    SalespersonPersonID AS salesperson_id,\n",
        "    CAST(OrderID % 7 AS int) AS channel_id\n",
        "FROM Sales.Orders\n",
        "\"\"\"\n",
        "\n",
        "conn = adbc_dbapi.connect(driver=\"adbc_driver_sqlserver\", uri=adbc_uri)\n",
        "tbl = conn.cursor().execute(sql).fetch_arrow_table()\n",
        "conn.close()\n",
        "\n",
        "df = pl.from_arrow(tbl)\n",
        "\n",
        "result = (\n",
        "    df.with_columns(pl.col(\"order_date\").dt.month().alias(\"order_month\"))\n",
        "      .group_by([\"order_month\", \"channel_id\"])\n",
        "      .agg(\n",
        "          pl.len().alias(\"orders\"),\n",
        "          pl.n_unique(\"CustomerID\").alias(\"unique_customers\")\n",
        "      )\n",
        "      .sort([\"order_month\", \"channel_id\"])\n",
        ")\n",
        "\n",
        "print(result)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Environment validation inside Python\n",
        "\n",
        "The article included PowerShell checks for package and driver readiness. This Python version validates package availability, shows configured connection variables, and attempts a lightweight ODBC connectivity test if `pyodbc` is available."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import os\n",
        "import importlib.util\n",
        "\n",
        "modules = [\"pyodbc\", \"pandas\", \"pyarrow\", \"polars\", \"adbc_driver_manager\"]\n",
        "status = {m: importlib.util.find_spec(m) is not None for m in modules}\n",
        "print(\"Module availability:\", status)\n",
        "print(\"SQLSERVER_CONN set:\", bool(os.getenv(\"SQLSERVER_CONN\")))\n",
        "print(\"SQLSERVER_ADBC set:\", bool(os.getenv(\"SQLSERVER_ADBC\")))\n",
        "\n",
        "if pyodbc is not None:\n",
        "    try:\n",
        "        conn_str = os.getenv(\n",
        "            \"SQLSERVER_CONN\",\n",
        "            \"Driver={ODBC Driver 18 for SQL Server};Server=localhost;Database=master;Trusted_Connection=yes;Encrypt=no\"\n",
        "        )\n",
        "        conn = pyodbc.connect(conn_str, timeout=5)\n",
        "        version_info = conn.cursor().execute(\"SELECT @@VERSION AS version_info\").fetchone()[0]\n",
        "        conn.close()\n",
        "        print(\"Connectivity test succeeded:\")\n",
        "        print(version_info[:200])\n",
        "    except Exception as e:\n",
        "        print(\"Connectivity test failed:\", repr(e))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Next Steps\n",
        "\n",
        "Arrow does not make bad SQL fast, remove network latency, or guarantee gains for every workload. But for large analytical result sets moving from SQL Server or Azure SQL into Python, it can reduce the cost of transfer and DataFrame construction by avoiding the traditional row-object detour.\n",
        "\n",
        "To validate this in your environment:\n",
        "\n",
        "1. Benchmark representative queries, not toy examples.\n",
        "2. Compare query time, fetch time, DataFrame build time, and end-to-end notebook completion time.\n",
        "3. Test realistic downstream operations in Polars, PyArrow, or Arrow-aware pandas paths.\n",
        "4. Watch for cases where code converts data back into Python objects and erases the benefit.\n",
        "5. Decide based on your actual bottleneck: query execution, transfer, or DataFrame construction.\n",
        "\n",
        "If your team still treats SQL-to-Python transfer as solved plumbing, this notebook is a good starting point for revisiting that assumption."
      ]
    }
  ]
}