AI engineering

24 min read

Production MCP server, skeleton to deployed

Model Context Protocol is the open spec for letting an AI assistant call your tools, read your resources, and use your prompts. Most MCP servers people show you online are toy demos. This is the full path from skeleton to a hosted production MCP server you can plug into Claude Desktop, ChatGPT, or any compatible client.

What MCP actually is

MCP is a transport-agnostic JSON-RPC contract between an AI client and your server. The client (Claude Desktop, ChatGPT custom connector, Zed, Cursor, anything that speaks the spec) discovers three kinds of capability you expose: tools the model can call, resources the model can read, and prompts the user can invoke as templates. Underneath, the wire format is JSON-RPC 2.0; the transport can be stdio (the server is a subprocess of the client), streamable HTTP (the server runs anywhere and the client connects over the network), or the older SSE pairing. Everything else, schemas, authentication, capability negotiation, sits on top of those primitives.

The reason MCP matters is not the spec itself. It is that you write the server once and any compliant client gets your capabilities for free. Before MCP, every assistant had its own plugin format and you wrote the same Postgres query tool four times.

Python or TypeScript

There are two healthy SDKs. fastmcp in Python wraps the official Python SDK with a FastAPI-style decorator API; it is the fastest way to ship a server on your own. @modelcontextprotocol/sdk in TypeScript is the official Node implementation; it slots into an existing Express or Fastify app naturally.

My default for solo work is Python with FastMCP. The decorator syntax is tight, the type hints become tool schemas automatically, and the deploy story (a single uv project, one Dockerfile) is boring in a good way. I reach for TypeScript when the MCP server lives inside a Node service that already exists, because shoving a Python sidecar next to it is more pain than just using the TS SDK.

An MCP server is not a microservice. It is the typed surface a model uses to act on your world, and the failure mode of a bad one is the model lying convincingly.

The skeleton

New folder, fresh uv project, one dependency, one file. This is the smallest valid MCP server: a single tool that returns the server time.

bash
mkdir mcp-acme && cd mcp-acme
uv init --package mcp-acme
uv add fastmcp
mkdir -p src/mcp_acme
touch src/mcp_acme/server.py

python
# src/mcp_acme/server.py
from datetime import datetime, timezone
from fastmcp import FastMCP

mcp = FastMCP(name="acme", version="0.1.0")

@mcp.tool
def server_time() -> dict:
    """Return the server's current time in ISO 8601 UTC."""
    now = datetime.now(timezone.utc)
    return {"iso": now.isoformat(), "epoch": int(now.timestamp())}

if __name__ == "__main__":
    mcp.run()  # defaults to stdio transport

Run it with uv run python -m mcp_acme.server and the process sits on stdin/stdout waiting for a client. That is the entire contract for local use. The next sections add what makes it useful and what makes it production grade.

A real tool

Toy tools are a trap because they let you skip the parts that bite later: parameter validation, structured responses, error shape. Here is a Postgres query tool with a Pydantic input model, a constrained result shape, and explicit error handling. The model passes you a SQL string; you decide what to allow.

python
# src/mcp_acme/tools/db.py
from typing import Literal
from pydantic import BaseModel, Field
import asyncpg
from fastmcp import FastMCP

mcp = FastMCP(name="acme-db", version="0.1.0")

class QueryArgs(BaseModel):
    sql: str = Field(..., description="A single SELECT statement.")
    limit: int = Field(100, ge=1, le=1000)

class Row(BaseModel):
    data: dict

class QueryResult(BaseModel):
    status: Literal["ok", "error"]
    rows: list[Row] = []
    error: str | None = None

POOL: asyncpg.Pool | None = None

async def get_pool() -> asyncpg.Pool:
    global POOL
    if POOL is None:
        POOL = await asyncpg.create_pool(dsn=os.environ["DATABASE_URL"], min_size=1, max_size=5)
    return POOL

@mcp.tool
async def db_query(args: QueryArgs) -> QueryResult:
    """Run a read-only SQL query. SELECT only; anything else is rejected."""
    stmt = args.sql.strip().rstrip(";")
    if not stmt.lower().startswith("select"):
        return QueryResult(status="error", error="Only SELECT statements are permitted.")
    pool = await get_pool()
    try:
        async with pool.acquire() as conn:
            await conn.execute("SET TRANSACTION READ ONLY")
            records = await conn.fetch(f"{stmt} LIMIT {args.limit}")
            return QueryResult(status="ok", rows=[Row(data=dict(r)) for r in records])
    except Exception as e:
        return QueryResult(status="error", error=str(e))

Two things to notice. First, the schema the client sees is generated from QueryArgs; the model gets useful hints (the docstring, the field descriptions, the bounds on limit) without you writing JSON schema by hand. Second, the dangerous verb (anything that mutates) is rejected at the application layer, and the database transaction is opened read-only as a second line of defence. You want both belts on.

Resources and subscriptions

Resources are the read-only side. They are addressed by URI and returned as text or binary, and clients can subscribe to a URI to be notified when its content changes. A clean use case is exposing a folder of markdown notes as resource://notes/<slug>; the assistant can list them, read them, and react when one is edited.

python
# src/mcp_acme/resources/notes.py
from pathlib import Path
from fastmcp import FastMCP
from watchfiles import awatch

mcp = FastMCP(name="acme-notes", version="0.1.0")
NOTES_DIR = Path("/var/data/notes")

@mcp.resource("resource://notes/{slug}")
async def read_note(slug: str) -> str:
    path = NOTES_DIR / f"{slug}.md"
    if not path.is_file():
        raise FileNotFoundError(slug)
    return path.read_text(encoding="utf-8")

@mcp.resource_template("resource://notes")
async def list_notes() -> list[dict]:
    return [
        {"uri": f"resource://notes/{p.stem}", "name": p.stem, "mimeType": "text/markdown"}
        for p in sorted(NOTES_DIR.glob("*.md"))
    ]

@mcp.on_startup
async def watch_notes():
    async for changes in awatch(NOTES_DIR):
        for _, path_str in changes:
            slug = Path(path_str).stem
            await mcp.notify_resource_updated(f"resource://notes/{slug}")

The notify_resource_updated call is the subscription pivot. A client that subscribed to resource://notes/foo will be told that resource changed; it then re-reads at its leisure. This is how you build live context surfaces (a Jira board, a Linear cycle, a build queue) without inventing your own pub/sub.

Auth across transports

Auth is the part where toy servers die. There are three realistic modes and you should know which you are in.

Stdio for local Claude Desktop. The server is a subprocess of the user. There is no auth; the OS is the perimeter. Do not expose anything via stdio that the local user should not already be able to do.

HTTP with a bearer token for single-tenant remote. You run the server somewhere, you give yourself a long random token, the server checks it on every request. Five lines of middleware. Suitable for personal MCP servers you host for yourself.

python
# src/mcp_acme/http.py
import os
from fastmcp import FastMCP
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.responses import JSONResponse

mcp = FastMCP(name="acme", version="0.1.0")
EXPECTED = os.environ["MCP_BEARER"]

class BearerAuth(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        if request.url.path.startswith("/healthz"):
            return await call_next(request)
        header = request.headers.get("authorization", "")
        if header != f"Bearer {EXPECTED}":
            return JSONResponse({"error": "unauthorised"}, status_code=401)
        return await call_next(request)

app = mcp.http_app()
app.add_middleware(BearerAuth)

@app.route("/healthz")
async def healthz(_):
    return JSONResponse({"ok": True})

OAuth for hosted multi-tenant. When the server is something many users connect to from many clients, you need real OAuth. The spec defines a discovery document, a dynamic client registration endpoint, and the usual authorisation code flow with PKCE. FastMCP exposes hooks for this; in practice you delegate to an identity provider (Auth0, WorkOS, your own Keycloak) and let it mint the tokens. Do not roll your own OAuth server.

Deploying it

One container, one machine, one reverse proxy. The same recipe works on Fly.io, a Hetzner CX22, a Render service, or a tiny DigitalOcean Droplet.

dockerfile
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install --no-cache-dir uv && uv sync --frozen --no-dev
COPY src ./src
ENV PYTHONUNBUFFERED=1 MCP_TRANSPORT=http PORT=8080
EXPOSE 8080
CMD ["uv", "run", "uvicorn", "mcp_acme.http:app", "--host", "0.0.0.0", "--port", "8080"]

On Fly.io, fly launch reads the Dockerfile, picks a region, and gives you a TLS-terminated public URL. On a VPS, put Caddy in front and it handles certificates for you.

caddy
# /etc/caddy/Caddyfile
mcp.example.com {
  encode zstd gzip
  reverse_proxy localhost:8080
  header /healthz Cache-Control "no-store"
}

Add a health endpoint (the example above exposes /healthz) so your platform can restart the container when it goes sideways. Wire the bearer token in as a secret, never in the image.

Wiring clients

Claude Desktop is configured by a JSON file at ~/Library/Application Support/Claude/claude_desktop_config.json on macOS, or the equivalent under %APPDATA% on Windows. Stdio servers are listed under mcpServers with a command and args; HTTP servers go under the same key with a URL and headers.

json
{
  "mcpServers": {
    "acme-local": {
      "command": "uv",
      "args": ["--directory", "/Users/sarma/code/mcp-acme", "run", "python", "-m", "mcp_acme.server"]
    },
    "acme-remote": {
      "url": "https://mcp.example.com/mcp",
      "headers": {
        "Authorization": "Bearer ${MCP_BEARER}"
      }
    }
  }
}

ChatGPT custom connectors take the public URL and walk you through OAuth or bearer setup in the UI; there is no config file to edit. Cursor, Zed, and Continue all read MCP server lists from their own settings, but they accept the same shape (command plus args, or URL plus headers). Restart the client after editing; none of them hot-reload yet.

Observability

An MCP server is a tool API the model can call hundreds of times per session. You want to see what was called, with what arguments, how long it took, and whether it errored. Structured JSON logs to stdout are the floor; OpenTelemetry traces are the ceiling that pays for itself the first time a tool starts misbehaving.

Wrap every tool with a logging decorator: log the tool name, a hash of the arguments (never the raw arguments, they may contain secrets), the duration, and the outcome. Push traces to a free Honeycomb or Grafana Cloud Tempo endpoint via the OTel exporter; the FastMCP-as-Starlette app instruments cleanly with opentelemetry-instrumentation-starlette.

Alert on three signals. Tool error rate above 1% over five minutes (something is misconfigured). P95 tool latency above 3 seconds (the model will keep retrying and your bill will balloon). Auth failure rate spike (someone is probing your bearer token). Everything else is nice to know; those three will save your weekend.

Pitfalls

Stdout pollution killing stdio transport

Stdio MCP uses stdout as the wire. A stray print(), a logging handler defaulting to stdout, or a library that warns to stdout will corrupt every message and the client will silently disconnect. Route every log to stderr, period. Set Python logging to stream=sys.stderr at the entrypoint.

Exposing destructive tools by accident

Any tool that writes, deletes, or sends is a tool the model will eventually call at the wrong moment. Default destructive verbs to a dry-run mode, require an explicit confirm=true argument, and log every invocation with full arguments to an append-only audit table you own.

No schema validation on tool inputs

If you accept dict or Any, the model will pass you malformed JSON, numbers as strings, nested objects you never planned for, and you will be debugging at 2am. Use Pydantic models (or zod in TS) on every tool input, with bounds and enums where they apply. The model also gets a better schema as a side effect.

Missing rate limits and timeouts

A confused model can call your search tool a hundred times in thirty seconds. Put a per-token rate limit (slowapi in Python, express-rate-limit in TS) at the HTTP layer, and set a hard timeout on every external call your tools make. Without both, one bad session can drain a downstream API quota.

Shipping without a /healthz

Platforms reach for a health endpoint to decide when to restart your container. No endpoint means the platform never restarts a hung process and your clients keep getting 502s. Add /healthz that returns 200 only when the database pool is alive and the worker loop is running.

Wrap up

The leap from a toy MCP server to a production one is mostly about the boring half: typed inputs, real auth, a deploy pipeline, observability, and the discipline to assume the model will call your tools in ways you did not anticipate. The spec gives you the wire format for free; the engineering gives you a server you can leave running for months without flinching. Build the skeleton in an hour, harden it over an afternoon, and you have an asset that every new AI client gets to use without you writing another integration.

Want this done for you?

If you would rather skip the YAK shave and have someone who has done this fifty times set it up properly, that is what I do for a living.

Start a project