Running inference for an open-source model on your own machine is fun for a weekend. Doing it for paying users is a different problem: you need GPUs that wake on demand, scale to zero when nobody is calling, and bill you per second rather than per month. The traditional answer is to rent a Kubernetes cluster, install Triton, configure autoscaling, and lose a quarter of engineering time to ops. The 2025 answer that most AI startups settled on is Modal Labs.

Modal lets you decorate a Python function, point it at a GPU, and ship it. There is no Dockerfile to maintain, no cluster to operate, no cold-start gymnastics to write. You write Python, you modal deploy, and a globally distributed Firecracker fleet runs your code on an H100 in under five seconds. This tutorial walks through the entire workflow end-to-end and finishes with a working Whisper transcription API you can call from any client.

What You'll Build

A serverless Whisper transcription service that:

Accepts an audio URL over HTTP
Loads the large-v3 Whisper model on an A10G GPU
Caches the model weights so subsequent calls hit a warm container
Returns timestamped transcripts as JSON
Scales automatically from zero to many concurrent containers
Costs roughly two cents per minute of audio processed

By the end you will understand Modal's mental model: images, functions, volumes, secrets, and web endpoints — and you will have shipped a real GPU API you can hand to a frontend.

Prerequisites

Python 3.11 or higher
A Modal account from modal.com — the free tier includes thirty dollars of monthly credit, which is more than enough for this tutorial
Basic familiarity with Python decorators and type hints
A terminal and a code editor

Modal ships as a single PyPI package. Install it in a fresh virtual environment.

python -m venv .venv
source .venv/bin/activate
pip install modal

Authenticate against your account. The command opens a browser window and writes a token to ~/.modal.toml.

modal setup

Verify the install with a built-in command.

modal profile current

You should see your workspace name printed. If you do, the SDK is ready to deploy.

Create a file called hello.py. The mental model is simple: every Modal application starts with an App object, and every function you want to run remotely is decorated with @app.function.

import modal
 
app = modal.App("hello-modal")
 
@app.function()
def square(x: int) -> int:
    return x * x
 
@app.local_entrypoint()
def main():
    print(square.remote(7))

Run it.

modal run hello.py

The first invocation takes a few seconds because Modal builds a container image, pushes it to its registry, schedules it on a worker, and streams logs back to your terminal. Subsequent runs reuse the image and warm container, taking well under a second. The output is 49.

The important detail is square.remote(7). .local() would run the function in your process. .remote() ships the call to Modal's infrastructure. Everything in this tutorial relies on that single distinction.

Step 3: Defining a Custom Image

Whisper needs PyTorch, the openai-whisper package, and ffmpeg. Modal lets you describe the image declaratively with a fluent builder. No Dockerfile required.

Create whisper_api.py.

import modal
 
image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("ffmpeg")
    .pip_install(
        "openai-whisper==20240930",
        "torch==2.5.1",
        "requests==2.32.3",
    )
)
 
app = modal.App("whisper-api", image=image)

Modal hashes the image specification and caches the result. The first deploy builds and pushes the layers; later deploys reuse them. You can override the image per-function if a small endpoint does not need the heavy ML stack.

Step 4: Picking a GPU and Mounting a Volume

Whisper large-v3 is roughly three gigabytes. Downloading it on every cold start would be wasteful. Modal solves this with Volumes — persistent block storage that any function can mount at a path.

Append the following to whisper_api.py.

model_cache = modal.Volume.from_name("whisper-cache", create_if_missing=True)
 
@app.function(
    gpu="A10G",
    volumes={"/cache": model_cache},
    timeout=600,
)
def transcribe(audio_url: str) -> dict:
    import whisper
    import requests
    import tempfile
    import os
 
    os.environ["XDG_CACHE_HOME"] = "/cache"
    model = whisper.load_model("large-v3", download_root="/cache/whisper")
 
    response = requests.get(audio_url, timeout=120)
    response.raise_for_status()
 
    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
        f.write(response.content)
        audio_path = f.name
 
    result = model.transcribe(audio_path, verbose=False)
    return {
        "text": result["text"],
        "segments": [
            {"start": s["start"], "end": s["end"], "text": s["text"]}
            for s in result["segments"]
        ],
        "language": result["language"],
    }

Several things deserve attention. The gpu="A10G" argument picks a twenty-four-gigabyte Ampere card — cheap, fast, and oversized for large-v3. You can also pick T4, L4, A100, H100, or even H100:8 for multi-GPU. The volumes mapping mounts the persistent volume at /cache inside the container, and XDG_CACHE_HOME redirects PyTorch and Whisper to write weights there. The first call downloads three gigabytes; every subsequent call reads from cache in milliseconds.

The timeout=600 raises the per-call ceiling to ten minutes because long audio files take time to transcribe.

Step 5: Exposing a Web Endpoint

Modal functions can be called from other Python code, but the more common pattern is to expose them as HTTPS endpoints. The @modal.fastapi_endpoint decorator turns a function into a public URL.

Add this block to whisper_api.py.

@app.function(image=image)
@modal.fastapi_endpoint(method="POST", docs=True)
def transcribe_endpoint(audio_url: str):
    return transcribe.remote(audio_url)

Deploy the app.

modal deploy whisper_api.py

Modal prints a URL that looks like https://your-workspace--whisper-api-transcribe-endpoint.modal.run. The docs=True flag means appending /docs to that URL gives you an interactive Swagger UI for free.

Test it from your terminal.

curl -X POST "https://your-workspace--whisper-api-transcribe-endpoint.modal.run?audio_url=https://example.com/sample.mp3"

The first call takes around twenty seconds while Modal provisions a GPU and downloads the model. Subsequent calls within the keep-warm window finish in two to four seconds for a one-minute clip.

Step 6: Concurrency, Containers, and Scaling

By default each container handles one request at a time. Whisper is GPU-bound and benefits from batching, but for clarity we will keep one-per-container and let Modal scale horizontally.

Modify the decorator.

@app.function(
    gpu="A10G",
    volumes={"/cache": model_cache},
    timeout=600,
    min_containers=1,
    max_containers=20,
    scaledown_window=300,
)
def transcribe(audio_url: str) -> dict:
    ...

The three knobs that matter:

min_containers=1 keeps one container always warm. Cold starts disappear at the cost of running one GPU continuously. Drop to zero if you want pure pay-per-use.
max_containers=20 caps the autoscaler. Useful guardrail against runaway traffic and runaway bills.
scaledown_window=300 keeps containers alive for five minutes after their last request, so bursty traffic does not pay cold-start tax repeatedly.

For batching multiple requests inside a single container, use @modal.concurrent(max_inputs=4) on the function. Whisper handles this well because GPU memory dwarfs the audio batch.

Step 7: Adding Secrets

Real APIs need authentication. Modal stores secrets in a managed vault and injects them as environment variables at runtime. Create a secret in the dashboard or via CLI.

modal secret create whisper-api API_KEY=your-long-random-token-here

Attach it to the endpoint and validate inside.

from fastapi import HTTPException, Header
import os
 
@app.function(
    image=image,
    secrets=[modal.Secret.from_name("whisper-api")],
)
@modal.fastapi_endpoint(method="POST")
def transcribe_endpoint(
    audio_url: str,
    authorization: str = Header(None),
):
    expected = f"Bearer {os.environ['API_KEY']}"
    if authorization != expected:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return transcribe.remote(audio_url)

Redeploy. Calls without the correct Authorization header now return 401.

Step 8: Scheduled Jobs

A common second use case for Modal is cron-style batch work. The @app.function(schedule=...) decorator handles it without external infrastructure.

@app.function(schedule=modal.Cron("0 3 * * *"))
def nightly_cleanup():
    import datetime
    print(f"Running cleanup at {datetime.datetime.now()}")

After deploy, Modal runs this function at 03:00 UTC every day. The function inherits the app image; you can override per function with a separate image= argument. There is no calendar to configure, no Airflow DAG to write — the decorator is the schedule.

Step 9: Local Development Loop

Iterating against a remote GPU sounds slow. Modal solves the loop with modal serve.

modal serve whisper_api.py

This deploys an ephemeral version that reloads on every file save, prints logs locally, and gives you a temporary URL. You edit, save, and the next request hits the new code in roughly two seconds. It is the fastest serverless GPU development experience available in 2026.

Use modal serve while building, then modal deploy once you are ready for production.

Step 10: Observability and Logs

Every function invocation produces structured logs viewable in three places.

The CLI stream during development.

modal app logs whisper-api

The web dashboard at modal.com/apps, with per-function tabs for invocations, errors, and GPU utilization.

Programmatic access via the modal Python SDK if you want to ship logs to your own observability stack like Datadog or Grafana.

For deeper traces, wrap critical sections in with modal.Status("doing thing"): blocks; the status text shows up in the dashboard timeline so you can see which step of a long function is currently running.

Testing Your Implementation

Run the full workflow against a real audio sample.

modal deploy whisper_api.py
curl -X POST \
  -H "Authorization: Bearer your-long-random-token-here" \
  "https://your-workspace--whisper-api-transcribe-endpoint.modal.run?audio_url=https://github.com/openai/whisper/raw/main/tests/jfk.flac"

You should receive JSON with the full transcript, a list of segments with timestamps, and the detected language. Cold path: about twenty seconds. Warm path: about three seconds for the fifteen-second JFK clip.

Check the dashboard. You should see the invocation, GPU utilization graph spiking during transcription, and a per-call cost of around half a cent.

Troubleshooting

The first deploy hangs at "Building image". Modal is building the PyTorch image, which is large. Expect three to five minutes on first build. Subsequent deploys reuse the cached layers and finish in seconds.

CUDA out of memory errors. Switch the GPU to a larger card such as A100 or use whisper.load_model("medium") instead of large-v3 while debugging.

Cold starts feel slow even with cache. The volume mount is fast, but PyTorch still imports CUDA libraries which takes several seconds. Use min_containers=1 to keep a warm container around for latency-sensitive endpoints.

modal deploy fails with import errors locally. Modal does not import your file the same way Python does. Imports inside the function body — like import whisper inside transcribe — run in the container, not locally, which is why we put heavy imports there.

Next Steps

Add a webhook for asynchronous transcription of long files: return a job ID immediately and have a second function POST the result to a callback URL when done.
Stream partial transcripts by switching to faster-whisper and using Modal's yield support inside generators.
Layer a Next.js or Streamlit frontend on top — Modal endpoints are plain HTTPS, so any client can call them.
Explore Mastra AI Agents and Workflows for orchestrating Modal endpoints from a TypeScript agent.

Conclusion

Modal Labs collapses the GPU deployment story from "spin up Kubernetes, configure autoscaling, fight container registries" into "decorate a Python function and run modal deploy." For AI inference, scheduled batch jobs, and any workload that needs occasional GPU bursts, it removes more friction than any other platform on the market in 2026.

You now have a production-shaped pattern: declarative images, persistent volumes for model caches, GPU selection per function, web endpoints with secrets, autoscaling controls, and a tight local development loop. Apply it to your next AI feature and skip the cluster.