Stop emailing CSVs at midnight. This tutorial walks through a complete KoboToolbox dashboard pipeline for NGOs: Python ETL, Postgres warehouse, an AI layer for anomaly detection and multilingual transcript summarisation, and donor PDFs generated in English, French and Arabic from the same dataset.
The KoboToolbox Export Problem
If you are an M&E officer at a Tunisian, Moroccan or Lebanese NGO running a donor-funded programme, your week probably looks like this:
- Monday morning, you download a CSV export from KoboToolbox.
- The field team in Sidi Bouzid uploads thirty more submissions Monday afternoon. Your CSV is already stale.
- The USAID reporting deadline is Friday. The donor wants a PDF in English, the steering committee wants the same numbers in French, and the local partner needs them in Arabic.
- You spend Friday night in Excel reconciling three exports, then in PowerPoint copy-pasting numbers into a donor template.
This is not a tooling problem — KoboToolbox is excellent at what it does. It is a pipeline problem. Kobo is a form runtime, not a reporting platform. The fix is to treat each Kobo submission as a fact in a proper data warehouse, then layer reporting and AI on top.
This tutorial is the practical companion to our pillar piece on AI M&E dashboards for MENA NGOs. Read that for the architectural reasoning; come here when you want to actually build it.
What You Will Build
By the end of this tutorial, you will have:
- A Python ETL job that pulls KoboToolbox submissions through the v2 REST API on a nightly schedule
- A Postgres warehouse with three tables: raw submissions, denormalised indicators, and an audit log
- An AI agent layer that flags anomalies and summarises Arabic/French voice notes into English for donor reports
- A PDF reporting engine using WeasyPrint that produces the same dataset in EN, FR and AR from a single Jinja2 template
- A total operating cost under $30/month for a programme of up to 50,000 submissions
Everything is open source. The only paid component is the LLM API calls (a few dollars a month at NGO volumes).
Prerequisites
Before you start:
- Python 3.11+ and
pip - A KoboToolbox account with at least one deployed form. The community server (
kf.kobotoolbox.org) is fine. - A Postgres 15+ database. For testing, a local Docker container is enough. For production, a $6/month VPS instance from Hetzner or DigitalOcean works.
- An API key from any OpenAI-compatible LLM provider (we use Claude via Anthropic, but the code is provider-agnostic)
- Familiarity with
requests,pandas, and basic SQL
Pipeline Architecture
Here is the end-to-end pipeline you will build. Each box is independently replaceable.
flowchart LR
A[Field Teams<br/>Sidi Bouzid, Tataouine,<br/>Mafraq, Casablanca] -->|XLSForm offline| B[KoboToolbox<br/>Server]
B -->|v2 REST API| C[Python ETL<br/>nightly cron]
C -->|append-only| D[(Postgres<br/>warehouse)]
D --> E[AI Agent Layer<br/>Anomaly + Summary]
E -->|writes flags| D
D --> F[WeasyPrint<br/>PDF Renderer]
F --> G[Donor PDF<br/>EN/FR/AR]
D --> H[Web Dashboard<br/>internal team]The two design rules that matter:
- The warehouse is the source of truth. Every downstream consumer (PDFs, dashboards, agents) reads from Postgres. Nothing reads from Kobo directly except the ETL.
- AI writes flags, never overwrites data. Agents can mark a submission as suspicious or attach a summary. They cannot mutate the underlying indicator values.
Step 1: KoboToolbox API Basics
KoboToolbox exposes a REST API at https://kf.kobotoolbox.org/api/v2/. You authenticate with a long-lived token, which you generate from your account settings page.
Three endpoints matter for our pipeline:
GET /api/v2/assets/{asset_uid}/— form metadata, schema, deployment URLGET /api/v2/assets/{asset_uid}/data/— paginated submission data as JSONGET /api/v2/assets/{asset_uid}/data/{submission_id}/attachments/{attachment_id}/— file attachments (images, audio)
The asset_uid is the short ID in your form URL — looks like aJk8Zq3wY7nF2cQv9pXt. Rate limit is roughly 60 requests/minute. Submissions are paginated with ?start=N&limit=M query parameters.
Get your token (do this once, store it as an environment variable):
curl -X POST https://kf.kobotoolbox.org/token/ \
-d "username=YOUR_USER&password=YOUR_PASS"
# returns: {"token":"a1b2c3..."}
export KOBO_TOKEN="a1b2c3..."
export KOBO_ASSET="aJk8Zq3wY7nF2cQv9pXt"Step 2: Python ETL — Pulling Submissions
Here is the core ETL. It is intentionally boring: pull JSON, upsert into Postgres, log what you did. Save as etl_kobo.py.
import os
import json
import logging
from datetime import datetime, timezone
import requests
import pandas as pd
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
log = logging.getLogger(__name__)
KOBO_BASE = "https://kf.kobotoolbox.org/api/v2"
KOBO_TOKEN = os.environ["KOBO_TOKEN"]
KOBO_ASSET = os.environ["KOBO_ASSET"]
DB_URL = os.environ["DATABASE_URL"] # postgresql://user:pass@host/db
HEADERS = {"Authorization": f"Token {KOBO_TOKEN}"}
engine = create_engine(DB_URL, pool_pre_ping=True)
def fetch_submissions(asset_uid: str, page_size: int = 30000) -> list[dict]:
"""Pull all submissions for a deployed form, paginated."""
url = f"{KOBO_BASE}/assets/{asset_uid}/data/"
out, start = [], 0
while True:
r = requests.get(
url,
headers=HEADERS,
params={"start": start, "limit": page_size, "format": "json"},
timeout=60,
)
r.raise_for_status()
batch = r.json().get("results", [])
if not batch:
break
out.extend(batch)
log.info("fetched %d (total %d)", len(batch), len(out))
if len(batch) < page_size:
break
start += page_size
return out
def upsert_raw(submissions: list[dict]) -> int:
"""Append raw payloads to submissions table. Idempotent on _uuid."""
if not submissions:
return 0
rows = [
{
"uuid": s["_uuid"],
"submitted_at": s["_submission_time"],
"asset_uid": KOBO_ASSET,
"payload": json.dumps(s, ensure_ascii=False),
"ingested_at": datetime.now(timezone.utc).isoformat(),
}
for s in submissions
]
with engine.begin() as conn:
conn.execute(
text("""
INSERT INTO submissions (uuid, submitted_at, asset_uid, payload, ingested_at)
VALUES (:uuid, :submitted_at, :asset_uid, :payload, :ingested_at)
ON CONFLICT (uuid) DO NOTHING
"""),
rows,
)
return len(rows)
if __name__ == "__main__":
subs = fetch_submissions(KOBO_ASSET)
n = upsert_raw(subs)
log.info("ingested %d submissions (duplicates skipped)", n)The corresponding schema migration:
CREATE TABLE IF NOT EXISTS submissions (
uuid TEXT PRIMARY KEY,
submitted_at TIMESTAMPTZ NOT NULL,
asset_uid TEXT NOT NULL,
payload JSONB NOT NULL,
ingested_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX IF NOT EXISTS submissions_submitted_at_idx ON submissions(submitted_at);
CREATE INDEX IF NOT EXISTS submissions_asset_uid_idx ON submissions(asset_uid);Run it nightly with cron:
30 2 * * * cd /opt/ngo-etl && /usr/bin/python3 etl_kobo.py >> /var/log/kobo-etl.log 2>&1That is the whole ingestion layer. Fifty lines, no Zapier subscription, fully reproducible.
Step 3: Denormalise into Indicators
The raw payload JSONB is your audit trail. For dashboards and donor PDFs you want a flat indicator table. Define indicators in a YAML file (one per donor template) and run a denormalisation pass after ingestion.
# indicators.py — runs after etl_kobo.py
from sqlalchemy import text
import pandas as pd
INDICATOR_SQL = """
INSERT INTO indicators (uuid, region, beneficiary_count, indicator_code, value, captured_at)
SELECT
s.uuid,
s.payload->>'region' AS region,
(s.payload->>'num_beneficiaries')::int AS beneficiary_count,
'OUT-1.2-cash-distribution' AS indicator_code,
(s.payload->>'amount_tnd')::numeric AS value,
s.submitted_at
FROM submissions s
WHERE s.payload->>'form_type' = 'cash_distribution'
ON CONFLICT (uuid, indicator_code) DO UPDATE
SET value = EXCLUDED.value, captured_at = EXCLUDED.captured_at;
"""
with engine.begin() as conn:
conn.execute(text(INDICATOR_SQL))Why store indicators denormalised? Because donors ask "show me the calculation" in audits, and a single SQL view that everyone can read is a better answer than a Jupyter notebook nobody can find.
Step 4: The AI Layer — Anomalies + Voice Note Summaries
Two agents earn their keep at NGO scale: an anomaly flagger that runs after each ETL and a transcript summariser that turns Arabic/French voice notes into English bullet points for donor narratives.
# agents.py
import os
import anthropic
from sqlalchemy import text
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
ANOMALY_PROMPT = """You are an M&E data quality assistant for a humanitarian NGO.
Below are this week's indicator values vs the rolling 8-week baseline.
Flag any value that is more than 2 standard deviations from the baseline.
Return a JSON list of {{indicator, region, value, baseline_mean, severity}}.
Data:
{rows}
"""
def flag_anomalies():
with engine.begin() as conn:
df = pd.read_sql(
"SELECT region, indicator_code, value, captured_at "
"FROM indicators WHERE captured_at > now() - interval '8 weeks'",
conn,
)
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": ANOMALY_PROMPT.format(rows=df.to_csv(index=False))}],
)
flags = msg.content[0].text
with engine.begin() as conn:
conn.execute(text("INSERT INTO agent_flags(kind, payload, created_at) "
"VALUES ('anomaly', :p, now())"), {"p": flags})
def summarise_voice_note(audio_path: str, lang_hint: str = "ar") -> str:
"""Whisper transcription + LLM summary in English."""
with open(audio_path, "rb") as f:
transcript = client.audio.transcribe(file=f, language=lang_hint).text # pseudo
summary = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=300,
messages=[{
"role": "user",
"content": f"Summarise this field interview in 3 English bullets for a donor report:\n\n{transcript}",
}],
).content[0].text
return summaryThe agent never writes to indicators. It writes to agent_flags, a separate table that the dashboard renders as a sidebar. Humans decide what to do with the flags.
For multilingual transcript summarisation, this is where the magic happens for francophone Maghreb NGOs. A field team in Tataouine records a 90-second voice note in Tunisian darija explaining why a distribution was delayed. The pipeline transcribes it (Whisper handles Arabic and French well), then asks the LLM to produce three English bullets the M&E officer can paste into the USAID narrative. End-to-end cost: about $0.02 per voice note.
For more on building agent layers like this, see our companion piece on GitLab AI PM/QA dashboards — same pattern, different domain.
Step 5: Donor PDFs in EN / FR / AR
The donor does not log in to your dashboard. The donor wants a PDF in their template, in their language, on the deadline. WeasyPrint turns HTML + CSS into print-quality PDFs and handles Arabic RTL natively.
# render_pdf.py
from jinja2 import Environment, FileSystemLoader
from weasyprint import HTML
import pandas as pd
env = Environment(loader=FileSystemLoader("templates"))
def render_report(lang: str, period: str, output: str):
tmpl = env.get_template(f"donor_report_{lang}.html.j2")
with engine.begin() as conn:
df = pd.read_sql(
text("SELECT indicator_code, region, sum(value) AS total "
"FROM indicators WHERE captured_at >= :start "
"GROUP BY indicator_code, region"),
conn, params={"start": period},
)
html = tmpl.render(rows=df.to_dict(orient="records"), period=period, lang=lang)
HTML(string=html).write_pdf(output)
# One template, three renders:
for lang in ("en", "fr", "ar"):
render_report(lang, "2026-04-01", f"reports/Q2-2026-{lang}.pdf")The Jinja2 template uses dir="rtl" when lang == "ar", a different font stack (Noto Naskh Arabic) for the AR version, and the donor's actual logframe table structure. Build one template per donor, reuse it forever.
Cost Comparison
For a typical INGO programme with one M&E officer, three field staff, two desk officers and two donor reps, here is the monthly bill:
| Component | Open-source pipeline | Power BI Pro stack |
|---|---|---|
| VPS (Hetzner CX22) | $6 | — |
| Postgres (on the VPS) | $0 | — |
| LLM API (Claude/OpenAI) | ~$15 | — |
| WeasyPrint / Python | $0 | — |
| Power BI Pro × 8 users | — | $80 |
| Power BI Premium per user | — | from $200 |
| Consultant retainer | — | from $1,500 |
| Monthly total | ~$30 | $280 – $1,800+ |
The open-source pipeline also wins on portability. When the donor programme ends in eighteen months and a new donor wants the same data on a different schedule, your Postgres database moves with you. Your Power BI semantic model does not.
Production Checklist
Before you point this at a real donor programme:
- Service account for the Kobo token, not a personal account
- Backups of Postgres to S3-compatible storage (Hetzner Storage Box is $4/month for 1 TB)
- Cron job heartbeat — we use
healthchecks.iofree tier - Row-level security in Postgres so the dashboard role cannot read PII columns
- A
pyproject.tomlpinning every dependency (requests,pandas,sqlalchemy,weasyprint,anthropic) - A dry-run flag on
agents.pyso you can review LLM outputs before they hit the database
Where to Go Next
Once the pipeline is live, the obvious extensions are:
- An MCP server on top of Postgres so the country director can ask "how many beneficiaries did we reach in Tataouine last week?" from Claude Desktop in Arabic
- A web dashboard for the M&E team (we ship a Next.js + Tremor setup; SvelteKit + ECharts works equally well)
- Programmatic donor template ingestion — map a new donor's logframe into your indicator codes via YAML
Want Us to Build This for You?
If you are running a donor-funded programme and the pipeline above is what you wish you had, tell us about your programme. We typically deliver a working KoboToolbox dashboard with multilingual donor PDFs in two weeks, from kickoff to live data. From around $4,500 for a single-donor programme; less if you already have your XLSForms stable.
Read the strategic context in our pillar piece: AI M&E dashboards for MENA NGOs.