The Premise: What If AI Could Do Its Own ML Research?
In early March 2026, Andrej Karpathy released autoresearch — a framework where an AI agent autonomously runs machine learning experiments on your local GPU. The concept is elegant: the agent proposes a hypothesis, writes the code, runs the experiment, evaluates the results, and decides whether to keep or discard the changes. Rinse, repeat. About 12 experiments per hour, each in a tight 5-minute loop.
Karpathy's original tweet captured the idea perfectly: let AI do the grunt work of ML research while you sleep.
We thought: what if we pointed this at a real Kaggle competition?
The result is our fork of autoresearch (branch: kaggle/rna-3d-folding), adapted to compete in Stanford RNA 3D Folding 2 — a $75,000 prize pool competition to predict RNA 3D structures, with a deadline of March 25, 2026.
This tutorial walks through exactly how we adapted the system, the technical constraints we hit, and what we've learned so far.
The Challenge: Local GPU vs. Kaggle's Constraints
Karpathy's autoresearch was designed for local execution:
| Feature | Original (Local) | Our Adaptation (Kaggle) |
|---|---|---|
| GPU access | Always available | Max 2 concurrent sessions |
| Experiment time | ~5 minutes | 30-60 minutes per run |
| Feedback loop | Immediate | Delayed (submission + scoring) |
| Iteration speed | ~12/hour | ~1-2/hour |
| Cost | Your electricity | Free Kaggle quota (limited) |
The core insight: Kaggle competitions require a fundamentally different loop rhythm. You can't burn through 12 experiments an hour when each submission takes 30-60 minutes and you have limited GPU time.
Architecture Overview
Here's how our adapted system works:
┌─────────────────────────────────────────────┐
│ OpenClaw (Agent Runtime) │
│ github.com/openclaw/openclaw │
│ │
│ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Cron Job │───▶│ loop.py │───▶│ Notify │ │
│ │ (hourly) │ │ │ │(WhatsApp)│ │
│ └──────────┘ └─────┬────┘ └────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Experiment Logic │ │
│ │ │ │
│ │ 1. Check scores │ │
│ │ 2. Log results │ │
│ │ 3. Push next exp │ │
│ │ 4. Track in JSON │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ simulate.py │ │
│ │ (local pre-screen │ │
│ │ ~40 seconds) │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Kaggle API │ │
│ │ (submit + score) │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────┘
OpenClaw serves as the agent runtime — it manages the cron scheduling, tool execution, and notifications that keep the autonomous loop running 24/7.
Step 1: Understanding the Original Autoresearch Loop
The original autoresearch works like this:
- AI proposes a hypothesis and writes experiment code
- Code runs on local GPU (~5 minutes)
- Results evaluated — did the metric improve?
- Keep or discard — like
git commitvsgit reset - Repeat — agent uses history to inform next hypothesis
The genius is the tight feedback loop. The agent learns from every experiment, accumulating a history of what works and what doesn't.
Step 2: Adapting for Kaggle
The Key Changes
From 5-minute loops to cron-driven hourly loops:
Instead of a continuous local loop, we use OpenClaw's cron system to trigger experiments on a schedule. Each cycle:
- Checks if the previous Kaggle submission has been scored
- Retrieves the score via Kaggle's submissions API (the
publicScoreNullablefield) - Logs the result to
experiments.json - Decides: keep the changes (score improved) or discard (score dropped)
- Generates the next experiment parameters
- Pushes the updated notebook to Kaggle
- Sends a WhatsApp notification with the results
Score retrieval:
# Simplified from our loop.py
import kaggle
def get_latest_score(competition, notebook_slug):
submissions = kaggle.api.competition_submissions(competition)
for sub in submissions:
if sub.publicScoreNullable is not None:
return float(sub.publicScoreNullable)
return NoneThe publicScoreNullable field is key — it's None while the submission is still being evaluated, which tells our loop to wait and check again on the next cron cycle.
🚀 Building autonomous AI systems for your team? Noqta designs and implements AI automation solutions that run production workflows — not just demos.
Keep/discard logic with experiments.json:
Inspired by autoresearch's git-like approach, we track experiments in a JSON file:
{
"experiments": [
{
"id": 14,
"timestamp": "2026-03-08T14:30:00Z",
"parameters": {
"method": "protenix",
"confidence_threshold": 0.72,
"template_weight": 0.3
},
"score": 0.378,
"delta": "+0.012",
"status": "kept",
"notes": "Increasing protenix weight improved folding accuracy on longer sequences"
}
],
"best_score": 0.378,
"total_experiments": 14,
"total_kept": 6,
"total_discarded": 8
}This gives the AI agent a complete history to reason over when proposing the next experiment.
Local simulation for pre-screening:
Kaggle GPU time is precious. We added simulate.py — a lightweight local simulation that pre-screens parameters in ~40 seconds before committing to a full Kaggle run:
# simulate.py - Pre-screen parameters locally
def quick_validate(params):
"""Run a fast local check (~40s) before burning Kaggle GPU time."""
# Load a small subset of test sequences
# Run the pipeline with proposed parameters
# Check for obvious failures (NaN scores, crashes, regressions)
# Return: go/no-go decision
passThis saves roughly 30-60 minutes of Kaggle GPU time for experiments that would obviously fail.
Step 3: The Competition — Stanford RNA 3D Folding 2
Stanford RNA 3D Folding 2 challenges competitors to predict the 3D structure of RNA molecules. The evaluation metric is TM-score (Template Modeling score), where higher is better.
Our Approach: Hybrid Pipeline
We use two complementary methods:
-
Template-Based Modeling (TBM) — Finds similar known RNA structures and adapts them. Fast, reliable for sequences with known homologs.
-
Protenix Deep Learning — Neural network-based structure prediction. Better for novel sequences without known templates.
The key insight we've discovered through autonomous experimentation: smart routing between these methods matters more than tuning either one individually.
Our agent learned this on its own — after ~14 experiments, the biggest score jumps came not from tweaking model hyperparameters, but from improving the confidence-based routing that decides which method handles which RNA sequence.
Current best score: 0.378 TM-score (actively improving with each experiment cycle).
You can see our notebook here: Stanford RNA 3D Folding 2 Baseline v1.
Step 4: Setting Up the Autonomous Loop
Here's how to set up a similar system for your own Kaggle competition:
Prerequisites
- OpenClaw installed and configured
- Kaggle API credentials (
~/.kaggle/kaggle.json) - A baseline notebook that scores on the competition leaderboard
The loop.py Script
The heart of the system. Each cron trigger executes this flow:
def main():
# 1. Check if previous submission is scored
score = get_latest_score(COMPETITION, NOTEBOOK_SLUG)
if score is None:
notify("⏳ Previous submission still scoring. Will check next cycle.")
return
# 2. Log the result
experiment = log_experiment(score, current_params)
# 3. Keep or discard
if score > best_score:
keep_changes(experiment)
notify(f"✅ New best! {score} (+{score - best_score:.4f})")
else:
discard_changes(experiment)
notify(f"❌ Score dropped: {score} (best: {best_score})")
# 4. Generate next experiment
next_params = ai_propose_next(experiments_history)
# 5. Pre-screen locally
if not simulate(next_params):
notify("⚠️ Simulation failed. Trying alternative params.")
next_params = ai_propose_alternative(experiments_history)
# 6. Push to Kaggle
update_notebook(next_params)
submit_to_kaggle()
notify(f"🧪 Experiment #{len(experiments) + 1} submitted. Params: {next_params}")Cron Configuration
In OpenClaw, the cron job runs hourly:
# OpenClaw cron configuration
schedule:
kind: cron
expr: "0 * * * *" # Every hour
payload:
kind: agentTurn
message: "Run the next autoresearch Kaggle experiment cycle"What We've Learned So Far
After running this system for several days on the RNA folding competition:
-
Patience beats speed — The original autoresearch thrives on rapid iteration. On Kaggle, patience and smart experiment selection matter more. Our agent had to learn that 1-2 well-chosen experiments per hour beats 12 random ones.
-
Local simulation is essential — Without
simulate.py, we'd burn through Kaggle GPU quota on experiments that crash in the first 30 seconds. The 40-second pre-screen saves hours. -
Method routing > hyperparameter tuning — The biggest score improvements came from better routing logic between Template-Based Modeling and Protenix, not from tuning individual model parameters.
-
Experiment history is gold — The JSON log of all experiments gives the agent increasingly good context for proposing the next experiment. By experiment #10, proposals were consistently more targeted.
-
Notifications keep you sane — WhatsApp notifications after each experiment cycle mean you can monitor progress without staring at a dashboard. Wake up to "3 experiments ran overnight, best score improved by 0.015."
💡 Want to build autonomous AI workflows for your projects? Talk to our team about implementing AI agent systems that run while you sleep.
Reproduce This Yourself
- Fork our repo: github.com/anis-marrouchi/autoresearch-kaggle (branch:
kaggle/rna-3d-folding) - Study the original: github.com/karpathy/autoresearch
- Set up OpenClaw: github.com/openclaw/openclaw for the cron-driven agent runtime
- Pick your competition and adapt the
loop.pyfor its specific metric and submission format - Start with simulation — always pre-screen locally before burning GPU time
What's Next
The competition deadline is March 25, 2026. We're continuing to run experiments autonomously, and we'll publish a follow-up with final results and a deeper analysis of what the agent discovered.
The broader implication is exciting: autonomous AI research isn't just for labs with clusters of GPUs. With the right adaptation, you can run it on Kaggle's free infrastructure, competing against thousands of data scientists — with an AI agent doing the heavy lifting.
Follow our progress on Kaggle and noqta.tn.
FAQ
What is autoresearch?
Autoresearch is a framework by Andrej Karpathy (github.com/karpathy/autoresearch) that lets AI agents autonomously run machine learning experiments. The agent proposes hypotheses, writes code, runs experiments, and learns from results — approximately 12 experiments per hour on a local GPU.
Can I use this approach on any Kaggle competition?
Yes, the framework is competition-agnostic. You need to adapt loop.py for the specific evaluation metric and submission format of your competition. The core loop (propose → simulate → submit → score → keep/discard) stays the same.
How much Kaggle GPU time does this use?
Each experiment uses one GPU session (30-60 minutes depending on the competition). With local simulation pre-screening, we reject ~40% of experiments before they hit Kaggle, saving significant GPU quota.
What is OpenClaw and why use it?
OpenClaw is an open-source AI agent runtime. We use it because it provides cron scheduling, tool execution, and notifications (WhatsApp/Telegram) out of the box — exactly what an autonomous experiment loop needs to run 24/7.
What's a good TM-score for RNA structure prediction?
TM-scores range from 0 to 1. Scores above 0.5 generally indicate correct fold topology. Our current best of 0.378 puts us in the actively competitive range, with room to improve through continued autonomous experimentation.