How We Adapted Karpathy's Autoresearch for Kaggle Competitions

The Premise: What If AI Could Do Its Own ML Research?
In early March 2026, Andrej Karpathy released autoresearch — a framework where an AI agent autonomously runs machine learning experiments on your local GPU. The concept is elegant: the agent proposes a hypothesis, writes the code, runs the experiment, evaluates the results, and decides whether to keep or discard the changes. Rinse, repeat. About 12 experiments per hour, each in a tight 5-minute loop.
Karpathy's original tweet captured the idea perfectly: let AI do the grunt work of ML research while you sleep.
We thought: what if we pointed this at a real Kaggle competition?
The result is our fork of autoresearch (branch: kaggle/rna-3d-folding), adapted to compete in Stanford RNA 3D Folding 2 — a $75,000 prize pool competition to predict RNA 3D structures, with a deadline of March 25, 2026.
This tutorial walks through exactly how we adapted the system, the technical constraints we hit, and what we've learned so far.
The Challenge: Local GPU vs. Kaggle's Constraints
Karpathy's autoresearch was designed for local execution:
| Feature | Original (Local) | Our Adaptation (Kaggle) |
|---|---|---|
| GPU access | Always available | Max 2 concurrent sessions |
| Experiment time | ~5 minutes | 30-60 minutes per run |
| Feedback loop | Immediate | Delayed (submission + scoring) |
| Iteration speed | ~12/hour | ~1-2/hour |
| Cost | Your electricity | Free Kaggle quota (limited) |
The core insight: Kaggle competitions require a fundamentally different loop rhythm. You can't burn through 12 experiments an hour when each submission takes 30-60 minutes and you have limited GPU time.
Architecture Overview
Here's how our adapted system works:
┌─────────────────────────────────────────────┐
│ OpenClaw (Agent Runtime) │
│ github.com/openclaw/openclaw │
│ │
│ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Cron Job │───▶│ loop.py │───▶│ Notify │ │
│ │ (hourly) │ │ │ │(WhatsApp)│ │
│ └──────────┘ └─────┬────┘ └────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Experiment Logic │ │
│ │ │ │
│ │ 1. Check scores │ │
│ │ 2. Log results │ │
│ │ 3. Push next exp │ │
│ │ 4. Track in JSON │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ simulate.py │ │
│ │ (local pre-screen │ │
│ │ ~40 seconds) │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Kaggle API │ │
│ │ (submit + score) │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────┘
OpenClaw serves as the agent runtime — it manages the cron scheduling, tool execution, and notifications that keep the autonomous loop running 24/7.
Step 1: Understanding the Original Autoresearch Loop
The original autoresearch works like this:
- AI proposes a hypothesis and writes experiment code
- Code runs on local GPU (~5 minutes)
- Results evaluated — did the metric improve?
- Keep or discard — like
git commitvsgit reset - Repeat — agent uses history to inform next hypothesis
The genius is the tight feedback loop. The agent learns from every experiment, accumulating a history of what works and what doesn't.
Step 2: Adapting for Kaggle
The Key Changes
From 5-minute loops to cron-driven hourly loops:
Instead of a continuous local loop, we use OpenClaw's cron system to trigger experiments on a schedule. Each cycle:
- Checks if the previous Kaggle submission has been scored
- Retrieves the score via Kaggle's submissions API (the
publicScoreNullablefield) - Logs the result to
experiments.json - Decides: keep the changes (score improved) or discard (score dropped)
- Generates the next experiment parameters
- Pushes the updated notebook to Kaggle
- Sends a WhatsApp notification with the results
Score retrieval:
# Simplified from our loop.py
import kaggle
def get_latest_score(competition, notebook_slug):
submissions = kaggle.api.competition_submissions(competition)
for sub in submissions:
if sub.publicScoreNullable is not None:
return float(sub.publicScoreNullable)
return NoneThe publicScoreNullable field is key — it's None while the submission is still being evaluated, which tells our loop to wait and check again on the next cron cycle.
🚀 Building autonomous AI systems for your team? Noqta designs and implements AI automation solutions that run production workflows — not just demos.
Keep/discard logic with experiments.json:
Inspired by autoresearch's git-like approach, we track experiments in a JSON file:
{
"experiments": [
{
"id": 14,
"timestamp": "2026-03-08T14:30:00Z",
"parameters": {
"method": "protenix",
"confidence_threshold": 0.72,
"template_weight": 0.3
},
"score": 0.378,
"delta": "+0.012",
"status": "kept",
"notes": "Increasing protenix weight improved folding accuracy on longer sequences"
}
],
"best_score": 0.378,
"total_experiments": 14,
"total_kept": 6,
"total_discarded": 8
}This gives the AI agent a complete history to reason over when proposing the next experiment.
Local simulation for pre-screening:
Kaggle GPU time is precious. We added simulate.py — a lightweight local simulation that pre-screens parameters in ~40 seconds before committing to a full Kaggle run:
# simulate.py - Pre-screen parameters locally
def quick_validate(params):
"""Run a fast local check (~40s) before burning Kaggle GPU time."""
# Load a small subset of test sequences
# Run the pipeline with proposed parameters
# Check for obvious failures (NaN scores, crashes, regressions)
# Return: go/no-go decision
passThis saves roughly 30-60 minutes of Kaggle GPU time for experiments that would obviously fail.
Step 3: The Competition — Stanford RNA 3D Folding 2
Stanford RNA 3D Folding 2 challenges competitors to predict the 3D structure of RNA molecules. The evaluation metric is TM-score (Template Modeling score), where higher is better.
Our Approach: Hybrid Pipeline
We use two complementary methods:
-
Template-Based Modeling (TBM) — Finds similar known RNA structures and adapts them. Fast, reliable for sequences with known homologs.
-
Protenix Deep Learning — Neural network-based structure prediction. Better for novel sequences without known templates.
The key insight we've discovered through autonomous experimentation: smart routing between these methods matters more than tuning either one individually.
Our agent learned this on its own — after ~14 experiments, the biggest score jumps came not from tweaking model hyperparameters, but from improving the confidence-based routing that decides which method handles which RNA sequence.
Current best score: 0.378 TM-score (actively improving with each experiment cycle).
You can see our notebook here: Stanford RNA 3D Folding 2 Baseline v1.
Step 4: Setting Up the Autonomous Loop
Here's how to set up a similar system for your own Kaggle competition:
Prerequisites
- OpenClaw installed and configured
- Kaggle API credentials (
~/.kaggle/kaggle.json) - A baseline notebook that scores on the competition leaderboard
The loop.py Script
The heart of the system. Each cron trigger executes this flow:
def main():
# 1. Check if previous submission is scored
score = get_latest_score(COMPETITION, NOTEBOOK_SLUG)
if score is None:
notify("⏳ Previous submission still scoring. Will check next cycle.")
return
# 2. Log the result
experiment = log_experiment(score, current_params)
# 3. Keep or discard
if score > best_score:
keep_changes(experiment)
notify(f"✅ New best! {score} (+{score - best_score:.4f})")
else:
discard_changes(experiment)
notify(f"❌ Score dropped: {score} (best: {best_score})")
# 4. Generate next experiment
next_params = ai_propose_next(experiments_history)
# 5. Pre-screen locally
if not simulate(next_params):
notify("⚠️ Simulation failed. Trying alternative params.")
next_params = ai_propose_alternative(experiments_history)
# 6. Push to Kaggle
update_notebook(next_params)
submit_to_kaggle()
notify(f"🧪 Experiment #{len(experiments) + 1} submitted. Params: {next_params}")Cron Configuration
In OpenClaw, the cron job runs hourly:
# OpenClaw cron configuration
schedule:
kind: cron
expr: "0 * * * *" # Every hour
payload:
kind: agentTurn
message: "Run the next autoresearch Kaggle experiment cycle"What We've Learned So Far
After running this system for several days on the RNA folding competition:
-
Patience beats speed — The original autoresearch thrives on rapid iteration. On Kaggle, patience and smart experiment selection matter more. Our agent had to learn that 1-2 well-chosen experiments per hour beats 12 random ones.
-
Local simulation is essential — Without
simulate.py, we'd burn through Kaggle GPU quota on experiments that crash in the first 30 seconds. The 40-second pre-screen saves hours. -
Method routing > hyperparameter tuning — The biggest score improvements came from better routing logic between Template-Based Modeling and Protenix, not from tuning individual model parameters.
-
Experiment history is gold — The JSON log of all experiments gives the agent increasingly good context for proposing the next experiment. By experiment #10, proposals were consistently more targeted.
-
Notifications keep you sane — WhatsApp notifications after each experiment cycle mean you can monitor progress without staring at a dashboard. Wake up to "3 experiments ran overnight, best score improved by 0.015."
💡 Want to build autonomous AI workflows for your projects? Talk to our team about implementing AI agent systems that run while you sleep.
Reproduce This Yourself
- Fork our repo: github.com/anis-marrouchi/autoresearch-kaggle (branch:
kaggle/rna-3d-folding) - Study the original: github.com/karpathy/autoresearch
- Set up OpenClaw: github.com/openclaw/openclaw for the cron-driven agent runtime
- Pick your competition and adapt the
loop.pyfor its specific metric and submission format - Start with simulation — always pre-screen locally before burning GPU time
What's Next
The competition deadline is March 25, 2026. We're continuing to run experiments autonomously, and we'll publish a follow-up with final results and a deeper analysis of what the agent discovered.
The broader implication is exciting: autonomous AI research isn't just for labs with clusters of GPUs. With the right adaptation, you can run it on Kaggle's free infrastructure, competing against thousands of data scientists — with an AI agent doing the heavy lifting.
Follow our progress on Kaggle and noqta.tn.
FAQ
What is autoresearch?
Autoresearch is a framework by Andrej Karpathy (github.com/karpathy/autoresearch) that lets AI agents autonomously run machine learning experiments. The agent proposes hypotheses, writes code, runs experiments, and learns from results — approximately 12 experiments per hour on a local GPU.
Can I use this approach on any Kaggle competition?
Yes, the framework is competition-agnostic. You need to adapt loop.py for the specific evaluation metric and submission format of your competition. The core loop (propose → simulate → submit → score → keep/discard) stays the same.
How much Kaggle GPU time does this use?
Each experiment uses one GPU session (30-60 minutes depending on the competition). With local simulation pre-screening, we reject ~40% of experiments before they hit Kaggle, saving significant GPU quota.
What is OpenClaw and why use it?
OpenClaw is an open-source AI agent runtime. We use it because it provides cron scheduling, tool execution, and notifications (WhatsApp/Telegram) out of the box — exactly what an autonomous experiment loop needs to run 24/7.
What's a good TM-score for RNA structure prediction?
TM-scores range from 0 to 1. Scores above 0.5 generally indicate correct fold topology. Our current best of 0.378 puts us in the actively competitive range, with room to improve through continued autonomous experimentation.
Related Reading
Discuss Your Project with Us
We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.
Let's find the best solutions for your needs.
Related Articles

Build Your Own Code Interpreter with Dynamic Tool Generation
Learn how to create a custom code interpreter using dynamic tool generation and execution with o3-mini, enabling flexible and adaptive problem-solving.

Explore Improved Image and Video Segmentation with SAM 2 for Accurate Context-Aware Results
Discover the capabilities of SAM 2 in image and video segmentation, as we guide you through our tutorial on preprocessing, model training, and tracking for precise segmentations.

Getting Started with ALLaM-7B-Instruct-preview
Learn how to use the ALLaM-7B-Instruct-preview model with Python, and how to interact with it from JavaScript via a hosted API (e.g., on Hugging Face Spaces).