The Premise: What If AI Could Do Its Own ML Research?

In early March 2026, Andrej Karpathy released autoresearch — a framework where an AI agent autonomously runs machine learning experiments on your local GPU. The concept is elegant: the agent proposes a hypothesis, writes the code, runs the experiment, evaluates the results, and decides whether to keep or discard the changes. Rinse, repeat. About 12 experiments per hour, each in a tight 5-minute loop.

Karpathy's original tweet captured the idea perfectly: let AI do the grunt work of ML research while you sleep.

We thought: what if we pointed this at a real Kaggle competition?

The result is our fork of autoresearch (branch: kaggle/rna-3d-folding), adapted to compete in Stanford RNA 3D Folding 2 — a $75,000 prize pool competition to predict RNA 3D structures, with a deadline of March 25, 2026.

This tutorial walks through exactly how we adapted the system, the technical constraints we hit, and what we've learned so far.

The Challenge: Local GPU vs. Kaggle's Constraints

Karpathy's autoresearch was designed for local execution:

Feature	Original (Local)	Our Adaptation (Kaggle)
GPU access	Always available	Max 2 concurrent sessions
Experiment time	~5 minutes	30-60 minutes per run
Feedback loop	Immediate	Delayed (submission + scoring)
Iteration speed	~12/hour	~1-2/hour
Cost	Your electricity	Free Kaggle quota (limited)

The core insight: Kaggle competitions require a fundamentally different loop rhythm. You can't burn through 12 experiments an hour when each submission takes 30-60 minutes and you have limited GPU time.

Architecture Overview

Here's how our adapted system works:

┌─────────────────────────────────────────────┐
│              OpenClaw (Agent Runtime)         │
│  github.com/openclaw/openclaw                │
│                                              │
│  ┌──────────┐    ┌──────────┐    ┌────────┐ │
│  │ Cron Job │───▶│ loop.py  │───▶│ Notify │ │
│  │ (hourly) │    │          │    │(WhatsApp)│ │
│  └──────────┘    └─────┬────┘    └────────┘ │
│                        │                     │
│              ┌─────────▼─────────┐           │
│              │  Experiment Logic  │           │
│              │                    │           │
│              │  1. Check scores   │           │
│              │  2. Log results    │           │
│              │  3. Push next exp  │           │
│              │  4. Track in JSON  │           │
│              └─────────┬─────────┘           │
│                        │                     │
│              ┌─────────▼─────────┐           │
│              │   simulate.py     │           │
│              │  (local pre-screen │           │
│              │   ~40 seconds)     │           │
│              └─────────┬─────────┘           │
│                        │                     │
│              ┌─────────▼─────────┐           │
│              │   Kaggle API      │           │
│              │  (submit + score)  │           │
│              └───────────────────┘           │
└─────────────────────────────────────────────┘

OpenClaw serves as the agent runtime — it manages the cron scheduling, tool execution, and notifications that keep the autonomous loop running 24/7.

Step 1: Understanding the Original Autoresearch Loop

The original autoresearch works like this:

AI proposes a hypothesis and writes experiment code
Code runs on local GPU (~5 minutes)
Results evaluated — did the metric improve?
Keep or discard — like git commit vs git reset
Repeat — agent uses history to inform next hypothesis

The genius is the tight feedback loop. The agent learns from every experiment, accumulating a history of what works and what doesn't.

Step 2: Adapting for Kaggle

The Key Changes

From 5-minute loops to cron-driven hourly loops:

Instead of a continuous local loop, we use OpenClaw's cron system to trigger experiments on a schedule. Each cycle:

Checks if the previous Kaggle submission has been scored
Retrieves the score via Kaggle's submissions API (the publicScoreNullable field)
Logs the result to experiments.json
Decides: keep the changes (score improved) or discard (score dropped)
Generates the next experiment parameters
Pushes the updated notebook to Kaggle
Sends a WhatsApp notification with the results

Score retrieval:

# Simplified from our loop.py
import kaggle
 
def get_latest_score(competition, notebook_slug):
    submissions = kaggle.api.competition_submissions(competition)
    for sub in submissions:
        if sub.publicScoreNullable is not None:
            return float(sub.publicScoreNullable)
    return None

The publicScoreNullable field is key — it's None while the submission is still being evaluated, which tells our loop to wait and check again on the next cron cycle.

🚀 Building autonomous AI systems for your team? Noqta designs and implements AI automation solutions that run production workflows — not just demos.

Keep/discard logic with experiments.json:

Inspired by autoresearch's git-like approach, we track experiments in a JSON file:

{
  "experiments": [
    {
      "id": 14,
      "timestamp": "2026-03-08T14:30:00Z",
      "parameters": {
        "method": "protenix",
        "confidence_threshold": 0.72,
        "template_weight": 0.3
      },
      "score": 0.378,
      "delta": "+0.012",
      "status": "kept",
      "notes": "Increasing protenix weight improved folding accuracy on longer sequences"
    }
  ],
  "best_score": 0.378,
  "total_experiments": 14,
  "total_kept": 6,
  "total_discarded": 8
}

This gives the AI agent a complete history to reason over when proposing the next experiment.

Local simulation for pre-screening:

Kaggle GPU time is precious. We added simulate.py — a lightweight local simulation that pre-screens parameters in ~40 seconds before committing to a full Kaggle run:

# simulate.py - Pre-screen parameters locally
def quick_validate(params):
    """Run a fast local check (~40s) before burning Kaggle GPU time."""
    # Load a small subset of test sequences
    # Run the pipeline with proposed parameters
    # Check for obvious failures (NaN scores, crashes, regressions)
    # Return: go/no-go decision
    pass

This saves roughly 30-60 minutes of Kaggle GPU time for experiments that would obviously fail.

Step 3: The Competition — Stanford RNA 3D Folding 2

Stanford RNA 3D Folding 2 challenges competitors to predict the 3D structure of RNA molecules. The evaluation metric is TM-score (Template Modeling score), where higher is better.

Our Approach: Hybrid Pipeline

We use two complementary methods:

Template-Based Modeling (TBM) — Finds similar known RNA structures and adapts them. Fast, reliable for sequences with known homologs.
Protenix Deep Learning — Neural network-based structure prediction. Better for novel sequences without known templates.

The key insight we've discovered through autonomous experimentation: smart routing between these methods matters more than tuning either one individually.

Our agent learned this on its own — after ~14 experiments, the biggest score jumps came not from tweaking model hyperparameters, but from improving the confidence-based routing that decides which method handles which RNA sequence.

Current best score: 0.378 TM-score (actively improving with each experiment cycle).

You can see our notebook here: Stanford RNA 3D Folding 2 Baseline v1.

Step 4: Setting Up the Autonomous Loop

Here's how to set up a similar system for your own Kaggle competition:

Prerequisites

OpenClaw installed and configured
Kaggle API credentials (~/.kaggle/kaggle.json)
A baseline notebook that scores on the competition leaderboard

The loop.py Script

The heart of the system. Each cron trigger executes this flow:

def main():
    # 1. Check if previous submission is scored
    score = get_latest_score(COMPETITION, NOTEBOOK_SLUG)
    if score is None:
        notify("⏳ Previous submission still scoring. Will check next cycle.")
        return
    
    # 2. Log the result
    experiment = log_experiment(score, current_params)
    
    # 3. Keep or discard
    if score > best_score:
        keep_changes(experiment)
        notify(f"✅ New best! {score} (+{score - best_score:.4f})")
    else:
        discard_changes(experiment)
        notify(f"❌ Score dropped: {score} (best: {best_score})")
    
    # 4. Generate next experiment
    next_params = ai_propose_next(experiments_history)
    
    # 5. Pre-screen locally
    if not simulate(next_params):
        notify("⚠️ Simulation failed. Trying alternative params.")
        next_params = ai_propose_alternative(experiments_history)
    
    # 6. Push to Kaggle
    update_notebook(next_params)
    submit_to_kaggle()
    
    notify(f"🧪 Experiment #{len(experiments) + 1} submitted. Params: {next_params}")

Cron Configuration

In OpenClaw, the cron job runs hourly:

# OpenClaw cron configuration
schedule:
  kind: cron
  expr: "0 * * * *"  # Every hour
payload:
  kind: agentTurn
  message: "Run the next autoresearch Kaggle experiment cycle"

What We've Learned So Far

After running this system for several days on the RNA folding competition:

Patience beats speed — The original autoresearch thrives on rapid iteration. On Kaggle, patience and smart experiment selection matter more. Our agent had to learn that 1-2 well-chosen experiments per hour beats 12 random ones.
Local simulation is essential — Without simulate.py, we'd burn through Kaggle GPU quota on experiments that crash in the first 30 seconds. The 40-second pre-screen saves hours.
Method routing > hyperparameter tuning — The biggest score improvements came from better routing logic between Template-Based Modeling and Protenix, not from tuning individual model parameters.
Experiment history is gold — The JSON log of all experiments gives the agent increasingly good context for proposing the next experiment. By experiment #10, proposals were consistently more targeted.
Notifications keep you sane — WhatsApp notifications after each experiment cycle mean you can monitor progress without staring at a dashboard. Wake up to "3 experiments ran overnight, best score improved by 0.015."

💡 Want to build autonomous AI workflows for your projects? Talk to our team about implementing AI agent systems that run while you sleep.

Reproduce This Yourself

Fork our repo: github.com/anis-marrouchi/autoresearch-kaggle (branch: kaggle/rna-3d-folding)
Study the original: github.com/karpathy/autoresearch
Set up OpenClaw: github.com/openclaw/openclaw for the cron-driven agent runtime
Pick your competition and adapt the loop.py for its specific metric and submission format
Start with simulation — always pre-screen locally before burning GPU time

What's Next

The competition deadline is March 25, 2026. We're continuing to run experiments autonomously, and we'll publish a follow-up with final results and a deeper analysis of what the agent discovered.

The broader implication is exciting: autonomous AI research isn't just for labs with clusters of GPUs. With the right adaptation, you can run it on Kaggle's free infrastructure, competing against thousands of data scientists — with an AI agent doing the heavy lifting.

Follow our progress on Kaggle and noqta.tn.

FAQ

What is autoresearch?

Autoresearch is a framework by Andrej Karpathy (github.com/karpathy/autoresearch) that lets AI agents autonomously run machine learning experiments. The agent proposes hypotheses, writes code, runs experiments, and learns from results — approximately 12 experiments per hour on a local GPU.

Can I use this approach on any Kaggle competition?

Yes, the framework is competition-agnostic. You need to adapt loop.py for the specific evaluation metric and submission format of your competition. The core loop (propose → simulate → submit → score → keep/discard) stays the same.

How much Kaggle GPU time does this use?

Each experiment uses one GPU session (30-60 minutes depending on the competition). With local simulation pre-screening, we reject ~40% of experiments before they hit Kaggle, saving significant GPU quota.

What is OpenClaw and why use it?

OpenClaw is an open-source AI agent runtime. We use it because it provides cron scheduling, tool execution, and notifications (WhatsApp/Telegram) out of the box — exactly what an autonomous experiment loop needs to run 24/7.

What's a good TM-score for RNA structure prediction?

TM-scores range from 0 to 1. Scores above 0.5 generally indicate correct fold topology. Our current best of 0.378 puts us in the actively competitive range, with room to improve through continued autonomous experimentation.