How We Adapted Karpathy's Autoresearch for Kaggle Competitions

Noqta Team
By Noqta Team ·

Loading the Text to Speech Audio Player...

The Premise: What If AI Could Do Its Own ML Research?

In early March 2026, Andrej Karpathy released autoresearch — a framework where an AI agent autonomously runs machine learning experiments on your local GPU. The concept is elegant: the agent proposes a hypothesis, writes the code, runs the experiment, evaluates the results, and decides whether to keep or discard the changes. Rinse, repeat. About 12 experiments per hour, each in a tight 5-minute loop.

Karpathy's original tweet captured the idea perfectly: let AI do the grunt work of ML research while you sleep.

We thought: what if we pointed this at a real Kaggle competition?

The result is our fork of autoresearch (branch: kaggle/rna-3d-folding), adapted to compete in Stanford RNA 3D Folding 2 — a $75,000 prize pool competition to predict RNA 3D structures, with a deadline of March 25, 2026.

This tutorial walks through exactly how we adapted the system, the technical constraints we hit, and what we've learned so far.

The Challenge: Local GPU vs. Kaggle's Constraints

Karpathy's autoresearch was designed for local execution:

FeatureOriginal (Local)Our Adaptation (Kaggle)
GPU accessAlways availableMax 2 concurrent sessions
Experiment time~5 minutes30-60 minutes per run
Feedback loopImmediateDelayed (submission + scoring)
Iteration speed~12/hour~1-2/hour
CostYour electricityFree Kaggle quota (limited)

The core insight: Kaggle competitions require a fundamentally different loop rhythm. You can't burn through 12 experiments an hour when each submission takes 30-60 minutes and you have limited GPU time.

Architecture Overview

Here's how our adapted system works:

┌─────────────────────────────────────────────┐
│              OpenClaw (Agent Runtime)         │
│  github.com/openclaw/openclaw                │
│                                              │
│  ┌──────────┐    ┌──────────┐    ┌────────┐ │
│  │ Cron Job │───▶│ loop.py  │───▶│ Notify │ │
│  │ (hourly) │    │          │    │(WhatsApp)│ │
│  └──────────┘    └─────┬────┘    └────────┘ │
│                        │                     │
│              ┌─────────▼─────────┐           │
│              │  Experiment Logic  │           │
│              │                    │           │
│              │  1. Check scores   │           │
│              │  2. Log results    │           │
│              │  3. Push next exp  │           │
│              │  4. Track in JSON  │           │
│              └─────────┬─────────┘           │
│                        │                     │
│              ┌─────────▼─────────┐           │
│              │   simulate.py     │           │
│              │  (local pre-screen │           │
│              │   ~40 seconds)     │           │
│              └─────────┬─────────┘           │
│                        │                     │
│              ┌─────────▼─────────┐           │
│              │   Kaggle API      │           │
│              │  (submit + score)  │           │
│              └───────────────────┘           │
└─────────────────────────────────────────────┘

OpenClaw serves as the agent runtime — it manages the cron scheduling, tool execution, and notifications that keep the autonomous loop running 24/7.

Step 1: Understanding the Original Autoresearch Loop

The original autoresearch works like this:

  1. AI proposes a hypothesis and writes experiment code
  2. Code runs on local GPU (~5 minutes)
  3. Results evaluated — did the metric improve?
  4. Keep or discard — like git commit vs git reset
  5. Repeat — agent uses history to inform next hypothesis

The genius is the tight feedback loop. The agent learns from every experiment, accumulating a history of what works and what doesn't.

Step 2: Adapting for Kaggle

The Key Changes

From 5-minute loops to cron-driven hourly loops:

Instead of a continuous local loop, we use OpenClaw's cron system to trigger experiments on a schedule. Each cycle:

  1. Checks if the previous Kaggle submission has been scored
  2. Retrieves the score via Kaggle's submissions API (the publicScoreNullable field)
  3. Logs the result to experiments.json
  4. Decides: keep the changes (score improved) or discard (score dropped)
  5. Generates the next experiment parameters
  6. Pushes the updated notebook to Kaggle
  7. Sends a WhatsApp notification with the results

Score retrieval:

# Simplified from our loop.py
import kaggle
 
def get_latest_score(competition, notebook_slug):
    submissions = kaggle.api.competition_submissions(competition)
    for sub in submissions:
        if sub.publicScoreNullable is not None:
            return float(sub.publicScoreNullable)
    return None

The publicScoreNullable field is key — it's None while the submission is still being evaluated, which tells our loop to wait and check again on the next cron cycle.

🚀 Building autonomous AI systems for your team? Noqta designs and implements AI automation solutions that run production workflows — not just demos.

Keep/discard logic with experiments.json:

Inspired by autoresearch's git-like approach, we track experiments in a JSON file:

{
  "experiments": [
    {
      "id": 14,
      "timestamp": "2026-03-08T14:30:00Z",
      "parameters": {
        "method": "protenix",
        "confidence_threshold": 0.72,
        "template_weight": 0.3
      },
      "score": 0.378,
      "delta": "+0.012",
      "status": "kept",
      "notes": "Increasing protenix weight improved folding accuracy on longer sequences"
    }
  ],
  "best_score": 0.378,
  "total_experiments": 14,
  "total_kept": 6,
  "total_discarded": 8
}

This gives the AI agent a complete history to reason over when proposing the next experiment.

Local simulation for pre-screening:

Kaggle GPU time is precious. We added simulate.py — a lightweight local simulation that pre-screens parameters in ~40 seconds before committing to a full Kaggle run:

# simulate.py - Pre-screen parameters locally
def quick_validate(params):
    """Run a fast local check (~40s) before burning Kaggle GPU time."""
    # Load a small subset of test sequences
    # Run the pipeline with proposed parameters
    # Check for obvious failures (NaN scores, crashes, regressions)
    # Return: go/no-go decision
    pass

This saves roughly 30-60 minutes of Kaggle GPU time for experiments that would obviously fail.

Step 3: The Competition — Stanford RNA 3D Folding 2

Stanford RNA 3D Folding 2 challenges competitors to predict the 3D structure of RNA molecules. The evaluation metric is TM-score (Template Modeling score), where higher is better.

Our Approach: Hybrid Pipeline

We use two complementary methods:

  1. Template-Based Modeling (TBM) — Finds similar known RNA structures and adapts them. Fast, reliable for sequences with known homologs.

  2. Protenix Deep Learning — Neural network-based structure prediction. Better for novel sequences without known templates.

The key insight we've discovered through autonomous experimentation: smart routing between these methods matters more than tuning either one individually.

Our agent learned this on its own — after ~14 experiments, the biggest score jumps came not from tweaking model hyperparameters, but from improving the confidence-based routing that decides which method handles which RNA sequence.

Current best score: 0.378 TM-score (actively improving with each experiment cycle).

You can see our notebook here: Stanford RNA 3D Folding 2 Baseline v1.

Step 4: Setting Up the Autonomous Loop

Here's how to set up a similar system for your own Kaggle competition:

Prerequisites

  • OpenClaw installed and configured
  • Kaggle API credentials (~/.kaggle/kaggle.json)
  • A baseline notebook that scores on the competition leaderboard

The loop.py Script

The heart of the system. Each cron trigger executes this flow:

def main():
    # 1. Check if previous submission is scored
    score = get_latest_score(COMPETITION, NOTEBOOK_SLUG)
    if score is None:
        notify("⏳ Previous submission still scoring. Will check next cycle.")
        return
    
    # 2. Log the result
    experiment = log_experiment(score, current_params)
    
    # 3. Keep or discard
    if score > best_score:
        keep_changes(experiment)
        notify(f"✅ New best! {score} (+{score - best_score:.4f})")
    else:
        discard_changes(experiment)
        notify(f"❌ Score dropped: {score} (best: {best_score})")
    
    # 4. Generate next experiment
    next_params = ai_propose_next(experiments_history)
    
    # 5. Pre-screen locally
    if not simulate(next_params):
        notify("⚠️ Simulation failed. Trying alternative params.")
        next_params = ai_propose_alternative(experiments_history)
    
    # 6. Push to Kaggle
    update_notebook(next_params)
    submit_to_kaggle()
    
    notify(f"🧪 Experiment #{len(experiments) + 1} submitted. Params: {next_params}")

Cron Configuration

In OpenClaw, the cron job runs hourly:

# OpenClaw cron configuration
schedule:
  kind: cron
  expr: "0 * * * *"  # Every hour
payload:
  kind: agentTurn
  message: "Run the next autoresearch Kaggle experiment cycle"

What We've Learned So Far

After running this system for several days on the RNA folding competition:

  1. Patience beats speed — The original autoresearch thrives on rapid iteration. On Kaggle, patience and smart experiment selection matter more. Our agent had to learn that 1-2 well-chosen experiments per hour beats 12 random ones.

  2. Local simulation is essential — Without simulate.py, we'd burn through Kaggle GPU quota on experiments that crash in the first 30 seconds. The 40-second pre-screen saves hours.

  3. Method routing > hyperparameter tuning — The biggest score improvements came from better routing logic between Template-Based Modeling and Protenix, not from tuning individual model parameters.

  4. Experiment history is gold — The JSON log of all experiments gives the agent increasingly good context for proposing the next experiment. By experiment #10, proposals were consistently more targeted.

  5. Notifications keep you sane — WhatsApp notifications after each experiment cycle mean you can monitor progress without staring at a dashboard. Wake up to "3 experiments ran overnight, best score improved by 0.015."

💡 Want to build autonomous AI workflows for your projects? Talk to our team about implementing AI agent systems that run while you sleep.

Reproduce This Yourself

  1. Fork our repo: github.com/anis-marrouchi/autoresearch-kaggle (branch: kaggle/rna-3d-folding)
  2. Study the original: github.com/karpathy/autoresearch
  3. Set up OpenClaw: github.com/openclaw/openclaw for the cron-driven agent runtime
  4. Pick your competition and adapt the loop.py for its specific metric and submission format
  5. Start with simulation — always pre-screen locally before burning GPU time

What's Next

The competition deadline is March 25, 2026. We're continuing to run experiments autonomously, and we'll publish a follow-up with final results and a deeper analysis of what the agent discovered.

The broader implication is exciting: autonomous AI research isn't just for labs with clusters of GPUs. With the right adaptation, you can run it on Kaggle's free infrastructure, competing against thousands of data scientists — with an AI agent doing the heavy lifting.

Follow our progress on Kaggle and noqta.tn.

FAQ

What is autoresearch?

Autoresearch is a framework by Andrej Karpathy (github.com/karpathy/autoresearch) that lets AI agents autonomously run machine learning experiments. The agent proposes hypotheses, writes code, runs experiments, and learns from results — approximately 12 experiments per hour on a local GPU.

Can I use this approach on any Kaggle competition?

Yes, the framework is competition-agnostic. You need to adapt loop.py for the specific evaluation metric and submission format of your competition. The core loop (propose → simulate → submit → score → keep/discard) stays the same.

How much Kaggle GPU time does this use?

Each experiment uses one GPU session (30-60 minutes depending on the competition). With local simulation pre-screening, we reject ~40% of experiments before they hit Kaggle, saving significant GPU quota.

What is OpenClaw and why use it?

OpenClaw is an open-source AI agent runtime. We use it because it provides cron scheduling, tool execution, and notifications (WhatsApp/Telegram) out of the box — exactly what an autonomous experiment loop needs to run 24/7.

What's a good TM-score for RNA structure prediction?

TM-scores range from 0 to 1. Scores above 0.5 generally indicate correct fold topology. Our current best of 0.378 puts us in the actively competitive range, with room to improve through continued autonomous experimentation.


Want to read more tutorials? Check out our latest tutorial on How to Open a Free Flouci Professional Account for Self-Employed Individuals in Tunisia.

Discuss Your Project with Us

We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.

Let's find the best solutions for your needs.

Related Articles

Getting Started with ALLaM-7B-Instruct-preview

Learn how to use the ALLaM-7B-Instruct-preview model with Python, and how to interact with it from JavaScript via a hosted API (e.g., on Hugging Face Spaces).

8 min read·