Writing Production-Grade Agent Skills: A 2026 Guide

When Addy Osmani — the Google Chrome engineering lead who taught a generation of developers JavaScript design patterns — open-sourced a library of two dozen production engineering skills this month, it crossed 14,000 GitHub stars in days. The reason it resonated is simple: most agent skills floating around GitHub are someone's half-finished experiment. His were curated by someone who has shipped software at scale. That gap — between a skill that works and a skill that wastes a 200,000-token context window — is what this guide is about.

If you have been hoping that your pile of scattered .md files is quietly helping your AI coding agent, the uncomfortable truth is that it probably is not. This is how to write skills that actually move the needle.

What a skill really is

An agent skill is a folder with a SKILL.md file at its center. The file has two parts: YAML frontmatter (a name and a description) and a markdown body containing step-by-step instructions. Optionally, the folder also holds reference documents and executable scripts that load only when a task needs them.

The format started at Anthropic and has become a de facto open standard. The same files now plug into Claude Code, Cursor, GitHub Copilot, Codex CLI, Gemini CLI, Windsurf, and OpenCode — any agent that reads markdown instructions. Write once, run everywhere. That portability is exactly why getting the authoring right pays off across every tool your team uses.

Progressive disclosure: the core idea

The single most important concept is progressive disclosure. At session startup, your agent reads only the name and description of every installed skill — roughly 100 tokens each. It reads the full SKILL.md body only when the skill becomes relevant, and it reads bundled reference files only when a specific task demands them.

This is what lets you install 100 skills without drowning the context window. A February 2026 study from Bosch Research and Carnegie Mellon that analyzed over 40,000 publicly listed skills found the median skill body is around 1,400 tokens — small, focused, and loaded on demand. The architecture is filesystem-based: scripts can be executed via bash without ever loading their source into context, so only the output costs tokens.

The practical rules that fall out of this:

Keep the SKILL.md body under 500 lines. If it grows past that, split it.
Keep reference links one level deep from SKILL.md. Agents partially preview deeply nested files and end up with incomplete information.
For reference files longer than 100 lines, add a table of contents at the top so the agent sees the full scope even on a partial read.

Anatomy of a skill that works

Here is the minimal frontmatter every SKILL.md needs:

---
name: processing-pdfs
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDFs or when the user mentions forms or document extraction.
---

Two fields do an enormous amount of work. The name must be lowercase letters, numbers, and hyphens only, under 64 characters — gerund form like processing-pdfs or testing-code reads best. The description is capped at 1,024 characters and is the single most important line in the whole file, because the agent uses it to choose between potentially dozens of skills.

Write descriptions in the third person and pack them with both what the skill does and when to use it:

# Good — specific, includes triggers
description: Analyze Excel spreadsheets, create pivot tables, generate charts. Use when analyzing spreadsheets, tabular data, or .xlsx files.
 
# Bad — vague, no triggers
description: Helps with documents

A description like "Helps with documents" gives the agent nothing to discriminate on. It will either never fire or fire at the wrong time.

Match freedom to fragility

The body of a great skill calibrates how much latitude it gives the agent to the fragility of the task. Think of the agent as a robot walking a path:

Open field, many safe routes — use high freedom. Give general direction in prose and trust the model. Code review is a good example: the right approach depends on context.

## Code review process
1. Analyze the code structure and organization
2. Check for potential bugs or edge cases
3. Suggest improvements for readability and maintainability
4. Verify adherence to project conventions

Narrow bridge with cliffs on both sides — use low freedom. Give an exact command and forbid deviation. Database migrations are the canonical case:

## Database migration
Run exactly this script:
`python scripts/migrate.py --verify --backup`
Do not modify the command or add additional flags.

Getting this calibration wrong is the most common reason skills misbehave: too much freedom on a fragile task invites improvisation where you needed precision.

Verification loops are non-negotiable

The Osmani skills add a layer most homegrown skills skip entirely: each one ends with evidence requirements and an anti-rationalization table — common excuses an agent might give for skipping a step, paired with a documented rebuttal. The philosophy is "process, not prose": a workflow with checkpoints and exit criteria, not a wall of reference text.

The pattern that delivers the biggest quality jump is the validate-fix-repeat loop. Give the agent a checklist it copies into its response and ticks off, and a validator it must pass before proceeding:

## Document editing process
1. Make your edits to word/document.xml
2. Validate immediately: python ooxml/scripts/validate.py unpacked_dir/
3. If validation fails:
   - Review the error message
   - Fix the issues
   - Run validation again
4. Only proceed when validation passes
5. Rebuild and test the output

For batch or destructive operations, go further with the plan-validate-execute pattern: have the agent write a structured plan file, validate it with a script, and only then apply changes. Editing 50 form fields from a spreadsheet without this invites referencing fields that do not exist; with it, errors surface before anything is touched.

Build evaluations before documentation

The biggest mindset shift in the official guidance: write your evaluations first. Run the agent on real tasks without a skill, document where it fails, then build three test scenarios that capture those failures. Establish a baseline, write the minimal instructions needed to pass, and iterate. This guarantees your skill solves an actual problem rather than documenting an imagined one.

The most effective authoring workflow uses two instances of the model. "Claude A" helps you draft and refine the skill; "Claude B" — a fresh instance with the skill loaded — runs real tasks. You watch where B struggles, bring the specifics back to A, and refine. The models understand the skill format natively, so you do not need a special prompt to get good structure — you need real observation of how the skill gets used.

A pre-flight checklist

Before you commit a skill, verify:

The description is specific and names concrete triggers.
The SKILL.md body is under 500 lines; detail lives in separate files.
No time-sensitive information leaks in — put deprecated patterns in an "old patterns" section instead.
Terminology is consistent (pick "API endpoint" and never drift to "URL" or "route").
File references are one level deep and examples are concrete, not abstract.
Scripts handle errors explicitly instead of punting back to the agent, and constants are documented rather than magic numbers.
Verification or feedback loops exist for anything quality-critical.

Why this matters for MENA teams

Because skills are plain markdown that work across every major agent, they are a hedge against vendor lock-in — a real concern for teams in Tunisia, Saudi Arabia, and the wider region navigating shifting model access and pricing. A well-authored skill library is portable institutional knowledge: the senior-engineer workflows your team relies on, encoded once and runnable on whatever agent is fastest, cheapest, or most available this quarter. That is leverage a scattered pile of .md files will never give you.

The takeaway is the one that made Osmani's release land: a skill is not a prompt snippet you toss into a folder. It is a small, tested, single-purpose unit of engineering judgment. Treat it like production code — concise, calibrated, and verified — and your agents start behaving like the senior engineers you wanted them to be.

Sources: addyosmani/agent-skills on GitHub · Skill authoring best practices — Claude Docs · Agent Skills Work but Most Teams Are Building Them Wrong — O'Reilly Radar