Swarm Skills: Multi-Agent Extension of the Agent Skills Standard¶

Swarm Skills is a 2026 proposal extending Agent Skills with multi-agent roles, a workflow layout, and a self-evolution lifecycle for portable coordination protocols.

What the spec adds to Agent Skills¶

The base Agent Skills standard packages single-agent task knowledge in a SKILL.md. Swarm Skills adds multi-agent semantics without replacing that contract (Swarm Skills paper):

Field added	Purpose
`kind: swarm-skill`	Discriminator that lets a single-agent host ignore the skill cleanly via `additionalProperties: true`
`teammate_mode`	Interaction style — for example, `build_mode` for autonomous execution, `plan_mode` for approval workflows
`roles[]`	Participants; each entry carries an `id`, required skills and tools, and an optional target model

A host that does not know the swarm-skill kind keeps loading the file as a regular skill and skips the unknown fields.

File layout¶

A Swarm Skill is a directory, not a single file (Swarm Skills paper):

my-swarm-skill/
  SKILL.md          # frontmatter + natural-language body
  roles/            # one persona file per teammate id
    leader.md
    researcher.md
    reviewer.md
  workflow.md       # task dependency graph (sequential, parallel, fan-out/fan-in)
  bind.md           # operational boundaries: turns, token budgets, quality gates
  dependencies      # other Swarm Skills this one depends on
  evolutions.json   # runtime artifact — appended by the host, not hand-edited

The Host Agent loads only SKILL.md frontmatter to route a task. It pulls role files, workflow.md, and bind.md on demand through the host's native read_file — no new DSL or runtime plugin. This extends Anthropic Skills' progressive-disclosure pattern to multi-agent metadata.

The CREATE / USE / PATCH lifecycle¶

graph TD
    A[Multi-agent run] --> B{>=2 distinct roles<br/>and cross-agent deps?}
    B -- yes --> C[CREATE: distill trace<br/>into candidate Swarm Skill]
    B -- no --> A
    C --> D[USE: Host injects skill<br/>description into Leader prompt]
    D --> E[Run produces new trace]
    E --> F[PATCH: scan for friction<br/>append Evolution Record]
    F -->|>=10 records| G[Governance:<br/>SIMPLIFY / REBUILD / ROLLBACK]
    F --> D
    G --> D

CREATE — trajectory distillation. An LLM synthesizes a trace with two or more distinct sub-agent roles and cross-agent dependencies into a candidate Swarm Skill.
USE — progressive disclosure. The Host reads description from frontmatter into the Leader prompt, then loads full role and workflow content only after selection. It appends existing Evolution Records to the relevant instructions.
PATCH — friction-driven optimization. Post-execution analysis scans traces for circular dependencies, redundant loops, and premature termination, then appends a new Evolution Record (Context, Change Directive, Scoring Metrics) to evolutions.json.

Scoring formula¶

Each Evolution Record carries a composite score that decides whether it stays, gets pruned, or triggers a rewrite (Swarm Skills paper):

S_i = w_E * E + w_U * U + w_F * F

Component	What it measures
Effectiveness (E in [0,1])	Effect of the patch, stabilized with Bayesian smoothing using a Beta(1,1) prior so early evaluations do not dominate
Utilization (U in [0,1])	Adoption rate — if the Leader consistently ignores an appended instruction, U decays
Freshness (F in [0,1])	Time-decay factor with an exponential half-life (for example, 90 days) so stale optimizations gradually drop out

At 10 or more records, three governance actions become available. SIMPLIFY prunes through LLM categorization (delete, merge, refine, retain). REBUILD rewrites the spec, archives the prior version, and clears evolutions.json. ROLLBACK reverts to any archived state.

Reference implementation¶

JiuwenSwarm is the reference Host Agent on openJiuwen.com; community swarm skills are intended to live at swarmskills.openjiuwen.com. The algorithm "strictly interacts with the schema defined by the Swarm Skills specification" — the authors claim this makes it portable to other multi-agent runtimes without framework-specific plugins (Swarm Skills paper).

What is not yet proven¶

The paper is a specification proposal with measurement support, not a benchmarked system. The authors call out the gaps directly (Swarm Skills paper):

No quantitative benchmarks of the self-evolution algorithm. The empirical work is a measurement study of 33 queries across 9 Anthropic Skills repositories crawled in April 2026. It shows that authors are writing multi-role skills (for example, engineering-team with 14 roles), but does not measure whether the scoring mechanism improves outcomes.
No conformance testing across varied Host Agents. Effectiveness on hosts that lack native recursive read_file or dynamic sub-agent spawning "remains to be empirically evaluated."
First-run lock-in. A poor initial workflow tends to accumulate patches rather than trigger REBUILD, which entrenches bad decisions in an opaque chain of Evolution Records.
Adoption beyond the authors is unverified. The community Swarm Skills hub is currently inaccessible to anonymous fetch.

For projects already invested in framework-native multi-agent code, the portability benefit is theoretical until a second Host Agent ships conformant support. A related critique applies: some "LLM swarm" framings overstate genuine swarm behavior (LLM-Powered Swarms: A New Frontier or a Conceptual Stretch?).

Key Takeaways¶

Swarm Skills extends Agent Skills with three frontmatter fields (kind, teammate_mode, roles[]) plus a roles/ + workflow.md + bind.md + evolutions.json layout — no new DSL.
A CREATE / USE / PATCH lifecycle distills multi-agent traces into reusable skills and appends Evolution Records on friction signals.
Scoring is S = w_E*E + w_U*U + w_F*F with Bayesian-smoothed effectiveness, decaying utilisation, and exponential-half-life freshness; governance fires SIMPLIFY, REBUILD, or ROLLBACK above 10 records.
The spec is a May 2026 proposal — no independent benchmarks of the self-evolution loop, no conformance tests on hosts other than JiuwenSwarm, and the authors flag "first-run lock-in" as an open failure mode.