Back to ICM
Part 3 of 5 · Evaluator · 14 min

The Stack: Chat, Skills, ICM, Frameworks

Chat, skills, ICM, or framework: the principle that helps you decide which fits.

A lot of AI work gets wasted at the wrong layer. Picking ICM for tasks a skill could do in three lines, or reaching for LangChain for tasks ICM could handle as a folder of markdown. The instinct is usually right ("this needs more structure"), but the chosen layer is wrong. The thing gets built, rewritten, rewritten again as the requirements clarify. The build isn't the cost. The layer-mismatch tax is. It gets paid every time requirements shift or a tool you depended on changes underneath you.

The frame this part offers is a stack: seven tiers of structure, from a bare chat to a full framework. Each tier solves a real problem the tier below can't, and each tier carries cost the tier below avoids. Knowing which tier your work belongs to is most of the discipline; picking the right tier the first time is most of the savings.

Pick the simplest layer that works

Rich Sutton's 2019 essay The Bitter Lesson is the philosophical anchor for this stack. Sutton's argument goes like this: across seventy years of AI research, every time someone built clever, hand-crafted, domain-specific structure to make a system perform better, general methods using more compute eventually swept the structure aside. Speech recognition, computer vision, game-playing, language modeling. The careful symbolic systems that worked well in the 1980s lost to statistical methods in the 1990s. The hand-tuned features that won in the 2000s lost to learned features in the 2010s. Narrow expert systems lose to general models that get smarter with more compute and more data.

The lesson Sutton names (the reason he calls it bitter) is that researchers keep falling for the appeal of building specialized structure, and they keep losing to general methods. The bitter version: don't over-build the bespoke layer, because the bespoke layer gets eaten. The constructive version: build the structure that survives the next general advance, the structure that makes your work more portable rather than more specialized.

Applied to AI work, that argument produces a stack with a clear principle: pick the simplest layer that works, and only escalate when you're forced to. The shape of the work tells you which tier. Here's the full progression:

LayerWhat it doesExamplesFits when
1. ChatOne-off explorationBare ChatGPT, Claude.ai"I'm thinking out loud"
2. Projects / Gems / Custom GPTsPersistent contextClaude Projects, Gemini Gems, Custom GPTs"I want the same context every chat"
3. SkillsAtomic capabilities, 1 to 3 stepsClaude Skills, AGENTS.md skills"I want it to do this one thing my way"
4. Projects + SkillsContext with capabilities(combined)The natural ad-hoc combo
5. ICMMulti-step workflows with handoffsFolders + CLAUDE.md / AGENTS.md"I want a sequenced process with checkpoints"
6. ICM + durable executionProduction-grade reliabilityICM + DBOS / LangGraph / Temporal"I want it reliable without babysitting"
7. FrameworksReal-time, concurrent, dynamicLangChain, LangGraph, Temporal"Files-as-source-of-truth breaks at scale"

Tier 1: bare chat. Exploration. You're thinking out loud, sketching, asking questions that might not have answers yet. The work isn't repeatable, so there's no structure. Any persistent structure at this stage is premature optimization for a workflow you haven't validated.

Tier 2: projects, gems, custom GPTs. The first hint of repeatability. You've found yourself pasting the same context into every chat: the same client background, the same brand voice, the same product description. A project (or its equivalent in Gemini Gems or OpenAI's Custom GPTs) lets you stash that context once and have it available in every conversation inside the project. The output is whatever the conversation produces; the project is a smarter chat surface. The Projects vs. Skills guide covers the distinction in more depth. The short version is that projects hold context, skills hold capability, and they compose.

Tier 3: skills. The first time you're encoding a capability rather than just context. A skill says "when this comes up, here's exactly how I want it handled." Three steps or fewer, with no real handoffs between steps and no human review in the middle. The skill is one atomic move. "Generate a meeting summary in this format. Send a Slack message in this voice. Convert this transcript to a clean reading guide." Skills are good for small repeated tasks that don't justify a full workspace.

Tier 4: projects + skills. The natural ad-hoc combination. Persistent context from the project, atomic capabilities from skills. The two layers compose: when you're in the project's context and you ask for something the skill knows how to do, the agent applies the skill against that context. A lot of work lives here, and it works well. ICM is overkill below this line.

Tier 5: ICM. This is where things change. The work has multiple steps. The output of step one is the input of step two. A human needs to look at the work somewhere in the middle and decide whether to proceed or redirect. The contracts between stages need to be stable enough that the workflow is reproducible. A skill can't carry that, since it's one move with no handoffs. A project plus a stack of skills can't carry it either, because there's no structure for the handoff. ICM's folder-of-markdown structure fits: stages in numbered folders, each stage with a contract that declares what it reads and what it produces, the workspace as a whole moving through the stages.

Tier 6: ICM with durable execution. Adds reliability semantics on top of ICM. Most workspaces don't need this. The ones that do are the workspaces with side effects: stages that send emails, post to channels, write to databases, call external APIs. When a stage fails after a side effect, plain ICM has no rollback story. Durable execution closes that gap with workflow.yaml declarations, idempotency keys, and compensations. Part 5 covers this in detail.

Tier 7: frameworks. Where files-as-source-of-truth breaks down. Real-time concurrency: multiple agents working on the same problem simultaneously, sharing state, reacting to each other in tight loops. Dynamic routing where the next step depends on a complex runtime computation, not just a checkpoint approval. Scale at thousands of workflows per day with proper queueing, rate limiting, and failure isolation. ICM's folder-based handoffs are too slow for these shapes. LangChain, LangGraph, and Temporal earn their keep here. The mistake is reaching for them when you don't have these shapes.

Two nuances worth pulling out. ICM sits above projects and skills rather than replacing them. A team using projects-plus-skills for ad-hoc work and ICM for sequenced workflows is using the stack correctly, not double-paying. Skills can also live inside ICM workspaces as plug-ins for specific stages: an ICM workspace's skills/ folder bundles capabilities the workflow needs, and stages load them only when relevant. Frameworks belong at the top of the stack. When you genuinely have real-time concurrency, dynamic routing, or scale that breaks file-based handoffs, they earn their keep. When you don't, they're the bespoke layer the Bitter Lesson warns about: the clever structure that the next general advance will subsume.

For technical readers curious about what sits between ICM and a heavyweight framework, the lightweight Agent SDKs (OpenAI Agents SDK, Pydantic AI, Google ADK, Letta, smolagents, Mastra) are moving fast. That's a different post for a different audience and not in scope for this series.

How meeting-summarizer grew up

The clearest example of the stack in action is one I walked myself, recent enough to remember the exact moments of escalation.

The meeting-summarizer started as a Claude skill: paste a transcript, get a summary back. Two steps, no review. It worked. For a few weeks I used it as a skill, dropping transcripts into Claude and getting back clean meeting notes. The skill was small, the output was good enough, and the work didn't need anything more.

Then I wanted action items pulled out separately. Three steps now: read the transcript, write the notes, extract the action items as a distinct artifact. Still inside the skill, just longer. The skill grew a "now do this" section. Output got messier. Sometimes the action items came as a list inside the notes, sometimes as a separate section, sometimes the model decided they weren't important enough to surface.

Then I wanted key insights as a distinct section: the discoveries beyond what was technically said, the implications worth flagging. Four things now. The skill kept growing. The conditional branches multiplied: if the meeting was about X, emphasize Y; if it was a working session, look for Z. Output quality started to degrade unpredictably.

Then I added optional extended notes, only on explicit request, and that was the moment the skill had outgrown the layer. Five things happening, one of them gated. The skill markdown was tangled, with different output formats for different situations. There was no clean way to checkpoint, and no way for a different agent to run the same skill and produce structurally similar work. I was constantly tweaking the prompt to coax the model through the conditional logic.

So I promoted it. Same job, restructured as an ICM workspace with five stage folders, shared references for cross-stage rules, and a delivery stage that pauses for the folder confirm. The workspace's CLAUDE.md opens with one line: "Promoted from the meeting-summarizer skill on 2026-05-04." Each stage has its own contract for inputs, process, audit, and outputs. The conditional logic that was tangling the skill became explicit gating in the workspace's routing table. The "extended notes only on request" became a stage that exists but is gated. The model doesn't have to make a judgment call every run; the gate decides.

The content-to-guide workspace grew the same way. It started as a skill for turning long PDFs and YouTube videos into reader's guides, and got promoted when it hit seven steps with two checkpoints and bundled skills for fetching transcripts and rendering EPUBs.

That's the pattern: skill until it isn't. Promotion isn't a failure of the skill; it's the work outgrowing the layer. There's a downloadable skill that does the conversion mechanically (skill-to-icm-converter, available from the landing page), but recognizing the moment when promotion is the right move matters more than running the conversion. Three steps with real artifacts moving between them, especially with a human checkpoint somewhere in the middle, is usually the threshold.

The next question is concrete: what does an ICM workspace actually look like? Not the abstract structure, the real thing, with stage folders and outputs and the routing tables that make it run. Part 4 walks through meeting-summarizer and content-to-guide end to end.

Three takeaways

  1. Skills for 1 to 3 atomic steps. If your work fits in a single capability with no real handoffs, a skill is the right home. ICM is overkill below that threshold; the structure costs more than the work justifies.
  2. ICM for sequential, reviewed workflows. Multi-step work with artifacts moving between stages, especially with a human checkpoint somewhere in the middle, is where ICM earns its keep. The stage contract is what makes the steps composable.
  3. Frameworks belong at the top. Most agent work isn't real-time, concurrent, or dynamically routed. For the work that is, frameworks belong. For the work that isn't, they're the bespoke layer that gets eaten.

Cross-references

Sources

  • Sutton, The Bitter Lesson (2019): the canonical statement of the lesson and the source of the Stack's philosophical anchor
  • The skill-to-ICM promotion story, from my meeting-summarizer and content-to-guide workspaces, dated 2026-05-04 .