Vibe Working: When "Just Tell the Agent" Actually Works

In our previous three posts, we traced a single thread: from OpenClaw as a product → to its Brain-Body-Soul architecture → to the Skills + Gateway + MCP capability stack underneath.

We kept saying “Skills will transform daily work.” Time to show what that actually looks like.

I. Microsoft Called It “Vibe Working”

On September 29, 2025, Microsoft shipped two features in Microsoft 365 Copilot and gave them a name: Vibe Working.

Agent Mode landed in Excel and Word. You type a prompt — “Build me a loan amortization calculator with monthly payment breakdowns” — and the Agent doesn’t just spit out a formula. It creates sheets, writes formulas, generates charts, validates results, spots errors, fixes them, and iterates until the output checks out. Multi-step. Self-correcting.

Office Agent landed in the Copilot chat sidebar. You say “Make a board-ready presentation from this quarterly data” and it produces a polished PowerPoint. Not a template with placeholder text — an actual deck with your numbers, formatted, ready to present.

The name traces back to Andrej Karpathy. On February 2, 2025, the OpenAI founding member tweeted: “There’s a new kind of coding I call ‘vibe coding’, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” Seven months later, Microsoft took that idea from code and applied it to spreadsheets, documents, and slides: you provide the intent, the Agent delivers the artifact.

No more wrestling with VLOOKUP syntax. No more manually formatting 47 slides. No more copying numbers between three spreadsheets and a Word doc.

At least, that’s the promise. Microsoft’s own SpreadsheetBench shows Agent Mode in Excel hitting 57.2% accuracy on complex tasks. Better than manual for some users — but a long way from reliable.

II. The Promise vs the Reality

Here’s what the research actually says.

Benchmarks for office automation — like SpreadsheetBench — tested top models on realistic workflows: filtering datasets, cross-referencing tables, and producing summary analysis. Tasks that a competent office worker handles daily without thinking twice.

Even the best systems fail nearly half the time. The conclusion from researchers is blunt: performance is still “far below the human accuracy standards required by real-world office workflows.”

The failure modes are instructive:

Operation redundancy — the Agent repeats the same action three times in a row, wasting tokens and sometimes corrupting its own output.
Hallucinated references — it confidently edits cell B14 in a spreadsheet that only has 10 rows.
App-switching failures — moving data from Excel to Word to Email breaks context more often than not.
Long-horizon drift — on tasks with 10+ steps, the Agent gradually forgets what it was trying to accomplish.

But here’s what most people miss about these failures. Microsoft’s own AI Red Team published a taxonomy of failure modes in agentic systems, and the scariest finding isn’t hallucination — it’s the erosion of human oversight.

When the Agent generates a spreadsheet that looks right, users stop checking the formulas. When it drafts an email that sounds right, users hit send without reading. The real risk isn’t that the Agent gets it wrong. It’s that the human stops noticing.

This is the central tension in Vibe Working: the more capable the Agent becomes, the more dangerous it is to trust it without guardrails.

III. Four Scenarios: Before and After

Before diving into our own work, some context on what’s already been measured in the wild.

An NBER field study (conditionally accepted at American Economic Review: Insights) tracked 7,137 knowledge workers across 66 firms over six months. Workers using integrated AI tools spent 25–31% less time on email — roughly two to three fewer hours per week.

Morgan Stanley’s financial analysts saved 1.5 hours per day on research and report prep.
Repsol ran a Copilot pilot and found employees saved 121 minutes per week on average, with output quality improving 16.2%.
World Wide Technology deployed Copilot to 941 users and measured 446 hours saved per week — mostly on meeting summaries, email drafts, and report generation.

Those numbers are real. But the NBER study’s buried finding is just as important: despite saving hours on email, there was no significant change in the quantity or composition of workers’ overall tasks. Workers could speed up things they controlled individually — but they couldn’t change workflows that required coordination with others. AI sped up the cells; it didn’t rewire the organism.

That’s the key insight. Current tools save time on individual tasks within one application. The hard part — the part where accuracy drops to ~50% — is when the Agent needs to chain tasks across multiple apps and deliver a complete artifact.

That’s where Skills-based orchestration comes in. Here’s what we’ve been building and testing.

Scenario 1: Quarterly Sales Report

Before: You open three CSV exports from the CRM. You paste them into Excel. You spend 40 minutes building pivot tables, writing SUMIFS formulas, formatting conditional colors, and creating charts. Then you copy the charts into a Word doc, write commentary around them, and email it to your manager. Total: ~2 hours.

After: You tell the Agent: “Pull Q4 sales data, break it down by region and product line, flag anything that dropped more than 15% quarter-over-quarter, and give me a report with charts.”

What happens under the hood:

A Sales Reporting Skill activates — it knows the standard report structure, which metrics matter, and how to flag anomalies.
The Skill orchestrates MCP tools: one connects to the CRM database, another writes to Excel, another generates the Word doc.
The Gateway manages the session — if the CRM query takes 30 seconds, it doesn’t time out; if the Excel write fails, it retries.
You get back a formatted Excel workbook and a Word summary. Total: ~3 minutes of your time.

The Agent didn’t improvise. It followed a recipe — one that encodes how your company structures its quarterly reports.

Scenario 2: Meeting Notes

Before: You sit through a 45-minute meeting. You scribble notes. Afterward, you spend 20 minutes typing them up, organizing by topic, identifying action items, and sending them to attendees. Half the time you miss something and have to check the recording.

After: You say: “Transcribe yesterday’s product sync, organize by topic, extract action items with owners and deadlines, and send the summary to everyone who attended.”

Under the hood:

A Meeting Notes Skill activates — it knows the difference between a decision, an action item, and background discussion.
MCP tools handle the transcription (Whisper API), calendar lookup (who attended), and email dispatch.
The Skill applies your team’s preferred format — not a generic template, but the actual structure your team uses.

The output is a document that looks like a human wrote it, because the Skill was trained on how your team writes meeting notes.

Scenario 3: Contract Risk Review

Before: Legal sends you a 30-page vendor contract. You read it. You highlight clauses that look unusual. You cross-reference against your company’s standard terms. You write up a risk summary. This takes most of an afternoon.

After: You say: “Review this vendor contract against our standard terms. Flag deviations, rate each by risk level, and give me a summary I can send to legal.”

Under the hood:

A Contract Review Skill activates — it knows your company’s standard terms, common risk patterns, and how your legal team prefers risk ratings.
MCP tools handle PDF parsing, text extraction, and structured comparison.
The Gateway enforces access controls — the contract data stays within the secure runtime, never leaves the sandbox.

You get a structured risk report in 4 minutes. Legal still does the final review — the Agent doesn’t replace lawyers, it replaces the 3 hours of reading and highlighting that precedes the actual legal judgment.

Scenario 4: Email Triage

Before: Monday morning. 127 unread emails. You spend 45 minutes scanning subject lines, opening messages, mentally categorizing (urgent / FYI / needs reply / spam), and drafting responses. By the time you’re done, three new urgent emails have arrived.

After: You say: “Triage my inbox. Flag anything urgent from direct reports or clients. Draft replies for anything that just needs acknowledgment. Summarize the rest in three bullet points.”

Under the hood:

An Email Triage Skill activates — it knows who your direct reports are, which clients are priority, and what “urgent” means in your context.
MCP tools connect to your email provider, pull messages, and draft replies.
The Gateway ensures no email content is stored beyond the session — when the task is done, the data is gone.

You review 127 emails in 6 minutes. You edit two drafted replies, approve the rest, and move on.

IV. What Makes This Work (and What Doesn’t)

All four scenarios share a pattern. Let’s make it explicit.

What makes it work:

A Skill that encodes domain knowledge. Not a generic prompt — a structured instruction set that knows your company’s report format, your team’s meeting note style, your legal team’s risk rating scale. This is why a Skill-based approach outperforms raw prompting.
MCP tools that handle the mechanics. The Agent doesn’t need to “figure out” how to connect to your CRM or parse a PDF. MCP provides pre-built, tested integrations. The Skill just says “use this tool” and MCP handles the protocol.
A Gateway that keeps everything running. Session state doesn’t vanish mid-task. If a step fails, the Gateway retries or rolls back. Permissions are enforced — the contract review Skill can’t access your email, and the email Skill can’t access the contract.

What doesn’t work (yet):

Cross-app workflows with many steps. The pass rate drops significantly when tasks span 4+ applications. Context fragmentation is the biggest unsolved problem.
Ambiguous intent. “Make this report better” isn’t enough. The Agent needs specific intent — “flag drops over 15%” is actionable, “make it look nice” is not. Vibe Working requires users to be clear about what “done” looks like.
First-time setup. A Skill needs to learn your company’s conventions before it can replicate them. The first quarterly report takes effort to configure. The 20th one takes 3 minutes.

V. Why Current Solutions Fall Short

Microsoft’s Vibe Working features are impressive demos. But there are structural limitations in the current approach.

Copilot is locked to Microsoft’s ecosystem. Agent Mode works in Excel and Word. What if your data is in Google Sheets, your CRM is Salesforce, and your meeting recordings are in Otter.ai? You need something that orchestrates across vendors, not within one.

No persistent memory across sessions. Copilot doesn’t remember that last month’s report used a specific chart style, or that your legal team prefers a 3-tier risk scale. Every session starts from scratch. Skills solve this — the knowledge is in the Skill file, not in the session.

No security isolation. When Copilot processes your vendor contract, where does that data go? Through OpenAI’s API? Anthropic’s? Microsoft uses both — and here’s a detail buried in their own documentation: Anthropic models within Microsoft 365 Copilot experiences are explicitly out of scope for the EU Data Boundary. If you’re a European enterprise running Agent Mode, some of your data may be processed outside EU datacenters (specifically on AWS US). For sensitive documents, you need a runtime with clear data boundaries — a Gateway with sandboxing, not a chat window with cloud APIs.

The accuracy numbers are brutal. 57.2% on SpreadsheetBench for Excel-only tasks — and that’s Microsoft’s own Agent Mode on their own benchmark. Academic work on spreadsheet reasoning (like SheetBrain, SheetAgent) shows that even purpose-built neuro-symbolic systems need explicit validation modules to avoid corrupting data. Raw model intelligence, no matter how impressive, is not production-ready for office automation without infrastructure.

VI. The Approach We’re Taking

AgentPuter’s Vibe Working stack has three layers — the same three we described in our previous post:

Skills define the playbook for each scenario. A Sales Reporting Skill is different from a Meeting Notes Skill is different from a Contract Review Skill. Each one encodes specific domain knowledge, step sequences, tool requirements, and output formats.

The Agent Gateway orchestrates execution. It loads the right Skill, routes MCP tool calls, manages session state, enforces permissions, and handles failures. The Gateway is the reason the system doesn’t fall apart on step 7 of a 12-step workflow.

MCP tools handle the actual connections — database queries, file I/O, email APIs, calendar lookups, PDF parsing. Standardized, tested, containerized.

What makes this different from Copilot? Three things:

Vendor-neutral. Our Gateway orchestrates across Google Workspace, Microsoft 365, Salesforce, Slack, Notion — wherever your data actually lives. Not locked to one ecosystem.
Persistent knowledge. Skills remember your conventions across sessions. The 20th quarterly report is as fast as the 2nd, because the Skill already knows your format, your metrics, your audience.
Security-first runtime. Every Skill executes in a sandboxed environment. Contract data doesn’t touch the email Skill’s context. Session data is ephemeral unless explicitly persisted. Audit logs for every step.

Closing Thoughts

“Vibe Working” is a good name for what’s coming. The idea that you describe what you want and an Agent delivers the finished artifact — that’s the end state everyone is building toward.

But the honest truth is: we’re not there yet. The gap between the demo and the daily driver is real. ~50% pass rates on office workflows tell you that raw model intelligence isn’t enough.

What closes the gap isn’t a better model. It’s the infrastructure around the model:

Skills that constrain the Agent to proven workflows instead of letting it improvise
A Gateway that keeps multi-step tasks on track, with retries, rollbacks, and access control
MCP tools that provide tested, reliable integrations instead of asking the Agent to figure out APIs on its own

Over the past four posts, we went from dissecting one viral open-source project to building up a full picture of what Agent infrastructure actually requires.

Here’s the part that should bother everyone building in this space: Morgan Stanley’s analysts save 1.5 hours a day with AI, yet the best general-purpose Agent still fails half of all multi-app office tasks. The ROI is already real — inside single apps, with human supervision. The moment you remove the human or cross app boundaries, things break.

The punchline is simple: the Agent that does your quarterly report isn’t smarter than ChatGPT. It just has better instructions, a reliable runtime, and the right tools plugged in. The 7,137 workers in that NBER study didn’t need a smarter model. They needed better infrastructure around the model they already had.

That’s Vibe Working. Not vibes. Infrastructure.

This is the fourth post in our series on Agent infrastructure. We’ve gone from OpenClaw → architecture → the Skills + Gateway + MCP capability stack → and now what it looks like in practice. Next, we’ll turn to the business model: how do you actually monetize an Agent platform? If you have an office workflow you’ve tried — and failed — to automate with AI, we’d love to hear about it.