Skip to content
Back to journal
Productivity AI

Working Efficiently with AI Coding Assistants: A Practical Playbook

A tactical playbook for getting real leverage out of AI coding assistants — where they help, where they quietly hurt, and how to ship code you'd actually sign your name to.

Pangaea Engineering 10 min read

A junior engineer can now produce four hundred lines of plausible-looking code in ninety seconds. That is the whole problem in one sentence. The constraint on shipping software was never typing speed; it was understanding, judgment, and the slow accumulation of correctness. AI assistants are spectacularly good at removing the one bottleneck that didn’t matter much and leaving every other one exactly where it was — or, if you’re careless, making them worse.

Most teams we talk to are using these tools at maybe a quarter of their potential. They either treat the assistant as a magic oracle and paste its output straight into production, or they distrust it so thoroughly that they use it as a fancier autocomplete and miss the genuine leverage. Both camps are leaving enormous value on the table. This is a field guide for the middle path: how to extract real work from these tools without quietly degrading your codebase one accepted suggestion at a time.

Where they help, and where they quietly hurt

The honest version of the value proposition is narrower and more useful than the marketing. Assistants are excellent at work that is tedious but well-specified: writing the third CRUD endpoint that looks like the first two, translating a data structure from one shape to another, scaffolding tests for a pure function, wiring up a config parser, doing the mechanical 80% of a refactor once you’ve decided on the target. They are very good at things you already know how to do but don’t want to type. They are a genuine accelerant for working in an unfamiliar but well-documented framework, where they collapse an hour of doc-spelunking into a few minutes.

The danger is that they are equally fluent at the things they’re bad at. The model does not get visibly nervous when it’s wrong. It generates a confident, well-formatted, plausible solution to a concurrency problem with a race condition you won’t see until it’s 2 a.m. and production is melting. It will invent an API method that should exist but doesn’t. It will write security-adjacent code — auth checks, input validation, crypto usage — that looks textbook and contains a subtle, exploitable gap.

The failure mode of a bad engineer is code that looks wrong. The failure mode of an AI assistant is code that looks right. The second is far more expensive.

The mental model that works: treat the assistant like an extremely fast, widely-read mid-level engineer with no memory of your system, no skin in the game, and a pathological inability to say “I’m not sure.” That framing tells you exactly when to lean in and when to keep your hands on the wheel.

Context is the entire game

The single biggest difference between teams getting 4x and teams getting 1.2x is not prompt cleverness. It’s context. The model can only reason about what you put in front of it, and its default assumption — generic, average-of-the-internet code — is almost never what your codebase wants.

Before you ask for anything non-trivial, give the model the same things you’d give a new hire on their first day:

  • The actual files it will touch and the ones it will call into — not your description of them.
  • The existing pattern to imitate. “We do error handling like this; here’s a representative module.”
  • The constraints that aren’t in the code: performance budgets, the library you’re standardizing on, the thing you tried last sprint that didn’t work.
  • The interface and the test, when you have them. A failing test is the clearest spec you can hand a machine.

A request like “add caching to the user service” yields generic slop. The same request with the service file, your existing Redis wrapper, the cache-key convention you use elsewhere, and a note that keys must be tenant-scoped yields something you can actually review and merge. The work you do assembling context is not overhead — it is the work. It is also, conveniently, the same thinking you’d have to do anyway to specify the task properly for yourself.

Specify interfaces and tests, not vibes

The quality of what comes back is bounded by the precision of what goes in. “Make this faster” is a wish. A good task specification reads like a small contract: here is the function signature, here are the inputs and their shapes, here is what correct output looks like, here are the edge cases that matter, here is what you may not change.

Compare these two ways of asking for the same thing. The first:

Write a function to parse the date range from the query params.

The second:

Implement parseDateRange(params: URLSearchParams): { start: Date; end: Date } | Error

Rules:
- Reads "from" and "to" params, both ISO-8601 date strings.
- If either is missing, default to the last 30 days (end = now).
- If "from" is after "to", return an Error, don't throw.
- Reject ranges longer than 366 days with an Error.
- Pure function. No timezone library — assume UTC.

Here are three existing parsers in this file to match style: [paste]

The second version is barely more typing once it’s a habit, and it removes essentially all the ambiguity the model would otherwise resolve by guessing. Even better: write the test cases first and hand those over. Tests are an unambiguous specification that doubles as verification. When you can express the task as “make this test suite pass without changing the tests,” you’ve turned a fuzzy generative problem into a closed-loop one the model is genuinely good at.

The review loop, and why you read every line

Here is the rule that does not bend: you are responsible for every line you merge, regardless of who or what wrote it. The author field on the commit might as well be yours, because the pager certainly will be.

This sounds obvious and is routinely violated, because reviewing generated code feels different from reviewing a colleague’s PR. A colleague’s code carries an implicit guarantee that a thinking human believed it was correct. Generated code carries no such guarantee, but it’s so fluent that it borrows the feeling of one. You have to consciously override the instinct to skim. Read it as if a clever intern wrote it under deadline pressure and may have faked a part they didn’t understand — because functionally, that’s the situation.

Practical tactics that keep the loop tight:

  • Keep the diffs small. Ask for one logical change at a time. A 60-line diff gets read carefully; a 600-line diff gets skimmed and rubber-stamped, which is where bugs enter.
  • Make it explain the non-obvious parts. “Why did you use a lock here?” Either you learn something or the model reveals it was cargo-culting. Both are useful.
  • Run it before you trust it. Plausible and correct are different axes. The terminal is the tiebreaker.
  • Distrust confidence around boundaries. Null handling, empty collections, off-by-one, timezone math, integer overflow, the failure path of an external call. This is where fluent-but-wrong lives.

If reviewing the output carefully would take longer than writing it yourself, that’s a real signal — sometimes the right call is to write it yourself and stop fighting the tool.

Tests as the forcing function

Tests are where the AI workflow stops being a vibe and starts being engineering. They serve three roles at once: a precise specification you hand the model, an automated verifier that catches the confident-but-wrong output, and a regression net for when you ask for the next change and the model “helpfully” refactors something it shouldn’t have touched.

The discipline that pays off: have the model propose tests, but you own the assertions. Models love to write tests that pass by construction — they’ll assert that the function returns whatever the function happens to return, which verifies nothing. Read the test for whether it would actually fail if the code were wrong. A test that can’t fail is worse than no test, because it manufactures false confidence. When a generated test looks too green, delete a line of the implementation and confirm the test goes red. If it doesn’t, the test is theater.

Knowing when to turn it off

Senior judgment shows up as much in declining the tool as in using it. Reach for the off switch when:

  • Concurrency, ordering, and distributed state. Race conditions, lock ordering, idempotency, exactly-once delivery. The model pattern-matches to code that looks correct in the common case and silently omits the interleaving that bites you. Reason about these yourself.
  • Security-critical paths. Authentication, authorization, session handling, crypto, anything parsing untrusted input. Plausible-looking is not a standard you can accept here.
  • Genuinely novel domains. If you’re doing something the training data has little of — a new protocol, a bespoke numerical method, an unusual hardware constraint — the model’s confident average is confidently wrong.
  • Code where being subtly wrong is catastrophic and hard to detect. Money, medical, anything where the bug ships silently and surfaces as a lawsuit.

None of this means “don’t use it at all” in these areas — it can still draft, explain, or rubber-duck. It means the bar for what you accept goes way up, and the assistant moves from author to research aide.

Measuring leverage, not activity

The metric that matters is not lines generated, prompts sent, or suggestions accepted. Those measure activity, and activity is cheap. A team can quadruple its keystroke output and ship slower, because the bottleneck moved downstream into review, debugging, and the cleanup of subtle defects.

Measure the things you cared about before AI existed: cycle time from idea to merged-and-deployed, defect escape rate, the share of PRs that sail through review versus the ones that bounce. If your generated-code PRs bounce more, or your incident rate ticks up, you are converting typing speed into rework — negative leverage dressed up as productivity. Real leverage looks like the same small team shipping more correct software per week, with review staying boring. If review is getting scarier, you’ve optimized the wrong variable.

What this means for you

The teams winning with these tools aren’t the ones with the cleverest prompts. They’re the ones who kept their engineering standards exactly where they were and used the assistant to hit those standards faster. The discipline is the moat.

A checklist we’d actually use:

  • Specify before you generate. Interface, inputs, edge cases, constraints, a pattern to match. If you can’t write the spec, you’re not ready to ask.
  • Front-load context. The real files, the existing patterns, the unwritten constraints. The model knows nothing you didn’t tell it.
  • Prefer tests as the spec. A failing test you own is the best prompt there is.
  • Read every line like an intern wrote it on a deadline. Keep diffs small enough that you actually do.
  • Run it, then trust it. Plausible and correct are different things.
  • Turn it off for concurrency, security, and novel domains. Author becomes aide.
  • Measure cycle time and defects, not keystrokes. If review is getting scarier, stop.

Used this way, an AI assistant is one of the largest leverage increases available to a small senior team — which is precisely why we care about using it well. Used the other way, it’s a very efficient machine for generating technical debt that looks like progress. The tool doesn’t decide which one you get. You do.

Tags: #ai #productivity #workflow #tooling

Keep reading

Let's build it together.

Whether it's a brand-new product or software that needs a serious team behind it — tell us about it.