May 9, 2026 · 10 min read

LangGraph in Production: What Nobody Tells You

LangGraph is excellent for building agentic workflows — until you hit the edge cases the documentation skips over. After running 12,400 production executions across three months, here's what actually matters.

State Size Is a Silent Killer

State is serialised to JSON at every checkpoint. Large objects in state make checkpoints slow, replay painful, and memory usage unpredictable.

Wrong:

class AgentState(TypedDict):
    documents: list[dict]   # 50 full document objects, each 10KB
    embeddings: list[list[float]]  # 50 × 1536 floats

Right:

class AgentState(TypedDict):
    document_ids: list[str]   # keys only
    embedding_ids: list[str]  # references to vector store

Store large objects in S3/blob storage. Keep only their IDs in state. Checkpoint size drops from megabytes to kilobytes. Replay goes from 30s to under 1s.

Retry Logic Belongs in the Graph

When you catch exceptions inside a node function, the graph can't see the failure, can't checkpoint progress between retries, and can't route differently after N failures.

Wrong — retry hidden inside a node:

def call_llm(state):
    for attempt in range(3):
        try:
            return {"response": llm.invoke(state["prompt"])}
        except Exception:
            time.sleep(2 ** attempt)
    raise RuntimeError("LLM failed after 3 attempts")

Right — retry as a graph edge:

def call_llm(state):
    result = llm.invoke(state["prompt"])   # let it raise
    return {"response": result, "attempts": state.get("attempts", 0)}

def should_retry(state):
    if state.get("error") and state.get("attempts", 0) < 3:
        return "retry"
    return "continue"

graph.add_conditional_edges("call_llm", should_retry, {
    "retry": "call_llm",
    "continue": "next_node",
})

Now the graph checkpoints before each retry. After a crash mid-retry, execution resumes from the last checkpoint — not from the beginning.

Checkpointing Is Non-Negotiable for Long Runs

Agent success rate before and after checkpointing
Fig 1. Success rate went from 67% to 94.2% after adding checkpointing + graph-level retry routing.
Retry distribution across 12,400 runs
Fig 2. 76% of runs complete on first attempt. 18% need 1 retry. 1% need 3+.

SqliteSaver or PostgresSaver turns your agent from all-or-nothing into resumable. A crash mid-execution resumes from the last successful node using the same thread_id.

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string("./checkpoints.db")
graph  = workflow.compile(checkpointer=memory)

# Resume a crashed run with the same thread_id
result = graph.invoke(
    input_state,
    config={"configurable": {"thread_id": "run-abc123"}},
)

For long-running document processing agents this is not optional. Without checkpointing, a transient API error at step 8 of 10 restarts from step 1. With it, it resumes from step 8.

Production Metrics

After 12,400 runs with checkpointing and graph-level retry routing:

  • 76% complete without retrying
  • 18% need 1 retry
  • 5% need 2 retries
  • 1% need 3 or more

Overall success rate: 94.2%, up from 67% before adding checkpointing. The 5.8% that fail are genuine failures — bad inputs, downstream outages — not transient errors eating runs.

The pattern that works: small state, retries as edges, checkpoints always on.

Related topics
AIPythonSystems

T
Tanmay Bohra
Full Stack Engineer at Grant Thornton Bharat. Building high-concurrency systems in Go and TypeScript.
← portfolio chat with tanmay ↗