The gap between demo and production
The demo version of every LLM tool is easy. You call the API, you get text back, you display it. Four lines of Python, a Streamlit interface, and you've got something impressive to show in a meeting.
The production version is where things get real. Users will give it inputs you never anticipated. The model will occasionally return malformed JSON when you asked for JSON. The API will rate-limit you at the worst possible moment. And every call costs money, so a runaway prompt or a user who refreshes 200 times in an hour becomes a billing event that hurts.
Good LLM tooling is mostly not about the LLM. It's about the infrastructure around it — the input validation, the output parsing, the retry logic, the cost controls, the logging. The model is a component. Building the system around it is the actual work.
Setting up your environment properly
Start with a clean virtual environment. I use uv now for dependency management —
it's dramatically faster than pip and handles lockfiles properly.
If you're still using bare pip with a requirements.txt, this is a good time to upgrade.
# Create environment and install dependencies
uv venv .venv
source .venv/bin/activate
uv pip install anthropic openai pydantic tenacity python-dotenv structlog
Those packages matter. pydantic for validating and parsing structured output.
tenacity for retry logic with exponential backoff.
structlog for structured logging that makes debugging production issues tractable.
Don't skip any of them.
Store your API keys in a .env file and load them with python-dotenv.
Never hardcode credentials, never commit them, never pass them as environment variables in a way that gets logged somewhere.
This is not optional.
from dotenv import load_dotenv
import os
load_dotenv()
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not ANTHROPIC_API_KEY:
raise ValueError("ANTHROPIC_API_KEY not set in environment")
Structured output — the most important thing to get right
If your tool does anything beyond raw text generation, you need structured output. "Parse the JSON the model returns" sounds simple until the model returns a JSON block wrapped in markdown fences, or adds a friendly explanation before the JSON, or uses slightly different field names than you specified, or decides to nest things differently.
The right approach is to define a Pydantic model for every output type and validate every response against it. If validation fails, you retry. If it fails again, you log the failure and surface it — you don't silently return garbage.
from pydantic import BaseModel, Field
from typing import List, Optional
class SEOAnalysis(BaseModel):
title_tag: str = Field(description="The page title tag")
title_length: int = Field(description="Character count of title")
issues: List[str] = Field(description="List of identified issues")
recommendations: List[str] = Field(description="Actionable recommendations")
priority_score: int = Field(ge=1, le=10, description="Priority from 1-10")
summary: str = Field(description="One-paragraph summary")
class LLMResponse(BaseModel):
success: bool
data: Optional[SEOAnalysis] = None
error: Optional[str] = None
tokens_used: int = 0
cost_usd: float = 0.0
Building a robust LLM client
Don't call the API directly everywhere in your codebase. Wrap it in a client class that handles retries, logging, cost tracking, and output parsing in one place. Every other part of your application calls this client — which means when you need to change models, add a new provider, or change your retry strategy, you change it in one place.
import json
import structlog
from anthropic import Anthropic
from tenacity import retry, stop_after_attempt, wait_exponential
from pydantic import ValidationError
from models import LLMResponse, SEOAnalysis
from config import ANTHROPIC_API_KEY
log = structlog.get_logger()
class LLMClient:
def __init__(self):
self.client = Anthropic(api_key=ANTHROPIC_API_KEY)
self.model = "claude-3-5-sonnet-20241022"
# Cost per million tokens (update as pricing changes)
self.input_cost_per_mtok = 3.0
self.output_cost_per_mtok = 15.0
def _calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
return (
(input_tokens / 1_000_000) * self.input_cost_per_mtok +
(output_tokens / 1_000_000) * self.output_cost_per_mtok
)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
def _call_api(self, system: str, user: str) -> dict:
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": user}]
)
return {
"content": response.content[0].text,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens
}
def analyse_seo(self, page_content: str) -> LLMResponse:
system = """You are a technical SEO expert. Analyse the provided page content
and return a JSON object with exactly these fields:
- title_tag (string)
- title_length (integer)
- issues (array of strings)
- recommendations (array of strings)
- priority_score (integer 1-10)
- summary (string)
Return ONLY the JSON object. No explanation, no markdown fences."""
user = f"Analyse this page content for SEO issues:\n\n{page_content}"
try:
result = self._call_api(system, user)
parsed = json.loads(result["content"])
analysis = SEOAnalysis(**parsed)
cost = self._calculate_cost(
result["input_tokens"],
result["output_tokens"]
)
log.info("seo_analysis_complete", cost_usd=cost)
return LLMResponse(
success=True,
data=analysis,
tokens_used=result["input_tokens"] + result["output_tokens"],
cost_usd=cost
)
except (json.JSONDecodeError, ValidationError) as e:
log.error("parse_error", error=str(e))
return LLMResponse(success=False, error=str(e))
The retry decorator from tenacity handles rate limits and transient API errors automatically. The exponential backoff (waiting 2s, then 4s, then 8s) prevents hammering the API when it's struggling. Always use this pattern in production.
Writing prompts that produce consistent output
Inconsistent prompts are the number one cause of production LLM tools failing quietly. The model produces something different depending on subtle phrasing changes, input length, or whether you're using a slightly different model version. You need prompts that are deterministic enough to be testable.
The four things every production system prompt needs
- A clear role definition Tell the model exactly what it is and what kind of reasoning it should apply. "You are a technical SEO expert" is better than "You are a helpful assistant." Specificity reduces variance.
- Explicit output format instructions Don't say "return JSON". Say "return a JSON object with exactly these fields: [field list]. Return ONLY the JSON object. No explanation. No markdown code fences." The more explicit, the more consistent.
- Constraint boundaries If priority_score should be 1–10, say so in the prompt. If summary should be under 200 words, say so. Unconstrained fields produce unpredictably long or short output.
- A failure mode instruction Tell the model what to do if the input is invalid, unclear, or insufficient: "If you cannot complete the analysis due to insufficient content, return the JSON with an empty issues array and a summary explaining what was missing." This prevents creative responses to bad input.
Cost management in production
A tool that costs $0.001 per use sounds cheap until you have 10,000 users and someone discovers that pasting a 50,000-word document into your input field doesn't throw an error. Cost management is not optional — it's infrastructure.
Always truncate inputs before they reach the API. Set a maximum character count that matches your expected use case, and slice the input before building your message. Add a middleware layer that tracks spend per user per day if you're building a multi-user tool. Log every token count on every call so you can actually see your cost breakdown later.
MAX_INPUT_CHARS = 8000 # roughly 2000 tokens
DAILY_COST_LIMIT_USD = 5.0
def safe_truncate(text: str, max_chars: int = MAX_INPUT_CHARS) -> str:
if len(text) <= max_chars:
return text
return text[:max_chars] + "\n\n[Content truncated for analysis]"
def check_daily_budget(user_id: str, projected_cost: float) -> bool:
todays_spend = get_user_daily_spend(user_id) # your tracking logic
return (todays_spend + projected_cost) < DAILY_COST_LIMIT_USD
Testing LLM tools properly
You can't unit test a stochastic system the same way you test deterministic code. But you can test the parts around the model — and you should.
Test your Pydantic validation logic with fixed JSON inputs. Test your cost calculation function with known token counts. Test your retry logic with a mocked client that raises exceptions. Test your truncation function. Test your error handling paths.
For the model output itself, build an evaluation set: a collection of known inputs paired with expected output structures (not exact outputs — structures). Run this eval set after every prompt change. If your pass rate drops below your baseline, your prompt change made things worse. This is a lightweight but effective way to catch prompt regressions before they hit production.
Use pytest-recording or VCR.py to record real API responses during development and replay them in CI tests. This means your test suite doesn't make real API calls, but it tests against real response formats.
Deploying it
The simplest production deployment for a Python LLM tool in 2025 is FastAPI on a Fly.io machine or a Render service. If you're building something that needs to scale, Railway or a serverless function on Vercel (with the Python runtime) works well. If you're building an internal tool, don't over-engineer it — a Streamlit app behind a simple auth layer deployed to a small VPS handles more traffic than most internal tools will ever see.
What you always need in production, regardless of hosting: a health check endpoint, structured logging to somewhere you can query (Logtail, Axiom, or even a simple SQLite table for small tools), and an error alerting mechanism so you know when things break before your users tell you.
Build the infrastructure, then build the features
The temptation is always to get the interesting AI part working first and add the infrastructure later. Every time I've done that, "later" becomes "when something breaks in production and I'm debugging it at midnight."
Get your retry logic, your structured output parsing, your cost tracking, and your logging in place before you start building features. The boring scaffolding is what makes the interesting part reliable.
If you want to see this pattern in practice, check out my free AI prompt generator tool — it's built on exactly this architecture. And if you need a custom AI tool built for your team or product, take a look at my AI automation service. I build this kind of infrastructure professionally, and the result is tools that stay working rather than tools that looked good in the demo.