AI tools/blog/ai-tools/build-first-llm-tool-weekend

How to Build a Production-Ready LLM Tool in Python

There's a version of this tutorial that gets you to a working ChatGPT wrapper in 15 minutes. This isn't that tutorial. What I want to show you is the version you'd actually ship to users — the one that handles errors gracefully, produces consistent structured output, manages costs, and doesn't silently return nonsense when the model has a bad day. I've built more than a dozen LLM-powered tools in production. Here's what I actually use.

The gap between demo and production

The demo version of every LLM tool is easy. You call the API, you get text back, you display it. Four lines of Python, a Streamlit interface, and you've got something impressive to show in a meeting.

The production version is where things get real. Users will give it inputs you never anticipated. The model will occasionally return malformed JSON when you asked for JSON. The API will rate-limit you at the worst possible moment. And every call costs money, so a runaway prompt or a user who refreshes 200 times in an hour becomes a billing event that hurts.

Good LLM tooling is mostly not about the LLM. It's about the infrastructure around it — the input validation, the output parsing, the retry logic, the cost controls, the logging. The model is a component. Building the system around it is the actual work.

Setting up your environment properly

Start with a clean virtual environment. I use uv now for dependency management — it's dramatically faster than pip and handles lockfiles properly. If you're still using bare pip with a requirements.txt, this is a good time to upgrade.

bash
# Create environment and install dependencies
uv venv .venv
source .venv/bin/activate

uv pip install anthropic openai pydantic tenacity python-dotenv structlog

Those packages matter. pydantic for validating and parsing structured output. tenacity for retry logic with exponential backoff. structlog for structured logging that makes debugging production issues tractable. Don't skip any of them.

Store your API keys in a .env file and load them with python-dotenv. Never hardcode credentials, never commit them, never pass them as environment variables in a way that gets logged somewhere. This is not optional.

python — config.py
from dotenv import load_dotenv
import os

load_dotenv()

ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

if not ANTHROPIC_API_KEY:
    raise ValueError("ANTHROPIC_API_KEY not set in environment")

Structured output — the most important thing to get right

If your tool does anything beyond raw text generation, you need structured output. "Parse the JSON the model returns" sounds simple until the model returns a JSON block wrapped in markdown fences, or adds a friendly explanation before the JSON, or uses slightly different field names than you specified, or decides to nest things differently.

The right approach is to define a Pydantic model for every output type and validate every response against it. If validation fails, you retry. If it fails again, you log the failure and surface it — you don't silently return garbage.

python — models.py
from pydantic import BaseModel, Field
from typing import List, Optional

class SEOAnalysis(BaseModel):
    title_tag: str = Field(description="The page title tag")
    title_length: int = Field(description="Character count of title")
    issues: List[str] = Field(description="List of identified issues")
    recommendations: List[str] = Field(description="Actionable recommendations")
    priority_score: int = Field(ge=1, le=10, description="Priority from 1-10")
    summary: str = Field(description="One-paragraph summary")

class LLMResponse(BaseModel):
    success: bool
    data: Optional[SEOAnalysis] = None
    error: Optional[str] = None
    tokens_used: int = 0
    cost_usd: float = 0.0

Building a robust LLM client

Don't call the API directly everywhere in your codebase. Wrap it in a client class that handles retries, logging, cost tracking, and output parsing in one place. Every other part of your application calls this client — which means when you need to change models, add a new provider, or change your retry strategy, you change it in one place.

python — llm_client.py
import json
import structlog
from anthropic import Anthropic
from tenacity import retry, stop_after_attempt, wait_exponential
from pydantic import ValidationError
from models import LLMResponse, SEOAnalysis
from config import ANTHROPIC_API_KEY

log = structlog.get_logger()

class LLMClient:
    def __init__(self):
        self.client = Anthropic(api_key=ANTHROPIC_API_KEY)
        self.model = "claude-3-5-sonnet-20241022"
        # Cost per million tokens (update as pricing changes)
        self.input_cost_per_mtok = 3.0
        self.output_cost_per_mtok = 15.0

    def _calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
        return (
            (input_tokens / 1_000_000) * self.input_cost_per_mtok +
            (output_tokens / 1_000_000) * self.output_cost_per_mtok
        )

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10)
    )
    def _call_api(self, system: str, user: str) -> dict:
        response = self.client.messages.create(
            model=self.model,
            max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": user}]
        )
        return {
            "content": response.content[0].text,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens
        }

    def analyse_seo(self, page_content: str) -> LLMResponse:
        system = """You are a technical SEO expert. Analyse the provided page content
and return a JSON object with exactly these fields:
- title_tag (string)
- title_length (integer)
- issues (array of strings)
- recommendations (array of strings)
- priority_score (integer 1-10)
- summary (string)

Return ONLY the JSON object. No explanation, no markdown fences."""

        user = f"Analyse this page content for SEO issues:\n\n{page_content}"

        try:
            result = self._call_api(system, user)
            parsed = json.loads(result["content"])
            analysis = SEOAnalysis(**parsed)
            cost = self._calculate_cost(
                result["input_tokens"],
                result["output_tokens"]
            )
            log.info("seo_analysis_complete", cost_usd=cost)
            return LLMResponse(
                success=True,
                data=analysis,
                tokens_used=result["input_tokens"] + result["output_tokens"],
                cost_usd=cost
            )
        except (json.JSONDecodeError, ValidationError) as e:
            log.error("parse_error", error=str(e))
            return LLMResponse(success=False, error=str(e))

The retry decorator from tenacity handles rate limits and transient API errors automatically. The exponential backoff (waiting 2s, then 4s, then 8s) prevents hammering the API when it's struggling. Always use this pattern in production.

Writing prompts that produce consistent output

Inconsistent prompts are the number one cause of production LLM tools failing quietly. The model produces something different depending on subtle phrasing changes, input length, or whether you're using a slightly different model version. You need prompts that are deterministic enough to be testable.

The four things every production system prompt needs

  • A clear role definition Tell the model exactly what it is and what kind of reasoning it should apply. "You are a technical SEO expert" is better than "You are a helpful assistant." Specificity reduces variance.
  • Explicit output format instructions Don't say "return JSON". Say "return a JSON object with exactly these fields: [field list]. Return ONLY the JSON object. No explanation. No markdown code fences." The more explicit, the more consistent.
  • Constraint boundaries If priority_score should be 1–10, say so in the prompt. If summary should be under 200 words, say so. Unconstrained fields produce unpredictably long or short output.
  • A failure mode instruction Tell the model what to do if the input is invalid, unclear, or insufficient: "If you cannot complete the analysis due to insufficient content, return the JSON with an empty issues array and a summary explaining what was missing." This prevents creative responses to bad input.

Cost management in production

A tool that costs $0.001 per use sounds cheap until you have 10,000 users and someone discovers that pasting a 50,000-word document into your input field doesn't throw an error. Cost management is not optional — it's infrastructure.

Always truncate inputs before they reach the API. Set a maximum character count that matches your expected use case, and slice the input before building your message. Add a middleware layer that tracks spend per user per day if you're building a multi-user tool. Log every token count on every call so you can actually see your cost breakdown later.

python — cost guard example
MAX_INPUT_CHARS = 8000  # roughly 2000 tokens
DAILY_COST_LIMIT_USD = 5.0

def safe_truncate(text: str, max_chars: int = MAX_INPUT_CHARS) -> str:
    if len(text) <= max_chars:
        return text
    return text[:max_chars] + "\n\n[Content truncated for analysis]"

def check_daily_budget(user_id: str, projected_cost: float) -> bool:
    todays_spend = get_user_daily_spend(user_id)  # your tracking logic
    return (todays_spend + projected_cost) < DAILY_COST_LIMIT_USD

Testing LLM tools properly

You can't unit test a stochastic system the same way you test deterministic code. But you can test the parts around the model — and you should.

Test your Pydantic validation logic with fixed JSON inputs. Test your cost calculation function with known token counts. Test your retry logic with a mocked client that raises exceptions. Test your truncation function. Test your error handling paths.

For the model output itself, build an evaluation set: a collection of known inputs paired with expected output structures (not exact outputs — structures). Run this eval set after every prompt change. If your pass rate drops below your baseline, your prompt change made things worse. This is a lightweight but effective way to catch prompt regressions before they hit production.

Use pytest-recording or VCR.py to record real API responses during development and replay them in CI tests. This means your test suite doesn't make real API calls, but it tests against real response formats.

Deploying it

The simplest production deployment for a Python LLM tool in 2025 is FastAPI on a Fly.io machine or a Render service. If you're building something that needs to scale, Railway or a serverless function on Vercel (with the Python runtime) works well. If you're building an internal tool, don't over-engineer it — a Streamlit app behind a simple auth layer deployed to a small VPS handles more traffic than most internal tools will ever see.

What you always need in production, regardless of hosting: a health check endpoint, structured logging to somewhere you can query (Logtail, Axiom, or even a simple SQLite table for small tools), and an error alerting mechanism so you know when things break before your users tell you.

Build the infrastructure, then build the features

The temptation is always to get the interesting AI part working first and add the infrastructure later. Every time I've done that, "later" becomes "when something breaks in production and I'm debugging it at midnight."

Get your retry logic, your structured output parsing, your cost tracking, and your logging in place before you start building features. The boring scaffolding is what makes the interesting part reliable.

If you want to see this pattern in practice, check out my free AI prompt generator tool — it's built on exactly this architecture. And if you need a custom AI tool built for your team or product, take a look at my AI automation service. I build this kind of infrastructure professionally, and the result is tools that stay working rather than tools that looked good in the demo.

Need a production LLM tool built properly?

I build custom AI tools that actually work in production — with proper error handling, structured output, cost management, and the infrastructure to support real users.

See my AI service →

FAQs

What AI topics do you write about?

Production LLM use: prompts, evaluation, RAG architecture, cost/latency trade-offs, and tooling that helps teams ship safely—not slide-deck AI hype.

Do you recommend a specific model vendor?

Recommendations are task-dependent. Posts discuss interfaces and guardrails that port across providers; your contract, privacy, and latency needs should drive vendor choice.

How do you think about hallucinations?

Grounding, citations, retrieval quality, structured outputs, and human review for high-risk domains. No single trick eliminates risk; systems need measurement.

Are the free AI-related tools connected to the blog?

Yes. Prompt structuring, readability, and content brief helpers complement several articles in this category.

Can you help build an internal AI assistant?

That is a common engagement. Expect discovery on documents, access control, evaluation sets, and rollout—not a two-day chatbot demo without criteria.