Contents
- Production AI Application Architecture
- Compute Infrastructure Layer
- Model and API Layer
- Application Framework
- Deployment and Orchestration
- Observability Layer
- Security and Compliance
- Cost Breakdown: Small Production System
- Cost Optimization Strategies
- FAQ
- Related Resources
- Sources
Production AI Application Architecture
Five layers: compute, models, app logic, deployment, observability. Pick each thoughtfully.
Two paths: API-first (less ops), self-hosted (more control). Best approach: hybrid. APIs for core work, self-hosted for fine-tuning and cost control.
Compute Infrastructure Layer
Development and Testing: RunPod RTX 4090 at $0.34/hour. This provides sufficient compute for prototyping most applications. Spin up on-demand instances during development, tear down when finished.
Monthly cost estimate (40 development hours): $13.60
Production Inference: CoreWeave H100 with reserved capacity. For sub-5M monthly API calls, single H100 provides sufficient throughput. H100 generates approximately 200 tokens/second, handling 2,000+ queries per hour.
H100 reserved (1-year commitment): approximately $1,500/month Utilization-based cost: $0.05 per 1000 tokens
Fine-Tuning Operations: Lambda at $2.86/hour (PCIe) or $3.78/hour (SXM). Reserve full hours only when actively training. A typical 70B model fine-tuning run completes in 4-6 hours.
Fine-tuning cost estimate (8 fine-tuning runs monthly at $2.86/hr × 6 hours): $137/month
See RunPod GPU pricing, CoreWeave GPU pricing, and Lambda GPU pricing.
Model and API Layer
Primary LLM API: Together AI Llama 3.1 70B for cost-conscious deployment. Pricing: $0.88/1M input tokens, $1.06/1M output tokens.
For a chatbot processing 100K tokens daily (30M monthly):
- Input tokens (70% of total): 21M tokens × $0.88 = $18.48
- Output tokens (30% of total): 9M tokens × $1.06 = $9.54
- Monthly cost: $27.92
See Together AI pricing.
Specialized Tasks: OpenAI GPT-3.5 Turbo for tasks requiring superior reasoning. Use sparingly, only for complex requests routed through capability detection.
Estimated usage: 5M tokens monthly
- Input cost: $2.50
- Output cost: $7.50
- Monthly cost: $10
Embeddings: Open-source embeddings through Hugging Face or embed locally. Avoid API embeddings. MTEB benchmarks show open-source BGE-large-en matching commercial embeddings at zero cost.
Hosting embeddings: Pinecone serverless (free tier for <1M vectors) or Weaviate self-hosted (free, open-source).
Application Framework
Backend Framework: FastAPI for Python, demonstrates excellent performance and developer experience. AsyncIO support enables high-concurrency applications without multiprocessing complexity.
Sample architecture:
- FastAPI application server (2-4 instances)
- Request queue (Redis)
- Worker processes (Ray for distributed execution)
- Cache layer (Redis for response caching)
Deployment container image: 500MB typical. Storage cost negligible for small startups.
Frontend Framework: React for web interfaces, React Native for mobile. Both use web technologies familiar to most engineering teams. Deployment cost: static assets on CloudFlare Pages (free tier sufficient for prototyping).
Database: PostgreSQL for relational data, MongoDB for document storage. Both available free-tier on Cloud providers. PostGIS extension provides location capabilities if needed.
Start with PostgreSQL. 10GB free tier (Heroku PostgreSQL or similar) handles 1M+ rows comfortably.
Deployment and Orchestration
Container Orchestration: Kubernetes (k3s lightweight variant) or Docker Compose for smaller applications. Kubernetes adds operational overhead but scales to multi-region deployment.
For sub-1M monthly API calls, Docker Compose provides sufficient orchestration. Scaling requires Kubernetes.
CI/CD Pipeline: GitHub Actions (free for open-source, $0.008/minute compute). Link directly to repository. Typical build + test + deploy pipeline: 2-5 minutes per commit.
Monthly cost estimate (50 deployments): $4-20
Monitoring and Observability:
- Application metrics: Prometheus (self-hosted, free)
- Log aggregation: Loki for small-scale (free), Datadog for production ($15+/month)
- APM (Application Performance Monitoring): Self-hosted Jaeger (free) or Datadog (commercial)
Recommended stack for startups:
- Prometheus + Grafana for metrics ($0/month self-hosted)
- Loki for logs ($0/month self-hosted)
- Sentry for error tracking ($0/month for hobby tier, $29/month production)
Monthly cost: $0-29
See AWS GPU pricing for EC2 instance costs if self-hosting infrastructure.
Observability Layer
Request Tracing: Implement request ID propagation across services. Log request ID with every event. This enables debugging production issues without accessing live systems.
Sample implementation: OpenTelemetry libraries (free, open-source) emit traces to Jaeger (self-hosted) or commercial provider.
Error Tracking: Sentry integrates with FastAPI with two lines of code. Captures stack traces, request context, and user information automatically.
Model Monitoring: Track inference latency, token per second throughput, and model quality metrics. Log model input/output samples to detect data drift.
Data drift detection requires holdout test set evaluation. Monthly evaluation of 1000 test cases ensures model quality remains stable.
Security and Compliance
API Key Management: Use environment variables for development, secrets management for production. HashiCorp Vault (self-hosted, free) or AWS Secrets Manager ($0.40/secret/month).
Rotate API keys quarterly. Implement automatic key rotation for OAuth tokens.
Data Privacy: Implement field-level encryption for PII. PostgreSQL native encryption (pgcrypto extension) handles sensitive fields.
For GDPR compliance, implement data deletion workflows. Document data retention policies explicitly.
See best-gpu-cloud-in-europe-gdpr-compliant-providers for compliance considerations.
Rate Limiting: Implement API rate limiting (10 requests/second per user typical). Use Redis for distributed rate limiting across multiple servers.
Cost Breakdown: Small Production System
Monthly Operating Costs
Compute:
- Development: $20
- Production H100 reserved: $1,500
- Fine-tuning: $120
- Subtotal: $1,640
APIs:
- Together AI: $28
- OpenAI: $10
- Subtotal: $38
Infrastructure:
- Database (Heroku): $50
- Cache (Redis Cloud): $15
- Object storage (S3): $5
- CDN (CloudFlare): $0 (free tier)
- Subtotal: $70
Deployment and Monitoring:
- CI/CD: $10
- Error tracking (Sentry): $29
- Subtotal: $39
Total: $1,787/month
This supports:
- 5M monthly inference tokens
- 10,000 monthly API requests
- 2-3 fine-tuning runs
- Full-featured production application
- 99%+ uptime
Cost Optimization Strategies
Caching: Implement aggressive response caching. A RAG system processing similar queries repeatedly benefits from query-result caching.
Cache hit rate 50%: Cut API costs 50% Cache hit rate 70%: Cut API costs 70%
Redis cost: $15/month for 5GB. Caching investment pays back within days on high-volume applications.
Batch Processing: Process multiple requests in batches rather than individually. Batch size 32 reduces per-token costs through batch size efficiency.
Expected savings: 10-20% on inference costs.
Model Selection: Start with Llama 3.1 8B instead of 70B. 8B runs on RTX 4090, 20x cheaper than H100. Only upgrade to 70B when capability gaps appear.
Measured approach:
- Week 1-2: Prototype on local RTX 4090
- Week 3-4: Deploy 8B model on API
- Month 2+: Upgrade to 70B only if benchmarks justify
Spot Instance Usage: Use spot instances for batch processing and fine-tuning. H100 spot instances cost 50% less than on-demand.
Batch processing with 2-minute interruption tolerance: Use spot instances. Production inference with SLA: Use reserved capacity.
FAQ
How do I choose between self-hosted and API? API for 0-5M monthly tokens: Lower total cost, less operational overhead. Self-hosted for 5M+ monthly tokens: Cost advantages justify operational complexity. Hybrid approach: APIs for general workloads, self-hosted for specialized/recurring tasks.
Should I use Kubernetes from day one? No. Start with Docker Compose. Kubernetes adds 40-60 hours of learning curve and operational overhead. Migrate to Kubernetes only when you have >10K daily active users or multi-region requirements.
How do I handle model fine-tuning costs? Fine-tune when specific domain data justifies cost. Measure baseline model performance first. A fine-tuning run costing $150 only makes sense if it improves metrics by 5%+.
What's the minimum viable production stack?
- 1 application server (FastAPI on VM)
- 1 database (PostgreSQL)
- 1 API provider (Together AI)
- 1 monitoring tool (Sentry)
- Total monthly cost: $100-200
This handles up to 100K monthly API calls with acceptable latency.
How do I avoid runaway API costs? Implement per-user rate limits. Monitor token usage daily. Set up alerts when daily cost exceeds threshold. Cap API spending programmatically if needed.
Can I run this stack entirely on free tiers? For hobby projects, yes. Supabase (PostgreSQL free), Vercel (frontend free), Together AI (1M free tokens monthly), Sentry (free tier). Limitation: No self-hosted GPU compute (requires paid tier).
Related Resources
- Together AI Pricing
- OpenAI API Pricing
- RunPod GPU Pricing
- Lambda GPU Pricing
- CoreWeave GPU Pricing
- AWS GPU Pricing
Sources
- FastAPI official documentation
- Kubernetes best practices documentation
- HashiCorp infrastructure tools documentation
- Cloud provider pricing (March 2026)
- Industry benchmarks for model inference latency