Build a Production Chatbot
Build a production-ready chatbot with streaming responses using Llama 3.1 on GPU infrastructure.
Real Implementation: This tutorial uses actual Syaala API endpoints and RunPod GPU infrastructure. No mock data.
What You’ll Build
- Backend: Llama 3.1 8B model on NVIDIA A100 GPU
- Frontend: Next.js 14 with streaming chat UI
- Features: Conversation history, streaming responses, cost tracking
- Runtime: vLLM for optimal throughput
- Scaling: Auto-scale from 0 to 5 replicas based on demand
Architecture
User Browser
↓
Next.js Frontend (Port 3000)
↓
Syaala API (platform.syaala.com)
↓
RunPod Serverless Endpoint
↓
vLLM Runtime + Llama 3.1 8B
↓
NVIDIA A100 40GB GPUPrerequisites
Install Dependencies
node --version # v20.0.0 or higher
npm install -g @syaala/cliGet API Key
Generate your API key at app.syaala.com/settings
export SYAALA_API_KEY=sk_live_...Authenticate
syaala auth login --api-key $SYAALA_API_KEYStep 1: Deploy Llama 3.1 Model
Option A: Using CLI
syaala deployments create \
--name chatbot-llama-3 \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--runtime vllm \
--gpu-type A100_40GB \
--gpu-count 1 \
--min-replicas 0 \
--max-replicas 5 \
--env MAX_MODEL_LEN=4096 \
--env MAX_NUM_SEQS=256Option B: Using TypeScript SDK
import { createClient } from '@syaala/sdk'
const client = createClient(process.env.SYAALA_API_KEY!)
// Get organization ID
const profile = await client.getProfile()
if (!profile.success) throw new Error('Auth failed')
const orgId = profile.data.orgId
// Create deployment
const deployment = await client.deployments.create(orgId, {
name: 'chatbot-llama-3',
modelId: 'meta-llama/Meta-Llama-3.1-8B-Instruct',
runtime: 'vllm',
provider: 'runpod',
gpuType: 'A100_40GB',
gpuCount: 1,
scaling: {
minReplicas: 0, // Scale to zero when idle
maxReplicas: 5 // Auto-scale under load
},
environment: {
MAX_MODEL_LEN: '4096',
MAX_NUM_SEQS: '256',
TRUST_REMOTE_CODE: 'false'
}
})
if (!deployment.success) {
throw new Error(`Deployment failed: ${deployment.error}`)
}
console.log('Deployment ID:', deployment.data.id)
console.log('Status:', deployment.data.state)Wait for Deployment
# Monitor deployment status
syaala deployments get <deployment-id>
# Watch logs
syaala deployments logs <deployment-id> --followExpected states: PROVISIONING → INITIALIZING → HEALTHY
Step 2: Create Next.js Frontend
Initialize Project
npx create-next-app@14 chatbot-app --typescript --tailwind --app
cd chatbot-app
npm install @syaala/sdk aiProject Structure
chatbot-app/
├── app/
│ ├── api/
│ │ └── chat/
│ │ └── route.ts # Streaming API route
│ ├── page.tsx # Chat UI
│ └── layout.tsx
├── lib/
│ └── syaala.ts # SDK client
└── .env.local # API keysStep 3: Configure SDK Client
Create lib/syaala.ts:
import { createClient } from '@syaala/sdk'
if (!process.env.SYAALA_API_KEY) {
throw new Error('SYAALA_API_KEY is required')
}
if (!process.env.NEXT_PUBLIC_DEPLOYMENT_ID) {
throw new Error('NEXT_PUBLIC_DEPLOYMENT_ID is required')
}
export const syaalaClient = createClient(process.env.SYAALA_API_KEY, {
timeout: 60000, // 60 second timeout for long responses
retries: 3
})
export const DEPLOYMENT_ID = process.env.NEXT_PUBLIC_DEPLOYMENT_IDEnvironment Variables
Create .env.local:
SYAALA_API_KEY=sk_live_...
NEXT_PUBLIC_DEPLOYMENT_ID=dep_...Step 4: Build Streaming API Route
Create app/api/chat/route.ts:
import { syaalaClient, DEPLOYMENT_ID } from '@/lib/syaala'
import { StreamingTextResponse } from 'ai'
export const runtime = 'edge'
export const dynamic = 'force-dynamic'
interface ChatMessage {
role: 'user' | 'assistant'
content: string
}
export async function POST(req: Request) {
try {
const { messages } = await req.json() as { messages: ChatMessage[] }
if (!messages || messages.length === 0) {
return new Response('Messages are required', { status: 400 })
}
// Format conversation for Llama 3.1 Instruct
const lastMessage = messages[messages.length - 1]
const conversationHistory = messages
.slice(0, -1)
.map(msg => `${msg.role === 'user' ? 'User' : 'Assistant'}: ${msg.content}`)
.join('\n')
const prompt = conversationHistory
? `${conversationHistory}\nUser: ${lastMessage.content}\nAssistant:`
: `User: ${lastMessage.content}\nAssistant:`
// Stream from Syaala deployment
const stream = syaalaClient.inference.stream(DEPLOYMENT_ID, {
prompt,
maxTokens: 1000,
temperature: 0.7,
topP: 0.9,
stream: true,
stop: ['User:', '\n\nUser:']
})
// Convert to ReadableStream for Vercel AI SDK
const encoder = new TextEncoder()
const readableStream = new ReadableStream({
async start(controller) {
try {
for await (const chunk of stream) {
if (chunk.success) {
controller.enqueue(encoder.encode(chunk.data.text))
if (chunk.data.finishReason) {
controller.close()
break
}
} else {
console.error('Stream error:', chunk.error)
controller.error(new Error(chunk.error))
break
}
}
} catch (error) {
console.error('Stream processing error:', error)
controller.error(error)
}
}
})
return new StreamingTextResponse(readableStream)
} catch (error) {
console.error('Chat API error:', error)
return new Response(
JSON.stringify({
error: error instanceof Error ? error.message : 'Unknown error'
}),
{ status: 500, headers: { 'Content-Type': 'application/json' } }
)
}
}Step 5: Build Chat UI
Create app/page.tsx:
'use client'
import { useChat } from 'ai/react'
import { useState } from 'react'
export default function ChatPage() {
const { messages, input, handleInputChange, handleSubmit, isLoading, error } = useChat({
api: '/api/chat'
})
const [showCosts, setShowCosts] = useState(false)
// Estimate costs: A100 40GB @ $1.10/hour, ~1000 tokens/min
const estimatedCost = (messages.length * 0.0003).toFixed(4)
return (
<div className="flex flex-col h-screen max-w-4xl mx-auto p-4">
{/* Header */}
<div className="mb-4 pb-4 border-b">
<h1 className="text-2xl font-bold">Llama 3.1 Chatbot</h1>
<p className="text-sm text-gray-600">
Powered by Syaala Platform · Running on NVIDIA A100
</p>
<button
onClick={() => setShowCosts(!showCosts)}
className="text-xs text-blue-600 hover:underline mt-1"
>
{showCosts ? 'Hide' : 'Show'} cost estimate
</button>
{showCosts && (
<p className="text-xs text-gray-500 mt-1">
Estimated cost: ${estimatedCost} ({messages.length} messages)
</p>
)}
</div>
{/* Messages */}
<div className="flex-1 overflow-y-auto space-y-4 mb-4">
{messages.length === 0 && (
<div className="text-center text-gray-500 mt-8">
<p>Start a conversation with Llama 3.1</p>
<p className="text-sm mt-2">Try: "Explain quantum computing in simple terms"</p>
</div>
)}
{messages.map((message) => (
<div
key={message.id}
className={`flex ${message.role === 'user' ? 'justify-end' : 'justify-start'}`}
>
<div
className={`max-w-[80%] rounded-lg px-4 py-2 ${
message.role === 'user'
? 'bg-blue-600 text-white'
: 'bg-gray-100 text-gray-900'
}`}
>
<p className="text-sm font-semibold mb-1">
{message.role === 'user' ? 'You' : 'Llama 3.1'}
</p>
<p className="whitespace-pre-wrap">{message.content}</p>
</div>
</div>
))}
{isLoading && (
<div className="flex justify-start">
<div className="bg-gray-100 rounded-lg px-4 py-2">
<p className="text-sm text-gray-500">Thinking...</p>
</div>
</div>
)}
{error && (
<div className="bg-red-50 border border-red-200 rounded-lg p-4">
<p className="text-sm text-red-600">Error: {error.message}</p>
</div>
)}
</div>
{/* Input */}
<form onSubmit={handleSubmit} className="flex gap-2">
<input
value={input}
onChange={handleInputChange}
placeholder="Type your message..."
className="flex-1 px-4 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-600"
disabled={isLoading}
/>
<button
type="submit"
disabled={isLoading || !input.trim()}
className="px-6 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700 disabled:opacity-50 disabled:cursor-not-allowed"
>
Send
</button>
</form>
</div>
)
}Step 6: Run Development Server
npm run devVisit http://localhost:3000
Step 7: Monitor & Optimize
View Metrics
syaala deployments metrics <deployment-id> --period 1hMetrics tracked:
- GPU utilization (target: 70-90%)
- Memory usage
- Request latency (p50, p95, p99)
- Throughput (tokens/second)
- Active replicas
Optimize Performance
Increase throughput:
syaala deployments update <deployment-id> \
--env MAX_NUM_SEQS=512 \
--env GPU_MEMORY_UTILIZATION=0.95Reduce latency:
syaala deployments update <deployment-id> \
--min-replicas 1 # Keep one replica warmOptimize costs:
syaala deployments update <deployment-id> \
--min-replicas 0 \
--max-replicas 3 \
--gpu-type RTX_4090 # Cheaper GPU for demo useProduction Checklist
- Rate limiting - Add rate limits to /api/chat
- Authentication - Implement user sessions
- Conversation storage - Save chat history to database
- Error handling - Retry failed requests
- Monitoring - Track usage and costs
- Caching - Cache common responses
- Content filtering - Filter inappropriate content
- Load testing - Test with concurrent users
Cost Analysis
GPU Costs (RunPod Serverless):
- A100 40GB:
$1.10/hour ($0.00031/second) - RTX 4090:
$0.50/hour ($0.00014/second)
Scale-to-zero pricing:
- Idle time: $0.00/hour
- Active time: Pay only when processing requests
Example monthly costs:
- 1000 conversations/day
- 10 messages per conversation
- 2 seconds per response
- Total: ~$12/month on RTX 4090
Troubleshooting
Deployment stuck in PROVISIONING
# Check deployment logs
syaala deployments logs <deployment-id>
# Common issue: Model download timeout
# Solution: Increase timeout or use smaller modelSlow streaming responses
# Check GPU utilization
syaala deployments metrics <deployment-id>
# If GPU util < 50%: Increase batch size
syaala deployments update <deployment-id> --env MAX_NUM_SEQS=512
# If GPU util > 95%: Add more replicas
syaala deployments update <deployment-id> --max-replicas 10High costs
# Check scaling configuration
syaala deployments get <deployment-id>
# Ensure scale-to-zero is enabled
syaala deployments update <deployment-id> --min-replicas 0
# Switch to cheaper GPU
syaala deployments update <deployment-id> --gpu-type RTX_4090Next Steps
- Batch Processing - Process multiple requests efficiently
- API Integration - Integrate chatbot into existing apps
- Monitoring Guide - Set up alerts and dashboards
- Cost Optimization - Reduce GPU costs
Example Repository
Complete working code: github.com/syaala/examples/chatbot
git clone https://github.com/syaala/examples
cd examples/chatbot
npm install
npm run dev