Build a Production Chatbot

Build a production-ready chatbot with streaming responses using Llama 3.1 on GPU infrastructure.

Real Implementation: This tutorial uses actual Syaala API endpoints and RunPod GPU infrastructure. No mock data.

What You’ll Build

Backend: Llama 3.1 8B model on NVIDIA A100 GPU
Frontend: Next.js 14 with streaming chat UI
Features: Conversation history, streaming responses, cost tracking
Runtime: vLLM for optimal throughput
Scaling: Auto-scale from 0 to 5 replicas based on demand

Architecture

User Browser
    ↓
Next.js Frontend (Port 3000)
    ↓
Syaala API (platform.syaala.com)
    ↓
RunPod Serverless Endpoint
    ↓
vLLM Runtime + Llama 3.1 8B
    ↓
NVIDIA A100 40GB GPU

Prerequisites

Install Dependencies

node --version  # v20.0.0 or higher
npm install -g @syaala/cli

Get API Key

Generate your API key at app.syaala.com/settings

export SYAALA_API_KEY=sk_live_...

Authenticate

syaala auth login --api-key $SYAALA_API_KEY

Step 1: Deploy Llama 3.1 Model

Option A: Using CLI

syaala deployments create \
  --name chatbot-llama-3 \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --runtime vllm \
  --gpu-type A100_40GB \
  --gpu-count 1 \
  --min-replicas 0 \
  --max-replicas 5 \
  --env MAX_MODEL_LEN=4096 \
  --env MAX_NUM_SEQS=256

Option B: Using TypeScript SDK

import { createClient } from '@syaala/sdk'
 
const client = createClient(process.env.SYAALA_API_KEY!)
 
// Get organization ID
const profile = await client.getProfile()
if (!profile.success) throw new Error('Auth failed')
 
const orgId = profile.data.orgId
 
// Create deployment
const deployment = await client.deployments.create(orgId, {
  name: 'chatbot-llama-3',
  modelId: 'meta-llama/Meta-Llama-3.1-8B-Instruct',
  runtime: 'vllm',
  provider: 'runpod',
  gpuType: 'A100_40GB',
  gpuCount: 1,
  scaling: {
    minReplicas: 0,  // Scale to zero when idle
    maxReplicas: 5   // Auto-scale under load
  },
  environment: {
    MAX_MODEL_LEN: '4096',
    MAX_NUM_SEQS: '256',
    TRUST_REMOTE_CODE: 'false'
  }
})
 
if (!deployment.success) {
  throw new Error(`Deployment failed: ${deployment.error}`)
}
 
console.log('Deployment ID:', deployment.data.id)
console.log('Status:', deployment.data.state)

Wait for Deployment

# Monitor deployment status
syaala deployments get <deployment-id>
 
# Watch logs
syaala deployments logs <deployment-id> --follow

Expected states: PROVISIONING → INITIALIZING → HEALTHY

Step 2: Create Next.js Frontend

Initialize Project

npx create-next-app@14 chatbot-app --typescript --tailwind --app
cd chatbot-app
npm install @syaala/sdk ai

Project Structure

chatbot-app/
├── app/
│   ├── api/
│   │   └── chat/
│   │       └── route.ts      # Streaming API route
│   ├── page.tsx              # Chat UI
│   └── layout.tsx
├── lib/
│   └── syaala.ts             # SDK client
└── .env.local                # API keys

Step 3: Configure SDK Client

Create lib/syaala.ts:

import { createClient } from '@syaala/sdk'
 
if (!process.env.SYAALA_API_KEY) {
  throw new Error('SYAALA_API_KEY is required')
}
 
if (!process.env.NEXT_PUBLIC_DEPLOYMENT_ID) {
  throw new Error('NEXT_PUBLIC_DEPLOYMENT_ID is required')
}
 
export const syaalaClient = createClient(process.env.SYAALA_API_KEY, {
  timeout: 60000,  // 60 second timeout for long responses
  retries: 3
})
 
export const DEPLOYMENT_ID = process.env.NEXT_PUBLIC_DEPLOYMENT_ID

Environment Variables

Create .env.local:

SYAALA_API_KEY=sk_live_...
NEXT_PUBLIC_DEPLOYMENT_ID=dep_...

Step 4: Build Streaming API Route

Create app/api/chat/route.ts:

import { syaalaClient, DEPLOYMENT_ID } from '@/lib/syaala'
import { StreamingTextResponse } from 'ai'
 
export const runtime = 'edge'
export const dynamic = 'force-dynamic'
 
interface ChatMessage {
  role: 'user' | 'assistant'
  content: string
}
 
export async function POST(req: Request) {
  try {
    const { messages } = await req.json() as { messages: ChatMessage[] }
    
    if (!messages || messages.length === 0) {
      return new Response('Messages are required', { status: 400 })
    }
    
    // Format conversation for Llama 3.1 Instruct
    const lastMessage = messages[messages.length - 1]
    const conversationHistory = messages
      .slice(0, -1)
      .map(msg => `${msg.role === 'user' ? 'User' : 'Assistant'}: ${msg.content}`)
      .join('\n')
    
    const prompt = conversationHistory
      ? `${conversationHistory}\nUser: ${lastMessage.content}\nAssistant:`
      : `User: ${lastMessage.content}\nAssistant:`
    
    // Stream from Syaala deployment
    const stream = syaalaClient.inference.stream(DEPLOYMENT_ID, {
      prompt,
      maxTokens: 1000,
      temperature: 0.7,
      topP: 0.9,
      stream: true,
      stop: ['User:', '\n\nUser:']
    })
    
    // Convert to ReadableStream for Vercel AI SDK
    const encoder = new TextEncoder()
    const readableStream = new ReadableStream({
      async start(controller) {
        try {
          for await (const chunk of stream) {
            if (chunk.success) {
              controller.enqueue(encoder.encode(chunk.data.text))
              
              if (chunk.data.finishReason) {
                controller.close()
                break
              }
            } else {
              console.error('Stream error:', chunk.error)
              controller.error(new Error(chunk.error))
              break
            }
          }
        } catch (error) {
          console.error('Stream processing error:', error)
          controller.error(error)
        }
      }
    })
    
    return new StreamingTextResponse(readableStream)
  } catch (error) {
    console.error('Chat API error:', error)
    return new Response(
      JSON.stringify({ 
        error: error instanceof Error ? error.message : 'Unknown error' 
      }), 
      { status: 500, headers: { 'Content-Type': 'application/json' } }
    )
  }
}

Step 5: Build Chat UI

Create app/page.tsx:

'use client'
 
import { useChat } from 'ai/react'
import { useState } from 'react'
 
export default function ChatPage() {
  const { messages, input, handleInputChange, handleSubmit, isLoading, error } = useChat({
    api: '/api/chat'
  })
  
  const [showCosts, setShowCosts] = useState(false)
  
  // Estimate costs: A100 40GB @ $1.10/hour, ~1000 tokens/min
  const estimatedCost = (messages.length * 0.0003).toFixed(4)
  
  return (
    <div className="flex flex-col h-screen max-w-4xl mx-auto p-4">
      {/* Header */}
      <div className="mb-4 pb-4 border-b">
        <h1 className="text-2xl font-bold">Llama 3.1 Chatbot</h1>
        <p className="text-sm text-gray-600">
          Powered by Syaala Platform · Running on NVIDIA A100
        </p>
        <button
          onClick={() => setShowCosts(!showCosts)}
          className="text-xs text-blue-600 hover:underline mt-1"
        >
          {showCosts ? 'Hide' : 'Show'} cost estimate
        </button>
        {showCosts && (
          <p className="text-xs text-gray-500 mt-1">
            Estimated cost: ${estimatedCost} ({messages.length} messages)
          </p>
        )}
      </div>
      
      {/* Messages */}
      <div className="flex-1 overflow-y-auto space-y-4 mb-4">
        {messages.length === 0 && (
          <div className="text-center text-gray-500 mt-8">
            <p>Start a conversation with Llama 3.1</p>
            <p className="text-sm mt-2">Try: "Explain quantum computing in simple terms"</p>
          </div>
        )}
        
        {messages.map((message) => (
          <div
            key={message.id}
            className={`flex ${message.role === 'user' ? 'justify-end' : 'justify-start'}`}
          >
            <div
              className={`max-w-[80%] rounded-lg px-4 py-2 ${
                message.role === 'user'
                  ? 'bg-blue-600 text-white'
                  : 'bg-gray-100 text-gray-900'
              }`}
            >
              <p className="text-sm font-semibold mb-1">
                {message.role === 'user' ? 'You' : 'Llama 3.1'}
              </p>
              <p className="whitespace-pre-wrap">{message.content}</p>
            </div>
          </div>
        ))}
        
        {isLoading && (
          <div className="flex justify-start">
            <div className="bg-gray-100 rounded-lg px-4 py-2">
              <p className="text-sm text-gray-500">Thinking...</p>
            </div>
          </div>
        )}
        
        {error && (
          <div className="bg-red-50 border border-red-200 rounded-lg p-4">
            <p className="text-sm text-red-600">Error: {error.message}</p>
          </div>
        )}
      </div>
      
      {/* Input */}
      <form onSubmit={handleSubmit} className="flex gap-2">
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Type your message..."
          className="flex-1 px-4 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-600"
          disabled={isLoading}
        />
        <button
          type="submit"
          disabled={isLoading || !input.trim()}
          className="px-6 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700 disabled:opacity-50 disabled:cursor-not-allowed"
        >
          Send
        </button>
      </form>
    </div>
  )
}

Step 6: Run Development Server

npm run dev

Visit http://localhost:3000

Step 7: Monitor & Optimize

View Metrics

syaala deployments metrics <deployment-id> --period 1h

Metrics tracked:

GPU utilization (target: 70-90%)
Memory usage
Request latency (p50, p95, p99)
Throughput (tokens/second)
Active replicas

Optimize Performance

Increase throughput:

syaala deployments update <deployment-id> \
  --env MAX_NUM_SEQS=512 \
  --env GPU_MEMORY_UTILIZATION=0.95

Reduce latency:

syaala deployments update <deployment-id> \
  --min-replicas 1  # Keep one replica warm

Optimize costs:

syaala deployments update <deployment-id> \
  --min-replicas 0 \
  --max-replicas 3 \
  --gpu-type RTX_4090  # Cheaper GPU for demo use

Production Checklist

Rate limiting - Add rate limits to /api/chat
Authentication - Implement user sessions
Conversation storage - Save chat history to database
Error handling - Retry failed requests
Monitoring - Track usage and costs
Caching - Cache common responses
Content filtering - Filter inappropriate content
Load testing - Test with concurrent users

Cost Analysis

GPU Costs (RunPod Serverless):

A100 40GB: ~~$1.10/hour (~~$0.00031/second)
RTX 4090: ~~$0.50/hour (~~$0.00014/second)

Scale-to-zero pricing:

Idle time: $0.00/hour
Active time: Pay only when processing requests

Example monthly costs:

1000 conversations/day
10 messages per conversation
2 seconds per response
Total: ~$12/month on RTX 4090

Troubleshooting

Deployment stuck in PROVISIONING

# Check deployment logs
syaala deployments logs <deployment-id>
 
# Common issue: Model download timeout
# Solution: Increase timeout or use smaller model

Slow streaming responses

# Check GPU utilization
syaala deployments metrics <deployment-id>
 
# If GPU util < 50%: Increase batch size
syaala deployments update <deployment-id> --env MAX_NUM_SEQS=512
 
# If GPU util > 95%: Add more replicas
syaala deployments update <deployment-id> --max-replicas 10

High costs

# Check scaling configuration
syaala deployments get <deployment-id>
 
# Ensure scale-to-zero is enabled
syaala deployments update <deployment-id> --min-replicas 0
 
# Switch to cheaper GPU
syaala deployments update <deployment-id> --gpu-type RTX_4090

Next Steps

Batch Processing - Process multiple requests efficiently
API Integration - Integrate chatbot into existing apps
Monitoring Guide - Set up alerts and dashboards
Cost Optimization - Reduce GPU costs

Example Repository

Complete working code: github.com/syaala/examples/chatbot

git clone https://github.com/syaala/examples
cd examples/chatbot
npm install
npm run dev

Batch Inference Processing Index