Your First Deployment

This comprehensive guide walks you through every step of deploying an AI model on Syaala Platform, from model selection to production monitoring.

Overview

Deploying a model on Syaala involves:

Choosing a model from HuggingFace or uploading a custom model
Selecting GPU hardware based on model requirements
Configuring runtime (vLLM, Triton, FastAPI)
Setting scaling parameters for auto-scaling
Deploying via CLI, SDK, or Dashboard
Testing the OpenAI-compatible endpoint
Monitoring GPU metrics and performance

Estimated Time: 15-20 minutes for first deployment

Step 1: Choose Your Model

Syaala supports models from multiple sources:

HuggingFace Hub (Recommended)

Browse HuggingFace Models and select a model:

Popular Choices:

LLMs: meta-llama/Meta-Llama-3.1-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.3
Vision: openai/clip-vit-large-patch14, Salesforce/blip-image-captioning-base
Audio: openai/whisper-large-v3, facebook/wav2vec2-large-960h

Model License Agreement

Gated Models: Some models (like Llama 3.1) require license acceptance on HuggingFace. Syaala handles authentication for you - simply ensure you’ve accepted the model license on the HuggingFace website before deploying.

Custom Models

Upload your own model weights to Syaala Storage:

syaala models upload \
  --name "my-custom-model" \
  --path ./model-weights/ \
  --framework pytorch

Step 2: Calculate GPU Requirements

Different models require different GPU memory:

Model Size	Minimum GPU	Recommended GPU	Memory
3B params	RTX 4090	A100 40GB	12GB
7B params	RTX 4090	A100 40GB	16GB
13B params	A100 40GB	A100 80GB	32GB
30B params	A100 80GB	H100 80GB	64GB
70B params	2× A100 80GB	4× H100 80GB	160GB

Memory Calculation

Required Memory = Model Size × Precision Multiplier × Safety Factor

- FP32: 4 bytes per parameter (32 bits)
- FP16: 2 bytes per parameter (16 bits)
- INT8: 1 byte per parameter (8 bits)
- Safety Factor: 1.2× (for activations, KV cache)

Example (Llama 3.1 8B in FP16):
8B × 2 bytes × 1.2 = 19.2 GB
→ Requires RTX 4090 (24GB) or A100 40GB

💡 Pro Tip: Use the GPU Calculator in the dashboard for automatic recommendations.

Step 3: Select Runtime

Syaala supports three inference runtimes:

vLLM (Recommended for LLMs)

Best For: Large language models, chat, text generation

Features:

PagedAttention for efficient memory usage
Continuous batching for high throughput
OpenAI-compatible API
Tensor parallelism for multi-GPU

Example Configuration:

syaala deployments create \
  --runtime vllm \
  --env MAX_MODEL_LEN=4096 \
  --env TENSOR_PARALLEL_SIZE=1 \
  --env GPU_MEMORY_UTILIZATION=0.9

Triton Inference Server

Best For: Multi-model serving, vision models, ensemble pipelines

Features:

Dynamic batching
Model ensemble pipelines
Multiple framework support (PyTorch, TensorFlow, ONNX)
Concurrent model execution

Example Configuration:

syaala deployments create \
  --runtime triton \
  --env MAX_BATCH_SIZE=32 \
  --env DYNAMIC_BATCHING=true

FastAPI (Custom)

Best For: Custom inference logic, preprocessing, non-standard models

Features:

Full Python control
Custom preprocessing/postprocessing
Integration with external services
REST + WebSocket support

Example:

# main.py
from fastapi import FastAPI
import torch
 
app = FastAPI()
model = torch.load('/models/custom-model.pt')
 
@app.post('/predict')
async def predict(request: dict):
    result = model(request['input'])
    return {'output': result}

Step 4: Configure Scaling

Syaala auto-scales deployments based on load:

Scaling Parameters

{
  minReplicas: 1,      // Always keep 1 instance running
  maxReplicas: 10,     // Scale up to 10 instances max
  targetUtilization: 0.8, // Scale when GPU hits 80%
  scaleDownDelay: 300 // Wait 5min before scaling down
}

Scaling Strategies

Cost-Optimized (Development/Testing):

--min-replicas 0 --max-replicas 3 --scale-down-delay 60

Scales to zero when idle
Low cost, higher cold-start latency

Performance-Optimized (Production):

--min-replicas 2 --max-replicas 10 --scale-down-delay 600

Always-on replicas for instant responses
Higher cost, zero cold-starts

Balanced (Most Common):

--min-replicas 1 --max-replicas 5 --scale-down-delay 300

One instance always ready
Scales during traffic spikes

Step 5: Deploy

syaala deployments create \
  --name "llama-31-8b-production" \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --runtime vllm \
  --gpu "NVIDIA-RTX-4090" \
  --min-replicas 1 \
  --max-replicas 5 \
  --env MAX_MODEL_LEN=4096 \
  --env TEMPERATURE=0.7 \
  --env TOP_P=0.9

import { SyaalaClient } from '@syaala/sdk'
 
const client = new SyaalaClient({
  apiKey: process.env.SYAALA_API_KEY!
})
 
const deployment = await client.deployments.create({
  name: 'llama-31-8b-production',
  model: 'meta-llama/Meta-Llama-3.1-8B-Instruct',
  runtime: 'vllm',
  gpu: 'NVIDIA-RTX-4090',
  scaling: {
    minReplicas: 1,
    maxReplicas: 5,
    targetUtilization: 0.8
  },
  environment: {
    MAX_MODEL_LEN: '4096',
    TEMPERATURE: '0.7',
    TOP_P: '0.9'
  }
})
 
console.log('Deployment ID:', deployment.id)
console.log('Endpoint:', deployment.endpoint)

Go to Dashboard → Deployments
Click “Create Deployment”
Fill in deployment form:
- Name: llama-31-8b-production
- Model: Select from HuggingFace catalog
- GPU: Choose based on requirements
- Scaling: Configure min/max replicas
Click “Deploy”
Monitor deployment progress in real-time

curl -X POST https://platform.syaala.com/api/deployments \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "llama-31-8b-production",
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "runtime": "vllm",
    "gpu": "NVIDIA-RTX-4090",
    "scaling": {
      "minReplicas": 1,
      "maxReplicas": 5,
      "targetUtilization": 0.8
    },
    "environment": {
      "MAX_MODEL_LEN": "4096",
      "TEMPERATURE": "0.7"
    }
  }'

Deployment Process

Watch the deployment progress through these stages:

Validating (10-30s): Checking model access, GPU availability
Provisioning (1-2m): Allocating GPU hardware
Downloading (2-5m): Pulling model weights from HuggingFace
Loading (1-3m): Loading model into GPU memory
Running (∞): Deployment ready, accepting requests

Total Time: Typically 5-10 minutes for first deployment (model needs to download). Subsequent deployments are faster due to caching.

Step 6: Test Your Endpoint

Once status shows running, test the deployment:

OpenAI-Compatible API

curl -X POST https://dep-a1b2c3d4.syaala.run/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "What are the benefits of GPU acceleration for AI?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Streaming Responses

For real-time streaming (like ChatGPT):

const response = await fetch('https://dep-a1b2c3d4.syaala.run/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${process.env.SYAALA_API_KEY}`
  },
  body: JSON.stringify({
    model: 'meta-llama/Meta-Llama-3.1-8B-Instruct',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true // Enable streaming
  })
})
 
const reader = response.body!.getReader()
const decoder = new TextDecoder()
 
while (true) {
  const { done, value } = await reader.read()
  if (done) break
 
  const chunk = decoder.decode(value)
  const lines = chunk.split('\n').filter(line => line.trim())
 
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = line.slice(6)
      if (data === '[DONE]') break
 
      const json = JSON.parse(data)
      process.stdout.write(json.choices[0].delta.content || '')
    }
  }
}

Step 7: Monitor Performance

Real-Time Metrics

View live metrics for your deployment:

syaala deployments metrics dep_a1b2c3d4e5f6

Metrics Tracked:

GPU Utilization: 0-100% (target: 60-80% for optimal cost)
Memory Usage: GB used / GB total
Throughput: Requests per second
Latency: p50, p95, p99 response times
Active Replicas: Current replica count

Metrics Dashboard

View comprehensive metrics in the Dashboard → Monitoring:

GPU Utilization Graph: 24-hour time series
Request Rate: Requests per minute
Error Rate: 4xx/5xx errors
Cost Breakdown: Hourly spend by resource

Setting Up Alerts

Configure alerts for critical events:

syaala alerts create \
  --name "High GPU Utilization" \
  --metric gpu_utilization \
  --condition ">" \
  --threshold 90 \
  --duration 300 \
  --notification slack

Production Checklist

Before going to production, ensure:

Model tested with representative workload
Scaling configured for expected traffic
Alerts set up for GPU utilization, errors, costs
Monitoring enabled with proper dashboards
API keys rotated from test to production keys
Rate limits understood for your plan tier
Cost estimates validated against budget
Backup deployment in different region (for critical services)

Troubleshooting

Deployment Fails to Start

Symptoms: Status stuck in provisioning or failed

Solutions:

Check System Status for GPU availability
Ensure sufficient GPU memory for model size
Review deployment logs: syaala deployments logs dep_xxx
Contact support if issue persists

High Latency

Symptoms: p95 latency > 2 seconds

Solutions:

Increase min-replicas to reduce cold starts
Enable GPU memory optimization: GPU_MEMORY_UTILIZATION=0.95
Use tensor parallelism for large models: TENSOR_PARALLEL_SIZE=2
Consider a faster GPU (RTX 4090 → A100)

Out of Memory Errors

Symptoms: CUDA out of memory in logs

Solutions:

Reduce MAX_MODEL_LEN (context window)
Reduce MAX_BATCH_SIZE
Enable quantization: QUANTIZATION=int8
Upgrade to larger GPU (24GB → 40GB → 80GB)

Next Steps

Congratulations! You’ve successfully deployed your first AI model. Now explore:

API Reference: Complete deployment API documentation
SDK Integration: Integrate into your application
Monitoring Guide: Set up comprehensive monitoring
Cost Optimization: Reduce GPU costs

Authentication Quickstart