Getting StartedFirst Deployment

Your First Deployment

This comprehensive guide walks you through every step of deploying an AI model on Syaala Platform, from model selection to production monitoring.

Overview

Deploying a model on Syaala involves:

  1. Choosing a model from HuggingFace or uploading a custom model
  2. Selecting GPU hardware based on model requirements
  3. Configuring runtime (vLLM, Triton, FastAPI)
  4. Setting scaling parameters for auto-scaling
  5. Deploying via CLI, SDK, or Dashboard
  6. Testing the OpenAI-compatible endpoint
  7. Monitoring GPU metrics and performance

Estimated Time: 15-20 minutes for first deployment


Step 1: Choose Your Model

Syaala supports models from multiple sources:

Browse HuggingFace Models and select a model:

Popular Choices:

  • LLMs: meta-llama/Meta-Llama-3.1-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.3
  • Vision: openai/clip-vit-large-patch14, Salesforce/blip-image-captioning-base
  • Audio: openai/whisper-large-v3, facebook/wav2vec2-large-960h

Model License Agreement

Gated Models: Some models (like Llama 3.1) require license acceptance on HuggingFace. Syaala handles authentication for you - simply ensure you’ve accepted the model license on the HuggingFace website before deploying.

Custom Models

Upload your own model weights to Syaala Storage:

syaala models upload \
  --name "my-custom-model" \
  --path ./model-weights/ \
  --framework pytorch

Step 2: Calculate GPU Requirements

Different models require different GPU memory:

Model SizeMinimum GPURecommended GPUMemory
3B paramsRTX 4090A100 40GB12GB
7B paramsRTX 4090A100 40GB16GB
13B paramsA100 40GBA100 80GB32GB
30B paramsA100 80GBH100 80GB64GB
70B params2× A100 80GB4× H100 80GB160GB

Memory Calculation

Required Memory = Model Size × Precision Multiplier × Safety Factor

- FP32: 4 bytes per parameter (32 bits)
- FP16: 2 bytes per parameter (16 bits)
- INT8: 1 byte per parameter (8 bits)
- Safety Factor: 1.2× (for activations, KV cache)

Example (Llama 3.1 8B in FP16):
8B × 2 bytes × 1.2 = 19.2 GB
→ Requires RTX 4090 (24GB) or A100 40GB

💡 Pro Tip: Use the GPU Calculator in the dashboard for automatic recommendations.


Step 3: Select Runtime

Syaala supports three inference runtimes:

Best For: Large language models, chat, text generation

Features:

  • PagedAttention for efficient memory usage
  • Continuous batching for high throughput
  • OpenAI-compatible API
  • Tensor parallelism for multi-GPU

Example Configuration:

syaala deployments create \
  --runtime vllm \
  --env MAX_MODEL_LEN=4096 \
  --env TENSOR_PARALLEL_SIZE=1 \
  --env GPU_MEMORY_UTILIZATION=0.9

Triton Inference Server

Best For: Multi-model serving, vision models, ensemble pipelines

Features:

  • Dynamic batching
  • Model ensemble pipelines
  • Multiple framework support (PyTorch, TensorFlow, ONNX)
  • Concurrent model execution

Example Configuration:

syaala deployments create \
  --runtime triton \
  --env MAX_BATCH_SIZE=32 \
  --env DYNAMIC_BATCHING=true

FastAPI (Custom)

Best For: Custom inference logic, preprocessing, non-standard models

Features:

  • Full Python control
  • Custom preprocessing/postprocessing
  • Integration with external services
  • REST + WebSocket support

Example:

# main.py
from fastapi import FastAPI
import torch
 
app = FastAPI()
model = torch.load('/models/custom-model.pt')
 
@app.post('/predict')
async def predict(request: dict):
    result = model(request['input'])
    return {'output': result}

Step 4: Configure Scaling

Syaala auto-scales deployments based on load:

Scaling Parameters

{
  minReplicas: 1,      // Always keep 1 instance running
  maxReplicas: 10,     // Scale up to 10 instances max
  targetUtilization: 0.8, // Scale when GPU hits 80%
  scaleDownDelay: 300 // Wait 5min before scaling down
}

Scaling Strategies

Cost-Optimized (Development/Testing):

--min-replicas 0 --max-replicas 3 --scale-down-delay 60
  • Scales to zero when idle
  • Low cost, higher cold-start latency

Performance-Optimized (Production):

--min-replicas 2 --max-replicas 10 --scale-down-delay 600
  • Always-on replicas for instant responses
  • Higher cost, zero cold-starts

Balanced (Most Common):

--min-replicas 1 --max-replicas 5 --scale-down-delay 300
  • One instance always ready
  • Scales during traffic spikes

Step 5: Deploy

syaala deployments create \
  --name "llama-31-8b-production" \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --runtime vllm \
  --gpu "NVIDIA-RTX-4090" \
  --min-replicas 1 \
  --max-replicas 5 \
  --env MAX_MODEL_LEN=4096 \
  --env TEMPERATURE=0.7 \
  --env TOP_P=0.9

Deployment Process

Watch the deployment progress through these stages:

  1. Validating (10-30s): Checking model access, GPU availability
  2. Provisioning (1-2m): Allocating GPU hardware
  3. Downloading (2-5m): Pulling model weights from HuggingFace
  4. Loading (1-3m): Loading model into GPU memory
  5. Running (∞): Deployment ready, accepting requests

Total Time: Typically 5-10 minutes for first deployment (model needs to download). Subsequent deployments are faster due to caching.


Step 6: Test Your Endpoint

Once status shows running, test the deployment:

OpenAI-Compatible API

curl -X POST https://dep-a1b2c3d4.syaala.run/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."},
      {"role": "user", "content": "What are the benefits of GPU acceleration for AI?"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

Streaming Responses

For real-time streaming (like ChatGPT):

const response = await fetch('https://dep-a1b2c3d4.syaala.run/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${process.env.SYAALA_API_KEY}`
  },
  body: JSON.stringify({
    model: 'meta-llama/Meta-Llama-3.1-8B-Instruct',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true // Enable streaming
  })
})
 
const reader = response.body!.getReader()
const decoder = new TextDecoder()
 
while (true) {
  const { done, value } = await reader.read()
  if (done) break
 
  const chunk = decoder.decode(value)
  const lines = chunk.split('\n').filter(line => line.trim())
 
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = line.slice(6)
      if (data === '[DONE]') break
 
      const json = JSON.parse(data)
      process.stdout.write(json.choices[0].delta.content || '')
    }
  }
}

Step 7: Monitor Performance

Real-Time Metrics

View live metrics for your deployment:

syaala deployments metrics dep_a1b2c3d4e5f6

Metrics Tracked:

  • GPU Utilization: 0-100% (target: 60-80% for optimal cost)
  • Memory Usage: GB used / GB total
  • Throughput: Requests per second
  • Latency: p50, p95, p99 response times
  • Active Replicas: Current replica count

Metrics Dashboard

View comprehensive metrics in the Dashboard → Monitoring:

  • GPU Utilization Graph: 24-hour time series
  • Request Rate: Requests per minute
  • Error Rate: 4xx/5xx errors
  • Cost Breakdown: Hourly spend by resource

Setting Up Alerts

Configure alerts for critical events:

syaala alerts create \
  --name "High GPU Utilization" \
  --metric gpu_utilization \
  --condition ">" \
  --threshold 90 \
  --duration 300 \
  --notification slack

Production Checklist

Before going to production, ensure:

  • Model tested with representative workload
  • Scaling configured for expected traffic
  • Alerts set up for GPU utilization, errors, costs
  • Monitoring enabled with proper dashboards
  • API keys rotated from test to production keys
  • Rate limits understood for your plan tier
  • Cost estimates validated against budget
  • Backup deployment in different region (for critical services)

Troubleshooting

Deployment Fails to Start

Symptoms: Status stuck in provisioning or failed

Solutions:

  1. Check System Status for GPU availability
  2. Ensure sufficient GPU memory for model size
  3. Review deployment logs: syaala deployments logs dep_xxx
  4. Contact support if issue persists

High Latency

Symptoms: p95 latency > 2 seconds

Solutions:

  1. Increase min-replicas to reduce cold starts
  2. Enable GPU memory optimization: GPU_MEMORY_UTILIZATION=0.95
  3. Use tensor parallelism for large models: TENSOR_PARALLEL_SIZE=2
  4. Consider a faster GPU (RTX 4090 → A100)

Out of Memory Errors

Symptoms: CUDA out of memory in logs

Solutions:

  1. Reduce MAX_MODEL_LEN (context window)
  2. Reduce MAX_BATCH_SIZE
  3. Enable quantization: QUANTIZATION=int8
  4. Upgrade to larger GPU (24GB → 40GB → 80GB)

Next Steps

Congratulations! You’ve successfully deployed your first AI model. Now explore: