Your First Deployment
This comprehensive guide walks you through every step of deploying an AI model on Syaala Platform, from model selection to production monitoring.
Overview
Deploying a model on Syaala involves:
- Choosing a model from HuggingFace or uploading a custom model
- Selecting GPU hardware based on model requirements
- Configuring runtime (vLLM, Triton, FastAPI)
- Setting scaling parameters for auto-scaling
- Deploying via CLI, SDK, or Dashboard
- Testing the OpenAI-compatible endpoint
- Monitoring GPU metrics and performance
Estimated Time: 15-20 minutes for first deployment
Step 1: Choose Your Model
Syaala supports models from multiple sources:
HuggingFace Hub (Recommended)
Browse HuggingFace Models and select a model:
Popular Choices:
- LLMs:
meta-llama/Meta-Llama-3.1-8B-Instruct,mistralai/Mistral-7B-Instruct-v0.3 - Vision:
openai/clip-vit-large-patch14,Salesforce/blip-image-captioning-base - Audio:
openai/whisper-large-v3,facebook/wav2vec2-large-960h
Model License Agreement
Gated Models: Some models (like Llama 3.1) require license acceptance on HuggingFace. Syaala handles authentication for you - simply ensure you’ve accepted the model license on the HuggingFace website before deploying.
Custom Models
Upload your own model weights to Syaala Storage:
syaala models upload \
--name "my-custom-model" \
--path ./model-weights/ \
--framework pytorchStep 2: Calculate GPU Requirements
Different models require different GPU memory:
| Model Size | Minimum GPU | Recommended GPU | Memory |
|---|---|---|---|
| 3B params | RTX 4090 | A100 40GB | 12GB |
| 7B params | RTX 4090 | A100 40GB | 16GB |
| 13B params | A100 40GB | A100 80GB | 32GB |
| 30B params | A100 80GB | H100 80GB | 64GB |
| 70B params | 2× A100 80GB | 4× H100 80GB | 160GB |
Memory Calculation
Required Memory = Model Size × Precision Multiplier × Safety Factor
- FP32: 4 bytes per parameter (32 bits)
- FP16: 2 bytes per parameter (16 bits)
- INT8: 1 byte per parameter (8 bits)
- Safety Factor: 1.2× (for activations, KV cache)
Example (Llama 3.1 8B in FP16):
8B × 2 bytes × 1.2 = 19.2 GB
→ Requires RTX 4090 (24GB) or A100 40GB💡 Pro Tip: Use the GPU Calculator in the dashboard for automatic recommendations.
Step 3: Select Runtime
Syaala supports three inference runtimes:
vLLM (Recommended for LLMs)
Best For: Large language models, chat, text generation
Features:
- PagedAttention for efficient memory usage
- Continuous batching for high throughput
- OpenAI-compatible API
- Tensor parallelism for multi-GPU
Example Configuration:
syaala deployments create \
--runtime vllm \
--env MAX_MODEL_LEN=4096 \
--env TENSOR_PARALLEL_SIZE=1 \
--env GPU_MEMORY_UTILIZATION=0.9Triton Inference Server
Best For: Multi-model serving, vision models, ensemble pipelines
Features:
- Dynamic batching
- Model ensemble pipelines
- Multiple framework support (PyTorch, TensorFlow, ONNX)
- Concurrent model execution
Example Configuration:
syaala deployments create \
--runtime triton \
--env MAX_BATCH_SIZE=32 \
--env DYNAMIC_BATCHING=trueFastAPI (Custom)
Best For: Custom inference logic, preprocessing, non-standard models
Features:
- Full Python control
- Custom preprocessing/postprocessing
- Integration with external services
- REST + WebSocket support
Example:
# main.py
from fastapi import FastAPI
import torch
app = FastAPI()
model = torch.load('/models/custom-model.pt')
@app.post('/predict')
async def predict(request: dict):
result = model(request['input'])
return {'output': result}Step 4: Configure Scaling
Syaala auto-scales deployments based on load:
Scaling Parameters
{
minReplicas: 1, // Always keep 1 instance running
maxReplicas: 10, // Scale up to 10 instances max
targetUtilization: 0.8, // Scale when GPU hits 80%
scaleDownDelay: 300 // Wait 5min before scaling down
}Scaling Strategies
Cost-Optimized (Development/Testing):
--min-replicas 0 --max-replicas 3 --scale-down-delay 60- Scales to zero when idle
- Low cost, higher cold-start latency
Performance-Optimized (Production):
--min-replicas 2 --max-replicas 10 --scale-down-delay 600- Always-on replicas for instant responses
- Higher cost, zero cold-starts
Balanced (Most Common):
--min-replicas 1 --max-replicas 5 --scale-down-delay 300- One instance always ready
- Scales during traffic spikes
Step 5: Deploy
syaala deployments create \
--name "llama-31-8b-production" \
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
--runtime vllm \
--gpu "NVIDIA-RTX-4090" \
--min-replicas 1 \
--max-replicas 5 \
--env MAX_MODEL_LEN=4096 \
--env TEMPERATURE=0.7 \
--env TOP_P=0.9Deployment Process
Watch the deployment progress through these stages:
- Validating (10-30s): Checking model access, GPU availability
- Provisioning (1-2m): Allocating GPU hardware
- Downloading (2-5m): Pulling model weights from HuggingFace
- Loading (1-3m): Loading model into GPU memory
- Running (∞): Deployment ready, accepting requests
Total Time: Typically 5-10 minutes for first deployment (model needs to download). Subsequent deployments are faster due to caching.
Step 6: Test Your Endpoint
Once status shows running, test the deployment:
OpenAI-Compatible API
curl -X POST https://dep-a1b2c3d4.syaala.run/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the benefits of GPU acceleration for AI?"}
],
"max_tokens": 200,
"temperature": 0.7
}'Streaming Responses
For real-time streaming (like ChatGPT):
const response = await fetch('https://dep-a1b2c3d4.syaala.run/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${process.env.SYAALA_API_KEY}`
},
body: JSON.stringify({
model: 'meta-llama/Meta-Llama-3.1-8B-Instruct',
messages: [{ role: 'user', content: 'Hello!' }],
stream: true // Enable streaming
})
})
const reader = response.body!.getReader()
const decoder = new TextDecoder()
while (true) {
const { done, value } = await reader.read()
if (done) break
const chunk = decoder.decode(value)
const lines = chunk.split('\n').filter(line => line.trim())
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6)
if (data === '[DONE]') break
const json = JSON.parse(data)
process.stdout.write(json.choices[0].delta.content || '')
}
}
}Step 7: Monitor Performance
Real-Time Metrics
View live metrics for your deployment:
syaala deployments metrics dep_a1b2c3d4e5f6Metrics Tracked:
- GPU Utilization: 0-100% (target: 60-80% for optimal cost)
- Memory Usage: GB used / GB total
- Throughput: Requests per second
- Latency: p50, p95, p99 response times
- Active Replicas: Current replica count
Metrics Dashboard
View comprehensive metrics in the Dashboard → Monitoring:
- GPU Utilization Graph: 24-hour time series
- Request Rate: Requests per minute
- Error Rate: 4xx/5xx errors
- Cost Breakdown: Hourly spend by resource
Setting Up Alerts
Configure alerts for critical events:
syaala alerts create \
--name "High GPU Utilization" \
--metric gpu_utilization \
--condition ">" \
--threshold 90 \
--duration 300 \
--notification slackProduction Checklist
Before going to production, ensure:
- Model tested with representative workload
- Scaling configured for expected traffic
- Alerts set up for GPU utilization, errors, costs
- Monitoring enabled with proper dashboards
- API keys rotated from test to production keys
- Rate limits understood for your plan tier
- Cost estimates validated against budget
- Backup deployment in different region (for critical services)
Troubleshooting
Deployment Fails to Start
Symptoms: Status stuck in provisioning or failed
Solutions:
- Check System Status for GPU availability
- Ensure sufficient GPU memory for model size
- Review deployment logs:
syaala deployments logs dep_xxx - Contact support if issue persists
High Latency
Symptoms: p95 latency > 2 seconds
Solutions:
- Increase
min-replicasto reduce cold starts - Enable GPU memory optimization:
GPU_MEMORY_UTILIZATION=0.95 - Use tensor parallelism for large models:
TENSOR_PARALLEL_SIZE=2 - Consider a faster GPU (RTX 4090 → A100)
Out of Memory Errors
Symptoms: CUDA out of memory in logs
Solutions:
- Reduce
MAX_MODEL_LEN(context window) - Reduce
MAX_BATCH_SIZE - Enable quantization:
QUANTIZATION=int8 - Upgrade to larger GPU (24GB → 40GB → 80GB)
Next Steps
Congratulations! You’ve successfully deployed your first AI model. Now explore:
- API Reference: Complete deployment API documentation
- SDK Integration: Integrate into your application
- Monitoring Guide: Set up comprehensive monitoring
- Cost Optimization: Reduce GPU costs