Service

Dedicated Deployment.

On-prem, VPC, or dedicated cloud tenancy. Your data, your weights, your audit trail. We own the operational complexity; you keep the keys. The default deployment shape for every Agrus engagement.

Scope a deployment Private LLM guide

Three shapes

VPC. Dedicated cloud. On-prem.

VPC deployment

The most common shape. The AI system runs inside your existing AWS, GCP, or Azure VPC. Inference traffic stays inside your private network. Audit logs go to your existing SIEM. SOC 2 and ISO 27001 evidence is easier here because the controls extend your existing program.

Dedicated cloud tenancy

A separate cloud account or tenancy dedicated to the AI workload, outside your production VPC but still in a cloud you control. Right when you want distinct cost visibility for AI, different egress requirements, or multiple AI workloads at different sensitivity levels.

On-prem deployment

The model runs on hardware you own, in a data centre you control. Used mostly by defense contractors, healthcare systems with strict data-residency rules, certain investigation firms, and customers in jurisdictions where regulators effectively require on-prem.

The detailed framework, with cost models and selection criteria, is in our Private LLM Deployment Guide.

What we own, what you own

Clean separation of operational responsibility.

Agrus owns

• Reference architecture and deployment patterns
• IaC modules (Terraform, CDK, or Pulumi)
• Inference layer wrapper + observability
• Eval harness and drift detection
• Runbook authorship and 24/7 on-call (if Managed SLA)
• Model upgrade testing and roll-out

You own

• The cloud account / on-prem hardware
• Model weights (after deployment)
• All prompt and completion data
• The audit log destination (your SIEM)
• IAM and human-access control
• The codebase — full source, no vendor lock-in

Hardware & GPU sizing

We pick GPUs based on workload, not vendor preference.

Sizing happens after model selection. A 70B-class model serving moderate concurrency typically needs 2-4 H100s or equivalent AMD/B200 capacity. 7B-14B models for high-volume routing workloads run well on a single H100 or pair of L40s. We size for the production workload plus 30-40% headroom.

For on-prem deployments we work with your hardware vendor of choice (Supermicro, Dell, HPE, Lenovo). We have reference designs with each. Hardware procurement to production deployment is typically 6-12 weeks for the first cluster, then 1-3 weeks for subsequent replicas.

Pricing

Discovery Sprint includes a deployment architecture recommendation.

$15-30K for the sprint; deployment build T&M from there ($200-280/hr blended).

See pricing Scope a deployment