systems nominal
Raju Ghosh
SRE Lead AI Platform Architect
I build reliable AI platforms for regulated SaaS.
SRE Lead with 14+ years across network engineering, DevOps, and site reliability. I design and operate multi-cloud platforms, Kubernetes infrastructure, and production AI systems that are observable, secure, and resilient across AWS, GCP, and Azure.
// I build AI infrastructure that doesn't page at 3am.
Experience
14.3yrs
Clusters
50+
Clouds
3providers
Products
5SaaS
Trusted across multi-cloud production systems
AWS GCP Azure Kubernetes Terraform FluxCD Grafana Istio Temporal Keycloak Helm Karpenter GitHub Actions

About.

I'm Raju Ghosh, an SRE Lead focused on platform engineering and AI reliability for regulated SaaS. I lead reliability engineering at CoverGo, where I work across AWS, GCP, and Azure to improve resilience, observability, delivery safety, and operational scalability for production systems.

My current work sits at the boundary of SRE and AI platform engineering: inference-path reliability, token and cost visibility, production debugging for AI workflows, and the infrastructure patterns needed to run AI features safely in real products. I care about building the tooling, guardrails, and feedback loops that make AI systems operable after launch — not just demoable before launch.

I lead reliability and platform operations for five SaaS and enterprise products across three cloud providers, spanning 50+ clusters and customers in Singapore, Hong Kong, Canada, the Middle East, and Europe, with cross-cloud disaster recovery for critical workloads.

Before this, I spent years in network engineering, systems administration, and DevOps, which gave me a strong bias toward practical architecture: simple where possible, automated where repeatable, observable by default, and designed for failure.

What I work on.

01 /
AI reliability engineering for production systems
02 /
Observability for latency, token cost, and failure modes in LLM and AI workflows
03 /
Platform engineering for AI workloads on Kubernetes
04 /
Multi-cloud infrastructure across AWS, GCP, and Azure
05 /
GitOps, Terraform, and delivery guardrails for safer change management
06 /
Resilience engineering, incident response, SLOs, and disaster recovery

In production.

production-ap-southeast — live event stream raju@production-ap-southeast ~ $
03:14:15ZINFO flux-system » HelmRelease synced nginx-ingress revision=v4.10.1
03:14:22ZWARN ai-inference » worker-7d9f latency_p99=847ms threshold=500ms scaling=true
03:14:30ZINFO kube-system » node/ip-10-0-1-42 transitioned to Ready
03:14:35ZINFO monitoring » PrometheusRule applied rule=slo-alerts/api-99.9
03:14:41ZINFO production » deployment/api-gateway rollout complete replicas=3/3 strategy=RollingUpdate
03:14:48ZINFO keycloak » token exchange ok realm=production latency=12ms
03:14:55ZINFO flux-system » Kustomization applied name=cluster-addons revision=v1.9.2
03:15:01ZWARN ai-inference » ocr-worker retry=2 model=gemini-1.5-pro timeout=30s
03:15:08ZINFO karpenter » provisioned type=m5.2xlarge zone=ap-east-1a cost=+$0.38/h
03:15:14ZINFO istio » xDS push clusters=142 listeners=89 routes=67
03:15:20ZINFO flux-system » app/saas-platform Synced Healthy generation=48
03:15:27ZINFO cert-manager » certificate renewed expires=2026-07-05 host=*.prod.internal
03:15:33ZWARN ai-inference » ocr-worker queue_depth=847 workers=12 autoscaling to 18
03:15:40ZINFO alertmanager » alert resolved name=HighErrorRate duration=3m48s
03:15:46ZINFO kube-system » coredns health check passed latency=0.8ms replicas=2/2
03:14:15ZINFO flux-system » HelmRelease synced nginx-ingress revision=v4.10.1
03:14:22ZWARN ai-inference » worker-7d9f latency_p99=847ms threshold=500ms scaling=true
03:14:30ZINFO kube-system » node/ip-10-0-1-42 transitioned to Ready
03:14:35ZINFO monitoring » PrometheusRule applied rule=slo-alerts/api-99.9
03:14:41ZINFO production » deployment/api-gateway rollout complete replicas=3/3 strategy=RollingUpdate
03:14:48ZINFO keycloak » token exchange ok realm=production latency=12ms
03:14:55ZINFO flux-system » Kustomization applied name=cluster-addons revision=v1.9.2
03:15:01ZWARN ai-inference » ocr-worker retry=2 model=gemini-1.5-pro timeout=30s
03:15:08ZINFO karpenter » provisioned type=m5.2xlarge zone=ap-east-1a cost=+$0.38/h
03:15:14ZINFO istio » xDS push clusters=142 listeners=89 routes=67
03:15:20ZINFO flux-system » app/saas-platform Synced Healthy generation=48
03:15:27ZINFO cert-manager » certificate renewed expires=2026-07-05 host=*.prod.internal
03:15:33ZWARN ai-inference » ocr-worker queue_depth=847 workers=12 autoscaling to 18
03:15:40ZINFO alertmanager » alert resolved name=HighErrorRate duration=3m48s
03:15:46ZINFO kube-system » coredns health check passed latency=0.8ms replicas=2/2

Selected work.

A few themes from the systems I design and operate:

01 /
Multi-cloud platform operations
Built and operated shared platform capabilities across AWS, GCP, and Azure for multiple SaaS and enterprise products, balancing standardization with regional and product-specific requirements.
02 /
AI systems in production
Supported real-world AI workloads including OCR and product intelligence workflows, with a focus on reliability, observability, operational debugging, and cost-aware inference paths.
03 /
Resilience by design
Designed platform patterns for high availability, safer rollouts, and cross-cloud disaster recovery for critical workloads running across 50+ clusters.
04 /
Delivery safety and platform guardrails
Used Kubernetes, Terraform, FluxCD, Helm, and GitHub Actions to improve repeatability, reduce operational risk, and scale engineering delivery without losing control.

Notes.

I write about SRE, platform engineering, AI reliability, observability, and the operational realities of running modern systems in production. My notes are where I turn incidents, architecture tradeoffs, and field experience into reusable patterns.
→ read my notes

Contact.

I'm most interested in Staff, Principal, and AI Platform Architect conversations focused on reliability, platform design, AI infrastructure, and regulated SaaS systems.