Raju Ghosh — SRE Lead & AI Platform Engineer

Trusted across multi-cloud production systems

AWS GCP Azure Kubernetes Terraform FluxCD Grafana Istio Temporal Keycloak Helm Karpenter GitHub Actions

about / me

About.

I'm Raju Ghosh, an SRE Lead focused on platform engineering and AI reliability for regulated SaaS. I lead reliability engineering at CoverGo, where I work across AWS, GCP, and Azure to improve resilience, observability, delivery safety, and operational scalability for production systems.

My current work sits at the boundary of SRE and AI platform engineering: inference-path reliability, token and cost visibility, production debugging for AI workflows, and the infrastructure patterns needed to run AI features safely in real products. I care about building the tooling, guardrails, and feedback loops that make AI systems operable after launch — not just demoable before launch.

I lead reliability and platform operations for five SaaS and enterprise products across three cloud providers, spanning 50+ clusters and customers in Singapore, Hong Kong, Canada, the Middle East, and Europe, with cross-cloud disaster recovery for critical workloads.

Before this, I spent years in network engineering, systems administration, and DevOps, which gave me a strong bias toward practical architecture: simple where possible, automated where repeatable, observable by default, and designed for failure.

focus / areas

What I work on.

01 /

AI reliability engineering for production systems

02 /

Observability for latency, token cost, and failure modes in LLM and AI workflows

03 /

Platform engineering for AI workloads on Kubernetes

04 /

Multi-cloud infrastructure across AWS, GCP, and Azure

05 /

GitOps, Terraform, and delivery guardrails for safer change management

06 /

Resilience engineering, incident response, SLOs, and disaster recovery

stdout / live

In production.

production-ap-southeast — live event stream raju@production-ap-southeast ~ $

03:14:15ZINFO flux-system » HelmRelease synced nginx-ingress revision=v4.10.1

03:14:22ZWARN ai-inference » worker-7d9f latency_p99=847ms threshold=500ms scaling=true

03:14:30ZINFO kube-system » node/ip-10-0-1-42 transitioned to Ready

03:14:35ZINFO monitoring » PrometheusRule applied rule=slo-alerts/api-99.9

03:14:41ZINFO production » deployment/api-gateway rollout complete replicas=3/3 strategy=RollingUpdate

03:14:48ZINFO keycloak » token exchange ok realm=production latency=12ms

03:14:55ZINFO flux-system » Kustomization applied name=cluster-addons revision=v1.9.2

03:15:01ZWARN ai-inference » ocr-worker retry=2 model=gemini-1.5-pro timeout=30s

03:15:08ZINFO karpenter » provisioned type=m5.2xlarge zone=ap-east-1a cost=+$0.38/h

03:15:14ZINFO istio » xDS push clusters=142 listeners=89 routes=67

03:15:20ZINFO flux-system » app/saas-platform Synced Healthy generation=48

03:15:27ZINFO cert-manager » certificate renewed expires=2026-07-05 host=*.prod.internal

03:15:33ZWARN ai-inference » ocr-worker queue_depth=847 workers=12 autoscaling to 18

03:15:40ZINFO alertmanager » alert resolved name=HighErrorRate duration=3m48s

03:15:46ZINFO kube-system » coredns health check passed latency=0.8ms replicas=2/2