I'm Raju Ghosh, an SRE Lead focused on platform engineering and AI reliability for regulated SaaS. I lead reliability engineering at CoverGo, where I work across AWS, GCP, and Azure to improve resilience, observability, delivery safety, and operational scalability for production systems.
My current work sits at the boundary of SRE and AI platform engineering: inference-path reliability, token and cost visibility, production debugging for AI workflows, and the infrastructure patterns needed to run AI features safely in real products. I care about building the tooling, guardrails, and feedback loops that make AI systems operable after launch — not just demoable before launch.
I lead reliability and platform operations for five SaaS and enterprise products across three cloud providers, spanning 50+ clusters and customers in Singapore, Hong Kong, Canada, the Middle East, and Europe, with cross-cloud disaster recovery for critical workloads.
Before this, I spent years in network engineering, systems administration, and DevOps, which gave me a strong bias toward practical architecture: simple where possible, automated where repeatable, observable by default, and designed for failure.
A few themes from the systems I design and operate: