PROJECTS:
With 12+ years in IT, I specialize in Site Reliability Engineering (SRE) and DevOps, building reliable, scalable, and secure cloud-native systems. I design and operate resilient production platforms, leveraging AWS, Azure, Kubernetes, CI/CD, and observability stacks like Prometheus, Grafana, OpenTelemetry, and Datadog. Passionate about automation, AI-driven monitoring, and optimizing infrastructure for high availability and efficiency. I have successfully supported mission-critical systems across banking, telecom, and SaaS domains, enabling thousands of seamless production releases. I thrive on solving complex operational challenges and implementing intelligent automation that reduces downtime and manual effort.
Key Engineering Achievements:
★ Built multi-tenant observability platforms with Prometheus, Grafana, OpenTelemetry, and Datadog, improving incident detection and system reliability. ★ Automated CI/CD pipelines and production deployments across AWS and Azure, ensuring zero-downtime releases and operational efficiency. ★ Optimized cloud infrastructure costs using intelligent autoscaling, Kubernetes orchestration, and resource management. ★ Implemented AI/ML-driven anomaly detection and predictive monitoring to proactively prevent system failures. ★ Led incident response, root cause analysis, and automation of operational workflows for mission-critical systems across multiple domains and countries.
Tracelink
OPUS(Orchestrated Platform for Unified Supply) :
TraceLink is a SaaS-based supply chain network platform focused on pharmaceutical track-and-trace compliance under DSCSA regulations. It enables life sciences companies to monitor, verify, and manage the movement of medicines across the supply chain to ensure safety, authenticity, and regulatory compliance.
OPUS is TraceLink’s PaaS solution that connects companies, systems, and stakeholders through real-time supply chain orchestration. It enables seamless system-to-system and business process integration, improving visibility, operational efficiency, and supply chain resilience.
Responsibilities:
★ Designed and deployed CI/CD pipelines using GitLab, Jenkins, and FluxCD, improving delivery speed by 40% and reducing deployment errors through automated quality gates and rollback strategies. ★ Architected multi-tenant observability framework using Prometheus, Grafana, Open Telemetry and YACE, reducing troubleshooting time by 50% and improving MTTR. ★ Automated scaling using KEDA and Karpenter for Kubernetes workloads, decreasing cloud spend by 25% and improving response times under high load. ★ Developed infrastructure using Terraform and AWS CloudFormation, enabling consistent, resilient and version-controlled infrastructure as code (IaC). ★ Secured Kubernetes clusters by enforcing Pod Security Standards (PSP/OPA) and following SRE practices, reducing container-related vulnerabilities by 70% and improving overall platform reliability. ★ Built and supported Kafka messaging solutions for high-throughput data streaming and integrated secure email ingestion workflows using AWS SES. ★ Integrated Python and Prometheus-based Quality gates into CI/CD pipelines to enforce pre/post deployment checks and enhance release reliability.
Singtel
Open Platform :
Open Platform is a Payment gateway project. In this project, Singtel deals financial transaction and charging between content providers like Google Play and service providers like TSEL and Optus. In Open-platform we have two types of billings: Carrier billing and Non-Carrier billing. In Carrier billing customer can buy the apps and game credits and has the facility to pay the bill along with his phone bill (i.e the customer should be a post-paid customer of the service provider). This reduces the hassle of entering the credit card details for each payment. In Non-Carrier billing, customer should enter his debit card or credit card details, to buy product from vendor or content from content provider. Open platform also provides identity service API uses one time pin to identify the user before allowing them to do the purchase.
★ Architected and built secure, scalable AWS cloud infrastructure from scratch using Terraform and CloudFormation (Infrastructure as Code), designing multi-environment (SAND/UAT/PROD) architectures with VPC segmentation, Load Balancers, Auto Scaling, and high availability (Multi-AZ) deployment strategies. ★ Led end-to-end cloud migration from on-premises to AWS, modernizing legacy workloads into containerized microservices using Docker and orchestrating deployments on Kubernetes (EKS) and AWS Fargate, enabling improved scalability, resiliency, and infrastructure portability. ★ Established robust CI/CD pipelines using Jenkins, Maven, Nexus, Git (Bitbucket), and RunDeck, implementing automated build, test, artifact management, branching strategies, and zero-downtime release processes aligned with SRE reliability principles. ★ Owned production reliability and operational excellence by implementing proactive monitoring, alerting, and observability using New Relic and Zabbix, performing deep log analysis, performance tuning, incident response, RCA reporting, and reducing MTTR through automation and alert optimization. ★ Designed and enforced cloud governance, release controls, and audit-compliant change management processes, integrating configuration management (Puppet), automated deployments, SQL validations, and structured release orchestration while driving Agile-based delivery and cross-functional stakeholder collaboration.
DBS
DBS (Development Bank of Singapore) Mobile App
DBS Mobile App is a mobile banking application. It is developed using mobeix product. It contains various modules like Account Summary, Fund Transfer, Payments, Investment Services, Cards, etc. To stand at the top position in the mobile banking sector DBS introduced new concepts in the market like Paylah. I worked as a DevOps & Production Support Engineer to stabilize the release environment of this application.
★ Delivered 24x7 L2 Production Support as the primary point of contact for critical incidents, troubleshooting application and infrastructure issues on Linux/JBoss servers, ensuring high availability and minimal downtime in banking production environments. ★ Executed production deployments on Banking servers using IBM Rational Rose and IBM Rational Quest, managing controlled releases in compliance-driven environments with strict audit and governance standards. ★ Conducted detailed Root Cause Analysis (RCA) for production incidents and build failures, analyzing server logs and application behavior to implement preventive fixes and improve release stability. ★ Automated recurring operational tasks using Shell scripting and Cron jobs, proactively monitoring server health and reducing repetitive production issues. ★ Managed Change, Release, and Disaster Recovery (DR) processes across UAT and Production environments, ensuring audit compliance, documentation accuracy, and seamless cross-team coordination during high-severity incidents.