sharan@sre:~$ whoami

Sharan Chenna

Site Reliability Engineer

sharan@sre:~$ curl -s -O https://www.sharanch.dev/assets/Sharan_Cloud_SRE_DevOps_Resume.pdf

sharan@sre:~$ cat summary.txt

Results-driven Site Reliability Engineer with 4 years of experience in managing large-scale distributed systems. Specializing in improving system reliability, implementing robust monitoring and alerting solutions, and automating operational tasks. Skilled in incident response, performance optimization, and fostering a culture of blameless postmortems.

sharan@sre:~$ ls -l work-experience

Cloud Operations Engineer | OCI Compute (Control & Data Plane)

Oracle (Feb 2024 - Sep 2025)

> Hypervisor Fleet Management: Managed lifecycle for a fleet of 8,000+ KVM/libvirt hypervisors, utilizing automated health checks to ensure 99.9% availability for critical OCI compute services.
> Infrastructure Automation: Automated provisioning for thousands of nodes using Terraform and Ansible, reducing manual configuration time by 40% and ensuring consistent state enforcement across global regions.
> Observability & Monitoring: Engineered comprehensive Grafana dashboards using custom telemetry, reducing Mean Time to Detection (MTTD) for critical incidents by 25% through proactive alerting.
> Incident Management: Led troubleshooting for high-severity incidents, improving SLA compliance to 99.9%+ by standardizing root cause analysis (RCA) and reducing recurring issues by 20%.
> Process Improvement: Developed automated JIRA dashboards and runbooks, reducing on-call administrative toil by 30% and streamlining the incident tracking lifecycle.
> CI/CD Pipeline Optimization: Implemented robust CI/CD pipelines via OCIbuild, increasing deployment frequency by 2x while maintaining a 0% failure rate in production during peak traffic windows.
> Pre-Production Validation: Executed rigorous region-based testing strategies to validate new features, catching 15+ critical bugs per quarter before production promotion.
> Security & Compliance: Enforced security best practices across infrastructure by implementing granular access controls, achieving 100% compliance with internal security audits and regulatory standards.
> Collaboration & Knowledge Sharing: Worked closely with development, operations, and support teams to share insights, document best practices, and improve incident response processes, fostering a culture of reliability and continuous learning

Associate Engineer | Linux COE

CtrlS Datacenters (Apr 2021 - Dec 2023)

> Multi-Cloud Infrastructure: Provisioned and managed a hybrid fleet of 2500+ servers (RHEL, CentOS, SUSE, AIX, Ubuntu) across on-premise and multi-cloud environments, maintaining 99.9% system availability.
> Virtualization Management: Administered enterprise virtualization clusters using VMware, Hyper-V, and Nutanix, optimizing resource allocation for virtual instances and Network Attached Storage (NAS) to reduce hardware overhead by 15%.
> High Availability & Web Serving: Configured SUSE HA clusters and high-traffic web servers (Apache, Nginx) with NIC Bonding and automated SSL renewal, ensuring zero downtime and robust network resilience.
> System Automation: Developed advanced Bash and Python scripts to automate routine administrative tasks and user provisioning, reducing manual operational toil by 40%.
> Identity & Access Management: Integrated Active Directory for centralized user management and automated sudo privilege auditing, ensuring strictly least-privilege access and enhancing audit readiness.
> Security & Compliance: Led critical Patch Management cycles and server hardening initiatives, achieving 100% compliance with security frameworks and resolving high-severity vulnerabilities ahead of SLA deadlines.
> Monitoring & Diagnostics: Architected a centralized Zabbix monitoring solution, automating agent deployment across the fleet to achieve 100% infrastructure visibility and reduce incident detection time by 30%
> Database Administration: Deployed and optimized MySQL and MongoDB Master-Slave architectures, tuning queries to improve database performance and reliability for business-critical applications
> Technical Account Management (TAM): Served as the Technical SPOC for key enterprise accounts, driving resolution for complex technical issues and improving Customer Satisfaction (CSAT) scores by 20% through proactive service management.
> ITIL Process Management: Championed ITIL best practices for Incident, Change, and Problem management, ensuring 95%+ adherence to SLAs for all critical service requests

sharan@sre:~$ cat skills.txt

Infrastructure Automation

> Terraform, Ansible, Chef
> Docker, Kubernetes, OCI Build
> AWS CLI, OCI, GitLab CI/CD
> TeamCity, Jenkins (CI/CD pipelines)

Observability & Monitoring

> Prometheus, Grafana, Alertmanager
> OpenTelemetry, Blackbox Exporter
> Zabbix, Node Exporter
> Log analysis, RCA tooling

Scripting & Ops Engineering

> Python, Bash, SaltStack
> Disk usage analysis, batch ops
> Custom CLI tooling for on-call
> Incident coordination & postmortems

sharan@sre:~$ ls -l projects/

Monitoring Stack Deployment

Proof Of Concept project for Terraform for Infrastructure Provisioning, AWS Provider, Ansible Automation, Docker for portability, Grafana Dashboards, Prometheus for PromQL and Node Exporter for metrics

view →

Jenkins CI/CD

Proof Of Concept project that uses Jenkins for CICD, This deploys a web app using flask and uses GitHub webhook payloads for triggering the Pipeline.

view →

Blogposts

Anecdotes from my work/study, Coming soooooooooon!

view →

sharan@sre:~$ ls social

I'm currently open to new opportunities. Whether you have a question or just want to say hi, my inbox is always open.

Say Hello

linkedin github browser-toys