Job Title : Site Reliability Engineer – Observability Overview : We are seeking a Site Reliability Engineer III to develop and maintain our observability platform. This role focuses on ensuring the reliability, performance, and scalability of microservices, Kubernetes clusters, and cloud infrastructure. You'll collaborate with cross-functional teams to deliver metrics, logs, and traces for system health and performance, enabling proactive monitoring and troubleshooting. Responsibilities : Develop and maintain a resilient observability stack using tools like Prometheus, Grafana, Loki, InfluxDB, Telegraf, and OpenTelemetry. Partner with teams to identify monitoring needs and provide data-driven insights. Implement monitoring solutions across diverse environments, including Kubernetes, cloud, and on-premises setups. Aggregate logs, metrics, and traces for end-to-end system visibility. Set up alerts and thresholds for proactive performance monitoring. Create dashboards to track system health and resource utilization. Support incident response efforts and perform post-incident analyses for continuous improvement. Document observability best practices, setups, and troubleshooting techniques. Stay current on observability technologies and trends. Preferred Qualifications : Bachelor's degree in a relevant field or equivalent experience. 3–5 years of experience in observability, SRE, DevOps, or platform engineering. Experience with observability solutions for complex infrastructure (e.g., Kubernetes, AWS, Azure, on-prem vSphere). Proficiency in Git and CI/CD workflows; familiarity with cloud platforms and containerized environments. Relevant certifications are a plus. Skills : Deep knowledge of observability principles, monitoring tools, and cloud-native technologies. Strong scripting and automation skills (Python, Bash, or Go). Proficient in data visualization (Grafana, Kibana). Effective troubleshooting using logs, metrics, and traces. Collaborative and adaptable with a continuous improvement mindset. This role is perfect for those passionate about reliable, seamless systems and proactive monitoring. Join us to drive innovation and resilience in our observability practices!