SHOULD YOU ACCEPT THIS CHALLENGE… Our team is dedicated to maintaining the reliability, performance, and operational excellence of our product and platforms. The Fleet Reliability team is where you will be at the frontline of ensuring seamless customer experiences, especially during incidents or escalations. We work closely with engineering and support teams to proactively prevent issues, resolve customer escalations, and improve our monitoring and response processes. The work is cross-discipline and each team member develops an understanding and expertise in many functional areas of our products and technologies.
As an Observability and SRE Engineer, you'll be responsible for managing and enhancing the observability of our systems, troubleshooting complex issues, and leading post-incident reviews. Your work will directly impact our ability to respond swiftly to incidents, minimize downtime, and improve customer satisfaction. You'll focus on building a resilient infrastructure while also owning the process and tooling to resolve escalations effectively.
Key Responsibilities: Customer Escalation ManagementAct as the primary technical resource for high-impact customer escalations, working to diagnose, troubleshoot, and resolve incidents.Coordinate with customer support and engineering teams to ensure issues are resolved quickly and accurately.Serve as a technical point of contact during incidents, communicating status and resolution plans to relevant stakeholders.Observability and MonitoringDevelop and maintain dashboards, alerts, and logging systems to track product performance.Improve the observability and visibility of features through enhancements to monitoring, logging, and alerting.Establish SLAs, SLIs, and SLOs to measure and ensure the reliability of product and proactively prevent escalations and sev-1's.Look for trends on features causing reliability issues.Collaboration and CommunicationWork cross-functionally with development, product, and support teams to enhance system reliability and customer experience.Provide feedback to development teams on areas of improvement for code stability and reliability.Mentor other engineers on best practices in observability and reliability engineering.WHAT YOU'LL NEED TO BRING TO THIS ROLE... Experience: 7+ years in SRE, or a related field, with a strong focus on observability and customer-facing incident response.Technical Skills: Proficiency in monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Datadog, New Relic, Splunk).Programming and Scripting: Solid knowledge of languages like Python, Go, SQL and experience with shell scripting for automation.Cloud Infrastructure: Experience with cloud platforms (e.g., AWS, GCP, Azure) and container orchestration tools (e.g., Kubernetes).Problem-Solving: Strong analytical skills to diagnose and troubleshoot complex systems and identify root causes quickly.Communication: Excellent verbal and written communication skills, with experience in handling high-stakes customer interactions.Incident Management: Familiarity with incident management frameworks and tools (e.g., PagerDuty, Opsgenie, or similar) is a plus.Certification in cloud platforms (e.g., AWS Certified Solutions Architect, Google Cloud SRE Professional).
We are primarily an in-office environment and therefore, you will be expected to work from the Santa Clara, CA office in compliance with Pure's policies, unless you are on PTO, or work travel, or other approved leave.
The annual base salary range is: $207,000 – $312,000. Salary ranges are determined based on role, level and location. For positions open to candidates in multiple geographical locations, the base salary range is reflective of the labor market across the applicable locations. This role may be eligible for incentive pay and/or equity. And because we understand the value of bringing your full and best self to work, we offer a variety of perks to manage a healthy balance, including flexible time off, wellness resources, and company-sponsored team events - check out purebenefits.com for more information.
#J-18808-Ljbffr