Observability And Site Reliability Engineer

Details of the offer

SHOULD YOU ACCEPT THIS CHALLENGE…
Our team is dedicated to maintaining the reliability, performance, and operational excellence of our product and platforms.The Fleet Reliability team is where you will be at the frontline of ensuring seamless customer experiences, especially during incidents or escalations. We work closely with engineering and support teams to proactively prevent issues, resolve customer escalations, and improve our monitoring and response processes. The work is cross-discipline and each team member develops an understanding and expertise in many functional areas of our products and technologies.
As anObservability and SRE Engineer, you'll be responsible for managing and enhancing the observability of our systems, troubleshooting complex issues, and leading post-incident reviews. Your work will directly impact our ability to respond swiftly to incidents, minimize downtime, and improve customer satisfaction. You'll focus on building a resilient infrastructure while also owning the process and tooling to resolve escalations effectively.
Key Responsibilities:
Customer Escalation Management
Act as the primary technical resource for high-impact customer escalations, working to diagnose, troubleshoot, and resolve incidents.
Coordinate with customer support and engineering teams to ensure issues are resolved quickly and accurately.
Serve as a technical point of contact during incidents, communicating status and resolution plans to relevant stakeholders.

Observability and Monitoring
Develop and maintain dashboards, alerts, and logging systems to track product performance.
Improve the observability and visibility of features through enhancements to monitoring, logging, and alerting.
Establish SLAs, SLIs, and SLOs to measure and ensure the reliability of product and proactively prevent escalations and sev-1's 
Look for trends on features causing reliability issues 

Collaboration and Communication
Work cross-functionally with development, product, and support teams to enhance system reliability and customer experience.
Provide feedback to development teams on areas of improvement for code stability and reliability.
Mentor other engineers on best practices in observability and reliability engineering.

WHAT YOU'LL NEED TO BRING TO THIS ROLE...
Experience:7+ years in SRE, or a related field, with a strong focus on observability and customer-facing incident response.
Technical Skills:Proficiency in monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Datadog, New Relic, Splunk).
Programming and Scripting:Solid knowledge of languages like Python, Go, SQL and experience with shell scripting for automation.
Cloud Infrastructure:Experience with cloud platforms (e.g., AWS, GCP, Azure) and container orchestration tools (e.g., Kubernetes).
Problem-Solving:Strong analytical skills to diagnose and troubleshoot complex systems and identify root causes quickly.
Communication:Excellent verbal and written communication skills, with experience in handling high-stakes customer interactions.
Incident Management:Familiarity with incident management frameworks and tools (e.g., PagerDuty, Opsgenie, or similar) is a plus.

Certification in cloud platforms (e.g., AWS Certified Solutions Architect, Google Cloud SRE Professional).

We are primarily an in-office environment and therefore, you will be expected to work from the Santa Clara, CA office in compliance with Pure's policies, unless you are on PTO, or work travel, or other approved leave.

The annual base salary range is: $207,000 – $312,000. Salary ranges are determined based on role, level and location. For positions open to candidates in multiple geographical locations, the base salary range is reflective of the labor market across the applicable locations. 
This role may be eligible for incentive pay and/or equity. 
And because we understand the value of bringing your full and best self to work, we offer a variety of perks to manage a healthy balance, including flexible time off, wellness resources, and company-sponsored team events - check out purebenefits.com for more information. 
INCLUDE FOR POSTING LOCATION IDENTIFICATION
#LI-REMOTE, #LI-HYBRID, #LI-ONSITE


Nominal Salary: To be agreed

Source: Greenhouse

Requirements

C++/ Qt Senior Developer (2) C++

Design and development of an engineering platform for the development, parameterization and configuration of HVDC/FACTS projects (High Voltage Direct Current...


Yopeso Vertriebsgesellschaft Mbh - California

Published 8 days ago

Staff Software Engineer, Payments Incentives & Store Value

Staff Software Engineer, Payments Incentives & Store ValueThis job opening is already off the market. About the RoleAirbnb was born in 2007 when two Hosts we...


Meetfrank - California

Published 8 days ago

Director, Solutions Engineer - Cloud

Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation. Take a look at what we do! -https://www.youtube....


Crusoe Energy Systems Llc - California

Published 8 days ago

Senior Machine Learning Engineer

About the Role: The Machine Learning team at Tubi works on core algorithms that define the entire experience of its 33+ million users. We work on different a...


Tubi Tv - California

Published 8 days ago

Built at: 2024-11-22T06:07:35.309Z