Credit Karma is seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join our Site Operations team as a Technical Duty Officer (TDO). The TDO will play a critical role in upleveling our Incident Response and Problem Management capabilities. You will play a critical role in ensuring the reliability and stability of Credit Karma's systems and services. As the TDO, you will be responsible for safeguarding changes, managing incidents, driving postmortems, and establishing processes to identify and prevent recurring issues. You will collaborate closely with engineering teams to promote best practices and contribute to a culture of continuous improvement.
What you'll do:
Incident Management:Lead incident response efforts, ensuring swift resolution and minimal impact to users
Change Management:Review and advise on high-risk platform changes to mitigate potential issues
Problem Management:Conduct blameless postmortems, identify root causes, and drive the implementation of preventative measures
Reliability Engineering:Develop tools and automation to improve system reliability and reduce manual effort
Collaboration:Work closely with engineering teams to advocate for SRE best practices and contribute to technical design discussions
Monitoring & Observability:Leverage monitoring tools to identify and address potential issues proactively
What's great about the role:
You will have the opportunity to contribute to an engineering first focused organization.
Your contributions will have a noticeable impact on Credit Karma's members and your fellow Karmanauts (that's what we call ourselves).
You will be involved in organizational efforts of continuous improvement to increase and ensure the reliability of Credit Karma.
You will get broad exposure to our full stack, consisting of forward-looking technologies such as GenAI/LLM, Incident Automation, Automated Observability at Scale, etc.
You will grow and learn and have fun doing it – it's part of our culture.
And, of course, all those awesome company perks that you have probably already read about.
Minimum Basic Requirement:
5+ years of experience in Site Reliability Engineering or a related field
Strong understanding of cloud-native architectures, containerization, and orchestration
Proven experience leading fire drills and managing production incidents.
Proficiency in at least one scripting language (e.g., Python, Go)
Familiarity with monitoring and observability tools (e.g., New Relic, Datadog, Prometheus
Preferred Qualifications:
Experience working with public cloud platforms (e.g., GCP, AWS, Azure)
Experience with additional programming languages (e.g., Scala, TypeScript, Java)
Experience developing production-quality tooling
Experience with postmortems and follow-through on action items
Benefits at Credit Karma includes:
Medical and Dental Coverage
Retirement Plan
Commuter Benefits
Wellness perks
Paid Time Off (Vacation, Sick, Baby Bonding, Cultural Observance, & More)
Education Perks
Paid Gift Week in December