About Company
The core of all AI, business intelligence and applications is data — various bits and bytes that come in all different formats. Only when we sift through this data, reason with it and build on top of it in real time does it give way to vast amounts of information and knowledge. Real-time insights are key to the way we live our lives today; the way we entertain ourselves; the way we listen to music; the way we order groceries. Real-time insights keep your BI tools fresh; they keep your ride-sharing app with the most current price; and they ensure you never miss a fraudulent payment. SingleStoreDB is the world’s only database that empowers users to transact, analyze and search data in real time. It empowers the world’s makers to build, deploy and scale modern, intelligent applications — backed by streaming data ingestion, a unique table type that supports both transactional (OLTP) and analytical (OLAP) workloads, limitless point-in-time recovery and a distributed (shared-nothing), MySQL-compatible architecture.
Job Description
Summary
SingleStore is seeking a Site Reliability Engineer to help optimize and scale our managed service offering across all three major cloud providers. In this role, you will be at the intersection of leading technology trends A highly performant distributed database, managed by Kubernetes, running in the cloud. This is a great opportunity to push the boundaries with a cloud-focused SRE role.
This is a development role, requiring an engineering mindset to solve operational challenges. You will be part of a globally distributed team of engineers, helping to drive SRE practices across the company. Through infrastructure automation, you will help us grow our service across multiple cloud platforms. This requires a relentless focus on eliminating manual processes. You will also leverage our monitoring platform to improve the overall customer experience by systematically identifying and fixing any issues impacting our customers. As an SRE, you will also help diagnose issues on the platform, leveraging a deep understanding of the SingleStore query engine along with the backend infrastructure.
Roles and Responsibilities
- Develop automation platform to manage infrastructure rollouts across cloud providers
- Optimize telemetry platform to identify customer impacting events while providing relevant data to drive debugging
- Partner with engineering team to optimize performance of services for cloud architecture
- Debug Live Site events and conduct follow-up postmortem and RCA analysis
- Participate in an SLA-driven on-call rotation, which will include after-hours, weekend, and rotating holiday participation.
Required Skills and Experience
- 5 years of demonstrated experience working as a Site Reliability Engineer
- Infrastructure automation experience. Scripting experience (Python, Bash) a plus.
- Experience with the Prometheus monitoring stack. Experience with Grafana, Mimir and Loki is a plus.
- Knowledge of Kubernetes and the container ecosystem
- Strong cross group collaboration and communication skills
- Familiar with at least one of AWS, Azure, or Google Cloud
- Experience debugging, diagnosing and troubleshooting complex, production software
- B.S. Degree in Computer Science or related field
Skills
- Effective communication
- Programming language (R, Python, Scala, Matlab)
- Teamwork
