The AdGear Site Reliability Engineer is a software engineer responsible for ensuring that the AdGear platform's services are designed, implemented, deployed, and operated such that they are highly available, highly performing, and scalable.
The ideal candidate for the Site Reliability Engineer position typically self-identifies as a "hacker", who is both a "jack of all trades" as well as possesses deep knowledge in multiple areas of software development, Linux/unix systems administration, networking, internet protocols, databases, and distributed systems. The ideal candidate has a mix of software development and infrastructure operations skills, and approaches infrastructure operations from the perspective of a software engineer.
You'll be working in a small team of SREs, supporting a medium-sized team of software engineers working on building the next generation of AdGear's administrative interfaces, ad decisioning, delivery, data processing and analytics systems.
Relevant industry experience is important, but ultimately less so than your demonstrated abilities and attitude.
- Co-architect new services, including failure tolerance and self-healing by-design, as well as establishing clear scaling-out paths
- Evaluating and benchmarking new solutions, establishing capacity and growth plans
- Implementing deployment and configuration strategies for new services, including provisioning resources, and go-live
- Administration of services, whether built in-house or from external vendors
- Continual optimization of services on all layers (hardware, software) for high performance
- Continual improvement of internal services for ease of packaging, configuration and deployment
- Monitoring of all critical services, sharing pager duty, troubleshooting and addressing problems as they arise (including any needed changes in code, topology, resources, or configuration)
- Backups/DR implementation, plans, documentation and exercises
- Co-own technical relationships with several service providers and vendors
Qualifications and Skills
- Full competency in at least 1 software development language (Java, Erlang, C/C++, Ruby)
- Full competency in at least 1 supporting language (Bash, SQL)
- Strong linux system administration and troubleshooting skills, including strong knowledge of how the various components work (kernel, CPU, memory, disk, network)
- Thorough understanding of networking protocols that make the internet work (Ethernet, IP, DNS, TCP and UDP, HTTP, TLS)
- Experience with source control system, ideally git
- Experience with database systems and data pipelines (batch, real-time & hybrid)
- Familiarity with configuration management systems, containers, VMs
- Familiarity with distributed multi-datacenter 24/7 web systems
- You have a track record of making things better and leading solutions that remove technical pain points and facilitate growth
- You enjoy working with others who are smart and passionate about building useful, reliable, performant products
- You can balance moving fast with breaking things, and you make sure you know how to fix them when they do break
- 3 weeks annual leave
- 50% paid Medical
- Yoga in the office, fitness discounts, soccer and cycling groups, cards and board games in the office
- Company outings, company roasts and many more cool things