Senior Staff Site Reliability Engineer
This job is being posted to the TAO job board because it is potentially open to remote candidates. Feel free to contact me if you’d like to learn about Optimizely before applying. I am a PDX-based remote Optimizely employee. Josh.Schoonmaker@Optimizely.com
Optimizely is focused on unlocking digital potential and we are the recognized category leader in Digital Experience Platform (DXP) and created the category for A/B Testing and experimentation software. We have incredible customers – isn’t that one of the most important aspects of looking for your next job? Optimizely has over 9,000 brands from global organizations such as Visa, Sky, Yamaha, Wall Street Journal to tech innovators like Atlassian, DocuSign, FitBit and Zillow.
Not only are we financially sound and growing but we have unicorn status: Exceeded $300M in revenue in 2020, is profitable already, and has all strategic options ahead of itself. Optimizely continues to invest and addresses a market opportunity north of $30 billion, providing significant personal career growth opportunities.
We are an inclusive culture with a global team of 1200+ people across the US, Europe, Australia, and Vietnam. We blend European and American business culture with emphasis on teamwork, inclusion, and moving fast. People make the difference!
If you are looking to work on the next generation of digital technologies in a fast-paced, hyper-growth environment, apply! We’re just getting started...
Site Reliability Engineers at Optimizely are focused on making Optimizely the most reliable, performant and trustworthy Digital Experience Optimization platform ever! Our engineering teams have built a collection system and data pipeline that process billions of events per month and delivers a customer profile to hundreds of marketers daily. This is a unique opportunity to help drive the engineering organization toward state-of-the-art observability, service-oriented architectural excellence, and forward-looking planning and execution of large technical projects.
As a Sr. Site Reliability Engineer you will:
- Assist with defining a roadmap for all engineering teams to utilize fully automated, self-service, highly scalable, cost-efficient, observable, auditable and reliable infrastructure services as standard practice
- Work on the execution of this roadmap across the engineering organization, collaborating with SREs and senior engineers across engineering while also performing hands-on work on the most critical challenges
- Build a metrics-driven operational culture standardizing our practices, working closely with product teams for SLO definition, under
- standing how to derive application service level indicators
- Make iterative improvements to blameless incident management processes, root cause analyses, outage prevention, and service recovery strategies across the engineering organization
- Propose and drive medium to large improvements to production systems to achieve significant impact to our business and engineering teams
- Mentor and coach engineers to be curious and effective at discovering and solving technical challenges
- Participate in SRE 24/7/365 on-call rotation
You’ll be successful if you:
- Have proven experience (5-7 years) demonstrating hands-on technical leadership and business impact in combining software engineering skills with systems engineering skills to solve complex automation and reliability challenges
- Demonstrate clear decision making and good trade-offs in complex situations comprising multiple opinions, needs, teams, technologies, cloud providers, and architectural settings
- Have proven experience demonstrating hands-on business impact in combining software engineering skills with systems engineering skills to solve complex automation and reliability challenges.
- Are a team player with an analytical mindset, strong communication skills
- Have the ability to proactively look at all systems, tools, processes and architectures with an open mind and make recommendations on scale, reliability, availability and automation is key
- Have a strong familiarity with microservice orchestration clusters such as Kubernetes, ECS or Docker swarm.
- Have proven experience with cloud platforms such as AWS, Azure or GCP.
- Understand what tools and methodologies there are around monitoring, logging and alerting. A background managing or implementing tools like an ELK stack, Loki Log aggregation, Prometheus metrics, Grafana dashboards, Hashicorp Consul, Alertmanager and Datadog
- Are proficient in more than one programming language including any of: Ruby, Java, Bash, Go etc.
- Have a strong understanding of Infrastructure as Code methodologies with tools like CloudFormation, Terraform, Chef, Packer, Helm etc.
- Understand and have worked with CI/CD build pipelines such as Jenkins, Teamcity, Waypoint, AWS CodeBuild or CircleCI.