Site Reliability Engineer (AWS)

Spark Infrastructure · Remote/Frisco, Texas
Department Spark Infrastructure
Employment Type Full Time
Minimum Experience Mid-level

Site Reliability Engineer (SRE)Team

Spark is the Gearbox Software team behind SHiFT, our online services platform that serves millions of users every month across multiple gaming franchises. SHiFT is our one-stop-shop gaming services platform responsible for dozens of features gamers around the world depend on every day, from cross-play to friend presence, citizen science, dedicated server hosting, matchmaking, and much more. Spark is passionate about delivering features for our gaming partners that are relevant, dependable, and secure. We take pride in the stability of our platform and are always looking for ways to take that stability to new levels. Our team is agile with a commitment to seeing features go from desktop to production in minutes, not days.



To further drive our vision of premier stability and rapid feature delivery, we are looking for a mid-level Site Reliability Engineer to join our team. As an SRE on Spark, you will be responsible for assisting in the design and implementation of flexible cloud architectures with an automation-first emphasis. You will be challenged along the way to adopt the shared mentality that observability is everything and push for that philosophy to be actualized throughout the platform. As an SRE you should be comfortable integrating multiple technologies together to form a single, coherent view of platform health. You should have expertise in cloud and microservice security best practices. When challenged with designing and implementing a new feature in the infrastructure, you are confident in both, ready to defend them in a room with other technical minds. You also recognize that the best designs come from collaboration, not dictation, and are willing to bring implementations to the table with an open mind.


Typical Day

Tl;dr: You will be deeply immersed in AWS and Terraform; plenty of Go and/or Python development is sprinkled in as well.

Your days will be filled with building solutions to technical challenges in observability, and availability of our SHiFT online services. You will evangelize security best practices, call out gaps in observability, and be obsessed with user experience as it relates to the services you support. You will help manage and orchestrate each of these by leaning heavily on technologies like Terraform, Docker, BashPython and Go. On any given day, you should expect to spend at least 75% of your time actively engineering solutions; the rest will be a mixture of reviewing code from your colleagues, defining SLIs and SLOs, participating in design meetings, documentation, and self-development.

This position will require you to carry a company-paid mobile device and participate in 24/7 on-call rotations alongside your engineering colleagues. Don't worry though, our on-call experience doesn't suck.


Core Responsibilities:

  • Be a trusted voice in the evangelism of reliability engineering throughout the team
  • Champion discussions that define appropriate SLIs and SLOs of services
  • Design and engineer best practices for ensuring the observability and reliability at every layer of the stack
  • Participate in after-hours on-call support rotations


Must Have (the non-negotiable parts):

  • Proficiency in AWS container management, orchestration, and observability features (ECS, Fargate, Aurora, AppConfig, CloudWatch, etc.)
  • Proficiency in Terraform and/or CloudFormation
  • Professional Experience managing AWS access and security services (IAM, kms, Secrets Manager, WAFv2, etc.)
  • Minimum of 3 years experience in a wide variety of AWS technologies in a professional setting
  • Minimum of 2 years experience with containers in a professional setting, preferably Docker
  • Professional development experience with at least one of: Go, Python
  • Experience defining SLIs and SLOs for highly available cloud-based applications
  • Understanding of observability stack management (monitoring, alerting, structured logging, APM, etc.)
  • Comfortable communicator, able to clearly detail designs and implementations on an individual level and in large group settings


Should Have (some wiggle room):

  • Hands-on experience developing and maintaining CI/CD pipelines, preferably in git/GitLab
  • Understanding of RESTful and Websocket based APIs
  • Bachelor's degree in computer science, related field, or equivalent training and professional experience


Now you're just showing off:

  • Any verifiable security credential (isc2, AWS security specialist, ethical hacking, security+, etc.)
  • Experience working in retail/eCommerce programs
  • Familiarity with OpenTelemetry / OpenSLO
  • Familiarity with Datadog / Honeycomb
  • Familiarity with Atlassian products (OpsGenie, JIRA, Confluence)
  • Experience working with developers in an agile environment
  • Experience in the games industry, preferably launching multiple online-enabled AAAs
  • Knowledge about Gearbox-owned IPs

Gearbox Entertainment believes that all team members should be able to enjoy a work environment free from all forms of discrimination and harassment. We are committed to reflecting the diversity of the world we strive to entertain. As an Equal Opportunity Employer, we provide fair and equal treatment to all team members and applicants. We do not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity or expression, national origin, disability, genetic information, pregnancy or maternity, veteran status, or any other status protected by applicable national, federal, state or local law.

Thank You

Your application was submitted successfully.

  • Location
    Remote/Frisco, Texas
  • Department
    Spark Infrastructure
  • Employment Type
    Full Time
  • Minimum Experience