Birkirkara
·
Hybrid
Senior Site Reliability Engineer
Job Overview
We are seeking a highly skilled Senior Site Reliability Engineer with a strong focus on Incident Management to join our team of 8 SREs. In this role, you will serve as the SRE Manager’s key partner, working closely with the Lead SRE and Lead TSE (ServiceDesk) to lead and manage incidents throughout their entire lifecycle, ensuring minimal downtime and maintaining system stability.
You will leverage your expertise in ITSM/ITIL frameworks and Jira Service Management (JSM) to streamline processes and ensure all interactions related to incidents, escalations, and team collaboration are effectively captured and managed.
Main Responsibilities
- Lead and manage incidents through their entire lifecycle, from detection to post-incident review, ensuring timely resolution and minimal impact on system performance.
- Utilize Jira Service Management (JSM) to track, manage, and report on incidents, ensuring all relevant information is captured and processes align with ITSM/ITIL best practices.
- Collaborate with the SRE Manager, Lead SRE, and Lead TSE to develop and implement streamlined incident management processes across teams.
- Act as a subject matter expert on ITSM/ITIL frameworks, driving their application in incident management practices.
- Mentor and guide junior SREs and Technical Support Engineers in incident management best practices and JSM usage.
- Enhance and maintain monitoring systems to proactively identify potential issues and implement preventive measures.
- Participate in the on-call rotation, with each team member covering one week at a time, distributed across the 8-member SRE team.
- Communicate effectively with internal stakeholders, including DevOps, studio techs, Corporate IT, and Customer account management.
- Collaborate with other IT departments to ensure seamless integration of new systems and services.
- Participate in the evaluation and adoption of new SRE tools and technologies.
Requirements
- 5+ years of experience in SRE or a similar role, with a strong emphasis on incident management.
- Strong leadership and mentoring skills, with the ability to guide and support team members.
- Proven track record of leading incidents through their entire lifecycle, including detection, response, resolution, and post-incident review.
- In-depth knowledge of ITSM/ITIL frameworks and their practical application in incident management.
- Extensive experience as a JIRA and JSM Administrator, with demonstrated ability to configure and manage JSM to support incident management processes.
- Extensive experience in designing JIRA workflows and supporting services.
- Excellent problem-solving and troubleshooting skills.
- Strong understanding of Linux/Unix operating systems.
- Strong understanding with scripting languages such as Python.
- Experience with automation tools such as Ansible, Chef, or Puppet.
- Experience with GitOps and tools like ArgoCD.
- Experience with Kubernetes (EKS).
- Familiarity with CI/CD concepts and tools such as Jenkins or GitLab CI/CD.
- Excellent communication and teamwork skills.
- Eager to learn and adapt to new technologies and approaches.
- Passion for the iGaming industry and understanding of its unique challenges and opportunities.
Nice to Haves
- Experience with On-prem Kubernetes (Kubespray or kubeadm).
- Experience with eBPF agents and the groundcover observability stack.
- Certifications in ITSM/ITIL or related areas.
- Locations
- Birkirkara
- Remote status
- Hybrid
Already working at Eeze?
Let’s recruit together and find your next colleague.