Company:
Old Mutual
Industry: Banking / Financial Services
Deadline: Jan 30, 2026
Job Type: Full Time
Experience: 7 years
Location: Gauteng, Western Cape
Province: Cape Town, Johannesburg
Field: ICT / Computer
Job Description
- OM Bank is currently looking for a site reliability engineer to join OM Bank platform team. The candidate will be responsible for maintaining the OM Bank platform, including first line support for the platform’s technical services and managing service outages through the incident management process.
KEY RESULT AREAS
- First line support for all services that comprise the platform
- Managing the incident management process for production incidents including detection, triaging, resolve and driving continuous improvements
- Maintain the production readiness score card defined in terraform to ensure checks are working as expected and responsible for adding new checks to the scorecard workflow
- Creating and maintaining monitors in datadog that improve observability across the platform
- Engagement with the wider OM Bank product and build team to ensure alignment to the observability standards defined by the platform team
- Designing and implementing enhancements to the platform that contribute towards reducing MTTR (mean time to recovery)
- Designing and implementing automation initiatives including self-service capabilities
- Implementing Service Level Indicators & Objectives for the platform
- Implementing and maintaining datadog dashboards for the platform
- Defining and maintaining baseline monitors to be used by product teams
- Maintaining the observability repository that contains all service definitions and observability related configurations
- Maintaining the feature flagging repository containing all feature flagging definition for product teams
- Maintaining Pager Duty definitions and overall administration
- Fine tuning monitors to ensure alerts are triggered appropriately
- Leading an action center during a production incident, fostering collaboration across the bank to resolve the outage
- Advising product and platform on engineering best practices to ensure services are built with observability and scalability from the start
- Maintaining overall platform health by monitoring key metrics
- Maintaining and extending the SRE API written in python and deploy to Kubernetes
ROLE REQUIREMENTS
- Bachelor’s degree in computer science, electrical or electronic engineering, Information Technology, or relevant field
- 7+ years of software and platform engineering experience building and supporting scalable services
- 3-5 years experience in writing infrastructure as code (Terraform, AWS CDK, Cloudformation)
- Solid experience using observability platforms like Datadog
- Experience with microservices architecture and Restful API
- Solid Kubernetes experiencing displaying end to end deployment and maintenance of clusters including designing and building infrastructure as code required to deploy the cluster and required cloud resources that support the cluster
- Experience with Kubernetes custom resource management and deployment
- Solid experiencing deploying Kubernetes resources using Helm Charts
- Experience in fine tuning Kubernetes HPA configs
- Moderate experience using go/python programming language
- Solid experience using GitOps and general git based operations
- Solid infrastructure as code background displaying experience in designing, implementing and maintaining IAC design patterns that manage large scale cloud environment.
- Solid AWS experience, displaying advanced understanding of cloud architecture and maintaining distributed systems
- Experience maintaining queuing systems like AWS SQS and event streaming platforms like Kafka
- Experience supporting mobile applications
Skills
- Action Planning, Application Development, Business Process Design, Computer Literacy, Data Management, Data Modeling, Evaluating Information, Identifying Customer Needs, Information Technology (IT) Support, Market Analysis, Oral Communications, Product Development, Technical Support, Technical Troubleshooting, Test Case Management, User Requirements Documentation, Web Development
Competencies
- Business Insight
- Collaborates
- Courage
- Cultivates Innovation
- Decision Quality
- Drives Results
- Ensures Accountability
- Manages Complexity
Closing Date
- 01 November 2025