Data Reliability Engineer

United Kingdom Full-time Posted 2 weeks ago

Be part of something great!

Maximus is a global organisation that specialises in providing health and employment services to millions of people every year. Here in the UK we employ around 5,000 people across the country to deliver services that have a profound impact on people’s lives. From assessments and health services to employability programmes and specialist support, we do work that matters with people who cares.

Benefits

£56,500 - £65,000 per annum
25 days holidays + bank holidays
9% combined pension contribution
Healthcare cashplan
Retail discounts

About the role:

We have an exciting opportunity for you to join Maximus as a Data Reliability Engineer to transform our clients through cutting edge Digital Solutions.

Join our forward-thinking digital organisation, helping to build and run a trusted, resilient, and high-performing data platform that underpins operational excellence across Maximus UK. This role is responsible for improving the reliability, observability, quality, and operational supportability of our data estate, ensuring that critical data products, pipelines, and platforms are dependable, well-governed, and fit to support business, contractual, and service objectives.

Working across engineering, architecture, delivery, product, and operational teams, you will focus on making data platforms and pipelines easier to monitor, support, recover, and continuously improve, with a primary focus on our Azure Databricks-based data platform and the wider Azure data ecosystem, including Purview. You will help establish engineering standards, reliability controls, and proactive monitoring approaches that reduce incidents, improve trust in data, and enable teams to detect and resolve issues before they impact users, operations, or client outcomes. The core platform is Azure, with growing expansion into AWS, so the role will contribute to patterns and practices that support a secure and scalable multi-cloud data environment.

You will bring a strong engineering mindset to data operations, combining data platform knowledge with an understanding of resilience, automation, service health, and supportability. You will work closely with data engineers, platform teams, security, and service stakeholders to strengthen the operational maturity of the data estate, embedding a culture of quality, transparency, accountability, and continuous improvement across the end-to-end data lifecycle.

Key Responsibilities:

Own and improve the reliability, availability, observability, and operational supportability of Maximus UK’s Azure Databricks platform, Azure data services, and associated data pipelines and data products
Design and implement monitoring, alerting, health checks, and diagnostics across Azure Databricks, Azure data services, orchestration layers, storage, and downstream consumption, extending these patterns into AWS as the estate grows
Define and maintain reliability standards, controls, operational runbooks, and support models that improve the resilience, predictability, and supportability of data services
Work closely with data engineering teams to identify, prioritise, and remediate reliability, performance, and data quality issues across Databricks notebooks, jobs, workflows, and other Azure data workloads
Establish proactive incident detection, triage, and root cause analysis practices, reducing mean time to detect and mean time to recover for data-related issues
Design and implement robust data quality controls, validation frameworks, reconciliation processes, and anomaly detection approaches across the end-to-end data lifecycle
Configure and use Azure Purview to provide effective data cataloguing, lineage, ownership, and governance, ensuring reliability and quality controls are visible and auditable
Collaborate with platform, cloud, architecture, and security teams to ensure the data estate is secure, resilient, cost-effective, and aligned to enterprise standards and patterns
Contribute to the reliability engineering approach for an Azure-first data platform while supporting reusable patterns and operational readiness for data services in AWS
Partner with architects and engineers so that new pipelines, data products, and platform services are designed with operability, recoverability, scalability, and observability built in from the start
Automate repetitive operational tasks, environment checks, dependency verification, failure handling, and recovery processes to increase efficiency and reduce manual intervention and risk
Capture lessons learned, codify reliability patterns and standards, and share best practice to continuously improve reliability, transparency, and engineering discipline across the data function

What we are looking for

Proven experience in data engineering, platform engineering, site reliability engineering, DataOps, or a closely related role focused on data platform reliability and operations
Strong hands-on experience with Azure-based data platforms, particularly Azure Databricks and core Azure data services such as Data Lake Storage, Data Factory/Synapse, and analytical stores, with familiarity of equivalent services in AWS
Strong understanding of modern data platform architectures, including data lakes, warehouses or lakehouses, orchestration frameworks, transformation pipelines, streaming services, and analytical consumption layers
Experience designing and implementing monitoring, observability, logging, alerting, and incident management approaches for Databricks workloads and Azure data services, using tools such as Azure Monitor, Log Analytics, or similar
Strong understanding of data quality, reconciliation, validation, and lineage concepts, and practical experience implementing control frameworks that protect critical data flows and products
Hands-on experience with Azure Purview or comparable data cataloge and lineage tooling, including configuration of collections, classification, ownership, and lineage for key datasets.
Good understanding of reliability engineering principles such as availability targets, resilience patterns, recoverability, service health indicators, and operational readiness assessments
Experience using scripting and automation (for example, Python, PowerShell, or similar) to remove operational toil, improve repeatability, and strengthen recovery and deployment processes
Ability to diagnose and resolve complex issues that span data pipelines, integrations, cloud infrastructure, configuration, and source or downstream systems, and to drive pragmatic remediation
Strong collaboration and communication skills, with the ability to work effectively across data engineering, architecture, cloud platform, security, delivery, and operational teams, and to explain issues and trade-offs clearly to technical and non-technical stakeholders.
Experience with Azure-native observability tooling and patterns for Databricks and the wider Azure data stack, and exposure to equivalent AWS monitoring approaches
Experience tuning and optimising Databricks jobs, clusters, and workflows for performance, cost efficiency, and reliability in production environments is desirable
Familiarity with CI/CD practices, infrastructure as code, and automated testing for data platforms, supporting safe, repeatable, and low-risk changes to data services is desirable
Experience working in regulated or contractually governed environments such as public sector, health, employment services, or similarly controlled industries, where service stability and compliance are critical is desirable
Experience contributing to platform standards, engineering communities of practice, or cross-team reliability improvement initiatives, helping to raise engineering and operational maturity across multiple squads or domains is desirable

EEO Statement

Maximus is committed to developing, maintaining and supporting a culture of diversity, equity and inclusion throughout the recruitment process. We know that feeling included has a dramatic impact on personal well-being and are working to ensure that no job applicant receives less favourable treatment due to any personal characteristic. Advertisements for posts will include sufficiently clear and accurate information to enable potential applicants to assess their own suitability for the post. We are a Disability Confident Leader, thanks to our commitment to the recruitment, retention and career development of people with disabilities and long-term conditions. The Disability Confident scheme includes a guaranteed interview for any applicant with a disability who meets the minimum requirements for a job. When you complete your job application you will find a question asking you if you would like to apply under the Disability Confident Guaranteed Interview Scheme. If you feel that you have a disability and apply under this scheme, providing that you meet the essential criteria for the job, you will then be invited for an interview. Your Guaranteed Interview application will only be shared with the hiring manager and the local resourcing team. Where reasonable, Maximus will review and consider adjustments for those applicants who express a requirement for them during the recruitment process.

Data Reliability Engineer

Similar sponsor-licensed roles

Cookie preferences