Machine learning operations server room

Machine Learning Operations – MLOps

Artificial intelligence (AI) and machine learning (ML) are key factors for business, helping companies to plan and personalise their products and to improve business operations across the globe.

Early concept made possible with technological advancement

The origin of artificial intelligence (AI) as a computer science field that aims to simulate every aspect of human intelligence by using computing machines dates to 1956, when John McCarthy et. al organised the Dartmouth Summer Research Project on Artificial Intelligence 1 . That workshop was the seed of a unified identity for the field and a dedicated research community that led to numerous breakthroughs during the subsequent decades 2 . Despite the rapid growth of different aspects of artificial intelligence, the field could still boast no significant practical success and nearly every computer program that was built to interact with humans was codified with a defined set of rules, only allowing rudimentary displays of intelligence in specific contexts and with capabilities limited to specific tasks. Ruled-based systems, also known as expert systems, cannot be designed to address complex problems where the programming of the set of rules that defines the body of knowledge is impractical or impossible.

It was not until the nineties that the AI focus started drifting from expert systems toward a new paradigm built upon the idea that programs should be able to learn, so that AI could start displaying its potential 3 . It is in this new paradigm where modern AI establishes its roots, and the field of machine learning (ML) is born, being the “field of study that gives computers the ability to learn without being explicitly programmed,” as stated by Arthur Samuel. Machine learning algorithms learn through data exposition, usually via an iterative process or an ensemble of statistical samplings of the data. A key feature of ML is that its predictions improve with experience and with the use of more relevant data, at least up to a certain point. Thus, by learning through practice, instead of following defined sets of rules, machine learning systems deliver better solutions than expert systems in numerous cases, and more complex problems start being accessible.

However, the advent of machine learning could not be possible without the development of two key elements. First, the recent progress of the hardware computing resources, such as the continuous increase of the number of transistors on microchips following Moore’s law, and more recently, the use of graphical processing units (GPU). Secondly, the availability of a large amount of data, primarily driven by the invention of the World Wide Web and mobile technology. By the end of 2020, 44 zettabytes were stored on the cloud, and it is estimated that this will increase up to 200+ zettabytes by 2025 4 . These factors have led to unprecedented progress in statistical models, algorithms and applications that have brought AI and ML into the limelight. Thus, AI solutions are already being applied in virtually every industry with excellent results. Some distinguished examples are automated medical diagnosis, voice input for human-computer interaction, intelligent assistants, AI-based cybersecurity and self-driving cars.

Two people inspecting artificial intelligence server

Catalyst for digital business

70% of the globe’s GDP will have gone through some form of digitisation by 2022, and by 2023, investments in Direct Digital Transformation will amount to $6.8 trillion 5 . As companies are undergoing their journey of digitalisation today, the use of ML is a key feature in automating, predicting, planning and personalising their product. However, the integration of ML within the business chain comes with new challenges that have a tremendous impact on business. Some of the questions that companies are nowadays facing are not related to how to build ML models, but rather to which built models are in use, what they are doing and whether the used data reflects the state of the world. Although the answer to these questions might seem simple when compared with the complexity of the ML algorithms, they are usually overlooked, bearing negative effects on business 6 . To align models with business needs and to generate business value, it is therefore essential not only to build the ML model but also to deal with dataset management, monitoring and deploying models and building processes that are shareable and repeatable throughout an organisation. To provide a solution to those issues, a set of best practices have been codified into a new field, Machine Learning Operations, or MLOps.

Therefore, MLOps is a set of practices for the operationalisation of ML models that aim to build, deploy and monitor ML applications and that facilitates the collaboration and communication between data scientists and operations professionals quickly and reliably. Some of the MLOps capabilities are 7 :

  • MLOps allows the unification of the release cycle for machine learning and software application releases.
  • MLOps enables automated testing of machine learning artefacts, e.g., data validation, ML model testing and ML model integration testing.
  • MLOps enables the application of agile principles to machine learning projects.
  • MLOps enables supporting machine learning models and datasets to build these models as first-class citizens within CI/CD systems.
  • MLOps reduces technical debt across machine learning models.
  • MLOps must be a language-, framework-, platform-, and infrastructure-agnostic practice.

Therefore, MLOps is a set of practices for the operationalisation of ML models that aim to build, deploy and monitor ML applications and that facilitates the collaboration and communication between data scientists and operations professionals quickly and reliably. Some of the MLOps capabilities are 7 :

  • MLOps allows the unification of the release cycle for machine learning and software application releases.
  • MLOps enables automated testing of machine learning artefacts, e.g., data validation, ML model testing and ML model integration testing.
  • MLOps enables the application of agile principles to machine learning projects.
  • MLOps enables supporting machine learning models and datasets to build these models as first-class citizens within CI/CD systems.
  • MLOps reduces technical debt across machine learning models.
  • MLOps must be a language-, framework-, platform-, and infrastructure-agnostic practice.
Digital business activities across globe

Seven core principles

According to Microsoft’s Machine learning DevOps guide 8 , seven core principles should be considered when adopting MLOps for any ML-based projects: 

  1. version control code, data, and experimentation output – to ensure reproducibility of experiments and inference results 
  1. use multiple environments – to segregate development and testing from production work, as shown below.
MLOps environment set-up
MLOps development, test and production environment set up
  1. manage infrastructure and configuration with infrastructure-as-code – for consistency between environments 
  1. track and manage machine learning experiments – for quantitative analysis of experimentation success and to enable agility 
  1. test code, validate data integrity, model quality – to test the experimentation code base 
  1. machine learning continuous integration and delivery – to ensure that only qualitative models land in production 
  1. monitor service, models, and data – to serve machine learning models in an operationalised environment 

It is worth mentioning that the interpretation of these Microsoft principles should be flexible, i.e., they are not a set of rules that must be adopted when designing an ML project. Specifically, the second core principle, use multiple environments, can be omitted most of the time during development, even though it is relevant to follow it for functional testing of applications and APIs. As an illustration, in our solution, we adopt only a reduced set of those principles.

In this work, a proposed MLOps solution is presented in terms of infrastructure configuration, data preprocessing workflow and end-to-end model development workflow. The aim is to showcase how MLOps principles can be brought into practice using cloud computing resources from Azure. The proposed MLOps solution is based on the above-mentioned core principles, where the focus in this paper is on principles 1,3,4 and 6.

Proposed MLOps solution 

The core components of the proposed MLOps solution can be summarised in terms of infrastructure configuration, data preprocessing workflow and end-to-end model development workflow, which will be presented below.

Infrastructure configuration
  • Infrastructure-as-code: terraform 
  • Code repository: Azure DevOps repo 
  • Data repository: Azure data lake with a Delta Lake storage layer 
  • Model repository: MLflow 
  • Model hosting server: Azure Pipeline Docker 

Data preprocessing workflow

Data preprocessing workflow illustrated Arrow pointing right
MLOps data preprocessing workflow
MLOps data preprocessing workflow
Validation of new data Arrow pointing right

New data is fetched by ETL pipelines based on time scheduling, which is a story of its own. Before the new data enters the data preprocessing workflow, the quality needs to be validated. In this solution, data quality is proposed to be measured based on knowledge obtained from previous stored data, and validated through:

  • ensuring that the new data is not identical to the latest stored data 
  • ensuring that the new data has same features/columns as the latest stored data 
  • ensuring that the new data has the same data significance compared to the stored data through hypothesis testing, with the added significance of the stored data as a NULL hypothesis, using Great Expectations 9 .

If any step in the validation fails, a webhook, built using Pysteams 10  notifies a teams channel with information regarding the validation fail. Also, the data preprocessing workflow is being terminated. 

Writing data to the Delta Lake RAW layer Arrow pointing right

When new data has passed validation, it is ready to be stored in the Delta Lake RAW layer, i.e. stored as a delta table in the data lake RAW-data folder. The aim of the RAW layer is to have version control of the source data without any processing. 

Writing data to the Delta Lake CLEAN layer Arrow pointing right

Data from the Delta Lake RAW layer is cleaned through casting column names and types as well as dropping sensitive information, such as personal data. The cleaned data is then written to the Delta Lake CLEAN layer, i.e. stored as a delta table in the data lake CLEAN-data folder. The aim of having the CLEAN layer is to have version control of data ready to be used for any purpose, simultaneously staying as close as possible to the original source data. No use-case specific processing is performed.

Writing data to the Delta Lake CURATED layer Arrow pointing right

Cleaned data from multiple Delta Lake CLEAN layers can be joined or merged for a specific purpose at this step and written to the Delta Lake CURATED layer, i.e. stored as a delta table in the data lake CURATED-data folder. The aim of the CURATED layer is to have version control of joined or merged data for a specific purpose.  

Data lake storage structure Arrow pointing right

data_lake_storage_account 

|__ data_lake_file_system 

  |__ source_data_folder 

  |__ source_file (Source data) 

  |__ RAW_data_folder (RAW layer) 

  |__ delta_table (parquet format) 

  |__ CLEAN_data_folder (CLEAN layer) 

  |__ delta_table (parquet format) 

  |__ CURATED_data_folder (CURATED layer) 

  |__ delta_table (parquet format)

 

The phrasing “layer” is commonly used in MLOps to indicate data versioning checkpoints for data with different processing levels. The proposed solution suggests having RAW, CLEAN and CURATED layers of data with the definition given in this summary. Another popular layer definition is the BRONZE, SILVER and GOLD layers definition 11 . Which one to pick is up to the developers, as long as the definitions are clear to the whole developer team.

End-to-end model development workflow

Workflow illustration Arrow pointing right
MLOps end-to-end model development workflow
MLOps end-to-end model development workflow
Run model (re-)training Arrow pointing right

When data has been curated to the Delta Lake curated layer for a specific purpose, it’s time for model (re-)training. A model pipeline is constructed at this step, which consists of several data preprocessing steps and a final predictor step. In this solution, the following actions are carried out for model (re-)training: 

  • Define data preprocessor steps using sklearn pipelines 12 , for example, selecting columns, removing outliers and performing feature engineering.  

NOTE: The reason that these preprocessor steps are performed in the model-pipeline (and not in the curated layer stage) is due to the explorative-nature of ML. For the same purpose, for example, time series forecasting, the preprocessor pipeline steps for different models can look very different. Therefore, the preprocessor-pipeline steps should be bound to a model rather than a purpose – data in the curated layer are curated for a purpose. 

  • Specify train, validation and test datasets by splitting data in the Delta Lake curated layer into training, validation and test data, and fit the preprocessor-pipeline to the training data. 
  • Define the parameters for optimization and the objective function, and perform hyperparameter optimisation (using Hyperopt 13 ) using the preprocessed training data
  • Add the model with best parameters to the model-pipeline as the final predictor step.   
  • Run the validation data through the model-pipeline Log the best model and all necessary parameters/metrics and data versions in MLflow tracking for experimentation version control. 
Put model in staging Arrow pointing right

When a new model has been (re-)trained, it should be registered to the model repository in MLflow. The aim with model registry is to have version control of model development experiments. Based on the metric logs from the corresponding experiment, it is decided whether the new model should be pushed into staging. If the new model passes the defined performance tests, for example, metrics thresholds, the previous staged model is archived, and the new model is transitioned into staging. A notification is sent to the developer team regarding a new model being put into staging. 

The aim of having models in archived, staging and production stage is to clarify a model’s role in the model repo. A production model is the best model in the current environment, a staging model is the challenger model that will later challenge the production model and an archived model is a model that has been replaced.

Comparison of staging model and production model Arrow pointing right

When a new model has been pushed into staging, model performance comparison needs to be carried out between the staging model and the production model, to decide whether the staging model should replace the production model. The model performance is validated on the test data, and the resulting metrics are summarised for the data scientist to make a decision.   

Move staging model to production Arrow pointing right

Decided by the data scientist, if the staging model outperforms the production model, the production model is being archived and the staging model is being pushed to production. A notification is sent to the developer team regarding a new model being put into production.  

Model run-time deployment Arrow pointing right

When a new model has been pushed to production, the building environment will be notified that a new production model is available. The production model is fetched from the MLflow model registry to a model API, where a test suite is run on the API to ensure that the application is good. A docker container is then built using Azure pipelines, and the model API is being exposed to a port and deployed to cloud.   

 

Verdict

This work proposes an MLOps solution for scalable and repeatable end-to-end ML implementation. The solution is set up using Infrastructure-as-code and contains semi-automated steps for data preprocessing and end-to-end machine learning development workflows. As noticed throughout the development, MLOps increases the quality, simplifies the management process and automates the deployment of ML models in a large-scale production environment. Thus, it becomes easier to align models with business needs.

MLOps creates a wide array of benefits, namely:

  • MLOps orchestrates the entire development process
  • MLOps monitors data drifting
  • MLOps leverages agile methods
  • MLOps promotes truly reusable components
  • MLOps versions both data and models
  • MLOps alleviates comparison between models and artifacts

In contrast, there are also some limitations worth mentioning, such as a lack of a common definition of data within the data pipeline, and the dependence on certain tech stacks that limit the generalisation of certain procedures.

All in all, MLOps is a set of principles for establishing a common working framework when implementing ML solutions to meet business operationalisation needs. However, the practical implementation of MLOps depends specifically on each case. Experiment with it and pick the solution that fits your business needs the best.

Footnotes