Introduction

martinboeckling

Introduction

This blog post is part of a series describing the usage of MLflow with HANA Machine Learning co-authored by @stojanm and @martinboeckling. In this blog post we provide a more technical deep dive on the setup of a MLFlow instance and provide a general introduction how Machine Learning models trained with HANA ML can be logged with MLflow. The first blog post of the blog series is called Tracking HANA Machine Learning experiments with MLflow: A Conceptual Guide for MLOps and gives an introduction to the topic of MLOps with MLflow.

Starting with the python HANA ML package version 2.13, HANA Machine Learning added support for tracking of experiments with the MLflow package, which makes the incorporation of models developed using HANA Machine Learning into a comprehensive MLOps pipeline easy to achieve.

In this blog post we will provide an overview how MLflow can be used together with HANA ML. MLflow, which manages the experiment tracking and artefact management can run as a managed service at a hyperscaler platform, deployed locally or on a remote infrastructure. In the following we describe how to deploy MLflow on SAP Business Technology Platform and how to track your HANA machine learning experiments with MLFlow. In addition, we present which methods and algorithms in the hana ml package currently support the experiment tracking feature. Finally, we touch on the possibility to use logged models in MLflow for prediction.

Prerequisites

In this blog post we solely focus on the technical integration of HANA ML and MLflow as a logging platform. Generally, we assume that Python is already installed together with an already established development environment. Furthermore, we will not completely explain all details of docker and Cloud Foundry, but simply focus on the essential parts for HANA ML and MLFlow within this blog post.

Set up MLFlow on BTP

MLFlow is leveraged and integrated in different solutions. For example Databricks as well Machine Learning in Microsoft Fabric integrate or provide a managed MLflow instance. In case MLFlow is not yet provided, we outline in this section a possibility to deploy MLflow in SAP BTP. We focus for simplicity reasons on the SQLite based deployment of MLflow. However, for productive environments it is recommended to separate the storage from the runtime of the MLFlow instance. A detailed explanation for setting up your own MLFlow server with alternatives to SQLite can be found under the following link: https://mlflow.org/docs/latest/tracking/server.html. In the following paragraphs, we focus on a step by step overview to set up your own MLFlow instance using SQLite in BTP.

As a first step we want to create a local docker file which can be used to upload it to our BTP environment. In the following code snippet, we provide the coding used to construct your own docker container locally. For that, paste the following code into a file called Dockerfile within your desired local folder.

# Retrieve a python version as a base runtime for our docker container
FROM python:3.10-slim
# Run the pip install command for the package of mlflow
RUN pip install mlflow
# Create a temporary folder within our docker container to store our artifact
RUN mkdir -p /mlflow
# Expose the port 7000 to make our application which runs within docker accessable over the defined port
EXPOSE 7000
# Define environment variables BACKEND_URI and ARTIFACT_ROOT to define the backend uri as also the 
# artifact root
ENV BACKEND_URI sqlite:///mlflow//mlflow.db
ENV ARTIFACT_ROOT /mlflow/artifacts
# run the shell comand to setup our mlflow server within our docker container
CMD mlflow server --backend-store-uri ${BACKEND_URI} --host 0.0.0.0 --port 7000

After creating successfully the Docker object, you can run docker build -t {tagname}. to construct your docker container. Afterwards, the local docker image is locally built. To expose the docker image, we push the image to a docker registry. In our example, we assume that you already have a docker registry set up where you can push your image to. For that step, you can run the following commands: docker tag {tagname} {dockerhub repository tag}, docker push {dockerhub repository tag }. After the successful run of the command, you see within your private docker hub the newly published docker container, which contains MLflow and all its dependencies inside of it.

After the successful publishing of your docker image to your registry, we can run the following command to create a BTP app based on the published docker image:

cf push APP-NAME --docker-image REPO/IMAGE:TAG

After successfully publishing the docker image on our BTP Cloud Foundry environment, we can find our published app within our BTP account and are able to access it under the published URL.

MLFlow Initial UI.png

Set up tracking for MLflow

With MLflow users have the possibility to track their trained HANA ML models. In the following paragraph, we introduce the aspects that are needed to be able to log HANA ML models into MLflow itself.

To be able to use MLFlow together with HANA ML, we need to first install besides the HANA ML package also the MLFlow package. Therefore, you need to run the following command in your virtual environment, to be able to run the following scripts.

pip install mlflow hana-ml

As a general setup, we first need to run the following command to set up our tracking with MLFlow to our available MLFlow instance. Therefore, place into the following two lines first your personal MLFlow tracking URI and your own custom experiment. In case you do not want to create a separate experiment, the different runs together with the MLFlow model are stored under the default experiment.

The method that allows us to track HANA ML models is implemented in the HANA ML package and is called enable_mlflow_autologging(schema=None, meta=None, is_exported=False, registered_model_name=None). This method can be used for initialised HANA ML models that are under the following methods:

Within the method enable_mlflow_autologging the user has different keywords that can be filled that allows us to influence the behaviour of our MLFlow autologging in HANA ML.

schema: Defines the HANA database schema for MLFlow autologging where the MLflow logging table is stored
meta: Defines the name of the model storage table in HANA database
is_exported: Determines if the hana model binaries should be exported to MLflow
registered_model_name: Name of the model stored in MLflow

In the following section we provide an overview for the Unified Interface how the logging of MLflow can be used.

Run HANA ML Algorithms with MLflow

As we have explained and outlined in the sections above, we have created a MLFlow instance and have introduced the syntax that is needed for the logging of HANA ML models in MLflow. In the following sections we will provide based on an example how the logging of HANA ML models on MLflow is done.

Model training of HANA ML with MLFlow

For the training of HANA ML in combination with MLflow, we focus in this blog post on the Unified Method. We apply for the respective elements a Classification on the sample bank dataset which can be found under the HANA ML sample dataset folder on GitHub.

The dataset can either be uploaded directly to the SAP HANA database or you could also use SAP Datasphere as your starting point. Generally, to use HANA ML directly you would need to store the dataset in a HANA database. However, HANA ML also provides methods to integrate third party files/ data structures. This involves Pandas, Spark as also shapefiles. In addition also HANA Data Lake file tables can be integrated with HANA ML functionalities. An overview of the different methods can be found under the following page. In the following paragraphs, we will go through the sample code that we have created to combine HANA ML and MLFlow.

Connect to HANA database (Deployed under SAP Datasphere)

To be able to connect to the HANA database instance, we first need to build up a connection to the HANA database. In our example, we load the data from the data samples provided by HANA ML. During the time of this blog post, the OpenSQL schema of Datasphere only supports Basic Authentication. Therefore, in this blog post we only elaborate how the connection is done over basic authentication. SAP HANA standalone supports however non-basic authentication, which are also supported in the HANA ML package to connect certificate based to the SAP HANA instance.

To establish the connection to the HANA database, we make use of the implemented HANA ML dataframe class and call the method ConnectionContext. We store the instance of the connection in the variable conn. To now be able to establish the connection to the HANA database view or table, we will need to specify over the method table the connection. The beautiful aspect is that overall, the dataset is not going to be loaded to the Python runtime, but will only be represented with a proxy to the actual table in the HANA database. All transformations, if done over the methods of HANA ML, are then pushed down to the database itself and executed there if the training is executed. In our case, we load the sample dataset into our database by making use of the provided methods of the HANA ML package.

"""
This script provides a short example how HANA ML and MLFlow can be integrated together.
The credentials to the database are following the currently supported authentification (Basic) of the 
SAP Datasphere OpenSQL schema. Overall, HANA Cloud standalone is also able to support multiple other authentification
methods. We have used an abstraction python file (constants) where we retrieve the securely stored 
authentification properties. To get more details about the exact method structure needed, please
have a look at the documentation:
https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.07/en-US/hana_ml.dataframe.html#hana_ml.dataframe.ConnectionContext
"""
from hana_ml import dataframe
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
from hana_ml.algorithms.pal.auto_ml import AutomaticClassification
import mlflow
from hana_ml.algorithms.pal.auto_ml import Preprocessing
from hana_ml.algorithms.pal.partition import train_test_val_split
from constants import db_url, db_user, db_password
# dataset retrieval
conn = dataframe.ConnectionContext(address=db_url, port=443, user=db_user, password=db_password)
dataset_data, training_data, _, test_data = DataSets.load_bank_data(connection=conn, schema=schema_name, train_percentage=0.7, valid_percentage=0, test_percentage=0.3, seed=43)

After connecting to the database, the user is able to use the preprocessing methods implemented in HANA ML. Generally, the different changes are pushed down to the HANA database and are not executed within the Python runtime. In our use case, we do not need to use the data preprocessing as we directly retrieve a sample dataset which we can directly use for our ML training.

After finishing the potentially needed transformations, we are now able to implement the tracking of our HANA ML runs with the possibility of MLFlow. Similar to the normal usage of MLFlow, we set up first our tracking uri under which we want to store our HANA ML runs and models. In your case you would need to change the keyword mlflow_tracking_uri with your respective MLflow tracking URL. Furthermore, we then are able to specify the experiment name under which the runs are tracked. If we do not specify a specific experiment, the runs are tracked under the Default experiment.

# set up MLFlow
mlflow.set_tracking_uri(mlflow_tracking_uri)
mlflow.set_experiment("HANA ML Experiment")

In the following chapters, we will provide an outline how the exact training is performed and what components are logged to MLflow.

Unified Method

For the example we use the implemented Hybrid Gradient Boosting Tree as a classification algorithm for our Classification. In order to perform the classification, we use the Unified Classification in order to be able to run our algorithm. On the defined variable, we then use the implemented enable_mlflow_autologging method. This allows us to directly log the model using implemented auto logging behaviour.

uc = UnifiedClassification(func="HybridGradientBoostingTree")
uc.enable_mlflow_autologging() uc.fit(training_data, key="ID", label="responded")

We call the fit method once we have initiated the HANA ML model variable and the associated autologging for MLflow. For the fit method, we have in total two different options. Firstly, the non-partitioned training dataset where we only use the training dataset. If we decide to partition our training dataset, we allow to create a validation dataset for which we can log metrics automatically during training.

If we do not define for our fit function the partitioning, we will not log metrics within MLFlow. In the following image, you can see how a potential HANA ML tracked run looks like in MLFlow together with the stored HANA ML model in MLflow.

MLflow Initial Model.png

If we decide to partition our dataset, here for instance to partition the dataset along the defined primary key, we are able to directly log evaluation metrics relevant for the Classification we have used. This includes the following metrics: AUC, Recall, Precision, F1 Score, Accuracy, Kappa coefficient and the Mathews Correlation Coefficient (MCC). This would directly allow us to compare multiple runs within our MLFlow project to one another and measure the different performances.

MLFlow Metric Model.png

In addition to the general run, HANA ML also logs the model to MLflow. What is logged to MLflow depends on the parameters set for the method enable_mlflow_autologging. If for instance everything is set to the default settings, we will see the following yaml file to be logged to MLflow.

MLflow model.png

If within the method enable_mlflow_autologging the parameter is set to is_exported, the model binaries stored in the model storage on HANA are exported to MLflow. This setting would allow us to retrieve the trained model from MLflow and use it in a different HANA database for prediction purposes. In addition to the yaml file containing the metadata we now can see a created subfolder called models which contains the necessary model artefacts normally stored in the HANA database now in MLflow.

MLflow exported model.png

After the training is finished, we have besides the auto logging capabilities of HANA ML for MLflow the possibility to track further artefacts in MLflow. In the following section we will outline a few possibilities that exist with the additional tracking.

Additional logging possibilities

Besides the outlined auto logging capabilities, we can track with MLFlow additional artefacts to the respective run. In the following chapters, we outline selected possibilities to further enrich the auto logging for HANA ML runs tracked in MLFlow.

Adding run and experiment description

The description in the experiment section can be handy once the number of your experiments grows in the repository. In addition, mlflow allows to also add individual description to each run of an experiment. Using the following methods you can set up both:

from mlflow.tracking import MlflowClient
current_experiment=dict(mlflow.get_experiment_by_name("HANA ML Experiment"))
experiment_id=current_experiment['experiment_id']
    
run = mlflow.active_run()
MlflowClient().set_experiment_tag(experiment_id,"mlflow.note.content",
"This experiment shows the automated methods of HANA machine learning and how to track them with MLFLOW")
MlflowClient().set_tag(run.info.run_id, "mlflow.note.content", "This is a run tracked with Unified Classification from HANA Machine Learning")

Logging input datasets

Sometimes it is important to keep the input dataset also as part of the tracking with MLflow. Since HANA machine learning datasets are located in HANA, they need to be converted to pandas DataFrames to be tracked as shown in the following code:

# Store training dataset in MLFlow itself
pandas_training_dataset = training_data.collect()
mlflow_dataset = mlflow.data.from_pandas(pandas_training_dataset, name="Customer data", targets="LABEL")
mlflow.log_input(mlflow_dataset, context='training')

This results in the change, that the respective state of the training data is logged to the current run. The logged dataset can be found in the associated MLflow run, where the schema of the dataset is provided together with some metadata information about the number of rows and number of elements. In addition, also the provided context is marked in the UI of MLflow.

Logging a model report

In addition to the logging of the dataset, it might also be important to add a model report to MLFlow. HANA ML generally provides different interactive visualisations for the trained model artefact, which can be stored as an HTML file. After the storing of the model report to your local repository, we can log the input of the model report to our current run. This allows us to interactively explore the model report automatically generated by HANA ML and make it accessible in MLFlow. To log the HANA ML model report, you can use the following code snippet.

# create additional model report in MLFlow
UnifiedReport(uc).display(save_html="UnifiedReport")
mlflow.log_artifact("UnifiedReport_unified_classification_model_report.html")

After the Model report is stored successfully under the current run, we can see in the artefact tab in MLFlow the interactive model report:

The complete script used for this section can be found in the following code snippet:

from hana_ml import dataframe
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
from hana_ml.visualizers.unified_report import UnifiedReport
import mlflow
from hana_ml.algorithms.pal.utility import DataSets
from constants import db_url, db_user, db_password

# dataset retrieval
conn = dataframe.ConnectionContext(address=db_url,
                                   port=443,
                                   user=db_user,
                                   password=db_password)

dataset_data, training_data, _, test_data = DataSets.load_bank_data(connection=conn, schema=schema_name, train_percentage=0.7, valid_percentage=0, test_percentage=0.3, seed=43)

# set up MLflow
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment("HANA ML Experiment")
# set up classification
uc = UnifiedClassification(func="HybridGradientBoostingTree")
uc.enable_mlflow_autologging(is_exported=True)
# train model
uc.fit(training_data, key="ID", label="LABEL", partition_method="stratified", stratified_column="ID", partition_random_state=43, build_report=True)
# create additional model report in MLFlow
UnifiedReport(uc).display(save_html="UnifiedReport")
mlflow.log_artifact("UnifiedReport_unified_classification_model_report.html")
# Store training dataset in MLFlow itself
pandas_training_dataset = training_data.collect()
mlflow_dataset = mlflow.data.from_pandas(pandas_training_dataset, name="Customer data", targets="LABEL")
mlflow.log_input(mlflow_dataset, context='training')

Apply of trained model

After we have finished our training, we are able with HANA ML to retrieve the model from MLFLow and use it for our prediction purposes. For this purpose, we will create a separate Python script where we will provide an overview to retrieve the trained MLflow model.

Similar to our training script, we first set up our connection to the HANA database and establish the connection to our table. In our case, we simply use the sample dataset provided by HANA ML.

from hana_ml import dataframe
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
from hana_ml.visualizers.unified_report import UnifiedReport
import mlflow
from hana_ml.algorithms.pal.utility import DataSets
from constants import db_url, db_user, db_password

# dataset retrieval
conn = dataframe.ConnectionContext(address=db_url,
                                   port=443,
                                   user=db_user,
                                   password=db_password)

dataset_data, training_data, _, test_data = DataSets.load_bank_data(connection=conn, schema=schema_name, train_percentage=0.7, valid_percentage=0, test_percentage=0.3, seed=43)

Similar to our training script, we need to set the tracking url for MLflow and need to initiate the model storage of HANA. If we have decided to not export the HANA ML model to MLflow, we need to specify the same schema for the model storage where our HANA ML model is stored after the successful run. In case we have exported our model, we are able to specify a different schema. In the following, you can see the necessary script in order to retrieve the logged HANA ML model from MLflow.

# set up MLFlow and model storage
mlflow.set_tracking_uri(tracking_url)
model_storage = ModelStorage(connection_context=conn, schema=schema_name)

After the model storage has been initiated, we are able to retrieve the stored HANA ML model from MLflow. In order to select the correct model, you need to extract the correct run id associated to the model you would like to apply for your prediction dataset. In our case, this is the test dataset we have received from the sample dataset method. The model_uri needed for the model retrieval is consisting of the following pattern 'runs:/{run id}/model', in which you would need to exchange the run id with your respective run. For the actual retrieval of the model, we use the initiated model storage, in our case called model_storage and call the method load_mlflow_model to load the MLflow model to our HANA database and assign the respective proxy to our variable mymodel. The variable mymodel is then used to call the predict method in order to apply our model to our dataset. In the end we transform our prediction dataset into a Pandas DataFrame to look at the content of the created DataFrame. Normally, we could directly persist the created temporary table with the save method and therefore make the dataset available for further processing.

# load logged run from MLflow to HANA ML
logged_model = 'runs:/d8a763b7b81940598633605e447cd880/model'
mymodel = model_storage.load_mlflow_model(connection_context=conn, model_uri=logged_model)
dataset_data_predict = mymodel.predict(data=test_data, key="ID")
# collect the predicted dataset to see content in dataframe
print(dataset_data_predict.collect())

After running the script, you should be able to see the following terminal output, for which we can see the download of the artefact stored in MLflow and the created prediction dataset, which consists in our case of 4 columns: ID (primary key), SCORE (predicted label), CONFIDENCE (prediction confidence for applied row) and REASON_CODE (influence of individual variables to prediction output).

Terminal output MLflow HANA model.png

In case we have exported our model, the output of our terminal look slightly different indicating that we also download the respective model artefacts stored additionally to the yaml file. In the following you see the complete script used for applying the model to a new dataset.

from hana_ml import dataframe
from hana_ml.model_storage import ModelStorage
from hana_ml.algorithms.pal.utility import DataSets
import mlflow
from constants import db_url, db_user, db_password

conn = dataframe.ConnectionContext(address=db_url,
                                   port=443,
                                   user=db_user,
                                   password=db_password)

# full_set, diabetes_train, diabetes_test, _ = DataSets.load_diabetes_data(conn)
dataset_data, training_data, _, test_data = DataSets.load_bank_data(connection=conn, schema=schema_name, train_percentage=0.7, valid_percentage=0, test_percentage=0.3, seed=43)

# set up MLFlow and model storage
mlflow.set_tracking_uri(tracking_uri)
model_storage = ModelStorage(connection_context=conn, schema=schema_name)
# load logged run from MLflow to HANA ML
logged_model = 'runs:/ed7b8d4734cb42ca90c417f932957b40/model'
mymodel = model_storage.load_mlflow_model(connection_context=conn, model_uri=logged_model)
dataset_data_predict = mymodel.predict(data=test_data, key="ID")
# collect the predicted dataset to see content in dataframe
print(dataset_data_predict.collect())

Key take aways

In this blog post we have showcased an end to end example how MLflow can be integrated in your HANA ML workload by providing the possibility to share and compare multiple tracked runs in MLflow. If the data is already stored in HANA, this allows you to directly interact with MLflow while being able to run your Machine Learning algorithms on data stored in the HANA database without the need to transfer your data between multiple systems. This blog covered an essential part of the automated logging capabilities of HANA ML models into MLflow.

We highly appreciate your thoughts, comments and questions under this blog post. In case you want to reach out for general questions around HANA, or specifically HANA ML, don't hesitate to use the Q&A tool with the respective tags that describe your question.

Tracking HANA Machine Learning experiments with MLflow: A technical Deep Dive

Introduction

Prerequisites

Set up MLFlow on BTP

Set up tracking for MLflow

Run HANA ML Algorithms with MLflow

Model training of HANA ML with MLFlow

Connect to HANA database (Deployed under SAP Datasphere)

Unified Method

Additional logging possibilities

Adding run and experiment description

Logging input datasets

Logging a model report

Apply of trained model

Key take aways

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win