Building a Cloud-Based ML Pipeline for Diabetes Prediction

Radhin krishna
Jul 7, 2024
8 min read

Azure Data Science Associate DP-100

Introduction

This project aims to develop a machine-learning pipeline for predicting diabetes onset in individuals. By leveraging the cloud-based environment of Azure Machine Learning Studio (classic), this pipeline will streamline the process of data preparation, model training, evaluation, and deployment.

The core objective is to construct a robust and scalable model capable of identifying individuals at risk for having diabetes. This model will be trained on a relevant dataset containing patient information known to influence diabetes development.

The project will utilize Azure ML Studio's intuitive drag-and-drop interface to create the pipeline. This visual approach will enable efficient experimentation with different machine-learning algorithms and hyperparameters to optimize the model's performance.

Following successful training, the model will be deployed as a web service on Azure. This deployment will allow for real-world application of the model's predictions, potentially aiding in early diabetes detection and preventative healthcare measures.

To create a student account in Azure:

Navigate to the Azure for Students page. Sign up or log in using a Microsoft account linked to your school email or a personal account.

For school email verification, if your institution is part of Azure for Students, use your school email. For personal email, additional verification steps will be required.

Follow the prompts to verify your student status, which may include a code sent to your phone or details about your school.

Once verified, we have an Azure student account with $100 credit for the first year and access to various free services.

Then Create Azure Machine learnng which is a Envoirment of Free of charge

Data

The data set contains Information regarding the patient and whether they are diabetic or not

columns:

Pregnancies

Glucose

blood pressure

SkinThickness

Insulin

BMI

DiabetesPedigreeFunction

Age

Outcome

This data can be loaded in Azure ML studio by going to assets -->Datasets->Create New Dataset then select from Web files from the drop-down.

And then type the desired data set link, name(Daibetes.csv), Type(tabular), and Description if needed.

In settings and preview, check for Encoding(UTF-8), delimiter(Comma), and Column header as 'Use Header From the First File'.

Then Validate the schema and other details.

Link: https://github.com/plotly/datasets/blob/4e7b558c4f425b4eecec2f5e378cb15c39944a97/diabetes.csv

Creating an ML Workspace and Uploading the Diabetes Dataset:

Access the Azure portal: Log in to the Azure portal using your student account credentials.
Create a resource group: A resource group helps organize Azure resources logically. Click on "Create a resource" and search for "Resource group." Give it a descriptive name, like "ML-Workspace-RG," and choose a location.
Create the ML workspace: In the Azure portal search bar, look for "Machine Learning" and select "Machine learning service." Under "Create," choose "Workspace." Provide a name (e.g., "DiabetesMLWorkspace"), subscription (your student account), resource group (the one you created), location, and pricing tier (select "Free" for basic experimentation). Click "Create" to provision the workspace.
Access the workspace: Once created, navigate to your workspace in the Azure portal.
Upload the Diabetes dataset to Blob storage:

Create a storage account: In the Azure portal search bar, type "Storage account" and select "Storage account." Choose a name (e.g., "DiabetesDataStorage"), resource group, location, performance tier (e.g., "Standard"), replication option (locally redundant storage is common), and account tier (select "Standard" for general-purpose storage). Click "Create" to provision the storage account.
Get connection string: Navigate to your storage account in the Azure portal. Go to "Settings" -> "Configuration" and copy the primary connection string for Blob storage.
Azure portal: Go to your storage account in the Azure portal, and under "Blob service," click "Blobs." Create a container (e.g., "diabetes-data") to hold the dataset. Upload the Diabetes dataset files (usually .csv or .tsv format) by dragging and dropping them into the container.

Why Blob Storage for Data Lakes and SQL Database for Structured Data?

Blob storage:
Ideal for data lakes due to its:
Scalability: Handles massive datasets efficiently.
Cost-effectiveness: Stores various data formats (structured, unstructured, semi-structured) at a lower cost than managed databases like SQL Database.
Flexibility: Supports various data access patterns (streaming, batch processing, etc.).
SQL Database:
Suitable for structured, relational data requiring fast queries and ACID (Atomicity, Consistency, Isolation, Durability) transactions.
Provides a familiar SQL interface for querying and managing data.
Offers built-in security features for data access control.

Accessing Azure ML Studio and Creating an ML Pipeline with Two class Logistic Regression for Diabetes Data

1. Access Azure ML Studio:

Navigate: In the Azure portal, locate "Machine Learning" and select "Azure Machine Learning Studio" (classic experience).

2. Create a New Experiment:

Click "Experiments" on the left menu.
Select "New."
Choose "Blank Experiment" and provide a name (e.g., "DiabetesKMeans").

3. Import the Diabetes Dataset:

Drag and drop the "From Blob Storage" module onto the canvas.
Connect it to the "Experiment" start.
Configure the module:
Subscription: Select your Azure subscription.
Storage Account: Choose the storage account containing the Diabetes dataset.
Container: Specify the container where the dataset resides (e.g., "diabetes-data").
File Path: Enter the path to the specific data file (e.g., "diabetes.csv").

4. Data Cleaning:

Drag and drop the "Select Columns" module.
Connect it to the "From Blob Storage" module.
Select only the relevant features for clustering (e.g., blood sugar levels, weight, etc.).

5. Data Splitting:

Drag and drop the "Split Data" module.
Connect it to the "Select Columns" module.
Configure "Split Data":
Ratio: Set the desired split ratio for training and testing data (e.g., 80%/20%).
Random Seed: Optionally, use a seed for reproducibility.

6. Data Normalization :

Consider normalization (e.g., Min-Max Scaling) if features have different scales. You can use the "Normalize" module after "Select Columns" if needed.

7. K-Means Clustering:

Drag and drop the "Two Class Logistic regression" module.
Connect it to the "Split Data" (training data output).
Configure "Two Class Logistic regression":
Select the Target Variable: Specify the target variable as Outcome column
Maximum Iterations: Set the maximum number of iterations for the K-Means algorithm to run.

8. Model Evaluation:

Drag and drop the "Evaluate Model" module.
Connect it to the "Split Data" (testing data output).
Connect "Two Class Logistic regression" to "Evaluate Model" as well.
Configure "Evaluate Model":
Metrics: Choose relevant metrics for unsupervised learning, such as the Silhouette Coefficient or Calinski-Harabasz Index.

9. Model Training and Evaluation:

Click "Run" on the top menu.
Monitor the run: View the progress and logs for insights into data processing, training, and evaluation metrics.

10. Analyze Results:

Once the run finishes, explore the output:
Review the generated cluster labels for the training data.
Analyze evaluation metrics to assess the model's performance.
Visualize cluster distribution if applicable (e.g., using scatter plots).

Creating an Inference Pipeline for Real-Time Predictions

Now let's create an inference pipeline in Azure ML Studio to make real-time predictions on new data using your trained model:

1. Copying the Original Pipeline:

In Azure ML Studio, open your existing Diabetes K-Means experiment.
Right-click on the experiment canvas and select "Copy."
Right-click anywhere on the canvas and choose "Paste."
Rename the copied experiment (e.g., "DiabetesKMeans_Inference").

2. Deleting the Original Dataset:

In the copied pipeline, locate the "From Blob Storage" module that reads the training data.
Right-click on the module and select "Delete."

3. Adding "Enter Data Manually" Function:

Drag and drop the "Enter Data Manually" module onto the canvas.
Connect it to the starting point of the pipeline (the experiment input).
Configure "Enter Data Manually":
Columns: Define the columns (features) expected for prediction.
Data Types: Specify the data types for each column (e.g., integer, float).

4. Python Script for User-Desired Output:

Explanation:

This Python script will sit between the "Score Model" module (where predictions are made) and the "Web Service Output" module (which sends results to the user interface). The script allows you to filter and present only the relevant information to the user.

import pandas as pd
input_df = outputs['Score Model']
# Assuming the predicted cluster label is in a column named "cluster"
predicted_cluster = input_df["cluster"].iloc[0]
# Select and format desired output data (you can customize this)
output_data = {"Predicted Cluster": predicted_cluster,
    # Add more relevant information here based on your data and needs}
# Convert to JSON for web service output
output_json = json.dumps(output_data)
return output_json

Implementation Steps:

Drag and drop the "Execute Python Script" module.
Connect it to the "Score Model" module.
Configure "Execute Python Script":
Paste the Python code snippet into the script editor.
Set the "Script Input" to the output from the "Score Model" module (replace with the actual output name in the code).
Set the "Script Output" to a new name (e.g., "Filtered Output").

5. Web Service Output:

Connect the "Filtered Output" from the Python script to the "Web Service Output" module.
Configure "Web Service Output":
Choose the type of output (e.g., JSON).

6. Deploying the Inference Pipeline:

Click "Run" on the top menu to test the pipeline locally.
Once satisfied, click the ellipsis (...) on the experiment toolbar.
Select "Create inference pipeline" -> "Real-time inference pipeline."
Follow the wizard to configure your inference endpoint (choose an appropriate compute resource).
Deploy the pipeline to create a web service.

7. Deploying the Inference Pipeline and Creating an Endpoint:

Click "Run" on the top menu to test the pipeline locally.
Once satisfied, click the ellipsis (...) on the experiment toolbar.
Select "Create web service" -> "Real-time inference pipeline."
Follow the deployment wizard:
Name: Choose a descriptive name for your endpoint ( "DiabetesKMeans_Inference").
Compute resource: Select Azure Machine Learning compute instance for the web service
Environment: Choose the environment used by your pipeline (it should already be configured).
Other configurations: Follow the wizard's guidance to set any additional parameters (e.g., model registration if needed).
Click "Create" to deploy the pipeline as a real-time web service endpoint.

Copy these into a notepad or any where after deploying is compelete

Request URL: This is the URL of your deployed model endpoint.
Headers: If your endpoint requires authentication, copy the header name and value (e.g., "Authorization" and your API key).
Request Body: Copy the exact format and content of your data you sent for prediction.
Response Code: Note the HTTP status code returned by the endpoint.
Response Body: Copy the relevant parts of the response body that contain the model's prediction output.

Checking Deployment with Postman

1. Prerequisites:

Postman application: Download and install Postman from https://www.postman.com/downloads/.
API Key or Access Token: Depending on your deployment configuration, you might need an authentication key to access the endpoint. You can usually find this information in the deployment details within the Azure Machine Learning portal.

2. Create a Post Request:

Open Postman.
Click "New" to create a new request.
Enter the URL of your deployed model endpoint. You can find this URL in the deployment details within the Azure Machine Learning portal.

3. Set Request Method and Headers (Optional):

Method: Select "POST" as the HTTP method for real-time predictions.
Headers: If your endpoint requires authentication, add a "Authorization" header with your API key or access token. Refer to your deployment documentation for specific header requirements.

4. Prepare Request Body:

Go to the "Body" tab.
Choose the appropriate format for your input data:
x-www-form-urlencoded: Use this if your model expects key-value pairs for each feature. Example: Pregnancies=3&Glucose=148&blood pressure=82&SkinThickness=31&Insulin=168&BMI=34.9&DiabetesPedigreeFunction=0.242&Age=27 ``` (Replace values with your specific data)
raw: If your model accepts JSON format, create a JSON object representing your data: JSON { "Pregnancies": 3, "Glucose": 148, "BloodPressure": 82, "SkinThickness": 31, "Insulin": 168, "BMI": 34.9, "DiabetesPedigreeFunction": 0.242, "Age": 27 } ``` (Replace values with your specific data)

5. Send the Request:

Click "Send" to send the request to your deployed model endpoint.

6. Verify Response:

Review the response code: A successful response will typically have a code like 200 (OK).
Examine the response body: This should contain the model's prediction output. The specific format (JSON, plain text, etc.) will depend on your model implementation.

Comprehensive Project Conclusion:

This project successfully demonstrates the creation and deployment of a machine learning pipeline for K-Means clustering on the Diabetes dataset, leveraging the capabilities of Azure Machine Learning Studio. The project addressed various aspects, resulting in a robust and user-centric solution:

1. Cost-Effective Development Environment:

Established an Azure student account, providing access to free resources for experimentation, promoting cost-conscious development.

2. Streamlined Data Management:

Employed Azure Blob storage, a highly scalable and cost-effective data lake option, to store the Diabetes dataset.

3. User-Friendly Machine Learning Pipeline Construction:

Utilized Azure Machine Learning Studio, a user-friendly platform, to create a modular pipeline for data processing, K-Means clustering, and model evaluation.

4. Robust Data Preprocessing:

Implemented data cleaning, splitting, and (optional) normalization steps, ensuring data quality and optimal model performance.

5. Unsupervised Learning for Data Exploration:

Leveraged K-Means clustering to identify meaningful groups (clusters) in the data based on blood sugar levels and other relevant features, facilitating data exploration and potential pattern discovery.

6. Data-Driven Evaluation:

Employed suitable metrics for unsupervised learning to assess the K-Means model's performance, providing objective feedback on its effectiveness.

7. Real-Time Prediction and User Interactivity:

Developed a separate inference pipeline in Azure ML Studio, enabling real-time predictions on new data points.
Incorporated a Python script within the pipeline to filter and format the model's output for user consumption, enhancing clarity and user experience.

8. Model Deployment and Accessibility:

Deployed the inference pipeline as a web service with an accessible endpoint, allowing users to interact with the model and obtain predictions.

9. Effective Model Testing:

Utilized Postman, a popular API testing tool, to verify the functionality of the deployed model endpoint, ensuring its reliability.

This project highlights the potential of Azure Machine Learning Studio for building and deploying user-friendly ML pipelines on a cost-effective platform. It showcases the effectiveness of K-Means clustering in exploratory data analysis and the importance of data preprocessing and model evaluation for robust solutions. This foundation can be further expanded upon to develop more sophisticated machine learning models for complex tasks in the healthcare domain or beyond.