Skip to content

Commit

Permalink
ML.NET: update the clustering tutorial (#9620)
Browse files Browse the repository at this point in the history
  • Loading branch information
pkulikov authored and JRAlexander committed Dec 19, 2018
1 parent d5b9302 commit 3f5e4d0
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 50 deletions.
86 changes: 37 additions & 49 deletions docs/machine-learning/tutorials/iris-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: Cluster iris flowers using a clustering learner - ML.NET
description: Learn how to use ML.NET in a clustering scenario
author: pkulikov
ms.author: johalex
ms.date: 07/02/2018
ms.date: 12/17/2018
ms.topic: tutorial
ms.custom: mvc, seodec18
#Customer intent: As a developer, I want to use ML.NET so that I can build a model to cluster iris flowers based on its parameters.
Expand Down Expand Up @@ -39,7 +39,7 @@ As you don't know to which group each flower belongs to, you choose the [unsuper

## Create a console application

1. Open Visual Studio 2017. Select **File** > **New** > **Project** from the menu bar. In the **New Project** dialog, select the **Visual C#** node followed by the **.NET Core** node. Then select the **Console App (.NET Core)** project template. In the **Name** text box, type "IrisClustering" and then select the **OK** button.
1. Open Visual Studio 2017. Select **File** > **New** > **Project** from the menu bar. In the **New Project** dialog, select the **Visual C#** node followed by the **.NET Core** node. Then select the **Console App (.NET Core)** project template. In the **Name** text box, type "IrisFlowerClustering" and then select the **OK** button.

1. Create a directory named *Data* in your project to store the data set and model files:

Expand Down Expand Up @@ -73,11 +73,11 @@ Create classes for the input data and the predictions:
1. In the **Add New Item** dialog box, select **Class** and change the **Name** field to *IrisData.cs*. Then, select the **Add** button.
1. Add the following `using` directive to the new file:

[!code-csharp[Add necessary usings](../../../samples/machine-learning/tutorials/IrisClustering/IrisData.cs#1)]
[!code-csharp[Add necessary usings](~/samples/machine-learning/tutorials/IrisFlowerClustering/IrisData.cs#Usings)]

Remove the existing class definition and add the following code, which defines the classes `IrisData` and `ClusterPrediction`, to the *IrisData.cs* file:

[!code-csharp[Define data classes](../../../samples/machine-learning/tutorials/IrisClustering/IrisData.cs#2)]
[!code-csharp[Define data classes](~/samples/machine-learning/tutorials/IrisFlowerClustering/IrisData.cs#ClassDefinitions)]

`IrisData` is the input data class and has definitions for each feature from the data set. Use the [Column](xref:Microsoft.ML.Runtime.Api.ColumnAttribute) attribute to specify the indices of the source columns in the data set file.

Expand All @@ -98,103 +98,91 @@ Go back to the *Program.cs* file and add two fields to hold the paths to the dat

Add the following code right above the `Main` method to specify those paths:

[!code-csharp[Initialize paths](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#1)]
[!code-csharp[Initialize paths](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#Paths)]

To make the preceding code compile, add the following `using` directives at the top of the *Program.cs* file:

[!code-csharp[Add usings for paths](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#2)]
[!code-csharp[Add usings for paths](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#UsingsForPaths)]

## Create a learning pipeline
## Create ML context

Add the following additional `using` directives to the top of the *Program.cs* file:

[!code-csharp[Add Microsoft.ML usings](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#3)]

In the `Main` method, replace the `Console.WriteLine("Hello World!")` with the following code:
[!code-csharp[Add Microsoft.ML usings](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#MLUsings)]

[!code-csharp[Call the Train method](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#4)]
In the `Main` method, replace the `Console.WriteLine("Hello World!");` line with the following code:

The `Train` method trains the model. Create that method just below the `Main` method, using the following code:
[!code-csharp[Create ML context](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#CreateContext)]

```csharp
private static PredictionModel<IrisData, ClusterPrediction> Train()
{
The <xref:Microsoft.ML.MLContext?displayProperty=nameWithType> class represents the machine learning environment and provides mechanisms for logging and entry points for data loading, model training, prediction, and other tasks. This is comparable conceptually to using `DbContext` in Entity Framework.

}
```
## Setup data loading

The learning pipeline loads all of the data and algorithms necessary to train the model. Add the following code into the `Train` method:
Add the following code to the `Main` method to setup the way to load data:

[!code-csharp[Initialize pipeline](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#5)]
[!code-csharp[Create text loader](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#SetupTextLoader)]

## Load and transform data
Note that the column names and indices match the schema defined by the `IrisData` class. The <xref:Microsoft.ML.Runtime.Data.DataKind.R4?displayProperty=nameWithType> value specifies the `float` type.

The first step to perform is to load the training data set. In our case, the training data set is stored in the text file with a path defined by the `_dataPath` field. Columns in the file are separated by the comma (","). Add the following code into the `Train` method:
Use instantiated <xref:Microsoft.ML.Runtime.Data.TextLoader> instance to create an <xref:Microsoft.ML.Runtime.Data.IDataView> instance, which represents the data source for the training data set:

[!code-csharp[Add step to load data](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#6)]
[!code-csharp[Create IDataView](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#CreateDataView)]

The next step is to combine all of the feature columns into the **Features** column using the <xref:Microsoft.ML.Legacy.Transforms.ColumnConcatenator> transformation class. By default, a learning algorithm processes only features from the **Features** column. Add the following code:

[!code-csharp[Add step to concatenate columns](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#7)]
## Create a learning pipeline

## Choose a learning algorithm
For this tutorial, the learning pipeline of the clustering task comprises two following steps:

After adding the data to the pipeline and transforming it into the correct input format, you select a learning algorithm (**learner**). The learner trains the model. ML.NET provides a <xref:Microsoft.ML.Legacy.Trainers.KMeansPlusPlusClusterer> learner that implements [k-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering) with an improved method for choosing the initial cluster centroids.
- concatenate loaded columns into one **Features** column, which is used by a clustering trainer;
- use a <xref:Microsoft.ML.Trainers.KMeans.KMeansPlusPlusTrainer> trainer to train the model using the k-means++ clustering algorithm.

Add the following code into the `Train` method following the data processing code added in the previous step:
Add the following code to the `Main` method:

[!code-csharp[Add a learner step](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#8)]
[!code-csharp[Create pipeline](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#CreatePipeline)]

Use the <xref:Microsoft.ML.Legacy.Trainers.KMeansPlusPlusClusterer.K?displayProperty=nameWithType> property to specify number of clusters. The code above specifies that the data set should be split in three clusters.
The code specifies that the data set should be split in three clusters.

## Train the model

The steps added in the preceding sections prepared the pipeline for training, however, none have been executed. The `pipeline.Train<TInput, TOutput>` method produces the model that takes in an instance of the `TInput` type and outputs an instance of the `TOutput` type. Add the following code into the `Train` method:
The steps added in the preceding sections prepared the pipeline for training, however, none have been executed. Add the following line to the `Main` method to perform data loading and model training:

[!code-csharp[Train the model and return](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#9)]
[!code-csharp[Train the model](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#TrainModel)]

### Save the model

At this point, you have a model that can be integrated into any of your existing or new .NET applications. To save your model to a .zip file, add the following code to the `Main` method below the call to the `Train` method:
At this point, you have a model that can be integrated into any of your existing or new .NET applications. To save your model to a .zip file, add the following code to the `Main` method:

[!code-csharp[Save the model](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#10)]
[!code-csharp[Save the model](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#SaveModel)]

Using `await` in the `Main` method means the `Main` method must have the `async` modifier and return a `Task`:

[!code-csharp[Make the Main method async](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#11)]

You also need to add the following `using` directive at the top of the *Program.cs* file:

[!code-csharp[Add System.Threading.Tasks using](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#12)]
## Use the model for predictions

Because the `async Main` method is the feature added in C# 7.1 and the default language version of the project is C# 7.0, you need to change the language version to C# 7.1 or higher. To do that, right-click the project node in **Solution Explorer** and select **Properties**. Select the **Build** tab and select the **Advanced** button. In the dropdown, select **C# 7.1** (or a higher version). Select the **OK** button.
To make predictions, use the <xref:Microsoft.ML.Runtime.Data.PredictionFunction%602> class that takes instances of the input type through the transformer pipeline and produces instances of the output type. Add the following line to the `Main` method to create an instance of that class:

## Use the model for predictions
[!code-csharp[Create predictor](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#Predictor)]

Create the `TestIrisData` class to house test data instances:

1. In **Solution Explorer**, right-click the project, and then select **Add** > **New Item**.
1. In the **Add New Item** dialog box, select **Class** and change the **Name** field to *TestIrisData.cs*. Then, select the **Add** button.
1. Modify the class to be static like in the following example:

[!code-csharp[Make class static](../../../samples/machine-learning/tutorials/IrisClustering/TestIrisData.cs#1)]
[!code-csharp[Make class static](~/samples/machine-learning/tutorials/IrisFlowerClustering/TestIrisData.cs#Static)]

This tutorial introduces one iris data instance within this class. You can add other scenarios to experiment with the model. Add the following code into the `TestIrisData` class:

[!code-csharp[Test data](../../../samples/machine-learning/tutorials/IrisClustering/TestIrisData.cs#2)]
[!code-csharp[Test data](~/samples/machine-learning/tutorials/IrisFlowerClustering/TestIrisData.cs#TestData)]

To find out the cluster to which the specified item belongs to, go back to the *Program.cs* file and add the following code into the `Main` method:

[!code-csharp[Predict and output results](../../../samples/machine-learning/tutorials/IrisClustering/Program.cs#13)]
[!code-csharp[Predict and output results](~/samples/machine-learning/tutorials/IrisFlowerClustering/Program.cs#PredictionExample)]

Run the program to see which cluster contains the specified data instance and squared distances from that instance to the cluster centroids. Your results should be similar to the following. As the pipeline processes, it might display warnings or processing messages. These have been removed from the following output for clarity.
Run the program to see which cluster contains the specified data instance and squared distances from that instance to the cluster centroids. Your results should be similar to the following:

```text
Cluster: 2
Distances: 0.4192338 0.0008847713 0.9660053
Distances: 11.69127 0.02159119 25.59896
```

Congratulations! You've now successfully built a machine learning model for iris clustering and used it to make predictions. You can find the source code for this tutorial at the [dotnet/samples](https://github.com/dotnet/samples/tree/master/machine-learning/tutorials/IrisClustering) GitHub repository.
Congratulations! You've now successfully built a machine learning model for iris clustering and used it to make predictions. You can find the source code for this tutorial at the [dotnet/samples](https://github.com/dotnet/samples/tree/master/machine-learning/tutorials/IrisFlowerClustering) GitHub repository.

## Next steps

Expand Down
2 changes: 1 addition & 1 deletion docs/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -1177,7 +1177,7 @@
## [Tutorials](machine-learning/tutorials/index.md)
### [Sentiment analysis (binary classification)](machine-learning/tutorials/sentiment-analysis.md)
### [Taxi fare predictor (regression)](machine-learning/tutorials/taxi-fare.md)
### [Iris petals (clustering)](machine-learning/tutorials/iris-clustering.md)
### [Iris flowers (clustering)](machine-learning/tutorials/iris-clustering.md)
## [How-to guides](machine-learning/how-to-guides/index.md)
### [Apply categorical feature engineering ](machine-learning/how-to-guides/train-model-categorical-ml-net.md)
### [Apply textual feature engineering ](machine-learning/how-to-guides/train-model-textual-ml-net.md)
Expand Down

0 comments on commit 3f5e4d0

Please sign in to comment.