-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more info on working with categorical data #16881
Merged
Merged
Changes from 1 commit
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
106f98b
Add feature label article
natke a87478d
Add one hot encoding
natke 57b3e24
Added one hot encoding and hashing to data prep how-to
natke b47c2ae
Acrolinx and tidy up
natke 65105ad
Fix xref
natke d1245e5
Fix typo
natke 61e9eb4
Remove one-hot encoding example, as the example is in the API docs
natke 647091c
Update docs/machine-learning/how-to-guides/prepare-data-ml-net.md
natke 600d8c3
Update docs/machine-learning/how-to-guides/prepare-data-ml-net.md
natke 1b42d38
Update docs/machine-learning/how-to-guides/prepare-data-ml-net.md
natke e64cc4c
Update docs/machine-learning/how-to-guides/prepare-data-ml-net.md
natke 781166b
Made tenses consistent for loading enmerable data
natke 64c0352
Update after review
natke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Acrolinx and tidy up
- Loading branch information
commit b47c2ae0fa7f43ddfb7ee2c6e2c7d84ac80e4d63
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -3,7 +3,7 @@ title: Prepare data for building a model | |||
description: Learn how to use transforms in ML.NET to manipulate and prepare data for additional processing or model building. | ||||
author: luisquintanilla | ||||
ms.author: luquinta | ||||
ms.date: 09/11/2019 | ||||
ms.date: 01/29/2020 | ||||
ms.custom: mvc, how-to, title-hack-0625 | ||||
#Customer intent: As a developer, I want to know how I can transform and prepare data with ML.NET | ||||
--- | ||||
|
@@ -18,7 +18,7 @@ Data is often unclean and sparse. ML.NET machine learning algorithms expect inpu | |||
|
||||
Sometimes, not all data in a dataset is relevant for analysis. An approach to remove irrelevant data is filtering. The [`DataOperationsCatalog`](xref:Microsoft.ML.DataOperationsCatalog) contains a set of filter operations that take in an [`IDataView`](xref:Microsoft.ML.IDataView) containing all of the data and return an [IDataView](xref:Microsoft.ML.IDataView) containing only the data points of interest. It's important to note that because filter operations are not an [`IEstimator`](xref:Microsoft.ML.IEstimator%601) or [`ITransformer`](xref:Microsoft.ML.ITransformer) like those in the [`TransformsCatalog`](xref:Microsoft.ML.TransformsCatalog), they cannot be included as part of an [`EstimatorChain`](xref:Microsoft.ML.Data.EstimatorChain%601) or [`TransformerChain`](xref:Microsoft.ML.Data.TransformerChain%601) data preparation pipeline. | ||||
|
||||
Using the following input data which is loaded into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
Using the following input data and load it into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
|
||||
```csharp | ||||
HomeData[] homeDataList = new HomeData[] | ||||
|
@@ -54,7 +54,7 @@ The sample above takes rows in the dataset with a price between 200000 and 10000 | |||
|
||||
Missing values are a common occurrence in datasets. One approach to dealing with missing values is to replace them with the default value for the given type if any or another meaningful value such as the mean value in the data. | ||||
|
||||
Using the following input data which is loaded into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
Using the following input data and load it into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
|
||||
```csharp | ||||
HomeData[] homeDataList = new HomeData[] | ||||
|
@@ -94,16 +94,16 @@ ITransformer replacementTransformer = replacementEstimator.Fit(data); | |||
IDataView transformedData = replacementTransformer.Transform(data); | ||||
``` | ||||
|
||||
ML.NET supports various [replacement modes](xref:Microsoft.ML.Transforms.MissingValueReplacingEstimator.ReplacementMode). The sample above uses the `Mean` replacement mode which will fill in the missing value with that column's average value. The replacement | ||||
ML.NET supports various [replacement modes](xref:Microsoft.ML.Transforms.MissingValueReplacingEstimator.ReplacementMode). The sample above uses the `Mean` replacement mode, which fills in the missing value with that column's average value. The replacement | ||||
's result fills in the `Price` property for the last element in our data with 200,000 since it's the average of 100,000 and 300,000. | ||||
|
||||
## Use normalizers | ||||
|
||||
[Normalization](https://en.wikipedia.org/wiki/Feature_scaling) is a data pre-processing technique used to standardize features that are not on the same scale which helps algorithms converge faster. For example, the ranges for values like age and income vary significantly with age generally being in the range of 0-100 and income generally being in the range of zero to thousands. Visit the [transforms page](../resources/transforms.md) for a more detailed list and description of normalization transforms. | ||||
[Normalization](https://en.wikipedia.org/wiki/Feature_scaling) is a data pre-processing technique used to scale features to be in the same range, usually between 0 and 1, so that they can be more accurately processed by a machine learning algorithm. For example, the ranges for age and income vary significantly with age generally being in the range of 0-100 and income generally being in the range of zero to thousands. Visit the [transforms page](../resources/transforms.md) for a more detailed list and description of normalization transforms. | ||||
|
||||
### Min-Max normalization | ||||
|
||||
Using the following input data which is loaded into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
Using the following input data and load it into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
|
||||
```csharp | ||||
HomeData[] homeDataList = new HomeData[] | ||||
|
@@ -121,7 +121,7 @@ HomeData[] homeDataList = new HomeData[] | |||
}; | ||||
``` | ||||
|
||||
Normalization can be applied to columns with single numerical values as well as vectors. Normalize the data in the `Price` column using min-max normalization with the [`NormalizeMinMax`](xref:Microsoft.ML.NormalizationCatalog.NormalizeMinMax*) method. | ||||
Normalization can be applied to columns with single numerical values as well as vectors. Normalize the data in the `Price` column using min-max normalization with the [`NormalizeMinMax`](xref:Microsoft.ML.NormalizationCatalog.NormalizeMinMax%2A) method. | ||||
|
||||
```csharp | ||||
// Define min-max estimator | ||||
|
@@ -135,13 +135,13 @@ ITransformer minMaxTransformer = minMaxEstimator.Fit(data); | |||
IDataView transformedData = minMaxTransformer.Transform(data); | ||||
``` | ||||
|
||||
The original price values `[200000,100000]` are converted to `[ 1, 0.5 ]` using the `MinMax` normalization formula which generates output values in the range of 0-1. | ||||
The original price values `[200000,100000]` are converted to `[ 1, 0.5 ]` using the `MinMax` normalization formula that generates output values in the range of 0-1. | ||||
|
||||
### Binning | ||||
|
||||
[Binning](https://en.wikipedia.org/wiki/Data_binning) converts continuous values into a discrete representation of the input. For example, suppose one of your features is age. Instead of using the actual age value, binning creates ranges for that value. 0-18 could be one bin, another could be 19-35 and so on. | ||||
|
||||
Using the following input data which is loaded into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
Using the following input data that is loaded into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
|
||||
```csharp | ||||
HomeData[] homeDataList = new HomeData[] | ||||
|
@@ -182,7 +182,7 @@ The result of binning creates bin bounds of `[0,200000,Infinity]`. Therefore the | |||
|
||||
## Work with categorical data | ||||
|
||||
One of the most common types of data is categorical data. Categorical data is that which has a finite number of categories. For example, the states of the USA, or a list of the types of animals found in a set of pictures. Whether these are features or labels, they must be mapped onto a numerical value in order to be used to generate a machine learning model. There are a number of ways of doing this in ML.NET, depending on the problem you are solving. | ||||
One of the most common types of data is categorical data. Categorical data has a finite number of categories. For example, the states of the USA, or a list of the types of animals found in a set of pictures. Whether the categorical data are features or labels, they must be mapped onto a numerical value in so that they can be used to generate a machine learning model. There are a number of ways of working with categorical data in ML.NET, depending on the problem you are solving. | ||||
natke marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
||||
### Key value mapping | ||||
|
||||
|
@@ -203,7 +203,7 @@ One hot encoding takes a finite set of values and maps them onto integers whose | |||
||| | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||
|98109|10...00| | ||||
|
||||
Using the following input data which is loaded into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
Using the following input data and load it into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
|
||||
```csharp | ||||
CarData[] cars = new CarData[] | ||||
|
@@ -258,7 +258,7 @@ ML.NET provides [Hash](xref:Microsoft.ML.ConversionsExtensionsCatalog.Hash%2A) t | |||
|
||||
## Work with text data | ||||
|
||||
Text data needs to be transformed into numbers before using it to build a machine learning model. Visit the [transforms page](../resources/transforms.md) for a more detailed list and description of text transforms. | ||||
Like categorical data, text data needs to be transformed into numerical features before using it to build a machine learning model. Visit the [transforms page](../resources/transforms.md) for a more detailed list and description of text transforms. | ||||
|
||||
Using data like the data below that has been loaded into an [`IDataView`](xref:Microsoft.ML.IDataView): | ||||
|
||||
|
@@ -278,7 +278,7 @@ ReviewData[] reviews = new ReviewData[] | |||
}; | ||||
``` | ||||
|
||||
The minimum step to convert text to a numerical vector representation is to use the [`FeaturizeText`](xref:Microsoft.ML.TextCatalog.FeaturizeText%2A) method. By using the [`FeaturizeText`](xref:Microsoft.ML.TextCatalog.FeaturizeText%2A) transform, a series of transformations is applied to the input text column resulting in a numerical vector representing the lp-normalized word and character ngrams. | ||||
ML.NET provides the [`FeaturizeText`](xref:Microsoft.ML.TextCatalog.FeaturizeText%2A) transform that takes a texts string and creates a set of features from the text, by applying a series of individual transforms. | ||||
natke marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
||||
```csharp | ||||
// Define text transform estimator | ||||
|
@@ -298,7 +298,7 @@ The resulting transform converts the text values in the `Description` column to | |||
[ 0.2041241, 0.2041241, 0.2041241, 0.4082483, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0, 0, 0, 0, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0 ] | ||||
``` | ||||
|
||||
Combine complex text processing steps into an [`EstimatorChain`](xref:Microsoft.ML.Data.EstimatorChain%601) to remove noise and potentially reduce the amount of required processing resources as needed. | ||||
The transforms that make up `FeaturizeText` can also be applied individually for finer grain control over feature generation. | ||||
|
||||
```csharp | ||||
// Define text transform estimator | ||||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd fix the tenses to match here. Using / load. We got feedback on this before though if we said "Use the following input data and load it into an IDataView" confused users because the code to load the data into an IDataView is not there. So it was worded as it currently is to make it sound like..."This is what the data looks like and we assume it's been loaded into an IDataView". The reason for not including the load code is we don't want to focus on loading code one way or another (file / enumerable). We want to focus more on the input data / transforms and resulting output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sure that makes sense. Let me have another look at the wording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made them consistent and explicitly instructed to load into a variable called
data
. Let me know what you thinkThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good