Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more info on working with categorical data #16881

Merged
merged 13 commits into from
Feb 12, 2020
Merged

Conversation

natke
Copy link
Contributor

@natke natke commented Jan 29, 2020

@dotnet-bot dotnet-bot added this to the January 2020 milestone Jan 29, 2020
@natke natke closed this Jan 29, 2020
@natke natke reopened this Jan 29, 2020
@natke natke closed this Jan 29, 2020
@natke natke reopened this Jan 29, 2020
@natke natke marked this pull request as ready for review February 6, 2020 19:42
Copy link
Contributor

@luisquintanilla luisquintanilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done. Few comments and suggestions.

docs/machine-learning/how-to-guides/prepare-data-ml-net.md Outdated Show resolved Hide resolved
docs/machine-learning/how-to-guides/prepare-data-ml-net.md Outdated Show resolved Hide resolved
|---------|---------------------|
|98052|00...01|
|98100|00...10|
|||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|||

docs/machine-learning/how-to-guides/prepare-data-ml-net.md Outdated Show resolved Hide resolved
docs/machine-learning/how-to-guides/prepare-data-ml-net.md Outdated Show resolved Hide resolved
@@ -18,7 +18,7 @@ Data is often unclean and sparse. ML.NET machine learning algorithms expect inpu

Sometimes, not all data in a dataset is relevant for analysis. An approach to remove irrelevant data is filtering. The [`DataOperationsCatalog`](xref:Microsoft.ML.DataOperationsCatalog) contains a set of filter operations that take in an [`IDataView`](xref:Microsoft.ML.IDataView) containing all of the data and return an [IDataView](xref:Microsoft.ML.IDataView) containing only the data points of interest. It's important to note that because filter operations are not an [`IEstimator`](xref:Microsoft.ML.IEstimator%601) or [`ITransformer`](xref:Microsoft.ML.ITransformer) like those in the [`TransformsCatalog`](xref:Microsoft.ML.TransformsCatalog), they cannot be included as part of an [`EstimatorChain`](xref:Microsoft.ML.Data.EstimatorChain%601) or [`TransformerChain`](xref:Microsoft.ML.Data.TransformerChain%601) data preparation pipeline.

Using the following input data which is loaded into an [`IDataView`](xref:Microsoft.ML.IDataView):
Using the following input data and load it into an [`IDataView`](xref:Microsoft.ML.IDataView):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd fix the tenses to match here. Using / load. We got feedback on this before though if we said "Use the following input data and load it into an IDataView" confused users because the code to load the data into an IDataView is not there. So it was worded as it currently is to make it sound like..."This is what the data looks like and we assume it's been loaded into an IDataView". The reason for not including the load code is we don't want to focus on loading code one way or another (file / enumerable). We want to focus more on the input data / transforms and resulting output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sure that makes sense. Let me have another look at the wording

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made them consistent and explicitly instructed to load into a variable called data. Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link
Contributor

@luisquintanilla luisquintanilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. Looks good to me. The only thing I'd change is the unresolved suggestion in the One Hot Encoding table. There seems to be an extra blank row. If the intent is to show that there are other values in between I'd replace the blank column values with "...". Otherwise, I'd remove the extra row.

@natke
Copy link
Contributor Author

natke commented Feb 12, 2020

Thanks! I missed that. Will update.

@luisquintanilla
Copy link
Contributor

:shipit:

@natke natke merged commit e4c228a into dotnet:master Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need clarity on concept of key type, including info on using Hashing to create the key type
4 participants