-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
6,027 additions
and
0 deletions.
There are no files selected for viewing
26 changes: 26 additions & 0 deletions
26
Машинное обучение и анализ данных/Обучение на размеченных данных/week_2/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Неделя 2: Борьба с переобучением и оценивание качества | ||
|
||
## Проблема переобучения и борьба с ней | ||
|
||
* _Конспект_ [Проблема переобучения](materials/) | ||
* _Слайды_ [Проблема переобучения](materials/) | ||
|
||
## Метрики качества | ||
|
||
* _Конспект_ [Метрики качества](materials/) | ||
* _Слайды_ [Метрики качества](materials/) | ||
|
||
## Библиотека scikit-learn. Введение | ||
|
||
* _Ноутбуки_ [Введение в scikit-learn](notebooks/) | ||
|
||
### Задание 1 | ||
[Линейная регрессия: переобучение и регуляризация](assigments/assigment_1.ipynb) | ||
|
||
В этом задании вы будете настраивать линейную модель для прогнозирования количества прокатов велосипедов в зависимости от календарных характеристик дня и погодных условий. Нужно так подобрать веса признаков, чтобы уловить все линейные зависимости в данных и в то же время не учесть лишние признаки, тогда модель не переобучится и будет делать достаточно точные предсказания на новых данных. Найденные линейные зависимости нужно будет интерпретировать, то есть понять, соответствует ли обнаруженная закономерность здравому смыслу. Основная цель задания - на примере показать и объяснить, из-за чего возникает переобучение и как с ним можно бороться. | ||
|
||
### Задание 2 | ||
|
||
[Метрики в sklearn](assigments/assigment_2.ipynb) | ||
|
||
В ходе выполнения вы посмотрите, чем отличаются разные метрики качества, и потренируетесь их вычислять. |
1,003 changes: 1,003 additions & 0 deletions
1,003
...учение и анализ данных/Обучение на размеченных данных/week_2/assigments/assigment_1.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
1,108 changes: 1,108 additions & 0 deletions
1,108
...учение и анализ данных/Обучение на размеченных данных/week_2/assigments/assigment_2.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
732 changes: 732 additions & 0 deletions
732
... обучение и анализ данных/Обучение на размеченных данных/week_2/assigments/bikes_rent.csv
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file added
BIN
+374 KB
...из данных/Обучение на размеченных данных/week_2/materials/[Конспект] Метрики качества.pdf
Binary file not shown.
Binary file added
BIN
+132 KB
...нных/Обучение на размеченных данных/week_2/materials/[Конспект] Проблема переобучения.pdf
Binary file not shown.
Binary file added
BIN
+6.86 MB
...ализ данных/Обучение на размеченных данных/week_2/materials/[Слайды] Метрики качества.pdf
Binary file not shown.
Binary file added
BIN
+4.64 MB
...данных/Обучение на размеченных данных/week_2/materials/[Слайды] Проблема переобучения.pdf
Binary file not shown.
293 changes: 293 additions & 0 deletions
293
...ие и анализ данных/Обучение на размеченных данных/week_2/notebooks/cross_validation.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,293 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Корректность проверена на Python 3.6:**\n", | ||
"+ numpy 1.15.4\n", | ||
"+ sklearn 0.20.2" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Sklearn" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## sklearn.model_selection" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"документация: http://scikit-learn.org/stable/modules/cross_validation.html" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn import model_selection, datasets\n", | ||
"\n", | ||
"import numpy as np" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Разовое разбиение данных на обучение и тест с помощью train_test_split" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"iris = datasets.load_iris()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"train_data, test_data, train_labels, test_labels = model_selection.train_test_split(iris.data, iris.target, \n", | ||
" test_size = 0.3)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#убедимся, что тестовая выборка действительно составляет 0.3 от всех данных\n", | ||
"float(len(test_labels))/len(iris.data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print('Размер обучающей выборки: {} объектов \\nРазмер тестовой выборки: {} объектов'.format(len(train_data),\n", | ||
" len(test_data)))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print('Обучающая выборка:\\n', train_data[:5])\n", | ||
"print('\\n')\n", | ||
"print('Тестовая выборка:\\n', test_data[:5])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print('Метки классов на обучающей выборке:\\n', train_labels)\n", | ||
"print('\\n')\n", | ||
"print('Метки классов на тестовой выборке:\\n', test_labels)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Стратегии проведения кросс-валидации" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#сгенерируем короткое подобие датасета, где элементы совпадают с порядковым номером\n", | ||
"X = range(0,10)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### KFold" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"kf = model_selection.KFold(n_splits = 5)\n", | ||
"for train_indices, test_indices in kf.split(X):\n", | ||
" print(train_indices, test_indices)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"kf = model_selection.KFold(n_splits = 2, shuffle = True)\n", | ||
"for train_indices, test_indices in kf.split(X):\n", | ||
" print(train_indices, test_indices)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"kf = model_selection.KFold(n_splits = 2, shuffle = True, random_state = 1)\n", | ||
"for train_indices, test_indices in kf.split(X):\n", | ||
" print(train_indices, test_indices)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### StratifiedKFold" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"y = np.array([0] * 5 + [1] * 5)\n", | ||
"print(y)\n", | ||
"\n", | ||
"skf = model_selection.StratifiedKFold(n_splits = 2, shuffle = True, random_state = 0)\n", | ||
"for train_indices, test_indices in skf.split(X, y):\n", | ||
" print(train_indices, test_indices)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"target = np.array([0, 1] * 5)\n", | ||
"print(target)\n", | ||
"\n", | ||
"skf = model_selection.StratifiedKFold(n_splits = 2,shuffle = True)\n", | ||
"for train_indices, test_indices in skf.split(X, target):\n", | ||
" print(train_indices, test_indices)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### ShuffleSplit" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ss = model_selection.ShuffleSplit(n_splits = 10, test_size = 0.2)\n", | ||
"\n", | ||
"for train_indices, test_indices in ss.split(X):\n", | ||
" print(train_indices, test_indices)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### StratifiedShuffleSplit" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"target = np.array([0] * 5 + [1] * 5)\n", | ||
"print(target)\n", | ||
"\n", | ||
"sss = model_selection.StratifiedShuffleSplit(n_splits = 4, test_size = 0.2)\n", | ||
"for train_indices, test_indices in sss.split(X, target):\n", | ||
" print(train_indices, test_indices)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Leave-One-Out" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"loo = model_selection.LeaveOneOut()\n", | ||
"\n", | ||
"for train_indices, test_index in loo.split(X):\n", | ||
" print(train_indices, test_index)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Больше стратегий проведения кросс-валидации доступно здесь: http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"anaconda-cloud": {}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |
Oops, something went wrong.