Multi-label Text Classification

Requirements

Python 3.8
All the modules in requirements.txt

Before we can use NLTK for tokenization some steps need to be completed. Open a new python session and run:

import nltk
nltk.download('punkt')

The datasets

Additionally, the Wikipedia Stats.

Preprocessing the AmazonCat-13K dataset

Ensure that the project folder has the following directory structure:

datasets/
- AmazonCat-13K/
  - tst.json
  - trn.json
- GoogleNews-vectors-negative.bin.gz
09 - Preprocess the AmazonCat-13k Dataset.py

The datasets folder contains the extracted archive files. Run task 09 - Preprocess the AmazonCat-13k Dataset. This should create the following files in the datasets/AmazonCat-13K folder:

X.trn.raw.npy (an ndarray)
Y.trn.raw.npz (a csc sparse matrix)
X.trn.processed.npy (an ndarray)
Y.trn.processed.npz (a csc sparse matrix)
X.tst.npy (an ndarray)
Y.tst.npz (a csc sparse matrix)

Training classifiers with the AmazonCat-13K dataset

Ensure that the project folder has the following directory structure:

datasets/
- AmazonCat-13K/
  - X.trn.processed.npy
  - Y.trn.processed.npz
  - X.tst.npy
  - Y.tst.npz
- GoogleNews-vectors-negative.bin.gz
results/
- history/
- predict/
- weights/
10 - Training the AmazonCat-13k Dataset.py

Before running the training, the dataset needs to be preprocessed. To train the models, run the task 10 - Training the AmazonCat-13k Dataset.

Important labels

The top 10 most frequent labels are

Label id	Label	# of occurences
1471	books	355,211
7961	music	194,561
7892	movies & tv	128,026
9237	pop	120,090
7083	literature & fiction	97,803
7891	movies	88,967
4038	education & reference	76,277
10063	rock	75,035
12630	used & rental textbooks	71,667
8108	new	71,667

Labels that occur at most

Label id	Label	threshold	# of occurences
6554	john	50	50
4949	fountains	100	100
7393	marriage	1,000	996
84	accessories & supplies	10,000	9,976
9202	politics & social sciences	50,000	48,521
7083	literature & fiction	100,000	96,012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-label Text Classification

Requirements

The datasets

Preprocessing the AmazonCat-13K dataset

Training classifiers with the AmazonCat-13K dataset

Important labels

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
.vscode		.vscode
utils		utils
.gitignore		.gitignore
01 - First Steps with Gensim.py		01 - First Steps with Gensim.py
02 - Create a Word Embeddings Model.py		02 - Create a Word Embeddings Model.py
03 - Introduction to Keras with a Single Embedding Layer.py		03 - Introduction to Keras with a Single Embedding Layer.py
04 - Create a Simple Binary Classifier using Keras.py		04 - Create a Simple Binary Classifier using Keras.py
05 - Training an LSTM Model.py		05 - Training an LSTM Model.py
06 - Create a Simple Multi-Label Classifier.py		06 - Create a Simple Multi-Label Classifier.py
07 - Create a Multi-Label Multi-Class Classifier.py		07 - Create a Multi-Label Multi-Class Classifier.py
08 - Multi-label Multi-Class Classifier Evaluation.py		08 - Multi-label Multi-Class Classifier Evaluation.py
09 - Preprocess the AmazonCat-13k Dataset.py		09 - Preprocess the AmazonCat-13k Dataset.py
10 - Training the AmazonCat-13k Dataset.py		10 - Training the AmazonCat-13k Dataset.py
11 - Analyze Different Label Frequency Thresholds.py		11 - Analyze Different Label Frequency Thresholds.py
12 - Get a Single Readable Dataset.py		12 - Get a Single Readable Dataset.py
13 - Analyze a Single Classifier.py		13 - Analyze a Single Classifier.py
14 - Gather statistics for the AmazonCat-13k Dataset.py		14 - Gather statistics for the AmazonCat-13k Dataset.py
15 - Export Positive Confusion Samples.py		15 - Export Positive Confusion Samples.py
16 - Create Model Training History Diagrams.py		16 - Create Model Training History Diagrams.py
17 - Evaluate X-BERT.py		17 - Evaluate X-BERT.py
18 - Evaluate Independent Classifiers.py		18 - Evaluate Independent Classifiers.py
19 - Example ROC curves.py		19 - Example ROC curves.py
20 - Training Timings Evaluation.py		20 - Training Timings Evaluation.py
21 - Wikipedia example.py		21 - Wikipedia example.py
README.md		README.md
first_steps.ipynb		first_steps.ipynb
requirements.txt		requirements.txt
test_metrics.py		test_metrics.py

Beneboe/Multi-Label-Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Multi-label Text Classification

Requirements

The datasets

Preprocessing the AmazonCat-13K dataset

Training classifiers with the AmazonCat-13K dataset

Important labels

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages