- Python 3.8
- All the modules in
requirements.txt
Before we can use NLTK for tokenization some steps need to be completed. Open a new python session and run:
import nltk
nltk.download('punkt')
Additionally, the Wikipedia Stats.
Ensure that the project folder has the following directory structure:
- datasets/
- AmazonCat-13K/
tst.json
trn.json
GoogleNews-vectors-negative.bin.gz
- AmazonCat-13K/
09 - Preprocess the AmazonCat-13k Dataset.py
The datasets folder contains the extracted archive files. Run task 09 - Preprocess the AmazonCat-13k Dataset
. This should create the following files in the datasets/AmazonCat-13K
folder:
X.trn.raw.npy
(an ndarray)Y.trn.raw.npz
(a csc sparse matrix)X.trn.processed.npy
(an ndarray)Y.trn.processed.npz
(a csc sparse matrix)X.tst.npy
(an ndarray)Y.tst.npz
(a csc sparse matrix)
Ensure that the project folder has the following directory structure:
- datasets/
- AmazonCat-13K/
X.trn.processed.npy
Y.trn.processed.npz
X.tst.npy
Y.tst.npz
GoogleNews-vectors-negative.bin.gz
- AmazonCat-13K/
- results/
- history/
- predict/
- weights/
10 - Training the AmazonCat-13k Dataset.py
Before running the training, the dataset needs to be preprocessed. To train the models, run the task 10 - Training the AmazonCat-13k Dataset
.
The top 10 most frequent labels are
Label id | Label | # of occurences |
---|---|---|
1471 | books | 355,211 |
7961 | music | 194,561 |
7892 | movies & tv | 128,026 |
9237 | pop | 120,090 |
7083 | literature & fiction | 97,803 |
7891 | movies | 88,967 |
4038 | education & reference | 76,277 |
10063 | rock | 75,035 |
12630 | used & rental textbooks | 71,667 |
8108 | new | 71,667 |
Labels that occur at most
Label id | Label | threshold | # of occurences |
---|---|---|---|
6554 | john | 50 | 50 |
4949 | fountains | 100 | 100 |
7393 | marriage | 1,000 | 996 |
84 | accessories & supplies | 10,000 | 9,976 |
9202 | politics & social sciences | 50,000 | 48,521 |
7083 | literature & fiction | 100,000 | 96,012 |