Comment-Toxicity-Classification

About 37% of young people between the ages of 12 and 17 have been bullied online. 30% have had it happen more than once.

Photo by Adrian Swancar on Unsplash

Cyber Bullying is a priority for every social media company at the moment from Google to Twitter. Extending past Social Media companies, every organization with a website that allows users to comment takes cyberbullying very seriously, from schools to company websites. It is important that the safety of workers and users of the website are taken seriously and protected.

Cyberbullying occurs on every platform and in every single country in the world. As a company, it is your duty to ensure that nasty comments are flagged and taken off the platform. To be able to do that you need a deep learning algorithm that can detect when a comment is toxic and its class(es) of toxicity.

That is exactly what my web app does: you upload a comment and it tells you if it is clean or if it is toxic and its class(es) of toxicity [toxic, severe_toxic, obscene, threat, insult, and identity_hate].

Check it out here: https://comment-toxicity-classifier.onrender.com/

Project Process

Data Cleaning → Text Vectorization → Model Architecture → Model Performance → Deployment

The data was sourced from Kaggle: https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge

Data Cleaning

I cleaned the data by removing hyperlinks, special characters and numbers.

Text Vectorization

I used the Tf-Idf Text Vectorizer, which helps us to vectorize our input data into a specific number of tokens. I chose it because it prioritizes important words while penalizing commonly occuring words.

To gain a better understanding of how the Tf-Idf Vectorizer works:

https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

Model Architecture

3-layer neural network:

2 hidden layers with the ReLU activation function.
Output layer with a Sigmoid activation function.

model = keras.Sequential([
    layers.Dense(100, input_shape=(6000,), activation = "relu"),

    layers.Dense(50, activation = "relu"),

    layers.Dense(6, activation = "sigmoid")
])

To ensure that the model could perform multi-classification, I compiled the model with a binary_crossentropy loss.

model.compile(
optimizer = "adam",
    loss = "binary_crossentropy",
    metrics = ["binary_accuracy"]
)

Model Performance

The model performed well with a

Training Accuracy: 99.9%
Testing Accuracy: 93.2%

Deployment

The model was deployed via a Flask app and hosted using Render.

References

https://www.slicktext.com/blog/2020/05/cyberbullying-statistics-facts/#:~:text=About%2037%25%20of%20teens%20between%20the%20ages%20of,and%20perpetrators%20of%20cyberbullying%20in%202019%20and%202020.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
templates		templates
Comment Toxicity Classification.ipynb		Comment Toxicity Classification.ipynb
README.md		README.md
app.py		app.py
model.h5		model.h5
requirements.txt		requirements.txt
trial.py		trial.py
tv_layer.pkl		tv_layer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comment-Toxicity-Classification

Project Process

Data Cleaning → Text Vectorization → Model Architecture → Model Performance → Deployment

Data Cleaning

Text Vectorization

Model Architecture

Model Performance

Deployment

References

About

Releases

Packages

Languages

posi-olomo/Comment-Toxicity-Classification

Folders and files

Latest commit

History

Repository files navigation

Comment-Toxicity-Classification

Project Process

Data Cleaning → Text Vectorization → Model Architecture → Model Performance → Deployment

Data Cleaning

Text Vectorization

Model Architecture

Model Performance

Deployment

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages