Skip to content

posi-olomo/Comment-Toxicity-Classification

Repository files navigation

Comment-Toxicity-Classification

About 37% of young people between the ages of 12 and 17 have been bullied online. 30% have had it happen more than once.

Photo by Adrian Swancar on Unsplash

Cyber Bullying is a priority for every social media company at the moment from Google to Twitter. Extending past Social Media companies, every organization with a website that allows users to comment takes cyberbullying very seriously, from schools to company websites. It is important that the safety of workers and users of the website are taken seriously and protected.

Cyberbullying occurs on every platform and in every single country in the world. As a company, it is your duty to ensure that nasty comments are flagged and taken off the platform. To be able to do that you need a deep learning algorithm that can detect when a comment is toxic and its class(es) of toxicity.

That is exactly what my web app does: you upload a comment and it tells you if it is clean or if it is toxic and its class(es) of toxicity [toxic, severe_toxic, obscene, threat, insult, and identity_hate].

Check it out here: https://comment-toxicity-classifier.onrender.com/

Project Process

Data Cleaning → Text Vectorization → Model Architecture → Model Performance → Deployment

The data was sourced from Kaggle: https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge


Data Cleaning

I cleaned the data by removing hyperlinks, special characters and numbers.


Text Vectorization

I used the Tf-Idf Text Vectorizer, which helps us to vectorize our input data into a specific number of tokens. I chose it because it prioritizes important words while penalizing commonly occuring words.

To gain a better understanding of how the Tf-Idf Vectorizer works:

https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/


Model Architecture

3-layer neural network:

  • 2 hidden layers with the ReLU activation function.
  • Output layer with a Sigmoid activation function.
model = keras.Sequential([
    layers.Dense(100, input_shape=(6000,), activation = "relu"),

    layers.Dense(50, activation = "relu"),

    layers.Dense(6, activation = "sigmoid")
])

To ensure that the model could perform multi-classification, I compiled the model with a binary_crossentropy loss.

model.compile(
optimizer = "adam",
    loss = "binary_crossentropy",
    metrics = ["binary_accuracy"]
)

Model Performance

The model performed well with a

  • Training Accuracy: 99.9%
  • Testing Accuracy: 93.2%

Deployment

The model was deployed via a Flask app and hosted using Render.

References

https://www.slicktext.com/blog/2020/05/cyberbullying-statistics-facts/#:~:text=About%2037%25%20of%20teens%20between%20the%20ages%20of,and%20perpetrators%20of%20cyberbullying%20in%202019%20and%202020.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published