Skip to content

andyharless/twit_demog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Learning Demographics from Text: Twitter Gender

Broadly, the intent of this project is to infer, from written text, information about the demographic profile of the writer. The project as it stands seeks specifically to infer gender from Twitter statuses (tweets).

For purposes of model training and evaluation, I've defined gender operationally as the gender typically associated with the first word of someone's display name (as ascertained by the gender-guesser Python package). In a sense, what I'm calling "gender" is really a projection of "name" onto a binary variable. It will not correspond in all cases to either biological sex or personal gender identification, but nonetheless it seems like a meaningful and useful projection. (For example, it seems more meaningful and useful than projecting "name" onto "first letter.")

The basic approach is to take principal components of sentence-level embeddings and fit these to a quadratic-logistic model to predict gender. The embeddings (or, more generally, activations, if some don't meet the strict definition of embeddings) come from three separate models:

  1. A fine-tuned version Google's Universal Sentence Encoder (Large), which uses a transformer-based approach to embed sentences.

  2. An LSTM network built on word-level embeddings initialized with Glove vectors pre-trained on Twitter.

  3. A simple max-pooling network built on similarly initialized word-level embeddings. (Fine tuning of word-level embeddings in the max-pooling model is meant to capture word-level differences in male-female usage, which may be present even when the presumably more substantive differences captured by the other models are not).

Files

Code

Demo

Intitial data processing

Baseline model using tuned USE-Large sentence embeddings

Second model adding activations from LSTM network with tuned Glove embeddings

Complete model adding activations from max-pooling network

Processing data for online learning evaluation

Full model with online learning

Relevant Kaggle Kernels

About

Infer Demographic Info from Text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published