(for code, please go to the master branch)
In recent years, researchers are more actively seen to discuss on research articles in social media, like Facebook, Twitter, LinkedIn, etc. due to their quick and powerful reach and impact to sheer amount of audience. Therefore, social sites are becoming crucial data repository to understand the current trends of research, analyzing the reactions of people to research, getting insights on which types of topics are getting user attention more, etc. [1-3]. This type of analysis is important as it may reveal interesting implicit information about past and current trends of research topics to new researchers and give them an idea how people is going to discuss on their works in future which, at first glance, might not be comprehendible from the voluminous and versatile social data.
The aim of this spotlight is to investigate the topics that are being discussed frequently in Facebook posts on scholarly articles using one of the most interesting text mining algorithms in natural language processing: Topic Modeling which aims in extracting the subject of the document being discussed. Two popular topic modeling algorithms, LDA (Latent Dirichlet Allocation) and GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) will be used for this purpose and visualized to understand their relative outputs. The core assumption of LDA is each document may have multiple topics associated with it, whereas GSDMM considers each document has only one underlying topic. So, it is considered by many researchers that LDA works better when document size is larger (>50), for example, news articles in newspapers, scientific articles in magazines, etc. and GSDMM works better in short-text documents, like posts in Twitter and Facebook, product reviews, etc. [4][7]. The text that will be used in this analysis will be of varying length ranging from 8 to about 72,000 with a mean of 658. Also, they are originally categorized in multiple topics. For this reason, analyzing the dataset with both LDA and GSDMM will help in getting a comparative picture of the performance in this corpus.
The dataset that will be used in this spotlight is on scholarly posts in Facebook and was originally collected from altmetric.com [8] and used in [5].
Here I worked in three steps:
- Task 1 : Setup
- Task 2 : Data Loading, Exploration, and Preprocessing
- Task 3 : Topic Modeling and visualization (i) GSDMM, (ii) LDA
- Zheng, H, et al. (2018) “Social Media Presence of Scholarly Journals”. Journal of the Association for Information Science And Technology, 70(3):256-270.
- Pulido, CM, et al. (2018): “Social impact in social media: A new method to evaluate the social impact of research”. PLoS ONE, 13(8).
- “Social media for scientists”. Nature Cell Biology 20, 1329 (2018).
- Albalawi, R, et al. (2020). “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis”. Frontiers in Artificial Intelligence, 3(42).
- Freeman, C, et al. (2019). “Shared Feelings: Understanding Facebook Reactions to Scholarly Articles”. JCDL’19, 301-304.
- Yin, J, et al. 2014. “A dirichlet multinomial mixture model-based approach for short text clustering”. KDD’14, Association for Computing Machinery, NY, USA, 233–242.
- https://towardsdatascience.com/short-text-topic-modelling-lda-vs-gsdmm-20f1db742e14
- https://www.altmetric.com
- https://medium.com/analytics-vidhya/topic-modeling-using-lda-and-gibbs-sampling-explained-49d49b3d1045
- https://techblog.assignar.com/topic_modelling%20-%20assignar_froms_classification/
- https://www.kaggle.com/ptfrwrd/topic-modeling-guide-gsdm-lda-lsi#LDA-model
- https://towardsdatascience.com/gsdmm-topic-modeling-for-social-media-posts-and-reviews-8726489dc52f
- https://www.kaggle.com/ptfrwrd/topic-modeling-guide-gsdm-lda-lsi?scriptVersionId=44304210
- https://github.com/rwalk/gsdmm