Mrinal Kanti Baowaly, Yi-Pei Tu and Kuan-Ta Chen, Ph.D.
Institute of Information Science, Academia Sinica, Taiwan.
The goal of this research is to construct a model that can accurately predict the helpfulness of game reviews. We can define our problem as follows:
Given a set of reviews, determine whether the reviews are helpful or not helpful on the basis of those review data and reviewers' expertise.
-
feature_ranking.R: Models with a large number of features may cause issues, as irrelevant features may cause confusion, or unimportant features may increase the execution cost. Thus, we needed to select a proper subset of features, and that is why we should rank the features in our dataset by their importance. In this step, we employed the Random Forest algorithm as the Recursive Feature Elimination technique to output a sorted list of features by their importance. To rank the features, we employed 10-fold cross validation taking a (random) sample size 25%. We repeated the whole process 10 times and finally calculated the mean weight of each feature to rank them.
-
gbm_eval.R: In our experiment, we employed a GBM (Gradient Boosting Machine) classifier for model training and evaluation. Before applying GBM to train the model, we investigated our dataset and found it imbalanced. To combat this imbalanced dataset, we applied a number of options discussed step by step in later sections.
- Download the repository from https://github.com/baowaly/PredictingReviewHelpfulness
- I have already ranked the features and saved in the directory 'Features'. You may also execute the script feature_ranking.R to check how it works.
- Execute the script: gbm_eval.R
On the steam store there are many game genres e.g. action, racing, survival, rpg etc. We have analysed top 10 game genres that are discussed and explained in the paper. In this case, we are investigating the helpfulness of Racing game genre and showing the process to reproduce its result. We have considered those reviews which have at least 50 votes. The following step by step explanation are from the script gbm_eval.R.
#Define some variables
genre <- "Racing"
vote_num <- 50
#load dataset
dataFile <- paste0("Dataset/Reviews_", genre ,"_50.Rdata")
if(!file.exists(dataFile) || file.size(dataFile) == 0){
stop("Data file doesn't exist")
}
load(dataFile)
original_dataset <- ds.reviews
Now it is time to take a look at the data
# dimensions of dataset
dim(original_dataset)
## [1] 2251 2205
You can see 2251 reviews and 2205 attributes in the original dataset of Racing genre.
Steam helpfulness (target variable) of a review is defined by the rating score (here, ws.score) of that review.
Now depending on a score threshold (0.90) we defined our target variable helpful as follows:
##Add a helpful column
score_thrsld <- 0.90
original_dataset$helpful <- ifelse(original_dataset$ws.score > score_thrsld, "Yes", "No")
original_dataset$helpful <- as.factor(original_dataset$helpful)
dim.y <- "helpful"
After doing the feature ranking (Rscript feature_ranking.R), we have total 1789 features that as shown in the following table. Some insignificant features are removed during the ranking process.
Feature Type | Examples | Explanation |
---|---|---|
Metadata (39) | e.g. recommend, days.interval, #words, #sentences | Text, Reviewer, Game metadata. Counting, Aggregating, Ratio variables |
LIWC (93) | liwc.* (e.g. liwc.Segment, liwc.WC, liwc.Analytic, liwc.Tone) | Linguistic Inquiry and Word Count (LIWC) reads a given text and counts the percentage of words that reflect different emotions, thinking styles, social concerns, and even parts of speech |
TF-IDF (629) | tfidf.* (e.g. tfidf.die, tfidf.easy, tfidf.better, tfidf.bad) | Unigram term-weighting method used to extract features based on the word frequencies in the review and review corpus |
LDA (28) | topic.* (e.g. topic.1, topic.2, topic.3) | Topic based features induced from review corpus |
Word2Vec (1000) | wv.* (e.g. wv.1, wv.2, wv.3, wv.4) | Word2vec, a group of related models that are used to produce word embeddings features from the text |
#Load feature file
featureFile <- paste0("Features/FT_", genre, "_V", vote_num, "_R", score_thrsld,"_S25.Rdata")
if(!file.exists(featureFile)){
stop("\nFeature File Not Exists!!")
}
load(featureFile)
#Total number of features
max.dims <- NROW(feature.rank)
print(max.dims)
## [1] 1789
#Total feature list
dim.x <- feature.rank[order(feature.rank$total)[1:max.dims],]$name
#Exclude tfidf features, we found there are not important at all
dim.x <- grep("^tfidf.*?", dim.x, value = TRUE, invert = TRUE)
dataset <- original_dataset[, c(dim.x, dim.y), with=F]
#Peek at the Data
head(dataset[,1:5], 5)
## recommend topic.17 wv.613 wv.367 wv.638
## 1: 0 0.04912281 -0.0196526 0.00585691 0.00564988
## 2: 0 0.04204204 -0.0181740 0.00550578 0.00636852
## 3: 0 0.04301075 -0.0237652 0.00789483 0.01222250
## 4: 1 0.02873563 -0.0156660 0.00388676 0.01468580
## 5: 1 0.02487562 -0.0175165 0.00381155 0.00345860
We did an investigation with a smaller sample size (50% of the dataset) taking the upper limit of feature size by 1000. The results are displayed in the following Figure.
From the above plot of F-score against number of feature, we can clearly see that the best F-scores of the dataset were found when the number of features were 840.
#Select number of features
gbm.dim <- 840
#Selec the best subset of the total features
gbm.dim.x <- dim.x[1:gbm.dim]
Let’s now take a look at the number of instances (rows) that belong to each class and balance the dataset.
#Total number of rows in the dataset
n.rows <- NROW(dataset)
#Class Distribution
print(table(dataset$helpful))
##
## No Yes
## 2187 64
Creates possibly balanced samples by random over-sampling minority examples, under-sampling majority examples or combination of over- and under-sampling.
#Balance data set with both over and under sampling
n.samplePercent <- 100
n.sampleSize <- ceiling(n.rows * n.samplePercent/100)
balanced_data <- ovun.sample(helpful ~ ., data = dataset, method="both", p=0.5, N=n.sampleSize)$data
print(table(balanced_data$helpful))
##
## No Yes
## 1105 1146
Partition dataset into training (80%) and test (20%).
#Split dataset
split <- 0.80
trainIndex <- as.vector(createDataPartition(y=balanced_data$helpful, p=split, list=FALSE))
#Get train data
gbm.dataTrain <- balanced_data[trainIndex, c(gbm.dim.x,dim.y), ]
dim(gbm.dataTrain)
## [1] 1801 841
#Get test data
gbm.dataTest <- balanced_data[-trainIndex, c(gbm.dim.x,dim.y), ]
dim(gbm.dataTest)
## [1] 450 841
#Split train data
gbm.trainX <- gbm.dataTrain[, gbm.dim.x, ]
gbm.trainY <- as.factor(gbm.dataTrain[, dim.y, ])
#Split test data
gbm.testX <- gbm.dataTest[, gbm.dim.x, ]
gbm.testY <- as.factor(gbm.dataTest[, dim.y, ])
#Remove columns with near zero variance
nearZV <- nearZeroVar(gbm.trainX)
if(length(nearZV) > 0){
gbm.trainX <- gbm.trainX[, -nearZV]
gbm.testX <- gbm.testX[, -nearZV]
}
#Preprocess training Data
preObj <- preProcess(gbm.trainX, method = c("center", "scale"))
gbm.trainX <- predict(preObj, gbm.trainX)
#Preprocess test Data
preObj <- preProcess(gbm.testX, method = c("center", "scale"))
gbm.testX <- predict(preObj, gbm.testX)
We will apply 10-fold crossvalidation to build the model. In our experiment, we have used more reliable evaluation metrics: F-score and AUC. We have employed a GBM (Gradient Boosting Machine) classifier.
#Control parameters
gbm.fitControl = trainControl(method="repeatedcv", #small size -> repeatedcv
number=10, #10-fold cv
repeats=3,
returnResamp = "final",
selectionFunction = "best",
classProbs=TRUE,
summaryFunction=twoClassSummary,
allowParallel = TRUE)
#Train the model
gbmFit <- train(gbm.trainX, gbm.trainY, method="gbm", metric="ROC", trControl=gbm.fitControl, verbose=F)
#Get evaluation score
eval_score <- get_eval_score(gbmFit, gbm.trainX, gbm.trainY, gbm.testX, gbm.testY, score_thrsld, "GBM")
#print(eval_score)
We conducted a series of executions (10 times) from the steps 5-8 in model training and the testing process to make the results more accurate. The evaluation results are saved in a file in the Evaluation directory. The mean evaluation metrics are shown in the following table.
scoreFile <- paste0("Evaluation/SCORE_", genre,"_V50_R0.9.Rdata")
load(scoreFile)
final_eval_score <- sapply(Filter(is.numeric, comb.score), mean)
Evaluation Metric | Training score | Test score |
---|---|---|
Accuracy | 1 | 0.989 |
Precision | 1 | 0.988 |
Recall | 1 | 0.991 |
F1-score | 1 | 0.989 |
From the above evaluation result, we found our model performed very well to classify whether the reviews were helpful or not. All evaluation metrics are about 99% for test dataset.
#Important features
gbmImp <- varImp(gbmFit, scale = TRUE)
impFeatures <- gbmImp$importance
#Sort by overall weight
impFeatures <- impFeatures[order(-impFeatures$Overall), , drop = FALSE]
head(impFeatures, 10)
## Overall
## wv.698 100.00000
## recommend 79.68782
## topic.17 68.56388
## wv.903 47.82878
## wv.819 41.77297
## wv.226 37.71184
## wv.701 36.39785
## wv.330 35.35896
## wv.512 32.78374
## n.content.remove 32.10252
Similar to the evaluation metrics, feature importance is also computed by averaging the weights of 10 executions and saved it in a file in the ImpFeatures directory. The plotting of feature importace is shown as follows.
#Load feature importace file
load(paste0("ImpFeatures/Imp_FT_", genre,"_V50_R0.9.Rdata"))
#Select top 20
top_features <- feature.imp[1:20, ]
top_features$feature <- factor(top_features$feature, levels=unique(top_features$feature))
featurePlot <- ggplot(data=top_features, aes(x=feature, y=mean )) +
geom_bar(stat="identity", position=position_dodge(0.6), fill="steelblue", width=0.6) +
scale_x_discrete(limits = rev(levels(top_features$feature))) +
scale_y_continuous(breaks=c(seq(0,100,10))) +
xlab("Features") + ylab("Feature importance (Relative weight)") + # Set axis labels
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
#Horizontal bar plot
featurePlot <- featurePlot + coord_flip()
print(featurePlot)
We found that review helpfulness mostly depends on:
- Metadata features
- The hidden word embedding features (Word2Vec) and
- LDA topic based features.