This project is an adaptation of a Kaggle competition to serve as a course project for Advanced Business Analytics, DSBA 6211, at the University of North Carolina and Charlotte for the Fall 2021 Semester.
Team Members:
Dave Collins
Kristie Soliman
Jaime Cassell
Taylor Ferguson
Dustin Ballentine
CareerVillage.org is a nonprofit that crowdsources career advice for underserved youth. Founded in 2011 in four classrooms in New York City, the platform has now served career advice from 25,000 volunteer professionals to over 3.5M online learners. The platform uses a Q&A style similar to StackOverflow or Quora to provide students with answers to any question about any career.
- Explore User Profiles
- Explore Activities Time Series
- Explore Questions and Answers
- What factors motivate user participation?
- What factors affect a question’s likelihood to be answered?
Data was retrieved from the Kaggle competition page on 16 September, 2021. Five years worth of data was provided by CareerVillage.org. The data consists of 15 csv files, 1 file per entity. The Enhanced Entity Relationship Diagram below approximates the relationships amongst the original 15 tables.
We began preprocessing each table individually, to include data type conversions, formatting changes, transformations, and new feature engineering. A summary of the major steps performed for each table is provided below:
- answers
- comments
- emails
- group_memberships
- groups
- matches
- professionals - Created variable professionals_loc_div by binning professionals location into U.S. Geographic Division
- questions
- school_memberships
- students - Created variable students_loc_div by binning students location into U.S. Geographic Division
- tag_questions
- tag_users
- tags
- question_scores
- answer_scores
- Created variable professionals_country by binning professionals location into country
- Transformed professionals_date_joined into datetime, and removes hh:mm:ss
- Imputed "Not Specified" for NA fields
- Created variable students_country by binning students location into country
- Transformed students_date_joined into datetime, and removes hh:mm:ss
- Imputed "Not Specified" for NA fields
More than 90% of each population are from the United States, although Another significant category of users are those who choose not to enter any location information, or who entered clearly erroneous information, both of which we have rolled into one category of "Not Specified". Since the choice to leave the field blank is up to the user, these records may provide valuebale information regarding user behavior.
Although the majority of users are based in the United States, the international users as a group make a significant subset of the total population as shown in the graphs above. The plots below break out the top 7 other countries of origin for both professionals and students. As the charts below highlight, whereas a majority of both international professionals and students come from India, international professionals appear to have a higher presence in China and Europe, while more international students are in Africa.
Topic modeling was performed on the questions asked by students to identify recurring themes. Questions were defined by the ten topics below:
Topic 1: Finding and internship or job
Topic 2: High-school-aged students inquiring about college and careers
Topic 3: Finding work in people-oriented occupations (e.g. working with children, customer-service, social work)
Topic 4: Seeking help coping with college stressors (e.g. lagging grades, time comitments, anxiety, uncertainty)
Topic 5: Options for paying for college such as applying for sholarships/financial aid
Topic 6: Careers and outlook for STEM majors
Topic 7: Careers and outlook for the medical field
Topic 8: Careers and outlook for business-related majors (e.g accounting, finance, economics)
Topic 9: Career and outlook for artistic-related majors (e.g. fashion, culinary-arts, architecture, creative-writing)
The top ten terms for each topic are shown below along with the frequency of questions associated with each topic. The topics were further plotted by year to show the fluctuations over time.
- Future work