Skip to content

This project is an adaptation of a Kaggle competition to serve as a course project for Advanced Business Analytics, DSBA 6211, at the University of North Carolina and Charlotte for the Fall 2021 Semester.

Notifications You must be signed in to change notification settings

DABallentine/Career-Village

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Career Village - Matching Users with Professionals

This project is an adaptation of a Kaggle competition to serve as a course project for Advanced Business Analytics, DSBA 6211, at the University of North Carolina and Charlotte for the Fall 2021 Semester.

Team Members:
Dave Collins
Kristie Soliman
Jaime Cassell
Taylor Ferguson
Dustin Ballentine

Project Summary

CareerVillage.org is a nonprofit that crowdsources career advice for underserved youth. Founded in 2011 in four classrooms in New York City, the platform has now served career advice from 25,000 volunteer professionals to over 3.5M online learners. The platform uses a Q&A style similar to StackOverflow or Quora to provide students with answers to any question about any career.

Research Objectives

  1. Explore User Profiles
  2. Explore Activities Time Series
  3. Explore Questions and Answers
  4. What factors motivate user participation?
  5. What factors affect a question’s likelihood to be answered?

Data Resources

Data was retrieved from the Kaggle competition page on 16 September, 2021. Five years worth of data was provided by CareerVillage.org. The data consists of 15 csv files, 1 file per entity. The Enhanced Entity Relationship Diagram below approximates the relationships amongst the original 15 tables. image

Data Preprocessing

Initial preprocessing by entity

We began preprocessing each table individually, to include data type conversions, formatting changes, transformations, and new feature engineering. A summary of the major steps performed for each table is provided below:

  1. answers
  2. comments
  3. emails
  4. group_memberships
  5. groups
  6. matches
  7. professionals
  8. - Created variable professionals_loc_div by binning professionals location into U.S. Geographic Division
    - Created variable professionals_country by binning professionals location into country
    - Transformed professionals_date_joined into datetime, and removes hh:mm:ss
    - Imputed "Not Specified" for NA fields
  9. questions
  10. school_memberships
  11. students
  12. - Created variable students_loc_div by binning students location into U.S. Geographic Division
    - Created variable students_country by binning students location into country
    - Transformed students_date_joined into datetime, and removes hh:mm:ss
    - Imputed "Not Specified" for NA fields
  13. tag_questions
  14. tag_users
  15. tags
  16. question_scores
  17. answer_scores

Subsequent preprocessing by data subset

Data Understanding and Exploration

1. Exploring User Profiles

Locations

U.S.-based Users

More than 90% of each population are from the United States, although Another significant category of users are those who choose not to enter any location information, or who entered clearly erroneous information, both of which we have rolled into one category of "Not Specified". Since the choice to leave the field blank is up to the user, these records may provide valuebale information regarding user behavior.
professionals_Map
students_Map

International Users

Although the majority of users are based in the United States, the international users as a group make a significant subset of the total population as shown in the graphs above. The plots below break out the top 7 other countries of origin for both professionals and students. As the charts below highlight, whereas a majority of both international professionals and students come from India, international professionals appear to have a higher presence in China and Europe, while more international students are in Africa.

image
image

Industries

Professionals' Industries
Industries Providing Answers

2. Exploring Activity Time Series

3. Exploring Questions and Answers

Student Questions

Topic modeling was performed on the questions asked by students to identify recurring themes. Questions were defined by the ten topics below:

Topic 1: Finding and internship or job
Topic 2: High-school-aged students inquiring about college and careers
Topic 3: Finding work in people-oriented occupations (e.g. working with children, customer-service, social work)
Topic 4: Seeking help coping with college stressors (e.g. lagging grades, time comitments, anxiety, uncertainty)
Topic 5: Options for paying for college such as applying for sholarships/financial aid
Topic 6: Careers and outlook for STEM majors
Topic 7: Careers and outlook for the medical field
Topic 8: Careers and outlook for business-related majors (e.g accounting, finance, economics)
Topic 9: Career and outlook for artistic-related majors (e.g. fashion, culinary-arts, architecture, creative-writing)

The top ten terms for each topic are shown below along with the frequency of questions associated with each topic. The topics were further plotted by year to show the fluctuations over time.

Question Topics LDA
Question Topics Frequency
Question Topics by Year

Modeling

4. What factors motivate users' participation?

Defining participation / activity

a. Factors motivating professionals

b. Factors motivating students

5. What factors affect a question's likelihood to be answered?

Results Summary

Future Work

Possible future work may include:

  1. Future work

About

This project is an adaptation of a Kaggle competition to serve as a course project for Advanced Business Analytics, DSBA 6211, at the University of North Carolina and Charlotte for the Fall 2021 Semester.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages