Exploring Free Code Camp’s “2016 New Coder Survey”

Structure of Dataset

The original “2016 New Coder Survey” dataset consists of 113 variables. Most of these variables are answers to survey questions, though a few are computer-generated (e.g. respondent ID and survey start/end times). Over 15,000 observations (i.e. respondents) exist.

The str function output is long and messy, so I won’t print it here. Please consult Free Code Camp’s survey data dictionary. Boolean, numeric, and categorical types are the majority.

New Variables

I created six new variables from existing variables:

  • ContinentCitizen and ContinentLive from CountryCitizen and CountryLive using Vincent Arel-Bundock’s countrycode R package
  • PodcastPartiallyDerivative, PodcastBecomingDataSci, and PodcastTalkingMachines from PodcastOther using ifelse statements
  • HoursLearningBucket using the cut function on HoursLearning

These new variables bring our total to 119 variables.

## [1] 15620   119

Data Science/Engineering Subset

646 respondents answered “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?

## [1] 646 119

The following analysis first explores the characteristics of these developing data scientists/engineers, which complements Free Code Camp’s univariate exploration of new coders in general.

Free Code Camp’s article structure is intentionally mimicked for the purpose of direct comparison. Additional comments are included where the results significantly differ. A few bonus plots are included too!

We’ll then dive deeper into the characteristics of new coders in general via bivariate and multivariate exploration.


Univariate Plots

Who Participated

CodeNewbie and Free Code Camp designed the survey, and dozens of coding-related organizations publicized it to their members.

Of the 646 developing data scientists and data engineers who responded to the survey:

A quarter are women.

Data science and engineering appear to draw a few more females, as 21% of new coders in general are women.

##    female 
## 0.2447917

Their median age is 26.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   22.00   26.00   27.72   31.25   65.00      74

The median of 26 years is clearer once the long-tail data is log transformed.

They started programming an average of 16 months ago.

This average is 5 months longer than the full survey dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.00    8.00   16.17   20.00  360.00      31

Like the age plot, the median programming experience of 8 months is much clearer once logarithmically transformed.

Learner Goals and Approaches

The average respondent dedicates 14 hours per week to learning.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.41   20.00   80.00      30

Again, log transformation makes the right-skewed data’s distribution clearer. The first quartile, median, and third quartile of 5, 10, and 20, respectively, are easily detectable.

No respondents want to freelance or start their own business.*

Compared to 40% for the full new coder survey, this is a bit shocking. I understand the demand for data scientists and engineers in industry, but I have a hunch these zero counts are caused by the survey’s design. Every respondent that answered the job role of interest question has zero counts for “start your own business” and “freelance.”

52% percent are already applying for jobs, or will start applying within the next year.

The data-related subset has a longer time horizon than the full survey dataset, where 65% are applying within the next year.

Most of them want to work in an office, as opposed to remotely.

And a majority are willing to relocate.

Most of them have not yet attended any in-person coding events.

On average, they use at least three different resources for learning.

Those interested in data science and/or engineering use Coursera, edX, and Udacity more frequently than new coders in general. These companies have a wider range of subject areas than the some of the coding-specific resources listed.

64% of developing data scientists and engineers have used at least one of Coursera, edX, or Udacity.

## [1] 0.6393189

Only 46% of new coders in general have used at least one of these resources.

## [1] 0.4591549

Only 1% have attended a bootcamp.

6% of new coders from the full survey dataset have attended a bootcamp.

Demographics and Socioeconomics

Data-focused respondents represent 166 countries.

More than 90% are from North America, Europe, and Asia.

The dominating percentage of North Americans should be expected because Free Code Camp is based in the United States.

Their cities span a wide range of urbanization levels.

Just under a quarter of respondents are ethnic minorities in their country.

And nearly half are non-native English speakers. They grew up speaking one of 148 languages.

67% have earned at least a bachelor’s degree.

Compared to 58% for the full new coder survey, the data-focused subset is more skewed towards post-secondary studies.

Just over one-half are currently working.

Two-thirds of the new coder population are currently working.

A quarter work in the tech industry.

There is a higher variety of employment fields compared to the full dataset, where 50% of respondents work in software development and IT.

Median current salary is $44k.

The median current salary for the full dataset is $37k.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   25000   43600   48420   60000  200000     390

And they expect to earn a median of $60k with their new data science/engineering skills.

The median for the full survey dataset is $50k. With data science/engineering being notoriously lucrative in 2016, some respondents might be seeking higher wages.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   61110   80000  200000      65

7% have served in their country’s military.

## has served in military 
##             0.06501548

13% have children, and another 3% financially support an elderly or disabled relative. And one-fifth are doing this without the help of a spouse.

## has children 
##    0.1346749
## financially supporting 
##             0.03250774
## no spouse 
## 0.2137405

47% consider themselves underemployed (working a job that is below their education level).

This is 5% higher than new coders in general.

## is underemployed 
##        0.4705882

If they have a home mortgage, they owe an average of $194k.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   76000  150000  194400  240000 1000000     591

If they have student loans, they owe an average of $37k.

This average is $3k more than the full survey dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   10000   20000   36880   45000 1000000     485

Removing the million dollar outlier, the distribution is much clearer with the majority of debt under $75k. I hope that outlier is a joke.

14% don’t yet have high-speed internet at home.

## has high-speed internet 
##               0.8573913

And 3% are currently receiving disability benefits from their government.

## is receiving disability benefits 
##                       0.02608696

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

There isn’t really a singular main feature of interest in the “2016 New Coder Survey” dataset. There are several smaller features, but nothing stands out like diamond price and its relationship to carat weight, cut, colour, etc. in the R diamonds dataset, for example. The diamonds dataset covers two time periods (the existence of the diamond pre-sale and post-sale), whereas the survey dataset only covers a single period (the early stages of an individual’s coding career).

If we could fast-forward several years and survey the same respondents, the main feature of interest might be career earnings (adjusted for cost of living, preferably) and/or self-reported career satisfaction. A predictive model using a combination of variables from the 2016 survey could then be built to estimate career success.

If the survey asked “Are you already working as a data scientist/engineer?” instead of “Are you already working as a software developer?”, the current income variable might be a main feature of interest. Unfortunately, the answer to that question cannot be extracted from the existing variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Though there isn’t a main feature of interest, we can separate the respondents who answered something other than “Data Scientist/Data Engineer” to the job role interest question and compare the subsets using bivariate and multivariate plots.

I will also explore several of the smaller features, six of them numerical and four of them categorical.

The numerical:

  1. Age
  2. Programming experience
  3. Hours dedicated to learning weekly
  4. Current salary
  5. Expected next salary
  6. Student debt remaining

The categorical:

  1. Gender
  2. Citizenship by continent
  3. School degree
  4. Ethnic majority vs. minority

Of the features you investigated, were there any unusual distributions?

There is a lot of long-tail data that requires transformation to view the details of the distribution. Programming experience, for example, is really positively skewed. Some respondents have coded for one month, others for 20+ years.

That no respondents want to freelance or start their own business seems strange. Perhaps a survey design choice caused these zero counts. Every respondent that answered the job role of interest question has zero counts for “start your own business” and “freelance.”

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The following operations were performed to tidy, adjust, or change the form of the data:

  • Each code event, resource, and podcast is represented by a boolean variable. I summed the number of yeses for each, which created a single row of sums. I used tidyr’s gather() to transform the data from a wide format to a long format. Then I transformed the long data into factor format, using the replicate function with the number of yeses as the multiplier. This data is used to create the code event, resource, and podcast bar charts. wide to long to factor formats
  • After subselecting all code event, resource, and podcast columns separately, I created a new boolean variable named answered, where 1 represents using at least one event/resource/podcast and 0 represents using none. The answered sum total is used in the “x out of 646 developing data scientists/engineers answered” label at the bottom of each bar chart.
  • I separated data-specific podcasts in the user-inputted PodcastOther category into their own boolean variables.
  • I changed “NA” in the EmploymentStatus variable to “other” if the respondent provided the user-inputted EmploymentStatusOther variable.
  • I changed “NA” in the EmploymentField variable to “other” if the respondent provided the user-inputted EmploymentFieldOther variable.
  • I separated the “Americas” continents outputted by countrycode() into North and South America.

The first five operations were performed so bar charts could be created, which wasn’t possible with the original data format. The Americas separation was performed for additional insight.


Bivariate Plots

14974 respondents did not answer “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?

## [1] 14974   119

SPLOMs

The next two plots are created using pairs.panels() from the psych package. They display a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.

For the data science subset of the survey, all correlations are below 0.4, which supports my statement that no main feature exists. The strongest of the correlations are:

  • Age and Income (0.30)
  • Income and ExpectedEarning (0.36)
  • Income and StudentDebtOwe (0.34)

The phenomena revealed are intuitive, but not groundbreaking: you tend to make more money when you are older, you tend to expect your next job to have a high salary if your current one does, and expensive schooling can lead to higher income levels.

For the non-data science subset of the survey, all correlations are again below 0.4. Most of the correlations are within 0.1 of the data science subset, except for three:

  • Age and StudentDebtOwe (0.24 - 0.10 = 0.14)
  • MonthsProgramming and StudentDebtOwe (-0.07 - 0.09 = -0.16)
  • Income and StudentDebtOwe (0.34 - 0.08 = 0.26)

Interesting. Student debt levels are involved in all three correlations. I bet the aforementioned skew towards post-secondary studies for the data science subset plays a role here, where higher levels of student debt are expected. Expensive schooling has not led to higher salaries as frequently for those not interested in data science and engineering.

Let’s zoom in on the strong age-income correlation, this time for the full survey dataset. Note that the strength exists despite the majority of $200k salaries belonging to respondents under 40.