The original “2016 New Coder Survey” dataset consists of 113 variables. Most of these variables are answers to survey questions, though a few are computer-generated (e.g. respondent ID and survey start/end times). Over 15,000 observations (i.e. respondents) exist.
The str
function output is long and messy, so I won’t print it here. Please consult Free Code Camp’s survey data dictionary. Boolean, numeric, and categorical types are the majority.
I created six new variables from existing variables:
ifelse
statementscut
function on HoursLearningThese new variables bring our total to 119 variables.
## [1] 15620 119
646 respondents answered “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?”
## [1] 646 119
Free Code Camp’s article structure is intentionally mimicked for the purpose of direct comparison. Additional comments are included where the results significantly differ. A few bonus plots are included too!
CodeNewbie and Free Code Camp designed the survey, and dozens of coding-related organizations publicized it to their members.
Of the 646 developing data scientists and data engineers who responded to the survey:
Data science and engineering appear to draw a few more females, as 21% of new coders in general are women.
## female
## 0.2447917
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.00 22.00 26.00 27.72 31.25 65.00 74
The median of 26 years is clearer once the long-tail data is log transformed.
This average is 5 months longer than the full survey dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 8.00 16.17 20.00 360.00 31
Like the age plot, the median programming experience of 8 months is much clearer once logarithmically transformed.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.00 10.00 14.41 20.00 80.00 30
Again, log transformation makes the right-skewed data’s distribution clearer. The first quartile, median, and third quartile of 5, 10, and 20, respectively, are easily detectable.
Compared to 40% for the full new coder survey, this is a bit shocking. I understand the demand for data scientists and engineers in industry, but I have a hunch these zero counts are caused by the survey’s design. Every respondent that answered the job role of interest question has zero counts for “start your own business” and “freelance.”
The data-related subset has a longer time horizon than the full survey dataset, where 65% are applying within the next year.
Those interested in data science and/or engineering use Coursera, edX, and Udacity more frequently than new coders in general. These companies have a wider range of subject areas than the some of the coding-specific resources listed.
64% of developing data scientists and engineers have used at least one of Coursera, edX, or Udacity.
## [1] 0.6393189
Only 46% of new coders in general have used at least one of these resources.
## [1] 0.4591549
6% of new coders from the full survey dataset have attended a bootcamp.
The dominating percentage of North Americans should be expected because Free Code Camp is based in the United States.
Compared to 58% for the full new coder survey, the data-focused subset is more skewed towards post-secondary studies.
Diversity amongst majors is greater compared to the full survey, where Computer Science and Information Technology checked in at #1 and #2 with 17% and 5%, respectively.
Two-thirds of the new coder population are currently working.
There is a higher variety of employment fields compared to the full dataset, where 50% of respondents work in software development and IT.
The median current salary for the full dataset is $37k.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 25000 43600 48420 60000 200000 390
The median for the full survey dataset is $50k. With data science/engineering being notoriously lucrative in 2016, some respondents might be seeking higher wages.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 40000 60000 61110 80000 200000 65
## has served in military
## 0.06501548
## has children
## 0.1346749
## financially supporting
## 0.03250774
## no spouse
## 0.2137405
This is 5% higher than new coders in general.
## is underemployed
## 0.4705882
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 76000 150000 194400 240000 1000000 591
This average is $3k more than the full survey dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 10000 20000 36880 45000 1000000 485
Removing the million dollar outlier, the distribution is much clearer with the majority of debt under $75k. I hope that outlier is a joke.
## has high-speed internet
## 0.8573913
## is receiving disability benefits
## 0.02608696
There isn’t really a singular main feature of interest in the “2016 New Coder Survey” dataset. There are several smaller features, but nothing stands out like diamond price and its relationship to carat weight, cut, colour, etc. in the R diamonds dataset, for example. The diamonds dataset covers two time periods (the existence of the diamond pre-sale and post-sale), whereas the survey dataset only covers a single period (the early stages of an individual’s coding career).
If we could fast-forward several years and survey the same respondents, the main feature of interest might be career earnings (adjusted for cost of living, preferably) and/or self-reported career satisfaction. A predictive model using a combination of variables from the 2016 survey could then be built to estimate career success.
If the survey asked “Are you already working as a data scientist/engineer?” instead of “Are you already working as a software developer?”, the current income variable might be a main feature of interest. Unfortunately, the answer to that question cannot be extracted from the existing variables.
Though there isn’t a main feature of interest, we can separate the respondents who answered something other than “Data Scientist/Data Engineer” to the job role interest question and compare the subsets using bivariate and multivariate plots.
I will also explore several of the smaller features, six of them numerical and four of them categorical.
The numerical:
The categorical:
There is a lot of long-tail data that requires transformation to view the details of the distribution. Programming experience, for example, is really positively skewed. Some respondents have coded for one month, others for 20+ years.
That no respondents want to freelance or start their own business seems strange. Perhaps a survey design choice caused these zero counts. Every respondent that answered the job role of interest question has zero counts for “start your own business” and “freelance.”
The following operations were performed to tidy, adjust, or change the form of the data:
gather()
to transform the data from a wide format to a long format. Then I transformed the long data into factor format, using the replicate function with the number of yeses as the multiplier. This data is used to create the code event, resource, and podcast bar charts. The first five operations were performed so bar charts could be created, which wasn’t possible with the original data format. The Americas separation was performed for additional insight.
14974 respondents did not answer “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?”
## [1] 14974 119
The next two plots are created using pairs.panels() from the psych package. They display a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
For the data science subset of the survey, all correlations are below 0.4, which supports my statement that no main feature exists. The strongest of the correlations are:
The phenomena revealed are intuitive, but not groundbreaking: you tend to make more money when you are older, you tend to expect your next job to have a high salary if your current one does, and expensive schooling can lead to higher income levels.
For the non-data science subset of the survey, all correlations are again below 0.4. Most of the correlations are within 0.1 of the data science subset, except for three:
Interesting. Student debt levels are involved in all three correlations. I bet the aforementioned skew towards post-secondary studies for the data science subset plays a role here, where higher levels of student debt are expected. Expensive schooling has not led to higher salaries as frequently for those not interested in data science and engineering.
Let’s zoom in on the strong age-income correlation, this time for the full survey dataset. Note that the strength exists despite the majority of $200k salaries belonging to respondents under 40.