Exploring Free Code Camp’s “2016 New Coder Survey”

By David Venturi

Structure of Dataset

The original “2016 New Coder Survey” dataset consists of 113 variables. Most of these variables are answers to survey questions, though a few are computer-generated (e.g. respondent ID and survey start/end times). Over 15,000 observations (i.e. respondents) exist.

The str function output is long and messy, so I won’t print it here. Please consult Free Code Camp’s survey data dictionary. Boolean, numeric, and categorical types are the majority.

New Variables

I created six new variables from existing variables:

ContinentCitizen and ContinentLive from CountryCitizen and CountryLive using Vincent Arel-Bundock’s countrycode R package
PodcastPartiallyDerivative, PodcastBecomingDataSci, and PodcastTalkingMachines from PodcastOther using ifelse statements
HoursLearningBucket using the cut function on HoursLearning

These new variables bring our total to 119 variables.

## [1] 15620   119

Data Science/Engineering Subset

646 respondents answered “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?”

## [1] 646 119

The following analysis first explores the characteristics of these developing data scientists/engineers, which complements Free Code Camp’s univariate exploration of new coders in general.

Free Code Camp’s article structure is intentionally mimicked for the purpose of direct comparison. Additional comments are included where the results significantly differ. A few bonus plots are included too!

We’ll then dive deeper into the characteristics of new coders in general via bivariate and multivariate exploration.

Univariate Plots

Who Participated

CodeNewbie and Free Code Camp designed the survey, and dozens of coding-related organizations publicized it to their members.

Of the 646 developing data scientists and data engineers who responded to the survey:

A quarter are women.

Data science and engineering appear to draw a few more females, as 21% of new coders in general are women.

##    female 
## 0.2447917

Their median age is 26.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   22.00   26.00   27.72   31.25   65.00      74

The median of 26 years is clearer once the long-tail data is log transformed.

They started programming an average of 16 months ago.

This average is 5 months longer than the full survey dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.00    8.00   16.17   20.00  360.00      31

Like the age plot, the median programming experience of 8 months is much clearer once logarithmically transformed.

Learner Goals and Approaches

The average respondent dedicates 14 hours per week to learning.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.41   20.00   80.00      30

Again, log transformation makes the right-skewed data’s distribution clearer. The first quartile, median, and third quartile of 5, 10, and 20, respectively, are easily detectable.

No respondents want to freelance or start their own business.*

Compared to 40% for the full new coder survey, this is a bit shocking. I understand the demand for data scientists and engineers in industry, but I have a hunch these zero counts are caused by the survey’s design. Every respondent that answered the job role of interest question has zero counts for “start your own business” and “freelance.”

52% percent are already applying for jobs, or will start applying within the next year.

The data-related subset has a longer time horizon than the full survey dataset, where 65% are applying within the next year.

Most of them want to work in an office, as opposed to remotely.

And a majority are willing to relocate.

Most of them have not yet attended any in-person coding events.

On average, they use at least three different resources for learning.

Those interested in data science and/or engineering use Coursera, edX, and Udacity more frequently than new coders in general. These companies have a wider range of subject areas than the some of the coding-specific resources listed.

64% of developing data scientists and engineers have used at least one of Coursera, edX, or Udacity.

## [1] 0.6393189

Only 46% of new coders in general have used at least one of these resources.

## [1] 0.4591549

Only 1% have attended a bootcamp.

6% of new coders from the full survey dataset have attended a bootcamp.

Demographics and Socioeconomics

Data-focused respondents represent 166 countries.

More than 90% are from North America, Europe, and Asia.

The dominating percentage of North Americans should be expected because Free Code Camp is based in the United States.

Their cities span a wide range of urbanization levels.

Just under a quarter of respondents are ethnic minorities in their country.

And nearly half are non-native English speakers. They grew up speaking one of 148 languages.

67% have earned at least a bachelor’s degree.

Compared to 58% for the full new coder survey, the data-focused subset is more skewed towards post-secondary studies.

They studied 425 different majors. Computer Science and Mathematics were the two most popular majors, and an additional 16% studied some form of engineering.

Diversity amongst majors is greater compared to the full survey, where Computer Science and Information Technology checked in at #1 and #2 with 17% and 5%, respectively.

Just over one-half are currently working.

Two-thirds of the new coder population are currently working.

A quarter work in the tech industry.

There is a higher variety of employment fields compared to the full dataset, where 50% of respondents work in software development and IT.

Median current salary is $44k.

The median current salary for the full dataset is $37k.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   25000   43600   48420   60000  200000     390

And they expect to earn a median of $60k with their new data science/engineering skills.

The median for the full survey dataset is $50k. With data science/engineering being notoriously lucrative in 2016, some respondents might be seeking higher wages.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   61110   80000  200000      65

7% have served in their country’s military.

## has served in military 
##             0.06501548

13% have children, and another 3% financially support an elderly or disabled relative. And one-fifth are doing this without the help of a spouse.

## has children 
##    0.1346749

## financially supporting 
##             0.03250774

## no spouse 
## 0.2137405

47% consider themselves underemployed (working a job that is below their education level).

This is 5% higher than new coders in general.

## is underemployed 
##        0.4705882

If they have a home mortgage, they owe an average of $194k.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   76000  150000  194400  240000 1000000     591

If they have student loans, they owe an average of $37k.

This average is $3k more than the full survey dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   10000   20000   36880   45000 1000000     485

Removing the million dollar outlier, the distribution is much clearer with the majority of debt under $75k. I hope that outlier is a joke.

14% don’t yet have high-speed internet at home.

## has high-speed internet 
##               0.8573913

And 3% are currently receiving disability benefits from their government.

## is receiving disability benefits 
##                       0.02608696

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

There isn’t really a singular main feature of interest in the “2016 New Coder Survey” dataset. There are several smaller features, but nothing stands out like diamond price and its relationship to carat weight, cut, colour, etc. in the R diamonds dataset, for example. The diamonds dataset covers two time periods (the existence of the diamond pre-sale and post-sale), whereas the survey dataset only covers a single period (the early stages of an individual’s coding career).

If we could fast-forward several years and survey the same respondents, the main feature of interest might be career earnings (adjusted for cost of living, preferably) and/or self-reported career satisfaction. A predictive model using a combination of variables from the 2016 survey could then be built to estimate career success.

If the survey asked “Are you already working as a data scientist/engineer?” instead of “Are you already working as a software developer?”, the current income variable might be a main feature of interest. Unfortunately, the answer to that question cannot be extracted from the existing variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Though there isn’t a main feature of interest, we can separate the respondents who answered something other than “Data Scientist/Data Engineer” to the job role interest question and compare the subsets using bivariate and multivariate plots.

I will also explore several of the smaller features, six of them numerical and four of them categorical.

The numerical:

Age
Programming experience
Hours dedicated to learning weekly
Current salary
Expected next salary
Student debt remaining

The categorical:

Gender
Citizenship by continent
School degree
Ethnic majority vs. minority

Of the features you investigated, were there any unusual distributions?

There is a lot of long-tail data that requires transformation to view the details of the distribution. Programming experience, for example, is really positively skewed. Some respondents have coded for one month, others for 20+ years.

That no respondents want to freelance or start their own business seems strange. Perhaps a survey design choice caused these zero counts. Every respondent that answered the job role of interest question has zero counts for “start your own business” and “freelance.”

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The following operations were performed to tidy, adjust, or change the form of the data:

Each code event, resource, and podcast is represented by a boolean variable. I summed the number of yeses for each, which created a single row of sums. I used tidyr’s gather() to transform the data from a wide format to a long format. Then I transformed the long data into factor format, using the replicate function with the number of yeses as the multiplier. This data is used to create the code event, resource, and podcast bar charts.
After subselecting all code event, resource, and podcast columns separately, I created a new boolean variable named answered, where 1 represents using at least one event/resource/podcast and 0 represents using none. The answered sum total is used in the “x out of 646 developing data scientists/engineers answered” label at the bottom of each bar chart.
I separated data-specific podcasts in the user-inputted PodcastOther category into their own boolean variables.
I changed “NA” in the EmploymentStatus variable to “other” if the respondent provided the user-inputted EmploymentStatusOther variable.
I changed “NA” in the EmploymentField variable to “other” if the respondent provided the user-inputted EmploymentFieldOther variable.
I separated the “Americas” continents outputted by countrycode() into North and South America.

The first five operations were performed so bar charts could be created, which wasn’t possible with the original data format. The Americas separation was performed for additional insight.

Bivariate Plots

14974 respondents did not answer “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?”

## [1] 14974   119

SPLOMs

The next two plots are created using pairs.panels() from the psych package. They display a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.

For the data science subset of the survey, all correlations are below 0.4, which supports my statement that no main feature exists. The strongest of the correlations are:

Age and Income (0.30)
Income and ExpectedEarning (0.36)
Income and StudentDebtOwe (0.34)

The phenomena revealed are intuitive, but not groundbreaking: you tend to make more money when you are older, you tend to expect your next job to have a high salary if your current one does, and expensive schooling can lead to higher income levels.

For the non-data science subset of the survey, all correlations are again below 0.4. Most of the correlations are within 0.1 of the data science subset, except for three:

Age and StudentDebtOwe (0.24 - 0.10 = 0.14)
MonthsProgramming and StudentDebtOwe (-0.07 - 0.09 = -0.16)
Income and StudentDebtOwe (0.34 - 0.08 = 0.26)

Interesting. Student debt levels are involved in all three correlations. I bet the aforementioned skew towards post-secondary studies for the data science subset plays a role here, where higher levels of student debt are expected. Expensive schooling has not led to higher salaries as frequently for those not interested in data science and engineering.

Let’s zoom in on the strong age-income correlation, this time for the full survey dataset. Note that the strength exists despite the majority of $200k salaries belonging to respondents under 40.

The earnings vs. age trend, however, isn’t maintained as these individuals prepare to transition to their new job of choice. Younger individuals seem willing to capitalize on lucrative tech salaries and older individuals seem willing to take a pay cut.

Let’s use the full new coder survey for the rest of the analysis.

We’ll ditch the data science-only focus that complemented Free Code Camp’s univariate exploration of new coders, and switch to profiling new coders in general.

Gender and Citizenship

First, we’ll explore hours dedicated to learning per week and expected next salary across gender and continent citizenship. These former two variables are dependent upon the quality of the coding resources, whereas the other numerical ones (e.g. age, income, and programming experience) are set previously.

For the following logarithmic boxplots, the horizontal line is the median and the “x” is the mean. The top of the box is the third quartile and the bottom is the first quartile. Whisker length is the interquartile range multiplied by 1.5.

Statistical tests and inferences follow the majority of the bivariate plots. Alpha levels of 0.05 are used for the pairwise t-tests, which are appropriate because the variables analyzed are normally distributed.

Hours dedicated to learning results are nearly identical across genders.

## 
##        male      female genderqueer     agender       trans 
##       10766        2840          66          38          36

Do transgender new coders actually spend more time learning? A pairwise t-test says the difference is not significant, as all trans p-values are greater than the aforementioned alpha value of 0.05.

##                     male    female genderqueer   agender
## female      0.0001183222        NA          NA        NA
## genderqueer 0.7009298469 0.7911540          NA        NA
## agender     0.3421146251 0.6519387   0.6006503        NA
## trans       0.3485157466 0.1554507   0.3260318 0.1812888

Not much differentiation in weekly hours dedicated to learning for continents as well. All have a median of 10 hours. Asian and African students have the highest means, at 16.4 and 16.8 hours, respectively.

## 
## North America        Europe          Asia South America        Africa 
##          6744          3358          2178           567           506 
##       Oceania 
##           301

The higher Asian and African means are both significant compared to the other continents’ means at a significance level of 0.05.

##               North America       Europe        Asia South America
## Europe         0.7366938549           NA          NA            NA
## Asia           0.0006050985 0.0008652786          NA            NA
## South America  0.3219580345 0.4255435379 0.006543667            NA
## Africa         0.0136500683 0.0113202845 0.553219616    0.01015108
## Oceania        0.0958745963 0.1305238713 0.002926231    0.44185526
##                    Africa
## Europe                 NA
## Asia                   NA
## South America          NA
## Africa                 NA
## Oceania       0.003587357

Females actually expect higher salaries than males, with a $9k gap in medians and a $4k gap in means. There is a huge gap in first quartiles, where the 25th percentile female expects $14k more than her male equivalent. As with hours dedicated to learning, transgender new coders have relatively higher expected salaries. Did a particularly ambitious set of trans individuals respond to the survey or are these their true traits?

## Gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   30000   50000   52620   70000  200000    6763 
## -------------------------------------------------------- 
## Gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   43650   59000   56620   70000  200000    1532 
## -------------------------------------------------------- 
## Gender: genderqueer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    7000   50000   60000   66970   70000  200000      37 
## -------------------------------------------------------- 
## Gender: agender
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   24000   36000   46500   58220   67500  200000      20 
## -------------------------------------------------------- 
## Gender: trans
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20000   44250   67500   67230   76250  200000      20

The gap between females and males is significant, according to a pairwise t-test, as are the gaps between genderqueer respondents and males and trans respondents and males. I wonder why minority-gendered respondents expect higher salaries for their next job.

##                     male     female genderqueer  agender
## female      2.061149e-05         NA          NA       NA
## genderqueer 8.998896e-03 0.06141337          NA       NA
## agender     4.205845e-01 0.81824355   0.3227977       NA
## trans       4.773836e-02 0.15208424   0.9768457 0.373584

Whoa. Expected next salary by continent varies way more compared to the above three boxplots. Given the previously listed sample sizes for each continent, I would assume nearly all of these gaps are statistically significant. North Americans expect the highest range of salaries, with their interquartile range spanning from $50k to $70k. Europe’s 75th percentile is North America’s 25th percentile (I wonder if some European respondents forgot to convert from pounds or euros to US dollars). Expectations in Asia are all over the board.

A lot of these individuals are using similar, if not the same, online educational resources. Labour market economics are cruel.

The median new coder that dedicates 40+ hours per week expects $10k more than the median new coders from the other brackets.

## 
##   (0,10]  (10,20]  (20,40] (40,100] 
##     8175     3564     2394      694

This expectations gap, however, might be due to random chance, as the (0,10] and (40,100] means comparison failed to show significance.

##              (0,10]     (10,20]     (20,40]
## (10,20]  0.01094413          NA          NA
## (20,40]  0.05523308 0.710986652          NA
## (40,100] 0.10608015 0.003341339 0.008889624

Let’s dig into that 40-100 hour bracket. Less than 5% of new coders are dedicating 40+ hours to learning each week. Below are the most common ages and educational backgrounds for this bracket. The bottom row is number of respondents.

## 
## 25 21 26 23 22 20 24 32 30 28 
## 43 42 39 37 34 33 33 29 28 27

## 
##                       bachelor's degree 
##                                     270 
##          some college credit, no degree 
##                                     102 
## high school diploma or equivalent (GED) 
##                                      71 
##      master's degree (non-professional) 
##                                      45 
##                        some high school 
##                                      35 
## professional degree (MBA, MD, JD, etc.) 
##                                      24

Most of these respondents are in their early twenties and have a bachelor’s degree. It appears that they are forgoing traditional forms of higher education (like master’s and professional degrees) and using those 40+ hour weeks to learn code.

This is the exact situation I’m in with my personalized data science master’s degree. The quality and affordability of online education in 2016 is incredible, though many still aren’t aware of the existence of resources like Free Code Camp, Udacity, and Coursera. If this survey was performed in a few years, I would expect more respondents to be in the higher brackets.

Job Roles of Interest

Let’s now explore job roles of interest, first categorically, then numerically. Again, the most common roles are:

## 
##         Full-Stack Web Developer          Front-End Web Developer 
##                             2571                             1379 
##           Back-End Web Developer   Data Scientist / Data Engineer 
##                              704                              646 
##                 Mobile Developer         User Experience Designer 
##                              414                              275 
##                DevOps / SysAdmin                  Product Manager 
##                              219                              191 
##       Quality Assurance Engineer 
##                              104

User experience designer is by far the most diverse discipline in terms of gender, with 52% males, 46% females, and the highest percentage of agender, genderqueer, and trans respondents (2%). Mobile development is the most male-dominated discipline at 81%, though full-stack and back-end development are close.

The highest relative popularity for North America (read: biggest blue bar segment) is user experience design. Europe’s is back-end development. Asia’s, South America’s, and Africa’s is mobile development. Oceania’s is data science/engineering. Mobile developer is the most diverse discipline in terms of citizenship.

The skew towards post-secondary studies for data science and data engineering is much clearer here. Mobile development has the highest percentage of respondents with no, some, or only a high school education, though back-end development is a close second. I wonder if these skews will reflect themselves in the subsequent age boxplot.

Mobile developers are indeed the youngest with a first quartile two years younger than the next youngest role, but back-end developers are not second-youngest. Mobile being a relatively new discipline likely has something to do with this phenomena. Front-end development is the oldest discipline with a mean age of 29 years.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   11.00   23.00   27.00   28.94   33.00   70.00     294 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.00   24.00   27.00   29.08   33.00   64.00     193 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.00   22.00   27.00   28.03   32.00   59.00     103 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.00   22.00   26.00   27.72   31.25   65.00      74 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    12.0    20.0    24.0    26.2    31.0    54.0      77 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.00   22.00   26.00   28.74   32.00   73.00      38

Based on the results of a pairwise t-test, we are inclined to conclude that mobile developers are the youngest, but front-end developers being the oldest might be caused by random chance.

##                           Full-Stack Developer Front-End Developer
## Front-End Developer               6.362326e-01                  NA
## Back-End Developer                1.733113e-02        1.181725e-02
## Data Scientist / Engineer         1.815968e-03        1.384457e-03
## Mobile Developer                  2.017238e-08        2.421440e-08
## UX Designer                       7.266237e-01        5.663622e-01
##                           Back-End Developer Data Scientist / Engineer
## Front-End Developer                       NA                        NA
## Back-End Developer                        NA                        NA
## Data Scientist / Engineer        0.528965428                        NA
## Mobile Developer                 0.001311138               0.008054328
## UX Designer                      0.266165761               0.114106993
##                           Mobile Developer
## Front-End Developer                     NA
## Back-End Developer                      NA
## Data Scientist / Engineer               NA
## Mobile Developer                        NA
## UX Designer                   0.0003372604

Data scientists-, data engineers-, and back-end developers-in-training have programmed the longest with a median experience of eight months. UX designers have the lowest first quartile by two whole months at two months, but front-end developers have the lowest average of 9.5 months. Programming experience is so positively skewed that some of the means are above their third quartile.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.250   0.500   1.043   1.000  40.830      88 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.2500  0.5000  0.7917  1.0000 15.0000      43 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.3333  0.6667  1.2680  1.6250 20.0000      33 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.2500  0.6667  1.3470  1.6670 30.0000      31 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.250   0.500   1.049   1.250  13.330      15 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1667  0.5000  1.0610  1.0000 36.0000      20

For all disciplines except back-end development, the results of the t-tests suggest that those interested in data science and engineering have indeed programmed for longer. Front-end developers have programmed the least, unanimously, according to p-values.

##                           Full-Stack Developer Front-End Developer
## Front-End Developer               7.616578e-05                  NA
## Back-End Developer                5.808766e-03        7.899438e-08
## Data Scientist / Engineer         3.202567e-04        1.242137e-09
## Mobile Developer                  9.513062e-01        1.588332e-02
## UX Designer                       8.831383e-01        3.515075e-02
##                           Back-End Developer Data Scientist / Engineer
## Front-End Developer                       NA                        NA
## Back-End Developer                        NA                        NA
## Data Scientist / Engineer         0.45064220                        NA
## Mobile Developer                  0.06479446                0.01348069
## UX Designer                       0.13350983                0.04064788
##                           Mobile Developer
## Front-End Developer                     NA
## Back-End Developer                      NA
## Data Scientist / Engineer               NA
## Mobile Developer                        NA
## UX Designer                      0.9366481

Full-stack developers dedicate the most time to learning each week, with 25% of respondents dedicating 30+ hours weekly. UX designers spend the least amount of time learning per week with a mean of 12 hours per week.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   10.00   15.00   19.94   30.00  100.00     108 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     6.0    12.0    16.7    20.0   100.0      48 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    8.00   15.00   18.77   25.00  100.00      40 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   14.41   20.00   80.00      30 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   12.00   17.76   25.00  100.00      21 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   10.00   12.04   15.00   63.00      19

Statistical significance unanimously supports that full-stack developers spend the most time learning and UX designers the least.

##                           Full-Stack Developer Front-End Developer
## Front-End Developer               2.120848e-10                  NA
## Back-End Developer                7.321617e-02        3.656253e-03
## Data Scientist / Engineer         2.701196e-16        1.656064e-03
## Mobile Developer                  7.364865e-03        2.170544e-01
## UX Designer                       1.038213e-15        5.003092e-06
##                           Back-End Developer Data Scientist / Engineer
## Front-End Developer                       NA                        NA
## Back-End Developer                        NA                        NA
## Data Scientist / Engineer       1.942677e-07                        NA
## Mobile Developer                2.905685e-01              0.0005175278
## UX Designer                     1.021364e-09              0.0331516928
##                           Mobile Developer
## Front-End Developer                     NA
## Back-End Developer                      NA
## Data Scientist / Engineer               NA
## Mobile Developer                        NA
## UX Designer                   1.937195e-06

Respondents interested in data science and/or engineering clearly have the highest current salaries. I would be shocked if a pairwise t-test suggested otherwise. Their third quartile of $60k per year is $8k higher than the next highest discipline. There isn’t much income differentiation between the remaining job roles of interest, though all are above the 2014 US median income of $28.9k.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   35000   41010   52000  200000    1508 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   35000   37020   48000  200000     806 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   17750   32000   36990   49250  200000     436 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   25000   43600   48420   60000  200000     390 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   33800   36420   46500  155000     286 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   20000   31500   35730   50000   90000     175

The inference that developing data scientists and engineers have the highest current salaries, as expected, is supported by statistical inference. UX designers having the lowest, on average, might be caused by random chance, however.

##                           Full-Stack Developer Front-End Developer
## Front-End Developer               0.0057234012                  NA
## Back-End Developer                0.0345620393        9.869959e-01
## Data Scientist / Engineer         0.0001364637        5.710562e-08
## Mobile Developer                  0.0782325486        8.259973e-01
## UX Designer                       0.0699231222        6.691043e-01
##                           Back-End Developer Data Scientist / Engineer
## Front-End Developer                       NA                        NA
## Back-End Developer                        NA                        NA
## Data Scientist / Engineer       2.781493e-06                        NA
## Mobile Developer                8.502432e-01              7.102127e-05
## UX Designer                     7.002592e-01              1.145311e-04
##                           Mobile Developer
## Front-End Developer                     NA
## Back-End Developer                      NA
## Data Scientist / Engineer               NA
## Mobile Developer                        NA
## UX Designer                      0.8524358

Those interested in data science/engineering expect to earn the most at their next job. Given the aforementioned correlation between current salaries and expected salaries, this is not a surprise. Front-end developers appear to be the least optimistic in terms of next salary, though it is pretty close. Note that expected salaries are higher than current salaries across the board.

## JobRoleInterest: Full-Stack Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   55000   54670   70000  200000     225 
## -------------------------------------------------------- 
## JobRoleInterest: Front-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   30000   50000   48070   60000  200000     118 
## -------------------------------------------------------- 
## JobRoleInterest: Back-End Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   30000   50000   50060   65000  200000      73 
## -------------------------------------------------------- 
## JobRoleInterest: Data Scientist / Engineer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   60000   61110   80000  200000      65 
## -------------------------------------------------------- 
## JobRoleInterest: Mobile Developer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   30000   50000   52740   70000  200000      48 
## -------------------------------------------------------- 
## JobRoleInterest: UX Designer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6000   40000   51000   55100   70000  200000      40

Interesting. At a significance level of 0.05, future front-end developers can be said to expect the lowest next salaries. Data-focused respondents retain their place in the salary-related kingdom.

##                           Full-Stack Developer Front-End Developer
## Front-End Developer               8.252229e-11                  NA
## Back-End Developer                4.106259e-04        1.587079e-01
## Data Scientist / Engineer         1.731529e-06        4.490202e-19
## Mobile Developer                  2.377850e-01        6.742147e-03
## UX Designer                       8.283726e-01        6.602333e-04
##                           Back-End Developer Data Scientist / Engineer
## Front-End Developer                       NA                        NA
## Back-End Developer                        NA                        NA
## Data Scientist / Engineer       4.043780e-11                        NA
## Mobile Developer                1.606523e-01              1.595757e-05
## UX Designer                     2.332536e-02              7.428145e-03
##                           Mobile Developer
## Front-End Developer                     NA
## Back-End Developer                      NA
## Data Scientist / Engineer               NA
## Mobile Developer                        NA
## UX Designer                      0.3314679

Bivariate Analysis

How did the feature(s) of interest vary with other features in the dataset?

The data science/engineering subset of the survey is largely similar to the non-data science/engineering subset, except for three correlations involving student debt owed. The skew towards post-secondary studies for the data-focused subset is the likely culprit.

The correlation between current salary and age is stronger than the correlation between expected next salary and age for new coders in general, and I expect that would be true for the data-focused subset as well.

Hours dedicated to learning doesn’t appear to vary much with gender or continent, with consistent medians of ten hours weekly.

Expected salary for a new coder’s next job varies strongly by continent. Females appear to have a much higher bottom line than males. Those who dedicate more than 40 hours a week to learning might expect higher next salaries, but sample size issues prevent this statement from being definitive.

The majority of new coders for all job roles of interest are male, North American, and have bachelor’s degrees. Age, programming experience, hours dedicated to learning, current salary, and expected next salary all vary depending on job role of interest. One or two of the roles stands out from the pack for each of the five quantitative variables.

What was the strongest relationship you found?

No exceedingly strong relationship exists. All correlations are below 0.4.

Current salary and expected next salary has the strongest relationship for both subsets with correlations of 0.36 and 0.38.

Of the features you investigated, were there any unusual distributions?

Europe’s 75th percentile for expected next salary is North America’s 25th percentile ($50k USD). Perhaps some European respondents forgot to convert from pounds or euros to US dollars.

Multivariate Plots

Let’s dig deeper into the two salary variables: current salary and expected next salary. Again, the latter salary is for the first new job where the respondent, presumably, will advertise their new coding skills.

For both males vs. females and ethnic majorities vs. minorities, three faceted scatter plots, in succession, follow:

Current salary vs. age
Expected next salary vs. age
Expected next salary vs. current salary

Since respondents that are 65+ years old are the outliers, I removed them to tighten up the linear model.

Gender

Female new coders have a higher median current salary ($38k vs. $36k) than males. They are also slightly older (28 vs. 27 years old). Pearson’s r correlations, which are appropriate given that income and age are both normally distributed, tell us that male salaries tend to increase with age more so than female salaries do for this new coder dataset.

Despite the abundance of male data points (79% of survey respondents are male), male respondents clearly have a higher-than-the-population-average proportion of $150k+ current salaries. The split, below, is 89/11.

## 
##     male   female 
## 0.893617 0.106383

As with current salary, females have a higher median expected next salary. Correlations are similar across genders this time, however. Both are low, indicating that there isn’t much of a relationship between expected salary and age. Young new coders, both male and female, expect similar salaries as older new coders.

Male new coders do again have an above average proportion of expected next salaries above $150k. The 82/18 split isn’t as extreme as the previous 89/11 split, however.

## 
##      male    female 
## 0.8181818 0.1818182

Plotting the two salary-related variables against each other, we are left with the impression that the gender wage gap does not exist in this dataset. Females have both a higher median current salary and expected next salary. Male new coders do dominate the elite ($150k+) salary lines but also notice the cluster of blue circles and the absence of red circles near the origin. Pearson’s r correlations tell us that a higher percentage of a female new coder’s expected salary can be explained by her current salary. Both correlations are still relatively high.

Ethnicity

Plotting current salary vs. age for ethnicity tells a much different story than the plot for gender. Whereas the correlations are both near 0.25 for ethnic majorities and minorities, for males vs. females they were drastically different. For new coders of all races, the degree at which age contributes to your current salary is similar. Median ages are both 27, while the 50th percentile minority actually has a higher current salary than their majority equivalent by $4k.

As with gender, there is an abundance of data points for the majority demographic (76/24 is the survey’s ethnic majority/minority split). The split is a bit more extreme as we isolate those with current salaries above $150k.

## 
##        No       Yes 
## 0.8105263 0.1894737

Again, the correlations with age are lower for expected next salary compared to current salary. There is a bit of a gap between groups this time, indicating that minorities in today’s workplace might expect their salary to increase at a slower rate as they age. Ethnic minorities definitively have a higher median by $10k. They appear optimistic about the changing diversity landscape in the workplace.

Unlike gender, the proportion of ethnic majorities vs. minorities remains constant near 76/24 when we isolate those with expected next salaries above $150k.

## 
##        No       Yes 
## 0.7692308 0.2307692

We are again left with the impression that the wage gap, this time the racial one, does not exist in this dataset. Minorities have both a higher median current salary and expected next salary. The majority demographic owns only a slightly higher than average percentage of the $150k+ salaries. The minority’s correlation below is higher as well, suggesting that these new coders expect to convert their current salary to their next salary at a higher rate.

Job Roles of Interest

Let’s combine all of the job roles of interest boxplots (the blue ones) into one radar chart. The mean for each numerical variable normalized between 0 and 1 is plotted.

One thing jumps out immediately: developing data scientists/engineers lead the pack for programming experience, current salary, and expected next salary. Beyond that, however, overplotting is an issue, which makes it difficult to internalize other patterns in the data. Let’s fix that by faceting the plot next.

Ah, that’s better. Full-stack developers have high normalized age and hours dedicated to learning means. They also have middle-of-the-road means for all other variables, which contribute to their notable polygon area. Front-end and mobile developers have the smallest areas, thanks to the lowest programming experience and expected next salary means for the former, and the lowest and second-lowest age and current salary means for the latter.

Perception of strength based on overall area is a common misinterpretation of radar plots. Note that we are strictly using this plot to efficiently compare roles across several numerical variables, and not to determine which role is better if such a determination even exists.

Multivariate Analysis

Were there any interesting or surprising interactions between features?

The wage gaps, both gender and racial, do not present themselves in this new coder dataset via current and expected next salary medians. Maybe new coders aren’t an accurate representation of the working population in general. Data suggests that both wage gaps still exist in 2016.

That the female correlation between income and age (0.192) is lower than the male one (0.267), but the ethnic minority correlation (0.243) is nearly identical to the ethnic majority one (0.253) is interesting as well. I wonder why, all else equal, the minority demographic for race performs better salary-wise as they age compared to the minority demographic for gender.

Though this was previously noted in the bivariate section, the strong current salary vs. age relationship isn’t maintained as new coders across all genders and ethnic representations transition to their next job. Through expected next salary results, younger individuals seem willing to capitalize on lucrative coding-related salaries and older individuals seem willing to take a pay cut.

Final Plots

Plot One

Description One

This segmented bar chart conveys gender representation across job roles of interest.

Overall, the majority of new coder survey respondents are males. The vast majority are either male or female.

Mobile development leads the way with the highest percentage of males at 81%, though full-stack and front-end development are close, at 79% and 78%, respectively.

User experience design is the most diverse discipline, with 52% males and 46% females, and the highest proportion of agender, genderqueer, and trans respondents (2%). Front-end development has a notable percentage of females as well with 35%, which is 14% higher than the full survey dataset.

Plot Two

Description Two

This faceted radar chart, where the normalized mean (between 0 and 1) for each numerical variable is plotted for each job role of interest, clarifies the differences between disciplines. A common misinterpretation of radar plots is the perception of strength based on overall area. This plot should strictly be used to efficiently compare roles across numerical variables, not to determine which role is better.

Developing data scientists/engineers make the most money, expect the most money for their next job, and have the most programming experience. They have the largest amount of area within their polygon.

Full-stack developers are relatively older and dedicate the most amount of time to learning weekly. They also have a large polygon area.

Front-end developers are most green in terms of programming experience and have the lowest salary expectations for their first job where they advertise their new web development skills. They also have relatively low current salaries. These three factors contribute to the smallest polygon area.

Mobile developers are the youngest and currently do not make much money. These characteristics are expected of the discipline with the highest proportion of respondents with no, some, or only a high school education. They have the second smallest polygon area.

Plot Three

Description Three

This expected next salary vs. current salary scatter plot, faceted by gender, has a best-fit line labeled with Pearson’s r correlation, as well as dashed lines representing the median for each axis.

The dashed median lines inform us that the gender wage gap does not exist in this dataset. Females have a $2k lead in current salary and a $9k lead in expected salary for their next job, post-coding skills acquisition.

Though male new coders do have the highest proportion of elite ($150k+) current salaries and expected next salaries, they also have a notable cluster near the origin that females do not.

The correlation between expected and current salary is stronger for female new coders. This gap suggests that females expect to convert their current salary to a similar salary for their next job at a higher rate than males. They are both strong overall, however, as the correlation between these two salary variables represents the strongest correlation in the “2016 New Coder Survey” dataset.

Summary

Developing data scientists and engineers are slightly different than new coders in general.

They have a higher proportion of females.
They have programmed for longer.
They want to work for developed companies, rather than freelance or create their own.
They have a longer job search time horizon.
They use Coursera, edX, and Udacity more frequently.
They use bootcamps less frequently.
They have completed higher levels of education.
They come from a wider subject area background.
Fewer are currently working.
Fewer work in the tech industry.
They have higher current salaries and expected next salaries.
They have more student debt.

The two subsets do share plenty of common trends. Most are willing to relocate. Most don’t use podcasts or attend events yet. Similar proportions are ethnic minorities in their country.

Older new coders are willing to take a pay cut when transitioning to a job where they advertise their new coding skills. Younger new coders intend to increase their earning potential by capitalizing on demand for coding.

Weekly hours dedicated to learning doesn’t differ much across genders and citizenships by continent. Next expected salary does, however. Most people aren’t replacing the traditional college/university route with full-time online education…yet. Those that are seem to expect higher salaries, though we can’t be sure because of sample size issues.

Gender and continent distributions across job roles of interest vary. Females appear drawn to user experience design. Asians, South Americans, and Africans appear drawn to mobile development. School degree obtained does not vary much by discipline overall, though data science/engineering and mobile development stick out as the most and least seasoned in terms of education, respectively.

Developing data scientists/engineers have the highest current salaries, expect the highest next salaries, and have the most programming experience. Front-end developers are the oldest, but not significantly. Full-stack developers dedicate the most amount of time to learning per week.

Front-end developers are the least experienced coders and expect the lowest next salaries. UX designers spend the least amount of hours learning weekly and have the lowest current salaries, but not significantly for the latter. Mobile developers are the youngest.

The gender and racial wage gaps do not present themselves in this dataset. Perhaps new coders aren’t reflective of the working population in general, where data suggests that both wage gaps still exist in 2016.

Reflection

The successes of this exploration are largely due to the detailed design of the Free Code Camp survey.

The main struggle I encountered in this exploration was the lack of a main feature of interest, like the diamond dataset’s price variable. It would be awesome if we could survey the same respondents in a decade or so. We could combine career earnings and career satisfaction with the 2016 survey’s results to build a predictive model to estimate career success.

These are the people who are learning to code. Free, self-paced learning resources are definitely important.

Exploring Free Code Camp’s “2016 New Coder Survey”

By David Venturi

Structure of Dataset

New Variables

Data Science/Engineering Subset

The following analysis first explores the characteristics of these developing data scientists/engineers, which complements Free Code Camp’s univariate exploration of new coders in general.

We’ll then dive deeper into the characteristics of new coders in general via bivariate and multivariate exploration.

Univariate Plots

Who Participated

A quarter are women.

Their median age is 26.

They started programming an average of 16 months ago.

Learner Goals and Approaches

The average respondent dedicates 14 hours per week to learning.

No respondents want to freelance or start their own business.*

52% percent are already applying for jobs, or will start applying within the next year.

Most of them want to work in an office, as opposed to remotely.

And a majority are willing to relocate.

Most of them have not yet attended any in-person coding events.

On average, they use at least three different resources for learning.

Less than 20% listen to coding-related podcasts.

Only 1% have attended a bootcamp.

Demographics and Socioeconomics

Data-focused respondents represent 166 countries.

More than 90% are from North America, Europe, and Asia.

Their cities span a wide range of urbanization levels.

Just under a quarter of respondents are ethnic minorities in their country.

And nearly half are non-native English speakers. They grew up speaking one of 148 languages.

67% have earned at least a bachelor’s degree.

They studied 425 different majors. Computer Science and Mathematics were the two most popular majors, and an additional 16% studied some form of engineering.

Just over one-half are currently working.

A quarter work in the tech industry.

Median current salary is $44k.

And they expect to earn a median of $60k with their new data science/engineering skills.

7% have served in their country’s military.

13% have children, and another 3% financially support an elderly or disabled relative. And one-fifth are doing this without the help of a spouse.

47% consider themselves underemployed (working a job that is below their education level).

If they have a home mortgage, they owe an average of $194k.

If they have student loans, they owe an average of $37k.

14% don’t yet have high-speed internet at home.

And 3% are currently receiving disability benefits from their government.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots

SPLOMs

Let’s use the full new coder survey for the rest of the analysis.

Gender and Citizenship

Job Roles of Interest

Bivariate Analysis

How did the feature(s) of interest vary with other features in the dataset?

What was the strongest relationship you found?

Of the features you investigated, were there any unusual distributions?

Multivariate Plots

Gender

Ethnicity

Job Roles of Interest

Multivariate Analysis

Were there any interesting or surprising interactions between features?

Final Plots

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three

Summary

Reflection