It is no doubt an understatement to say that the impacts of the COVID-19 pandemic around the world have been far-reaching, affecting both the physical and mental health of the global population, and also profoundly changing the way in which the world functions, with lock-down protocols (domestic and international) causing travel, logistics, and interpersonal social interactions to come to a grinding halt. As such, the survey designed by course coordinator and lecturer Mr. Garth Tarr, with some input from students, sought to gain a snapshot of the lives of students currently enrolled in DATA2002 (“mainstream”) and DATA2902 (“advanced”) at the University of Sydney amidst the unfortunate spread of the Delta strain of COVID-19 in Sydney.
As mentioned, the group of people surveyed were students enrolled in either DATA2002 or DATA2902 at the University of Sydney. In this way, the entire population was ‘sampled’, meaning that it should not be considered a random sample.
There are a few potential biases that could have impacted the quality of our data. Since completing the survey was not mandatory, self-response bias is certainly the most important factor. In particular, people tend not to respond to questions that paint them in a negative light or to admit traits that are less socially desirable. In our case, this might affect any one of our variables, including but not limited to vaccination status, exercise, height, how they are finding the subject, or how they would rate their current loneliness or stress levels. Overconfidence (Tversky & Kahneman, 1974) is another factor to consider, as more than 50% of people believe they are better than average, which might impact self-evaluation questions such as mathematics and coding ability.
As discussed in class, both the way in which survey questions are asked and the choice in interviewer can influence the results obtained.
The design of some of the questions was quite problematic. Firstly, questions involving self-evaluation on scales from 1 to 10 were ambiguous, since for example some may interpret the scale as linear, others as nonlinear. Typically, it is considered best practice to use a qualitative scale (e.g. “not at all”, “neutral”, “extremely”) rather than a quantitative scale for questions involving self-evaluation. This issue is more pronounced in questions, for example, involving one’s psychological wellbeing, such as loneliness or stress, than those involving one’s perception of their own skills, such as coding or mathematics ability.
Additionally, some questions were left as free-form response rather than multiple-choice or drop-down. As a result, there were a considerable amount of text-based responses that should have been quantitative. The use of free-form also resulted in major inconsistencies in the responses. For example, height saw a mix of centimetres (cm), metres (m), and feet and inches (e.g. 5ft10in, 5’10", etc.). Salary saw a mix of numeric only, numeric with currency (e.g. $80000), use of commas to separate thousands, and a wide variety of measures, such as annual, monthly, weekly, and hourly. These could have easily been prevented by allowing only numerical input, or even better, using a drop-down menu, or slider.
These issues in design produced both additional work required to clean datasets before use, and also loss of data points as it is unreasonable to expect to be able to clean every data point correctly.
In light of the current situation in Sydney, the first question we are interested in is predicting the number of COVID-19 tests that any given student has taken over the past two months. Respondents were asked to answer the following question:
We would like to see whether or not the distribution of this number, given our survey responses, follow a Poisson distribution. This will be analysed as the first question.
It is not unreasonable to expect that prolonged periods of lockdowns can have severe impacts on one’s mental wellbeing, with increased levels of loneliness being the obvious effect. Loneliness in adults is already one of the fastest growing issues in modern, developed nations (Jeste et al., 2020). Ironically, in the current digital age, more and more people are reporting heightened feelings of loneliness despite being more interconnected than ever before through social media (Pittman & Reich, 2016; Davey, 2016). More worryingly, loneliness is often part of a self-fulfilling negative feedback loop; those who are more lonely are less likely to reach out to others for social contact, exacerbating the issue (Sheldon, 2008; Wu et al., 2020).
We identify two questions in the survey that would be interesting in this respect:
with the latter being multiple choice (“None of the time”, “Some of the time”, “Most of the time”, “All the time”) rather than free response. Given the current research, it would be interesting to see whether or not DATA2x02 student who self-identify as more lonely are less likely to turn on their cameras during Zoom class. This is discussed as the second question.
A well-known sign of poor psychological wellbeing is a low sense of self-esteem (Paradise & Kernis, 2002). Unfortunately, it has been documented that those who are more confident in their own abilities will tend to attempt more difficult tasks, as they are more likely to believe that they will succeed. For the same reason, those who are less confident will attempt fewer difficult tasks, typically out of a fear of failure (Krueger & Dickson, 1994; Sekścińska & Rudzinska-Wojciechowska, 2021). This survey asked three questions that would be of interest in this respect:
It is easily accepted that the advanced version of the course is more difficult than the mainstream version, mainly due to the increased workload as well as a focus on learning the material to a deeper level. Thus, it would be interesting to see whether or not those studying the advanced unit are more confident in their abilities than those studying the mainstream unit. This is discussed as the third question.
It has been found that future expected earnings after graduation is one of the most important considerations for students in deciding their choice of major at university (Montmarquette et al., 2002). Data science is an area of study that has rapidly gained popularity in the past few years, a trend which is typically associated with higher potential income for new graduates (Zita, 2021). Given that this is a compulsory unit for a major in data science, presumably many if not most students have their future employment prospects after graduation on their minds. It is then not unreasonable to assume that there might be a sizable number of students studying the major not only for a love of statistics and programming, but also because they have been unable to resist the allure of the possibility of a high-paying job after graduation (not necessarily a bad thing). Obviously, there is a directly relevant question in the survey:
This begs the question: have DATA2x02 students been able to correctly estimate their future expected earnings, or have they overestimated? This is discussed as the fourth question.
For this report, we will be relying on a range of packages in R, and also load in the survey data, as well as rename the columns to make them easier to work with.
library(tidyverse)
library(ggplot2)
library(expss)
library(readr)
library(stringr)
survey_data <- read.csv("survey_data.csv")
colnames(survey_data) <- c(
"timestamp","covid_tests","living_arrangements","height",
"grammar_test", # question about Wednesday event moved forward 2 days
"is_in_aus","math_ability","coding_ability","opinion_on_data2002",
"uni_year","zoom_camera_frequency","covid_vaccination_status",
"fav_social_media","gender","steak_pref","dominant_hand","stress",
"loneliness","spam","email_signoff","salary_est","which_unit",
"which_major", # allocated to which major, or taken as elective
"exercise_hrs_per_week"
)
Recall that we are interested in modelling the number of COVID-19 tests taken by a student over the past two months. Let’s denote this using \(Y\), a random variable. We suspect that this number follows a Poisson distribution, that is, \(Y \sim \text{Pois}(\lambda)\) for some \(\lambda > 0\) because it satisfies the general assumptions of a Poisson distribution (Blitzstein and Hwang, 2015):
We begin by loading a copy of the survey data so that we do not affect later questions, and cleaning out missing values. Since the number of missing values is low, we simply drop them. We then form a frequency table using the covid_tests column.
survey_data1 <- copy(survey_data)
kable(matrix(sum(is.na(survey_data1$covid_tests)), nrow=1, ncol=1),
col.names = c("Missing Values"))
| Missing Values |
|---|
| 3 |
freq_tab <- survey_data1 %>%
tidyr::drop_na(covid_tests) %>%
count(covid_tests)
colnames(freq_tab) <- c("covid_tests", "obs_freqs")
kable(freq_tab,
col.names = c("Number of COVID-19 Tests in Past 2 Months",
"Observed Frequency"))
| Number of COVID-19 Tests in Past 2 Months | Observed Frequency |
|---|---|
| 0 | 126 |
| 1 | 40 |
| 2 | 16 |
| 3 | 4 |
| 4 | 5 |
| 5 | 9 |
| 6 | 1 |
| 7 | 1 |
| 8 | 4 |
| 10 | 2 |
The first goal is to obtain a visual comparison of the observed frequency distribution against the expected frequency distribution. To do this, we need a table of expected frequencies.
First, note that the sample size after cleaning is:
samp_size <- sum(freq_tab$obs_freqs)
kable(matrix(samp_size, nrow=1,ncol=1),
col.names = c("Sample Size"))
| Sample Size |
|---|
| 208 |
Then let \(e_i\) represent the expected frequency of \(i\), the number of COVID-19 tests taken in the past two months. Under the Poisson assumption, \(e_i\) is given by: \[ e_i = N \cdot P_\lambda(i) \quad\forall i = 1,2,\dots,10 \] where \(P_\lambda(i) = \frac{e^{-\lambda}\lambda^i}{i!}\) is the probability mass function of the Poisson distribution and \(N = 208\) is the sample size.
In order to use this formula, we need a value for \(\lambda > 0\), which we do not know. Fortunately, from class we know that we can obtain an unbiased estimate for \(\lambda\) using the empirical mean, which we compute using R.
# vector of observed unique number of COVID-19 tests, y = 0,1,2...,10
y_vals <- freq_tab$covid_tests
# vector of observed frequencies
freqs <- freq_tab$obs_freqs
# estimate lambda
est_lam <- sum( y_vals * freqs ) / samp_size
kable(matrix(est_lam, nrow=1, ncol=1),
col.names = c("Estimated $\\lambda$"))
| Estimated \(\lambda\) |
|---|
| 1.028846 |
Now we have all the information we need to form a table of expected frequencies:
exp_freqs_tab <- data.frame(y_vals,
round(samp_size * dpois(y_vals, lambda = est_lam), 2))
colnames(exp_freqs_tab) <- c("covid_tests", "exp_freqs")
kable(exp_freqs_tab,
col.names = c("Number of COVID-19 Tests in Past 2 Months",
"Expected Frequency (Poisson)"))
| Number of COVID-19 Tests in Past 2 Months | Expected Frequency (Poisson) |
|---|---|
| 0 | 74.34 |
| 1 | 76.49 |
| 2 | 39.35 |
| 3 | 13.49 |
| 4 | 3.47 |
| 5 | 0.71 |
| 6 | 0.12 |
| 7 | 0.02 |
| 8 | 0.00 |
| 10 | 0.00 |
We can finally visually compare the observed and expected frequency (assuming Poisson) of the number of COVID-19 tests taken by a student over the past 2 months:
par(mfrow=c(1,2))
barplot(t(as.matrix(freqs)),
names.arg = y_vals,
main = "Observed Frequency",
col="grey")
barplot(t(as.matrix(exp_freqs_tab$exp_freqs)),
names.arg = y_vals,
main = "Expected Frequency (Poisson)",
col = "grey")
The fit of the Poisson model does not seem to be good at all. Let’s now perform a hypothesis test to quantify our intuition.
Let \(Y_i\) denote the number of times that respondent \(i\) took a COVID-19 test over the past two months, where \(i = 1,\dots,208\). Further suppose \(\lambda > 0\) is some constant. We form the following hypotheses:
We will perform a \(\chi^2\)-test for goodness of fit to test our hypotheses.
We need to make the assumption that each \(e_i\) for \(i = 1,\,\dots,\,10\) are sufficiently large. From lectures, we use \(e_i \geq 5\) as a threshold. From the expected frequency tables we saw earlier, we can clearly see that under the current categorisation we do not have \(e_i \geq 5\) for all \(i\). Thus, we need to recategorise:
# new categories
y_vals <- c("0", "1", "2", ">=3")
# recategorise sample frequencies
freqs <- matrix(c(freqs[1:3], samp_size - sum(freqs[1:3])),
nrow = 1,
ncol = 4)
colnames(freqs) <- y_vals
rownames(freqs) <- c("Observed Frequencies")
kable(freqs)
| 0 | 1 | 2 | >=3 | |
|---|---|---|---|---|
| Observed Frequencies | 126 | 40 | 16 | 26 |
# recategorise expected frequencies
exp_freqs <- matrix(c(exp_freqs_tab$exp_freqs[1:3], samp_size - sum(exp_freqs_tab$exp_freqs[1:3])),
nrow = 1,
ncol = 4)
colnames(exp_freqs) <- y_vals
rownames(exp_freqs) <- c("Expected Frequency (Poisson)")
kable(exp_freqs)
| 0 | 1 | 2 | >=3 | |
|---|---|---|---|---|
| Expected Frequency (Poisson) | 74.34 | 76.49 | 39.35 | 17.82 |
The assumption of sufficiently large expected counts is now satisfied, and we refer to these updated observed and expected cell counts as \(n_i\) and \(e_i\) respectively for \(i = 1,2,3,4\).
From class, we know that for a \(\chi^2\) test for goodness of fit, we use the test statistic \(T\) defined by the sum of the squared differences between the sample frequencies and expected frequencies, normalised by the expected frequencies: \[ T = \sum_{i=1}^{n} \frac{(n_i - e_i)^2}{e_i} \] where \(n_1,\,\dots,\,n_{4}\) refers to our observed frequencies and \(e_1,\,\dots,\,e_{4}\) refers to our expected frequencies as specified earlier.
We then compute the test statistic on our sample, which we denote using \(t_0\).
# expected value of each
t0 <- sum(
(freqs - exp_freqs)^2 / exp_freqs
)
kable(matrix(t0, nrow = 1, ncol = 1),
col.names = c("Observed Test Statistic ($t_0$)"))
| Observed Test Statistic (\(t_0\)) |
|---|
| 70.91771 |
For a \(\chi^2\)-test for goodness of fit, the p-value is defined as the probability of obtaining a test statistic \(T\) as or more extreme than \(t_0\), that is, \[ \text{p-value} = P(T \geq t_0) \]
As the name of the \(\chi^2\)-test suggests, it can be shown that the test statistic \(T\) follows a \(\chi^2_{k - q - 1}\) distribution with \(k - q - 1\) degrees of freedom, where \(k = 4\) is the number of categories (i.e. length of y_vals) and \(q = 1\) is the number of parameters that were estimated (just \(\lambda\)). Hence, we arrive at the following p-value:
p_val <- 1 - pchisq(t0, df = length(y_vals) - 2)
kable(matrix(p_val, nrow=1, ncol=1),
col.names = c("p-Value"),
digits = 32)
| p-Value |
|---|
| 4.440892e-16 |
Our p-value is extremely small. Hence, at the \(\alpha = 0.05\) level of significance, we have very strong evidence to reject the null hypothesis. In other words, our observed data strongly suggests that the number of COVID-19 tests taken by any given DATA2x02 student in the past 2 months does not follow a Poisson distribution. This agrees with the very poor visual agreement between the barplots of the observed frequencies and expected Poisson frequencies.
Recall that we would like to investigate whether or not those who think of themselves as lonely are less likely to have their cameras on during Zoom classes. Since we are testing for association between two categorical variables, we require a test for independence.
As always, we begin by loading in a copy of the data and performing some data cleaning. Again, since the number of missing values in the loneliness column is low, and there were no missing values in zoom_camera_frequency, we can simply remove any entries with missing values.
# create copy for this question
survey_data2 <- copy(survey_data)
missing_vals_q2 <- c(sum(is.na(survey_data2$loneliness)),
sum(is.na(survey_data2$zoom_camera_frequency)))
kable(matrix(missing_vals_q2, nrow=1, ncol=2),
col.names = c("Self-Reported Loneliness", "How Often is Zoom Camera On?"),
caption = "Total Missing Values")
| Self-Reported Loneliness | How Often is Zoom Camera On? |
|---|---|
| 2 | 0 |
survey_data2 <- tidyr::drop_na(survey_data2, loneliness)
As explained, since self-evaluated loneliness is more suited to a qualitative scale, we re-group the column into levels according to the following rules:
so we can treat it as a categorical variable.
survey_data2 <- survey_data2 %>%
mutate(
loneliness_grouped = loneliness,
loneliness_grouped = case_when(
(loneliness_grouped <= 3) ~ "Not Lonely",
(loneliness_grouped > 3 & loneliness_grouped <= 6) ~ "Somewhat Lonely",
loneliness_grouped > 6 ~ "Quite Lonely"
),
loneliness_grouped = factor(loneliness_grouped,
levels = c("Not Lonely",
"Somewhat Lonely",
"Quite Lonely"))
)
kable(count(survey_data2, loneliness_grouped),
col.names = c("Please Rate Your Current Feeling of Loneliness", "Frequency"))
| Please Rate Your Current Feeling of Loneliness | Frequency |
|---|---|
| Not Lonely | 53 |
| Somewhat Lonely | 72 |
| Quite Lonely | 84 |
Note that we also need to refactor the levels of zoom_camera_frequency to a natural order because R puts them in alphabetical order, allowing us to form a contingency table of observed counts.
survey_data2 <- survey_data2 %>%
mutate(
zoom_camera_frequency = factor(zoom_camera_frequency,
levels = c("None of the time",
"Some of the time",
"Most of the time",
"All the time"))
)
# add some more legible variable names
survey_data2 <- apply_labels(survey_data2,
zoom_camera_frequency = "How Often is Your Zoom Camera On?",
loneliness_grouped = "Current Loneliness")
# construct contingency table
lonely_vs_zoom_tab <- table(survey_data2$loneliness_grouped,
survey_data2$zoom_camera_frequency)
kable(lonely_vs_zoom_tab,
caption = "Table 1: Self-Reported Loneliness vs. How Often Do You Turn on Camera in Zoom")
| None of the time | Some of the time | Most of the time | All the time | |
|---|---|---|---|---|
| Not Lonely | 13 | 29 | 8 | 3 |
| Somewhat Lonely | 19 | 36 | 12 | 5 |
| Quite Lonely | 26 | 41 | 15 | 2 |
As usual, we first obtain a visual depiction of the relationship between self-perceived loneliness and willingness to turn on Zoom cameras.
survey_data2 %>%
ggplot() +
aes(x = zoom_camera_frequency,
fill = loneliness_grouped) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Reds") +
xlab("How Often is Your Camera on During Zoom Class") +
ylab("Proportion") +
labs(fill = "Current Level of Loneliness") +
theme()
Indeed, it seems that the proportion of people who describe themselves as quite lonely and turn on the cameras all the time is very low in comparison to the proportion of people who describe themselves as quite lonely and do not turn their cameras on all the time, which is consistent with our suspicions. We will now perform a \(\chi^2\)-test for independence to verify our intuition.
We form the following hypotheses:
First we calculate the sample size for reference.
samp_size_q2 <- length(survey_data2$loneliness_grouped)
kable(matrix(samp_size_q2, nrow = 1, ncol = 1),
col.names = c("Sample Size"))
| Sample Size |
|---|
| 209 |
To perform a \(\chi^2\)-test for independence, we need to assume that each response is independent from other responses (that no two respondents, for example, live in the same household). While it is possible that this assumption is not strictly true, due to our large sample size this assumption can reasonably be expected to hold.
We also need to assume that the expected cell counts under the null hypothesis are sufficiently large, i.e. \(e_{ij} \geq 5\) where \(e_{ij}\) denotes the expected frequency count in row \(i\), column \(j\). To check this assumption, we need to form the table of expected counts. From class, we know that \(e_i\) can be estimated using: \[ \hat{e}_{ij} = \frac{y_{i\bullet}y_{\bullet j}}{n} \] where \(n = 209\) denotes the sample size, \(y_{i\bullet}\) denotes the total of row \(i\), and \(y_{\bullet j}\) denotes the total of column \(j\) in Table 1.
Now we can construct the table of expected counts.
num_rows_q2 <- 3
num_cols_q2 <- 4
# compute row and column totals
row_totals_q2 <- apply(lonely_vs_zoom_tab, 1, sum)
# row_totals_q2
col_totals_q2 <- apply(lonely_vs_zoom_tab, 2, sum)
# col_totals_q2
# form table of estimated expected counts
tmp_rows_mat_q2 <- matrix(row_totals_q2,
nrow = num_rows_q2,
ncol = num_cols_q2)
tmp_cols_mat_q2 <- matrix(col_totals_q2,
nrow = num_rows_q2,
ncol = num_cols_q2)
exp_lonely_vs_zoom_tab <- tmp_rows_mat_q2 * tmp_cols_mat_q2 / samp_size_q2
rownames(exp_lonely_vs_zoom_tab) <- c("Not Lonely",
"Somewhat Lonely",
"Quite Lonely")
colnames(exp_lonely_vs_zoom_tab) <- c("None of the time",
"Some of the time",
"Most of the time",
"All the time")
kable(exp_lonely_vs_zoom_tab,
caption = "Table 2: Expected Zoom Camera Frequency vs. Self-Reported Loneliness")
| None of the time | Some of the time | Most of the time | All the time | |
|---|---|---|---|---|
| Not Lonely | 14.70813 | 2.535885 | 8.875598 | 26.880383 |
| Somewhat Lonely | 36.51675 | 19.980861 | 3.444976 | 12.057416 |
| Quite Lonely | 14.06699 | 42.602871 | 23.311005 | 4.019139 |
Notice there are 3 cells which are not sufficiently large. Hence, it would be best to perform a Monte Carlo simulation for the distribution of the test statistic \(T\) in order to estimate the p-value.
We will be using the test statistic: \[ T = \sum_{i=1}^{r=3}\sum_{j=1}^{c=4} \frac{(y_{ij} - e_{ij})^2}{e_{ij}} \] where \(y_{ij}\) denotes the observed count in row \(i\) and column \(j\), \(e_{ij}\) denotes the expected count in row \(i\) and column \(j\) assuming that the null hypothesis is true, \(r = 3\) refers to the number of rows (Not Lonely, Somewhat Lonely, Quite Lonely) and \(c = 4\) refers to the number of columns (None of the Time, Some of the Time, Most of the Time, All the Time).
Because we will be simulating the p-value, we do not need to discuss the distribution of \(T\). We perform the test using the in-built R function.
# for reproduceability
set.seed(2902)
test_q2 <- chisq.test(lonely_vs_zoom_tab,
simulate.p.value = TRUE,
B = 10000)
kable(matrix(c(test_q2$statistic, test_q2$p.value), nrow = 1, ncol = 2, byrow = TRUE),
col.names = c("Observed Test Statistic", "p-Value"))
| Observed Test Statistic | p-Value |
|---|---|
| 2.741243 | 0.8517148 |
We obtained a p-value of 0.8517148, which means we do not reject the null hypothesis. That is, there is not enough evidence for us to say that a person’s self-reported loneliness is not independent of how often they turn on their cameras during Zoom class.
While disappointing, this outcome is not unexpected. The assumption that how often a person turns on their camera during Zoom classes is indicative of being less willing to reach out to others is plausible, but certainly not concrete. Indeed, it is very possible that not turning on cameras could be attributed to a multitude of other reasons, such as lack of an appropriate space, slow internet speeds, or lack of access to webcam technology.
Moreover, the quality of our data is also impacted by the fact that there were only a total of 10 students who responded that they turn their cameras on all the time.
Other limitations include self-response bias, in particular the well-known phenomenon that respondents, when quantifying their feelings or emotions, will tend to choose middle-of-the-road values rather than the extremes. This can be seen in the concentration of responses in the middle ranges (3-7 and “Some of the Time”):
survey_data2 %>%
ggplot() +
aes(x = zoom_camera_frequency,
y = loneliness,
col = zoom_camera_frequency) +
geom_jitter(width=0.15, height=0.3) +
xlab("") +
ylab("Current Level of Loneliness") +
labs(col="How Often is Zoom Camera On?")
Finally, although our choice in how we grouped loneliness from a scale from 1 to 10 into Not Lonely (1-3), Somewhat lonely (4-6), and Quite Lonely (7-10) seems rather arbitrary, it does not have a noticeable impact on the p-value. We recomputed the p-values after varying the cut-off points.
tests_q2 <- 0
counter <- 1
for (i in 1:5) {
for (j in 6:9) {
survey_data2 <- survey_data2 %>%
mutate(
loneliness_grouped = loneliness,
loneliness_grouped = case_when(
(loneliness_grouped <= i) ~ "Not Lonely",
(loneliness_grouped > i & loneliness_grouped <= j) ~ "Somewhat Lonely",
loneliness_grouped > j ~ "Quite Lonely"
),
zoom_camera_frequency = factor(zoom_camera_frequency,
levels = c("None of the time",
"Some of the time",
"Most of the time",
"All the time")),
loneliness_grouped = factor(loneliness_grouped,
levels = c("Not Lonely",
"Somewhat Lonely",
"Quite Lonely"))
)
zoom_vs_lonely_tab_iter <- table(
survey_data2$loneliness_grouped,
survey_data2$zoom_camera_frequency
)
tests_q2[counter] <- chisq.test(zoom_vs_lonely_tab_iter,
simulate.p.value = TRUE, B = 10000)$p.value
counter = counter + 1
}
}
summary(tests_q2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05819 0.19633 0.51290 0.45486 0.60996 0.85441
Indeed, the choice in cut-off points did not produce any datasets that had a statistically significant relationship, as the lowest p-value was still not in the rejection zone.
Recall that we are interested in whether or not the students studying the advanced unit are more confident in their own abilities than those studying the mainstream unit.
Again, we form a copy of the survey data and check for missing values. Since there are very few missing values, we simply remove them.
survey_data3 <- copy(survey_data)
missing_vals_q3 <- c(sum(is.na(survey_data3$math_ability)),
sum(is.na(survey_data3$coding_ability)),
sum(is.na(survey_data3$which_unit)))
kable(matrix(missing_vals_q3, nrow=1, ncol=3, byrow=TRUE),
col.names = c("Self-Reported Maths Ability", "Self-Reported Coding Ability", "Which Unit Are You Taking?"),
caption = "Total Missing Values")
| Self-Reported Maths Ability | Self-Reported Coding Ability | Which Unit Are You Taking? |
|---|---|---|
| 1 | 1 | 0 |
survey_data3 <- survey_data3 %>%
tidyr::drop_na(math_ability, coding_ability)
Now note that for this survey, respondents were asked to evaluate their own capabilities in two respects: mathematics and coding. Since both skills are critical to this subject, it seems reasonable to take a simple average of these scores to form a combined “competency rating” and add it to survey_data3: \[
\text{general capability} = 0.5 \times\text{mathematics capability} + 0.5\times\text{coding capability}
\]
survey_data3 <- survey_data3 %>%
mutate(
general_ability = 0.5 * math_ability + 0.5 * coding_ability
)
As always, our first step is to visualise the two samples. Consider the frequency histograms of the self-reported general capabilities of the DATA2902 and DATA2002 cohorts:
par(mfrow=c(1, 2))
general_ability_adv <- survey_data3 %>%
filter(which_unit == "DATA2902 (Advanced)") %>%
select(general_ability)
general_ability_main <- survey_data3 %>%
filter(which_unit == "DATA2002") %>%
select(general_ability)
hist(general_ability_adv$general_ability,
xlab = "Self-Reported General Capability",
main = "DATA2902 (Adv)")
hist(general_ability_main$general_ability,
xlab = "Self-Reported General Capability",
main = "DATA2002 (Mainstream)")
We can immediately see that the self-reported general capability of DATA2902 is negatively skewed, which is within expectations as we theorised that DATA2902 are made up of students who would rate themselves as more competent. Let’s perform a two-sample means test to confirm our suspicions in a statistically robust way.
Let \(\mu_1\) denote the mean self-reported general capability score of students studying DATA2902, and \(\mu_2\) denote the mean self-reported general capability score of students studying DATA2002. We form the hypotheses:
First, we need to assume that our two samples (DATA2902 students and DATA2002 students) are independent. This assumption definitely holds, since a student can be enrolled in either DATA2902 or DATA2002, but not both.
We have several options for which statistical test to perform, and we argue that the \(t\)-test is most appropriate in this case, because all tests have their assumptions partially satisfied, making the \(t\)-test most favourable because, all else equal, it has the most statistical power (most likely to correctly reject the null hypothesis).
Although the \(t\)-test assumes that the two samples come from normal distributions, and is somewhat sensitive to outliers, the fact that we have large enough sample sizes (36 for DATA2902, and 174 for DATA2002) and a bounded range of values (1 to 10, so ‘true outliers’ are impossible) means that the \(t\)-test remains robust enough.
The sign test is almost always not preferrable due to its low statistical power and rigid assumptions. Although the Wilcoxon rank-sign test does not assume the two samples to be normal, it still assumes that the two samples are symmetric, which is not satisifed by our data anyway. The Wilcoxon rank-sum test does relax both normality and symmetry, but still assumes that the two samples come from the same distribution (that is, assumes differences between the samples in location, not shape). This is still not quite true for our data, because it does not make sense for two distributions on a bounded range to be different in location but remain identical in shape. Hence, as mentioned, because no test has all its assumptions satisfied, we will use the two sample \(t\)-test because it has comparatively more statistical power.
Finally, to perform a two sample \(t\)-test, we also need to decide whether or not to assume equal variance. To make this decision we can use sample variances to estimate population variance:
kable(matrix(c(var(general_ability_adv), var(general_ability_main)),
nrow = 1, ncol = 2, byrow = TRUE),
col.names = c("DATA2902 (Advanced)",
"DATA2002"),
caption = "Variance of Self-Reported General Ability")
| DATA2902 (Advanced) | DATA2002 |
|---|---|
| 1.740278 | 2.084886 |
The variances are not quite equal. Moreover, since variance is location invariant but scales quadratically, the fact that the DATA2902 sample is negatively skewed whereas the DATA2002 sample seems roughly normal means that we can reasonably assume the variances are not equal.
We are performing the two sample \(t\)-test assuming unequal variances. Hence, we use the Welch test statistic: \[ T_W = \frac{\bar{x} - \bar{y}}{\sqrt{\frac{\sigma^2_x}{n} + \frac{\sigma^2_y}{m}}} \] where \(\bar{x},\bar{y}\) refer to the sample means, \(\sigma^2_x, \sigma^2_y\) refer to the sample variances, and \(n, m\) refer to the sample sizes.
We use the in-built R function to compute the test statistic and corresponding p-value:
t_test_q3 <- t.test(general_ability_adv, general_ability_main, alternative = "greater", var.equal = FALSE)
kable(matrix(c(t_test_q3$statistic, t_test_q3$p.value),
nrow = 1, ncol = 2, byrow = TRUE),
col.names = c("Observed Test Statistic", "p-Value"))
| Observed Test Statistic | p-Value |
|---|---|
| 4.307479 | 3.52e-05 |
We obtained an extremely low p-value. Hence, at the \(\alpha = 0.05\) level of significance, we have sufficient evidence to reject the null hypothesis. That is, we have strong evidence to suggest that the mean self-reported general capability of DATA2902 students is not the same as that of DATA2002 students, and suggesting that it is possible that those who do the advanced course are more confident in their mathematics and coding abilities than their mainstream peers.
Recall that we are investigating whether or not DATA2x02 students overestimated their expected salary in Australia upon graduation. For our purposes, we will set the ‘true’ entry-level salary to be \(\mu^* = \$75000\) (Payscale.com, 2021).
Unfortunately, as mentioned the data for this question is of the worst quality in the entire survey. Let’s begin to try to clean this data. As usual we load a copy of the survey data, and for reference compute the raw number of observations.
survey_data4 <- copy(survey_data)
total_responses <- length(survey_data$salary_est)
kable(matrix(c(total_responses), nrow = 1, ncol = 1),
col.names = c("Total Responses"))
| Total Responses |
|---|
| 211 |
Fortunately, there are no missing values.
kable(matrix(c(sum(is.na(survey_data4$salary_est))), nrow = 1, ncol = 1),
col.names = c("Missing Values"))
| Missing Values |
|---|
| 0 |
To clean this dataset, we set up some broad rules that should be able to preserve the majority of data points:
survey_data4 <- survey_data4 %>%
mutate(
salary_est_cleaned = salary_est %>%
# remove '$', non-numeric characters, and commas
str_replace_all('\\$|\\s|,', '') %>%
# replace "k" with "000"
str_replace_all('k|K', "000") %>%
# extract the number
parse_number()
)
Next we create a set of rules that serve to both standardise the salary figures (hourly, weekly, monthly) to annual, and remove any outliers, under the assumption that any estimate equivalent to an annual salary outside of the range \([\$40k, \$200k]\) is a very poor estimate. The logic is that the minimum full time wage in Australia is approx. \(\$40k\), forming the lower bound. From personal experience (mine and peers), we have never seen entry-level jobs for undergraduates that pay more than \(\$150k\) in any industry, so \(\$200k\) is very reasonable for a hard upper-limit. Thus, our rules are:
Consider the results of our data cleaning:
survey_data4 <- survey_data4 %>%
mutate(
salary_est_cleaned = case_when(
(salary_est_cleaned <= 105) ~ salary_est_cleaned * 8 * 5 * 48,
(salary_est_cleaned >= 768 & salary_est_cleaned <= 3847) ~ salary_est_cleaned * 52,
(salary_est_cleaned >= 3332 & salary_est_cleaned <= 16667) ~ salary_est_cleaned * 12,
(salary_est_cleaned >= 40000 & salary_est_cleaned <= 200000) ~ salary_est_cleaned
)
)
# remove missing values
survey_data4 <- tidyr::drop_na(survey_data4, salary_est_cleaned)
# remove 0
survey_data4 <- survey_data4[survey_data4$salary_est_cleaned != 0, ]
salary_est_cleaned_len <- length(survey_data4$salary_est_cleaned)
kable(matrix(c(total_responses, salary_est_cleaned_len), nrow = 1, ncol = 2),
col.names = c("Total Responses (Raw)", "Total Responses (Cleaned)"))
| Total Responses (Raw) | Total Responses (Cleaned) |
|---|---|
| 211 | 171 |
It seems we only lost a total of \(211 - 171 = 40\), which is acceptable.
We can now take a look at the binned frequency histogram of the estimates to get an idea of the distribution. Note that the red line represents \(\mu^* = \$75000\) and the blue line represents the mean estimate by DATA2x02 students.
survey_data4 %>% ggplot() +
aes(x = salary_est_cleaned) +
geom_histogram(aes(y = ..density..),
binwidth=10000,
# alpha=0.5,
fill="pink1") +
geom_density(alpha=0.5) +
geom_vline(aes(xintercept = mean(salary_est_cleaned)),
color="blue",
linetype="dashed",
size=1) +
geom_vline(aes(xintercept = 75000),
color="red",
linetype="dashed",
size=1) +
xlab("Salary Estimated by DATA2x02 Students (Cleaned)") +
ylab("Proportion of Responses")
kable(matrix(c(mean(survey_data4$salary_est_cleaned), 75000), nrow = 1, ncol = 2, byrow = TRUE),
col.names = c("Average Salary Estimated by DATA2x02 Students", "'True' Average Salary (Payscale.com)"))
| Average Salary Estimated by DATA2x02 Students | ‘True’ Average Salary (Payscale.com) |
|---|---|
| 77148.28 | 75000 |
The estimate by DATA2x02 students is shockingly accurate!
Let \(\mu\) denote the mean of the estimates of entry-level data science salary by DATA2x02 students. Thus, we can form our hypotheses:
In this case it is most appropriate to perform a one sample \(t\)-test for the mean, which requires assuming that the distribution of the estimates made by DATA2x02 students is roughly normal. In the density histogram and curve above, we can see that the distribution of estimates is is slightly positively skewed with a heavy right tail. However, this can likely attributable to our data cleaning methodology (recall we only kept values in the range \([\$40k, \$200k]\)), meaning we most likely removed much of the left tails. Moreover, we expect some similarity in the thought process of the students when estimating, given that our population is DATA2x02 students enrolled in Sydney pursuing data science who are likely to have researched the prospects of a career in data science in Australia. Hence, it is fairly reasonable to assume that each student’s estimate is i.i.d. normal.
We perform the one sample \(t\)-test on the mean \(\mu = \$75000\), with hypotheses specified above, and obtain the following test statistic and p-value:
t_test_q4 <- t.test(select(survey_data4, salary_est_cleaned),
alternative = "greater",
mu = 75000)
t0_q4 <- t_test_q4$statistic
p_val_q4 <- t_test_q4$p.value
kable(matrix(c(t0_q4, p_val_q4), nrow = 1, ncol = 2, byrow=TRUE),
col.names = c("Observed Test Statistic", "p-Value"))
| Observed Test Statistic | p-Value |
|---|---|
| 1.121212 | 0.1318895 |
Hence, at the \(\alpha = 0.05\) level of significance, we have sufficient evidence not to reject the null hypothesis. That is, our data is consistent with the proposition that DATA2x02 correctly estimated the average entry-level salary for data scientists with an undergraduate degree in Australia. This result is unsurprising, considering the striking accuracy of the estimates.
While we take that the figure supplied by Payscale.com to be the ‘true’ population average salary, it is to be noted that this number is dervied based on sampling done by Payscale.com as well. However, given the amount of information available to the organisation, multiple sources of income data (self-supplied, employer, job listings, etc.), it is not unreasonable to accept \(\$75000\) as the true population mean for our purposes.
We investigated our survey data using four different angles, which together form an interesting snapshot of the current lives of students enrolled in DATA2002 and DATA2902.
Interestingly, we found that there is sufficient evidence to suggest that the number of COVID-19 antigen tests taken by students over the past 2 months does not follow a Poisson distribution.
We also found that the data is not inconsistent with the proposition that those who are lonely are less likely to have their cameras on during Zoom class. However, the number of possible confounding factors identified (for example, cameras are off due to technical reasons or lack of appropriate spaces) means this result is highly susceptible to Simpson’s paradox. Thus, this result should be interpreted with caution, and further investigation is required.
Next, we found that we have sufficient evidence to suggest that DATA2902 students do not have the same impression of their capabilities as DATA2002 students. However, this result should also be interpreted with caution. In particular, we did not investigate whether or not this result was sensitive to our method of weighting self-reported maths and coding ability. Further research should try varying the weightings to see if it has an impact on the end-result.
Finally, we found that evidence consistent with the proposition that DATA2x02 students correctly estimated their future expected earnings in Australia. This result is not as surprising as it might initially seem. For example, the ‘true’ \(\mu^* = \$75000\) is rather close to the average Australian salary for all full-time workers, making it not implausible that a cohort of almost 200 university-level students are well-informed enough to make a roughly accurate estimate. Additionally, our result may potentially be attributable to our method of data cleaning using an interval bounded by the minimum wage and a hard limit of implausible graduate salaries, which might have made the data more tightly packed around the true \(\mu^*\). Thus, further research should attempt to rectify these factors in their methodology.
Davey, G. C. (2016). Social media, loneliness, and anxiety in young people.Psychology Today,15.
gdemin. (n.d.). expss: Tables with Labels in R. [online] Available at: https://gdemin.github.io/expss/
Jeste, D. V., Lee, E. E., & Cacioppo, S. (2020). Battling the modern behavioral epidemic of loneliness: suggestions for research and interventions.JAMA psychiatry,77(6), 553.
Krueger Jr, N., & Dickson, P. R. (1994). How believing in ourselves increases risk taking: Perceived self‐efficacy and opportunity recognition.Decision sciences,25(3), 385-400.
Montmarquette, C., Cannings, K., & Mahseredjian, S. (2002). How do young people choose college majors?.Economics of Education Review,21(6), 543-556.
Paradise, A. W., & Kernis, M. H. (2002). Self-esteem and psychological well-being: Implications of fragile self-esteem.Journal of social and clinical psychology,21(4), 345-361.
Payscale.com. (2021). Average Entry-Level Data Scientist Salary in Australia. [online] Available at: https://www.payscale.com/research/AU/Job=Data_Scientist/Salary/df41c264/Entry-Level
Pittman, M., & Reich, B. (2016). Social media and loneliness: Why an Instagram picture may be worth more than a thousand Twitter words.Computers in Human Behavior,62, 155-167.
Sekścińska, K., Jaworska, D., & Rudzinska-Wojciechowska, J. (2021). Self-esteem and financial risk-taking.Personality and Individual Differences,172, 110576.
Sheldon, P. (2008). The relationship between unwillingness-to-communicate and students’ Facebook use.Journal of Media Psychology,20(2), 67-75.
Tidyverse.org. (n.d.). Read Rectangular Text Data.readr. [online] Available at: https://readr.tidyverse.org/
Tidyverse.org. (n.d.) Simple, Consistent Wrappers for Common String Operations.stringr. [online] Available at: https://stringr.tidyverse.org/
Tidyverse.org. (n.d.). Tidyverse. [online] Available at: https://www.tidyverse.org/ (https://www.tidyverse.org/).
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124-1131.
Wickham, H. (n.d.). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
Wu, A. F. W., Chou, T. L., Catmur, C., & Lau, J. Y. (2020). Loneliness and social disconnectedness in pathological social withdrawal.Personality and Individual Differences,163, 110092.
Zita, C. (2021). Is Data Science Still a Rising Career in 2021.[online] Available at: https://towardsdatascience.com/is-data-science-still-a-rising-career-in-2021-722281f7074c