Exploring Relationships {-}
Attribution: This lab is an adaptation of Chapter 4 of Answering questions with data Lab Manual by Matt Crump and his team.
Log into your Posit/RStudio Cloud account.
Open the Lab 4 Relationships project.
Open the Lab4-Remix-Student-Name.Rmd worksheet
IMPORTANT!
Remember to rename this file to include your name in place of “Student-Name”. Example: Lab-4-Remix-Sue-Smith.Rmd
Also, put your name in the Author area and change the Date to the current date.
Other than entering your name to replace Student Name and changing the date to the current date, do not change anything in the head space.
Load the libraries! The code is in the report template, so just run it.
CC1
library(ggpubr)
library(tidyverse)
library(moderndive)
library(scales)
library(broom)
This generalization exercise will explore the idea that correlations between two measures can arise by chance alone. There are two questions to answer. For each question, you will be sampling random numbers from uniform distribution. To conduct the estimate, you will be running a simulation 100 times. The questions are:
Warm-up Part 1. Estimate the range (minimum and maximum number) of correlations (using pearons’s r) that could occur by chance between two variables with n=10. Also create a histogram of the possible values.
Warm-up Part 2. Estimate the range (minimum and maximum number) of correlations (using pearons’s r) that could occur by chance between two variables with n = 100.
Use these tips to answer the question.
Tip 1: You can use the runif()
function to sample random
numbers between a minimum value, and maximum value. The example below
sample 10 (n=10) random numbers between the range 0 (min = 0) and 10
(max=10). Every time you run this code, the 10 values in x will be
re-sampled, and will be 10 new random numbers.
CC2
x <- runif(n=10, min=0, max=10)
Tip 2: You can compute the correlation between two sets of random
numbers, by first sampling random numbers into each variable, and then
running the cor()
function.
CC3
x <- runif(n=10, min=0, max=10)
y <- runif(n=10, min=0, max=10)
cor(x,y)
Running the above code will give different values for the correlation each time, because the numbers in x and y are always randomly different. We might expect that because x and y are chosen randomly that there should be a 0 correlation. However, what we see is that random sampling can produce “fake” correlations just by chance alone. We want to estimate the range of correlations that chance can produce.
Tip 3: One way to estimate the range of correlations that chance can produce is to repeat the above code many times. For example, if you ran the above code 100 times, you could save the correlations each time, then look at the smallest and largest correlation. This would be an estimate of the range of correlations that can be produced by chance. How can you repeat the above code many times to solve this problem?
We can do this using a for
loop.
The code below shows how to repeat everything inside the for loop 100
times. The variable i
is an index, that goes from 1 to 100.
The saved_value
variable starts out as an empty variable,
and then we put a value into it (at index position i, from 1 to
100).
In this code chunk, we put the sum of the products of x and
y into the saved_value
variable. At the end of the
simulation, the save_value
variable contains 100 numbers.
The min()
and max()
functions are used to find
the minimum and maximum values for each of the 100 simulations.
You should be able to modify this code by replacing
sum(x+y)
with cor(x,y)
. Doing this will allow
you to run the simulation 100 times, and find the minimum correlation
and maximum correlation that arises by chance.
This will be estimate for Warm-up Part 1.
CC4
# Make an empty variable
saved_value <- c()
for (i in 1:100){
x <- runif(n=10, min=0, max=10)
y <- runif(n=10, min=0, max=10)
saved_value[i] <- sum(x+y)
}
# Fin min and max
min(saved_value)
max(saved_value)
# Create histogram
hist(saved_value)
To provide an estimate for Warm-up Part 2, you will need to change
n=10
to n=100
.
Part 2 question: Estimate the range (minimum and maximum number) of correlations (using pearons’s r) that could occur by chance between two variables with n = 100.
Your answer to Part 2 question here:
Now that you are warmed up, let’s get with the Remix!
Answer all parts of the three questions below.
For Question 1, we are working again with the World Happiness Report. The data file is in your Data folder in your RStudio workspace for Lab 4.
CC5
# Read in the World Happiness Report data and create a new data frame
whr_data <- read_csv("./data/WHR2018v2.csv")
# Use summary function to find summary statistics for variables in whr_data
summary(whr_data)
For the year 2005 ONLY, find the correlation between Perceptions_of_corruption and Positive_affect. Create a scatter plot to visualize this relationship.
1a. What are your conclusions about the relationship between
Positive_affect and
Perceptions_of_corruption?
Your answer here:
1b. Is this surprising to you?
Your answer here:
Hint: See code chunk 11 in Lab 4 Rehearse 1, part 4. Copy CC11 and paste it in the empty code chunk below and then edit it as required to answer Question 1.
CC6
# Empty code chunk
For Question 2, we are still working with the World Happiness
Report:
What has happened to Log_GDP_per_capita
(consider this a measure of GDP) in the United States
ONLY with time (as the year has
increased)?
To do this, find the correlation coefficient r between Log_GDP_per_capita and year and provide a scatter plot with best fit line.
Hint: The code chunk above in Q1 might work for a starting point. Copy it here and then edit it appropriately for the new variables of concern in Q2.
Important!
Put the variable ‘year’ on the x-axis of the plot and ‘Log_GDP_per_capita’ on the y-axis. Do this also for the annotation for displaying the correlation coefficient r.
CC7
# Empty code chunk
Repeating Q2: What has happened to Log_GDP_per_capita (consider this a measure of GDP) in the United States ONLY with time (as the year has increased)?
Your answer to Q2 here:
Q3 is on Regression
This question is based on Rehearse 2 which is about linear regression. There are 5 parts to Q3.
Use this code chunk to load the real data object we created in Rehearse 2.
CC8
# Read in the wrangled version of Dr. Deveaux' real estate data
# from Rehearse 2 which is in the /data file.
real <- read_csv("./data/real.csv")
Q3a: Is there a linear relationship between the variables Price and Land.Value in the houses in the real data object?
Q3b: What is the equation of the best fit line for the linear relationship between Price and Land.Value?
Price is the response variable and Land.Value is the explanatory variable.
To answer parts a and b, use this code that you need to edit.
CC9
# Use the linear model function, lm(variable1 ~ variable2, data = data object)
# and create a new data object
# Variable1 ~ variable2 can be interpreted as "variable 1 is explained by variable2"
model <- lm(sales ~ youtube, data = marketing)
# Display the linear model
model
Your answer to Q3a here:
Your answer to Q3b here:
Q3c: Find the correlation coefficient r and create a scatter plot between Price and Land.Value using this code chunk 10 below which was developed in the Regression Rehearse.
You will need to edit it.
CC10
# Find correlation coefficient r
cor(real$`Living.Area`,
real$`Price`)
# Plot the scatter chart and add the best fit line
ggplot(real, aes(x=`Living.Area`,
y=`Price`))+
geom_point(col="#328da8")+
geom_smooth(method=lm, se=FALSE)+
theme_classic()
Q3a: What is the correlation coefficient r between
Price and Land.Value
Your answer to Q3c here:
Q3d: Is the correlation strong?
Your answer to Q3d here:
Q3e: Do you think that some extreme data points are
substantially influencing the slope of the line? If so
they are called *Influential**.
Your answer to part Q3e here:
In practice in the “real-world” one would remove influential points and re-run the code to see if the slope of the line appears to change substantially.
You do not have to do remove them and rerun the code.
Answer the following three questions with complete sentences.
Imagine a researcher found a positive correlation between two variables, and reported that the r value was +0.3. One possibility is that there is a true correlation between these two variables.
Discuss one alternative possibility that would also explain the observation of +0.3 value between the variables.
Your answer here:
Explain the difference between a correlation of r = +0.3 and r = +0.7. What does a larger value of r represent?
Your answer here:
Explain the difference between a negative correlation of r = - 0.5, and a positive correlation of r = + 0.5.
Your answer here:
Important
When you are ready to create your final lab report, save the Lab4-Remix-your-name.Rmd lab file and then Knit it to PDF or Word to make a reproducible file.
Note that if you have difficulty getting the documents to Knit to either Word or PDF, and you cannot fix it, just save the completed worksheet and submit your .Rmd file for partial credit.
Important
Remember to rename your file to include your name, e.g. Lab-4-Remix-Susan-Smith.pdf.
Submit your file in the M4.5 Lab 4 Remix and Report: Exploring Relationships assignment area.
The Lab 4 Exploring Relationships Grading Rubric will be used.
Congrats
You have completed Lab 4 Exploring Relationships
Previous: Lab 4 Rehearse 2 Regression
This
work was created by Dawn Wright and is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.
V1.4, Date 4/3/24
Last Compiled 2024-07-12