Exploring Relationships {-}
Attribution: This lab is an adaptation of Chapter 4 of Answering questions with data Lab Manual by Matt Crump and his team.
Log into your Posit Cloud account.
Open the Lab 4 Relationships project.
Open the Lab4-Remix-Student-Name.Rmd worksheet
IMPORTANT!
Remember to rename this file to include your name in place of “Student-Name”. Example: Lab-4-Remix-Sue-Smith.Rmd
Also, put your name in the Author area and change the Date to the current date.
Other than entering your name to replace Student Name and changing the date to the current date, do not change anything in the head space.
Load the libraries! The code is in the report template, so just run it.
CC1
library(ggpubr)
library(tidyverse)
library(moderndive)
library(scales)
library(broom)
This generalization exercise will explore the idea that correlations between two measures can arise by chance alone. There are two questions to answer. For each question, you will be sampling random numbers from uniform distribution. To conduct the estimate, you will be running a simulation 100 times. The questions are:
Warm-up Part 1. Estimate the range (minimum and maximum number) of correlations (using Pearson’s r) that could occur by chance between two variables with n=10. Also create a histogram of the possible values.
Use these tips to answer the question for Warm-up Part 1.

Tip 1: You can use the runif() function to sample random
numbers between a minimum value, and maximum value. The code in CC2
below samples n = 10 random numbers between the range 0 (min = 0) and 10
(max=10) and stores them in a variable named x. Every time you run this
code, the 10 values in x will be re-sampled, and will be 10 new random
numbers.
CC2
x <- runif(n=10, min=0, max=10)

Tip 2: The code chunk below, CC3, creates random variables x and y
using the runif function twice. We can then compute the
correlation between x and y by running the cor()
function.
CC3
x <- runif(n=10, min=0, max=10)
y <- runif(n=10, min=0, max=10)
cor(x,y)
Running the above code will give different values for the correlation each time, because the numbers in x and y are always randomly different. We might expect that because x and y are chosen randomly that there should be a 0 correlation. However, what we see is that random sampling can produce “fake” correlations just by chance alone.

Tip 3: One way to estimate the range of correlations that chance can produce is to repeat the above code many times. For example, if you ran the above code 100 times, you could save the correlations each time, then look at the smallest and largest correlation. This would be an estimate of the range of correlations that can be produced by chance. How can you repeat the above code many times to solve this problem?
We can do this using a for loop.
The code below in CC4 shows how to repeat everything inside the for
loop 100 times. The variable i is an index, that goes from
1 to 100. The saved_value variable starts out as an empty
variable, and then we put a value into it (at index position i, from 1
to 100).
In this code chunk, we put the sum of the
products of x and y into the saved_value variable. At the
end of the simulation, the save_value variable contains 100
numbers. The min() and max() functions are
then used to find the minimum and maximum values for each of the 100
simulations.
Your job now is to modify the code in CC4 by
replacing sum(x+y) with cor(x,y). Doing this
will allow you to run the simulation 100 times, and find the minimum
correlation and maximum correlation that arises by chance.
CC4
# Make an empty variable
saved_value <- c()
for (i in 1:100){
x <- runif(n=10, min=0, max=10)
y <- runif(n=10, min=0, max=10)
saved_value[i] <- sum(x+y)
}
# Find min and max
min(saved_value)
max(saved_value)
# Create histogram
hist(saved_value)
This will be estimate for the Warm-up Part 1 question.
Min Correlation = ? Your answer here:
Max Correlation = ? Your answer here:
Warm-up Part 2 question: Estimate the range (minimum and maximum number) of correlations (using Pearson’s r) that could occur by chance between two variables with n = 100.
To provide an estimate for Warm-up Part 2, you will
need to change n=10 to n=100 in CC4a and rerun
the code.
CC4a
# Make an empty variable
saved_value <- c()
for (i in 1:100){
x <- runif(n=10, min=0, max=10)
y <- runif(n=10, min=0, max=10)
saved_value[i] <- sum(x+y)
}
# Fin min and max
min(saved_value)
max(saved_value)
# Create histogram
hist(saved_value)
This will be estimate for the Warm-up Part 2 question.
Min Correlation = ? Your answer here:
Max Correlation = ? Your answer here:
Now that you are warmed up, let’s get with the Remix!
Answer all parts of the three questions below.
For Question 1, we are working again with the World Happiness Report. The data file is in your Data folder in your RStudio workspace for Lab 4.
CC5
# Read in the World Happiness Report data and create a new data frame
whr_data <- read_csv("./data/WHR2018v2.csv")
# Use summary function to find summary statistics for variables in whr_data
summary(whr_data)
For the year 2005 ONLY, find the correlation between Perceptions_of_corruption and Positive_affect. Create a scatter plot to visualize this relationship.
1a. What are your conclusions about the relationship between
Positive_affect and
Perceptions_of_corruption?
Your answer here:
1b. Is this surprising to you?
Your answer here:

Hint: See code chunk 11 in Lab 4 Rehearse 1, part 4. Copy CC11 and paste it in the empty code chunk below and then edit it as required to answer Question 1.
CC6
# Empty code chunk
For Question 2, we are still working with the World Happiness
Report:
What has happened to Log_GDP_per_capita
(consider this a measure of GDP) in the United States
ONLY with time (as the year has
increased)?
To do this, find the correlation coefficient r between Log_GDP_per_capita and year and provide a scatter plot with best fit line.
Hint: The code chunk above in Q1 might work for a starting point. Copy it here and then edit it appropriately for the new variables of concern in Q2.
Important!
Put the variable ‘year’ on the x-axis of the plot and ‘Log_GDP_per_capita’ on the y-axis. Do this also for the annotation for displaying the correlation coefficient r.
CC7
# Empty code chunk
Repeating Q2: What has happened to Log_GDP_per_capita (consider this a measure of GDP) in the United States ONLY with time (as the year has increased)?
Your answer to Q2 here:
Q3 is on Regression
This question is based on Rehearse 2 which is about linear regression. There are 5 parts to Q3.
Use this code chunk to load the real data object we created in Rehearse 2.
CC8
# Read in the wrangled version of Dr. Deveaux' real estate data
# from Rehearse 2 which is in the /data file.
real <- read_csv("./data/real.csv")
Q3a: Is there a linear relationship between the variables Price and Land.Value in the houses in the real data object?
Q3b: What is the equation of the best fit line for the linear relationship between Price and Land.Value?
Price is the response variable and Land.Value is the explanatory variable.
To answer parts a and b, use this code that you need to edit.
CC9
# Use the linear model function, lm(variable1 ~ variable2, data = data object)
# and create a new data object
# Variable1 ~ variable2 can be interpreted as "variable 1 is explained by variable2"
model <- lm(sales ~ youtube, data = marketing)
# Display the linear model
model
Your answer to Q3a here:
Your answer to Q3b here:
Q3c: Find the correlation coefficient r and create a scatter plot between Price and Land.Value using this code chunk 10 below which was developed in the Regression Rehearse.
You will need to edit it.
CC10
# Find correlation coefficient r
cor(real$`Living.Area`,
real$`Price`)
# Plot the scatter chart and add the best fit line
ggplot(real, aes(x=`Living.Area`,
y=`Price`))+
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
geom_point(col="#328da8")+
geom_smooth(method=lm, se=FALSE)+
theme_classic()
Q3a: What is the correlation coefficient r between
Price and Land.Value
Your answer to Q3c here:
Q3d: Is the correlation strong?
Your answer to Q3d here:
Q3e: Do you think that some extreme data points are
substantially influencing the slope of the line? If so
they are called *Influential**.
Your answer to part Q3e here:
In practice in the “real-world” one would remove influential points and re-run the code to see if the slope of the line appears to change substantially.
You do not have to do remove them and rerun the code.
Answer the following three questions with complete sentences.
Imagine a researcher found a positive correlation between two variables, and reported that the r value was +0.3. One possibility is that there is a true correlation between these two variables.
Discuss one alternative possibility that would also explain the observation of +0.3 value between the variables.
Your answer here:
Explain the difference between a correlation of r = +0.3 and r = +0.7. What does a larger value of r represent?
Your answer here:
Explain the difference between a negative correlation of r = - 0.5, and a positive correlation of r = + 0.5.
Your answer here:
Important
When you are ready to create your final lab report, save the Lab4-Remix-your-name.Rmd lab file and then Knit it to Word or PDF to make a reproducible file.
Note that if you have difficulty getting the documents to Knit to either Word or PDF, and you cannot fix it, just save the completed worksheet and submit your .Rmd file for partial credit.
Important
Remember to rename your file to include your name, e.g. Lab-4-Remix-Susan-Smith.pdf.
Submit your file in the M4.5 Lab 4 Remix and Report: Exploring Relationships assignment area.
The Lab 4 Exploring Relationships Grading Rubric will be used.

Congrats
You have completed Lab 4 Exploring Relationships
Previous: Lab 4 Rehearse 2
Regression

This
work was created by Dawn Wright and is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.
V2.0, Date 11/19/25
Last Compiled 2025-11-19