R functions for simulation, sampling & visualization
Oct 16, 2019
Christopher Yee
8 minute read

In my previous article about simulating page speed data, I broke one of the cardinal rules in programming: don’t repeat yourself.

There was a reason for this: I wanted to show what is going on under the hood and the theoretical concepts associated with them before using other functions in R.

For this follow-up, I’ll highlight a few #rstats shortcuts that will make your life easier when generating and exploring simulated data.

Background

We’ll use the same dataset from last time with Best Buy and Home Depot as our initial test subjects.

library(tidyverse)

df <- read_csv("https://raw.githubusercontent.com/Eeysirhc/random_datasets/master/page_speed_benchmark.csv") %>% 
  filter(page_type != 'Home')
  
df %>% 
  filter(website %in% c("BestBuy", "Home Depot")) %>% 
  group_by(website) %>% 
  summarize(avg_time_interactive = mean(time_to_interactive),
            sd_time_interactive = sd(time_to_interactive))
## # A tibble: 2 x 3
##   website    avg_time_interactive sd_time_interactive
##   <chr>                     <dbl>               <dbl>
## 1 BestBuy                    31.2                2.60
## 2 Home Depot                 25.2                1.41

Simulating data

The code from the previous article is prone to error in the long run when we have multiple datasets to manage.

bestbuy <- rnorm(1e4, 31.2, 2.60) %>% 
  as_tibble() %>% 
  mutate(website = paste0("bestbuy"))

homedepot <- rnorm(1e4, 25.2, 1.41) %>% 
  as_tibble() %>% 
  mutate(website = paste0("homedepot"))

Instead we can simplify things by writing a custom function to do all of the above.

# FUNCTION TO SIMULATE 10K
sim_10k <- function(mean, sd, website){
  
  values <- rnorm(1e4, mean, sd) %>% 
    as_tibble() %>% 
    mutate(website = paste0(website))
}

Now with a single call we can simulate 10K data points for our normal distribution based on a set of parameters:

  • Average mean
  • Standard deviation
  • Name of the website
bestbuy <- sim_10k(31.2, 2.60, "bestbuy")
homedepot <- sim_10k(25.2, 1.41, "homedepot")

# VALIDATE OUR DATA
rbind(bestbuy, homedepot) %>% 
  group_by(website) %>% 
  summarize(mean = mean(value),
            sd = sd(value)) 
## # A tibble: 2 x 3
##   website    mean    sd
##   <chr>     <dbl> <dbl>
## 1 bestbuy    31.2  2.58
## 2 homedepot  25.2  1.39

Plotting simulated data

There may be cases where we need to juggle around numerous datasets and quickly visualize the distribution.

Rather than writing (and rewriting) our data frames or plots we can create a custom function to do just that.

# FUNCTION FOR HISTOGRAM
hist_plot <- function(df){
  
  multihist <- ggplot(data = df, aes(value, fill = website)) +
    geom_histogram(binwidth = 0.05, position = 'identity', alpha = 0.8) +
    scale_fill_brewer(palette = 'Dark2') +
    labs(x = "Time to Interactive (seconds)") +
    theme_light()
  
  return(multihist)
}

# PLOT BEST BUY AND HOME DEPOT DATA
hist_plot(rbind(bestbuy, homedepot))

Let’s add Walmart to the mix to illustrate how easy it can be to visualize new datasets.

# WALMART SUMMARY STATISTICS
df %>% 
  group_by(website) %>% 
  filter(website == 'Walmart') %>% 
  summarize(avg_time_interactive = mean(time_to_interactive),
            sd_time_interactive = sd(time_to_interactive))
## # A tibble: 1 x 3
##   website avg_time_interactive sd_time_interactive
##   <chr>                  <dbl>               <dbl>
## 1 Walmart                 16.6               0.830

And let’s plot everything together with our new functions:

# SIMULATE WALMART DATA
walmart <- sim_10k(16.6, 0.830, "walmart")

# PLOT WALMART DATA
hist_plot(rbind(bestbuy, homedepot, walmart))

Or we can focus on a single group:

hist_plot(bestbuy)

Just give me the answers!

Here’s a non-secret: based on their mathematical properties, we can skip the entire data simulation process and spit out the results.

For the following section we’ll use Target as our example.

# TARGET SUMMARY STATISTICS
df %>% 
  filter(website == 'Target') %>% 
  group_by(website) %>% 
  summarize(mean_time_interactive = mean(time_to_interactive),
            sd_time_interactive = sd(time_to_interactive))
## # A tibble: 1 x 3
##   website mean_time_interactive sd_time_interactive
##   <chr>                   <dbl>               <dbl>
## 1 Target                   21.4                2.02

Calculating probabilities

dnorm

This function returns the probability density for a specific value given its shape.

For Target, we can use it to answer the question: what is the probability we will measure a time_to_interactive page speed score of exactly 20s?

In this case it would be as simple as this single line of code:

dnorm(20, 21.4, 2.02) * 100
## [1] 15.53292

And to validate that:

# SIMULATE TARGET DATA
target <- sim_10k(21.4, 2.02, "target")

target %>% 
  ggplot(aes(value)) +
  geom_density() +
  geom_vline(xintercept = 20, lty = 2, color = 'salmon') +
  geom_hline(yintercept = 0.1553292, lty = 2, color = 'salmon') +
  geom_point(aes(x = 20, y = 0.1553292), size = 3, color = 'grey40') +  
  labs(x = "Time to Interative (seconds)", y = "Probability Density") +
  scale_y_continuous(labels = scales::percent_format(round(1))) +
  theme_light() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

pnorm

We can use this function if we want to know what the probability distribution is for certain values that fall below a certain threshold - we have already been doing this!

For example: what is the probability a random page on Target will be less than 20 seconds?

sum(target$value < 20) / length(target$value) * 100
## [1] 23.92

And if we use the pnorm function without generating any data:

pnorm(20, 21.4, 2.02) * 100
## [1] 24.4133

Visualizing our data can also confirm this output.

target %>% 
  mutate(segment = ifelse(value < 20, "below", "above")) %>% 
  ggplot(aes(value, fill = segment)) +
  geom_histogram(binwidth = 0.05, alpha = 0.8) +
  scale_fill_brewer(palette = 'Set1') +
  labs(x = "Time to Interactive (seconds)") +
  theme_light()

This same function can also be used to find the probability distribution above a given threshold by adding a minor change with lower.tail = FALSE (we’ll use 25 seconds):

pnorm(25, 21.4, 2.02, lower.tail = FALSE) * 100
## [1] 3.736009

Which is the same as:

sum(target$value > 25) / length(target$value) * 100
## [1] 3.5

Note: we are generating 10K random numbers from a given parameter space so the values will match 1:1 as our n gets closer to infinity. To test this, try increasing your simulation from 10K (1e4) to 100K (1e5) or even 1M (1e6).

qnorm

We can leverage this function, given a desired quantile, to ask a question like: above what time_to_interactive threshold is considered the slowest 10th percentile on Target?

qnorm(.90, 21.4, 2.02) 
## [1] 23.98873

This is essentially the inverse of our pnorm function:

pnorm(23.98873, 21.4, 2.02, lower.tail = FALSE) * 100
## [1] 10.00004

Thus, everything above ~24 seconds is considered the slowest 10% for Target.

target %>% 
  mutate(percentile = ifelse(value > 23.98873, "10th", "90th")) %>% 
  ggplot(aes(value, fill = percentile)) +
  geom_histogram(binwidth = 0.05) +
  scale_fill_brewer(palette = 'Greens', direction = -1) +
  labs(x = "Time to Interactive (seconds)") +
  theme_light()

And to get our answer in the opposite direction for the fastest 2-percentile we make a slight adjustment again with lower.tail = FALSE:

qnorm(.98, 21.4, 2.02, lower.tail = FALSE)
## [1] 17.25143

Now we know that ~17.25 seconds in time_to_interactive is considered the maximum value to be considered the fastest 2% on Target’s website.

We can then validate our assumptions with the pnorm function:

pnorm(17.25143, 21.4, 2.02) * 100
## [1] 2.000007

And visualize our results as needed:

target %>% 
  mutate(percentile = ifelse(value < 17.25143, "2nd", "98th")) %>% 
  ggplot(aes(value, fill = percentile)) +
  geom_histogram(binwidth = 0.05) +
  scale_fill_brewer(palette = 'Blues') +
  labs(x = "Time to Interactive (seconds)") +
  theme_light()

Visualizing probabilities

Since we don’t necessarily need to simulate our data we also don’t need it to plot our distributions!

Taking our example from earlier, if we wanted to know Target’s probability density at exactly 20 seconds we can run this command:

dnorm(20, 21.4, 2.02) * 100
## [1] 15.53292

Which can also be visualized with the stat_function:

ggplot() +
  stat_function(aes(c(0, 40)), 
                color = 'grey30', 
                fun = dnorm, 
                n = 1e4,
    args = c(mean = 21.4, sd = 2.02)) +
  geom_vline(xintercept = 20, lty = 2, color = 'red') +
  geom_hline(yintercept = 0.1553292, lty = 2, color = 'red') +
  labs(x = "Time to Interactive (seconds)", y = "Probability Density") +
  scale_y_continuous(labels = scales::percent_format(round(1))) +
  theme_light() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

For the aforementioned pnorm function, we can calculate the probability a random Target page will be under 20 seconds with the following:

pnorm(20, 21.4, 2.02)
## [1] 0.244133

But also visualize the probability distribution by changing the fun argument like so:

ggplot() +
  stat_function(aes(c(0, 40)), 
                color = 'grey30', 
                fun = pnorm, 
                n = 1e4,
    args = c(mean = 21.4, sd = 2.02)) +
  geom_vline(xintercept = 20, lty = 2, color = 'red') +
  geom_hline(yintercept = 0.244133, lty = 2, color = 'red') +
  labs(x = "Time to Interactive (seconds)", y = "Probability Distribution") +
  scale_y_continuous(labels = scales::percent_format(round(1))) +
  theme_light() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

Last but not least, we can visualize our qnorm data as well to find the 90th percentfile cutoff point:

ggplot() +
  stat_function(aes(c(0, 1)), 
                color = 'grey30', 
                fun = qnorm,
                n = 1e4,
    args = c(mean = 21.4, sd = 2.02)) +
  geom_vline(xintercept = 0.90, lty = 2, color = 'red') +
  geom_hline(yintercept = 23.98873, lty = 2, color = 'red') +
  labs(y = "Time to Interactive (seconds)", x = "Percent Quantile") +
  theme_light() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

And to confirm everything we run the respective function:

qnorm(.90, 21.4, 2.02) 
## [1] 23.98873

Bonus

Charting multiple page speed probability densities in one go:

ggplot() +
  
  # HOME DEPOT
    stat_function(aes(c(0, 40), color = 'Home Depot'), 
                fun = dnorm, 
                n = 1e4,
                size = 1,
    args = c(mean = 25.2, sd = 1.41)) +  
  
  # TARGET
  stat_function(aes(c(0, 40), color = 'Target'), 
                fun = dnorm, 
                n = 1e4,
                size = 1,
    args = c(mean = 21.4, sd = 2.02)) +
   
  # WALMART
    stat_function(aes(c(0, 40), color = 'Walmart'), 
                fun = dnorm, 
                n = 1e4,
                size = 1,
    args = c(mean = 16.6, sd = 0.830)) +
  
  labs(x = "Time to Interactive (seconds)", y = "Probability Density", color = NULL) +
  scale_color_manual(values = c("#D55E00", "#009E73", "#0072B2")) +
  scale_y_continuous(labels = scales::percent_format(round(1))) +
  theme_light() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())