Mining Google Trends data with R
Jun 28, 2019
Christopher Yee
7 minute read

Google Trends is great for understanding relative search popularity for a given keyword or phrase. However, if we wanted to explore the topics some more it is quite clunky to retrieve that data within the web interface.

Enter the gtrendsR package for #rstats and what better way to demonstrate how this works than by pulling search popularity for ramen, pho, and spaghetti (hot on the heels of my last article about ramen ratings)!

Load packages

library(tidyverse)
library(lubridate)
library(gtrendsR)

Clean up our dataframe

food_timeseries <- as_tibble(food$interest_over_time) %>% 
  mutate(date = ymd(date)) %>% # CONVERT DATE FORMAT
  filter(date < Sys.Date() - 7) # REMOVE "NOISY" DATA FROM LAST SEVEN DAYS

Quick peek at data

food_timeseries
## # A tibble: 780 x 7
##    date        hits geo   time      keyword gprop category
##    <date>     <int> <chr> <chr>     <chr>   <chr>    <int>
##  1 2014-06-29    12 US    today+5-y ramen   web          0
##  2 2014-07-06    11 US    today+5-y ramen   web          0
##  3 2014-07-13    15 US    today+5-y ramen   web          0
##  4 2014-07-20    14 US    today+5-y ramen   web          0
##  5 2014-07-27    13 US    today+5-y ramen   web          0
##  6 2014-08-03    13 US    today+5-y ramen   web          0
##  7 2014-08-10    13 US    today+5-y ramen   web          0
##  8 2014-08-17    14 US    today+5-y ramen   web          0
##  9 2014-08-24    12 US    today+5-y ramen   web          0
## 10 2014-08-31    13 US    today+5-y ramen   web          0
## # … with 770 more rows

Graph interest over time

food_timeseries %>% 
  ggplot() +
  geom_line(aes(date, hits, color = keyword), size = 1) +
  scale_y_continuous(limits = c(0, 100)) +
  scale_color_brewer(palette = 'Set2') +
  theme_bw() +
  labs(x = NULL,
       y = "Relative Search Interest",
       color = NULL,
       title = "Google Trends: interest over time (US)") 

It looks like ramen has picked up traction over the last five years and even surpassed spaghetti popularity earlier this year.

I wonder what that will look like a year from now? We’ll look to using the prophet package from Facebook to forecast future popularity.

Forecasting relative search popularity

Load packages

library(prophet)

Prepare the data

Let’s see how we do for ramen search popularity.

ramen_timeseries <- food_timeseries %>% 
  filter(keyword == 'ramen') %>% 
  select(date, hits) %>% 
  mutate(date = ymd(date)) %>% 
  rename(ds = date, y = hits) %>% # CONVERT COLUMN HEADERS FOR PROPHET
  arrange(ds) # ARRANGE BY DATE

Build the model

ramen_m <- prophet(ramen_timeseries)
ramen_future <- make_future_dataframe(ramen_m, periods = 365) # PREDICT 365 DAYS
ramen_ftdata <- as_tibble(predict(ramen_m, ramen_future))

Combine forecast with actuals

ramen_forecast <- ramen_ftdata %>% 
  mutate(ds = ymd(ds),
         segment = case_when(ds > Sys.Date()-7 ~ 'forecast',
                             TRUE ~ 'actual'), # SEGMENT ACTUAL VS FORECAST DATA
         keyword = paste0("ramen")) %>% 
  select(ds, segment, yhat_lower, yhat, yhat_upper, keyword) %>% 
  left_join(ramen_timeseries) # JOIN ACTUAL DATA

Plot forecasting results

ramen_forecast %>% 
  rename(date = ds,
         actual = y) %>% 
  ggplot() +
  geom_line(aes(date, actual)) + # PLOT ACTUALS DATA
  geom_point(data = subset(ramen_forecast, segment == 'forecast'),
            aes(ds, yhat), color = 'salmon', size = 0.1) + # PLOT PREDICTION DATA
  geom_ribbon(data = subset(ramen_forecast, segment == 'forecast'),
            aes(ds, ymin = yhat_lower, ymax = yhat_upper), 
            fill = 'salmon', alpha = 0.3) + # SHADE PREDICTION DATA REGION
  scale_y_continuous(limits = c(0,100)) +
  theme_bw() +
  labs(x = NULL, y = "Relative Search Interest",
       title = "Google Trends: interest over time for \"ramen\" (US)")

The chart above doesn’t look too bad however this is relative search popularity so we need to compare the prediction with pho and spaghetti as well.

# FUTURE NOTE: REFACTOR FOR DRY PRINCIPLES

# BUILD FORECASTING MODEL FOR PHO
pho_timeseries <- food_timeseries %>% 
  filter(keyword == 'pho') %>% 
  select(date, hits) %>% 
  mutate(date = ymd(date)) %>% 
  rename(ds = date, y = hits) %>% 
  arrange(ds)

pho_m <- prophet(pho_timeseries)
pho_future <- make_future_dataframe(pho_m, periods = 365)
pho_ftdata <- as_tibble(predict(pho_m, pho_future))

pho_forecast <- pho_ftdata %>% 
  mutate(ds = ymd(ds),
         segment = case_when(ds > Sys.Date()-7 ~ 'forecast',
                             TRUE ~ 'actual'),
         keyword = paste0("pho")) %>% 
  select(ds, segment, yhat_lower, yhat, yhat_upper, keyword) %>% 
  left_join(pho_timeseries)

# BUILD FORECASTING MODEL FOR SPAGHETTI
spaghetti_timeseries <- food_timeseries %>% 
  filter(keyword == 'spaghetti') %>% 
  select(date, hits) %>% 
  mutate(date = ymd(date)) %>% 
  rename(ds = date, y = hits) %>% 
  arrange(ds)

spaghetti_m <- prophet(spaghetti_timeseries)
spaghetti_future <- make_future_dataframe(spaghetti_m, periods = 365)
spaghetti_ftdata <- as_tibble(predict(spaghetti_m, spaghetti_future))

spaghetti_forecast <- spaghetti_ftdata %>% 
  mutate(ds = ymd(ds),
         segment = case_when(ds > Sys.Date()-7 ~ 'forecast',
                             TRUE ~ 'actual'),
         keyword = paste0("spaghetti")) %>% 
  select(ds, segment, yhat_lower, yhat, yhat_upper, keyword) %>% 
  left_join(spaghetti_timeseries)

# COMBINE ALL MODELS
keyword_forecast <- rbind(ramen_forecast, pho_forecast, spaghetti_forecast) %>% 
  rename(date = ds, actual = y)

Final plot

keyword_forecast %>% 
  ggplot() +
  geom_line(aes(date, actual, color = keyword), size = 1) +
  geom_ribbon(data = subset(keyword_forecast, segment == 'forecast'),
              aes(date, ymin = yhat_lower, ymax = yhat_upper, fill = keyword), 
            alpha = 0.3) +
  geom_point(data = subset(keyword_forecast, segment == 'forecast'),
             aes(date, yhat, color = keyword), size = 0.1) +
  scale_y_continuous(limits = c(0,100)) +
  scale_color_brewer(palette = 'Set2') +
  scale_fill_brewer(palette = 'Set2') +
  theme_bw() +
  labs(x = NULL, y = "Relative Search Interest",
       title = "Google Trends: interest over time (US)") 

One year from now we should expect to see ramen at the top followed by pho and spaghetti fighting for a close second in terms of relative search interest.

Wrapping up

You will find a million methods on how to download Google Trends data.

This is just one way to do it in R where we pulled the data, plotted historical trends, forecasted future search popularity, and even performed some light text mining to find relationship between words.