Using R & GSC data to identify stale content

My friend John-Henry Scherck recently tweeted his process on how to refresh stale content:

Put together a quick video on how to refresh stale content using nothing more than Google Search Console and a word doc.

Check out the full video here: https://t.co/Vva4Zm4mNn pic.twitter.com/74Fm2oIz4c
— John-Henry Scherck (@JHTScherck) January 21, 2020

I imagine this can be broken down into five distinct parts:

Stale content selection
Understanding keyword intent
Actually refreshing the content
Internal link optimization
Publish

This short guide will focus on the first aspect where we’ll use #rstats to remove the manual work associated with stale candidate selection.

That's it! Fairly manual, but hopefully straightforward. Let me know what you think or if you have any questions.
— John-Henry Scherck (@JHTScherck) January 21, 2020

Load packages

library(tidyverse)
library(searchConsoleR)

scr_auth()

Download data

The code below will grab 100K results for the last five full weeks of data but feel free to revise as you see fit.

df <- as_tibble(search_analytics("https://www.christopheryee.org/",
                                 Sys.Date() - 35, # START DATE
                                 Sys.Date() - 3, # END DATE
                                 c("page", "query"),
                                 searchType = "web",
                                 rowLimit = 1e5))

Identify keywords

This is where we’ll exclude brand terms and filter only on keywords with more than 2K impressions & average position between 5 to 15.

keywords <- df %>% 
  group_by(query) %>% 
  summarize(impressions = sum(impressions),
            position = mean(position)) %>% 
  filter(!grepl("brand_term", query)) %>% # EXCLUDE BRAND TERMS HERE
  arrange(dsec(impressions)) %>% 
  filter(impressions >= 2000,
         position >= 5 & position < 15) %>% 
  select(query)

Dedupe landing pages

There may be instances where a page will have multiple keywords.

We can remove duplicates here by sorting keywords with highest clicks for each page.

pages <- df %>% 
  inner_join(keywords) %>% # JOIN OUR KEYWORDS DATASET
  group_by(query) %>% 
  arrange(desc(clicks)) %>% 
  mutate(candidate = row_number()) %>% 
  ungroup() %>% 
  filter(candidate == 1) %>% 
  select(page)

Fun fact: I often use candidate = row_number() as a quick hack to filter the “top” or “bottom” criteria for a given dataset

Final candidates

df %>% 
  inner_join(pages) %>% 
  mutate(ctr = (clicks / impressions) * 100) %>% # STANDARDIZE CTR
  arrange(desc(page, impressions)) %>% 
  distinct(.)

From here you can then take the keywords and move on to the understanding keyword intent phase.

Resources

Full script can be found on GitHub
If you enjoyed this post, you may be interested in my getting started with R guide using Google Search Console data