Using R & GSC data to identify stale content
Jan 21, 2020
Christopher Yee
2 minute read

My friend John-Henry Scherck recently tweeted his process on how to refresh stale content:

I imagine this can be broken down into five distinct parts:

  1. Stale content selection
  2. Understanding keyword intent
  3. Actually refreshing the content
  4. Internal link optimization
  5. Publish

This short guide will focus on the first aspect where we’ll use #rstats to remove the manual work associated with stale candidate selection.

Load packages

library(tidyverse)
library(searchConsoleR)

scr_auth()

Download data

The code below will grab 100K results for the last five full weeks of data but feel free to revise as you see fit.

df <- as_tibble(search_analytics("https://www.christopheryee.org/",
                                 Sys.Date() - 35, # START DATE
                                 Sys.Date() - 3, # END DATE
                                 c("page", "query"),
                                 searchType = "web",
                                 rowLimit = 1e5))

Identify keywords

This is where we’ll exclude brand terms and filter only on keywords with more than 2K impressions & average position between 5 to 15.

keywords <- df %>% 
  group_by(query) %>% 
  summarize(impressions = sum(impressions),
            position = mean(position)) %>% 
  filter(!grepl("brand_term", query)) %>% # EXCLUDE BRAND TERMS HERE
  arrange(dsec(impressions)) %>% 
  filter(impressions >= 2000,
         position >= 5 & position < 15) %>% 
  select(query)

Dedupe landing pages

There may be instances where a page will have multiple keywords.

We can remove duplicates here by sorting keywords with highest clicks for each page.

pages <- df %>% 
  inner_join(keywords) %>% # JOIN OUR KEYWORDS DATASET
  group_by(query) %>% 
  arrange(desc(clicks)) %>% 
  mutate(candidate = row_number()) %>% 
  ungroup() %>% 
  filter(candidate == 1) %>% 
  select(page)

Fun fact: I often use candidate = row_number() as a quick hack to filter the “top” or “bottom” criteria for a given dataset

Final candidates

df %>% 
  inner_join(pages) %>% 
  mutate(ctr = (clicks / impressions) * 100) %>% # STANDARDIZE CTR
  arrange(desc(page, impressions)) %>% 
  distinct(.)

From here you can then take the keywords and move on to the understanding keyword intent phase.

Resources