Classifying keywords with the fuzzyjoin R package

A few months ago I tweeted a complex (and tedious) Excel formula on how to classify keywords:

For the #seo who insists on completing their keyword/intent research in excel

Philosophy: keyword intent is not absolute so it won't fall neatly into an assigned bucket. For this reason a keyword can live under multiple conversion funnels since we can't be 100% certain. pic.twitter.com/JcTl9P11mC
— Christopher Yee (@Eeysirhc) April 24, 2019

I then ended it with:

Notes
1) This will probably crap out after 30k keywords (solve: copy+paste values only)
2) You can supplement with GSC CTR data or SEM conversion rates (separate workflow)
3) Consider ditching excel for R/Python in the future (https://t.co/GoiyfAiFYN)
— Christopher Yee (@Eeysirhc) April 24, 2019

In hindsight the third comment could be interpreted as “gud luk lulz!”

For this walkthrough we’ll use the {fuzzyjoin} #rstats package to replicate the aforementioned Excel method. In a future post we’ll build a neural net instead to achieve this for 1M+ keywords.

Load packages

library(tidyverse)

Download data

df <- read_csv("https://raw.githubusercontent.com/Eeysirhc/random_datasets/master/keywords_digital_marketing.csv")

Inspect the data

df

## # A tibble: 29 x 1
##    Keyword                            
##    <chr>                              
##  1 digital marketing agency           
##  2 what is digital marketing          
##  3 digital marketing jobs             
##  4 digital marketing salary           
##  5 digital marketing manager          
##  6 digital marketing course           
##  7 digital marketing strategy         
##  8 digital marketing services         
##  9 digital marketing specialist salary
## 10 digital marketing certificate      
## # … with 19 more rows

Introducing {fuzzyjoin}

Fuzzyjoin is an amazing package “that allows matching not just on values that match between columns, but on inexact matching.”

This is exactly what we need to identify and match keyword strings together so let’s first load the package.

# install.packages("fuzzyjoin") 
library(fuzzyjoin)

To understand how this works we’ll start small by building out the consideration identifier segment.

consideration <- as_tibble(c("what is", "blog", "specialist", 
                             "near me", "agency", "compan", 
                             "service", "example")) %>% 
  rename(consideration = value)

Then we (fuzzy) match the two dataframes with the regex_left_join function to produce the following result:

regex_left_join(df, consideration, by = c("Keyword" = "consideration")) %>% 
  replace(., is.na(.), "-") # REPLACE NA VALUES

## # A tibble: 29 x 2
##    Keyword                             consideration
##    <chr>                               <chr>        
##  1 digital marketing agency            agency       
##  2 what is digital marketing           what is      
##  3 digital marketing jobs              -            
##  4 digital marketing salary            -            
##  5 digital marketing manager           -            
##  6 digital marketing course            -            
##  7 digital marketing strategy          -            
##  8 digital marketing services          service      
##  9 digital marketing specialist salary specialist   
## 10 digital marketing certificate       -            
## # … with 19 more rows

Not bad - we were able to assign each keyword to the respective consideration segment while ignoring those that do not apply.

Now let’s combine everything for the other categories…

# TRANSACTIONAL
transactional <- as_tibble(c("agency", "compan", "consult", 
                             "service")) %>%
  rename(transactional = value)

# EVALUATION
evaluation <- as_tibble(c("blog", "consult", "agency", 
                          "compan", "service", "example")) %>% 
  rename(evaluation = value)

# CONSIDERATION
consideration <- as_tibble(c("what is", "blog", "specialist", 
                             "near me", "agency", "compan", 
                             "service", "example", "strategy")) %>% 
  rename(consideration = value)

# AWARENESS
awareness <- as_tibble(c("what is", "tool", "definition", 
                         "channel", "near me", "blog", 
                         "course", "new", "trend", "tip",
                         "strategy")) %>% 
  rename(awareness = value)

# OUT OF FUNNEL
out_of_funnel <- as_tibble(c("degree", "institute", "course", 
                             "certif", "skill", "for dummi", 
                             "training", "quote", "job", 
                             "salary", "intern", "manager", 
                             "resume", "analyst", "strategist", 
                             "director", "specialist")) %>% 
  rename(out_of_funnel = value)

And finally match them with the keywords dataset…

df_parsed <- regex_left_join(df, transactional, by = c("Keyword" = "transactional")) %>% 
  regex_left_join(evaluation, by = c("Keyword" = "evaluation")) %>% 
  regex_left_join(consideration, by = c("Keyword" = "consideration")) %>% 
  regex_left_join(awareness, by = c("Keyword" = "awareness")) %>% 
  regex_left_join(out_of_funnel, by = c("Keyword" = "out_of_funnel")) %>% 
  replace(., is.na(.), "-") %>% # REPLACE NA VALUES
  head(10) # FILTER ON FIRST 10

Step 5) Win.

Keyword	transactional	evaluation	consideration	awareness	out_of_funnel
digital marketing agency	agency	agency	agency
what is digital marketing			what is	what is
digital marketing jobs					job
digital marketing salary					salary
digital marketing manager					manager
digital marketing course				course	course
digital marketing strategy			strategy	strategy
digital marketing services	service	service	service
digital marketing specialist salary			specialist		salary
digital marketing specialist salary			specialist		specialist

Wrapping Up

What I love about this is the reproducibility and speed.

Although Excel did the job for me in the past when I needed it, the most frustrating thing was the cognitive load to get the lengthy array formula correct. It was (and still is) prone to error every time I had to 1) add new keyword identifiers or 2) wanted to expand the number of categories.

Transitioning to R has saved me countless headaches where I can just write, reuse or edit a few lines of code.

I hope this example using the {fuzzyjoin} package will help your keyword research workflow as much as it has helped mine!