Classifying keywords with the fuzzyjoin R package
Jul 19, 2019
Christopher Yee
4 minute read

A few months ago I tweeted a complex (and tedious) Excel formula on how to classify keywords:

I then ended it with:

In hindsight the third comment could be interpreted as “gud luk lulz!”

For this walkthrough we’ll use the {fuzzyjoin} #rstats package to replicate the aforementioned Excel method. In a future post we’ll build a neural net instead to achieve this for 1M+ keywords.

Load packages

library(tidyverse)

Download data

df <- read_csv("https://raw.githubusercontent.com/Eeysirhc/random_datasets/master/keywords_digital_marketing.csv")

Inspect the data

df
## # A tibble: 29 x 1
##    Keyword                            
##    <chr>                              
##  1 digital marketing agency           
##  2 what is digital marketing          
##  3 digital marketing jobs             
##  4 digital marketing salary           
##  5 digital marketing manager          
##  6 digital marketing course           
##  7 digital marketing strategy         
##  8 digital marketing services         
##  9 digital marketing specialist salary
## 10 digital marketing certificate      
## # … with 19 more rows

Introducing {fuzzyjoin}

Fuzzyjoin is an amazing package “that allows matching not just on values that match between columns, but on inexact matching.”

This is exactly what we need to identify and match keyword strings together so let’s first load the package.

# install.packages("fuzzyjoin") 
library(fuzzyjoin)

To understand how this works we’ll start small by building out the consideration identifier segment.

consideration <- as_tibble(c("what is", "blog", "specialist", 
                             "near me", "agency", "compan", 
                             "service", "example")) %>% 
  rename(consideration = value)

Then we (fuzzy) match the two dataframes with the regex_left_join function to produce the following result:

regex_left_join(df, consideration, by = c("Keyword" = "consideration")) %>% 
  replace(., is.na(.), "-") # REPLACE NA VALUES
## # A tibble: 29 x 2
##    Keyword                             consideration
##    <chr>                               <chr>        
##  1 digital marketing agency            agency       
##  2 what is digital marketing           what is      
##  3 digital marketing jobs              -            
##  4 digital marketing salary            -            
##  5 digital marketing manager           -            
##  6 digital marketing course            -            
##  7 digital marketing strategy          -            
##  8 digital marketing services          service      
##  9 digital marketing specialist salary specialist   
## 10 digital marketing certificate       -            
## # … with 19 more rows

Not bad - we were able to assign each keyword to the respective consideration segment while ignoring those that do not apply.

Now let’s combine everything for the other categories…

# TRANSACTIONAL
transactional <- as_tibble(c("agency", "compan", "consult", 
                             "service")) %>%
  rename(transactional = value)

# EVALUATION
evaluation <- as_tibble(c("blog", "consult", "agency", 
                          "compan", "service", "example")) %>% 
  rename(evaluation = value)

# CONSIDERATION
consideration <- as_tibble(c("what is", "blog", "specialist", 
                             "near me", "agency", "compan", 
                             "service", "example", "strategy")) %>% 
  rename(consideration = value)

# AWARENESS
awareness <- as_tibble(c("what is", "tool", "definition", 
                         "channel", "near me", "blog", 
                         "course", "new", "trend", "tip",
                         "strategy")) %>% 
  rename(awareness = value)

# OUT OF FUNNEL
out_of_funnel <- as_tibble(c("degree", "institute", "course", 
                             "certif", "skill", "for dummi", 
                             "training", "quote", "job", 
                             "salary", "intern", "manager", 
                             "resume", "analyst", "strategist", 
                             "director", "specialist")) %>% 
  rename(out_of_funnel = value)

And finally match them with the keywords dataset…

df_parsed <- regex_left_join(df, transactional, by = c("Keyword" = "transactional")) %>% 
  regex_left_join(evaluation, by = c("Keyword" = "evaluation")) %>% 
  regex_left_join(consideration, by = c("Keyword" = "consideration")) %>% 
  regex_left_join(awareness, by = c("Keyword" = "awareness")) %>% 
  regex_left_join(out_of_funnel, by = c("Keyword" = "out_of_funnel")) %>% 
  replace(., is.na(.), "-") %>% # REPLACE NA VALUES
  head(10) # FILTER ON FIRST 10

Step 5) Win.

Keyword transactional evaluation consideration awareness out_of_funnel
digital marketing agency agency agency agency
what is digital marketing
what is what is
digital marketing jobs
job
digital marketing salary
salary
digital marketing manager
manager
digital marketing course
course course
digital marketing strategy
strategy strategy
digital marketing services service service service
digital marketing specialist salary
specialist
salary
digital marketing specialist salary
specialist
specialist

Wrapping Up

What I love about this is the reproducibility and speed.

Although Excel did the job for me in the past when I needed it, the most frustrating thing was the cognitive load to get the lengthy array formula correct. It was (and still is) prone to error every time I had to 1) add new keyword identifiers or 2) wanted to expand the number of categories.

Transitioning to R has saved me countless headaches where I can just write, reuse or edit a few lines of code.

I hope this example using the {fuzzyjoin} package will help your keyword research workflow as much as it has helped mine!