TidyTuesday: Steam Games
Jul 30, 2019
Christopher Yee
3 minute read

Data from #tidytuesday week of 2019-07-30 (source)

Load R packages

library(tidyverse)
library(RColorBrewer)
library(scales)

Download data

steam_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-30/video_games.csv") 

Parse data

steam_games <- steam_raw %>% 
  
  # VARIABLE FOR AGE OF GAME
  mutate(release_year = substring(release_date, 8, 12), # EXTRACT YEAR
         release_year = as.numeric(str_trim(release_year)), 
         release_year = case_when(release_year == 5 ~ 2015, # INCORRECT DATA POINT
                          TRUE ~ release_year),
         age = 2019 - release_year) %>% 
  
  # VARIABLE FOR MIN/MAX NUMBER OF OWNERS
  mutate(max_owners = str_trim(word(owners, 2, sep = "\\..")),
         max_owners = as.numeric(str_replace_all(max_owners, ",", "")),
         min_owners = str_trim(word(owners, 1, sep = "\\..")),
         min_owners = as.numeric(str_replace_all(min_owners, ",", ""))) %>% 
  
  # REMOVE VALUES WITH INCONSISTENT RELEASE_DATE FORMAT (n=37)
  filter(age < 15) %>% 
  
  # FILTER OUT STUDIO SOFTWARE
  filter(price < 150)

Visualize data

Question: how many people still play games that are X years old (on Steam) ?

steam_games %>% 
  select(age) %>% 
  ggplot() +
  geom_histogram(aes(as.factor(age)), 
                 binwidth = 1, 
                 stat = 'count',
                 fill = 'steelblue',
                 alpha = 0.8) +
  scale_y_continuous(labels = comma_format()) +
  labs(x = "Years Since Release",
       y = NULL,
       title = "Count of active games on Steam since release",
       caption = "by: @eeysirhc\nsource: Steam Spy") +
  theme_light()

Question: What is the distrubtion of game ownership for the top publishers ?

steam_games %>% 
  select(-number, -release_date) %>% 
  mutate(max_owners = as.factor(max_owners / 1000000)) %>% 
  group_by(publisher) %>% 
  mutate(n = n()) %>% 
  ungroup() %>% 
  filter(n >= 80, publisher != 'NA') %>%
  mutate(publisher = paste0(publisher, " (", n, ")"),
         publisher = reorder(publisher, desc(n))) %>% 
  ggplot(aes(publisher, max_owners, color = publisher)) +
  geom_jitter(show.legend = FALSE, size = 3, alpha = 0.5) +
  scale_color_brewer(palette = 'RdBu') +
  theme_light() +
  labs(x = NULL,
       y = "Estimated Game Ownership per Million",
       title = "Distribution of ownership for top game publishers",
       subtitle = "(n) indicates total titles released",
       caption = "by: @eeysirhc\nsource: Steam Spy") +
  theme(panel.background = element_rect(fill = 'grey95'),
        plot.background = element_rect(fill = 'grey95'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1)) 

Interesting takeaways

  • Mass Market: whether it is their distribution network or brand name, publishers such as Ubisoft (117) and Square Enix (85) manage to consistently break into the 1M+ user base categories.
  • Some Hits Mostly Miss: majority of the top publishers will fall in this category like Strategy First (126) or Degica (81) where a few game titles will break 50K or 100K users but majority will fall in the 20K bucket.
  • Niche Market: publishers like Big Fish Games (n=265) and Choice of Games (n=92) have a lot of titles under their belts where each game acquires no more than 20K users. It is possible they do not want to compete against major publishers like Ubisoft or SEGA but instead choose to churn out games which cater to a super niche audience.