Christopher Yee

R functions for simulation, sampling & visualization

In my previous article about simulating page speed data, I broke one of the cardinal rules in programming: don’t repeat yourself. There was a reason for this: I wanted to show what is going on under the hood and the theoretical concepts associated with them before using other functions in R. For this follow-up, I’ll highlight a few #rstats shortcuts that will make your life easier when generating and exploring simulated data. Background We’ll use the same dataset from last time with Best Buy and Home Depot as our initial test subjects. ...

Simulating data to explore page speed performance

We may be inundated with data but sometimes collecting it can be a challenge in and of itself. A few reasons off the top of my head: Sparsity Difficult to measure Impractical to devote company resources to it Lack of technical expertise to actually build or acquire it Lazy (yours truly - except for that one time) Through simulation we can generate our own dataset with the added benefit of fully understanding what features we choose to put in our models (or leave out). ...

Find your favorite Twitter user with the rtweet package

Do you know who your favorite person on Twitter is? Probably! Did you ever want to quantify that statement? Probably not! Are you curious to find out who someone else’s favorite Twitter user is? Now you can with R! The code below is brought to you by Namita and her hilarious tweet: face some possibly uncomfortable truths about yourself and others with 4 easy lines of code using #rtweet and the #tidyverse pic.twitter.com/JtRnzk0xu7 ...

TidyTuesday: Steam Games

Data from #tidytuesday week of 2019-07-30 (source) Load R packages library(tidyverse) library(RColorBrewer) library(scales) Download data steam_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-30/video_games.csv") Parse data steam_games <- steam_raw %>% # VARIABLE FOR AGE OF GAME mutate(release_year = substring(release_date, 8, 12), # EXTRACT YEAR release_year = as.numeric(str_trim(release_year)), release_year = case_when(release_year == 5 ~ 2015, # INCORRECT DATA POINT TRUE ~ release_year), age = 2019 - release_year) %>% # VARIABLE FOR MIN/MAX NUMBER OF OWNERS mutate(max_owners = str_trim(word(owners, 2, sep = "\\..")), max_owners = as.numeric(str_replace_all(max_owners, ",", "")), min_owners = str_trim(word(owners, 1, sep = "\\..")), min_owners = as.numeric(str_replace_all(min_owners, ",", ""))) %>% # REMOVE VALUES WITH INCONSISTENT RELEASE_DATE FORMAT (n=37) filter(age < 15) %>% # FILTER OUT STUDIO SOFTWARE filter(price < 150) Visualize data Question: how many people still play games that are X years old (on Steam) ? ...

Classifying keywords with the fuzzyjoin R package

A few months ago I tweeted a complex (and tedious) Excel formula on how to classify keywords: For the #seo who insists on completing their keyword/intent research in excel Philosophy: keyword intent is not absolute so it won't fall neatly into an assigned bucket. For this reason a keyword can live under multiple conversion funnels since we can't be 100% certain. pic.twitter.com/JcTl9P11mC — Christopher Yee (@Eeysirhc) April 24, 2019 I then ended it with: ...

Visualizing Netflix viewing activity

If you are like me then it’s very likely you share your Netflix account with multiple users. If you are also like me then it’s very likely you would be curious about how your Netflix viewing activity coompares and contrasts to all the parasites on your account! In this post we’ll leverage #rstats to visualize what that will look like. Load packages Let’s fire up our favorite packages. library(tidyverse) library(lubridate) library(igraph) library(ggraph) library(tidygraph) library(influenceR) Download data With the exception of my own viewing activity (I’m not ashamed!), I have provided anonymized Netflix viewing data from a few family and friends for you to follow along. ...

Mining Google Trends data with R

Google Trends is great for understanding relative search popularity for a given keyword or phrase. However, if we wanted to explore the topics some more it is quite clunky to retrieve that data within the web interface. Enter the gtrendsR package for #rstats and what better way to demonstrate how this works than by pulling search popularity for ramen, pho, and spaghetti (hot on the heels of my last article about ramen ratings)! ...

TidyTuesday: Ramen Ratings

Data from #tidytuesday week of 2019-06-04 (source) Load R packages library(tidyverse) library(plotly) Download and parse data frame ramen_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-04/ramen_ratings.csv") ramen <- ramen_raw %>% group_by(brand, country) %>% summarize(avg_rating = round(mean(stars),2), total_reviews = n()) Build plotly chart plot_ly(data = ramen, x = ~total_reviews, y = ~avg_rating, size = 15, color = ~country, colors = 'Paired', text = ~paste("Brand: ", brand, "<br>Average Rating: ", avg_rating), showlegend = FALSE) %>% layout(xaxis = list(title = "Total Reviews"), yaxis = list(title = "Average Ratings"))