TidyTuesday: Cetaceans Dataset

Analyzing data for #tidytuesday week of 12/18/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) library(lubridate) library(tidytext) cetaceans_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-12-18/allCetaceanData.csv") cetaceans <- cetaceans_raw Most notable cause of death between Male vs Female ? cetaceans %>% select(sex, COD) %>% filter(sex != "U") %>% na.omit() %>% mutate(sex = replace(sex, str_detect(sex, "F"), "Female"), sex = replace(sex, str_detect(sex, "M"), "Male")) %>% unnest_tokens(bigram, COD, token = "ngrams", n = 2) %>% count(sex, bigram) %>% bind_tf_idf(bigram, sex, n) %>% arrange(desc(tf_idf)) %>% filter(tf_idf > 0.0011) %>% ggplot() + geom_col(aes(reorder(bigram, tf_idf), tf_idf, fill = sex)) + coord_flip() + scale_fill_brewer(palette = 'Set2', name = "") + labs(x = "", y = "", title = "Bigrams with highest TF-IDF for cause of death \n between Cetacean genders", caption = "Source: The Pudding") + theme_bw() ...

December 18, 2018 · Christopher Yee

TidyTuesday: NYC Restaurant Inspections

Analyzing data for #tidytuesday week of 12/11/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) library(lubridate) library(ebbr) nyc_restaurants_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-12-11/nyc_restaurants.csv") nyc_restaurants <- nyc_restaurants_raw %>% filter(inspection_date != '01/01/1900') What is the rate of “A” inspection grades by cuisine type? First step is to compute the relevant statistics cuisine_grades <- nyc_restaurants %>% select(cuisine_description, grade) %>% na.omit() %>% group_by(cuisine_description) %>% count(grade) %>% mutate(total = sum(n), pct_total = n/total) %>% ungroup() Next we apply empirical Bayesian estimation and filter the top 20 results ...

December 11, 2018 · Christopher Yee

TidyTuesday: Medium Article Metadata

Analyzing data for #tidytuesday week of 12/4/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) library(tidytext) library(stringr) articles_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-12-04/medium_datasci.csv") articles <- articles_raw Who are the top 10 authors in terms of total articles published? top_authors <- articles %>% select(author) %>% group_by(author) %>% count() %>% arrange(desc(n)) %>% na.omit() %>% head(10) top_authors %>% ggplot() + geom_col(aes(reorder(author, n), n), fill = "darkslategray4", alpha = 0.8) + coord_flip() + theme_bw() + labs(x = "", y = "", title = "Top 10 authors on Medium in terms of total articles published") ...

December 4, 2018 · Christopher Yee

TidyTuesday: Baltimore Bridges

Analyzing data for #tidytuesday week of 11/27/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) bridges_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-27/baltimore_bridges.csv") bridges <- bridges_raw Do bridge conditions get better over time? # REORDER BRIDGE_CONDITION FACTORS x <- bridges x$bridge_condition <- as.factor(x$bridge_condition) x$bridge_condition <- factor(x$bridge_condition, levels = c("Poor", "Fair", "Good")) x %>% filter(yr_built >= 1900) %>% # removing 2017 due to outlier select(lat, long, yr_built, bridge_condition, avg_daily_traffic) %>% group_by(yr_built, bridge_condition) %>% summarize(avg_daily_traffic = mean(avg_daily_traffic)) %>% ggplot() + geom_col(aes(yr_built, avg_daily_traffic, fill = bridge_condition), alpha = 0.3) + scale_y_continuous(label = comma_format(), limits = c(0, 223000)) + scale_fill_brewer(palette = 'Set1') + scale_color_brewer(palette = 'Set1') + geom_smooth(aes(yr_built, avg_daily_traffic, color = bridge_condition), se = FALSE) + theme_bw() + labs(x = "", y = "", title = "Baltimore bridges: average daily traffic by year built", subtitle = "Applied smoothing to highlight differences in bridge conditions and dampen outliers", fill = "Bridge Condition", color = "Bridge Condition") ...

November 27, 2018 · Christopher Yee

TidyTuesday: Thanksgiving Dinner

Analyzing data for #tidytuesday week of 11/20/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) thanksgiving_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-20/thanksgiving_meals.csv") thanksgiving <- thanksgiving_raw %>% filter(celebrate != 'No') What are the most popular pies for Thanksgiving ? thanksgiving %>% select(pie1:pie13) %>% pivot_longer(pie1:pie13, names_to = "pie_type") %>% filter(value != 'None') %>% select(value) %>% group_by(value) %>% count() %>% filter(n > 10) %>% ungroup() %>% ggplot(aes(reorder(value, n), n, label = n)) + geom_bar(aes(fill = value), alpha = 0.9, stat='identity') + coord_flip() + theme_classic() + theme(legend.position = 'none') + labs(title = "Most Popular Pies for Thanksgiving (n=980)", subtitle = "Question: Which type of pie is typically served at your Thanksgiving dinner? \n Please select all that apply", x = "", y = "") ...

November 20, 2018 · Christopher Yee

For the Love of Data, Segment!

Aggregated data is misleading. Let’s read that again: aggregated data is misleading. Why? Because the homogenized set buries the meaningful insights away. For example, I recently came across a competitive SEO analysis that examined the relationship between the number of ranking organic keywords to the estimated traffic from organic search for a handful of websites. In my opinion, this is a great start to understand the opportunity size of a market and how a given business stacks up against its competitors. ...

August 29, 2018 · Christopher Yee

Data Viz: Top Marketing Words in Linkedin Job Titles

I abhor tabulated data for a number of reasons: Quite difficult on the human eye to spot trends Puts a burden on the end user to spend extra time digesting the information True insights get lost because the devil is in the details In fact, individuals who join Square’s SEO team (I’m hiring by the way) are required to read this book on how to visualize data before making any presentations - an SEO bible, if you will. ...

March 10, 2018 · Christopher Yee

On Innovation (Mini-Rant)

TL;DR implementing a system to “drive” (e.g. creating a team or process) and “celebrate” (e.g. recognition awards because let’s be real those are just popularity contests) innovation is counter-productive to progress and I firmly believe this can only be fostered via company culture, environment, public forum, mind share, etc. What is innovation? I define it as the application of an idea in a useful and novel way to ensure an entity remains relevant while fundamentally changing the way it is perceived. ...

February 20, 2018 · Christopher Yee