Blogs

Exploratory data analysis on COVID-19 search queries

The team at Bing were generous enough to release search query data with COVID-19 intent. The files are broken down by country and state level granularity so we can understand how the world is coping with the pandemic through search. What follows is an exploratory analysis on how US Bing users are searching for COVID-19 (a.k.a. coronavirus) information. tl;dr COVID-19 search queries generally fall into five distinct categories: 1. Awareness 2. Consideration 3. Management 4. Unease 5. Advocacy (?) ...

TidyTuesday: Beer Production

Data from #tidytuesday week of 2020-03-31 (source) Load packages library(tidyverse) library(gganimate) library(gifski) Download data beer_states_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_states.csv") Clean data beer_total <- beer_states_raw %>% # FILL NULL VALUES WITH 0 replace(., is.na(.), 0) %>% # REMOVE LINE ITEM FOR 'TOTAL' filter(state != 'total') %>% # COMPUTE TOTAL BARRELS PER YEAR BY STATE group_by(year, state) %>% summarize(total_barrels = sum(barrels)) %>% ungroup() Create rankings beer_final <- beer_total %>% group_by(year) %>% mutate( # CALCULATE RANKINGS BY TOTAL BARRELS PRODUCED EACH YEAR rank = min_rank(-total_barrels) * 1.0, # STATE TOTAL DIVIDE BY STATE RANKED #1 PER YEAR produced = total_barrels / total_barrels[rank == 1], # CLEANED TEXT LABEL produced_label = paste0(" ", round(total_barrels / 1e6, 2), " M")) %>% group_by(state) %>% # SELECT TOP 20 filter(rank <= 20) %>% ungroup() Animate bar chart p <- beer_final %>% ggplot(aes(rank, produced, fill = state)) + geom_col(show.legend = FALSE) + geom_text(aes(rank, y = 0, label = state, hjust = 1.5)) + geom_text(aes(rank, y = produced, label = produced_label, hjust = 0)) + coord_flip() + scale_x_reverse() + theme_minimal(base_size = 15) + theme(axis.text.x = element_blank(), axis.text.y = element_blank()) + transition_time(year) + labs(title = "US Beer Production by State", subtitle = "Barrels produced each year: {round(frame_time)}", caption = "by: @eeysirhc\nsource: Alcohol and Tobacco Tax and Trade Bureau", x = NULL, y = NULL) animate(p, nframes = 300, fps = 12, width = 1000, height = 800, renderer = gifski_renderer()) ...

Script to track COVID-19 cases in the US

A couple weeks ago I shared an #rstats script to track global coronavirus cases by country. The New York Times also released COVID-19 data for new cases in the United States, both at the state and county level. You can run the code below on a daily basis to get the most up to date figures. Feel free to modify for your own needs: library(scales) library(tidyverse) library(gghighlight) state <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") county <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") State state %>% group_by(date, state) %>% mutate(total_cases = cumsum(cases)) %>% ungroup() %>% filter(total_cases >= 100) %>% # MINIMUM 100 CASES group_by(state) %>% mutate(day_index = row_number(), n = n()) %>% ungroup() %>% filter(n >= 12) %>% # MINIMUM 12 DAYS ggplot(aes(day_index, total_cases, color = state, fill = state)) + geom_point() + geom_smooth() + gghighlight() + scale_y_log10(labels = comma_format()) + facet_wrap(~state, ncol = 4) + labs(title = "COVID-19: cumulative daily new cases by US states (log scale)", x = "Days since 100th reported case", y = NULL, fill = NULL, color = NULL, caption = "by: @eeysirhc\nSource: New York Times") + theme_minimal() + theme(legend.position = 'none') + expand_limits(x = 30) ...

TardyThursday: College Tuition, Diversity & Pay

The differences between this unsanctioned #tardythursday and the official #tidytuesday: These will publish on Thursday (obviously) The dataset will come from a completely different week of TidyTuesday For a surprise, I’ll code with either #rstats or python (similar to #makeovermonday) Load modules import pandas as pd import seaborn as sns import matplotlib.pyplot as plt Download and parse data df_raw=pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/salary_potential.csv") df=df_raw[['state_name', 'early_career_pay', 'mid_career_pay']].groupby('state_name').mean().reset_index() Visualize dataset sns.set(style="darkgrid") plt.figure(figsize=(20,15)) g=sns.regplot(x="early_career_pay", y="mid_career_pay", data=df) for line in range(0,df.shape[0]): g.text(df.early_career_pay[line]+0.01, df.mid_career_pay[line], df.state_name[line], horizontalalignment='left', size='medium', color='black') plt.xlabel("Early Career Pay") plt.ylabel("Mid Career Pay") plt.title("Average Salary Potential by State: Early vs Mid Career", x=0.01, horizontalalignment="left", fontsize=16) plt.figtext(0.9, 0.09, "by: @eeysirhc", horizontalalignment="right") plt.figtext(0.9, 0.08, "Source: TuitionTracker.org", horizontalalignment="right") plt.show() ...

Script to track global Coronavirus pandemic cases

The coronavirus (a.k.a. COVID-19) is taking the world by storm with the World Health Organization officially characterizing the situation as a pandemic. I’m not an infectious disease expert but I couldn’t resist and write a quick #rstats script to visualize the total number of cases by country. Feel free to use and modify for your own needs: # LOAD PACKAGES library(tidyverse) library(scales) library(gghighlight) # DOWNLOAD DATA df <- read_csv("https://covid.ourworldindata.org/data/ecdc/full_data.csv") # PARSE DATA df_parsed <- df %>% filter(total_cases >= 100) %>% # MINIMUM 100 CASES group_by(location) %>% mutate(n = n(), day_index = row_number()) %>% ungroup() %>% filter(n >= 25, # MINIMUM 25 DAYS !location %in% c('World', 'International')) # EXCLUDE # GRAPH df_parsed %>% ggplot(aes(day_index, total_cases, color = location, fill = location)) + geom_point() + geom_smooth() + gghighlight() + scale_y_log10(labels = comma_format()) + labs(title = "COVID-19: cumulative daily new cases by country (log scale)", x = "Days since 100th reported case", y = NULL, fill = NULL, color = NULL, caption = "by: @eeysirhc\nSource: Our World in Data") + facet_wrap(~location, ncol = 4) + expand_limits(x = 70) + theme_minimal() + theme(legend.position = 'none') ...

Using R to calculate car lease payments

Purchasing a car is a significant time and financial commitment. There is so much at stake that the required song and dance with the sales manager don’t alleviate any fears about over paying. Thus, it is difficult to determine the equilibirum point at which the dealer will accept your offer versus how much you are willing to pay. I decided to write this for a few reasons: I am in the market for a new car Rather than doing actual car shopping I thought it would be more fun to procrastinate Online calculators are quite clunky when you want to compare and contrast monthly payments Hopefully, this will help others make more informed car buying decisions (TBD on Shiny app) Note: this guide will focus only on leasing and not the auto financing aspect of it ...

How to interact with Slack from R

I think my tweet speaks for itself: Words can not express how excited I am to use this :D — Christopher Yee (@Eeysirhc) March 10, 2020 The goal of this article is to document how to send #rstats code and plots directly to Slack. Load packages library(slackr) library(slackteams) library(slackreprex) Slack credentials Member ID You can easily grab that from this guide here. Slack Key ID To retrieve your Slack key ID, login here and then follow the prompts. ...

MakeoverMonday: Women in the Workforce

Goal of #makeovermonday is to transform some of my #rstats articles and visualizations to their python equivalent. Original plot for this #tidytuesday dataset can be found here. Load modules import pandas as pd import seaborn as sns import matplotlib.pyplot as plt Download and parse data df_raw = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv", sep=',', error_bad_lines=False, index_col=False, dtype='unicode') # FILTER ONLY FOR 2016 df_raw = df_raw[df_raw['year']=='2016'] df_raw = df_raw[['major_category', 'total_earnings_male', 'total_earnings_female', 'total_earnings', 'total_workers', 'workers_male', 'workers_female']] # REMOVE NULL VALUES df_raw = df_raw.dropna() Clean data Need to transform our data from objects to numerical values. ...