TidyTuesday

TidyTuesday: Cocktails pt.2

This is part 2 of TidyTuesday: Cocktails. Below shows how we can use #rstats to write a cocktail recommendation system that takes in a drink and returns a few other cocktails based on similarly mixed ingredients. Load libraries library(tidyverse) library(recommenderlab) Download and parse data Note: please check out part 1 for deatils on processing steps bc_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv') bc <- bc_raw %>% mutate(ingredient = str_to_lower(ingredient)) %>% distinct() %>% select(name, ingredient) bc_tidy <- bc %>% filter(!str_detect(ingredient, ",")) bc_untidy <- bc %>% filter(str_detect(ingredient, ",")) %>% mutate(ingredient = str_split(ingredient, ", ")) %>% unnest(ingredient) bc_clean <- rbind(bc_tidy, bc_untidy) %>% distinct() df <- bc_clean %>% mutate(ingredient = str_replace_all(ingredient, "-", "_"), ingredient = str_replace_all(ingredient, " ", "_"), ingredient = str_replace_all(ingredient, "old_mr._boston_", ""), ingredient = str_replace_all(ingredient, "old_thompson_", "")) df_processed <- df %>% mutate(value = 1) %>% pivot_wider(names_from = name) %>% replace(is.na(.), 0) Recommendation algorithm Transform data to binary rating matrix cocktails_matrix <- df_processed %>% select(-ingredient) %>% as.matrix() %>% as("binaryRatingMatrix") Create evaluation scheme scheme <- cocktails_matrix %>% evaluationScheme(method = "cross", k = 5, train = 0.8, given = -1) Input customer cocktail preference Let’s check the ingredients for a very simple cocktail: ...

TidyTuesday: Cocktails

Data from #tidytuesday week of 2020-05-26 (source) If you are looking for the R script then you can find it here Load packages library(tidyverse) library(ggrepel) library(FactoMineR) Download data bc_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv') Data processing Standardize cases bc_raw %>% count(ingredient, sort = TRUE) %>% filter(str_detect(ingredient, "red pepper sauce")) ## # A tibble: 2 x 2 ## ingredient n ## <chr> <int> ## 1 Hot red pepper sauce 4 ## 2 hot red pepper sauce 1 Let’s fix that by making all ingredient values to lower case: ...

TidyTuesday: Volcano Eruptions (python)

Data from #tidytuesday week of 2020-05-12 (source) but plotting in python. Load modules import pandas as pd import matplotlib.pyplot as plt import seaborn as sns Download and parse data volcano_raw = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv") volcano = volcano_raw[['primary_volcano_type', 'elevation']].sort_values(by='elevation', ascending=False) Visualize dataset sns.set(style="darkgrid") plt.figure(figsize=(20,15)) p = sns.boxplot(x=volcano.elevation, y=volcano.primary_volcano_type) p = sns.swarmplot(x=volcano.elevation, y=volcano.primary_volcano_type, color=".35") plt.xlabel("Elevation") plt.ylabel("") plt.title("What is the average elevation by volcano type?", x=0.01, horizontalalignment="left", fontsize=20) plt.figtext(0.9, 0.08, "by: @eeysirhc", horizontalalignment="right") plt.figtext(0.9, 0.07, "Source: The Smithsonian Institute", horizontalalignment="right") plt.show() ...

TidyTuesday: Animal Crossing

Data from #tidytuesday week of 2020-05-05 (source) Load packages library(tidyverse) library(ggfortify) Download data villagers_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv') Process data villagers <- villagers_raw %>% select(gender, species, personality) %>% mutate(species = str_to_title(species)) %>% group_by(gender, species, personality) %>% summarize(n = n()) %>% mutate(pct_total = n / sum(n)) %>% ungroup() Visualize data villagers %>% ggplot(aes(personality, pct_total, fill = gender, color = gender, group = gender)) + geom_polygon(alpha = 0.5) + geom_point() + coord_polar() + facet_wrap(~species) + labs(x = NULL, y = NULL, color = NULL, fill = NULL, title = "Animal Crossing: villager personality traits by species & gender", caption = "by: @eeysirhc\nsource:VillagerDB") + theme_bw() + theme(legend.position = 'top', axis.text.y = element_blank(), axis.ticks.y = element_blank()) ...

TidyTuesday: Beer Production

Data from #tidytuesday week of 2020-03-31 (source) Load packages library(tidyverse) library(gganimate) library(gifski) Download data beer_states_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_states.csv") Clean data beer_total <- beer_states_raw %>% # FILL NULL VALUES WITH 0 replace(., is.na(.), 0) %>% # REMOVE LINE ITEM FOR 'TOTAL' filter(state != 'total') %>% # COMPUTE TOTAL BARRELS PER YEAR BY STATE group_by(year, state) %>% summarize(total_barrels = sum(barrels)) %>% ungroup() Create rankings beer_final <- beer_total %>% group_by(year) %>% mutate( # CALCULATE RANKINGS BY TOTAL BARRELS PRODUCED EACH YEAR rank = min_rank(-total_barrels) * 1.0, # STATE TOTAL DIVIDE BY STATE RANKED #1 PER YEAR produced = total_barrels / total_barrels[rank == 1], # CLEANED TEXT LABEL produced_label = paste0(" ", round(total_barrels / 1e6, 2), " M")) %>% group_by(state) %>% # SELECT TOP 20 filter(rank <= 20) %>% ungroup() Animate bar chart p <- beer_final %>% ggplot(aes(rank, produced, fill = state)) + geom_col(show.legend = FALSE) + geom_text(aes(rank, y = 0, label = state, hjust = 1.5)) + geom_text(aes(rank, y = produced, label = produced_label, hjust = 0)) + coord_flip() + scale_x_reverse() + theme_minimal(base_size = 15) + theme(axis.text.x = element_blank(), axis.text.y = element_blank()) + transition_time(year) + labs(title = "US Beer Production by State", subtitle = "Barrels produced each year: {round(frame_time)}", caption = "by: @eeysirhc\nsource: Alcohol and Tobacco Tax and Trade Bureau", x = NULL, y = NULL) animate(p, nframes = 300, fps = 12, width = 1000, height = 800, renderer = gifski_renderer()) ...

TardyThursday: College Tuition, Diversity & Pay

The differences between this unsanctioned #tardythursday and the official #tidytuesday: These will publish on Thursday (obviously) The dataset will come from a completely different week of TidyTuesday For a surprise, I’ll code with either #rstats or python (similar to #makeovermonday) Load modules import pandas as pd import seaborn as sns import matplotlib.pyplot as plt Download and parse data df_raw=pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/salary_potential.csv") df=df_raw[['state_name', 'early_career_pay', 'mid_career_pay']].groupby('state_name').mean().reset_index() Visualize dataset sns.set(style="darkgrid") plt.figure(figsize=(20,15)) g=sns.regplot(x="early_career_pay", y="mid_career_pay", data=df) for line in range(0,df.shape[0]): g.text(df.early_career_pay[line]+0.01, df.mid_career_pay[line], df.state_name[line], horizontalalignment='left', size='medium', color='black') plt.xlabel("Early Career Pay") plt.ylabel("Mid Career Pay") plt.title("Average Salary Potential by State: Early vs Mid Career", x=0.01, horizontalalignment="left", fontsize=16) plt.figtext(0.9, 0.09, "by: @eeysirhc", horizontalalignment="right") plt.figtext(0.9, 0.08, "Source: TuitionTracker.org", horizontalalignment="right") plt.show() ...

MakeoverMonday: Women in the Workforce

Goal of #makeovermonday is to transform some of my #rstats articles and visualizations to their python equivalent. Original plot for this #tidytuesday dataset can be found here. Load modules import pandas as pd import seaborn as sns import matplotlib.pyplot as plt Download and parse data df_raw = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv", sep=',', error_bad_lines=False, index_col=False, dtype='unicode') # FILTER ONLY FOR 2016 df_raw = df_raw[df_raw['year']=='2016'] df_raw = df_raw[['major_category', 'total_earnings_male', 'total_earnings_female', 'total_earnings', 'total_workers', 'workers_male', 'workers_female']] # REMOVE NULL VALUES df_raw = df_raw.dropna() Clean data Need to transform our data from objects to numerical values. ...

TidyTuesday: Adoptable Dogs

Data from #tidytuesday week of 2019-12-17 (source) Quick post to showcase the amazing {reticulate} package which has made my life so much easier! Who said you had to choose between R vs Python? Load packages library(tidyverse) library(reticulate) R then Python Grab and parse data df_rdata <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-17/dog_moves.csv") df_rdata <- df_rdata %>% filter(inUS == 'TRUE') %>% select(location, total) df_rdata %>% head() ## # A tibble: 6 x 2 ## location total ## <chr> <dbl> ## 1 Texas 566 ## 2 Alabama 1428 ## 3 North Carolina 2627 ## 4 South Carolina 1618 ## 5 Georgia 3479 ## 6 California 1664 Plot data import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # note the r. before the df_rdata value fig = sns.barplot(x="total", y="location", data=r.df_rdata, orient="h") plt.xlabel("Adoptable Dogs Available") plt.ylabel("") plt.figtext(0.9, 0.03, "by: @eeysirhc", horizontalalignment="right") plt.figtext(0.9, 0.01, "source: The Pudding", horizontalalignment="right") plt.show(fig) ...