Changelog


One of my favorite online marketers, (the) Glen Allsopp, tweeted the following:

The public spreadsheet contains four fields:

  • Inc.com URL
  • URLs (company website)
  • Revenue
  • 3-Year-Growth

Although helpful I thought it would be interesting to explore the additional variables found on each company’s INC profile page.

Thus, I fired up R and scaped the list of URLs to answer the question: which industries were surveyed the most and how much revenue did they generate in 2019?

tl;dr

Load packages

library(tidyverse)
library(rvest) # SIMPLE WEB SCRAPING

Get URLs from CSV

inc5000 <- read_csv("inc5000_fastest_growing_companies.csv") %>% 
  rename(urls = `Inc.com URL`,
         website = URLs)

Base R format

We temporarily need to move out of the tidyverse and leverage base R for the next step.

company_urls <- inc5000$urls 

Loop function

A for loop is required to gather our data with the following order of operations:

  • Take URL from list
  • Crawl the page
  • Extract the page elements we want
  • Store into data frame
  • Rinse and repeat from step 1
# INITIALIZE DATA FRAME
company_raw <- data.frame()

for (page_url in company_urls){
  print(page_url)
  
  # RETRIEVE INC5000 PROFILE PAGE
  page <- read_html(page_url)
  
  # PARSE REVENUE
  revenue_millions <- page %>% 
    html_nodes(xpath = '//*/section[1]/div[3]/dl[1]/dd') %>% 
    html_text() %>% 
    str_replace(" Million", "") %>% # STRIP 'MILLIONS' FROM DATA VALUE
    str_replace("\\$", "") # STRIP $ SIGN FROM DATA VALUE

  # PARSE INDUSTRY
  industry <- page %>% 
    html_nodes(xpath = '//*/section[1]/div[3]/dl[3]/dd') %>% 
    html_text()

 # PARSE YEAR FOUNDED
  year_founded <- page %>% 
    html_nodes(xpath = '//*/section[1]/div[3]/dl[5]/dd') %>% 
    html_text()
  
  # PARSE EMPLOYEE COUNT
  employees <- page %>% 
    html_nodes(xpath = '//*/section[1]/div[3]/dl[6]/dd') %>% 
    html_text()

  # TEMP TO STORE LOOP DATA
  temp_df <- data.frame(page_url, revenue_millions, industry, year_founded, employees)
  
  # COMBINE TEMP WITH ALL DATA
  company_raw <- rbind(company_raw, temp_df) 
}

Data cleaning

Ideally, we want to separate our data collection from our data processing but I did not anticipate “billions” to show up in the parse revenue step of our loop.

We will leave it as is though to illustrate the ease with which we can clean data in R, specifically the tidyverse.

# BRING BACK TO TIDYVERSE
company_data <- company_raw %>% 
  as_tibble() 

# EXTRACT 'BILLION' VALUES AND CONVERT TO MILLIONS
billions <- company_data %>% 
  filter(grepl(" Billion", revenue_millions)) %>% 
  mutate(revenue_millions = str_replace(revenue_millions, " Billion", ""),
         revenue_millions = as.numeric(as.character(revenue_millions)),
         revenue_millions = revenue_millions*1000)

# CONVERT 'MILLION' VALUES TO NUMERICAL FORMAT
company_data <- company_data %>% 
  filter(!grepl(" Billion", revenue_millions)) %>% 
  mutate(revenue_millions = as.numeric(as.character(revenue_millions)))

# JOIN OUR SANITIZED DATASET
company_data <- rbind(company_data, billions)

Summarize data

With our cleaned dataset we can finally answer the question: how much revenue did each industry generate in 2019 and how many companies were surveyed?

company_parsed <- company_data %>% 
  group_by(industry) %>% 
  summarize(revenue_millions = sum(revenue_millions),
            count = n()) %>% 
  ungroup() %>% 
  filter(!is.na(revenue_millions)) %>% 
  mutate(pct_count = count / sum(count),
         pct_revenue = revenue_millions / sum(revenue_millions),
         revenue_billions = revenue_millions / 1000) 
industry revenue_billions count pct_revenue pct_count
Health 38.9890 360 0.1650677 0.0718276
Consumer Products & Services 23.2248 323 0.0983268 0.0644453
Construction 20.4304 354 0.0864962 0.0706305
Logistics & Transportation 18.7815 184 0.0795152 0.0367119
Government Services 14.0165 236 0.0593417 0.0470870
Business Products & Services 13.9789 490 0.0591825 0.0977654
Human Resources 11.4535 156 0.0484907 0.0311253
Retail 10.8481 163 0.0459276 0.0325219
Financial Services 9.5650 239 0.0404953 0.0476856
Software 9.2791 457 0.0392849 0.0911812
Advertising & Marketing 9.0369 489 0.0382595 0.0975658
Real Estate 6.4951 195 0.0274983 0.0389066
IT Management 6.2604 276 0.0265047 0.0550678
Energy 6.2551 77 0.0264822 0.0153631
Manufacturing 5.9942 179 0.0253776 0.0357143
Food & Beverage 5.0617 127 0.0214297 0.0253392
Telecommunications 3.3042 79 0.0139890 0.0157622
Insurance 3.1245 69 0.0132282 0.0137670
Engineering 2.6693 81 0.0113010 0.0161612
IT System Development 2.5352 121 0.0107333 0.0241421
Education 1.4515 69 0.0061452 0.0137670

Unsurprisingly, the Health industry comes out on top with a whopping $39Bn - nearly 1.7x more than the runner up Consumer Products & Services.

In terms of companies surveyed, we have Business Products & Services at #1 with 490 companies and a very close second for the Advertising & Marketing industry at 489.

Keeping it real

Visualize data

The tabulated data above is a little difficult to interpret so we’ll plot our results instead.

library(ggrepel)

company_parsed %>% 
  ggplot(aes(pct_count, pct_revenue, label = industry)) +
  geom_point() +
  geom_label_repel() +
  geom_abline(color = 'salmon', lty = 'dashed') + 
  scale_x_continuous(labels = scales::percent_format(round(1)),
                     limits = c(0, .1)) +
  scale_y_continuous(labels = scales::percent_format(round(1)),
                     limits = c(0, .2)) +
  labs(x = "% of Total Companies", y = "% of Total Revenue",
       title = "Inc.5000 Fastest Growing Private Companies of 2019",
       subtitle = "Compares the number of companies surveyed to total revenue generated per industry",
       caption = "by:@eeysirhc\nsource:@viperchill") +
  theme_minimal()

We are pretty much done but there may be instances where we want data values to be accessible for users. Luckily, the {plotly} package is our answer.

Chart interactivity

library(plotly)

plot_ly(data = company_parsed, 
        x = ~pct_count, 
        y = ~pct_revenue, 
        mode = "scatter",
        type = "scatter",
        size = 10,
        color = ~industry,
        colors = 'Set1', 
        hoverinfo = "text",
        text = ~paste("<b>Industry:</b> ",industry, 
                     "<br><b>Total Companies:</b> ", count,
                      "<br><b>Total Revenue ($Bn):</b> ", revenue_billions),
        showlegend = FALSE) %>% 
  layout(xaxis = list(title = "% of Total Companies",
                      tickformat = "%"),
         yaxis = list(title = "% of Total Revenue",
                      tickformat = "%"))

Wrapping up

In my next article we’ll take this a step further by building our own Shiny app. If you’re feeling adventurous then you can use some starter code here or checkout a barebones version for the US housing price index.

As always, if you enjoyed this or found it helpful please share over your favorite internet medium!