Harris County COVID-19 data

I have a very nice (I hope) dataset consisting of number of positive COVID-19 cases per day in Harris county by zipcode. In this blog entry I would like to study this dataset and look at comparisons with various other data.

Initial look

First off, let’s explore the data for issues, and for ideas about what might be interesting.

# How is the data distributed? Let's look at the most recent day

Harris %>% 
  group_by(Zip) %>% 
    summarize(Cases_today=last(Cases)) %>% 
  ggplot(aes(x=Cases_today)) +
  geom_histogram()

So over 20 zipcodes have no cases (but they may also have no people), and it looks like most zipcodes are in the 250-750 range. No obvious outliers or busts, so the data is looking pretty good so far.

Let’s look at the time behavior - this may get messy.

Harris %>% 
  mutate(Zip=as.factor(Zip)) %>% 
  ggplot(aes(x=Date, y=Cases, color=Zip)) +
  geom_line() + 
  theme(legend.position = "none")

Obviously almost zipcodes are up, but some much more than others. But again, most importantly at this stage, no obvious data busts.

Let’s load up some more datasets. I need Population, area, median age, race, blueness, and median income.

path <- "/home/ajackson/Dropbox/Rprojects/Datasets/"

#   SF file of zipcode outlines and areas

Zip_outlines <- readRDS(paste0(path, "ZipCodes_sf.rds"))

#   Census data for 2016

Zip_census16 <- readRDS(paste0(path, "TexasZipcode_16.rds"))
Zip_census16 <- Zip_census16 %>% 
  mutate(ZCTA=as.character(ZCTA))

#   Median family income and number of families
Income <- readRDS("/home/ajackson/Dropbox/Rprojects/Datasets/IncomeByZip.rds")

#   Many vs 2 generational households
House <- readRDS("/home/ajackson/Dropbox/Rprojects/Datasets/HouseholdByZip.rds")

#   Blueness

Blueness <- readRDS(paste0(path,"HarrisBlueness.rds")) 

Construct master dataframe

I want to build one data frame for all the ancilliary data by ZCTA, including the polygons, just to make things easier. Let’s make 2, one with geometries, and one without.

DF_values <- Zip_census16 %>% 
  select(-MalePop) %>% 
  mutate(MedianAge=as.numeric(MedianAge))

DF_values <- left_join(DF_values, Income, by="ZCTA")

DF_values <- left_join(DF_values, House, by="ZCTA")

# Drop NA will restrict to Harris county

DF_values <- left_join(DF_values, Blueness, by="ZCTA") %>% 
  drop_na()

left_join.sf =
  function(x,y,by=NULL,copy=FALSE,suffix=c(".x",".y"),...){
  ret = NextMethod("left_join")
  sf::st_as_sf(ret)
}

Zip_outlines <- sf::st_as_sf(Zip_outlines)

DF_polys <- left_join(DF_values, Zip_outlines, by=c("ZCTA"="Zip_Code"))

DF_polys <- sf::st_as_sf(DF_polys)

#   Check it out
      pal <- leaflet::colorNumeric(palette = colorspace::diverge_hsv(8),
                          na.color = "transparent",
                          reverse=TRUE,
                          domain = DF_values$blueness)
      
leaflet::leaflet(DF_polys) %>% 
        leaflet::setView(lng = -95.3103, lat = 29.7752, zoom = 7 ) %>%   
        leaflet::addTiles() %>%
        leaflet::addPolygons(data = DF_polys, 
                    weight = 1,
                    smoothFactor = 0.2, 
                    fillOpacity = 0.7,
                    fillColor =  ~pal(DF_values$blueness))