Biostat 200C Homework 2

To submit homework, please upload both Rmd and html files to Bruinlearn by the deadline.

Q1. CFR of COVID-19

Of primary interest to public is the risk of dying from COVID-19. A commonly used measure is case fatality rate/ratio/risk (CFR), which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}. \] Apparently CFR is not a fixed constant; it changes with time, location, and other factors. Also CFR is different from the infection fatality rate (IFR), the probability that someone infected with COVID-19 dies from it.

In this exercise, we use logistic regression to study how US county-level CFR changes according to demographic information and some health-, education-, and economy-indicators.

Data sources

04-04-2020.csv.gz: The data on COVID-19 confirmed cases and deaths on 2020-04-04 is retrieved from the Johns Hopkins COVID-19 data repository. It was downloaded from this link (commit 0174f38). This repository has been archived by the owner on Mar 10, 2023. It is now read-only. You can download data from box: https://ucla.box.com/s/brb3vz4nwoq8pjkcutxncymqw583d39l
us-county-health-rankings-2020.csv.gz: The 2020 County Health Ranking Data was released by County Health Rankings. The data was downloaded from the Kaggle Uncover COVID-19 Challenge (version 1). You can download data from box: https://ucla.box.com/s/brb3vz4nwoq8pjkcutxncymqw583d39l

Sample code for data preparation

Load the tidyverse package for data manipulation and visualization.

# tidyverse of data manipulation and visualization
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Read in the data of COVID-19 cases reported on 2020-04-04.

county_count <- read_csv("./datasets/04-04-2020.csv.gz") %>%
  # cast fips into dbl for use as a key for joining tables
  mutate(FIPS = as.numeric(FIPS)) %>%
  filter(Country_Region == "US") %>%
  print(width = Inf)

## Rows: 2679 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): FIPS, Admin2, Province_State, Country_Region, Combined_Key
## dbl  (6): Lat, Long_, Confirmed, Deaths, Recovered, Active
## dttm (1): Last_Update
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2,421 × 12
##     FIPS Admin2    Province_State Country_Region Last_Update           Lat
##    <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
##  1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
##  2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
##  3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
##  4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
##  5 19001 Adair     Iowa           US             2020-04-04 23:34:21  41.3
##  6 21001 Adair     Kentucky       US             2020-04-04 23:34:21  37.1
##  7 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
##  8 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
##  9  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
## 10 16003 Adams     Idaho          US             2020-04-04 23:34:21  44.9
##     Long_ Confirmed Deaths Recovered Active Combined_Key                 
##     <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
##  1  -82.5         6      0         0      0 Abbeville, South Carolina, US
##  2  -92.4        65      2         0      0 Acadia, Louisiana, US        
##  3  -75.6         8      0         0      0 Accomack, Virginia, US       
##  4 -116.        360      3         0      0 Ada, Idaho, US               
##  5  -94.5         1      0         0      0 Adair, Iowa, US              
##  6  -85.3         3      0         0      0 Adair, Kentucky, US          
##  7  -92.6        10      0         0      0 Adair, Missouri, US          
##  8  -94.7        14      0         0      0 Adair, Oklahoma, US          
##  9 -104.        294      9         0      0 Adams, Colorado, US          
## 10 -116.          1      0         0      0 Adams, Idaho, US             
## # … with 2,411 more rows

Standardize the variable names by changing them to lower case.

names(county_count) <- str_to_lower(names(county_count))

Sanity check by displaying the unique US states and territories:

county_count %>%
  select(province_state) %>%
  distinct() %>%
  arrange(province_state) %>%
  print(n = Inf)

## # A tibble: 58 × 1
##    province_state          
##    <chr>                   
##  1 Alabama                 
##  2 Alaska                  
##  3 Arizona                 
##  4 Arkansas                
##  5 California              
##  6 Colorado                
##  7 Connecticut             
##  8 Delaware                
##  9 Diamond Princess        
## 10 District of Columbia    
## 11 Florida                 
## 12 Georgia                 
## 13 Grand Princess          
## 14 Guam                    
## 15 Hawaii                  
## 16 Idaho                   
## 17 Illinois                
## 18 Indiana                 
## 19 Iowa                    
## 20 Kansas                  
## 21 Kentucky                
## 22 Louisiana               
## 23 Maine                   
## 24 Maryland                
## 25 Massachusetts           
## 26 Michigan                
## 27 Minnesota               
## 28 Mississippi             
## 29 Missouri                
## 30 Montana                 
## 31 Nebraska                
## 32 Nevada                  
## 33 New Hampshire           
## 34 New Jersey              
## 35 New Mexico              
## 36 New York                
## 37 North Carolina          
## 38 North Dakota            
## 39 Northern Mariana Islands
## 40 Ohio                    
## 41 Oklahoma                
## 42 Oregon                  
## 43 Pennsylvania            
## 44 Puerto Rico             
## 45 Recovered               
## 46 Rhode Island            
## 47 South Carolina          
## 48 South Dakota            
## 49 Tennessee               
## 50 Texas                   
## 51 Utah                    
## 52 Vermont                 
## 53 Virgin Islands          
## 54 Virginia                
## 55 Washington              
## 56 West Virginia           
## 57 Wisconsin               
## 58 Wyoming

We want to exclude entries from Diamond Princess, Grand Princess, Guam, Northern Mariana Islands, Puerto Rico, Recovered, and Virgin Islands, and only consider counties from 50 states and DC.

county_count <- county_count %>%
  filter(!(province_state %in% c("Diamond Princess", "Grand Princess", 
                                 "Recovered", "Guam", "Northern Mariana Islands", 
                                 "Puerto Rico", "Virgin Islands"))) %>%
  print(width = Inf)

## # A tibble: 2,413 × 12
##     fips admin2    province_state country_region last_update           lat
##    <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
##  1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
##  2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
##  3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
##  4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
##  5 19001 Adair     Iowa           US             2020-04-04 23:34:21  41.3
##  6 21001 Adair     Kentucky       US             2020-04-04 23:34:21  37.1
##  7 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
##  8 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
##  9  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
## 10 16003 Adams     Idaho          US             2020-04-04 23:34:21  44.9
##     long_ confirmed deaths recovered active combined_key                 
##     <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
##  1  -82.5         6      0         0      0 Abbeville, South Carolina, US
##  2  -92.4        65      2         0      0 Acadia, Louisiana, US        
##  3  -75.6         8      0         0      0 Accomack, Virginia, US       
##  4 -116.        360      3         0      0 Ada, Idaho, US               
##  5  -94.5         1      0         0      0 Adair, Iowa, US              
##  6  -85.3         3      0         0      0 Adair, Kentucky, US          
##  7  -92.6        10      0         0      0 Adair, Missouri, US          
##  8  -94.7        14      0         0      0 Adair, Oklahoma, US          
##  9 -104.        294      9         0      0 Adams, Colorado, US          
## 10 -116.          1      0         0      0 Adams, Idaho, US             
## # … with 2,403 more rows

Graphical summarize the COVID-19 confirmed cases and deaths on 2020-04-04 by state.

county_count %>%
  # turn into long format for easy plotting
  pivot_longer(confirmed:recovered, 
               names_to = "case", 
               values_to = "count") %>%
  group_by(province_state) %>%
  ggplot() + 
  geom_col(mapping = aes(x = province_state, y = `count`, fill = `case`)) + 
  # scale_y_log10() + 
  labs(title = "US COVID-19 Situation on 2020-04-04", x = "State") + 
  theme(axis.text.x = element_text(angle = 90))

Read in the 2020 county-level health ranking data.

county_info <- read_csv("./datasets/us-county-health-rankings-2020.csv.gz") %>%
  filter(!is.na(county)) %>%
  # cast fips into dbl for use as a key for joining tables
  mutate(fips = as.numeric(fips)) %>%
  select(fips, 
         state,
         county,
         percent_fair_or_poor_health, 
         percent_smokers, 
         percent_adults_with_obesity, 
         # food_environment_index,
         percent_with_access_to_exercise_opportunities, 
         percent_excessive_drinking,
         # teen_birth_rate, 
         percent_uninsured,
         # primary_care_physicians_rate,
         # preventable_hospitalization_rate,
         # high_school_graduation_rate,
         percent_some_college,
         percent_unemployed,
         percent_children_in_poverty,
         # `80th_percentile_income`,
         # `20th_percentile_income`,
         percent_single_parent_households,
         # violent_crime_rate,
         percent_severe_housing_problems,
         overcrowding,
         # life_expectancy,
         # age_adjusted_death_rate,
         percent_adults_with_diabetes,
         # hiv_prevalence_rate,
         percent_food_insecure,
         # percent_limited_access_to_healthy_foods,
         percent_insufficient_sleep,
         percent_uninsured_2,
         median_household_income,
         average_traffic_volume_per_meter_of_major_roadways,
         percent_homeowners,
         # percent_severe_housing_cost_burden,
         population_2,
         percent_less_than_18_years_of_age,
         percent_65_and_over,
         percent_black,
         percent_asian,
         percent_hispanic,
         percent_female,
         percent_rural) %>%
  print(width = Inf)

## Rows: 3193 Columns: 507
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): state, county, unreliable, primary_care_physicians_ratio, dentist...
## dbl (497): fips, num_deaths, years_of_potential_life_lost_rate, 95percent_ci...
## lgl   (3): presence_of_water_violation, non_petitioned_cases, petitioned_cases
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 3,142 × 30
##     fips state   county   percent_fair_or_poor_health percent_smokers
##    <dbl> <chr>   <chr>                          <dbl>           <dbl>
##  1  1001 Alabama Autauga                         20.9            18.1
##  2  1003 Alabama Baldwin                         17.5            17.5
##  3  1005 Alabama Barbour                         29.6            22.0
##  4  1007 Alabama Bibb                            19.4            19.1
##  5  1009 Alabama Blount                          21.7            19.2
##  6  1011 Alabama Bullock                         31.0            22.9
##  7  1013 Alabama Butler                          27.9            21.8
##  8  1015 Alabama Calhoun                         23.1            20.6
##  9  1017 Alabama Chambers                        24.0            19.4
## 10  1019 Alabama Cherokee                        20.7            17.5
##    percent_adults_with_obesity percent_with_access_to_exercise_opportunities
##                          <dbl>                                         <dbl>
##  1                        33.3                                         69.1 
##  2                        31                                           73.7 
##  3                        41.7                                         53.2 
##  4                        37.6                                         16.3 
##  5                        33.8                                         15.6 
##  6                        37.2                                          2.50
##  7                        43.3                                         48.6 
##  8                        38.5                                         47.7 
##  9                        40.1                                         61.9 
## 10                        35                                           33.4 
##    percent_excessive_drinking percent_uninsured percent_some_college
##                         <dbl>             <dbl>                <dbl>
##  1                       15.0              8.72                 62.0
##  2                       18.0             11.3                  67.4
##  3                       12.8             12.2                  34.9
##  4                       15.6             10.2                  44.1
##  5                       14.2             13.4                  53.4
##  6                       12.1             11.4                  35.0
##  7                       11.9             11.2                  41.7
##  8                       13.8             11.9                  59.2
##  9                       12.7             11.9                  48.5
## 10                       14.1             11.2                  51.8
##    percent_unemployed percent_children_in_poverty
##                 <dbl>                       <dbl>
##  1               3.63                        19.3
##  2               3.62                        13.9
##  3               5.17                        43.9
##  4               3.97                        27.8
##  5               3.51                        18  
##  6               4.69                        68.3
##  7               4.79                        36.3
##  8               4.65                        26.5
##  9               3.91                        30.7
## 10               3.57                        24.7
##    percent_single_parent_households percent_severe_housing_problems overcrowding
##                               <dbl>                           <dbl>        <dbl>
##  1                             26.2                            14.7        1.20 
##  2                             24.1                            13.6        1.27 
##  3                             56.6                            14.6        1.69 
##  4                             28.7                            10.5        0.255
##  5                             28.6                            10.5        1.89 
##  6                             74.8                            18.1        0.113
##  7                             52.7                            13.2        1.69 
##  8                             40.2                            13.7        1.54 
##  9                             46.6                            16.0        4.04 
## 10                             23.8                            13          1.5  
##    percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
##                           <dbl>                 <dbl>                      <dbl>
##  1                         11.1                  13.2                       35.9
##  2                         10.7                  11.6                       33.3
##  3                         17.6                  22                         38.6
##  4                         14.5                  14.3                       38.1
##  5                         17                    10.7                       35.9
##  6                         23.7                  24.8                       45.0
##  7                         19.2                  20.6                       41.9
##  8                         17.5                  15.7                       41.3
##  9                         19.9                  17.9                       37.3
## 10                         15.2                  12.5                       35.4
##    percent_uninsured_2 median_household_income
##                  <dbl>                   <dbl>
##  1                11.1                   59338
##  2                14.3                   57588
##  3                16.1                   34382
##  4                13                     46064
##  5                17.1                   50412
##  6                15.2                   29267
##  7                14.5                   37365
##  8                15.4                   45400
##  9                15.2                   39917
## 10                13.9                   42132
##    average_traffic_volume_per_meter_of_major_roadways percent_homeowners
##                                                 <dbl>              <dbl>
##  1                                              88.5                74.9
##  2                                              87.0                73.6
##  3                                             102.                 61.4
##  4                                              29.3                75.1
##  5                                              33.4                78.6
##  6                                               4.07               75.5
##  7                                              19.3                69.9
##  8                                             110.                 69.5
##  9                                              20.3                67.8
## 10                                              25.9                79.0
##    population_2 percent_less_than_18_years_of_age percent_65_and_over
##           <dbl>                             <dbl>               <dbl>
##  1        55601                              23.7                15.6
##  2       218022                              21.6                20.4
##  3        24881                              20.9                19.4
##  4        22400                              20.5                16.5
##  5        57840                              23.2                18.2
##  6        10138                              21.1                16.4
##  7        19680                              22.2                20.3
##  8       114277                              21.6                17.7
##  9        33615                              20.8                19.5
## 10        26032                              19.2                23.0
##    percent_black percent_asian percent_hispanic percent_female percent_rural
##            <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
##  1         19.3          1.22              2.97           51.4          42.0
##  2          8.78         1.15              4.65           51.5          42.3
##  3         48.0          0.454             4.28           47.2          67.8
##  4         21.1          0.237             2.62           46.8          68.4
##  5          1.46         0.320             9.57           50.7          90.0
##  6         69.5          0.187             7.96           45.5          51.4
##  7         44.6          1.32              1.51           53.4          71.2
##  8         20.9          0.964             3.91           51.9          33.7
##  9         39.6          1.33              2.56           52.1          49.1
## 10          4.24         0.338             1.62           50.5          85.7
## # … with 3,132 more rows

For stability in estimating CFR, we restrict to counties with \(\ge 5\) confirmed cases.

county_count <- county_count %>%
  filter(confirmed >= 5)

We join the COVID-19 count data and county-level information using FIPS (Federal Information Processing System) as key.

county_data <- county_count %>%
  left_join(county_info, by = "fips") %>%
  print(width = Inf)

## # A tibble: 1,466 × 41
##     fips admin2    province_state country_region last_update           lat
##    <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
##  1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
##  2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
##  3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
##  4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
##  5 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
##  6 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
##  7  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
##  8 28001 Adams     Mississippi    US             2020-04-04 23:34:21  31.5
##  9 31001 Adams     Nebraska       US             2020-04-04 23:34:21  40.5
## 10 42001 Adams     Pennsylvania   US             2020-04-04 23:34:21  39.9
##     long_ confirmed deaths recovered active combined_key                 
##     <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
##  1  -82.5         6      0         0      0 Abbeville, South Carolina, US
##  2  -92.4        65      2         0      0 Acadia, Louisiana, US        
##  3  -75.6         8      0         0      0 Accomack, Virginia, US       
##  4 -116.        360      3         0      0 Ada, Idaho, US               
##  5  -92.6        10      0         0      0 Adair, Missouri, US          
##  6  -94.7        14      0         0      0 Adair, Oklahoma, US          
##  7 -104.        294      9         0      0 Adams, Colorado, US          
##  8  -91.4        16      0         0      0 Adams, Mississippi, US       
##  9  -98.5         8      0         0      0 Adams, Nebraska, US          
## 10  -77.2        21      0         0      0 Adams, Pennsylvania, US      
##    state          county    percent_fair_or_poor_health percent_smokers
##    <chr>          <chr>                           <dbl>           <dbl>
##  1 South Carolina Abbeville                        19.9            17.3
##  2 Louisiana      Acadia                           20.9            21.5
##  3 Virginia       Accomack                         20.1            18.3
##  4 Idaho          Ada                              11.5            12.0
##  5 Missouri       Adair                            21.4            20.5
##  6 Oklahoma       Adair                            28.5            27.7
##  7 Colorado       Adams                            16.6            16.3
##  8 Mississippi    Adams                            27.3            22.2
##  9 Nebraska       Adams                            15.8            14.6
## 10 Pennsylvania   Adams                            15.3            16.2
##    percent_adults_with_obesity percent_with_access_to_exercise_opportunities
##                          <dbl>                                         <dbl>
##  1                        36.7                                          59.0
##  2                        38.4                                          42.5
##  3                        36.3                                          37.4
##  4                        25.6                                          89.5
##  5                        27.9                                          78.3
##  6                        47.7                                          28.5
##  7                        27.8                                          93.1
##  8                        35.3                                          69.1
##  9                        36.7                                          81.6
## 10                        35.6                                          60.6
##    percent_excessive_drinking percent_uninsured percent_some_college
##                         <dbl>             <dbl>                <dbl>
##  1                       15.9             12.9                  52.5
##  2                       19.8             10.7                  43.6
##  3                       15.5             16.6                  45.1
##  4                       17.9              8.74                 73.8
##  5                       18.9             10.6                  65.3
##  6                       11.8             24.5                  35.1
##  7                       18.9             11.0                  57.0
##  8                       12.3             15.0                  41.7
##  9                       18.5              8.76                 70.8
## 10                       19.2              7.49                 57.3
##    percent_unemployed percent_children_in_poverty
##                 <dbl>                       <dbl>
##  1               3.98                        30.8
##  2               5.37                        35.4
##  3               3.81                        27  
##  4               2.46                        10.2
##  5               3.51                        19.9
##  6               4.17                        34.9
##  7               3.47                        12.6
##  8               6.21                        40.4
##  9               2.87                        14.4
## 10               3.27                        11.2
##    percent_single_parent_households percent_severe_housing_problems overcrowding
##                               <dbl>                           <dbl>        <dbl>
##  1                             37.1                            14.3        0.463
##  2                             33.4                            12.3        3.51 
##  3                             45.9                            15.1        2.10 
##  4                             23.8                            14.0        1.46 
##  5                             29.5                            18.0        0.740
##  6                             38.3                            15.4        5.65 
##  7                             31.0                            18.1        5.37 
##  8                             66.4                            12.8        2.37 
##  9                             26.2                            10.5        0.904
## 10                             26.7                            12.3        1.88 
##    percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
##                           <dbl>                 <dbl>                      <dbl>
##  1                         15.8                  15.2                       36.1
##  2                         11.4                  15.1                       32.4
##  3                         15.9                  14.1                       36.8
##  4                          7.9                  12                         26.3
##  5                          8.4                  17.5                       31.9
##  6                         24.3                  19.1                       39.5
##  7                          7.7                   8                         31.0
##  8                         13.2                  24.7                       41.1
##  9                         11                    11.7                       30.1
## 10                          8.5                   8.3                       34.7
##    percent_uninsured_2 median_household_income
##                  <dbl>                   <dbl>
##  1               15.9                    42412
##  2               14.0                    40484
##  3               19.4                    42879
##  4               11.1                    66827
##  5               12.3                    40395
##  6               29.6                    35156
##  7               13.8                    70199
##  8               18.7                    33392
##  9               10.7                    55167
## 10                8.46                   62877
##    average_traffic_volume_per_meter_of_major_roadways percent_homeowners
##                                                 <dbl>              <dbl>
##  1                                               11.6               76.3
##  2                                               63.7               70.8
##  3                                               60.0               67.9
##  4                                              277.                68.4
##  5                                               45.8               60.0
##  6                                               16.7               68.6
##  7                                              490.                65.2
##  8                                              150.                61.7
##  9                                               53.4               68.2
## 10                                              113.                77.2
##    population_2 percent_less_than_18_years_of_age percent_65_and_over
##           <dbl>                             <dbl>               <dbl>
##  1        24541                              20.1                21.8
##  2        62190                              25.8                15.3
##  3        32412                              20.5                23.6
##  4       469966                              23.8                14.4
##  5        25339                              18.4                14.8
##  6        22082                              26.6                15.9
##  7       511868                              26.5                10.5
##  8        31192                              20.1                18.8
##  9        31511                              23.7                18.2
## 10       102811                              20.0                20.4
##    percent_black percent_asian percent_hispanic percent_female percent_rural
##            <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
##  1        27.5           0.412             1.54           51.6         78.6 
##  2        17.9           0.320             2.73           51.2         51.7 
##  3        28.0           0.781             9.34           51.2        100   
##  4         1.24          2.81              8.31           49.9          5.47
##  5         2.85          2.28              2.57           51.9         37.9 
##  6         0.534         0.802             6.82           50.1         83.3 
##  7         3.19          4.37             40.4            49.5          3.62
##  8        52.4           0.513            11.3            47.9         37.2 
##  9         0.996         1.33             10.9            50.2         22.5 
## 10         1.60          0.875             7.11           50.8         53.7 
## # … with 1,456 more rows

Numerical summaries of each variable:

summary(county_data)

##       fips          admin2          province_state     country_region    
##  Min.   : 1001   Length:1466        Length:1466        Length:1466       
##  1st Qu.:18003   Class :character   Class :character   Class :character  
##  Median :29029   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :30076                                                           
##  3rd Qu.:42077                                                           
##  Max.   :90053                                                           
##  NA's   :13                                                              
##   last_update                       lat            long_        
##  Min.   :2020-04-04 23:34:21   Min.   :19.60   Min.   :-159.60  
##  1st Qu.:2020-04-04 23:34:21   1st Qu.:33.96   1st Qu.: -94.56  
##  Median :2020-04-04 23:34:21   Median :38.02   Median : -86.48  
##  Mean   :2020-04-04 23:34:21   Mean   :37.71   Mean   : -89.73  
##  3rd Qu.:2020-04-04 23:34:21   3rd Qu.:41.38   3rd Qu.: -81.22  
##  Max.   :2020-04-04 23:34:21   Max.   :64.81   Max.   : -68.65  
##                                NA's   :19      NA's   :19       
##    confirmed           deaths           recovered     active 
##  Min.   :    5.0   Min.   :   0.000   Min.   :0   Min.   :0  
##  1st Qu.:    9.0   1st Qu.:   0.000   1st Qu.:0   1st Qu.:0  
##  Median :   20.0   Median :   0.000   Median :0   Median :0  
##  Mean   :  208.8   Mean   :   4.842   Mean   :0   Mean   :0  
##  3rd Qu.:   68.0   3rd Qu.:   2.000   3rd Qu.:0   3rd Qu.:0  
##  Max.   :63306.0   Max.   :1905.000   Max.   :0   Max.   :0  
##                                                              
##  combined_key          state              county         
##  Length:1466        Length:1466        Length:1466       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
##  Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
##  1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
##  Median :17.010              Median :17.147   Median :32.95              
##  Mean   :17.594              Mean   :17.153   Mean   :32.41              
##  3rd Qu.:20.377              3rd Qu.:19.365   3rd Qu.:36.20              
##  Max.   :38.887              Max.   :27.775   Max.   :51.00              
##  NA's   :28                  NA's   :28       NA's   :28                 
##  percent_with_access_to_exercise_opportunities percent_excessive_drinking
##  Min.   :  0.00                                Min.   : 7.81             
##  1st Qu.: 59.95                                1st Qu.:15.70             
##  Median : 74.71                                Median :18.03             
##  Mean   : 71.14                                Mean   :17.92             
##  3rd Qu.: 85.94                                3rd Qu.:20.00             
##  Max.   :100.00                                Max.   :28.62             
##  NA's   :28                                    NA's   :28                
##  percent_uninsured percent_some_college percent_unemployed
##  Min.   : 2.263    Min.   :30.06        Min.   : 1.582    
##  1st Qu.: 6.754    1st Qu.:53.24        1st Qu.: 3.252    
##  Median : 9.925    Median :61.21        Median : 3.870    
##  Mean   :10.583    Mean   :60.87        Mean   : 4.071    
##  3rd Qu.:13.519    3rd Qu.:68.74        3rd Qu.: 4.690    
##  Max.   :31.208    Max.   :90.34        Max.   :18.092    
##  NA's   :28        NA's   :28           NA's   :28        
##  percent_children_in_poverty percent_single_parent_households
##  Min.   : 2.50               Min.   : 9.43                   
##  1st Qu.:12.82               1st Qu.:27.09                   
##  Median :18.40               Median :32.96                   
##  Mean   :19.46               Mean   :33.84                   
##  3rd Qu.:24.50               3rd Qu.:38.94                   
##  Max.   :55.00               Max.   :80.00                   
##  NA's   :28                  NA's   :28                      
##  percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
##  Min.   : 6.562                  Min.   : 0.000   Min.   : 1.800              
##  1st Qu.:12.267                  1st Qu.: 1.378   1st Qu.: 9.125              
##  Median :14.439                  Median : 1.962   Median :11.300              
##  Mean   :15.079                  Mean   : 2.429   Mean   :11.749              
##  3rd Qu.:16.976                  3rd Qu.: 2.882   3rd Qu.:13.900              
##  Max.   :33.391                  Max.   :14.489   Max.   :34.100              
##  NA's   :28                      NA's   :28       NA's   :28                  
##  percent_food_insecure percent_insufficient_sleep percent_uninsured_2
##  Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
##  1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
##  Median :12.70         Median :34.02              Median :12.027     
##  Mean   :13.26         Mean   :33.88              Mean   :12.776     
##  3rd Qu.:15.20         3rd Qu.:36.56              3rd Qu.:16.541     
##  Max.   :33.50         Max.   :46.71              Max.   :42.397     
##  NA's   :28            NA's   :28                 NA's   :28         
##  median_household_income average_traffic_volume_per_meter_of_major_roadways
##  Min.   : 25385          Min.   :   0.00                                   
##  1st Qu.: 46994          1st Qu.:  53.05                                   
##  Median : 54317          Median : 105.00                                   
##  Mean   : 57584          Mean   : 201.39                                   
##  3rd Qu.: 64754          3rd Qu.: 206.92                                   
##  Max.   :140382          Max.   :4444.12                                   
##  NA's   :28              NA's   :28                                        
##  percent_homeowners  population_2      percent_less_than_18_years_of_age
##  Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
##  1st Qu.:64.34      1st Qu.:   36502   1st Qu.:20.321                   
##  Median :69.98      Median :   75478   Median :22.182                   
##  Mean   :68.98      Mean   :  202450   Mean   :22.197                   
##  3rd Qu.:74.78      3rd Qu.:  180031   3rd Qu.:24.002                   
##  Max.   :89.76      Max.   :10105518   Max.   :35.447                   
##  NA's   :28         NA's   :28         NA's   :28                       
##  percent_65_and_over percent_black     percent_asian      percent_hispanic 
##  Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
##  1st Qu.:14.927      1st Qu.: 1.6175   1st Qu.: 0.68248   1st Qu.: 2.9419  
##  Median :17.222      Median : 5.6397   Median : 1.23421   Median : 5.5939  
##  Mean   :17.516      Mean   :12.4178   Mean   : 2.40412   Mean   :10.0010  
##  3rd Qu.:19.598      3rd Qu.:17.5931   3rd Qu.: 2.67550   3rd Qu.:11.0564  
##  Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3595  
##  NA's   :28          NA's   :28        NA's   :28         NA's   :28       
##  percent_female  percent_rural   
##  Min.   :34.63   Min.   :  0.00  
##  1st Qu.:50.00   1st Qu.: 17.11  
##  Median :50.66   Median : 36.97  
##  Mean   :50.46   Mean   : 40.11  
##  3rd Qu.:51.35   3rd Qu.: 60.00  
##  Max.   :56.87   Max.   :100.00  
##  NA's   :28      NA's   :28

List rows in county_data that don’t have a match in county_count:

county_data %>%
  filter(is.na(state) & is.na(county)) %>%
  print(n = Inf)

## # A tibble: 28 × 41
##     fips admin2  provi…¹ count…² last_update           lat  long_ confi…³ deaths
##    <dbl> <chr>   <chr>   <chr>   <dttm>              <dbl>  <dbl>   <dbl>  <dbl>
##  1    NA DeKalb  Tennes… US      2020-04-04 23:34:21  36.0  -85.8       5      0
##  2    NA DeSoto  Florida US      2020-04-04 23:34:21  27.2  -81.8      11      1
##  3    NA Dukes … Massac… US      2020-04-04 23:34:21  41.4  -70.7      16      0
##  4    NA Fillmo… Minnes… US      2020-04-04 23:34:21  43.7  -92.1       9      0
##  5    NA Kansas… Missou… US      2020-04-04 23:34:21  39.1  -94.6     172      2
##  6    NA LaSalle Illino… US      2020-04-04 23:34:21  41.3  -88.9       7      1
##  7    NA Manass… Virgin… US      2020-04-04 23:34:21  38.7  -77.5      14      0
##  8    NA McDuff… Georgia US      2020-04-04 23:34:21  33.5  -82.5       5      1
##  9    NA Out of… Michig… US      2020-04-04 23:34:21  NA     NA        83      1
## 10    NA Out of… Tennes… US      2020-04-04 23:34:21  NA     NA       218      1
## 11 90005 Unassi… Arkans… US      2020-04-04 23:34:21  NA     NA        53      8
## 12 90008 Unassi… Colora… US      2020-04-04 23:34:21  NA     NA       158      0
## 13 90009 Unassi… Connec… US      2020-04-04 23:34:21  NA     NA       241      1
## 14 90013 Unassi… Georgia US      2020-04-04 23:34:21  NA     NA       245      4
## 15 90015 Unassi… Hawaii  US      2020-04-04 23:34:21  NA     NA         8      0
## 16 90017 Unassi… Illino… US      2020-04-04 23:34:21  NA     NA        58      1
## 17 90021 Unassi… Kentuc… US      2020-04-04 23:34:21  NA     NA        29      6
## 18    NA Unassi… Louisi… US      2020-04-04 23:34:21  NA     NA        31      0
## 19 90023 Unassi… Maine   US      2020-04-04 23:34:21  NA     NA        12      3
## 20 90025 Unassi… Massac… US      2020-04-04 23:34:21  NA     NA       274      9
## 21    NA Unassi… Michig… US      2020-04-04 23:34:21  NA     NA       252      1
## 22 90032 Unassi… Nevada  US      2020-04-04 23:34:21  NA     NA        34      0
## 23 90034 Unassi… New Je… US      2020-04-04 23:34:21  NA     NA      3935     14
## 24 90044 Unassi… Rhode … US      2020-04-04 23:34:21  NA     NA       241     14
## 25 90047 Unassi… Tennes… US      2020-04-04 23:34:21  NA     NA        63      0
## 26 90050 Unassi… Vermont US      2020-04-04 23:34:21  NA     NA        11     15
## 27 90053 Unassi… Washin… US      2020-04-04 23:34:21  NA     NA       483      0
## 28    NA Weber   Utah    US      2020-04-04 23:34:21  41.3 -112.       63      1
## # … with 32 more variables: recovered <dbl>, active <dbl>, combined_key <chr>,
## #   state <chr>, county <chr>, percent_fair_or_poor_health <dbl>,
## #   percent_smokers <dbl>, percent_adults_with_obesity <dbl>,
## #   percent_with_access_to_exercise_opportunities <dbl>,
## #   percent_excessive_drinking <dbl>, percent_uninsured <dbl>,
## #   percent_some_college <dbl>, percent_unemployed <dbl>,
## #   percent_children_in_poverty <dbl>, …

We found there are some rows that miss fips.

county_count %>%
  filter(is.na(fips)) %>%
  select(fips, admin2, province_state) %>%
  print(n = Inf)

## # A tibble: 13 × 3
##     fips admin2              province_state
##    <dbl> <chr>               <chr>         
##  1    NA DeKalb              Tennessee     
##  2    NA DeSoto              Florida       
##  3    NA Dukes and Nantucket Massachusetts 
##  4    NA Fillmore            Minnesota     
##  5    NA Kansas City         Missouri      
##  6    NA LaSalle             Illinois      
##  7    NA Manassas            Virginia      
##  8    NA McDuffie            Georgia       
##  9    NA Out of MI           Michigan      
## 10    NA Out of TN           Tennessee     
## 11    NA Unassigned          Louisiana     
## 12    NA Unassigned          Michigan      
## 13    NA Weber               Utah

We need to (1) manually set the fips for some counties, (2) discard those Unassigned, unassigned or Out of, and (3) try to join with county_info again.

county_data <- county_count %>%
  # manually set FIPS for some counties
  mutate(fips = ifelse(admin2 == "DeKalb" & province_state == "Tennessee", 47041, fips)) %>%
  mutate(fips = ifelse(admin2 == "DeSoto" & province_state == "Florida", 12027, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Dona Ana" & province_state == "New Mexico", 35013, fips)) %>% 
  mutate(fips = ifelse(admin2 == "Dukes and Nantucket" & province_state == "Massachusetts", 25019, fips)) %>% 
  mutate(fips = ifelse(admin2 == "Fillmore" & province_state == "Minnesota", 27045, fips)) %>%  
  #mutate(fips = ifelse(admin2 == "Harris" & province_state == "Texas", 48201, fips)) %>%  
  #mutate(fips = ifelse(admin2 == "Kenai Peninsula" & province_state == "Alaska", 2122, fips)) %>%  
  mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Illinois", 17099, fips)) %>%
  #mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Louisiana", 22059, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Lac qui Parle" & province_state == "Minnesota", 27073, fips)) %>%  
  mutate(fips = ifelse(admin2 == "Manassas" & province_state == "Virginia", 51683, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Matanuska-Susitna" & province_state == "Alaska", 2170, fips)) %>%
  mutate(fips = ifelse(admin2 == "McDuffie" & province_state == "Georgia", 13189, fips)) %>%
  #mutate(fips = ifelse(admin2 == "McIntosh" & province_state == "Georgia", 13191, fips)) %>%
  #mutate(fips = ifelse(admin2 == "McKean" & province_state == "Pennsylvania", 42083, fips)) %>%
  mutate(fips = ifelse(admin2 == "Weber" & province_state == "Utah", 49057, fips)) %>%
  filter(!(is.na(fips) | str_detect(admin2, "Out of") | str_detect(admin2, "Unassigned"))) %>%
  left_join(county_info, by = "fips") %>%
  print(width = Inf)

## # A tibble: 1,446 × 41
##     fips admin2    province_state country_region last_update           lat
##    <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
##  1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
##  2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
##  3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
##  4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
##  5 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
##  6 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
##  7  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
##  8 28001 Adams     Mississippi    US             2020-04-04 23:34:21  31.5
##  9 31001 Adams     Nebraska       US             2020-04-04 23:34:21  40.5
## 10 42001 Adams     Pennsylvania   US             2020-04-04 23:34:21  39.9
##     long_ confirmed deaths recovered active combined_key                 
##     <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
##  1  -82.5         6      0         0      0 Abbeville, South Carolina, US
##  2  -92.4        65      2         0      0 Acadia, Louisiana, US        
##  3  -75.6         8      0         0      0 Accomack, Virginia, US       
##  4 -116.        360      3         0      0 Ada, Idaho, US               
##  5  -92.6        10      0         0      0 Adair, Missouri, US          
##  6  -94.7        14      0         0      0 Adair, Oklahoma, US          
##  7 -104.        294      9         0      0 Adams, Colorado, US          
##  8  -91.4        16      0         0      0 Adams, Mississippi, US       
##  9  -98.5         8      0         0      0 Adams, Nebraska, US          
## 10  -77.2        21      0         0      0 Adams, Pennsylvania, US      
##    state          county    percent_fair_or_poor_health percent_smokers
##    <chr>          <chr>                           <dbl>           <dbl>
##  1 South Carolina Abbeville                        19.9            17.3
##  2 Louisiana      Acadia                           20.9            21.5
##  3 Virginia       Accomack                         20.1            18.3
##  4 Idaho          Ada                              11.5            12.0
##  5 Missouri       Adair                            21.4            20.5
##  6 Oklahoma       Adair                            28.5            27.7
##  7 Colorado       Adams                            16.6            16.3
##  8 Mississippi    Adams                            27.3            22.2
##  9 Nebraska       Adams                            15.8            14.6
## 10 Pennsylvania   Adams                            15.3            16.2
##    percent_adults_with_obesity percent_with_access_to_exercise_opportunities
##                          <dbl>                                         <dbl>
##  1                        36.7                                          59.0
##  2                        38.4                                          42.5
##  3                        36.3                                          37.4
##  4                        25.6                                          89.5
##  5                        27.9                                          78.3
##  6                        47.7                                          28.5
##  7                        27.8                                          93.1
##  8                        35.3                                          69.1
##  9                        36.7                                          81.6
## 10                        35.6                                          60.6
##    percent_excessive_drinking percent_uninsured percent_some_college
##                         <dbl>             <dbl>                <dbl>
##  1                       15.9             12.9                  52.5
##  2                       19.8             10.7                  43.6
##  3                       15.5             16.6                  45.1
##  4                       17.9              8.74                 73.8
##  5                       18.9             10.6                  65.3
##  6                       11.8             24.5                  35.1
##  7                       18.9             11.0                  57.0
##  8                       12.3             15.0                  41.7
##  9                       18.5              8.76                 70.8
## 10                       19.2              7.49                 57.3
##    percent_unemployed percent_children_in_poverty
##                 <dbl>                       <dbl>
##  1               3.98                        30.8
##  2               5.37                        35.4
##  3               3.81                        27  
##  4               2.46                        10.2
##  5               3.51                        19.9
##  6               4.17                        34.9
##  7               3.47                        12.6
##  8               6.21                        40.4
##  9               2.87                        14.4
## 10               3.27                        11.2
##    percent_single_parent_households percent_severe_housing_problems overcrowding
##                               <dbl>                           <dbl>        <dbl>
##  1                             37.1                            14.3        0.463
##  2                             33.4                            12.3        3.51 
##  3                             45.9                            15.1        2.10 
##  4                             23.8                            14.0        1.46 
##  5                             29.5                            18.0        0.740
##  6                             38.3                            15.4        5.65 
##  7                             31.0                            18.1        5.37 
##  8                             66.4                            12.8        2.37 
##  9                             26.2                            10.5        0.904
## 10                             26.7                            12.3        1.88 
##    percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
##                           <dbl>                 <dbl>                      <dbl>
##  1                         15.8                  15.2                       36.1
##  2                         11.4                  15.1                       32.4
##  3                         15.9                  14.1                       36.8
##  4                          7.9                  12                         26.3
##  5                          8.4                  17.5                       31.9
##  6                         24.3                  19.1                       39.5
##  7                          7.7                   8                         31.0
##  8                         13.2                  24.7                       41.1
##  9                         11                    11.7                       30.1
## 10                          8.5                   8.3                       34.7
##    percent_uninsured_2 median_household_income
##                  <dbl>                   <dbl>
##  1               15.9                    42412
##  2               14.0                    40484
##  3               19.4                    42879
##  4               11.1                    66827
##  5               12.3                    40395
##  6               29.6                    35156
##  7               13.8                    70199
##  8               18.7                    33392
##  9               10.7                    55167
## 10                8.46                   62877
##    average_traffic_volume_per_meter_of_major_roadways percent_homeowners
##                                                 <dbl>              <dbl>
##  1                                               11.6               76.3
##  2                                               63.7               70.8
##  3                                               60.0               67.9
##  4                                              277.                68.4
##  5                                               45.8               60.0
##  6                                               16.7               68.6
##  7                                              490.                65.2
##  8                                              150.                61.7
##  9                                               53.4               68.2
## 10                                              113.                77.2
##    population_2 percent_less_than_18_years_of_age percent_65_and_over
##           <dbl>                             <dbl>               <dbl>
##  1        24541                              20.1                21.8
##  2        62190                              25.8                15.3
##  3        32412                              20.5                23.6
##  4       469966                              23.8                14.4
##  5        25339                              18.4                14.8
##  6        22082                              26.6                15.9
##  7       511868                              26.5                10.5
##  8        31192                              20.1                18.8
##  9        31511                              23.7                18.2
## 10       102811                              20.0                20.4
##    percent_black percent_asian percent_hispanic percent_female percent_rural
##            <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
##  1        27.5           0.412             1.54           51.6         78.6 
##  2        17.9           0.320             2.73           51.2         51.7 
##  3        28.0           0.781             9.34           51.2        100   
##  4         1.24          2.81              8.31           49.9          5.47
##  5         2.85          2.28              2.57           51.9         37.9 
##  6         0.534         0.802             6.82           50.1         83.3 
##  7         3.19          4.37             40.4            49.5          3.62
##  8        52.4           0.513            11.3            47.9         37.2 
##  9         0.996         1.33             10.9            50.2         22.5 
## 10         1.60          0.875             7.11           50.8         53.7 
## # … with 1,436 more rows

Summarize again

summary(county_data)

##       fips          admin2          province_state     country_region    
##  Min.   : 1001   Length:1446        Length:1446        Length:1446       
##  1st Qu.:17186   Class :character   Class :character   Class :character  
##  Median :28156   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :29455                                                           
##  3rd Qu.:42048                                                           
##  Max.   :56039                                                           
##   last_update                       lat            long_        
##  Min.   :2020-04-04 23:34:21   Min.   :19.60   Min.   :-159.60  
##  1st Qu.:2020-04-04 23:34:21   1st Qu.:33.96   1st Qu.: -94.52  
##  Median :2020-04-04 23:34:21   Median :38.02   Median : -86.48  
##  Mean   :2020-04-04 23:34:21   Mean   :37.71   Mean   : -89.73  
##  3rd Qu.:2020-04-04 23:34:21   3rd Qu.:41.39   3rd Qu.: -81.21  
##  Max.   :2020-04-04 23:34:21   Max.   :64.81   Max.   : -68.65  
##    confirmed           deaths           recovered     active 
##  Min.   :    5.0   Min.   :   0.000   Min.   :0   Min.   :0  
##  1st Qu.:    9.0   1st Qu.:   0.000   1st Qu.:0   1st Qu.:0  
##  Median :   20.0   Median :   0.000   Median :0   Median :0  
##  Mean   :  207.2   Mean   :   4.854   Mean   :0   Mean   :0  
##  3rd Qu.:   66.0   3rd Qu.:   2.000   3rd Qu.:0   3rd Qu.:0  
##  Max.   :63306.0   Max.   :1905.000   Max.   :0   Max.   :0  
##  combined_key          state              county         
##  Length:1446        Length:1446        Length:1446       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
##  Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
##  1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
##  Median :17.010              Median :17.143   Median :32.90              
##  Mean   :17.594              Mean   :17.151   Mean   :32.39              
##  3rd Qu.:20.398              3rd Qu.:19.365   3rd Qu.:36.20              
##  Max.   :38.887              Max.   :27.775   Max.   :51.00              
##  percent_with_access_to_exercise_opportunities percent_excessive_drinking
##  Min.   :  0.00                                Min.   : 7.81             
##  1st Qu.: 59.95                                1st Qu.:15.68             
##  Median : 74.71                                Median :18.03             
##  Mean   : 71.15                                Mean   :17.91             
##  3rd Qu.: 85.97                                3rd Qu.:20.01             
##  Max.   :100.00                                Max.   :28.62             
##  percent_uninsured percent_some_college percent_unemployed
##  Min.   : 2.263    Min.   :21.14        Min.   : 1.582    
##  1st Qu.: 6.754    1st Qu.:53.21        1st Qu.: 3.252    
##  Median : 9.937    Median :61.19        Median : 3.870    
##  Mean   :10.592    Mean   :60.83        Mean   : 4.071    
##  3rd Qu.:13.527    3rd Qu.:68.72        3rd Qu.: 4.690    
##  Max.   :31.208    Max.   :90.34        Max.   :18.092    
##  percent_children_in_poverty percent_single_parent_households
##  Min.   : 2.50               Min.   : 9.43                   
##  1st Qu.:12.82               1st Qu.:27.07                   
##  Median :18.40               Median :32.96                   
##  Mean   :19.46               Mean   :33.83                   
##  3rd Qu.:24.50               3rd Qu.:38.93                   
##  Max.   :55.00               Max.   :80.00                   
##  percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
##  Min.   : 6.562                  Min.   : 0.000   Min.   : 1.80               
##  1st Qu.:12.267                  1st Qu.: 1.379   1st Qu.: 9.10               
##  Median :14.439                  Median : 1.971   Median :11.30               
##  Mean   :15.082                  Mean   : 2.437   Mean   :11.75               
##  3rd Qu.:16.992                  3rd Qu.: 2.887   3rd Qu.:13.90               
##  Max.   :33.391                  Max.   :14.489   Max.   :34.10               
##  percent_food_insecure percent_insufficient_sleep percent_uninsured_2
##  Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
##  1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
##  Median :12.70         Median :34.02              Median :12.027     
##  Mean   :13.25         Mean   :33.88              Mean   :12.786     
##  3rd Qu.:15.20         3rd Qu.:36.54              3rd Qu.:16.572     
##  Max.   :33.50         Max.   :46.71              Max.   :42.397     
##  median_household_income average_traffic_volume_per_meter_of_major_roadways
##  Min.   : 25385          Min.   :   0.00                                   
##  1st Qu.: 46994          1st Qu.:  53.09                                   
##  Median : 54317          Median : 104.63                                   
##  Mean   : 57600          Mean   : 200.72                                   
##  3rd Qu.: 64775          3rd Qu.: 206.78                                   
##  Max.   :140382          Max.   :4444.12                                   
##  percent_homeowners  population_2      percent_less_than_18_years_of_age
##  Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
##  1st Qu.:64.36      1st Qu.:   36275   1st Qu.:20.326                   
##  Median :69.96      Median :   75382   Median :22.182                   
##  Mean   :68.99      Mean   :  201689   Mean   :22.204                   
##  3rd Qu.:74.77      3rd Qu.:  179982   3rd Qu.:24.019                   
##  Max.   :89.76      Max.   :10105518   Max.   :35.447                   
##  percent_65_and_over percent_black     percent_asian      percent_hispanic 
##  Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
##  1st Qu.:14.913      1st Qu.: 1.6168   1st Qu.: 0.68228   1st Qu.: 2.9451  
##  Median :17.225      Median : 5.6397   Median : 1.22863   Median : 5.6100  
##  Mean   :17.512      Mean   :12.4056   Mean   : 2.40009   Mean   :10.0338  
##  3rd Qu.:19.598      3rd Qu.:17.4904   3rd Qu.: 2.66813   3rd Qu.:11.1199  
##  Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3595  
##  percent_female  percent_rural   
##  Min.   :34.63   Min.   :  0.00  
##  1st Qu.:50.00   1st Qu.: 17.11  
##  Median :50.65   Median : 36.97  
##  Mean   :50.46   Mean   : 40.12  
##  3rd Qu.:51.35   3rd Qu.: 60.04  
##  Max.   :56.87   Max.   :100.00

If there are variables with missing value for many counties, we go back and remove those variables from consideration.

Let’s create a final data frame for analysis.

county_data <- county_data %>%
  mutate(state = as.factor(state)) %>%
  select(county, confirmed, deaths, state, percent_fair_or_poor_health:percent_rural)
summary(county_data)

##     county            confirmed           deaths                    state     
##  Length:1446        Min.   :    5.0   Min.   :   0.000   Georgia       :  96  
##  Class :character   1st Qu.:    9.0   1st Qu.:   0.000   Texas         :  80  
##  Mode  :character   Median :   20.0   Median :   0.000   North Carolina:  63  
##                     Mean   :  207.2   Mean   :   4.854   Mississippi   :  61  
##                     3rd Qu.:   66.0   3rd Qu.:   2.000   Indiana       :  58  
##                     Max.   :63306.0   Max.   :1905.000   Ohio          :  57  
##                                                          (Other)       :1031  
##  percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
##  Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
##  1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
##  Median :17.010              Median :17.143   Median :32.90              
##  Mean   :17.594              Mean   :17.151   Mean   :32.39              
##  3rd Qu.:20.398              3rd Qu.:19.365   3rd Qu.:36.20              
##  Max.   :38.887              Max.   :27.775   Max.   :51.00              
##                                                                          
##  percent_with_access_to_exercise_opportunities percent_excessive_drinking
##  Min.   :  0.00                                Min.   : 7.81             
##  1st Qu.: 59.95                                1st Qu.:15.68             
##  Median : 74.71                                Median :18.03             
##  Mean   : 71.15                                Mean   :17.91             
##  3rd Qu.: 85.97                                3rd Qu.:20.01             
##  Max.   :100.00                                Max.   :28.62             
##                                                                          
##  percent_uninsured percent_some_college percent_unemployed
##  Min.   : 2.263    Min.   :21.14        Min.   : 1.582    
##  1st Qu.: 6.754    1st Qu.:53.21        1st Qu.: 3.252    
##  Median : 9.937    Median :61.19        Median : 3.870    
##  Mean   :10.592    Mean   :60.83        Mean   : 4.071    
##  3rd Qu.:13.527    3rd Qu.:68.72        3rd Qu.: 4.690    
##  Max.   :31.208    Max.   :90.34        Max.   :18.092    
##                                                           
##  percent_children_in_poverty percent_single_parent_households
##  Min.   : 2.50               Min.   : 9.43                   
##  1st Qu.:12.82               1st Qu.:27.07                   
##  Median :18.40               Median :32.96                   
##  Mean   :19.46               Mean   :33.83                   
##  3rd Qu.:24.50               3rd Qu.:38.93                   
##  Max.   :55.00               Max.   :80.00                   
##                                                              
##  percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
##  Min.   : 6.562                  Min.   : 0.000   Min.   : 1.80               
##  1st Qu.:12.267                  1st Qu.: 1.379   1st Qu.: 9.10               
##  Median :14.439                  Median : 1.971   Median :11.30               
##  Mean   :15.082                  Mean   : 2.437   Mean   :11.75               
##  3rd Qu.:16.992                  3rd Qu.: 2.887   3rd Qu.:13.90               
##  Max.   :33.391                  Max.   :14.489   Max.   :34.10               
##                                                                               
##  percent_food_insecure percent_insufficient_sleep percent_uninsured_2
##  Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
##  1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
##  Median :12.70         Median :34.02              Median :12.027     
##  Mean   :13.25         Mean   :33.88              Mean   :12.786     
##  3rd Qu.:15.20         3rd Qu.:36.54              3rd Qu.:16.572     
##  Max.   :33.50         Max.   :46.71              Max.   :42.397     
##                                                                      
##  median_household_income average_traffic_volume_per_meter_of_major_roadways
##  Min.   : 25385          Min.   :   0.00                                   
##  1st Qu.: 46994          1st Qu.:  53.09                                   
##  Median : 54317          Median : 104.63                                   
##  Mean   : 57600          Mean   : 200.72                                   
##  3rd Qu.: 64775          3rd Qu.: 206.78                                   
##  Max.   :140382          Max.   :4444.12                                   
##                                                                            
##  percent_homeowners  population_2      percent_less_than_18_years_of_age
##  Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
##  1st Qu.:64.36      1st Qu.:   36275   1st Qu.:20.326                   
##  Median :69.96      Median :   75382   Median :22.182                   
##  Mean   :68.99      Mean   :  201689   Mean   :22.204                   
##  3rd Qu.:74.77      3rd Qu.:  179982   3rd Qu.:24.019                   
##  Max.   :89.76      Max.   :10105518   Max.   :35.447                   
##                                                                         
##  percent_65_and_over percent_black     percent_asian      percent_hispanic 
##  Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
##  1st Qu.:14.913      1st Qu.: 1.6168   1st Qu.: 0.68228   1st Qu.: 2.9451  
##  Median :17.225      Median : 5.6397   Median : 1.22863   Median : 5.6100  
##  Mean   :17.512      Mean   :12.4056   Mean   : 2.40009   Mean   :10.0338  
##  3rd Qu.:19.598      3rd Qu.:17.4904   3rd Qu.: 2.66813   3rd Qu.:11.1199  
##  Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3595  
##                                                                            
##  percent_female  percent_rural   
##  Min.   :34.63   Min.   :  0.00  
##  1st Qu.:50.00   1st Qu.: 17.11  
##  Median :50.65   Median : 36.97  
##  Mean   :50.46   Mean   : 40.12  
##  3rd Qu.:51.35   3rd Qu.: 60.04  
##  Max.   :56.87   Max.   :100.00  
##

Display the 10 counties with highest CFR.

county_data %>%
  mutate(cfr = deaths / confirmed) %>%
  select(county, state, confirmed, deaths, cfr) %>%
  arrange(desc(cfr)) %>%
  top_n(10)

## Selecting by cfr

## # A tibble: 18 × 5
##    county         state          confirmed deaths   cfr
##    <chr>          <fct>              <dbl>  <dbl> <dbl>
##  1 Emmet          Michigan               7      2 0.286
##  2 Grand Traverse Michigan              12      3 0.25 
##  3 Toole          Montana               12      3 0.25 
##  4 Fayette        Indiana               14      3 0.214
##  5 Concordia      Louisiana              5      1 0.2  
##  6 Harrison       Texas                  5      1 0.2  
##  7 Huntington     Indiana                5      1 0.2  
##  8 Isabella       Michigan              10      2 0.2  
##  9 McDuffie       Georgia                5      1 0.2  
## 10 Navarro        Texas                  5      1 0.2  
## 11 Orange         Indiana                5      1 0.2  
## 12 Perry          Pennsylvania           5      1 0.2  
## 13 Randolph       Indiana                5      1 0.2  
## 14 Rockingham     North Carolina         5      1 0.2  
## 15 Seneca         Ohio                   5      1 0.2  
## 16 Toombs         Georgia                5      1 0.2  
## 17 Vigo           Indiana               10      2 0.2  
## 18 Washington     Alabama                5      1 0.2

Write final data into a csv file for future use.

write_csv(county_data, "./datasets/covid19-county-data-20200404.csv.gz")

Note:

Given that the datasets were collected in the middle of the pandemic, what assumptions of CFR might be violated by defining CFR as deaths/confirmed from this data set?

Because COVID-19 pandemic was still ongoing in 2020, we should realize some critical assumptions for defining CFR are not met using this datasets.

Numbers of confirmed cases do not reflect the number of diagnosed people. This is mainly limited by the availability of testing.
Some confirmed cases may die later.

With acknowledgement of these severe limitations, we continue to use deaths/confirmed as a very rough proxy of CFR.

Q1.1

Read and run above code to generate a data frame county_data that includes county-level COVID-19 confirmed cases and deaths, demographic, and health related information.

Q1.2

What assumptions of logistic regression may be violated by this data set?

Q1.3

Run a logistic regression, using variables state, …, percent_rural as predictors.

Q1.4

Interpret the regression coefficients of 3 significant predictors with p-value <0.01.

Q1.5

Apply analysis of deviance to (1) evaluate the goodness of fit of the model and (2) compare the model to the intercept-only model.

Q1.6

Perform analysis of deviance to evaluate the significance of each predictor. Display the 10 most significant predictors.

Q1.7

Construct confidence intervals of regression coefficients.

Q1.8

Plot the deviance residuals against the fitted values. Are there potential outliers?

Q1.9

Plot the half-normal plot. Are there potential outliers in predictor space?

Q1.10

Find the best sub-model using the AIC criterion.

Q1.11

Find the best sub-model using the lasso with cross validation.

	Diseased	Not diseased
Exposed	\(\pi_1\)	\(1 - \pi_1\)
Not exposed	\(\pi_2\)	\(1 - \pi_2\)