Biostat 200C Homework 2

Due Apr 25 @ 11:59PM

Published

April 14, 2026

To submit homework, please upload both Rmd and html files to Bruinlearn by the deadline.

1 Q1. CFR of COVID-19 (90pts)

Of primary interest to public is the risk of dying from COVID-19. A commonly used measure is case fatality rate/ratio/risk (CFR), which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}. \] Apparently CFR is not a fixed constant; it changes with time, location, and other factors. Also CFR is different from the infection fatality rate (IFR), the probability that someone infected with COVID-19 dies from it.

In this exercise, we use logistic regression to study how US county-level CFR changes according to demographic information and some health-, education-, and economy-indicators.

1.1 Data sources

1.2 Sample code for data preparation

Load the tidyverse package for data manipulation and visualization.

# tidyverse of data manipulation and visualization
library(tidyverse)

Read in the data of COVID-19 cases reported on 2020-04-04.

county_count <- read_csv("./datasets/04-04-2020.csv.gz") %>%
  # cast fips into dbl for use as a key for joining tables
  mutate(FIPS = as.numeric(FIPS)) %>%
  filter(Country_Region == "US") %>%
  print(width = Inf)
# A tibble: 2,421 × 12
    FIPS Admin2    Province_State Country_Region Last_Update           Lat
   <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
 1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
 2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
 3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
 4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
 5 19001 Adair     Iowa           US             2020-04-04 23:34:21  41.3
 6 21001 Adair     Kentucky       US             2020-04-04 23:34:21  37.1
 7 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
 8 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
 9  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
10 16003 Adams     Idaho          US             2020-04-04 23:34:21  44.9
    Long_ Confirmed Deaths Recovered Active Combined_Key                 
    <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
 1  -82.5         6      0         0      0 Abbeville, South Carolina, US
 2  -92.4        65      2         0      0 Acadia, Louisiana, US        
 3  -75.6         8      0         0      0 Accomack, Virginia, US       
 4 -116.        360      3         0      0 Ada, Idaho, US               
 5  -94.5         1      0         0      0 Adair, Iowa, US              
 6  -85.3         3      0         0      0 Adair, Kentucky, US          
 7  -92.6        10      0         0      0 Adair, Missouri, US          
 8  -94.7        14      0         0      0 Adair, Oklahoma, US          
 9 -104.        294      9         0      0 Adams, Colorado, US          
10 -116.          1      0         0      0 Adams, Idaho, US             
# ℹ 2,411 more rows

Standardize the variable names by changing them to lower case.

names(county_count) <- str_to_lower(names(county_count))

Sanity check by displaying the unique US states and territories:

county_count %>%
  select(province_state) %>%
  distinct() %>%
  arrange(province_state) %>%
  print(n = Inf)
# A tibble: 58 × 1
   province_state          
   <chr>                   
 1 Alabama                 
 2 Alaska                  
 3 Arizona                 
 4 Arkansas                
 5 California              
 6 Colorado                
 7 Connecticut             
 8 Delaware                
 9 Diamond Princess        
10 District of Columbia    
11 Florida                 
12 Georgia                 
13 Grand Princess          
14 Guam                    
15 Hawaii                  
16 Idaho                   
17 Illinois                
18 Indiana                 
19 Iowa                    
20 Kansas                  
21 Kentucky                
22 Louisiana               
23 Maine                   
24 Maryland                
25 Massachusetts           
26 Michigan                
27 Minnesota               
28 Mississippi             
29 Missouri                
30 Montana                 
31 Nebraska                
32 Nevada                  
33 New Hampshire           
34 New Jersey              
35 New Mexico              
36 New York                
37 North Carolina          
38 North Dakota            
39 Northern Mariana Islands
40 Ohio                    
41 Oklahoma                
42 Oregon                  
43 Pennsylvania            
44 Puerto Rico             
45 Recovered               
46 Rhode Island            
47 South Carolina          
48 South Dakota            
49 Tennessee               
50 Texas                   
51 Utah                    
52 Vermont                 
53 Virgin Islands          
54 Virginia                
55 Washington              
56 West Virginia           
57 Wisconsin               
58 Wyoming                 

We want to exclude entries from Diamond Princess, Grand Princess, Guam, Northern Mariana Islands, Puerto Rico, Recovered, and Virgin Islands, and only consider counties from 50 states and DC.

county_count <- county_count %>%
  filter(!(province_state %in% c("Diamond Princess", "Grand Princess", 
                                 "Recovered", "Guam", "Northern Mariana Islands", 
                                 "Puerto Rico", "Virgin Islands"))) %>%
  print(width = Inf)
# A tibble: 2,413 × 12
    fips admin2    province_state country_region last_update           lat
   <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
 1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
 2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
 3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
 4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
 5 19001 Adair     Iowa           US             2020-04-04 23:34:21  41.3
 6 21001 Adair     Kentucky       US             2020-04-04 23:34:21  37.1
 7 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
 8 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
 9  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
10 16003 Adams     Idaho          US             2020-04-04 23:34:21  44.9
    long_ confirmed deaths recovered active combined_key                 
    <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
 1  -82.5         6      0         0      0 Abbeville, South Carolina, US
 2  -92.4        65      2         0      0 Acadia, Louisiana, US        
 3  -75.6         8      0         0      0 Accomack, Virginia, US       
 4 -116.        360      3         0      0 Ada, Idaho, US               
 5  -94.5         1      0         0      0 Adair, Iowa, US              
 6  -85.3         3      0         0      0 Adair, Kentucky, US          
 7  -92.6        10      0         0      0 Adair, Missouri, US          
 8  -94.7        14      0         0      0 Adair, Oklahoma, US          
 9 -104.        294      9         0      0 Adams, Colorado, US          
10 -116.          1      0         0      0 Adams, Idaho, US             
# ℹ 2,403 more rows

Graphical summarize the COVID-19 confirmed cases and deaths on 2020-04-04 by state.

county_count %>%
  # turn into long format for easy plotting
  pivot_longer(confirmed:recovered, 
               names_to = "case", 
               values_to = "count") %>%
  group_by(province_state) %>%
  ggplot() + 
  geom_col(mapping = aes(x = province_state, y = `count`, fill = `case`)) + 
  # scale_y_log10() + 
  labs(title = "US COVID-19 Situation on 2020-04-04", x = "State") + 
  theme(axis.text.x = element_text(angle = 90))

Read in the 2020 county-level health ranking data.

county_info <- read_csv("./datasets/us-county-health-rankings-2020.csv.gz") %>%
  filter(!is.na(county)) %>%
  # cast fips into dbl for use as a key for joining tables
  mutate(fips = as.numeric(fips)) %>%
  select(fips, 
         state,
         county,
         percent_fair_or_poor_health, 
         percent_smokers, 
         percent_adults_with_obesity, 
         # food_environment_index,
         percent_with_access_to_exercise_opportunities, 
         percent_excessive_drinking,
         # teen_birth_rate, 
         percent_uninsured,
         # primary_care_physicians_rate,
         # preventable_hospitalization_rate,
         # high_school_graduation_rate,
         percent_some_college,
         percent_unemployed,
         percent_children_in_poverty,
         # `80th_percentile_income`,
         # `20th_percentile_income`,
         percent_single_parent_households,
         # violent_crime_rate,
         percent_severe_housing_problems,
         overcrowding,
         # life_expectancy,
         # age_adjusted_death_rate,
         percent_adults_with_diabetes,
         # hiv_prevalence_rate,
         percent_food_insecure,
         # percent_limited_access_to_healthy_foods,
         percent_insufficient_sleep,
         percent_uninsured_2,
         median_household_income,
         average_traffic_volume_per_meter_of_major_roadways,
         percent_homeowners,
         # percent_severe_housing_cost_burden,
         population_2,
         percent_less_than_18_years_of_age,
         percent_65_and_over,
         percent_black,
         percent_asian,
         percent_hispanic,
         percent_female,
         percent_rural) %>%
  print(width = Inf)
# A tibble: 3,142 × 30
    fips state   county   percent_fair_or_poor_health percent_smokers
   <dbl> <chr>   <chr>                          <dbl>           <dbl>
 1  1001 Alabama Autauga                         20.9            18.1
 2  1003 Alabama Baldwin                         17.5            17.5
 3  1005 Alabama Barbour                         29.6            22.0
 4  1007 Alabama Bibb                            19.4            19.1
 5  1009 Alabama Blount                          21.7            19.2
 6  1011 Alabama Bullock                         31.0            22.9
 7  1013 Alabama Butler                          27.9            21.8
 8  1015 Alabama Calhoun                         23.1            20.6
 9  1017 Alabama Chambers                        24.0            19.4
10  1019 Alabama Cherokee                        20.7            17.5
   percent_adults_with_obesity percent_with_access_to_exercise_opportunities
                         <dbl>                                         <dbl>
 1                        33.3                                         69.1 
 2                        31                                           73.7 
 3                        41.7                                         53.2 
 4                        37.6                                         16.3 
 5                        33.8                                         15.6 
 6                        37.2                                          2.50
 7                        43.3                                         48.6 
 8                        38.5                                         47.7 
 9                        40.1                                         61.9 
10                        35                                           33.4 
   percent_excessive_drinking percent_uninsured percent_some_college
                        <dbl>             <dbl>                <dbl>
 1                       15.0              8.72                 62.0
 2                       18.0             11.3                  67.4
 3                       12.8             12.2                  34.9
 4                       15.6             10.2                  44.1
 5                       14.2             13.4                  53.4
 6                       12.1             11.4                  35.0
 7                       11.9             11.2                  41.7
 8                       13.8             11.9                  59.2
 9                       12.7             11.9                  48.5
10                       14.1             11.2                  51.8
   percent_unemployed percent_children_in_poverty
                <dbl>                       <dbl>
 1               3.63                        19.3
 2               3.62                        13.9
 3               5.17                        43.9
 4               3.97                        27.8
 5               3.51                        18  
 6               4.69                        68.3
 7               4.79                        36.3
 8               4.65                        26.5
 9               3.91                        30.7
10               3.57                        24.7
   percent_single_parent_households percent_severe_housing_problems overcrowding
                              <dbl>                           <dbl>        <dbl>
 1                             26.2                            14.7        1.20 
 2                             24.1                            13.6        1.27 
 3                             56.6                            14.6        1.69 
 4                             28.7                            10.5        0.255
 5                             28.6                            10.5        1.89 
 6                             74.8                            18.1        0.113
 7                             52.7                            13.2        1.69 
 8                             40.2                            13.7        1.54 
 9                             46.6                            16.0        4.04 
10                             23.8                            13          1.5  
   percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
                          <dbl>                 <dbl>                      <dbl>
 1                         11.1                  13.2                       35.9
 2                         10.7                  11.6                       33.3
 3                         17.6                  22                         38.6
 4                         14.5                  14.3                       38.1
 5                         17                    10.7                       35.9
 6                         23.7                  24.8                       45.0
 7                         19.2                  20.6                       41.9
 8                         17.5                  15.7                       41.3
 9                         19.9                  17.9                       37.3
10                         15.2                  12.5                       35.4
   percent_uninsured_2 median_household_income
                 <dbl>                   <dbl>
 1                11.1                   59338
 2                14.3                   57588
 3                16.1                   34382
 4                13                     46064
 5                17.1                   50412
 6                15.2                   29267
 7                14.5                   37365
 8                15.4                   45400
 9                15.2                   39917
10                13.9                   42132
   average_traffic_volume_per_meter_of_major_roadways percent_homeowners
                                                <dbl>              <dbl>
 1                                              88.5                74.9
 2                                              87.0                73.6
 3                                             102.                 61.4
 4                                              29.3                75.1
 5                                              33.4                78.6
 6                                               4.07               75.5
 7                                              19.3                69.9
 8                                             110.                 69.5
 9                                              20.3                67.8
10                                              25.9                79.0
   population_2 percent_less_than_18_years_of_age percent_65_and_over
          <dbl>                             <dbl>               <dbl>
 1        55601                              23.7                15.6
 2       218022                              21.6                20.4
 3        24881                              20.9                19.4
 4        22400                              20.5                16.5
 5        57840                              23.2                18.2
 6        10138                              21.1                16.4
 7        19680                              22.2                20.3
 8       114277                              21.6                17.7
 9        33615                              20.8                19.5
10        26032                              19.2                23.0
   percent_black percent_asian percent_hispanic percent_female percent_rural
           <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
 1         19.3          1.22              2.97           51.4          42.0
 2          8.78         1.15              4.65           51.5          42.3
 3         48.0          0.454             4.28           47.2          67.8
 4         21.1          0.237             2.62           46.8          68.4
 5          1.46         0.320             9.57           50.7          90.0
 6         69.5          0.187             7.96           45.5          51.4
 7         44.6          1.32              1.51           53.4          71.2
 8         20.9          0.964             3.91           51.9          33.7
 9         39.6          1.33              2.56           52.1          49.1
10          4.24         0.338             1.62           50.5          85.7
# ℹ 3,132 more rows

For stability in estimating CFR, we restrict to counties with \(\ge 5\) confirmed cases.

county_count <- county_count %>%
  filter(confirmed >= 5)

We join the COVID-19 count data and county-level information using FIPS (Federal Information Processing System) as key.

county_data <- county_count %>%
  left_join(county_info, by = "fips") %>%
  print(width = Inf)
# A tibble: 1,466 × 41
    fips admin2    province_state country_region last_update           lat
   <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
 1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
 2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
 3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
 4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
 5 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
 6 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
 7  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
 8 28001 Adams     Mississippi    US             2020-04-04 23:34:21  31.5
 9 31001 Adams     Nebraska       US             2020-04-04 23:34:21  40.5
10 42001 Adams     Pennsylvania   US             2020-04-04 23:34:21  39.9
    long_ confirmed deaths recovered active combined_key                 
    <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
 1  -82.5         6      0         0      0 Abbeville, South Carolina, US
 2  -92.4        65      2         0      0 Acadia, Louisiana, US        
 3  -75.6         8      0         0      0 Accomack, Virginia, US       
 4 -116.        360      3         0      0 Ada, Idaho, US               
 5  -92.6        10      0         0      0 Adair, Missouri, US          
 6  -94.7        14      0         0      0 Adair, Oklahoma, US          
 7 -104.        294      9         0      0 Adams, Colorado, US          
 8  -91.4        16      0         0      0 Adams, Mississippi, US       
 9  -98.5         8      0         0      0 Adams, Nebraska, US          
10  -77.2        21      0         0      0 Adams, Pennsylvania, US      
   state          county    percent_fair_or_poor_health percent_smokers
   <chr>          <chr>                           <dbl>           <dbl>
 1 South Carolina Abbeville                        19.9            17.3
 2 Louisiana      Acadia                           20.9            21.5
 3 Virginia       Accomack                         20.1            18.3
 4 Idaho          Ada                              11.5            12.0
 5 Missouri       Adair                            21.4            20.5
 6 Oklahoma       Adair                            28.5            27.7
 7 Colorado       Adams                            16.6            16.3
 8 Mississippi    Adams                            27.3            22.2
 9 Nebraska       Adams                            15.8            14.6
10 Pennsylvania   Adams                            15.3            16.2
   percent_adults_with_obesity percent_with_access_to_exercise_opportunities
                         <dbl>                                         <dbl>
 1                        36.7                                          59.0
 2                        38.4                                          42.5
 3                        36.3                                          37.4
 4                        25.6                                          89.5
 5                        27.9                                          78.3
 6                        47.7                                          28.5
 7                        27.8                                          93.1
 8                        35.3                                          69.1
 9                        36.7                                          81.6
10                        35.6                                          60.6
   percent_excessive_drinking percent_uninsured percent_some_college
                        <dbl>             <dbl>                <dbl>
 1                       15.9             12.9                  52.5
 2                       19.8             10.7                  43.6
 3                       15.5             16.6                  45.1
 4                       17.9              8.74                 73.8
 5                       18.9             10.6                  65.3
 6                       11.8             24.5                  35.1
 7                       18.9             11.0                  57.0
 8                       12.3             15.0                  41.7
 9                       18.5              8.76                 70.8
10                       19.2              7.49                 57.3
   percent_unemployed percent_children_in_poverty
                <dbl>                       <dbl>
 1               3.98                        30.8
 2               5.37                        35.4
 3               3.81                        27  
 4               2.46                        10.2
 5               3.51                        19.9
 6               4.17                        34.9
 7               3.47                        12.6
 8               6.21                        40.4
 9               2.87                        14.4
10               3.27                        11.2
   percent_single_parent_households percent_severe_housing_problems overcrowding
                              <dbl>                           <dbl>        <dbl>
 1                             37.1                            14.3        0.463
 2                             33.4                            12.3        3.51 
 3                             45.9                            15.1        2.10 
 4                             23.8                            14.0        1.46 
 5                             29.5                            18.0        0.740
 6                             38.3                            15.4        5.65 
 7                             31.0                            18.1        5.37 
 8                             66.4                            12.8        2.37 
 9                             26.2                            10.5        0.904
10                             26.7                            12.3        1.88 
   percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
                          <dbl>                 <dbl>                      <dbl>
 1                         15.8                  15.2                       36.1
 2                         11.4                  15.1                       32.4
 3                         15.9                  14.1                       36.8
 4                          7.9                  12                         26.3
 5                          8.4                  17.5                       31.9
 6                         24.3                  19.1                       39.5
 7                          7.7                   8                         31.0
 8                         13.2                  24.7                       41.1
 9                         11                    11.7                       30.1
10                          8.5                   8.3                       34.7
   percent_uninsured_2 median_household_income
                 <dbl>                   <dbl>
 1               15.9                    42412
 2               14.0                    40484
 3               19.4                    42879
 4               11.1                    66827
 5               12.3                    40395
 6               29.6                    35156
 7               13.8                    70199
 8               18.7                    33392
 9               10.7                    55167
10                8.46                   62877
   average_traffic_volume_per_meter_of_major_roadways percent_homeowners
                                                <dbl>              <dbl>
 1                                               11.6               76.3
 2                                               63.7               70.8
 3                                               60.0               67.9
 4                                              277.                68.4
 5                                               45.8               60.0
 6                                               16.7               68.6
 7                                              490.                65.2
 8                                              150.                61.7
 9                                               53.4               68.2
10                                              113.                77.2
   population_2 percent_less_than_18_years_of_age percent_65_and_over
          <dbl>                             <dbl>               <dbl>
 1        24541                              20.1                21.8
 2        62190                              25.8                15.3
 3        32412                              20.5                23.6
 4       469966                              23.8                14.4
 5        25339                              18.4                14.8
 6        22082                              26.6                15.9
 7       511868                              26.5                10.5
 8        31192                              20.1                18.8
 9        31511                              23.7                18.2
10       102811                              20.0                20.4
   percent_black percent_asian percent_hispanic percent_female percent_rural
           <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
 1        27.5           0.412             1.54           51.6         78.6 
 2        17.9           0.320             2.73           51.2         51.7 
 3        28.0           0.781             9.34           51.2        100   
 4         1.24          2.81              8.31           49.9          5.47
 5         2.85          2.28              2.57           51.9         37.9 
 6         0.534         0.802             6.82           50.1         83.3 
 7         3.19          4.37             40.4            49.5          3.62
 8        52.4           0.513            11.3            47.9         37.2 
 9         0.996         1.33             10.9            50.2         22.5 
10         1.60          0.875             7.11           50.8         53.7 
# ℹ 1,456 more rows

Numerical summaries of each variable:

summary(county_data)
      fips          admin2          province_state     country_region    
 Min.   : 1001   Length:1466        Length:1466        Length:1466       
 1st Qu.:18003   Class :character   Class :character   Class :character  
 Median :29029   Mode  :character   Mode  :character   Mode  :character  
 Mean   :30076                                                           
 3rd Qu.:42077                                                           
 Max.   :90053                                                           
 NA's   :13                                                              
  last_update                       lat            long_        
 Min.   :2020-04-04 23:34:21   Min.   :19.60   Min.   :-159.60  
 1st Qu.:2020-04-04 23:34:21   1st Qu.:33.96   1st Qu.: -94.56  
 Median :2020-04-04 23:34:21   Median :38.02   Median : -86.48  
 Mean   :2020-04-04 23:34:21   Mean   :37.71   Mean   : -89.73  
 3rd Qu.:2020-04-04 23:34:21   3rd Qu.:41.38   3rd Qu.: -81.22  
 Max.   :2020-04-04 23:34:21   Max.   :64.81   Max.   : -68.65  
                               NA's   :19      NA's   :19       
   confirmed           deaths           recovered     active 
 Min.   :    5.0   Min.   :   0.000   Min.   :0   Min.   :0  
 1st Qu.:    9.0   1st Qu.:   0.000   1st Qu.:0   1st Qu.:0  
 Median :   20.0   Median :   0.000   Median :0   Median :0  
 Mean   :  208.8   Mean   :   4.842   Mean   :0   Mean   :0  
 3rd Qu.:   68.0   3rd Qu.:   2.000   3rd Qu.:0   3rd Qu.:0  
 Max.   :63306.0   Max.   :1905.000   Max.   :0   Max.   :0  
                                                             
 combined_key          state              county         
 Length:1466        Length:1466        Length:1466       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
 percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
 Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
 1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
 Median :17.010              Median :17.147   Median :32.95              
 Mean   :17.594              Mean   :17.153   Mean   :32.41              
 3rd Qu.:20.377              3rd Qu.:19.365   3rd Qu.:36.20              
 Max.   :38.887              Max.   :27.775   Max.   :51.00              
 NA's   :28                  NA's   :28       NA's   :28                 
 percent_with_access_to_exercise_opportunities percent_excessive_drinking
 Min.   :  0.00                                Min.   : 7.81             
 1st Qu.: 59.95                                1st Qu.:15.70             
 Median : 74.71                                Median :18.03             
 Mean   : 71.14                                Mean   :17.92             
 3rd Qu.: 85.94                                3rd Qu.:20.00             
 Max.   :100.00                                Max.   :28.62             
 NA's   :28                                    NA's   :28                
 percent_uninsured percent_some_college percent_unemployed
 Min.   : 2.263    Min.   :30.06        Min.   : 1.582    
 1st Qu.: 6.754    1st Qu.:53.24        1st Qu.: 3.252    
 Median : 9.925    Median :61.21        Median : 3.870    
 Mean   :10.583    Mean   :60.87        Mean   : 4.071    
 3rd Qu.:13.519    3rd Qu.:68.74        3rd Qu.: 4.690    
 Max.   :31.208    Max.   :90.34        Max.   :18.092    
 NA's   :28        NA's   :28           NA's   :28        
 percent_children_in_poverty percent_single_parent_households
 Min.   : 2.50               Min.   : 9.43                   
 1st Qu.:12.82               1st Qu.:27.09                   
 Median :18.40               Median :32.96                   
 Mean   :19.46               Mean   :33.84                   
 3rd Qu.:24.50               3rd Qu.:38.94                   
 Max.   :55.00               Max.   :80.00                   
 NA's   :28                  NA's   :28                      
 percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
 Min.   : 6.562                  Min.   : 0.000   Min.   : 1.800              
 1st Qu.:12.267                  1st Qu.: 1.378   1st Qu.: 9.125              
 Median :14.439                  Median : 1.962   Median :11.300              
 Mean   :15.079                  Mean   : 2.429   Mean   :11.749              
 3rd Qu.:16.976                  3rd Qu.: 2.882   3rd Qu.:13.900              
 Max.   :33.391                  Max.   :14.489   Max.   :34.100              
 NA's   :28                      NA's   :28       NA's   :28                  
 percent_food_insecure percent_insufficient_sleep percent_uninsured_2
 Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
 1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
 Median :12.70         Median :34.02              Median :12.027     
 Mean   :13.26         Mean   :33.88              Mean   :12.776     
 3rd Qu.:15.20         3rd Qu.:36.56              3rd Qu.:16.541     
 Max.   :33.50         Max.   :46.71              Max.   :42.397     
 NA's   :28            NA's   :28                 NA's   :28         
 median_household_income average_traffic_volume_per_meter_of_major_roadways
 Min.   : 25385          Min.   :   0.00                                   
 1st Qu.: 46994          1st Qu.:  53.05                                   
 Median : 54317          Median : 105.00                                   
 Mean   : 57584          Mean   : 201.39                                   
 3rd Qu.: 64754          3rd Qu.: 206.92                                   
 Max.   :140382          Max.   :4444.12                                   
 NA's   :28              NA's   :28                                        
 percent_homeowners  population_2      percent_less_than_18_years_of_age
 Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
 1st Qu.:64.34      1st Qu.:   36502   1st Qu.:20.321                   
 Median :69.98      Median :   75478   Median :22.182                   
 Mean   :68.98      Mean   :  202450   Mean   :22.197                   
 3rd Qu.:74.78      3rd Qu.:  180031   3rd Qu.:24.002                   
 Max.   :89.76      Max.   :10105518   Max.   :35.447                   
 NA's   :28         NA's   :28         NA's   :28                       
 percent_65_and_over percent_black     percent_asian      percent_hispanic 
 Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
 1st Qu.:14.927      1st Qu.: 1.6174   1st Qu.: 0.68249   1st Qu.: 2.9419  
 Median :17.222      Median : 5.6397   Median : 1.23421   Median : 5.5939  
 Mean   :17.516      Mean   :12.4178   Mean   : 2.40412   Mean   :10.0011  
 3rd Qu.:19.598      3rd Qu.:17.5931   3rd Qu.: 2.67550   3rd Qu.:11.0564  
 Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3596  
 NA's   :28          NA's   :28        NA's   :28         NA's   :28       
 percent_female  percent_rural   
 Min.   :34.63   Min.   :  0.00  
 1st Qu.:50.00   1st Qu.: 17.11  
 Median :50.66   Median : 36.97  
 Mean   :50.46   Mean   : 40.11  
 3rd Qu.:51.35   3rd Qu.: 60.00  
 Max.   :56.87   Max.   :100.00  
 NA's   :28      NA's   :28      

List rows in county_data that don’t have a match in county_count:

county_data %>%
  filter(is.na(state) & is.na(county)) %>%
  print(n = Inf)
# A tibble: 28 × 41
    fips admin2   province_state country_region last_update           lat  long_
   <dbl> <chr>    <chr>          <chr>          <dttm>              <dbl>  <dbl>
 1    NA DeKalb   Tennessee      US             2020-04-04 23:34:21  36.0  -85.8
 2    NA DeSoto   Florida        US             2020-04-04 23:34:21  27.2  -81.8
 3    NA Dukes a… Massachusetts  US             2020-04-04 23:34:21  41.4  -70.7
 4    NA Fillmore Minnesota      US             2020-04-04 23:34:21  43.7  -92.1
 5    NA Kansas … Missouri       US             2020-04-04 23:34:21  39.1  -94.6
 6    NA LaSalle  Illinois       US             2020-04-04 23:34:21  41.3  -88.9
 7    NA Manassas Virginia       US             2020-04-04 23:34:21  38.7  -77.5
 8    NA McDuffie Georgia        US             2020-04-04 23:34:21  33.5  -82.5
 9    NA Out of … Michigan       US             2020-04-04 23:34:21  NA     NA  
10    NA Out of … Tennessee      US             2020-04-04 23:34:21  NA     NA  
11 90005 Unassig… Arkansas       US             2020-04-04 23:34:21  NA     NA  
12 90008 Unassig… Colorado       US             2020-04-04 23:34:21  NA     NA  
13 90009 Unassig… Connecticut    US             2020-04-04 23:34:21  NA     NA  
14 90013 Unassig… Georgia        US             2020-04-04 23:34:21  NA     NA  
15 90015 Unassig… Hawaii         US             2020-04-04 23:34:21  NA     NA  
16 90017 Unassig… Illinois       US             2020-04-04 23:34:21  NA     NA  
17 90021 Unassig… Kentucky       US             2020-04-04 23:34:21  NA     NA  
18    NA Unassig… Louisiana      US             2020-04-04 23:34:21  NA     NA  
19 90023 Unassig… Maine          US             2020-04-04 23:34:21  NA     NA  
20 90025 Unassig… Massachusetts  US             2020-04-04 23:34:21  NA     NA  
21    NA Unassig… Michigan       US             2020-04-04 23:34:21  NA     NA  
22 90032 Unassig… Nevada         US             2020-04-04 23:34:21  NA     NA  
23 90034 Unassig… New Jersey     US             2020-04-04 23:34:21  NA     NA  
24 90044 Unassig… Rhode Island   US             2020-04-04 23:34:21  NA     NA  
25 90047 Unassig… Tennessee      US             2020-04-04 23:34:21  NA     NA  
26 90050 Unassig… Vermont        US             2020-04-04 23:34:21  NA     NA  
27 90053 Unassig… Washington     US             2020-04-04 23:34:21  NA     NA  
28    NA Weber    Utah           US             2020-04-04 23:34:21  41.3 -112. 
# ℹ 34 more variables: confirmed <dbl>, deaths <dbl>, recovered <dbl>,
#   active <dbl>, combined_key <chr>, state <chr>, county <chr>,
#   percent_fair_or_poor_health <dbl>, percent_smokers <dbl>,
#   percent_adults_with_obesity <dbl>,
#   percent_with_access_to_exercise_opportunities <dbl>,
#   percent_excessive_drinking <dbl>, percent_uninsured <dbl>,
#   percent_some_college <dbl>, percent_unemployed <dbl>, …

We found there are some rows that miss fips.

county_count %>%
  filter(is.na(fips)) %>%
  select(fips, admin2, province_state) %>%
  print(n = Inf)
# A tibble: 13 × 3
    fips admin2              province_state
   <dbl> <chr>               <chr>         
 1    NA DeKalb              Tennessee     
 2    NA DeSoto              Florida       
 3    NA Dukes and Nantucket Massachusetts 
 4    NA Fillmore            Minnesota     
 5    NA Kansas City         Missouri      
 6    NA LaSalle             Illinois      
 7    NA Manassas            Virginia      
 8    NA McDuffie            Georgia       
 9    NA Out of MI           Michigan      
10    NA Out of TN           Tennessee     
11    NA Unassigned          Louisiana     
12    NA Unassigned          Michigan      
13    NA Weber               Utah          

We need to (1) manually set the fips for some counties, (2) discard those Unassigned, unassigned or Out of, and (3) try to join with county_info again.

county_data <- county_count %>%
  # manually set FIPS for some counties
  mutate(fips = ifelse(admin2 == "DeKalb" & province_state == "Tennessee", 47041, fips)) %>%
  mutate(fips = ifelse(admin2 == "DeSoto" & province_state == "Florida", 12027, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Dona Ana" & province_state == "New Mexico", 35013, fips)) %>% 
  mutate(fips = ifelse(admin2 == "Dukes and Nantucket" & province_state == "Massachusetts", 25019, fips)) %>% 
  mutate(fips = ifelse(admin2 == "Fillmore" & province_state == "Minnesota", 27045, fips)) %>%  
  #mutate(fips = ifelse(admin2 == "Harris" & province_state == "Texas", 48201, fips)) %>%  
  #mutate(fips = ifelse(admin2 == "Kenai Peninsula" & province_state == "Alaska", 2122, fips)) %>%  
  mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Illinois", 17099, fips)) %>%
  #mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Louisiana", 22059, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Lac qui Parle" & province_state == "Minnesota", 27073, fips)) %>%  
  mutate(fips = ifelse(admin2 == "Manassas" & province_state == "Virginia", 51683, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Matanuska-Susitna" & province_state == "Alaska", 2170, fips)) %>%
  mutate(fips = ifelse(admin2 == "McDuffie" & province_state == "Georgia", 13189, fips)) %>%
  #mutate(fips = ifelse(admin2 == "McIntosh" & province_state == "Georgia", 13191, fips)) %>%
  #mutate(fips = ifelse(admin2 == "McKean" & province_state == "Pennsylvania", 42083, fips)) %>%
  mutate(fips = ifelse(admin2 == "Weber" & province_state == "Utah", 49057, fips)) %>%
  filter(!(is.na(fips) | str_detect(admin2, "Out of") | str_detect(admin2, "Unassigned"))) %>%
  left_join(county_info, by = "fips") %>%
  print(width = Inf)
# A tibble: 1,446 × 41
    fips admin2    province_state country_region last_update           lat
   <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
 1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
 2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
 3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
 4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
 5 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
 6 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
 7  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
 8 28001 Adams     Mississippi    US             2020-04-04 23:34:21  31.5
 9 31001 Adams     Nebraska       US             2020-04-04 23:34:21  40.5
10 42001 Adams     Pennsylvania   US             2020-04-04 23:34:21  39.9
    long_ confirmed deaths recovered active combined_key                 
    <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
 1  -82.5         6      0         0      0 Abbeville, South Carolina, US
 2  -92.4        65      2         0      0 Acadia, Louisiana, US        
 3  -75.6         8      0         0      0 Accomack, Virginia, US       
 4 -116.        360      3         0      0 Ada, Idaho, US               
 5  -92.6        10      0         0      0 Adair, Missouri, US          
 6  -94.7        14      0         0      0 Adair, Oklahoma, US          
 7 -104.        294      9         0      0 Adams, Colorado, US          
 8  -91.4        16      0         0      0 Adams, Mississippi, US       
 9  -98.5         8      0         0      0 Adams, Nebraska, US          
10  -77.2        21      0         0      0 Adams, Pennsylvania, US      
   state          county    percent_fair_or_poor_health percent_smokers
   <chr>          <chr>                           <dbl>           <dbl>
 1 South Carolina Abbeville                        19.9            17.3
 2 Louisiana      Acadia                           20.9            21.5
 3 Virginia       Accomack                         20.1            18.3
 4 Idaho          Ada                              11.5            12.0
 5 Missouri       Adair                            21.4            20.5
 6 Oklahoma       Adair                            28.5            27.7
 7 Colorado       Adams                            16.6            16.3
 8 Mississippi    Adams                            27.3            22.2
 9 Nebraska       Adams                            15.8            14.6
10 Pennsylvania   Adams                            15.3            16.2
   percent_adults_with_obesity percent_with_access_to_exercise_opportunities
                         <dbl>                                         <dbl>
 1                        36.7                                          59.0
 2                        38.4                                          42.5
 3                        36.3                                          37.4
 4                        25.6                                          89.5
 5                        27.9                                          78.3
 6                        47.7                                          28.5
 7                        27.8                                          93.1
 8                        35.3                                          69.1
 9                        36.7                                          81.6
10                        35.6                                          60.6
   percent_excessive_drinking percent_uninsured percent_some_college
                        <dbl>             <dbl>                <dbl>
 1                       15.9             12.9                  52.5
 2                       19.8             10.7                  43.6
 3                       15.5             16.6                  45.1
 4                       17.9              8.74                 73.8
 5                       18.9             10.6                  65.3
 6                       11.8             24.5                  35.1
 7                       18.9             11.0                  57.0
 8                       12.3             15.0                  41.7
 9                       18.5              8.76                 70.8
10                       19.2              7.49                 57.3
   percent_unemployed percent_children_in_poverty
                <dbl>                       <dbl>
 1               3.98                        30.8
 2               5.37                        35.4
 3               3.81                        27  
 4               2.46                        10.2
 5               3.51                        19.9
 6               4.17                        34.9
 7               3.47                        12.6
 8               6.21                        40.4
 9               2.87                        14.4
10               3.27                        11.2
   percent_single_parent_households percent_severe_housing_problems overcrowding
                              <dbl>                           <dbl>        <dbl>
 1                             37.1                            14.3        0.463
 2                             33.4                            12.3        3.51 
 3                             45.9                            15.1        2.10 
 4                             23.8                            14.0        1.46 
 5                             29.5                            18.0        0.740
 6                             38.3                            15.4        5.65 
 7                             31.0                            18.1        5.37 
 8                             66.4                            12.8        2.37 
 9                             26.2                            10.5        0.904
10                             26.7                            12.3        1.88 
   percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
                          <dbl>                 <dbl>                      <dbl>
 1                         15.8                  15.2                       36.1
 2                         11.4                  15.1                       32.4
 3                         15.9                  14.1                       36.8
 4                          7.9                  12                         26.3
 5                          8.4                  17.5                       31.9
 6                         24.3                  19.1                       39.5
 7                          7.7                   8                         31.0
 8                         13.2                  24.7                       41.1
 9                         11                    11.7                       30.1
10                          8.5                   8.3                       34.7
   percent_uninsured_2 median_household_income
                 <dbl>                   <dbl>
 1               15.9                    42412
 2               14.0                    40484
 3               19.4                    42879
 4               11.1                    66827
 5               12.3                    40395
 6               29.6                    35156
 7               13.8                    70199
 8               18.7                    33392
 9               10.7                    55167
10                8.46                   62877
   average_traffic_volume_per_meter_of_major_roadways percent_homeowners
                                                <dbl>              <dbl>
 1                                               11.6               76.3
 2                                               63.7               70.8
 3                                               60.0               67.9
 4                                              277.                68.4
 5                                               45.8               60.0
 6                                               16.7               68.6
 7                                              490.                65.2
 8                                              150.                61.7
 9                                               53.4               68.2
10                                              113.                77.2
   population_2 percent_less_than_18_years_of_age percent_65_and_over
          <dbl>                             <dbl>               <dbl>
 1        24541                              20.1                21.8
 2        62190                              25.8                15.3
 3        32412                              20.5                23.6
 4       469966                              23.8                14.4
 5        25339                              18.4                14.8
 6        22082                              26.6                15.9
 7       511868                              26.5                10.5
 8        31192                              20.1                18.8
 9        31511                              23.7                18.2
10       102811                              20.0                20.4
   percent_black percent_asian percent_hispanic percent_female percent_rural
           <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
 1        27.5           0.412             1.54           51.6         78.6 
 2        17.9           0.320             2.73           51.2         51.7 
 3        28.0           0.781             9.34           51.2        100   
 4         1.24          2.81              8.31           49.9          5.47
 5         2.85          2.28              2.57           51.9         37.9 
 6         0.534         0.802             6.82           50.1         83.3 
 7         3.19          4.37             40.4            49.5          3.62
 8        52.4           0.513            11.3            47.9         37.2 
 9         0.996         1.33             10.9            50.2         22.5 
10         1.60          0.875             7.11           50.8         53.7 
# ℹ 1,436 more rows

Summarize again

summary(county_data)
      fips          admin2          province_state     country_region    
 Min.   : 1001   Length:1446        Length:1446        Length:1446       
 1st Qu.:17186   Class :character   Class :character   Class :character  
 Median :28156   Mode  :character   Mode  :character   Mode  :character  
 Mean   :29455                                                           
 3rd Qu.:42048                                                           
 Max.   :56039                                                           
  last_update                       lat            long_        
 Min.   :2020-04-04 23:34:21   Min.   :19.60   Min.   :-159.60  
 1st Qu.:2020-04-04 23:34:21   1st Qu.:33.96   1st Qu.: -94.52  
 Median :2020-04-04 23:34:21   Median :38.02   Median : -86.48  
 Mean   :2020-04-04 23:34:21   Mean   :37.71   Mean   : -89.73  
 3rd Qu.:2020-04-04 23:34:21   3rd Qu.:41.39   3rd Qu.: -81.21  
 Max.   :2020-04-04 23:34:21   Max.   :64.81   Max.   : -68.65  
   confirmed           deaths           recovered     active 
 Min.   :    5.0   Min.   :   0.000   Min.   :0   Min.   :0  
 1st Qu.:    9.0   1st Qu.:   0.000   1st Qu.:0   1st Qu.:0  
 Median :   20.0   Median :   0.000   Median :0   Median :0  
 Mean   :  207.2   Mean   :   4.854   Mean   :0   Mean   :0  
 3rd Qu.:   66.0   3rd Qu.:   2.000   3rd Qu.:0   3rd Qu.:0  
 Max.   :63306.0   Max.   :1905.000   Max.   :0   Max.   :0  
 combined_key          state              county         
 Length:1446        Length:1446        Length:1446       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
 percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
 Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
 1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
 Median :17.010              Median :17.143   Median :32.90              
 Mean   :17.594              Mean   :17.151   Mean   :32.39              
 3rd Qu.:20.398              3rd Qu.:19.365   3rd Qu.:36.20              
 Max.   :38.887              Max.   :27.775   Max.   :51.00              
 percent_with_access_to_exercise_opportunities percent_excessive_drinking
 Min.   :  0.00                                Min.   : 7.81             
 1st Qu.: 59.95                                1st Qu.:15.68             
 Median : 74.71                                Median :18.03             
 Mean   : 71.15                                Mean   :17.91             
 3rd Qu.: 85.97                                3rd Qu.:20.01             
 Max.   :100.00                                Max.   :28.62             
 percent_uninsured percent_some_college percent_unemployed
 Min.   : 2.263    Min.   :21.14        Min.   : 1.582    
 1st Qu.: 6.754    1st Qu.:53.21        1st Qu.: 3.252    
 Median : 9.937    Median :61.19        Median : 3.870    
 Mean   :10.592    Mean   :60.83        Mean   : 4.071    
 3rd Qu.:13.527    3rd Qu.:68.72        3rd Qu.: 4.690    
 Max.   :31.208    Max.   :90.34        Max.   :18.092    
 percent_children_in_poverty percent_single_parent_households
 Min.   : 2.50               Min.   : 9.43                   
 1st Qu.:12.82               1st Qu.:27.07                   
 Median :18.40               Median :32.96                   
 Mean   :19.46               Mean   :33.83                   
 3rd Qu.:24.50               3rd Qu.:38.93                   
 Max.   :55.00               Max.   :80.00                   
 percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
 Min.   : 6.562                  Min.   : 0.000   Min.   : 1.80               
 1st Qu.:12.267                  1st Qu.: 1.379   1st Qu.: 9.10               
 Median :14.439                  Median : 1.971   Median :11.30               
 Mean   :15.082                  Mean   : 2.437   Mean   :11.75               
 3rd Qu.:16.992                  3rd Qu.: 2.887   3rd Qu.:13.90               
 Max.   :33.391                  Max.   :14.489   Max.   :34.10               
 percent_food_insecure percent_insufficient_sleep percent_uninsured_2
 Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
 1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
 Median :12.70         Median :34.02              Median :12.027     
 Mean   :13.25         Mean   :33.88              Mean   :12.786     
 3rd Qu.:15.20         3rd Qu.:36.54              3rd Qu.:16.572     
 Max.   :33.50         Max.   :46.71              Max.   :42.397     
 median_household_income average_traffic_volume_per_meter_of_major_roadways
 Min.   : 25385          Min.   :   0.00                                   
 1st Qu.: 46994          1st Qu.:  53.09                                   
 Median : 54317          Median : 104.63                                   
 Mean   : 57600          Mean   : 200.72                                   
 3rd Qu.: 64775          3rd Qu.: 206.78                                   
 Max.   :140382          Max.   :4444.12                                   
 percent_homeowners  population_2      percent_less_than_18_years_of_age
 Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
 1st Qu.:64.36      1st Qu.:   36275   1st Qu.:20.326                   
 Median :69.96      Median :   75382   Median :22.182                   
 Mean   :68.99      Mean   :  201689   Mean   :22.204                   
 3rd Qu.:74.77      3rd Qu.:  179982   3rd Qu.:24.019                   
 Max.   :89.76      Max.   :10105518   Max.   :35.447                   
 percent_65_and_over percent_black     percent_asian      percent_hispanic 
 Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
 1st Qu.:14.913      1st Qu.: 1.6168   1st Qu.: 0.68228   1st Qu.: 2.9451  
 Median :17.225      Median : 5.6397   Median : 1.22863   Median : 5.6100  
 Mean   :17.512      Mean   :12.4056   Mean   : 2.40009   Mean   :10.0338  
 3rd Qu.:19.598      3rd Qu.:17.4904   3rd Qu.: 2.66813   3rd Qu.:11.1199  
 Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3596  
 percent_female  percent_rural   
 Min.   :34.63   Min.   :  0.00  
 1st Qu.:50.00   1st Qu.: 17.11  
 Median :50.65   Median : 36.97  
 Mean   :50.46   Mean   : 40.12  
 3rd Qu.:51.35   3rd Qu.: 60.04  
 Max.   :56.87   Max.   :100.00  

If there are variables with missing value for many counties, we go back and remove those variables from consideration.

Let’s create a final data frame for analysis.

county_data <- county_data %>%
  mutate(state = as.factor(state)) %>%
  select(county, confirmed, deaths, state, percent_fair_or_poor_health:percent_rural)
summary(county_data)
    county            confirmed           deaths                    state     
 Length:1446        Min.   :    5.0   Min.   :   0.000   Georgia       :  96  
 Class :character   1st Qu.:    9.0   1st Qu.:   0.000   Texas         :  80  
 Mode  :character   Median :   20.0   Median :   0.000   North Carolina:  63  
                    Mean   :  207.2   Mean   :   4.854   Mississippi   :  61  
                    3rd Qu.:   66.0   3rd Qu.:   2.000   Indiana       :  58  
                    Max.   :63306.0   Max.   :1905.000   Ohio          :  57  
                                                         (Other)       :1031  
 percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
 Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
 1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
 Median :17.010              Median :17.143   Median :32.90              
 Mean   :17.594              Mean   :17.151   Mean   :32.39              
 3rd Qu.:20.398              3rd Qu.:19.365   3rd Qu.:36.20              
 Max.   :38.887              Max.   :27.775   Max.   :51.00              
                                                                         
 percent_with_access_to_exercise_opportunities percent_excessive_drinking
 Min.   :  0.00                                Min.   : 7.81             
 1st Qu.: 59.95                                1st Qu.:15.68             
 Median : 74.71                                Median :18.03             
 Mean   : 71.15                                Mean   :17.91             
 3rd Qu.: 85.97                                3rd Qu.:20.01             
 Max.   :100.00                                Max.   :28.62             
                                                                         
 percent_uninsured percent_some_college percent_unemployed
 Min.   : 2.263    Min.   :21.14        Min.   : 1.582    
 1st Qu.: 6.754    1st Qu.:53.21        1st Qu.: 3.252    
 Median : 9.937    Median :61.19        Median : 3.870    
 Mean   :10.592    Mean   :60.83        Mean   : 4.071    
 3rd Qu.:13.527    3rd Qu.:68.72        3rd Qu.: 4.690    
 Max.   :31.208    Max.   :90.34        Max.   :18.092    
                                                          
 percent_children_in_poverty percent_single_parent_households
 Min.   : 2.50               Min.   : 9.43                   
 1st Qu.:12.82               1st Qu.:27.07                   
 Median :18.40               Median :32.96                   
 Mean   :19.46               Mean   :33.83                   
 3rd Qu.:24.50               3rd Qu.:38.93                   
 Max.   :55.00               Max.   :80.00                   
                                                             
 percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
 Min.   : 6.562                  Min.   : 0.000   Min.   : 1.80               
 1st Qu.:12.267                  1st Qu.: 1.379   1st Qu.: 9.10               
 Median :14.439                  Median : 1.971   Median :11.30               
 Mean   :15.082                  Mean   : 2.437   Mean   :11.75               
 3rd Qu.:16.992                  3rd Qu.: 2.887   3rd Qu.:13.90               
 Max.   :33.391                  Max.   :14.489   Max.   :34.10               
                                                                              
 percent_food_insecure percent_insufficient_sleep percent_uninsured_2
 Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
 1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
 Median :12.70         Median :34.02              Median :12.027     
 Mean   :13.25         Mean   :33.88              Mean   :12.786     
 3rd Qu.:15.20         3rd Qu.:36.54              3rd Qu.:16.572     
 Max.   :33.50         Max.   :46.71              Max.   :42.397     
                                                                     
 median_household_income average_traffic_volume_per_meter_of_major_roadways
 Min.   : 25385          Min.   :   0.00                                   
 1st Qu.: 46994          1st Qu.:  53.09                                   
 Median : 54317          Median : 104.63                                   
 Mean   : 57600          Mean   : 200.72                                   
 3rd Qu.: 64775          3rd Qu.: 206.78                                   
 Max.   :140382          Max.   :4444.12                                   
                                                                           
 percent_homeowners  population_2      percent_less_than_18_years_of_age
 Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
 1st Qu.:64.36      1st Qu.:   36275   1st Qu.:20.326                   
 Median :69.96      Median :   75382   Median :22.182                   
 Mean   :68.99      Mean   :  201689   Mean   :22.204                   
 3rd Qu.:74.77      3rd Qu.:  179982   3rd Qu.:24.019                   
 Max.   :89.76      Max.   :10105518   Max.   :35.447                   
                                                                        
 percent_65_and_over percent_black     percent_asian      percent_hispanic 
 Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
 1st Qu.:14.913      1st Qu.: 1.6168   1st Qu.: 0.68228   1st Qu.: 2.9451  
 Median :17.225      Median : 5.6397   Median : 1.22863   Median : 5.6100  
 Mean   :17.512      Mean   :12.4056   Mean   : 2.40009   Mean   :10.0338  
 3rd Qu.:19.598      3rd Qu.:17.4904   3rd Qu.: 2.66813   3rd Qu.:11.1199  
 Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3596  
                                                                           
 percent_female  percent_rural   
 Min.   :34.63   Min.   :  0.00  
 1st Qu.:50.00   1st Qu.: 17.11  
 Median :50.65   Median : 36.97  
 Mean   :50.46   Mean   : 40.12  
 3rd Qu.:51.35   3rd Qu.: 60.04  
 Max.   :56.87   Max.   :100.00  
                                 

Display the 10 counties with highest CFR.

county_data %>%
  mutate(cfr = deaths / confirmed) %>%
  select(county, state, confirmed, deaths, cfr) %>%
  arrange(desc(cfr)) %>%
  top_n(10)
# A tibble: 18 × 5
   county         state          confirmed deaths   cfr
   <chr>          <fct>              <dbl>  <dbl> <dbl>
 1 Emmet          Michigan               7      2 0.286
 2 Grand Traverse Michigan              12      3 0.25 
 3 Toole          Montana               12      3 0.25 
 4 Fayette        Indiana               14      3 0.214
 5 Concordia      Louisiana              5      1 0.2  
 6 Harrison       Texas                  5      1 0.2  
 7 Huntington     Indiana                5      1 0.2  
 8 Isabella       Michigan              10      2 0.2  
 9 McDuffie       Georgia                5      1 0.2  
10 Navarro        Texas                  5      1 0.2  
11 Orange         Indiana                5      1 0.2  
12 Perry          Pennsylvania           5      1 0.2  
13 Randolph       Indiana                5      1 0.2  
14 Rockingham     North Carolina         5      1 0.2  
15 Seneca         Ohio                   5      1 0.2  
16 Toombs         Georgia                5      1 0.2  
17 Vigo           Indiana               10      2 0.2  
18 Washington     Alabama                5      1 0.2  

Write final data into a csv file for future use.

write_csv(county_data, "./datasets/covid19-county-data-20200404.csv.gz")

1.3 Note:

Given that the datasets were collected in the middle of the pandemic, what assumptions of CFR might be violated by defining CFR as deaths/confirmed from this data set?

Because COVID-19 pandemic was still ongoing in 2020, we should realize some critical assumptions for defining CFR are not met using this datasets.

  1. Numbers of confirmed cases do not reflect the number of diagnosed people. This is mainly limited by the availability of testing.

  2. Some confirmed cases may die later.

With acknowledgement of these severe limitations, we continue to use deaths/confirmed as a very rough proxy of CFR.

1.4 Q1.1 (5pts)

Read and run above code to generate a data frame county_data that includes county-level COVID-19 confirmed cases and deaths, demographic, and health related information.

1.5 Q1.2(5pts)

What assumptions of logistic regression may be violated by this data set?

1.6 Q1.3 (10pts)

Run a logistic regression, using variables state, …, percent_rural as predictors.

1.7 Q1.4 (10pts)

Interpret the regression coefficients of 3 significant predictors with p-value <0.05.

1.8 Q1.5 (10pts)

Apply analysis of deviance to (1) evaluate the goodness of fit of the model and (2) compare the model to the intercept-only model.

1.9 Q1.6 (10pts)

Perform analysis of deviance to evaluate the significance of each predictor. Display the 10 most significant predictors.

1.10 Q1.7 (5pts)

Construct confidence intervals of regression coefficients.

1.11 Q1.8 (5pts)

Plot the deviance residuals against the fitted values. Are there potential outliers?

1.12 Q1.9 (5pts)

Plot the half-normal plot. Are there potential outliers in predictor space?

1.13 Q1.10 (10pts)

Find the best sub-model using the AIC criterion.

1.14 Q1.11 (15pts)

Find the best sub-model using the lasso with cross validation.

2 Q2. Odds ratios (20pts)

Consider a \(2 \times 2\) contingency table from a prospective study in which people who were or were not exposed to some pollutant are followed up and, after several years, categorized according to the presense or absence of a disease. Following table shows the probabilities for each cell. The odds of disease for either exposure group is \(O_i = \pi_i / (1 - \pi_i)\), for \(i = 1,2\), and so the odds ratio is \[ \phi = \frac{O_1}{O_2} = \frac{\pi_1(1 - \pi_2)}{\pi_2 (1 - \pi_1)} \] is a measure of the relative likelihood of disease for the exposed and not exposed groups.

Diseased Not diseased
Exposed \(\pi_1\) \(1 - \pi_1\)
Not exposed \(\pi_2\) \(1 - \pi_2\)

2.1 Q2.1 (10pts)

For the simple logistic model \[ \pi_i = \frac{e^{\beta_i}}{1 + e^{\beta_i}}, \] show that if there is no difference between the exposed and not exposed groups (i.e., \(\beta_1 = \beta_2\)), then \(\phi = 1\).

2.2 Q2.2(10pts)

Consider \(J\) \(2 \times 2\) tables, one for each level \(x_j\) of a factor, such as age group, with \(j=1,\ldots, J\). For the logistic model \[ \pi_{ij} = \frac{e^{\alpha_i + \beta_i x_j}}{1 + e^{\alpha_i + \beta_i x_j}}, \quad i = 1,2, \quad j= 1,\ldots, J. \] Show that \(\log \phi\) is constant over all tables if \(\beta_1 = \beta_2\).

3 Q3. ELMR Chapter 4 Excercise 3 (30pts)