# tidyverse of data manipulation and visualization
library(tidyverse)Biostat 200C Homework 2
Due Apr 25 @ 11:59PM
To submit homework, please upload both Rmd and html files to Bruinlearn by the deadline.
1 Q1. CFR of COVID-19 (90pts)
Of primary interest to public is the risk of dying from COVID-19. A commonly used measure is case fatality rate/ratio/risk (CFR), which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}. \] Apparently CFR is not a fixed constant; it changes with time, location, and other factors. Also CFR is different from the infection fatality rate (IFR), the probability that someone infected with COVID-19 dies from it.
In this exercise, we use logistic regression to study how US county-level CFR changes according to demographic information and some health-, education-, and economy-indicators.
1.1 Data sources
04-04-2020.csv.gz: The data on COVID-19 confirmed cases and deaths on 2020-04-04 is retrieved from the Johns Hopkins COVID-19 data repository. It was downloaded from this link (commit 0174f38). This repository has been archived by the owner on Mar 10, 2023. It is now read-only. You can download data from box: https://ucla.box.com/s/brb3vz4nwoq8pjkcutxncymqw583d39lus-county-health-rankings-2020.csv.gz: The 2020 County Health Ranking Data was released by County Health Rankings. The data was downloaded from the Kaggle Uncover COVID-19 Challenge (version 1). You can download data from box: https://ucla.box.com/s/brb3vz4nwoq8pjkcutxncymqw583d39l
1.2 Sample code for data preparation
Load the tidyverse package for data manipulation and visualization.
Read in the data of COVID-19 cases reported on 2020-04-04.
county_count <- read_csv("./datasets/04-04-2020.csv.gz") %>%
# cast fips into dbl for use as a key for joining tables
mutate(FIPS = as.numeric(FIPS)) %>%
filter(Country_Region == "US") %>%
print(width = Inf)# A tibble: 2,421 × 12
FIPS Admin2 Province_State Country_Region Last_Update Lat
<dbl> <chr> <chr> <chr> <dttm> <dbl>
1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
5 19001 Adair Iowa US 2020-04-04 23:34:21 41.3
6 21001 Adair Kentucky US 2020-04-04 23:34:21 37.1
7 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
8 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
9 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
10 16003 Adams Idaho US 2020-04-04 23:34:21 44.9
Long_ Confirmed Deaths Recovered Active Combined_Key
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 -82.5 6 0 0 0 Abbeville, South Carolina, US
2 -92.4 65 2 0 0 Acadia, Louisiana, US
3 -75.6 8 0 0 0 Accomack, Virginia, US
4 -116. 360 3 0 0 Ada, Idaho, US
5 -94.5 1 0 0 0 Adair, Iowa, US
6 -85.3 3 0 0 0 Adair, Kentucky, US
7 -92.6 10 0 0 0 Adair, Missouri, US
8 -94.7 14 0 0 0 Adair, Oklahoma, US
9 -104. 294 9 0 0 Adams, Colorado, US
10 -116. 1 0 0 0 Adams, Idaho, US
# ℹ 2,411 more rows
Standardize the variable names by changing them to lower case.
names(county_count) <- str_to_lower(names(county_count))Sanity check by displaying the unique US states and territories:
county_count %>%
select(province_state) %>%
distinct() %>%
arrange(province_state) %>%
print(n = Inf)# A tibble: 58 × 1
province_state
<chr>
1 Alabama
2 Alaska
3 Arizona
4 Arkansas
5 California
6 Colorado
7 Connecticut
8 Delaware
9 Diamond Princess
10 District of Columbia
11 Florida
12 Georgia
13 Grand Princess
14 Guam
15 Hawaii
16 Idaho
17 Illinois
18 Indiana
19 Iowa
20 Kansas
21 Kentucky
22 Louisiana
23 Maine
24 Maryland
25 Massachusetts
26 Michigan
27 Minnesota
28 Mississippi
29 Missouri
30 Montana
31 Nebraska
32 Nevada
33 New Hampshire
34 New Jersey
35 New Mexico
36 New York
37 North Carolina
38 North Dakota
39 Northern Mariana Islands
40 Ohio
41 Oklahoma
42 Oregon
43 Pennsylvania
44 Puerto Rico
45 Recovered
46 Rhode Island
47 South Carolina
48 South Dakota
49 Tennessee
50 Texas
51 Utah
52 Vermont
53 Virgin Islands
54 Virginia
55 Washington
56 West Virginia
57 Wisconsin
58 Wyoming
We want to exclude entries from Diamond Princess, Grand Princess, Guam, Northern Mariana Islands, Puerto Rico, Recovered, and Virgin Islands, and only consider counties from 50 states and DC.
county_count <- county_count %>%
filter(!(province_state %in% c("Diamond Princess", "Grand Princess",
"Recovered", "Guam", "Northern Mariana Islands",
"Puerto Rico", "Virgin Islands"))) %>%
print(width = Inf)# A tibble: 2,413 × 12
fips admin2 province_state country_region last_update lat
<dbl> <chr> <chr> <chr> <dttm> <dbl>
1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
5 19001 Adair Iowa US 2020-04-04 23:34:21 41.3
6 21001 Adair Kentucky US 2020-04-04 23:34:21 37.1
7 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
8 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
9 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
10 16003 Adams Idaho US 2020-04-04 23:34:21 44.9
long_ confirmed deaths recovered active combined_key
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 -82.5 6 0 0 0 Abbeville, South Carolina, US
2 -92.4 65 2 0 0 Acadia, Louisiana, US
3 -75.6 8 0 0 0 Accomack, Virginia, US
4 -116. 360 3 0 0 Ada, Idaho, US
5 -94.5 1 0 0 0 Adair, Iowa, US
6 -85.3 3 0 0 0 Adair, Kentucky, US
7 -92.6 10 0 0 0 Adair, Missouri, US
8 -94.7 14 0 0 0 Adair, Oklahoma, US
9 -104. 294 9 0 0 Adams, Colorado, US
10 -116. 1 0 0 0 Adams, Idaho, US
# ℹ 2,403 more rows
Graphical summarize the COVID-19 confirmed cases and deaths on 2020-04-04 by state.
county_count %>%
# turn into long format for easy plotting
pivot_longer(confirmed:recovered,
names_to = "case",
values_to = "count") %>%
group_by(province_state) %>%
ggplot() +
geom_col(mapping = aes(x = province_state, y = `count`, fill = `case`)) +
# scale_y_log10() +
labs(title = "US COVID-19 Situation on 2020-04-04", x = "State") +
theme(axis.text.x = element_text(angle = 90))Read in the 2020 county-level health ranking data.
county_info <- read_csv("./datasets/us-county-health-rankings-2020.csv.gz") %>%
filter(!is.na(county)) %>%
# cast fips into dbl for use as a key for joining tables
mutate(fips = as.numeric(fips)) %>%
select(fips,
state,
county,
percent_fair_or_poor_health,
percent_smokers,
percent_adults_with_obesity,
# food_environment_index,
percent_with_access_to_exercise_opportunities,
percent_excessive_drinking,
# teen_birth_rate,
percent_uninsured,
# primary_care_physicians_rate,
# preventable_hospitalization_rate,
# high_school_graduation_rate,
percent_some_college,
percent_unemployed,
percent_children_in_poverty,
# `80th_percentile_income`,
# `20th_percentile_income`,
percent_single_parent_households,
# violent_crime_rate,
percent_severe_housing_problems,
overcrowding,
# life_expectancy,
# age_adjusted_death_rate,
percent_adults_with_diabetes,
# hiv_prevalence_rate,
percent_food_insecure,
# percent_limited_access_to_healthy_foods,
percent_insufficient_sleep,
percent_uninsured_2,
median_household_income,
average_traffic_volume_per_meter_of_major_roadways,
percent_homeowners,
# percent_severe_housing_cost_burden,
population_2,
percent_less_than_18_years_of_age,
percent_65_and_over,
percent_black,
percent_asian,
percent_hispanic,
percent_female,
percent_rural) %>%
print(width = Inf)# A tibble: 3,142 × 30
fips state county percent_fair_or_poor_health percent_smokers
<dbl> <chr> <chr> <dbl> <dbl>
1 1001 Alabama Autauga 20.9 18.1
2 1003 Alabama Baldwin 17.5 17.5
3 1005 Alabama Barbour 29.6 22.0
4 1007 Alabama Bibb 19.4 19.1
5 1009 Alabama Blount 21.7 19.2
6 1011 Alabama Bullock 31.0 22.9
7 1013 Alabama Butler 27.9 21.8
8 1015 Alabama Calhoun 23.1 20.6
9 1017 Alabama Chambers 24.0 19.4
10 1019 Alabama Cherokee 20.7 17.5
percent_adults_with_obesity percent_with_access_to_exercise_opportunities
<dbl> <dbl>
1 33.3 69.1
2 31 73.7
3 41.7 53.2
4 37.6 16.3
5 33.8 15.6
6 37.2 2.50
7 43.3 48.6
8 38.5 47.7
9 40.1 61.9
10 35 33.4
percent_excessive_drinking percent_uninsured percent_some_college
<dbl> <dbl> <dbl>
1 15.0 8.72 62.0
2 18.0 11.3 67.4
3 12.8 12.2 34.9
4 15.6 10.2 44.1
5 14.2 13.4 53.4
6 12.1 11.4 35.0
7 11.9 11.2 41.7
8 13.8 11.9 59.2
9 12.7 11.9 48.5
10 14.1 11.2 51.8
percent_unemployed percent_children_in_poverty
<dbl> <dbl>
1 3.63 19.3
2 3.62 13.9
3 5.17 43.9
4 3.97 27.8
5 3.51 18
6 4.69 68.3
7 4.79 36.3
8 4.65 26.5
9 3.91 30.7
10 3.57 24.7
percent_single_parent_households percent_severe_housing_problems overcrowding
<dbl> <dbl> <dbl>
1 26.2 14.7 1.20
2 24.1 13.6 1.27
3 56.6 14.6 1.69
4 28.7 10.5 0.255
5 28.6 10.5 1.89
6 74.8 18.1 0.113
7 52.7 13.2 1.69
8 40.2 13.7 1.54
9 46.6 16.0 4.04
10 23.8 13 1.5
percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
<dbl> <dbl> <dbl>
1 11.1 13.2 35.9
2 10.7 11.6 33.3
3 17.6 22 38.6
4 14.5 14.3 38.1
5 17 10.7 35.9
6 23.7 24.8 45.0
7 19.2 20.6 41.9
8 17.5 15.7 41.3
9 19.9 17.9 37.3
10 15.2 12.5 35.4
percent_uninsured_2 median_household_income
<dbl> <dbl>
1 11.1 59338
2 14.3 57588
3 16.1 34382
4 13 46064
5 17.1 50412
6 15.2 29267
7 14.5 37365
8 15.4 45400
9 15.2 39917
10 13.9 42132
average_traffic_volume_per_meter_of_major_roadways percent_homeowners
<dbl> <dbl>
1 88.5 74.9
2 87.0 73.6
3 102. 61.4
4 29.3 75.1
5 33.4 78.6
6 4.07 75.5
7 19.3 69.9
8 110. 69.5
9 20.3 67.8
10 25.9 79.0
population_2 percent_less_than_18_years_of_age percent_65_and_over
<dbl> <dbl> <dbl>
1 55601 23.7 15.6
2 218022 21.6 20.4
3 24881 20.9 19.4
4 22400 20.5 16.5
5 57840 23.2 18.2
6 10138 21.1 16.4
7 19680 22.2 20.3
8 114277 21.6 17.7
9 33615 20.8 19.5
10 26032 19.2 23.0
percent_black percent_asian percent_hispanic percent_female percent_rural
<dbl> <dbl> <dbl> <dbl> <dbl>
1 19.3 1.22 2.97 51.4 42.0
2 8.78 1.15 4.65 51.5 42.3
3 48.0 0.454 4.28 47.2 67.8
4 21.1 0.237 2.62 46.8 68.4
5 1.46 0.320 9.57 50.7 90.0
6 69.5 0.187 7.96 45.5 51.4
7 44.6 1.32 1.51 53.4 71.2
8 20.9 0.964 3.91 51.9 33.7
9 39.6 1.33 2.56 52.1 49.1
10 4.24 0.338 1.62 50.5 85.7
# ℹ 3,132 more rows
For stability in estimating CFR, we restrict to counties with \(\ge 5\) confirmed cases.
county_count <- county_count %>%
filter(confirmed >= 5)We join the COVID-19 count data and county-level information using FIPS (Federal Information Processing System) as key.
county_data <- county_count %>%
left_join(county_info, by = "fips") %>%
print(width = Inf)# A tibble: 1,466 × 41
fips admin2 province_state country_region last_update lat
<dbl> <chr> <chr> <chr> <dttm> <dbl>
1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
5 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
6 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
7 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
8 28001 Adams Mississippi US 2020-04-04 23:34:21 31.5
9 31001 Adams Nebraska US 2020-04-04 23:34:21 40.5
10 42001 Adams Pennsylvania US 2020-04-04 23:34:21 39.9
long_ confirmed deaths recovered active combined_key
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 -82.5 6 0 0 0 Abbeville, South Carolina, US
2 -92.4 65 2 0 0 Acadia, Louisiana, US
3 -75.6 8 0 0 0 Accomack, Virginia, US
4 -116. 360 3 0 0 Ada, Idaho, US
5 -92.6 10 0 0 0 Adair, Missouri, US
6 -94.7 14 0 0 0 Adair, Oklahoma, US
7 -104. 294 9 0 0 Adams, Colorado, US
8 -91.4 16 0 0 0 Adams, Mississippi, US
9 -98.5 8 0 0 0 Adams, Nebraska, US
10 -77.2 21 0 0 0 Adams, Pennsylvania, US
state county percent_fair_or_poor_health percent_smokers
<chr> <chr> <dbl> <dbl>
1 South Carolina Abbeville 19.9 17.3
2 Louisiana Acadia 20.9 21.5
3 Virginia Accomack 20.1 18.3
4 Idaho Ada 11.5 12.0
5 Missouri Adair 21.4 20.5
6 Oklahoma Adair 28.5 27.7
7 Colorado Adams 16.6 16.3
8 Mississippi Adams 27.3 22.2
9 Nebraska Adams 15.8 14.6
10 Pennsylvania Adams 15.3 16.2
percent_adults_with_obesity percent_with_access_to_exercise_opportunities
<dbl> <dbl>
1 36.7 59.0
2 38.4 42.5
3 36.3 37.4
4 25.6 89.5
5 27.9 78.3
6 47.7 28.5
7 27.8 93.1
8 35.3 69.1
9 36.7 81.6
10 35.6 60.6
percent_excessive_drinking percent_uninsured percent_some_college
<dbl> <dbl> <dbl>
1 15.9 12.9 52.5
2 19.8 10.7 43.6
3 15.5 16.6 45.1
4 17.9 8.74 73.8
5 18.9 10.6 65.3
6 11.8 24.5 35.1
7 18.9 11.0 57.0
8 12.3 15.0 41.7
9 18.5 8.76 70.8
10 19.2 7.49 57.3
percent_unemployed percent_children_in_poverty
<dbl> <dbl>
1 3.98 30.8
2 5.37 35.4
3 3.81 27
4 2.46 10.2
5 3.51 19.9
6 4.17 34.9
7 3.47 12.6
8 6.21 40.4
9 2.87 14.4
10 3.27 11.2
percent_single_parent_households percent_severe_housing_problems overcrowding
<dbl> <dbl> <dbl>
1 37.1 14.3 0.463
2 33.4 12.3 3.51
3 45.9 15.1 2.10
4 23.8 14.0 1.46
5 29.5 18.0 0.740
6 38.3 15.4 5.65
7 31.0 18.1 5.37
8 66.4 12.8 2.37
9 26.2 10.5 0.904
10 26.7 12.3 1.88
percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
<dbl> <dbl> <dbl>
1 15.8 15.2 36.1
2 11.4 15.1 32.4
3 15.9 14.1 36.8
4 7.9 12 26.3
5 8.4 17.5 31.9
6 24.3 19.1 39.5
7 7.7 8 31.0
8 13.2 24.7 41.1
9 11 11.7 30.1
10 8.5 8.3 34.7
percent_uninsured_2 median_household_income
<dbl> <dbl>
1 15.9 42412
2 14.0 40484
3 19.4 42879
4 11.1 66827
5 12.3 40395
6 29.6 35156
7 13.8 70199
8 18.7 33392
9 10.7 55167
10 8.46 62877
average_traffic_volume_per_meter_of_major_roadways percent_homeowners
<dbl> <dbl>
1 11.6 76.3
2 63.7 70.8
3 60.0 67.9
4 277. 68.4
5 45.8 60.0
6 16.7 68.6
7 490. 65.2
8 150. 61.7
9 53.4 68.2
10 113. 77.2
population_2 percent_less_than_18_years_of_age percent_65_and_over
<dbl> <dbl> <dbl>
1 24541 20.1 21.8
2 62190 25.8 15.3
3 32412 20.5 23.6
4 469966 23.8 14.4
5 25339 18.4 14.8
6 22082 26.6 15.9
7 511868 26.5 10.5
8 31192 20.1 18.8
9 31511 23.7 18.2
10 102811 20.0 20.4
percent_black percent_asian percent_hispanic percent_female percent_rural
<dbl> <dbl> <dbl> <dbl> <dbl>
1 27.5 0.412 1.54 51.6 78.6
2 17.9 0.320 2.73 51.2 51.7
3 28.0 0.781 9.34 51.2 100
4 1.24 2.81 8.31 49.9 5.47
5 2.85 2.28 2.57 51.9 37.9
6 0.534 0.802 6.82 50.1 83.3
7 3.19 4.37 40.4 49.5 3.62
8 52.4 0.513 11.3 47.9 37.2
9 0.996 1.33 10.9 50.2 22.5
10 1.60 0.875 7.11 50.8 53.7
# ℹ 1,456 more rows
Numerical summaries of each variable:
summary(county_data) fips admin2 province_state country_region
Min. : 1001 Length:1466 Length:1466 Length:1466
1st Qu.:18003 Class :character Class :character Class :character
Median :29029 Mode :character Mode :character Mode :character
Mean :30076
3rd Qu.:42077
Max. :90053
NA's :13
last_update lat long_
Min. :2020-04-04 23:34:21 Min. :19.60 Min. :-159.60
1st Qu.:2020-04-04 23:34:21 1st Qu.:33.96 1st Qu.: -94.56
Median :2020-04-04 23:34:21 Median :38.02 Median : -86.48
Mean :2020-04-04 23:34:21 Mean :37.71 Mean : -89.73
3rd Qu.:2020-04-04 23:34:21 3rd Qu.:41.38 3rd Qu.: -81.22
Max. :2020-04-04 23:34:21 Max. :64.81 Max. : -68.65
NA's :19 NA's :19
confirmed deaths recovered active
Min. : 5.0 Min. : 0.000 Min. :0 Min. :0
1st Qu.: 9.0 1st Qu.: 0.000 1st Qu.:0 1st Qu.:0
Median : 20.0 Median : 0.000 Median :0 Median :0
Mean : 208.8 Mean : 4.842 Mean :0 Mean :0
3rd Qu.: 68.0 3rd Qu.: 2.000 3rd Qu.:0 3rd Qu.:0
Max. :63306.0 Max. :1905.000 Max. :0 Max. :0
combined_key state county
Length:1466 Length:1466 Length:1466
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
Min. : 8.121 Min. : 5.909 Min. :12.40
1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
Median :17.010 Median :17.147 Median :32.95
Mean :17.594 Mean :17.153 Mean :32.41
3rd Qu.:20.377 3rd Qu.:19.365 3rd Qu.:36.20
Max. :38.887 Max. :27.775 Max. :51.00
NA's :28 NA's :28 NA's :28
percent_with_access_to_exercise_opportunities percent_excessive_drinking
Min. : 0.00 Min. : 7.81
1st Qu.: 59.95 1st Qu.:15.70
Median : 74.71 Median :18.03
Mean : 71.14 Mean :17.92
3rd Qu.: 85.94 3rd Qu.:20.00
Max. :100.00 Max. :28.62
NA's :28 NA's :28
percent_uninsured percent_some_college percent_unemployed
Min. : 2.263 Min. :30.06 Min. : 1.582
1st Qu.: 6.754 1st Qu.:53.24 1st Qu.: 3.252
Median : 9.925 Median :61.21 Median : 3.870
Mean :10.583 Mean :60.87 Mean : 4.071
3rd Qu.:13.519 3rd Qu.:68.74 3rd Qu.: 4.690
Max. :31.208 Max. :90.34 Max. :18.092
NA's :28 NA's :28 NA's :28
percent_children_in_poverty percent_single_parent_households
Min. : 2.50 Min. : 9.43
1st Qu.:12.82 1st Qu.:27.09
Median :18.40 Median :32.96
Mean :19.46 Mean :33.84
3rd Qu.:24.50 3rd Qu.:38.94
Max. :55.00 Max. :80.00
NA's :28 NA's :28
percent_severe_housing_problems overcrowding percent_adults_with_diabetes
Min. : 6.562 Min. : 0.000 Min. : 1.800
1st Qu.:12.267 1st Qu.: 1.378 1st Qu.: 9.125
Median :14.439 Median : 1.962 Median :11.300
Mean :15.079 Mean : 2.429 Mean :11.749
3rd Qu.:16.976 3rd Qu.: 2.882 3rd Qu.:13.900
Max. :33.391 Max. :14.489 Max. :34.100
NA's :28 NA's :28 NA's :28
percent_food_insecure percent_insufficient_sleep percent_uninsured_2
Min. : 3.40 Min. :23.03 Min. : 2.683
1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
Median :12.70 Median :34.02 Median :12.027
Mean :13.26 Mean :33.88 Mean :12.776
3rd Qu.:15.20 3rd Qu.:36.56 3rd Qu.:16.541
Max. :33.50 Max. :46.71 Max. :42.397
NA's :28 NA's :28 NA's :28
median_household_income average_traffic_volume_per_meter_of_major_roadways
Min. : 25385 Min. : 0.00
1st Qu.: 46994 1st Qu.: 53.05
Median : 54317 Median : 105.00
Mean : 57584 Mean : 201.39
3rd Qu.: 64754 3rd Qu.: 206.92
Max. :140382 Max. :4444.12
NA's :28 NA's :28
percent_homeowners population_2 percent_less_than_18_years_of_age
Min. :24.13 Min. : 2887 Min. : 7.069
1st Qu.:64.34 1st Qu.: 36502 1st Qu.:20.321
Median :69.98 Median : 75478 Median :22.182
Mean :68.98 Mean : 202450 Mean :22.197
3rd Qu.:74.78 3rd Qu.: 180031 3rd Qu.:24.002
Max. :89.76 Max. :10105518 Max. :35.447
NA's :28 NA's :28 NA's :28
percent_65_and_over percent_black percent_asian percent_hispanic
Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
1st Qu.:14.927 1st Qu.: 1.6174 1st Qu.: 0.68249 1st Qu.: 2.9419
Median :17.222 Median : 5.6397 Median : 1.23421 Median : 5.5939
Mean :17.516 Mean :12.4178 Mean : 2.40412 Mean :10.0011
3rd Qu.:19.598 3rd Qu.:17.5931 3rd Qu.: 2.67550 3rd Qu.:11.0564
Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3596
NA's :28 NA's :28 NA's :28 NA's :28
percent_female percent_rural
Min. :34.63 Min. : 0.00
1st Qu.:50.00 1st Qu.: 17.11
Median :50.66 Median : 36.97
Mean :50.46 Mean : 40.11
3rd Qu.:51.35 3rd Qu.: 60.00
Max. :56.87 Max. :100.00
NA's :28 NA's :28
List rows in county_data that don’t have a match in county_count:
county_data %>%
filter(is.na(state) & is.na(county)) %>%
print(n = Inf)# A tibble: 28 × 41
fips admin2 province_state country_region last_update lat long_
<dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl>
1 NA DeKalb Tennessee US 2020-04-04 23:34:21 36.0 -85.8
2 NA DeSoto Florida US 2020-04-04 23:34:21 27.2 -81.8
3 NA Dukes a… Massachusetts US 2020-04-04 23:34:21 41.4 -70.7
4 NA Fillmore Minnesota US 2020-04-04 23:34:21 43.7 -92.1
5 NA Kansas … Missouri US 2020-04-04 23:34:21 39.1 -94.6
6 NA LaSalle Illinois US 2020-04-04 23:34:21 41.3 -88.9
7 NA Manassas Virginia US 2020-04-04 23:34:21 38.7 -77.5
8 NA McDuffie Georgia US 2020-04-04 23:34:21 33.5 -82.5
9 NA Out of … Michigan US 2020-04-04 23:34:21 NA NA
10 NA Out of … Tennessee US 2020-04-04 23:34:21 NA NA
11 90005 Unassig… Arkansas US 2020-04-04 23:34:21 NA NA
12 90008 Unassig… Colorado US 2020-04-04 23:34:21 NA NA
13 90009 Unassig… Connecticut US 2020-04-04 23:34:21 NA NA
14 90013 Unassig… Georgia US 2020-04-04 23:34:21 NA NA
15 90015 Unassig… Hawaii US 2020-04-04 23:34:21 NA NA
16 90017 Unassig… Illinois US 2020-04-04 23:34:21 NA NA
17 90021 Unassig… Kentucky US 2020-04-04 23:34:21 NA NA
18 NA Unassig… Louisiana US 2020-04-04 23:34:21 NA NA
19 90023 Unassig… Maine US 2020-04-04 23:34:21 NA NA
20 90025 Unassig… Massachusetts US 2020-04-04 23:34:21 NA NA
21 NA Unassig… Michigan US 2020-04-04 23:34:21 NA NA
22 90032 Unassig… Nevada US 2020-04-04 23:34:21 NA NA
23 90034 Unassig… New Jersey US 2020-04-04 23:34:21 NA NA
24 90044 Unassig… Rhode Island US 2020-04-04 23:34:21 NA NA
25 90047 Unassig… Tennessee US 2020-04-04 23:34:21 NA NA
26 90050 Unassig… Vermont US 2020-04-04 23:34:21 NA NA
27 90053 Unassig… Washington US 2020-04-04 23:34:21 NA NA
28 NA Weber Utah US 2020-04-04 23:34:21 41.3 -112.
# ℹ 34 more variables: confirmed <dbl>, deaths <dbl>, recovered <dbl>,
# active <dbl>, combined_key <chr>, state <chr>, county <chr>,
# percent_fair_or_poor_health <dbl>, percent_smokers <dbl>,
# percent_adults_with_obesity <dbl>,
# percent_with_access_to_exercise_opportunities <dbl>,
# percent_excessive_drinking <dbl>, percent_uninsured <dbl>,
# percent_some_college <dbl>, percent_unemployed <dbl>, …
We found there are some rows that miss fips.
county_count %>%
filter(is.na(fips)) %>%
select(fips, admin2, province_state) %>%
print(n = Inf)# A tibble: 13 × 3
fips admin2 province_state
<dbl> <chr> <chr>
1 NA DeKalb Tennessee
2 NA DeSoto Florida
3 NA Dukes and Nantucket Massachusetts
4 NA Fillmore Minnesota
5 NA Kansas City Missouri
6 NA LaSalle Illinois
7 NA Manassas Virginia
8 NA McDuffie Georgia
9 NA Out of MI Michigan
10 NA Out of TN Tennessee
11 NA Unassigned Louisiana
12 NA Unassigned Michigan
13 NA Weber Utah
We need to (1) manually set the fips for some counties, (2) discard those Unassigned, unassigned or Out of, and (3) try to join with county_info again.
county_data <- county_count %>%
# manually set FIPS for some counties
mutate(fips = ifelse(admin2 == "DeKalb" & province_state == "Tennessee", 47041, fips)) %>%
mutate(fips = ifelse(admin2 == "DeSoto" & province_state == "Florida", 12027, fips)) %>%
#mutate(fips = ifelse(admin2 == "Dona Ana" & province_state == "New Mexico", 35013, fips)) %>%
mutate(fips = ifelse(admin2 == "Dukes and Nantucket" & province_state == "Massachusetts", 25019, fips)) %>%
mutate(fips = ifelse(admin2 == "Fillmore" & province_state == "Minnesota", 27045, fips)) %>%
#mutate(fips = ifelse(admin2 == "Harris" & province_state == "Texas", 48201, fips)) %>%
#mutate(fips = ifelse(admin2 == "Kenai Peninsula" & province_state == "Alaska", 2122, fips)) %>%
mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Illinois", 17099, fips)) %>%
#mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Louisiana", 22059, fips)) %>%
#mutate(fips = ifelse(admin2 == "Lac qui Parle" & province_state == "Minnesota", 27073, fips)) %>%
mutate(fips = ifelse(admin2 == "Manassas" & province_state == "Virginia", 51683, fips)) %>%
#mutate(fips = ifelse(admin2 == "Matanuska-Susitna" & province_state == "Alaska", 2170, fips)) %>%
mutate(fips = ifelse(admin2 == "McDuffie" & province_state == "Georgia", 13189, fips)) %>%
#mutate(fips = ifelse(admin2 == "McIntosh" & province_state == "Georgia", 13191, fips)) %>%
#mutate(fips = ifelse(admin2 == "McKean" & province_state == "Pennsylvania", 42083, fips)) %>%
mutate(fips = ifelse(admin2 == "Weber" & province_state == "Utah", 49057, fips)) %>%
filter(!(is.na(fips) | str_detect(admin2, "Out of") | str_detect(admin2, "Unassigned"))) %>%
left_join(county_info, by = "fips") %>%
print(width = Inf)# A tibble: 1,446 × 41
fips admin2 province_state country_region last_update lat
<dbl> <chr> <chr> <chr> <dttm> <dbl>
1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
5 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
6 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
7 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
8 28001 Adams Mississippi US 2020-04-04 23:34:21 31.5
9 31001 Adams Nebraska US 2020-04-04 23:34:21 40.5
10 42001 Adams Pennsylvania US 2020-04-04 23:34:21 39.9
long_ confirmed deaths recovered active combined_key
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 -82.5 6 0 0 0 Abbeville, South Carolina, US
2 -92.4 65 2 0 0 Acadia, Louisiana, US
3 -75.6 8 0 0 0 Accomack, Virginia, US
4 -116. 360 3 0 0 Ada, Idaho, US
5 -92.6 10 0 0 0 Adair, Missouri, US
6 -94.7 14 0 0 0 Adair, Oklahoma, US
7 -104. 294 9 0 0 Adams, Colorado, US
8 -91.4 16 0 0 0 Adams, Mississippi, US
9 -98.5 8 0 0 0 Adams, Nebraska, US
10 -77.2 21 0 0 0 Adams, Pennsylvania, US
state county percent_fair_or_poor_health percent_smokers
<chr> <chr> <dbl> <dbl>
1 South Carolina Abbeville 19.9 17.3
2 Louisiana Acadia 20.9 21.5
3 Virginia Accomack 20.1 18.3
4 Idaho Ada 11.5 12.0
5 Missouri Adair 21.4 20.5
6 Oklahoma Adair 28.5 27.7
7 Colorado Adams 16.6 16.3
8 Mississippi Adams 27.3 22.2
9 Nebraska Adams 15.8 14.6
10 Pennsylvania Adams 15.3 16.2
percent_adults_with_obesity percent_with_access_to_exercise_opportunities
<dbl> <dbl>
1 36.7 59.0
2 38.4 42.5
3 36.3 37.4
4 25.6 89.5
5 27.9 78.3
6 47.7 28.5
7 27.8 93.1
8 35.3 69.1
9 36.7 81.6
10 35.6 60.6
percent_excessive_drinking percent_uninsured percent_some_college
<dbl> <dbl> <dbl>
1 15.9 12.9 52.5
2 19.8 10.7 43.6
3 15.5 16.6 45.1
4 17.9 8.74 73.8
5 18.9 10.6 65.3
6 11.8 24.5 35.1
7 18.9 11.0 57.0
8 12.3 15.0 41.7
9 18.5 8.76 70.8
10 19.2 7.49 57.3
percent_unemployed percent_children_in_poverty
<dbl> <dbl>
1 3.98 30.8
2 5.37 35.4
3 3.81 27
4 2.46 10.2
5 3.51 19.9
6 4.17 34.9
7 3.47 12.6
8 6.21 40.4
9 2.87 14.4
10 3.27 11.2
percent_single_parent_households percent_severe_housing_problems overcrowding
<dbl> <dbl> <dbl>
1 37.1 14.3 0.463
2 33.4 12.3 3.51
3 45.9 15.1 2.10
4 23.8 14.0 1.46
5 29.5 18.0 0.740
6 38.3 15.4 5.65
7 31.0 18.1 5.37
8 66.4 12.8 2.37
9 26.2 10.5 0.904
10 26.7 12.3 1.88
percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
<dbl> <dbl> <dbl>
1 15.8 15.2 36.1
2 11.4 15.1 32.4
3 15.9 14.1 36.8
4 7.9 12 26.3
5 8.4 17.5 31.9
6 24.3 19.1 39.5
7 7.7 8 31.0
8 13.2 24.7 41.1
9 11 11.7 30.1
10 8.5 8.3 34.7
percent_uninsured_2 median_household_income
<dbl> <dbl>
1 15.9 42412
2 14.0 40484
3 19.4 42879
4 11.1 66827
5 12.3 40395
6 29.6 35156
7 13.8 70199
8 18.7 33392
9 10.7 55167
10 8.46 62877
average_traffic_volume_per_meter_of_major_roadways percent_homeowners
<dbl> <dbl>
1 11.6 76.3
2 63.7 70.8
3 60.0 67.9
4 277. 68.4
5 45.8 60.0
6 16.7 68.6
7 490. 65.2
8 150. 61.7
9 53.4 68.2
10 113. 77.2
population_2 percent_less_than_18_years_of_age percent_65_and_over
<dbl> <dbl> <dbl>
1 24541 20.1 21.8
2 62190 25.8 15.3
3 32412 20.5 23.6
4 469966 23.8 14.4
5 25339 18.4 14.8
6 22082 26.6 15.9
7 511868 26.5 10.5
8 31192 20.1 18.8
9 31511 23.7 18.2
10 102811 20.0 20.4
percent_black percent_asian percent_hispanic percent_female percent_rural
<dbl> <dbl> <dbl> <dbl> <dbl>
1 27.5 0.412 1.54 51.6 78.6
2 17.9 0.320 2.73 51.2 51.7
3 28.0 0.781 9.34 51.2 100
4 1.24 2.81 8.31 49.9 5.47
5 2.85 2.28 2.57 51.9 37.9
6 0.534 0.802 6.82 50.1 83.3
7 3.19 4.37 40.4 49.5 3.62
8 52.4 0.513 11.3 47.9 37.2
9 0.996 1.33 10.9 50.2 22.5
10 1.60 0.875 7.11 50.8 53.7
# ℹ 1,436 more rows
Summarize again
summary(county_data) fips admin2 province_state country_region
Min. : 1001 Length:1446 Length:1446 Length:1446
1st Qu.:17186 Class :character Class :character Class :character
Median :28156 Mode :character Mode :character Mode :character
Mean :29455
3rd Qu.:42048
Max. :56039
last_update lat long_
Min. :2020-04-04 23:34:21 Min. :19.60 Min. :-159.60
1st Qu.:2020-04-04 23:34:21 1st Qu.:33.96 1st Qu.: -94.52
Median :2020-04-04 23:34:21 Median :38.02 Median : -86.48
Mean :2020-04-04 23:34:21 Mean :37.71 Mean : -89.73
3rd Qu.:2020-04-04 23:34:21 3rd Qu.:41.39 3rd Qu.: -81.21
Max. :2020-04-04 23:34:21 Max. :64.81 Max. : -68.65
confirmed deaths recovered active
Min. : 5.0 Min. : 0.000 Min. :0 Min. :0
1st Qu.: 9.0 1st Qu.: 0.000 1st Qu.:0 1st Qu.:0
Median : 20.0 Median : 0.000 Median :0 Median :0
Mean : 207.2 Mean : 4.854 Mean :0 Mean :0
3rd Qu.: 66.0 3rd Qu.: 2.000 3rd Qu.:0 3rd Qu.:0
Max. :63306.0 Max. :1905.000 Max. :0 Max. :0
combined_key state county
Length:1446 Length:1446 Length:1446
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
Min. : 8.121 Min. : 5.909 Min. :12.40
1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
Median :17.010 Median :17.143 Median :32.90
Mean :17.594 Mean :17.151 Mean :32.39
3rd Qu.:20.398 3rd Qu.:19.365 3rd Qu.:36.20
Max. :38.887 Max. :27.775 Max. :51.00
percent_with_access_to_exercise_opportunities percent_excessive_drinking
Min. : 0.00 Min. : 7.81
1st Qu.: 59.95 1st Qu.:15.68
Median : 74.71 Median :18.03
Mean : 71.15 Mean :17.91
3rd Qu.: 85.97 3rd Qu.:20.01
Max. :100.00 Max. :28.62
percent_uninsured percent_some_college percent_unemployed
Min. : 2.263 Min. :21.14 Min. : 1.582
1st Qu.: 6.754 1st Qu.:53.21 1st Qu.: 3.252
Median : 9.937 Median :61.19 Median : 3.870
Mean :10.592 Mean :60.83 Mean : 4.071
3rd Qu.:13.527 3rd Qu.:68.72 3rd Qu.: 4.690
Max. :31.208 Max. :90.34 Max. :18.092
percent_children_in_poverty percent_single_parent_households
Min. : 2.50 Min. : 9.43
1st Qu.:12.82 1st Qu.:27.07
Median :18.40 Median :32.96
Mean :19.46 Mean :33.83
3rd Qu.:24.50 3rd Qu.:38.93
Max. :55.00 Max. :80.00
percent_severe_housing_problems overcrowding percent_adults_with_diabetes
Min. : 6.562 Min. : 0.000 Min. : 1.80
1st Qu.:12.267 1st Qu.: 1.379 1st Qu.: 9.10
Median :14.439 Median : 1.971 Median :11.30
Mean :15.082 Mean : 2.437 Mean :11.75
3rd Qu.:16.992 3rd Qu.: 2.887 3rd Qu.:13.90
Max. :33.391 Max. :14.489 Max. :34.10
percent_food_insecure percent_insufficient_sleep percent_uninsured_2
Min. : 3.40 Min. :23.03 Min. : 2.683
1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
Median :12.70 Median :34.02 Median :12.027
Mean :13.25 Mean :33.88 Mean :12.786
3rd Qu.:15.20 3rd Qu.:36.54 3rd Qu.:16.572
Max. :33.50 Max. :46.71 Max. :42.397
median_household_income average_traffic_volume_per_meter_of_major_roadways
Min. : 25385 Min. : 0.00
1st Qu.: 46994 1st Qu.: 53.09
Median : 54317 Median : 104.63
Mean : 57600 Mean : 200.72
3rd Qu.: 64775 3rd Qu.: 206.78
Max. :140382 Max. :4444.12
percent_homeowners population_2 percent_less_than_18_years_of_age
Min. :24.13 Min. : 2887 Min. : 7.069
1st Qu.:64.36 1st Qu.: 36275 1st Qu.:20.326
Median :69.96 Median : 75382 Median :22.182
Mean :68.99 Mean : 201689 Mean :22.204
3rd Qu.:74.77 3rd Qu.: 179982 3rd Qu.:24.019
Max. :89.76 Max. :10105518 Max. :35.447
percent_65_and_over percent_black percent_asian percent_hispanic
Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
1st Qu.:14.913 1st Qu.: 1.6168 1st Qu.: 0.68228 1st Qu.: 2.9451
Median :17.225 Median : 5.6397 Median : 1.22863 Median : 5.6100
Mean :17.512 Mean :12.4056 Mean : 2.40009 Mean :10.0338
3rd Qu.:19.598 3rd Qu.:17.4904 3rd Qu.: 2.66813 3rd Qu.:11.1199
Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3596
percent_female percent_rural
Min. :34.63 Min. : 0.00
1st Qu.:50.00 1st Qu.: 17.11
Median :50.65 Median : 36.97
Mean :50.46 Mean : 40.12
3rd Qu.:51.35 3rd Qu.: 60.04
Max. :56.87 Max. :100.00
If there are variables with missing value for many counties, we go back and remove those variables from consideration.
Let’s create a final data frame for analysis.
county_data <- county_data %>%
mutate(state = as.factor(state)) %>%
select(county, confirmed, deaths, state, percent_fair_or_poor_health:percent_rural)
summary(county_data) county confirmed deaths state
Length:1446 Min. : 5.0 Min. : 0.000 Georgia : 96
Class :character 1st Qu.: 9.0 1st Qu.: 0.000 Texas : 80
Mode :character Median : 20.0 Median : 0.000 North Carolina: 63
Mean : 207.2 Mean : 4.854 Mississippi : 61
3rd Qu.: 66.0 3rd Qu.: 2.000 Indiana : 58
Max. :63306.0 Max. :1905.000 Ohio : 57
(Other) :1031
percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
Min. : 8.121 Min. : 5.909 Min. :12.40
1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
Median :17.010 Median :17.143 Median :32.90
Mean :17.594 Mean :17.151 Mean :32.39
3rd Qu.:20.398 3rd Qu.:19.365 3rd Qu.:36.20
Max. :38.887 Max. :27.775 Max. :51.00
percent_with_access_to_exercise_opportunities percent_excessive_drinking
Min. : 0.00 Min. : 7.81
1st Qu.: 59.95 1st Qu.:15.68
Median : 74.71 Median :18.03
Mean : 71.15 Mean :17.91
3rd Qu.: 85.97 3rd Qu.:20.01
Max. :100.00 Max. :28.62
percent_uninsured percent_some_college percent_unemployed
Min. : 2.263 Min. :21.14 Min. : 1.582
1st Qu.: 6.754 1st Qu.:53.21 1st Qu.: 3.252
Median : 9.937 Median :61.19 Median : 3.870
Mean :10.592 Mean :60.83 Mean : 4.071
3rd Qu.:13.527 3rd Qu.:68.72 3rd Qu.: 4.690
Max. :31.208 Max. :90.34 Max. :18.092
percent_children_in_poverty percent_single_parent_households
Min. : 2.50 Min. : 9.43
1st Qu.:12.82 1st Qu.:27.07
Median :18.40 Median :32.96
Mean :19.46 Mean :33.83
3rd Qu.:24.50 3rd Qu.:38.93
Max. :55.00 Max. :80.00
percent_severe_housing_problems overcrowding percent_adults_with_diabetes
Min. : 6.562 Min. : 0.000 Min. : 1.80
1st Qu.:12.267 1st Qu.: 1.379 1st Qu.: 9.10
Median :14.439 Median : 1.971 Median :11.30
Mean :15.082 Mean : 2.437 Mean :11.75
3rd Qu.:16.992 3rd Qu.: 2.887 3rd Qu.:13.90
Max. :33.391 Max. :14.489 Max. :34.10
percent_food_insecure percent_insufficient_sleep percent_uninsured_2
Min. : 3.40 Min. :23.03 Min. : 2.683
1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
Median :12.70 Median :34.02 Median :12.027
Mean :13.25 Mean :33.88 Mean :12.786
3rd Qu.:15.20 3rd Qu.:36.54 3rd Qu.:16.572
Max. :33.50 Max. :46.71 Max. :42.397
median_household_income average_traffic_volume_per_meter_of_major_roadways
Min. : 25385 Min. : 0.00
1st Qu.: 46994 1st Qu.: 53.09
Median : 54317 Median : 104.63
Mean : 57600 Mean : 200.72
3rd Qu.: 64775 3rd Qu.: 206.78
Max. :140382 Max. :4444.12
percent_homeowners population_2 percent_less_than_18_years_of_age
Min. :24.13 Min. : 2887 Min. : 7.069
1st Qu.:64.36 1st Qu.: 36275 1st Qu.:20.326
Median :69.96 Median : 75382 Median :22.182
Mean :68.99 Mean : 201689 Mean :22.204
3rd Qu.:74.77 3rd Qu.: 179982 3rd Qu.:24.019
Max. :89.76 Max. :10105518 Max. :35.447
percent_65_and_over percent_black percent_asian percent_hispanic
Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
1st Qu.:14.913 1st Qu.: 1.6168 1st Qu.: 0.68228 1st Qu.: 2.9451
Median :17.225 Median : 5.6397 Median : 1.22863 Median : 5.6100
Mean :17.512 Mean :12.4056 Mean : 2.40009 Mean :10.0338
3rd Qu.:19.598 3rd Qu.:17.4904 3rd Qu.: 2.66813 3rd Qu.:11.1199
Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3596
percent_female percent_rural
Min. :34.63 Min. : 0.00
1st Qu.:50.00 1st Qu.: 17.11
Median :50.65 Median : 36.97
Mean :50.46 Mean : 40.12
3rd Qu.:51.35 3rd Qu.: 60.04
Max. :56.87 Max. :100.00
Display the 10 counties with highest CFR.
county_data %>%
mutate(cfr = deaths / confirmed) %>%
select(county, state, confirmed, deaths, cfr) %>%
arrange(desc(cfr)) %>%
top_n(10)# A tibble: 18 × 5
county state confirmed deaths cfr
<chr> <fct> <dbl> <dbl> <dbl>
1 Emmet Michigan 7 2 0.286
2 Grand Traverse Michigan 12 3 0.25
3 Toole Montana 12 3 0.25
4 Fayette Indiana 14 3 0.214
5 Concordia Louisiana 5 1 0.2
6 Harrison Texas 5 1 0.2
7 Huntington Indiana 5 1 0.2
8 Isabella Michigan 10 2 0.2
9 McDuffie Georgia 5 1 0.2
10 Navarro Texas 5 1 0.2
11 Orange Indiana 5 1 0.2
12 Perry Pennsylvania 5 1 0.2
13 Randolph Indiana 5 1 0.2
14 Rockingham North Carolina 5 1 0.2
15 Seneca Ohio 5 1 0.2
16 Toombs Georgia 5 1 0.2
17 Vigo Indiana 10 2 0.2
18 Washington Alabama 5 1 0.2
Write final data into a csv file for future use.
write_csv(county_data, "./datasets/covid19-county-data-20200404.csv.gz")1.3 Note:
Given that the datasets were collected in the middle of the pandemic, what assumptions of CFR might be violated by defining CFR as deaths/confirmed from this data set?
Because COVID-19 pandemic was still ongoing in 2020, we should realize some critical assumptions for defining CFR are not met using this datasets.
Numbers of confirmed cases do not reflect the number of diagnosed people. This is mainly limited by the availability of testing.
Some confirmed cases may die later.
With acknowledgement of these severe limitations, we continue to use deaths/confirmed as a very rough proxy of CFR.
1.4 Q1.1 (5pts)
Read and run above code to generate a data frame county_data that includes county-level COVID-19 confirmed cases and deaths, demographic, and health related information.
1.5 Q1.2(5pts)
What assumptions of logistic regression may be violated by this data set?
1.6 Q1.3 (10pts)
Run a logistic regression, using variables state, …, percent_rural as predictors.
1.7 Q1.4 (10pts)
Interpret the regression coefficients of 3 significant predictors with p-value <0.05.
1.8 Q1.5 (10pts)
Apply analysis of deviance to (1) evaluate the goodness of fit of the model and (2) compare the model to the intercept-only model.
1.9 Q1.6 (10pts)
Perform analysis of deviance to evaluate the significance of each predictor. Display the 10 most significant predictors.
1.10 Q1.7 (5pts)
Construct confidence intervals of regression coefficients.
1.11 Q1.8 (5pts)
Plot the deviance residuals against the fitted values. Are there potential outliers?
1.12 Q1.9 (5pts)
Plot the half-normal plot. Are there potential outliers in predictor space?
1.13 Q1.10 (10pts)
Find the best sub-model using the AIC criterion.
1.14 Q1.11 (15pts)
Find the best sub-model using the lasso with cross validation.
2 Q2. Odds ratios (20pts)
Consider a \(2 \times 2\) contingency table from a prospective study in which people who were or were not exposed to some pollutant are followed up and, after several years, categorized according to the presense or absence of a disease. Following table shows the probabilities for each cell. The odds of disease for either exposure group is \(O_i = \pi_i / (1 - \pi_i)\), for \(i = 1,2\), and so the odds ratio is \[ \phi = \frac{O_1}{O_2} = \frac{\pi_1(1 - \pi_2)}{\pi_2 (1 - \pi_1)} \] is a measure of the relative likelihood of disease for the exposed and not exposed groups.
| Diseased | Not diseased | |
|---|---|---|
| Exposed | \(\pi_1\) | \(1 - \pi_1\) |
| Not exposed | \(\pi_2\) | \(1 - \pi_2\) |
2.1 Q2.1 (10pts)
For the simple logistic model \[ \pi_i = \frac{e^{\beta_i}}{1 + e^{\beta_i}}, \] show that if there is no difference between the exposed and not exposed groups (i.e., \(\beta_1 = \beta_2\)), then \(\phi = 1\).
2.2 Q2.2(10pts)
Consider \(J\) \(2 \times 2\) tables, one for each level \(x_j\) of a factor, such as age group, with \(j=1,\ldots, J\). For the logistic model \[ \pi_{ij} = \frac{e^{\alpha_i + \beta_i x_j}}{1 + e^{\alpha_i + \beta_i x_j}}, \quad i = 1,2, \quad j= 1,\ldots, J. \] Show that \(\log \phi\) is constant over all tables if \(\beta_1 = \beta_2\).