To submit homework, please upload both Rmd and html files to Bruinlearn by the deadline.
Of primary interest to public is the risk of dying from COVID-19. A commonly used measure is case fatality rate/ratio/risk (CFR), which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}. \] Apparently CFR is not a fixed constant; it changes with time, location, and other factors. Also CFR is different from the infection fatality rate (IFR), the probability that someone infected with COVID-19 dies from it.
In this exercise, we use logistic regression to study how US county-level CFR changes according to demographic information and some health-, education-, and economy-indicators.
04-04-2020.csv.gz
: The data on COVID-19 confirmed
cases and deaths on 2020-04-04 is retrieved from the Johns Hopkins COVID-19
data repository. It was downloaded from this link (commit
0174f38). This repository has been archived by the owner on Mar 10,
2023. It is now read-only. You can download data from box: https://ucla.box.com/s/brb3vz4nwoq8pjkcutxncymqw583d39l
us-county-health-rankings-2020.csv.gz
: The 2020
County Health Ranking Data was released by County Health Rankings.
The data was downloaded from the Kaggle
Uncover COVID-19 Challenge (version 1). You can download data from
box: https://ucla.box.com/s/brb3vz4nwoq8pjkcutxncymqw583d39l
Load the tidyverse
package for data manipulation and
visualization.
# tidyverse of data manipulation and visualization
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Read in the data of COVID-19 cases reported on 2020-04-04.
county_count <- read_csv("./datasets/04-04-2020.csv.gz") %>%
# cast fips into dbl for use as a key for joining tables
mutate(FIPS = as.numeric(FIPS)) %>%
filter(Country_Region == "US") %>%
print(width = Inf)
## Rows: 2679 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): FIPS, Admin2, Province_State, Country_Region, Combined_Key
## dbl (6): Lat, Long_, Confirmed, Deaths, Recovered, Active
## dttm (1): Last_Update
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2,421 × 12
## FIPS Admin2 Province_State Country_Region Last_Update Lat
## <dbl> <chr> <chr> <chr> <dttm> <dbl>
## 1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
## 2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
## 3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
## 4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
## 5 19001 Adair Iowa US 2020-04-04 23:34:21 41.3
## 6 21001 Adair Kentucky US 2020-04-04 23:34:21 37.1
## 7 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
## 8 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
## 9 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
## 10 16003 Adams Idaho US 2020-04-04 23:34:21 44.9
## Long_ Confirmed Deaths Recovered Active Combined_Key
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -82.5 6 0 0 0 Abbeville, South Carolina, US
## 2 -92.4 65 2 0 0 Acadia, Louisiana, US
## 3 -75.6 8 0 0 0 Accomack, Virginia, US
## 4 -116. 360 3 0 0 Ada, Idaho, US
## 5 -94.5 1 0 0 0 Adair, Iowa, US
## 6 -85.3 3 0 0 0 Adair, Kentucky, US
## 7 -92.6 10 0 0 0 Adair, Missouri, US
## 8 -94.7 14 0 0 0 Adair, Oklahoma, US
## 9 -104. 294 9 0 0 Adams, Colorado, US
## 10 -116. 1 0 0 0 Adams, Idaho, US
## # … with 2,411 more rows
Standardize the variable names by changing them to lower case.
names(county_count) <- str_to_lower(names(county_count))
Sanity check by displaying the unique US states and territories:
county_count %>%
select(province_state) %>%
distinct() %>%
arrange(province_state) %>%
print(n = Inf)
## # A tibble: 58 × 1
## province_state
## <chr>
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
## 7 Connecticut
## 8 Delaware
## 9 Diamond Princess
## 10 District of Columbia
## 11 Florida
## 12 Georgia
## 13 Grand Princess
## 14 Guam
## 15 Hawaii
## 16 Idaho
## 17 Illinois
## 18 Indiana
## 19 Iowa
## 20 Kansas
## 21 Kentucky
## 22 Louisiana
## 23 Maine
## 24 Maryland
## 25 Massachusetts
## 26 Michigan
## 27 Minnesota
## 28 Mississippi
## 29 Missouri
## 30 Montana
## 31 Nebraska
## 32 Nevada
## 33 New Hampshire
## 34 New Jersey
## 35 New Mexico
## 36 New York
## 37 North Carolina
## 38 North Dakota
## 39 Northern Mariana Islands
## 40 Ohio
## 41 Oklahoma
## 42 Oregon
## 43 Pennsylvania
## 44 Puerto Rico
## 45 Recovered
## 46 Rhode Island
## 47 South Carolina
## 48 South Dakota
## 49 Tennessee
## 50 Texas
## 51 Utah
## 52 Vermont
## 53 Virgin Islands
## 54 Virginia
## 55 Washington
## 56 West Virginia
## 57 Wisconsin
## 58 Wyoming
We want to exclude entries from Diamond Princess
,
Grand Princess
, Guam
,
Northern Mariana Islands
, Puerto Rico
,
Recovered
, and Virgin Islands
, and only
consider counties from 50 states and DC.
county_count <- county_count %>%
filter(!(province_state %in% c("Diamond Princess", "Grand Princess",
"Recovered", "Guam", "Northern Mariana Islands",
"Puerto Rico", "Virgin Islands"))) %>%
print(width = Inf)
## # A tibble: 2,413 × 12
## fips admin2 province_state country_region last_update lat
## <dbl> <chr> <chr> <chr> <dttm> <dbl>
## 1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
## 2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
## 3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
## 4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
## 5 19001 Adair Iowa US 2020-04-04 23:34:21 41.3
## 6 21001 Adair Kentucky US 2020-04-04 23:34:21 37.1
## 7 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
## 8 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
## 9 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
## 10 16003 Adams Idaho US 2020-04-04 23:34:21 44.9
## long_ confirmed deaths recovered active combined_key
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -82.5 6 0 0 0 Abbeville, South Carolina, US
## 2 -92.4 65 2 0 0 Acadia, Louisiana, US
## 3 -75.6 8 0 0 0 Accomack, Virginia, US
## 4 -116. 360 3 0 0 Ada, Idaho, US
## 5 -94.5 1 0 0 0 Adair, Iowa, US
## 6 -85.3 3 0 0 0 Adair, Kentucky, US
## 7 -92.6 10 0 0 0 Adair, Missouri, US
## 8 -94.7 14 0 0 0 Adair, Oklahoma, US
## 9 -104. 294 9 0 0 Adams, Colorado, US
## 10 -116. 1 0 0 0 Adams, Idaho, US
## # … with 2,403 more rows
Graphical summarize the COVID-19 confirmed cases and deaths on 2020-04-04 by state.
county_count %>%
# turn into long format for easy plotting
pivot_longer(confirmed:recovered,
names_to = "case",
values_to = "count") %>%
group_by(province_state) %>%
ggplot() +
geom_col(mapping = aes(x = province_state, y = `count`, fill = `case`)) +
# scale_y_log10() +
labs(title = "US COVID-19 Situation on 2020-04-04", x = "State") +
theme(axis.text.x = element_text(angle = 90))
Read in the 2020 county-level health ranking data.
county_info <- read_csv("./datasets/us-county-health-rankings-2020.csv.gz") %>%
filter(!is.na(county)) %>%
# cast fips into dbl for use as a key for joining tables
mutate(fips = as.numeric(fips)) %>%
select(fips,
state,
county,
percent_fair_or_poor_health,
percent_smokers,
percent_adults_with_obesity,
# food_environment_index,
percent_with_access_to_exercise_opportunities,
percent_excessive_drinking,
# teen_birth_rate,
percent_uninsured,
# primary_care_physicians_rate,
# preventable_hospitalization_rate,
# high_school_graduation_rate,
percent_some_college,
percent_unemployed,
percent_children_in_poverty,
# `80th_percentile_income`,
# `20th_percentile_income`,
percent_single_parent_households,
# violent_crime_rate,
percent_severe_housing_problems,
overcrowding,
# life_expectancy,
# age_adjusted_death_rate,
percent_adults_with_diabetes,
# hiv_prevalence_rate,
percent_food_insecure,
# percent_limited_access_to_healthy_foods,
percent_insufficient_sleep,
percent_uninsured_2,
median_household_income,
average_traffic_volume_per_meter_of_major_roadways,
percent_homeowners,
# percent_severe_housing_cost_burden,
population_2,
percent_less_than_18_years_of_age,
percent_65_and_over,
percent_black,
percent_asian,
percent_hispanic,
percent_female,
percent_rural) %>%
print(width = Inf)
## Rows: 3193 Columns: 507
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): state, county, unreliable, primary_care_physicians_ratio, dentist...
## dbl (497): fips, num_deaths, years_of_potential_life_lost_rate, 95percent_ci...
## lgl (3): presence_of_water_violation, non_petitioned_cases, petitioned_cases
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3,142 × 30
## fips state county percent_fair_or_poor_health percent_smokers
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1001 Alabama Autauga 20.9 18.1
## 2 1003 Alabama Baldwin 17.5 17.5
## 3 1005 Alabama Barbour 29.6 22.0
## 4 1007 Alabama Bibb 19.4 19.1
## 5 1009 Alabama Blount 21.7 19.2
## 6 1011 Alabama Bullock 31.0 22.9
## 7 1013 Alabama Butler 27.9 21.8
## 8 1015 Alabama Calhoun 23.1 20.6
## 9 1017 Alabama Chambers 24.0 19.4
## 10 1019 Alabama Cherokee 20.7 17.5
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## <dbl> <dbl>
## 1 33.3 69.1
## 2 31 73.7
## 3 41.7 53.2
## 4 37.6 16.3
## 5 33.8 15.6
## 6 37.2 2.50
## 7 43.3 48.6
## 8 38.5 47.7
## 9 40.1 61.9
## 10 35 33.4
## percent_excessive_drinking percent_uninsured percent_some_college
## <dbl> <dbl> <dbl>
## 1 15.0 8.72 62.0
## 2 18.0 11.3 67.4
## 3 12.8 12.2 34.9
## 4 15.6 10.2 44.1
## 5 14.2 13.4 53.4
## 6 12.1 11.4 35.0
## 7 11.9 11.2 41.7
## 8 13.8 11.9 59.2
## 9 12.7 11.9 48.5
## 10 14.1 11.2 51.8
## percent_unemployed percent_children_in_poverty
## <dbl> <dbl>
## 1 3.63 19.3
## 2 3.62 13.9
## 3 5.17 43.9
## 4 3.97 27.8
## 5 3.51 18
## 6 4.69 68.3
## 7 4.79 36.3
## 8 4.65 26.5
## 9 3.91 30.7
## 10 3.57 24.7
## percent_single_parent_households percent_severe_housing_problems overcrowding
## <dbl> <dbl> <dbl>
## 1 26.2 14.7 1.20
## 2 24.1 13.6 1.27
## 3 56.6 14.6 1.69
## 4 28.7 10.5 0.255
## 5 28.6 10.5 1.89
## 6 74.8 18.1 0.113
## 7 52.7 13.2 1.69
## 8 40.2 13.7 1.54
## 9 46.6 16.0 4.04
## 10 23.8 13 1.5
## percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
## <dbl> <dbl> <dbl>
## 1 11.1 13.2 35.9
## 2 10.7 11.6 33.3
## 3 17.6 22 38.6
## 4 14.5 14.3 38.1
## 5 17 10.7 35.9
## 6 23.7 24.8 45.0
## 7 19.2 20.6 41.9
## 8 17.5 15.7 41.3
## 9 19.9 17.9 37.3
## 10 15.2 12.5 35.4
## percent_uninsured_2 median_household_income
## <dbl> <dbl>
## 1 11.1 59338
## 2 14.3 57588
## 3 16.1 34382
## 4 13 46064
## 5 17.1 50412
## 6 15.2 29267
## 7 14.5 37365
## 8 15.4 45400
## 9 15.2 39917
## 10 13.9 42132
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## <dbl> <dbl>
## 1 88.5 74.9
## 2 87.0 73.6
## 3 102. 61.4
## 4 29.3 75.1
## 5 33.4 78.6
## 6 4.07 75.5
## 7 19.3 69.9
## 8 110. 69.5
## 9 20.3 67.8
## 10 25.9 79.0
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## <dbl> <dbl> <dbl>
## 1 55601 23.7 15.6
## 2 218022 21.6 20.4
## 3 24881 20.9 19.4
## 4 22400 20.5 16.5
## 5 57840 23.2 18.2
## 6 10138 21.1 16.4
## 7 19680 22.2 20.3
## 8 114277 21.6 17.7
## 9 33615 20.8 19.5
## 10 26032 19.2 23.0
## percent_black percent_asian percent_hispanic percent_female percent_rural
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.3 1.22 2.97 51.4 42.0
## 2 8.78 1.15 4.65 51.5 42.3
## 3 48.0 0.454 4.28 47.2 67.8
## 4 21.1 0.237 2.62 46.8 68.4
## 5 1.46 0.320 9.57 50.7 90.0
## 6 69.5 0.187 7.96 45.5 51.4
## 7 44.6 1.32 1.51 53.4 71.2
## 8 20.9 0.964 3.91 51.9 33.7
## 9 39.6 1.33 2.56 52.1 49.1
## 10 4.24 0.338 1.62 50.5 85.7
## # … with 3,132 more rows
For stability in estimating CFR, we restrict to counties with \(\ge 5\) confirmed cases.
county_count <- county_count %>%
filter(confirmed >= 5)
We join the COVID-19 count data and county-level information using FIPS (Federal Information Processing System) as key.
county_data <- county_count %>%
left_join(county_info, by = "fips") %>%
print(width = Inf)
## # A tibble: 1,466 × 41
## fips admin2 province_state country_region last_update lat
## <dbl> <chr> <chr> <chr> <dttm> <dbl>
## 1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
## 2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
## 3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
## 4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
## 5 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
## 6 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
## 7 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
## 8 28001 Adams Mississippi US 2020-04-04 23:34:21 31.5
## 9 31001 Adams Nebraska US 2020-04-04 23:34:21 40.5
## 10 42001 Adams Pennsylvania US 2020-04-04 23:34:21 39.9
## long_ confirmed deaths recovered active combined_key
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -82.5 6 0 0 0 Abbeville, South Carolina, US
## 2 -92.4 65 2 0 0 Acadia, Louisiana, US
## 3 -75.6 8 0 0 0 Accomack, Virginia, US
## 4 -116. 360 3 0 0 Ada, Idaho, US
## 5 -92.6 10 0 0 0 Adair, Missouri, US
## 6 -94.7 14 0 0 0 Adair, Oklahoma, US
## 7 -104. 294 9 0 0 Adams, Colorado, US
## 8 -91.4 16 0 0 0 Adams, Mississippi, US
## 9 -98.5 8 0 0 0 Adams, Nebraska, US
## 10 -77.2 21 0 0 0 Adams, Pennsylvania, US
## state county percent_fair_or_poor_health percent_smokers
## <chr> <chr> <dbl> <dbl>
## 1 South Carolina Abbeville 19.9 17.3
## 2 Louisiana Acadia 20.9 21.5
## 3 Virginia Accomack 20.1 18.3
## 4 Idaho Ada 11.5 12.0
## 5 Missouri Adair 21.4 20.5
## 6 Oklahoma Adair 28.5 27.7
## 7 Colorado Adams 16.6 16.3
## 8 Mississippi Adams 27.3 22.2
## 9 Nebraska Adams 15.8 14.6
## 10 Pennsylvania Adams 15.3 16.2
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## <dbl> <dbl>
## 1 36.7 59.0
## 2 38.4 42.5
## 3 36.3 37.4
## 4 25.6 89.5
## 5 27.9 78.3
## 6 47.7 28.5
## 7 27.8 93.1
## 8 35.3 69.1
## 9 36.7 81.6
## 10 35.6 60.6
## percent_excessive_drinking percent_uninsured percent_some_college
## <dbl> <dbl> <dbl>
## 1 15.9 12.9 52.5
## 2 19.8 10.7 43.6
## 3 15.5 16.6 45.1
## 4 17.9 8.74 73.8
## 5 18.9 10.6 65.3
## 6 11.8 24.5 35.1
## 7 18.9 11.0 57.0
## 8 12.3 15.0 41.7
## 9 18.5 8.76 70.8
## 10 19.2 7.49 57.3
## percent_unemployed percent_children_in_poverty
## <dbl> <dbl>
## 1 3.98 30.8
## 2 5.37 35.4
## 3 3.81 27
## 4 2.46 10.2
## 5 3.51 19.9
## 6 4.17 34.9
## 7 3.47 12.6
## 8 6.21 40.4
## 9 2.87 14.4
## 10 3.27 11.2
## percent_single_parent_households percent_severe_housing_problems overcrowding
## <dbl> <dbl> <dbl>
## 1 37.1 14.3 0.463
## 2 33.4 12.3 3.51
## 3 45.9 15.1 2.10
## 4 23.8 14.0 1.46
## 5 29.5 18.0 0.740
## 6 38.3 15.4 5.65
## 7 31.0 18.1 5.37
## 8 66.4 12.8 2.37
## 9 26.2 10.5 0.904
## 10 26.7 12.3 1.88
## percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
## <dbl> <dbl> <dbl>
## 1 15.8 15.2 36.1
## 2 11.4 15.1 32.4
## 3 15.9 14.1 36.8
## 4 7.9 12 26.3
## 5 8.4 17.5 31.9
## 6 24.3 19.1 39.5
## 7 7.7 8 31.0
## 8 13.2 24.7 41.1
## 9 11 11.7 30.1
## 10 8.5 8.3 34.7
## percent_uninsured_2 median_household_income
## <dbl> <dbl>
## 1 15.9 42412
## 2 14.0 40484
## 3 19.4 42879
## 4 11.1 66827
## 5 12.3 40395
## 6 29.6 35156
## 7 13.8 70199
## 8 18.7 33392
## 9 10.7 55167
## 10 8.46 62877
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## <dbl> <dbl>
## 1 11.6 76.3
## 2 63.7 70.8
## 3 60.0 67.9
## 4 277. 68.4
## 5 45.8 60.0
## 6 16.7 68.6
## 7 490. 65.2
## 8 150. 61.7
## 9 53.4 68.2
## 10 113. 77.2
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## <dbl> <dbl> <dbl>
## 1 24541 20.1 21.8
## 2 62190 25.8 15.3
## 3 32412 20.5 23.6
## 4 469966 23.8 14.4
## 5 25339 18.4 14.8
## 6 22082 26.6 15.9
## 7 511868 26.5 10.5
## 8 31192 20.1 18.8
## 9 31511 23.7 18.2
## 10 102811 20.0 20.4
## percent_black percent_asian percent_hispanic percent_female percent_rural
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 27.5 0.412 1.54 51.6 78.6
## 2 17.9 0.320 2.73 51.2 51.7
## 3 28.0 0.781 9.34 51.2 100
## 4 1.24 2.81 8.31 49.9 5.47
## 5 2.85 2.28 2.57 51.9 37.9
## 6 0.534 0.802 6.82 50.1 83.3
## 7 3.19 4.37 40.4 49.5 3.62
## 8 52.4 0.513 11.3 47.9 37.2
## 9 0.996 1.33 10.9 50.2 22.5
## 10 1.60 0.875 7.11 50.8 53.7
## # … with 1,456 more rows
Numerical summaries of each variable:
summary(county_data)
## fips admin2 province_state country_region
## Min. : 1001 Length:1466 Length:1466 Length:1466
## 1st Qu.:18003 Class :character Class :character Class :character
## Median :29029 Mode :character Mode :character Mode :character
## Mean :30076
## 3rd Qu.:42077
## Max. :90053
## NA's :13
## last_update lat long_
## Min. :2020-04-04 23:34:21 Min. :19.60 Min. :-159.60
## 1st Qu.:2020-04-04 23:34:21 1st Qu.:33.96 1st Qu.: -94.56
## Median :2020-04-04 23:34:21 Median :38.02 Median : -86.48
## Mean :2020-04-04 23:34:21 Mean :37.71 Mean : -89.73
## 3rd Qu.:2020-04-04 23:34:21 3rd Qu.:41.38 3rd Qu.: -81.22
## Max. :2020-04-04 23:34:21 Max. :64.81 Max. : -68.65
## NA's :19 NA's :19
## confirmed deaths recovered active
## Min. : 5.0 Min. : 0.000 Min. :0 Min. :0
## 1st Qu.: 9.0 1st Qu.: 0.000 1st Qu.:0 1st Qu.:0
## Median : 20.0 Median : 0.000 Median :0 Median :0
## Mean : 208.8 Mean : 4.842 Mean :0 Mean :0
## 3rd Qu.: 68.0 3rd Qu.: 2.000 3rd Qu.:0 3rd Qu.:0
## Max. :63306.0 Max. :1905.000 Max. :0 Max. :0
##
## combined_key state county
## Length:1466 Length:1466 Length:1466
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
## Min. : 8.121 Min. : 5.909 Min. :12.40
## 1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
## Median :17.010 Median :17.147 Median :32.95
## Mean :17.594 Mean :17.153 Mean :32.41
## 3rd Qu.:20.377 3rd Qu.:19.365 3rd Qu.:36.20
## Max. :38.887 Max. :27.775 Max. :51.00
## NA's :28 NA's :28 NA's :28
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## Min. : 0.00 Min. : 7.81
## 1st Qu.: 59.95 1st Qu.:15.70
## Median : 74.71 Median :18.03
## Mean : 71.14 Mean :17.92
## 3rd Qu.: 85.94 3rd Qu.:20.00
## Max. :100.00 Max. :28.62
## NA's :28 NA's :28
## percent_uninsured percent_some_college percent_unemployed
## Min. : 2.263 Min. :30.06 Min. : 1.582
## 1st Qu.: 6.754 1st Qu.:53.24 1st Qu.: 3.252
## Median : 9.925 Median :61.21 Median : 3.870
## Mean :10.583 Mean :60.87 Mean : 4.071
## 3rd Qu.:13.519 3rd Qu.:68.74 3rd Qu.: 4.690
## Max. :31.208 Max. :90.34 Max. :18.092
## NA's :28 NA's :28 NA's :28
## percent_children_in_poverty percent_single_parent_households
## Min. : 2.50 Min. : 9.43
## 1st Qu.:12.82 1st Qu.:27.09
## Median :18.40 Median :32.96
## Mean :19.46 Mean :33.84
## 3rd Qu.:24.50 3rd Qu.:38.94
## Max. :55.00 Max. :80.00
## NA's :28 NA's :28
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## Min. : 6.562 Min. : 0.000 Min. : 1.800
## 1st Qu.:12.267 1st Qu.: 1.378 1st Qu.: 9.125
## Median :14.439 Median : 1.962 Median :11.300
## Mean :15.079 Mean : 2.429 Mean :11.749
## 3rd Qu.:16.976 3rd Qu.: 2.882 3rd Qu.:13.900
## Max. :33.391 Max. :14.489 Max. :34.100
## NA's :28 NA's :28 NA's :28
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## Min. : 3.40 Min. :23.03 Min. : 2.683
## 1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
## Median :12.70 Median :34.02 Median :12.027
## Mean :13.26 Mean :33.88 Mean :12.776
## 3rd Qu.:15.20 3rd Qu.:36.56 3rd Qu.:16.541
## Max. :33.50 Max. :46.71 Max. :42.397
## NA's :28 NA's :28 NA's :28
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## Min. : 25385 Min. : 0.00
## 1st Qu.: 46994 1st Qu.: 53.05
## Median : 54317 Median : 105.00
## Mean : 57584 Mean : 201.39
## 3rd Qu.: 64754 3rd Qu.: 206.92
## Max. :140382 Max. :4444.12
## NA's :28 NA's :28
## percent_homeowners population_2 percent_less_than_18_years_of_age
## Min. :24.13 Min. : 2887 Min. : 7.069
## 1st Qu.:64.34 1st Qu.: 36502 1st Qu.:20.321
## Median :69.98 Median : 75478 Median :22.182
## Mean :68.98 Mean : 202450 Mean :22.197
## 3rd Qu.:74.78 3rd Qu.: 180031 3rd Qu.:24.002
## Max. :89.76 Max. :10105518 Max. :35.447
## NA's :28 NA's :28 NA's :28
## percent_65_and_over percent_black percent_asian percent_hispanic
## Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
## 1st Qu.:14.927 1st Qu.: 1.6175 1st Qu.: 0.68248 1st Qu.: 2.9419
## Median :17.222 Median : 5.6397 Median : 1.23421 Median : 5.5939
## Mean :17.516 Mean :12.4178 Mean : 2.40412 Mean :10.0010
## 3rd Qu.:19.598 3rd Qu.:17.5931 3rd Qu.: 2.67550 3rd Qu.:11.0564
## Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3595
## NA's :28 NA's :28 NA's :28 NA's :28
## percent_female percent_rural
## Min. :34.63 Min. : 0.00
## 1st Qu.:50.00 1st Qu.: 17.11
## Median :50.66 Median : 36.97
## Mean :50.46 Mean : 40.11
## 3rd Qu.:51.35 3rd Qu.: 60.00
## Max. :56.87 Max. :100.00
## NA's :28 NA's :28
List rows in county_data
that don’t have a match in
county_count
:
county_data %>%
filter(is.na(state) & is.na(county)) %>%
print(n = Inf)
## # A tibble: 28 × 41
## fips admin2 provi…¹ count…² last_update lat long_ confi…³ deaths
## <dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 NA DeKalb Tennes… US 2020-04-04 23:34:21 36.0 -85.8 5 0
## 2 NA DeSoto Florida US 2020-04-04 23:34:21 27.2 -81.8 11 1
## 3 NA Dukes … Massac… US 2020-04-04 23:34:21 41.4 -70.7 16 0
## 4 NA Fillmo… Minnes… US 2020-04-04 23:34:21 43.7 -92.1 9 0
## 5 NA Kansas… Missou… US 2020-04-04 23:34:21 39.1 -94.6 172 2
## 6 NA LaSalle Illino… US 2020-04-04 23:34:21 41.3 -88.9 7 1
## 7 NA Manass… Virgin… US 2020-04-04 23:34:21 38.7 -77.5 14 0
## 8 NA McDuff… Georgia US 2020-04-04 23:34:21 33.5 -82.5 5 1
## 9 NA Out of… Michig… US 2020-04-04 23:34:21 NA NA 83 1
## 10 NA Out of… Tennes… US 2020-04-04 23:34:21 NA NA 218 1
## 11 90005 Unassi… Arkans… US 2020-04-04 23:34:21 NA NA 53 8
## 12 90008 Unassi… Colora… US 2020-04-04 23:34:21 NA NA 158 0
## 13 90009 Unassi… Connec… US 2020-04-04 23:34:21 NA NA 241 1
## 14 90013 Unassi… Georgia US 2020-04-04 23:34:21 NA NA 245 4
## 15 90015 Unassi… Hawaii US 2020-04-04 23:34:21 NA NA 8 0
## 16 90017 Unassi… Illino… US 2020-04-04 23:34:21 NA NA 58 1
## 17 90021 Unassi… Kentuc… US 2020-04-04 23:34:21 NA NA 29 6
## 18 NA Unassi… Louisi… US 2020-04-04 23:34:21 NA NA 31 0
## 19 90023 Unassi… Maine US 2020-04-04 23:34:21 NA NA 12 3
## 20 90025 Unassi… Massac… US 2020-04-04 23:34:21 NA NA 274 9
## 21 NA Unassi… Michig… US 2020-04-04 23:34:21 NA NA 252 1
## 22 90032 Unassi… Nevada US 2020-04-04 23:34:21 NA NA 34 0
## 23 90034 Unassi… New Je… US 2020-04-04 23:34:21 NA NA 3935 14
## 24 90044 Unassi… Rhode … US 2020-04-04 23:34:21 NA NA 241 14
## 25 90047 Unassi… Tennes… US 2020-04-04 23:34:21 NA NA 63 0
## 26 90050 Unassi… Vermont US 2020-04-04 23:34:21 NA NA 11 15
## 27 90053 Unassi… Washin… US 2020-04-04 23:34:21 NA NA 483 0
## 28 NA Weber Utah US 2020-04-04 23:34:21 41.3 -112. 63 1
## # … with 32 more variables: recovered <dbl>, active <dbl>, combined_key <chr>,
## # state <chr>, county <chr>, percent_fair_or_poor_health <dbl>,
## # percent_smokers <dbl>, percent_adults_with_obesity <dbl>,
## # percent_with_access_to_exercise_opportunities <dbl>,
## # percent_excessive_drinking <dbl>, percent_uninsured <dbl>,
## # percent_some_college <dbl>, percent_unemployed <dbl>,
## # percent_children_in_poverty <dbl>, …
We found there are some rows that miss fips
.
county_count %>%
filter(is.na(fips)) %>%
select(fips, admin2, province_state) %>%
print(n = Inf)
## # A tibble: 13 × 3
## fips admin2 province_state
## <dbl> <chr> <chr>
## 1 NA DeKalb Tennessee
## 2 NA DeSoto Florida
## 3 NA Dukes and Nantucket Massachusetts
## 4 NA Fillmore Minnesota
## 5 NA Kansas City Missouri
## 6 NA LaSalle Illinois
## 7 NA Manassas Virginia
## 8 NA McDuffie Georgia
## 9 NA Out of MI Michigan
## 10 NA Out of TN Tennessee
## 11 NA Unassigned Louisiana
## 12 NA Unassigned Michigan
## 13 NA Weber Utah
We need to (1) manually set the fips
for some counties,
(2) discard those Unassigned
, unassigned
or
Out of
, and (3) try to join with county_info
again.
county_data <- county_count %>%
# manually set FIPS for some counties
mutate(fips = ifelse(admin2 == "DeKalb" & province_state == "Tennessee", 47041, fips)) %>%
mutate(fips = ifelse(admin2 == "DeSoto" & province_state == "Florida", 12027, fips)) %>%
#mutate(fips = ifelse(admin2 == "Dona Ana" & province_state == "New Mexico", 35013, fips)) %>%
mutate(fips = ifelse(admin2 == "Dukes and Nantucket" & province_state == "Massachusetts", 25019, fips)) %>%
mutate(fips = ifelse(admin2 == "Fillmore" & province_state == "Minnesota", 27045, fips)) %>%
#mutate(fips = ifelse(admin2 == "Harris" & province_state == "Texas", 48201, fips)) %>%
#mutate(fips = ifelse(admin2 == "Kenai Peninsula" & province_state == "Alaska", 2122, fips)) %>%
mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Illinois", 17099, fips)) %>%
#mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Louisiana", 22059, fips)) %>%
#mutate(fips = ifelse(admin2 == "Lac qui Parle" & province_state == "Minnesota", 27073, fips)) %>%
mutate(fips = ifelse(admin2 == "Manassas" & province_state == "Virginia", 51683, fips)) %>%
#mutate(fips = ifelse(admin2 == "Matanuska-Susitna" & province_state == "Alaska", 2170, fips)) %>%
mutate(fips = ifelse(admin2 == "McDuffie" & province_state == "Georgia", 13189, fips)) %>%
#mutate(fips = ifelse(admin2 == "McIntosh" & province_state == "Georgia", 13191, fips)) %>%
#mutate(fips = ifelse(admin2 == "McKean" & province_state == "Pennsylvania", 42083, fips)) %>%
mutate(fips = ifelse(admin2 == "Weber" & province_state == "Utah", 49057, fips)) %>%
filter(!(is.na(fips) | str_detect(admin2, "Out of") | str_detect(admin2, "Unassigned"))) %>%
left_join(county_info, by = "fips") %>%
print(width = Inf)
## # A tibble: 1,446 × 41
## fips admin2 province_state country_region last_update lat
## <dbl> <chr> <chr> <chr> <dttm> <dbl>
## 1 45001 Abbeville South Carolina US 2020-04-04 23:34:21 34.2
## 2 22001 Acadia Louisiana US 2020-04-04 23:34:21 30.3
## 3 51001 Accomack Virginia US 2020-04-04 23:34:21 37.8
## 4 16001 Ada Idaho US 2020-04-04 23:34:21 43.5
## 5 29001 Adair Missouri US 2020-04-04 23:34:21 40.2
## 6 40001 Adair Oklahoma US 2020-04-04 23:34:21 35.9
## 7 8001 Adams Colorado US 2020-04-04 23:34:21 39.9
## 8 28001 Adams Mississippi US 2020-04-04 23:34:21 31.5
## 9 31001 Adams Nebraska US 2020-04-04 23:34:21 40.5
## 10 42001 Adams Pennsylvania US 2020-04-04 23:34:21 39.9
## long_ confirmed deaths recovered active combined_key
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -82.5 6 0 0 0 Abbeville, South Carolina, US
## 2 -92.4 65 2 0 0 Acadia, Louisiana, US
## 3 -75.6 8 0 0 0 Accomack, Virginia, US
## 4 -116. 360 3 0 0 Ada, Idaho, US
## 5 -92.6 10 0 0 0 Adair, Missouri, US
## 6 -94.7 14 0 0 0 Adair, Oklahoma, US
## 7 -104. 294 9 0 0 Adams, Colorado, US
## 8 -91.4 16 0 0 0 Adams, Mississippi, US
## 9 -98.5 8 0 0 0 Adams, Nebraska, US
## 10 -77.2 21 0 0 0 Adams, Pennsylvania, US
## state county percent_fair_or_poor_health percent_smokers
## <chr> <chr> <dbl> <dbl>
## 1 South Carolina Abbeville 19.9 17.3
## 2 Louisiana Acadia 20.9 21.5
## 3 Virginia Accomack 20.1 18.3
## 4 Idaho Ada 11.5 12.0
## 5 Missouri Adair 21.4 20.5
## 6 Oklahoma Adair 28.5 27.7
## 7 Colorado Adams 16.6 16.3
## 8 Mississippi Adams 27.3 22.2
## 9 Nebraska Adams 15.8 14.6
## 10 Pennsylvania Adams 15.3 16.2
## percent_adults_with_obesity percent_with_access_to_exercise_opportunities
## <dbl> <dbl>
## 1 36.7 59.0
## 2 38.4 42.5
## 3 36.3 37.4
## 4 25.6 89.5
## 5 27.9 78.3
## 6 47.7 28.5
## 7 27.8 93.1
## 8 35.3 69.1
## 9 36.7 81.6
## 10 35.6 60.6
## percent_excessive_drinking percent_uninsured percent_some_college
## <dbl> <dbl> <dbl>
## 1 15.9 12.9 52.5
## 2 19.8 10.7 43.6
## 3 15.5 16.6 45.1
## 4 17.9 8.74 73.8
## 5 18.9 10.6 65.3
## 6 11.8 24.5 35.1
## 7 18.9 11.0 57.0
## 8 12.3 15.0 41.7
## 9 18.5 8.76 70.8
## 10 19.2 7.49 57.3
## percent_unemployed percent_children_in_poverty
## <dbl> <dbl>
## 1 3.98 30.8
## 2 5.37 35.4
## 3 3.81 27
## 4 2.46 10.2
## 5 3.51 19.9
## 6 4.17 34.9
## 7 3.47 12.6
## 8 6.21 40.4
## 9 2.87 14.4
## 10 3.27 11.2
## percent_single_parent_households percent_severe_housing_problems overcrowding
## <dbl> <dbl> <dbl>
## 1 37.1 14.3 0.463
## 2 33.4 12.3 3.51
## 3 45.9 15.1 2.10
## 4 23.8 14.0 1.46
## 5 29.5 18.0 0.740
## 6 38.3 15.4 5.65
## 7 31.0 18.1 5.37
## 8 66.4 12.8 2.37
## 9 26.2 10.5 0.904
## 10 26.7 12.3 1.88
## percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
## <dbl> <dbl> <dbl>
## 1 15.8 15.2 36.1
## 2 11.4 15.1 32.4
## 3 15.9 14.1 36.8
## 4 7.9 12 26.3
## 5 8.4 17.5 31.9
## 6 24.3 19.1 39.5
## 7 7.7 8 31.0
## 8 13.2 24.7 41.1
## 9 11 11.7 30.1
## 10 8.5 8.3 34.7
## percent_uninsured_2 median_household_income
## <dbl> <dbl>
## 1 15.9 42412
## 2 14.0 40484
## 3 19.4 42879
## 4 11.1 66827
## 5 12.3 40395
## 6 29.6 35156
## 7 13.8 70199
## 8 18.7 33392
## 9 10.7 55167
## 10 8.46 62877
## average_traffic_volume_per_meter_of_major_roadways percent_homeowners
## <dbl> <dbl>
## 1 11.6 76.3
## 2 63.7 70.8
## 3 60.0 67.9
## 4 277. 68.4
## 5 45.8 60.0
## 6 16.7 68.6
## 7 490. 65.2
## 8 150. 61.7
## 9 53.4 68.2
## 10 113. 77.2
## population_2 percent_less_than_18_years_of_age percent_65_and_over
## <dbl> <dbl> <dbl>
## 1 24541 20.1 21.8
## 2 62190 25.8 15.3
## 3 32412 20.5 23.6
## 4 469966 23.8 14.4
## 5 25339 18.4 14.8
## 6 22082 26.6 15.9
## 7 511868 26.5 10.5
## 8 31192 20.1 18.8
## 9 31511 23.7 18.2
## 10 102811 20.0 20.4
## percent_black percent_asian percent_hispanic percent_female percent_rural
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 27.5 0.412 1.54 51.6 78.6
## 2 17.9 0.320 2.73 51.2 51.7
## 3 28.0 0.781 9.34 51.2 100
## 4 1.24 2.81 8.31 49.9 5.47
## 5 2.85 2.28 2.57 51.9 37.9
## 6 0.534 0.802 6.82 50.1 83.3
## 7 3.19 4.37 40.4 49.5 3.62
## 8 52.4 0.513 11.3 47.9 37.2
## 9 0.996 1.33 10.9 50.2 22.5
## 10 1.60 0.875 7.11 50.8 53.7
## # … with 1,436 more rows
Summarize again
summary(county_data)
## fips admin2 province_state country_region
## Min. : 1001 Length:1446 Length:1446 Length:1446
## 1st Qu.:17186 Class :character Class :character Class :character
## Median :28156 Mode :character Mode :character Mode :character
## Mean :29455
## 3rd Qu.:42048
## Max. :56039
## last_update lat long_
## Min. :2020-04-04 23:34:21 Min. :19.60 Min. :-159.60
## 1st Qu.:2020-04-04 23:34:21 1st Qu.:33.96 1st Qu.: -94.52
## Median :2020-04-04 23:34:21 Median :38.02 Median : -86.48
## Mean :2020-04-04 23:34:21 Mean :37.71 Mean : -89.73
## 3rd Qu.:2020-04-04 23:34:21 3rd Qu.:41.39 3rd Qu.: -81.21
## Max. :2020-04-04 23:34:21 Max. :64.81 Max. : -68.65
## confirmed deaths recovered active
## Min. : 5.0 Min. : 0.000 Min. :0 Min. :0
## 1st Qu.: 9.0 1st Qu.: 0.000 1st Qu.:0 1st Qu.:0
## Median : 20.0 Median : 0.000 Median :0 Median :0
## Mean : 207.2 Mean : 4.854 Mean :0 Mean :0
## 3rd Qu.: 66.0 3rd Qu.: 2.000 3rd Qu.:0 3rd Qu.:0
## Max. :63306.0 Max. :1905.000 Max. :0 Max. :0
## combined_key state county
## Length:1446 Length:1446 Length:1446
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
## Min. : 8.121 Min. : 5.909 Min. :12.40
## 1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
## Median :17.010 Median :17.143 Median :32.90
## Mean :17.594 Mean :17.151 Mean :32.39
## 3rd Qu.:20.398 3rd Qu.:19.365 3rd Qu.:36.20
## Max. :38.887 Max. :27.775 Max. :51.00
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## Min. : 0.00 Min. : 7.81
## 1st Qu.: 59.95 1st Qu.:15.68
## Median : 74.71 Median :18.03
## Mean : 71.15 Mean :17.91
## 3rd Qu.: 85.97 3rd Qu.:20.01
## Max. :100.00 Max. :28.62
## percent_uninsured percent_some_college percent_unemployed
## Min. : 2.263 Min. :21.14 Min. : 1.582
## 1st Qu.: 6.754 1st Qu.:53.21 1st Qu.: 3.252
## Median : 9.937 Median :61.19 Median : 3.870
## Mean :10.592 Mean :60.83 Mean : 4.071
## 3rd Qu.:13.527 3rd Qu.:68.72 3rd Qu.: 4.690
## Max. :31.208 Max. :90.34 Max. :18.092
## percent_children_in_poverty percent_single_parent_households
## Min. : 2.50 Min. : 9.43
## 1st Qu.:12.82 1st Qu.:27.07
## Median :18.40 Median :32.96
## Mean :19.46 Mean :33.83
## 3rd Qu.:24.50 3rd Qu.:38.93
## Max. :55.00 Max. :80.00
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## Min. : 6.562 Min. : 0.000 Min. : 1.80
## 1st Qu.:12.267 1st Qu.: 1.379 1st Qu.: 9.10
## Median :14.439 Median : 1.971 Median :11.30
## Mean :15.082 Mean : 2.437 Mean :11.75
## 3rd Qu.:16.992 3rd Qu.: 2.887 3rd Qu.:13.90
## Max. :33.391 Max. :14.489 Max. :34.10
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## Min. : 3.40 Min. :23.03 Min. : 2.683
## 1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
## Median :12.70 Median :34.02 Median :12.027
## Mean :13.25 Mean :33.88 Mean :12.786
## 3rd Qu.:15.20 3rd Qu.:36.54 3rd Qu.:16.572
## Max. :33.50 Max. :46.71 Max. :42.397
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## Min. : 25385 Min. : 0.00
## 1st Qu.: 46994 1st Qu.: 53.09
## Median : 54317 Median : 104.63
## Mean : 57600 Mean : 200.72
## 3rd Qu.: 64775 3rd Qu.: 206.78
## Max. :140382 Max. :4444.12
## percent_homeowners population_2 percent_less_than_18_years_of_age
## Min. :24.13 Min. : 2887 Min. : 7.069
## 1st Qu.:64.36 1st Qu.: 36275 1st Qu.:20.326
## Median :69.96 Median : 75382 Median :22.182
## Mean :68.99 Mean : 201689 Mean :22.204
## 3rd Qu.:74.77 3rd Qu.: 179982 3rd Qu.:24.019
## Max. :89.76 Max. :10105518 Max. :35.447
## percent_65_and_over percent_black percent_asian percent_hispanic
## Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
## 1st Qu.:14.913 1st Qu.: 1.6168 1st Qu.: 0.68228 1st Qu.: 2.9451
## Median :17.225 Median : 5.6397 Median : 1.22863 Median : 5.6100
## Mean :17.512 Mean :12.4056 Mean : 2.40009 Mean :10.0338
## 3rd Qu.:19.598 3rd Qu.:17.4904 3rd Qu.: 2.66813 3rd Qu.:11.1199
## Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3595
## percent_female percent_rural
## Min. :34.63 Min. : 0.00
## 1st Qu.:50.00 1st Qu.: 17.11
## Median :50.65 Median : 36.97
## Mean :50.46 Mean : 40.12
## 3rd Qu.:51.35 3rd Qu.: 60.04
## Max. :56.87 Max. :100.00
If there are variables with missing value for many counties, we go back and remove those variables from consideration.
Let’s create a final data frame for analysis.
county_data <- county_data %>%
mutate(state = as.factor(state)) %>%
select(county, confirmed, deaths, state, percent_fair_or_poor_health:percent_rural)
summary(county_data)
## county confirmed deaths state
## Length:1446 Min. : 5.0 Min. : 0.000 Georgia : 96
## Class :character 1st Qu.: 9.0 1st Qu.: 0.000 Texas : 80
## Mode :character Median : 20.0 Median : 0.000 North Carolina: 63
## Mean : 207.2 Mean : 4.854 Mississippi : 61
## 3rd Qu.: 66.0 3rd Qu.: 2.000 Indiana : 58
## Max. :63306.0 Max. :1905.000 Ohio : 57
## (Other) :1031
## percent_fair_or_poor_health percent_smokers percent_adults_with_obesity
## Min. : 8.121 Min. : 5.909 Min. :12.40
## 1st Qu.:14.390 1st Qu.:14.899 1st Qu.:29.10
## Median :17.010 Median :17.143 Median :32.90
## Mean :17.594 Mean :17.151 Mean :32.39
## 3rd Qu.:20.398 3rd Qu.:19.365 3rd Qu.:36.20
## Max. :38.887 Max. :27.775 Max. :51.00
##
## percent_with_access_to_exercise_opportunities percent_excessive_drinking
## Min. : 0.00 Min. : 7.81
## 1st Qu.: 59.95 1st Qu.:15.68
## Median : 74.71 Median :18.03
## Mean : 71.15 Mean :17.91
## 3rd Qu.: 85.97 3rd Qu.:20.01
## Max. :100.00 Max. :28.62
##
## percent_uninsured percent_some_college percent_unemployed
## Min. : 2.263 Min. :21.14 Min. : 1.582
## 1st Qu.: 6.754 1st Qu.:53.21 1st Qu.: 3.252
## Median : 9.937 Median :61.19 Median : 3.870
## Mean :10.592 Mean :60.83 Mean : 4.071
## 3rd Qu.:13.527 3rd Qu.:68.72 3rd Qu.: 4.690
## Max. :31.208 Max. :90.34 Max. :18.092
##
## percent_children_in_poverty percent_single_parent_households
## Min. : 2.50 Min. : 9.43
## 1st Qu.:12.82 1st Qu.:27.07
## Median :18.40 Median :32.96
## Mean :19.46 Mean :33.83
## 3rd Qu.:24.50 3rd Qu.:38.93
## Max. :55.00 Max. :80.00
##
## percent_severe_housing_problems overcrowding percent_adults_with_diabetes
## Min. : 6.562 Min. : 0.000 Min. : 1.80
## 1st Qu.:12.267 1st Qu.: 1.379 1st Qu.: 9.10
## Median :14.439 Median : 1.971 Median :11.30
## Mean :15.082 Mean : 2.437 Mean :11.75
## 3rd Qu.:16.992 3rd Qu.: 2.887 3rd Qu.:13.90
## Max. :33.391 Max. :14.489 Max. :34.10
##
## percent_food_insecure percent_insufficient_sleep percent_uninsured_2
## Min. : 3.40 Min. :23.03 Min. : 2.683
## 1st Qu.:10.70 1st Qu.:31.42 1st Qu.: 7.865
## Median :12.70 Median :34.02 Median :12.027
## Mean :13.25 Mean :33.88 Mean :12.786
## 3rd Qu.:15.20 3rd Qu.:36.54 3rd Qu.:16.572
## Max. :33.50 Max. :46.71 Max. :42.397
##
## median_household_income average_traffic_volume_per_meter_of_major_roadways
## Min. : 25385 Min. : 0.00
## 1st Qu.: 46994 1st Qu.: 53.09
## Median : 54317 Median : 104.63
## Mean : 57600 Mean : 200.72
## 3rd Qu.: 64775 3rd Qu.: 206.78
## Max. :140382 Max. :4444.12
##
## percent_homeowners population_2 percent_less_than_18_years_of_age
## Min. :24.13 Min. : 2887 Min. : 7.069
## 1st Qu.:64.36 1st Qu.: 36275 1st Qu.:20.326
## Median :69.96 Median : 75382 Median :22.182
## Mean :68.99 Mean : 201689 Mean :22.204
## 3rd Qu.:74.77 3rd Qu.: 179982 3rd Qu.:24.019
## Max. :89.76 Max. :10105518 Max. :35.447
##
## percent_65_and_over percent_black percent_asian percent_hispanic
## Min. : 7.722 Min. : 0.1286 Min. : 0.06245 Min. : 0.7952
## 1st Qu.:14.913 1st Qu.: 1.6168 1st Qu.: 0.68228 1st Qu.: 2.9451
## Median :17.225 Median : 5.6397 Median : 1.22863 Median : 5.6100
## Mean :17.512 Mean :12.4056 Mean : 2.40009 Mean :10.0338
## 3rd Qu.:19.598 3rd Qu.:17.4904 3rd Qu.: 2.66813 3rd Qu.:11.1199
## Max. :57.587 Max. :81.9544 Max. :42.95231 Max. :96.3595
##
## percent_female percent_rural
## Min. :34.63 Min. : 0.00
## 1st Qu.:50.00 1st Qu.: 17.11
## Median :50.65 Median : 36.97
## Mean :50.46 Mean : 40.12
## 3rd Qu.:51.35 3rd Qu.: 60.04
## Max. :56.87 Max. :100.00
##
Display the 10 counties with highest CFR.
county_data %>%
mutate(cfr = deaths / confirmed) %>%
select(county, state, confirmed, deaths, cfr) %>%
arrange(desc(cfr)) %>%
top_n(10)
## Selecting by cfr
## # A tibble: 18 × 5
## county state confirmed deaths cfr
## <chr> <fct> <dbl> <dbl> <dbl>
## 1 Emmet Michigan 7 2 0.286
## 2 Grand Traverse Michigan 12 3 0.25
## 3 Toole Montana 12 3 0.25
## 4 Fayette Indiana 14 3 0.214
## 5 Concordia Louisiana 5 1 0.2
## 6 Harrison Texas 5 1 0.2
## 7 Huntington Indiana 5 1 0.2
## 8 Isabella Michigan 10 2 0.2
## 9 McDuffie Georgia 5 1 0.2
## 10 Navarro Texas 5 1 0.2
## 11 Orange Indiana 5 1 0.2
## 12 Perry Pennsylvania 5 1 0.2
## 13 Randolph Indiana 5 1 0.2
## 14 Rockingham North Carolina 5 1 0.2
## 15 Seneca Ohio 5 1 0.2
## 16 Toombs Georgia 5 1 0.2
## 17 Vigo Indiana 10 2 0.2
## 18 Washington Alabama 5 1 0.2
Write final data into a csv file for future use.
write_csv(county_data, "./datasets/covid19-county-data-20200404.csv.gz")
Given that the datasets were collected in the middle of the pandemic,
what assumptions of CFR might be violated by defining CFR as
deaths/confirmed
from this data set?
Because COVID-19 pandemic was still ongoing in 2020, we should realize some critical assumptions for defining CFR are not met using this datasets.
Numbers of confirmed cases do not reflect the number of diagnosed people. This is mainly limited by the availability of testing.
Some confirmed cases may die later.
With acknowledgement of these severe limitations, we continue to use
deaths/confirmed
as a very rough proxy of CFR.
Read and run above code to generate a data frame
county_data
that includes county-level COVID-19 confirmed
cases and deaths, demographic, and health related information.
What assumptions of logistic regression may be violated by this data set?
Run a logistic regression, using variables state
, …,
percent_rural
as predictors.
Interpret the regression coefficients of 3 significant predictors with p-value <0.01.
Apply analysis of deviance to (1) evaluate the goodness of fit of the model and (2) compare the model to the intercept-only model.
Perform analysis of deviance to evaluate the significance of each predictor. Display the 10 most significant predictors.
Construct confidence intervals of regression coefficients.
Plot the deviance residuals against the fitted values. Are there potential outliers?
Plot the half-normal plot. Are there potential outliers in predictor space?
Find the best sub-model using the AIC criterion.
Find the best sub-model using the lasso with cross validation.
Consider a \(2 \times 2\) contingency table from a prospective study in which people who were or were not exposed to some pollutant are followed up and, after several years, categorized according to the presense or absence of a disease. Following table shows the probabilities for each cell. The odds of disease for either exposure group is \(O_i = \pi_i / (1 - \pi_i)\), for \(i = 1,2\), and so the odds ratio is \[ \phi = \frac{O_1}{O_2} = \frac{\pi_1(1 - \pi_2)}{\pi_2 (1 - \pi_1)} \] is a measure of the relative likelihood of disease for the exposed and not exposed groups.
Diseased | Not diseased | |
---|---|---|
Exposed | \(\pi_1\) | \(1 - \pi_1\) |
Not exposed | \(\pi_2\) | \(1 - \pi_2\) |
For the simple logistic model \[ \pi_i = \frac{e^{\beta_i}}{1 + e^{\beta_i}}, \] show that if there is no difference between the exposed and not exposed groups (i.e., \(\beta_1 = \beta_2\)), then \(\phi = 1\).
Consider \(J\) \(2 \times 2\) tables, one for each level \(x_j\) of a factor, such as age group, with \(j=1,\ldots, J\). For the logistic model \[ \pi_{ij} = \frac{e^{\alpha_i + \beta_i x_j}}{1 + e^{\alpha_i + \beta_i x_j}}, \quad i = 1,2, \quad j= 1,\ldots, J. \] Show that \(\log \phi\) is constant over all tables if \(\beta_1 = \beta_2\).