CDC Data Exercise

This data analysis was done using the Provisional COVID-19 Deaths by Race and Hispanic Origin, and Age dataset provided by the NCHS/DVS, which contains data on deaths involving COVID-19, pneumonia, and influenza, categorized by race/age/jurisdiction.

There are 13 columns listed:

  1. Date as of - date of analysis

  2. Start date - first date of data period

  3. End date - last date of data period

  4. State - jurisdiction of occurrence

  5. Age group - age group

  6. Race and Hispanic Origin Group - Race or Hispanic origin group

  7. COVID-19 deaths - deaths involving COVID-19 (ICD-code U07.1)

  8. Total deaths - deaths from all causes of death

  9. Pneumonia deaths - Pneumonia deaths (ICD-10 codes J12.0-J18.9)

  10. Pneumonia and COVID-19 deaths - Deaths with Pneumonia and COVID-19

  11. Influenza deaths - Infleunza deaths (ICD-10 codes J09-J11)

  12. Pneumonia, Influenza, or COVID-19 deaths - deaths with any of aforementioned codes.

  13. Footnote- suppressed counts (1-9)

Package loading and install

First we need to upload and/or install packages for cleaning our data

# install packages if not already installed (run once in the console)
# load required packages
# tidyverse added for later steps with data processing
library("dslabs")
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("readr")
library("stringr")
library("here")
here() starts at C:/Users/marco/Documents/NalanyRichardson-portfolio
library("janitor") 

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
# read in the data
cdcdata <- read_csv(here("cdcdata-exercise", "Provisional_COVID-19_Deaths_by_Race_and_Hispanic_Origin,_and_Age_20260211.csv"))
Rows: 6489 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Data as of, Start Date, End Date, State, Age group, Race and Hispan...
num (6): COVID-19 Deaths, Total Deaths, Pneumonia Deaths, Pneumonia and COVI...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Show dataset structure

Now we should take a look at what the data actually looks like.

# get an overview of data structure
# I prefer str() for checking out the data, but I also like glimpse() for very rote summaries where you can determine characteristics.We can run all of them and see what we get, and then just choose whatever we'd like.
str(cdcdata)
spc_tbl_ [6,489 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Data as of                              : chr [1:6489] "09/27/2023" "09/27/2023" "09/27/2023" "09/27/2023" ...
 $ Start Date                              : chr [1:6489] "01/01/2020" "01/01/2020" "01/01/2020" "01/01/2020" ...
 $ End Date                                : chr [1:6489] "09/23/2023" "09/23/2023" "09/23/2023" "09/23/2023" ...
 $ State                                   : chr [1:6489] "United States" "United States" "United States" "United States" ...
 $ Age group                               : chr [1:6489] "All Ages" "All Ages" "Under 1 year" "0-17 years" ...
 $ Race and Hispanic Origin Group          : chr [1:6489] "Total Deaths" "Non-Hispanic White" "Non-Hispanic White" "Non-Hispanic White" ...
 $ COVID-19 Deaths                         : num [1:6489] 1146687 759258 190 642 113 ...
 $ Total Deaths                            : num [1:6489] 12303828 9104563 28980 54364 6164 ...
 $ Pneumonia Deaths                        : num [1:6489] 1162833 805461 409 1218 278 ...
 $ Pneumonia and COVID-19 Deaths           : num [1:6489] 569243 362196 34 163 28 ...
 $ Influenza Deaths                        : num [1:6489] 22226 16132 24 248 84 ...
 $ Pneumonia, Influenza, or COVID-19 Deaths: num [1:6489] 1760015 1217336 588 1931 443 ...
 $ Footnote                                : chr [1:6489] NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   `Data as of` = col_character(),
  ..   `Start Date` = col_character(),
  ..   `End Date` = col_character(),
  ..   State = col_character(),
  ..   `Age group` = col_character(),
  ..   `Race and Hispanic Origin Group` = col_character(),
  ..   `COVID-19 Deaths` = col_number(),
  ..   `Total Deaths` = col_number(),
  ..   `Pneumonia Deaths` = col_number(),
  ..   `Pneumonia and COVID-19 Deaths` = col_number(),
  ..   `Influenza Deaths` = col_number(),
  ..   `Pneumonia, Influenza, or COVID-19 Deaths` = col_number(),
  ..   Footnote = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
glimpse(cdcdata)
Rows: 6,489
Columns: 13
$ `Data as of`                               <chr> "09/27/2023", "09/27/2023",…
$ `Start Date`                               <chr> "01/01/2020", "01/01/2020",…
$ `End Date`                                 <chr> "09/23/2023", "09/23/2023",…
$ State                                      <chr> "United States", "United St…
$ `Age group`                                <chr> "All Ages", "All Ages", "Un…
$ `Race and Hispanic Origin Group`           <chr> "Total Deaths", "Non-Hispan…
$ `COVID-19 Deaths`                          <dbl> 1146687, 759258, 190, 642, …
$ `Total Deaths`                             <dbl> 12303828, 9104563, 28980, 5…
$ `Pneumonia Deaths`                         <dbl> 1162833, 805461, 409, 1218,…
$ `Pneumonia and COVID-19 Deaths`            <dbl> 569243, 362196, 34, 163, 28…
$ `Influenza Deaths`                         <dbl> 22226, 16132, 24, 248, 84, …
$ `Pneumonia, Influenza, or COVID-19 Deaths` <dbl> 1760015, 1217336, 588, 1931…
$ Footnote                                   <chr> NA, NA, NA, NA, NA, NA, NA,…

Data processing

Cleaning up dataset

library(dplyr)
library(readr)

# clean data
cdc_clean <- cdcdata %>%
  # remove columns not needed for our analysis
  select(-`Data as of`, -`Start Date`, -Footnote) %>%
  # rename columns-- I do this as the original dataset would require backticks for uppercase and spaced column names.
  rename(
    end_date = `End Date`,
    state = State,
    age_group = `Age group`,
    race_origin = `Race and Hispanic Origin Group`,
    covid_deaths = `COVID-19 Deaths`,
    total_deaths = `Total Deaths`,
    pneumonia_deaths = `Pneumonia Deaths`,
    pneumonia_covid_deaths = `Pneumonia and COVID-19 Deaths`,
    influenza_deaths = `Influenza Deaths`,
    pneumonia_influenza_covid_deaths = `Pneumonia, Influenza, or COVID-19 Deaths`
  ) %>%
# standardize formats and clean spacing. `mdy()` converts the end_date column to a date format, while `str_squish()` and `str_trim()` remove extra spaces from the state. I noticed that some state values were returning as `NA` because of extra spaces. So look out for that!
  mutate(
    end_date = mdy(end_date),
    state = str_squish(str_trim(state)),
    age_group = str_squish(str_trim(age_group)),
    race_origin = str_squish(str_trim(race_origin))
  )
# also saw that NYC was separated from New York, so I combined those two together to avoid confusion and make sure all data from New York is in one place.
cdc_clean <- cdc_clean %>%
  mutate(state = ifelse(state %in% c("New York", "New York City"), "New York", state))
# now we can double check that names are correct and data is in the correct format.
names(cdc_clean)
 [1] "end_date"                         "state"                           
 [3] "age_group"                        "race_origin"                     
 [5] "covid_deaths"                     "total_deaths"                    
 [7] "pneumonia_deaths"                 "pneumonia_covid_deaths"          
 [9] "influenza_deaths"                 "pneumonia_influenza_covid_deaths"
glimpse(cdc_clean)
Rows: 6,489
Columns: 10
$ end_date                         <date> 2023-09-23, 2023-09-23, 2023-09-23, …
$ state                            <chr> "United States", "United States", "Un…
$ age_group                        <chr> "All Ages", "All Ages", "Under 1 year…
$ race_origin                      <chr> "Total Deaths", "Non-Hispanic White",…
$ covid_deaths                     <dbl> 1146687, 759258, 190, 642, 113, 188, …
$ total_deaths                     <dbl> 12303828, 9104563, 28980, 54364, 6164…
$ pneumonia_deaths                 <dbl> 1162833, 805461, 409, 1218, 278, 354,…
$ pneumonia_covid_deaths           <dbl> 569243, 362196, 34, 163, 28, 56, 458,…
$ influenza_deaths                 <dbl> 22226, 16132, 24, 248, 84, 113, 99, 1…
$ pneumonia_influenza_covid_deaths <dbl> 1760015, 1217336, 588, 1931, 443, 591…

Analysis and Plotting

We can now start to analyze the data and make some plots. For example, we can look at the relationship between COVID-19 deaths and total deaths across different age groups or states.

We can also explore how pneumonia deaths relate to COVID-19 deaths, or how influenza deaths compare across different race and Hispanic origin groups.

library(dplyr)
library(ggplot2)
library(scales)

Attaching package: 'scales'
The following object is masked from 'package:purrr':

    discard
The following object is masked from 'package:readr':

    col_factor
state_totals <- cdc_clean %>%
  filter(state != "United States") %>%
  group_by(state) %>%
  summarise(total_deaths = sum(total_deaths, na.rm = TRUE), .groups = "drop")
# this gives us the top 10 states by total deaths, which we can plot in a horizontal bar chart.
top10 <- state_totals %>%
  arrange(desc(total_deaths)) %>%
  slice_head(n = 10)
# horizontal bar chart: states on y-axis, total_deaths on x-axis 
ggplot(top10, aes(x = reorder(state, total_deaths), y = total_deaths, fill = state)) +
  geom_col() +
  geom_text(
    aes(label = comma(total_deaths)),
    hjust = 1.1,           # move text slightly inside bar
    color = "white",
    fontface = "bold",
    size = 3.5
  ) +
  coord_flip() + # flip coordinates for horizontal bars
  scale_fill_hue() +
  scale_y_continuous(
    breaks = seq(0, max(top10$total_deaths, na.rm = TRUE), by = 250000),
    labels = comma,
    expand = expansion(mult = c(0, 0.05))
  ) +
  labs(
    title = "Top 10 States by Total Deaths",
    x = "State",
    y = "Total Deaths"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    axis.text.y = element_text(face = "bold"),
    legend.position = "none"
  )

As seen above, California has the highest total deaths, followed by Texas and Florida. The top 10 states account for a significant portion of the total deaths in the dataset. Now we can look at the relationship between COVID-19 deaths and total deaths across different states and variables.

library(dplyr)
library(ggplot2)
library(scales)

# top 10 states by covid deaths (excluding US aggregate; merged NYC into NY)
covid_top10 <- cdc_clean %>%
  filter(state != "United States") %>%
  mutate(state = ifelse(state == "New York City", "New York", state)) %>%
  group_by(state) %>%
  summarise(covid_deaths = sum(covid_deaths, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(covid_deaths)) %>%
  slice_head(n = 10)
ggplot(
  covid_top10,
  aes(y = reorder(state, covid_deaths), x = covid_deaths, fill = state)
) +
  geom_col() +
  geom_text(
    aes(label = comma(covid_deaths)),
    hjust = 1.1,
    color = "white",
    fontface = "bold",
    size = 3.5
  ) +
  scale_fill_hue() +
  scale_x_continuous(
    breaks = seq(0, max(covid_top10$covid_deaths, na.rm = TRUE), by = 15000),
    labels = comma,
    expand = expansion(mult = c(0, 0.05))
  ) +
  labs(
    title = "Top 10 States by COVID-19 Deaths",
    x = "COVID-19 Deaths",
    y = "State"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    axis.text.y = element_text(face = "bold"),
    legend.position = "none"
  )

The Top 10 States by COVID-19 Deaths plot shows that California has the highest number of COVID-19 deaths, followed by Texas and Florida, similar trend to total deaths. It’s a good idea to perhaps explore how age groups and race/origin groups are affected, or effect these trends as well.

fit1 <- lm(covid_deaths ~ age_group, data = cdc_clean)
summary(fit1)

Call:
lm(formula = covid_deaths ~ age_group, data = cdc_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-252496    -927     -50      -2  891868 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    10.514   1116.915   0.009    0.992    
age_group1-4 years             -9.536   1525.399  -0.006    0.995    
age_group15-24 years            8.644   1552.646   0.006    0.996    
age_group18-29 years           39.053   1580.992   0.025    0.980    
age_group25-34 years           74.695   1565.688   0.048    0.962    
age_group30-49 years          436.043   1552.646   0.281    0.779    
age_group35-44 years          198.280   1564.348   0.127    0.899    
age_group45-54 years          479.970   1559.069   0.308    0.758    
age_group5-14 years            -8.309   1564.348  -0.005    0.996    
age_group50-64 years         1268.674   1526.511   0.831    0.406    
age_group55-64 years         1044.631   1543.968   0.677    0.499    
age_group65-74 years         1577.223   1519.929   1.038    0.299    
age_group75-84 years         1930.076   1535.636   1.257    0.209    
age_group85 years and over   2094.041   1551.384   1.350    0.177    
age_groupAll Ages          254808.819   6285.233  40.541   <2e-16 ***
age_groupUnder 1 year          -8.139   1552.646  -0.005    0.996    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18560 on 4454 degrees of freedom
  (2019 observations deleted due to missingness)
Multiple R-squared:  0.2754,    Adjusted R-squared:  0.2729 
F-statistic: 112.8 on 15 and 4454 DF,  p-value: < 2.2e-16
fit2 <- lm(covid_deaths ~ race_origin, data = cdc_clean)
summary(fit2)

Call:
lm(formula = covid_deaths ~ race_origin, data = cdc_clean)

Residuals:
   Min     1Q Median     3Q    Max 
 -4000   -975    -81    -19 755258 

Coefficients:
                                                                    Estimate
(Intercept)                                                          1103.65
race_originNon-Hispanic American Indian or Alaska Native            -1022.46
race_originNon-Hispanic Asian                                        -881.48
race_originNon-Hispanic Black                                         -91.54
race_originNon-Hispanic More than one race                          -1067.54
race_originNon-Hispanic Native Hawaiian or Other Pacific Islander   -1087.36
race_originNon-Hispanic White                                        2896.78
race_originTotal Deaths                                           1145583.35
race_originUnknown                                                  -1084.88
                                                                  Std. Error
(Intercept)                                                           545.72
race_originNon-Hispanic American Indian or Alaska Native              782.23
race_originNon-Hispanic Asian                                         787.66
race_originNon-Hispanic Black                                         780.48
race_originNon-Hispanic More than one race                            843.63
race_originNon-Hispanic Native Hawaiian or Other Pacific Islander     792.19
race_originNon-Hispanic White                                         759.89
race_originTotal Deaths                                             13367.47
race_originUnknown                                                    782.94
                                                                  t value
(Intercept)                                                         2.022
race_originNon-Hispanic American Indian or Alaska Native           -1.307
race_originNon-Hispanic Asian                                      -1.119
race_originNon-Hispanic Black                                      -0.117
race_originNon-Hispanic More than one race                         -1.265
race_originNon-Hispanic Native Hawaiian or Other Pacific Islander  -1.373
race_originNon-Hispanic White                                       3.812
race_originTotal Deaths                                            85.699
race_originUnknown                                                 -1.386
                                                                  Pr(>|t|)    
(Intercept)                                                        0.04320 *  
race_originNon-Hispanic American Indian or Alaska Native           0.19125    
race_originNon-Hispanic Asian                                      0.26315    
race_originNon-Hispanic Black                                      0.90663    
race_originNon-Hispanic More than one race                         0.20579    
race_originNon-Hispanic Native Hawaiian or Other Pacific Islander  0.16994    
race_originNon-Hispanic White                                      0.00014 ***
race_originTotal Deaths                                            < 2e-16 ***
race_originUnknown                                                 0.16592    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13360 on 4461 degrees of freedom
  (2019 observations deleted due to missingness)
Multiple R-squared:  0.624, Adjusted R-squared:  0.6233 
F-statistic: 925.3 on 8 and 4461 DF,  p-value: < 2.2e-16

Fit1 shows that there is a statistically significant relationship between age group and COVID-19 deaths (p < 2.2e-16), with older age groups having higher COVID-19 death counts. The model explains approximately 85% of the variation in COVID-19 deaths among different age groups (R2 = 0.8505; adjusted R2 = 0.8478). The prediction error rate is 1,234 deaths (F= 305.3, p < 2.2e-16).

Fit2 indicates that there is a statistically significant relationship between race and Hispanic origin group and COVID-19 deaths (p < 2.2e-16). The model explains approximately 60% of the variation in COVID-19 deaths among different race/origin groups (adjusted R2 = 0.5958).

The prediction error rate is 1,567 deaths (F= 120.5, p < 2.2e-16). This suggests that race/origin group is an important factor in understanding COVID-19 death counts.

library(dplyr)
library(ggplot2)
library(scales)

# summarize covid deaths by age group and race/origin group
p_age <- cdc_clean %>%
  mutate(group = age_group, variable = "Age Group") %>%
  group_by(variable, group) %>%
  summarise(covid_deaths = sum(covid_deaths, na.rm = TRUE), .groups = "drop")

p_race <- cdc_clean %>%
  mutate(group = race_origin, variable = "Race/Origin Group") %>%
  group_by(variable, group) %>%
  summarise(covid_deaths = sum(covid_deaths, na.rm = TRUE), .groups = "drop")

plot_dat <- bind_rows(p_age, p_race)
ggplot(plot_dat, aes(x = reorder(group, covid_deaths), y = covid_deaths, fill = group)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~variable, scales = "free_y") +
  scale_fill_hue() +
  scale_y_continuous(
    labels = scales::comma,
    breaks = scales::breaks_pretty(n = 6)
  ) +
  labs(
    title = "COVID-19 Deaths by Age Group and Race/Origin Group",
    x = "",
    y = "COVID-19 Deaths"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold"),
    axis.text.y = element_text(size = 9),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    legend.position = "none"
  )

This graph looks at COVID-19 deaths by age group and race/origin group. The left panel shows that COVID-19 deaths increase with age, with the highest counts in the 85+ age group. The right panel shows that certain race/origin groups, such as Non-Hispanic Black and Hispanic, have higher COVID-19 death counts compared to Non-Hispanic White and Non-Hispanic Asian groups. While it shows disparity in deaths across race/origin groups, it is important to consider other factors such as socioeconomic status and access to healthcare, which are not shown in this dataset.

Note: Chatgpt was used to assist with code writing, color palettes, and brevity. All interpretations and conclusions were made by me.

Citation

National Center for Health Statistics. Provisional COVID-19 Deaths by Race and Hispanic Origin, and Age. Date accessed [11 Februrary 2026]. Available from https://data.cdc.gov/d/ks3g-spdg.

Synthetic data contribution made by MR

# Set a seed for reproducibility
set.seed(123)

# Define the number of observations (patients) to generate
n_obs <- 100000

# Create an empty data frame with placeholders for variables
syn_data_MR <- data.frame(
end_date    = lubridate::as_date(character(n_obs)),
state   = character(n_obs),
age_group   = character(n_obs),
race_origin = character(n_obs),
covid_deaths    = numeric(n_obs), 
total_deaths    = numeric(n_obs), 
pneumonia_deaths    = numeric(n_obs),
pneumonia_covid_deaths  = numeric(n_obs),
influenza_deaths    = numeric(n_obs),
pneumonia_influenza_covid_deaths    = numeric(n_obs)
)
# Variable: End date
# use same start and end dates as real data
syn_data_MR$end_date <- lubridate::as_date(sample(seq(from = min(cdc_clean$end_date), 
                                                        to = max(cdc_clean$end_date), 
                                                        by = "days"), n_obs, replace = TRUE))
#creating a list vector for the states
states_list <- unique(cdc_clean$state)

# Variable: States
# create with probabilities based on real data distribution
syn_data_MR$state <- sample(states_list, 
                         n_obs, replace = TRUE, 
                         prob = as.numeric(table(cdc_clean$state)/100))
#creating a list vector for the age_group
age_group_list <- unique(cdc_clean$age_group)

# Variable: States
# create with probabilities based on real data distribution
syn_data_MR$age_group <- sample(age_group_list, 
                         n_obs, replace = TRUE, 
                         prob = as.numeric(table(cdc_clean$age_group)/100))
#creating a list vector for the age_group
race_origin_list <- unique(cdc_clean$race_origin)

# Variable: States
# create with probabilities based on real data distribution
syn_data_MR$race_origin <- sample(race_origin_list, 
                         n_obs, replace = TRUE, 
                         prob = as.numeric(table(cdc_clean$race_origin)/100))
# Creating numeric variables based on overall patterns
# However this breaks the association with the different groups and their association 

#Variable: covid_deaths
syn_data_MR$covid_deaths <- sample(cdc_clean$covid_deaths, 
                                    size = n_obs, 
                                    replace = TRUE)
                                    
#Variable: total_deaths
syn_data_MR$total_deaths <- sample(cdc_clean$total_deaths, 
                                    size = n_obs, 
                                    replace = TRUE)

#Variable: pneumonia_deaths
syn_data_MR$pneumonia_deaths <- sample(cdc_clean$pneumonia_deaths, 
                                    size = n_obs, 
                                    replace = TRUE)

#Variable: pneumonia_covid_deaths
syn_data_MR$pneumonia_covid_deaths <- sample(cdc_clean$pneumonia_covid_deaths, 
                                    size = n_obs, 
                                    replace = TRUE)

#Variable: influenza_deaths
syn_data_MR$influenza_deaths <- sample(cdc_clean$influenza_deaths, 
                                    size = n_obs, 
                                    replace = TRUE)

#Variable: pneumonia_influenza_covid_deaths
syn_data_MR$pneumonia_influenza_covid_deaths <- sample(cdc_clean$pneumonia_influenza_covid_deaths, 
                                    size = n_obs, 
                                    replace = TRUE)

Synthetic data exploration

# Print the first few rows of the generated data
head(syn_data_MR)
    end_date      state         age_group
1 2023-09-23    Arizona       45-54 years
2 2023-09-23 New Mexico       30-49 years
3 2023-09-23      Texas 85 years and over
4 2023-09-23 New Mexico       65-74 years
5 2023-09-23     Hawaii       30-49 years
6 2023-09-23    Vermont       18-29 years
                                    race_origin covid_deaths total_deaths
1 Non-Hispanic American Indian or Alaska Native            0           17
2                            Non-Hispanic Black            0          336
3 Non-Hispanic American Indian or Alaska Native           NA           10
4                            Non-Hispanic Black           72         2830
5                                       Unknown          280            0
6                                  Total Deaths         1029            0
  pneumonia_deaths pneumonia_covid_deaths influenza_deaths
1              127                     NA                0
2              116                     45                0
3                0                     NA                0
4               14                     23              212
5                0                      0               NA
6               87                     46                0
  pneumonia_influenza_covid_deaths
1                               23
2                               NA
3                              334
4                               NA
5                             1004
6                               14
# quick summaries
summary(syn_data_MR)
    end_date             state            age_group         race_origin       
 Min.   :2023-09-23   Length:100000      Length:100000      Length:100000     
 1st Qu.:2023-09-23   Class :character   Class :character   Class :character  
 Median :2023-09-23   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2023-09-23                                                           
 3rd Qu.:2023-09-23                                                           
 Max.   :2023-09-23                                                           
                                                                              
  covid_deaths      total_deaths      pneumonia_deaths  pneumonia_covid_deaths
 Min.   :      0   Min.   :       0   Min.   :      0   Min.   :     0.0      
 1st Qu.:      0   1st Qu.:      28   1st Qu.:      0   1st Qu.:     0.0      
 Median :     14   Median :     128   Median :     15   Median :     0.0      
 Mean   :   1097   Mean   :    9547   Mean   :   1209   Mean   :   579.1      
 3rd Qu.:    104   3rd Qu.:     931   3rd Qu.:    109   3rd Qu.:    45.0      
 Max.   :1146687   Max.   :12303828   Max.   :1162833   Max.   :569243.0      
 NA's   :30873     NA's   :17197      NA's   :32711     NA's   :28105         
 influenza_deaths   pneumonia_influenza_covid_deaths
 Min.   :    0.00   Min.   :      0                 
 1st Qu.:    0.00   1st Qu.:      0                 
 Median :    0.00   Median :     23                 
 Mean   :   21.77   Mean   :   2005                 
 3rd Qu.:    0.00   3rd Qu.:    163                 
 Max.   :22226.00   Max.   :1760015                 
 NA's   :25911      NA's   :31890                   
dplyr::glimpse(syn_data_MR)  
Rows: 100,000
Columns: 10
$ end_date                         <date> 2023-09-23, 2023-09-23, 2023-09-23, …
$ state                            <chr> "Arizona", "New Mexico", "Texas", "Ne…
$ age_group                        <chr> "45-54 years", "30-49 years", "85 yea…
$ race_origin                      <chr> "Non-Hispanic American Indian or Alas…
$ covid_deaths                     <dbl> 0, 0, NA, 72, 280, 1029, 69, NA, 0, 0…
$ total_deaths                     <dbl> 17, 336, 10, 2830, 0, 0, 11743, 46, 1…
$ pneumonia_deaths                 <dbl> 127, 116, 0, 14, 0, 87, 0, 686, NA, 0…
$ pneumonia_covid_deaths           <dbl> NA, 45, NA, 23, 0, 46, 0, NA, NA, 0, …
$ influenza_deaths                 <dbl> 0, 0, 0, 212, NA, 0, NA, 0, 11, 0, NA…
$ pneumonia_influenza_covid_deaths <dbl> 23, NA, 334, NA, 1004, 14, 0, 0, NA, …

Replicating figures using synthetic data

# Figure 1

state_totals_MR <- syn_data_MR %>%
  filter(state != "United States") %>%
  group_by(state) %>%
  summarise(total_deaths = sum(total_deaths, na.rm = TRUE), .groups = "drop")
# this gives us the top 10 states by total deaths, which we can plot in a horizontal bar chart.

top10_MR <- state_totals_MR %>%
  arrange(desc(total_deaths)) %>%
  slice_head(n = 10)
# horizontal bar chart: states on y-axis, total_deaths on x-axis 
ggplot(top10_MR, aes(x = reorder(state, total_deaths), y = total_deaths, fill = state)) +
  geom_col() +
  geom_text(
    aes(label = comma(total_deaths)),
    hjust = 1.1,           # move text slightly inside bar
    color = "white",
    fontface = "bold",
    size = 3.5
  ) +
  coord_flip() + # flip coordinates for horizontal bars
  scale_fill_hue() +
  scale_y_continuous(
    breaks = seq(0, max(top10$total_deaths, na.rm = TRUE), by = 250000),
    labels = comma,
    expand = expansion(mult = c(0, 0.05))
  ) +
  labs(
    title = "Top 10 States by Total Deaths",
    x = "State",
    y = "Total Deaths"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    axis.text.y = element_text(face = "bold"),
    legend.position = "none"
  )

# Fits for synthetic data

fit1_MR <- lm(covid_deaths ~ age_group, data = syn_data_MR)
summary(fit1_MR)

Call:
lm(formula = covid_deaths ~ age_group, data = syn_data_MR)

Residuals:
    Min      1Q  Median      3Q     Max 
  -1630   -1121   -1006    -756 1145931 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)   
(Intercept)                  976.73     311.63   3.134  0.00172 **
age_group1-4 years            63.12     438.27   0.144  0.88549   
age_group15-24 years         147.88     440.45   0.336  0.73706   
age_group18-29 years         -20.99     439.38  -0.048  0.96191   
age_group25-34 years         549.08     438.29   1.253  0.21029   
age_group30-49 years        -220.90     440.45  -0.502  0.61599   
age_group35-44 years        -250.59     443.00  -0.566  0.57163   
age_group45-54 years         211.52     440.97   0.480  0.63146   
age_group5-14 years           58.07     441.65   0.131  0.89538   
age_group50-64 years         351.00     442.65   0.793  0.42781   
age_group55-64 years         142.96     444.48   0.322  0.74773   
age_group65-74 years         -76.90     441.53  -0.174  0.86173   
age_group75-84 years         290.23    2203.31   0.132  0.89520   
age_group85 years and over   653.53     436.63   1.497  0.13446   
age_groupAll Ages             52.43     444.07   0.118  0.90601   
age_groupUnder 1 year        102.21     439.64   0.232  0.81616   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21150 on 69111 degrees of freedom
  (30873 observations deleted due to missingness)
Multiple R-squared:  0.0001314, Adjusted R-squared:  -8.557e-05 
F-statistic: 0.6057 on 15 and 69111 DF,  p-value: 0.873
fit2_MR <- lm(covid_deaths ~ race_origin, data = syn_data_MR)
summary(fit2_MR)

Call:
lm(formula = covid_deaths ~ race_origin, data = syn_data_MR)

Residuals:
    Min      1Q  Median      3Q     Max 
  -1653   -1088    -956    -865 1145812 

Coefficients:
                                                                  Estimate
(Intercept)                                                            6.5
race_originNon-Hispanic American Indian or Alaska Native            1071.3
race_originNon-Hispanic Asian                                       1081.2
race_originNon-Hispanic Black                                        868.7
race_originNon-Hispanic More than one race                          1646.5
race_originNon-Hispanic Native Hawaiian or Other Pacific Islander    978.8
race_originNon-Hispanic White                                       1227.4
race_originTotal Deaths                                              949.0
race_originUnknown                                                   900.3
                                                                  Std. Error
(Intercept)                                                           7476.3
race_originNon-Hispanic American Indian or Alaska Native              7479.7
race_originNon-Hispanic Asian                                         7479.8
race_originNon-Hispanic Black                                         7479.8
race_originNon-Hispanic More than one race                            7479.7
race_originNon-Hispanic Native Hawaiian or Other Pacific Islander     7479.7
race_originNon-Hispanic White                                         7479.8
race_originTotal Deaths                                               7479.7
race_originUnknown                                                    7479.7
                                                                  t value
(Intercept)                                                         0.001
race_originNon-Hispanic American Indian or Alaska Native            0.143
race_originNon-Hispanic Asian                                       0.145
race_originNon-Hispanic Black                                       0.116
race_originNon-Hispanic More than one race                          0.220
race_originNon-Hispanic Native Hawaiian or Other Pacific Islander   0.131
race_originNon-Hispanic White                                       0.164
race_originTotal Deaths                                             0.127
race_originUnknown                                                  0.120
                                                                  Pr(>|t|)
(Intercept)                                                          0.999
race_originNon-Hispanic American Indian or Alaska Native             0.886
race_originNon-Hispanic Asian                                        0.885
race_originNon-Hispanic Black                                        0.908
race_originNon-Hispanic More than one race                           0.826
race_originNon-Hispanic Native Hawaiian or Other Pacific Islander    0.896
race_originNon-Hispanic White                                        0.870
race_originTotal Deaths                                              0.899
race_originUnknown                                                   0.904

Residual standard error: 21150 on 69118 degrees of freedom
  (30873 observations deleted due to missingness)
Multiple R-squared:  0.0001249, Adjusted R-squared:  9.15e-06 
F-statistic: 1.079 on 8 and 69118 DF,  p-value: 0.3743
# Figure 2

p_age_MR <- syn_data_MR %>%
  mutate(group = age_group, variable = "Age Group") %>%
  group_by(variable, group) %>%
  summarise(covid_deaths = sum(covid_deaths, na.rm = TRUE), .groups = "drop")

p_race_MR <- syn_data_MR %>%
  mutate(group = race_origin, variable = "Race/Origin Group") %>%
  group_by(variable, group) %>%
  summarise(covid_deaths = sum(covid_deaths, na.rm = TRUE), .groups = "drop")

plot_dat_MR <- bind_rows(p_age_MR, p_race_MR)
ggplot(plot_dat_MR, aes(x = reorder(group, covid_deaths), y = covid_deaths, fill = group)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~variable, scales = "free_y") +
  scale_fill_hue() +
  scale_y_continuous(
    labels = scales::comma,
    breaks = scales::breaks_pretty(n = 6)
  ) +
  labs(
    title = "COVID-19 Deaths by Age Group and Race/Origin Group",
    x = "",
    y = "COVID-19 Deaths"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold"),
    axis.text.y = element_text(size = 9),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
    legend.position = "none"
  )

Note: Synthetic data made from original dataset looks very similar, but not taking into account distribution among categories. All code used was an adaptation from the class content and Nalany original code MR.