COVID-19 Data Analysis: A Series of Visualizations using R- Studio and ggplot

By Adam J. Albert, M.D.

Adam J. Albert, M.D.
9 min readJun 6, 2021

The following is a series of plots I made for a Data Visualization and Storytelling course at Northwestern University. I used R Studio ggplot package to produce the visuals.

Plot 1A

Plot 1B

Plot 1: Combined Plots

Data Set Description

Click here to view the data set in Excel csv format: Data.gov COVID-19 Dataset

This data set is a set of cases and deaths reported daily by each state throughout the pandemic. The data collection starts in January 2020 and is updated daily to reflect present data (most recent in this plot is June 2021).

According to the CDC website, this data set is based on the data reported by each state. It is important to note that many cases of COVID-19 go undetected so this data set does not include unreported cases or deaths. Therefore, it is a sample of the overall population in each state and thus the United States.

It is also important to note that this data is not randomly obtained. It is reported by each state, therefore, there could be sampling errors. For example, perhaps a state’s Department of Health is underfunded or understaffed compared to another state. It may under report its information just to lack of complete data collection.

The Variables of interest in this data set are

  1. Submission Date of all Variables
  2. The State Reporting the Variables
  3. The total cases of COVID in the state as of the date in that row
  4. The total deaths in that row as per the date in the row.

The second set of data can be viewed here:State Population Data

This is U.S. State population data and the variables include

  1. The State
  2. The population count in that state (either over age 18 or total)
  3. The year the data was generated (2012)

One last comment is regarding the data about high and low restrictions. This did not come from a data set, but rather the following source: States Ranked by COVID-19 Restrictions.

Using that data set, I created a vector and subsequently a new variable “Restrictions.” This listed the 50 states ranked from lowest to highest restrictions. The data to create the rankings was obtained from: “the U.S. Census Bureau, the U.S. Bureau of Labor Statistics, the Kaiser Family Foundation, Ballotpedia, Editorial Projects in Education, Centers for Disease Control and Prevention, National Restaurant Association, Littler Mendelson, Husch Blackwell and Ogletree Deakins”.(McCann 2021)

It is noted that 13 metrics for restrictions were weighted and scored across each state, then subsequently ranked. How weights were determined is unclear. Also, restrictions do change over time. The rankings used for these plots were based on April 5th, 2021 data. A more complete set of plots would chart rankings of restrictions and cases/fatality at 4 different points in time. While many states generally fell into the same category (high vs low restrictions), it should be noted that several states did change position from high to low and vice versa throughout the pandemic. This is a limitation to the interpretation of plot 1.

Plot Purpose

The purpose of this plot is to show what effect restrictions might have on case fatality rates and Covid-19 case rates. Seeing a relationship between level of restrictions and cases can help state officials make important decisions moving forward through the pandemic.

Plot 2

Data Set Description

Click here to view the dataset in Excel csv format: Data.gov COVID-19 Dataset

This is the same data set from plot 1 however, for plot 2, I specifically focused on the state of Pennsylvania.

The Variables of interest in this data set remain the same.
1. Submission Date of all Variables
2. The State Reporting the Variables
3. The total cases of COVID in the state as of the date in that row
4. The total deaths in that row as per the date in the row.

Audience

The audience for this plot would be state and local officials(state representatives, county commissioners, school board members, etc) within the state of Pennsylvania.

Plot Purpose

To help state policy makers see trends of cases and deaths throughout the pandemic. It is also intended to show how vaccination had an impact on new cases and deaths. As the pandemic continues, if vaccine resistant variants become prevalent or if there was another phase (such as if immunity begins to wane in certain populations), historical state data would be helpful for state officials to plan mitigation efforts and medical resources.

Plot 3A

Plot 3B

Plot 3 Combined Plots

Plot 3

Date Set Description

Click here to view the data set: COVID-19 Data Repository by CSSE at Johns Hopkins University

This data set is operated by Johns Hopkins University and is available on github. The sources include but is not limited to: WHO, US CDC, LA Times, COVID Tracking Project, numerous U.S. State Department of health. (Dong and Gardner 2020)

Each source collects is data separately. As an example, NY State describes its data set here.

The data was collected from March 2020 and is updated daily to the present day. In NY state SARS-CoV2 lab testing results are mandated to be reported electronically. If a person has multiple tests in 1 day, that will be counted as 1 time. An individual will only be counted positive 1 time. Numerous states and sources are used in this data set and I have not individually evaluated each of the sources, however, collection policies are similar in many sources.

Since this is based on reported tests, people who weren’t tested are not included. There are likely people that had COVID-19 and probably died without that data being reported. Therefore this data set represents a sample of the population of each state and the U.S. as a whole. Presumably case counts and deaths are higher than reported. If errors were to occur and states counted individuals twice than counts may seem higher. Therefore, the data may have differences in precision from state to state.

The Variables of interest in this data set are

  1. Province_State- the U.S State
  2. Incident_Rate- cases per 100,000 (of note, I used mutate to create my own column and didn’t realize this existed until this write up. The final numbers are +/- 500 /100,000 in some instances. I hypothesize this is because the population data I used is from 2012. Further work could be to replot the U.S. maps with the variable Incident_Rate, and it would be more accurate. The lesson here is “know thy data).”
  3. Lat and Long data for mapping
  4. Case_Fatality_Ratio- Number of recorded deaths*100/Number confirmed cases

The second set of data(the same as Plot 1) can be viewed here: State Population Data

This is U.S. State population data and the variables include

  1. The State
  2. The population count in that state (either over age 18 or total)
  3. The year the data was generated (2012)

Audience

The audience here includes the following: state policy making officials, general public (for perhaps an online newspaper). I also feel a map such as this would be helpful to use as a demonstration in a textbook on history, biostatistics or public health (re: studying pandemics). My primary target audience is state policy officials and the general public.

Plot Purpose

The purpose of the plot is simply to show variation in states of case fatality rates and cases per 100,000. There are many factors that can cause these differences and my intent is to stimulate my audience to ask more questions to help understand what those factors are and to develop policy to improve the overall numbers for each state. If we are approaching the end of the pandemic, this could serve as a case study to prepare and manage future pandemics.

Plot 4

Date Set Description

Click here to view the data set in Excel csv format: Data.gov COVID-19 Dataset

This data set is the same one used in Plot 1. It is a set of cases and deaths reported daily by state throughout the pandemic. The Data collection starts January 2020 and is updated daily to reflect present data (most recent in this plot is June 2021). Please refer to plot 1 data description for more information.

The Variables of interest in this data set for plot 4 are

  1. Submission Date of all Variables
  2. The State Reporting the Variables
  3. The new cases of COVID in the state as of the date in that row
  4. The new deaths in that row as per the date in the row.

Audience

Primary: Federal Health officials.
Secondary: students in public health classes

Plot Purpose

To present a times series of data throughout the pandemic to compare trends of cases and deaths throughout the entire United States. Understanding how this virus affected different states over time can help for managing future outbreaks of viruses with similar characteristics as Sars-COV2.

Plot 5

Data Set Description

Click to view the cdc vaccination data set in Excel csv format: COVID-19 Vaccinations in the U.S.

This data set was created May 24, 2021 and is updated daily to the present day as of the writing of this article. It is a public data set owned by the U.S. government. It is updated daily. There can be a delay in when the jurisdiction delivered a vaccine and when it is reported to the CDC. This could have an impact when examining the plot, however, it is unlikely to have a major impact. Health care providers are expected to report within 72 hours of administration.

Several Vaccine Administration Data Systems are used: 1) The Immunization INformation system (IISs) and the Vaccine Administration Management Systems(VAMS). The COVID-19 Data Clearinghouse is a cloud-hosted repository used to populate the Immunization Data Lake (IZ Data Lake) (CDC 2021)

Detailed information on how vaccine data are reported can be viewed here.

The Variables of interest in this dataset are
1. Administered_Dose1_Recip_18PlusPop_Pct 2. Administered_Dose1_Recip_65PlusPop_Pct 3. Date

As vaccine distribution is tightly controlled and regulated it is likely representative of the entire U.S and state population.

Audience

Primary: Pennsylvania health officials, health care providers, Secondary: Public re: understanding vaccine effect on cases/deaths,

Plot Purpose

To show the trend of new cases and death in Pennsylvania with respect to percentages of vaccinations in 2 populations ( age over 18 and over 65). This can create a compelling argument for the importance of vaccination.

References

Centers for Disease Control. About COVID-19 Vaccine Delivered and Administration Data. May 14,2021. https://www.cdc.gov/coronavirus/2019-ncov/vaccines/distributing/about-vaccine-data.html

“Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533–534. doi: 10.1016/S1473–3099(20)30120–1”

Ellison, Alya. States Ranked by COVID-19 restrictions. Becker’s Hospital Review. March 2,2021 https://www.beckershospitalreview.com/rankings-and-ratings/states-ranked-by-covid-19-restrictions.html

McCann, Adam. States with the Fewest Coronavitus Restrictions. Wallethub.com. April 6, 2021. https://wallethub.com/edu/states-coronavirus-restrictions/73818

Contact or Feedback

Please contact me at adamjalbertmd@gmail.com for further information or code related to the development of the plots. This was my final for MSHA 455 at Northwestern University. Dr. Christine Maimone, the course instructor, provided valuable feedback to aid in the production of these visualizations. I would welcome any feedback, corrections, errors. Any errors or inaccuracies are solely mine as this work has not been formally reviewed or edited by anyone other than myself as of this posting.

--

--

Adam J. Albert, M.D.

Dr. Albert is a physician board certified in Internal Medicine. He is pursuing graduate degrees in Health Data Analytics and Health Systems Engineering.