I will briefly discuss properly interpreting data you might see in the mainstream or on social media. The takeaway: if recent data for some measure (e.g. pneumonia deaths) from this year looks to be different than prior years, make sure to check that it is not an artifact of data collection or compilation.
One of the many analysis mindsets you learn as a scientist, or after having analyzed enough data, is to be immediately suspicious of unusual data (especially if it appears to confirm a hypothesis) and check that it has not come about due to some quirk/artifact in the data analysis, compilation, or collection process. This is a brief post about interpreting trends in recent CDC data in light of current world events.
Readers can find code and data for the plots in this post at end or at
- https://github.com/bahanonu/cdc_statistics.
- The code automatically downloads the data.
- Raw data from CDC can be found at https://www.cdc.gov/flu/weekly/#S2.
In experimental science we often have the luxury of setting up experiments to control for variables that are causing trends unrelated to the variable(s) we think are causative. For example, as a systems neuroscientist, this often comes about by checking that animals are not doing something that is not immediately being measured after an experimental manipulation. We then either create devices to quantitatively measure the variable (e.g. accelerometers for movement) so it can be integrated into downstream analysis or by designing control experiments. For example, see our Science paper from last year (Corder*, Ahanonu*, 2019) where we quantitatively measured general locomotion (video tracking) and stimulus-induced movement (with an accelerometer) to make sure that general locomotion was not the dominate contributor to neural activity in the amygdala after delivery of noxious (painful) stimuli—as opposed to the prominent movement-related activity in striatum, see our Nature paper ( Parker*, Marshall*, 2018). However, this luxury is not often the case when analyzing epidemiological or similar data outside clinical or other studies.
The many lockdowns in place throughout the United States (and the world), owing to SARS-CoV-2 continuing to spread and many people developing COVID-19, has led to hypotheses about whether having less overall civilian activity would also reduce overall or flu-related mortality (e.g. less car crashes since people are driving less, reduced spread of seasonal influenza due to physical distancing ("social distancing" is a terrible and inaccurate phrase), etc.). To get a quick sense of whether this was true, I downloaded the weekly CDC mortality data that also includes influenza- and pneumonia-related morality. As can be seen in Fig. 2, there looks to be a dramatic drop in deaths reported around when many states and local governments began implementing lockdown (or shelter-at-home) orders. However, the dip was so dramatic that I became immediately suspicious...
To test my suspicions, I quickly wrote a R script that would download all the archived weekly releases (going back to the beginning of January 2020) from the CDC for the current year (2020). When I then plot the Fig. 2 measures (see Fig. 1), one immediately notices that the CDC continually revises their data (as they state on their website) and that the trend almost always appears to be in an upward direction (and can be quite significant). In fact, if one was to download data on any given week, one would see the dip seen in Fig. 2 occur. We can see this quite dramatically if we use the same 2020 weekly CDC data series for this year but plot 2019 instead of 2020 (see Fig. 3). Once we do this, we see that "All Deaths" near the end of 2019 are revised upward quite a bit with each successive weekly release of CDC data in 2020.
This only gives a very brief introduction on one aspect of data analysis: how to quickly check whether data that appears to confirm a hypothesis is in fact coming from an artifact in the data collection, compilation, or processing. For a similar look at how to interpret statistics one might see in the media, see my old post Sochi Olympic stats: medal count. Look out for future post related to statistics and data analysis. I hope everyone is doing well and staying safe during these tumultuous times.
Note: None of these graphs force the ordinate to start at zero as it is not relevant for, nor does it dramatically skew, the interpretation of the data as it relates to the conclusions being drawn. See Edward Tufte's take.
Below is the data used and script run to do the analysis.
- # Biafra Ahanonu
- # started:2020.04.06
- # Script to download CDC total mortality and pneumonia/influenza statistics and plot the total per year along with showing the increased deaths as a function of reporting week
- # Run by opening R or RStudio
- # changelog
- #
- # TODO
- #
- # =====================
- # CONSTANTS, change these as needed
- # Path on computer where data will be saved
- savePath = '_downloads\\';
- # URL to download CDC data. NOTE that %s near the end indicates where a dynamic string will be place to download each data series
- baseUrl = 'https://www.cdc.gov/flu/weekly/weeklyarchives2019-2020/data/NCHSData%s.csv';
- # Current week
- # currentWeek = 13;
- # Current year
- currentYear = 2020;
- # Size of text for ggplot
- axisTextSize = 20
- defaultTextSize = 15
- # =====================
- # Load dependencies
- # =====================
- # Format of imported CDC table data
- # [1] "Year"
- # [2] "Week"
- # [3] "Percent.of.Deaths.Due.to.Pneumonia.and.Influenza"
- # [4] "Expected"
- # [5] "Threshold"
- # [6] "All.Deaths"
- # [7] "Pneumonia.Deaths"
- # [8] "Influenza.Deaths"
- # =====================
- # Download and load the CDC data
- # Weeks to download
- # Check download directory exists
- }
- strNo = paste0('0',fileNo)
- }else{
- strNo = fileNo
- }
- # Download CDC data
- # Check file exists, if not then download
- }
- # Load downloaded data into RAM
- filePath = destfile
- tableData$CollectionNo = fileNo
- }else{
- tableDataAll = tableData
- }
- tableDataLatest = tableData
- }
- }
- # =====================
- # Convert table into long data format to allow easy faceting, ONLY use the current year
- tableDataAllMelt = melt(tableDataAll[tableDataAll$Year==currentYear,],id.vars=c("Year","Week","CollectionSeries","CollectionNo"))
- # Remove rows not informative to most users
- tableDataAllMelt = tableDataAllMelt[!(tableDataAllMelt$variable %in% columnsToRemove),]
- # Adjust the default week by what is actually in each table
- # =====================
- # Plot CDC data as a function of week data reported
- newplot = ggplot(tableDataAllMelt,aes(Week,value,color=CollectionNo,group=CollectionNo))+
- geom_line(size=2)+
- geom_point(size=2)+
- xlab('Week (1 = start of year)')+
- ylab('')+
- labs(color='Week data reported')+
- # ggCustomColor(colourCount=14)+
- # ggCustomColorContinuous(lowColor="gray",midColor="blue",midpointH=currentWeek/2)+
- facet_wrap(~variable,scales="free_y")+
- ggtitle(sprintf('CDC total mortality by week data reported in %s | Gray lines indicate every 4 weeks',currentYear))+
- ggThemeBlank(axisTextSize=axisTextSize,defaultTextSize=defaultTextSize)
- # =====================
- # PLOT STATISTICS FOR ALL YEARS
- # Convert most recent table into long format
- tableData = tableDataLatest
- # Remove rows not informative to most users
- tableDataMelt = tableDataMelt[!(tableDataMelt$variable %in% columnsToRemove),]
- # Adjust the default week by what is actually in each table
- # Plot all years in CDC dataset by various statistics
- newplot = ggplot(tableDataMelt,aes(Week,value,color=Year,group=Year))+
- geom_line(size=2)+
- xlab('Week (1 = start of year)')+
- ylab('')+
- facet_wrap(~variable,scales="free_y")+
- ggThemeBlank(axisTextSize=axisTextSize,defaultTextSize=defaultTextSize)
- # Biafra Ahanonu
- # Themes to add to ggplot
- # started 2014.03.19
- ggThemeBlank <- function(stripTextSize=20,stripYAngle = 270, axisTextSize = 25, defaultTextSize = 20, axisXAngle = 0, gridMajorColor = "transparent", gridMinorColor = "transparent", backgroundColor="white",plotBackgroundColor='white', borderColor = "transparent",xAxisAdj = 0.5, tickColor="black", tickSize = 0.5, tickWidth = 1, titleTextSize = 15, legendPosition="right", legendDirection="vertical",stripTextColor="black",axisTextColor="black", backgroundColorLine=NA,backgroundColorSize=1, axisLineColor="transparent", axisLineSize = 2){
- # font_import(pattern="[F/f]utura")
- # theme(text=element_text(size=16, family="Comic Sans MS"))
- # gridMajorColor = "#F0F0F0"
- theme(panel.background = element_rect(fill = backgroundColor, colour = NA),
- plot.background = element_rect(size=backgroundColorSize,linetype="solid",color=backgroundColorLine, fill = plotBackgroundColor),
- legend.text=element_text(size=defaultTextSize),
- legend.title=element_text(size=defaultTextSize),
- legend.key = element_blank(),
- legend.key.height=unit(1.5,"line"),
- legend.key.width=unit(1.5,"line"),
- legend.position=legendPosition,
- legend.direction = legendDirection,
- # strip.background = element_rect(fill = '#005FAD'),
- # strip.text.x = element_text(colour = 'white', angle = 0, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
- # strip.text.y = element_text(colour = 'white', angle = stripYAngle, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
- strip.background = element_rect(fill = 'white'),
- strip.text.x = element_text(colour = stripTextColor, angle = 0, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
- strip.text.y = element_text(colour = stripTextColor, angle = stripYAngle, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
- axis.text.x = element_text(colour=axisTextColor, size = axisTextSize, angle = axisXAngle, vjust = xAxisAdj,hjust = xAxisAdj),
- axis.text.y = element_text(colour=axisTextColor, size = axisTextSize),
- axis.title.y=element_text(vjust=1.6, size = axisTextSize,colour=axisTextColor),
- axis.title.x=element_text(vjust=0.2, size = axisTextSize,colour=axisTextColor),
- # axis.line = element_line(colour = axisLineColor),
- axis.line.x = element_line(color = axisLineColor, size = axisLineSize),
- axis.line.y = element_line(color = axisLineColor, size = axisLineSize),
- plot.title = element_text(vjust=1.4, size = titleTextSize),
- # axis.ticks.x = element_blank(),
- # axis.ticks.y = element_blank(),
- axis.ticks.x = element_line(color = tickColor, size = tickWidth),
- axis.ticks.y = element_line(color = tickColor, size = tickWidth),
- axis.ticks.length=unit(tickSize,"cm"),
- panel.grid.major = element_line(color = gridMajorColor),
- panel.grid.minor = element_line(color = gridMinorColor),
- panel.border = element_rect(fill = NA,colour = borderColor),
- panel.spacing=unit(1 , "lines"))
- }
- # Pastel1 also option
- colorValues = getPalette(colourCount)
- }
- }
- ggCustomColorContinuous <- function(midpointH=0,lowColor="white",midColor="gray",highColor="red",...){
- }