Wary Statistics #1: The tale of CDC mortality

Summary

I will briefly discuss properly interpreting data you might see in the mainstream or on social media. The takeaway: if recent data for some measure (e.g. pneumonia deaths) from this year looks to be different than prior years, make sure to check that it is not an artifact of data collection or compilation.

Figure 1. CDC statistics by what week they were reported. Shows a continuous revision of each statistic upward and reveals why people should be extremely cautious about interpreting recent CDC (or other) data that is subject to change.

One of the many analysis mindsets you learn as a scientist, or after having analyzed enough data, is to be immediately suspicious of unusual data (especially if it appears to confirm a hypothesis) and check that it has not come about due to some quirk/artifact in the data analysis, compilation, or collection process. This is a brief post about interpreting trends in recent CDC data in light of current world events.

Readers can find code and data for the plots in this post at end or at

In experimental science we often have the luxury of setting up experiments to control for variables that are causing trends unrelated to the variable(s) we think are causative. For example, as a systems neuroscientist, this often comes about by checking that animals are not doing something that is not immediately being measured after an experimental manipulation. We then either create devices to quantitatively measure the variable (e.g. accelerometers for movement) so it can be integrated into downstream analysis or by designing control experiments. For example, see our Science paper from last year (Corder*, Ahanonu*, 2019) where we quantitatively measured general locomotion (video tracking) and stimulus-induced movement (with an accelerometer) to make sure that general locomotion was not the dominate contributor to neural activity in the amygdala after delivery of noxious (painful) stimuli—as opposed to the prominent movement-related activity in striatum, see our Nature paper ( Parker*, Marshall*, 2018). However, this luxury is not often the case when analyzing epidemiological or similar data outside clinical or other studies.

Figure 2. 2013-2020 CDC reporting for various morality statistics. There is a prominent decline in the current year (red box) that one should interpret with caution, as it is likely an artifact of CDC data collection and reporting.

The many lockdowns in place throughout the United States (and the world), owing to SARS-CoV-2 continuing to spread and many people developing COVID-19, has led to hypotheses about whether having less overall civilian activity would also reduce overall or flu-related mortality (e.g. less car crashes since people are driving less, reduced spread of seasonal influenza due to physical distancing ("social distancing" is a terrible and inaccurate phrase), etc.). To get a quick sense of whether this was true, I downloaded the weekly CDC mortality data that also includes influenza- and pneumonia-related morality. As can be seen in Fig. 2, there looks to be a dramatic drop in deaths reported around when many states and local governments began implementing lockdown (or shelter-at-home) orders. However, the dip was so dramatic that I became immediately suspicious...

Figure 3. 2019 CDC mortality statistics by 2020 reporting week. Using the 2020 weekly CDC archive mortality releases to plot 2019 mortality statistics also shows dramatic revising of death statistics upwards for prior weeks (see end of the year near yellow bracket).

To test my suspicions, I quickly wrote a R script that would download all the archived weekly releases (going back to the beginning of January 2020) from the CDC for the current year (2020). When I then plot the Fig. 2 measures (see Fig. 1), one immediately notices that the CDC continually revises their data (as they state on their website) and that the trend almost always appears to be in an upward direction (and can be quite significant). In fact, if one was to download data on any given week, one would see the dip seen in Fig. 2 occur. We can see this quite dramatically if we use the same 2020 weekly CDC data series for this year but plot 2019 instead of 2020 (see Fig. 3). Once we do this, we see that "All Deaths" near the end of 2019 are revised upward quite a bit with each successive weekly release of CDC data in 2020.

This only gives a very brief introduction on one aspect of data analysis: how to quickly check whether data that appears to confirm a hypothesis is in fact coming from an artifact in the data collection, compilation, or processing. For a similar look at how to interpret statistics one might see in the media, see my old post Sochi Olympic stats: medal count. Look out for future post related to statistics and data analysis. I hope everyone is doing well and staying safe during these tumultuous times.

Note: None of these graphs force the ordinate to start at zero as it is not relevant for, nor does it dramatically skew, the interpretation of the data as it relates to the conclusions being drawn. See Edward Tufte's take.

Below is the data used and script run to do the analysis.

download cdc_mortality.R

R / S+
  1. # Biafra Ahanonu
  2. # started:2020.04.06
  3. # Script to download CDC total mortality and pneumonia/influenza statistics and plot the total per year along with showing the increased deaths as a function of reporting week
  4. # Run by opening R or RStudio
  5. # changelog
  6.         #
  7. # TODO
  8.         #
  9.  
  10. # =====================
  11. # CONSTANTS, change these as needed
  12. # Path on computer where data will be saved
  13. savePath = '_downloads\\';
  14.  
  15. # URL to download CDC data. NOTE that %s near the end indicates where a dynamic string will be place to download each data series
  16. baseUrl = 'https://www.cdc.gov/flu/weekly/weeklyarchives2019-2020/data/NCHSData%s.csv';
  17.  
  18. # Current week
  19. currentWeek = as.numeric(strftime(c(Sys.Date()), format = "%V"))-2;
  20.         # currentWeek = 13;
  21. # Current year
  22. currentYear = 2020;
  23. columnsToRemove = c("Expected","Threshold")
  24.  
  25. # Size of text for ggplot
  26. axisTextSize = 20
  27. defaultTextSize = 15
  28.  
  29. # =====================
  30. # Load dependencies
  31. srcFileList = c('helper.ggplot_themes.R')
  32. lapply(srcFileList,FUN=function(file){source(file)})
  33.  
  34. packagesFileList = c("reshape2", "ggplot2","grid","gridExtra","ggthemes","RColorBrewer")
  35. lapply(packagesFileList,FUN=function(file){if(!require(file,character.only = TRUE)){install.packages(file,dep=TRUE)}})
  36.  
  37. # =====================
  38. # Format of imported CDC table data
  39. # [1] "Year"
  40. # [2] "Week"
  41. # [3] "Percent.of.Deaths.Due.to.Pneumonia.and.Influenza"
  42. # [4] "Expected"
  43. # [5] "Threshold"
  44. # [6] "All.Deaths"
  45. # [7] "Pneumonia.Deaths"
  46. # [8] "Influenza.Deaths"
  47.  
  48. # =====================
  49. # Download and load the CDC data
  50. # Weeks to download
  51. urlList = c(1:currentWeek)
  52. nLinks = length(urlList)
  53.  
  54. # Check download directory exists
  55. if(dir.exists(savePath)==FALSE){
  56.         dir.create(savePath)
  57. }
  58.  
  59. for (fileNo in c(1:nLinks)) {
  60.         if(fileNo<10){
  61.                 strNo = paste0('0',fileNo)
  62.         }else{
  63.                 strNo = fileNo
  64.         }
  65.  
  66.         # Download CDC data
  67.         fileURL = sprintf(baseUrl,strNo)
  68.         destfile = paste0(savePath,sprintf('NCHSData%s.csv',strNo))
  69.         # Check file exists, if not then download
  70.         if(!file.exists(destfile)){
  71.                 print(paste0('Download: ',fileURL,' | ',destfile))
  72.                 download.file(fileURL,destfile=destfile,quiet=TRUE,cacheOK = FALSE,method="curl")
  73.         }
  74.  
  75.         # Load downloaded data into RAM
  76.         filePath = destfile
  77.         print(paste0('Loading: ',filePath))
  78.         tableData = read.table(filePath,sep=',',header=T);
  79.         tableData$CollectionSeries = sprintf('NCHSData%s',strNo)
  80.         tableData$CollectionNo = fileNo
  81.         if(fileNo>1){
  82.                 tableDataAll = rbind(tableDataAll,tableData)
  83.         }else{
  84.                 tableDataAll = tableData
  85.         }
  86.         if(fileNo==nLinks){
  87.                 tableDataLatest = tableData
  88.         }
  89. }
  90.  
  91. # =====================
  92. # Convert table into long data format to allow easy faceting, ONLY use the current year
  93. tableDataAll$Year = as.character(tableDataAll$Year)
  94.         tableDataAllMelt = melt(tableDataAll[tableDataAll$Year==currentYear,],id.vars=c("Year","Week","CollectionSeries","CollectionNo"))
  95.         tableDataAllMelt$variable = gsub('\\.',' ',tableDataAllMelt$variable)
  96.         tableDataAllMelt$variable = gsub(' to ',' to\n',tableDataAllMelt$variable)
  97.         # Remove rows not informative to most users
  98.         tableDataAllMelt = tableDataAllMelt[!(tableDataAllMelt$variable %in% columnsToRemove),]
  99.         # Adjust the default week by what is actually in each table
  100.         currentWeekAdj = max(tableDataAllMelt$Week)
  101.  
  102. # =====================
  103. # Plot CDC data as a function of week data reported
  104. newplot = ggplot(tableDataAllMelt,aes(Week,value,color=CollectionNo,group=CollectionNo))+
  105.         geom_vline(xintercept=seq(4,currentWeekAdj,4),color='gray')+
  106.         geom_line(size=2)+
  107.         geom_point(size=2)+
  108.         xlab('Week (1 = start of year)')+
  109.         ylab('')+
  110.         labs(color='Week data reported')+
  111.         scale_x_continuous(breaks=seq(1,currentWeekAdj,1))+
  112.         # ggCustomColor(colourCount=14)+
  113.         # ggCustomColorContinuous(lowColor="gray",midColor="blue",midpointH=currentWeek/2)+
  114.         scale_colour_gradient2(low="gray", mid="blue", high="red",midpoint=currentWeek/2,limits = c(1,currentWeek),breaks=seq(1,currentWeek,1))+
  115.         facet_wrap(~variable,scales="free_y")+
  116.         ggtitle(sprintf('CDC total mortality by week data reported in %s | Gray lines indicate every 4 weeks',currentYear))+
  117.         ggThemeBlank(axisTextSize=axisTextSize,defaultTextSize=defaultTextSize)
  118.  
  119. dev.new(width=20,height=5)
  120. print(newplot)
  121.  
  122. # =====================
  123. # PLOT STATISTICS FOR ALL YEARS
  124.  
  125. # Convert most recent table into long format
  126. tableData = tableDataLatest
  127.         tableData$Year = as.character(tableData$Year)
  128.         tableDataMelt = melt(tableData,id.vars=c("Year","Week","CollectionSeries","CollectionNo"))
  129.         tableDataMelt$variable = gsub('\\.',' ',tableDataMelt$variable)
  130.         tableDataMelt$variable = gsub(' to ',' to\n',tableDataMelt$variable)
  131.         # Remove rows not informative to most users
  132.         tableDataMelt = tableDataMelt[!(tableDataMelt$variable %in% columnsToRemove),]
  133.         # Adjust the default week by what is actually in each table
  134.         currentWeekAdj = max(tableDataMelt$Week)
  135.  
  136. # Plot all years in CDC dataset by various statistics
  137. newplot = ggplot(tableDataMelt,aes(Week,value,color=Year,group=Year))+
  138.         geom_vline(xintercept=seq(4,currentWeekAdj,4),color='gray')+
  139.         geom_line(size=2)+
  140.         xlab('Week (1 = start of year)')+
  141.         ylab('')+
  142.         scale_x_continuous(breaks=seq(1,currentWeekAdj,6))+
  143.         facet_wrap(~variable,scales="free_y")+
  144.         ggCustomColor(colourCount=length(unique(tableDataLatest$Year))+1)+
  145.         ggThemeBlank(axisTextSize=axisTextSize,defaultTextSize=defaultTextSize)
  146.  
  147. dev.new(width=20,height=5)
  148. print(newplot)
download helper.ggplot_themes.R

R / S+
  1. # Biafra Ahanonu
  2. # Themes to add to ggplot
  3. # started 2014.03.19
  4.  
  5. ggThemeBlank <- function(stripTextSize=20,stripYAngle = 270, axisTextSize = 25, defaultTextSize = 20, axisXAngle = 0, gridMajorColor = "transparent", gridMinorColor = "transparent", backgroundColor="white",plotBackgroundColor='white', borderColor = "transparent",xAxisAdj = 0.5, tickColor="black", tickSize = 0.5, tickWidth = 1, titleTextSize = 15, legendPosition="right", legendDirection="vertical",stripTextColor="black",axisTextColor="black", backgroundColorLine=NA,backgroundColorSize=1, axisLineColor="transparent", axisLineSize = 2){
  6.         # font_import(pattern="[F/f]utura")
  7.         # theme(text=element_text(size=16, family="Comic Sans MS"))
  8.         # gridMajorColor = "#F0F0F0"
  9.         theme(panel.background = element_rect(fill = backgroundColor, colour = NA),
  10.                 plot.background = element_rect(size=backgroundColorSize,linetype="solid",color=backgroundColorLine, fill = plotBackgroundColor),
  11.                 text = element_text(size=defaultTextSize),
  12.                 legend.text=element_text(size=defaultTextSize),
  13.                 legend.title=element_text(size=defaultTextSize),
  14.                 legend.key = element_blank(),
  15.                 legend.key.height=unit(1.5,"line"),
  16.                 legend.key.width=unit(1.5,"line"),
  17.                 legend.position=legendPosition,
  18.                 legend.direction = legendDirection,
  19.                 # strip.background = element_rect(fill = '#005FAD'),
  20.                 # strip.text.x = element_text(colour = 'white', angle = 0, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
  21.                 # strip.text.y = element_text(colour = 'white', angle = stripYAngle, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
  22.                 strip.background = element_rect(fill = 'white'),
  23.                 strip.text.x = element_text(colour = stripTextColor, angle = 0, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
  24.                 strip.text.y = element_text(colour = stripTextColor, angle = stripYAngle, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
  25.                 axis.text.x = element_text(colour=axisTextColor, size = axisTextSize, angle = axisXAngle, vjust = xAxisAdj,hjust = xAxisAdj),
  26.                 axis.text.y = element_text(colour=axisTextColor, size = axisTextSize),
  27.                 axis.title.y=element_text(vjust=1.6, size = axisTextSize,colour=axisTextColor),
  28.                 axis.title.x=element_text(vjust=0.2, size = axisTextSize,colour=axisTextColor),
  29.                 # axis.line = element_line(colour = axisLineColor),
  30.                 axis.line.x = element_line(color = axisLineColor, size = axisLineSize),
  31.                 axis.line.y = element_line(color = axisLineColor, size = axisLineSize),
  32.                 plot.title = element_text(vjust=1.4, size = titleTextSize),
  33.                 # axis.ticks.x = element_blank(),
  34.                 # axis.ticks.y = element_blank(),
  35.                 axis.ticks.x = element_line(color = tickColor, size = tickWidth),
  36.                 axis.ticks.y = element_line(color = tickColor, size = tickWidth),
  37.                 axis.ticks.length=unit(tickSize,"cm"),
  38.                 panel.grid.major = element_line(color = gridMajorColor),
  39.                 panel.grid.minor = element_line(color = gridMinorColor),
  40.                 panel.border = element_rect(fill = NA,colour = borderColor),
  41.                 panel.spacing=unit(1 , "lines"))
  42. }
  43. ggCustomColor <- function(palette="Set1",colourCount = 9,revList=FALSE,...){
  44.         # Pastel1 also option
  45.         if(colourCount<3){colourCount=3;}
  46.         getPalette = colorRampPalette(brewer.pal(colourCount, palette))
  47.         colorValues = getPalette(colourCount)
  48.         if(revList==TRUE){
  49.                 colorValues = rev(colorValues)
  50.         }
  51.         return(scale_colour_manual(values = colorValues))
  52. }
  53. ggCustomColorContinuous <- function(midpointH=0,lowColor="white",midColor="gray",highColor="red",...){
  54.         return(scale_colour_gradient2(low=lowColor, mid=midColor, high=highColor,midpoint=midpointH,limits = c(1,13),breaks=seq(1,13,1)))
  55. }

-biafra
bahanonu [at] alum.mit.edu

©2006-2024 | Site created & coded by Biafra Ahanonu | Updated 17 April 2024
biafra ahanonu