Wary Statistics #1: The tale of CDC mortality

Summary

I will briefly discuss properly interpreting data you might see in the mainstream or on social media. The takeaway: if recent data for some measure (e.g. pneumonia deaths) from this year looks to be different than prior years, make sure to check that it is not an artifact of data collection or compilation.

Figure 1. CDC statistics by what week they were reported. Shows a continuous revision of each statistic upward and reveals why people should be extremely cautious about interpreting recent CDC (or other) data that is subject to change.

One of the many analysis mindsets you learn as a scientist, or after having analyzed enough data, is to be immediately suspicious of unusual data (especially if it appears to confirm a hypothesis) and check that it has not come about due to some quirk/artifact in the data analysis, compilation, or collection process. This is a brief post about interpreting trends in recent CDC data in light of current world events.

Readers can find code and data for the plots in this post at end or at

In experimental science we often have the luxury of setting up experiments to control for variables that are causing trends unrelated to the variable(s) we think are causative. For example, as a systems neuroscientist, this often comes about by checking that animals are not doing something that is not immediately being measured after an experimental manipulation. We then either create devices to quantitatively measure the variable (e.g. accelerometers for movement) so it can be integrated into downstream analysis or by designing control experiments. For example, see our Science paper from last year (Corder*, Ahanonu*, 2019) where we quantitatively measured general locomotion (video tracking) and stimulus-induced movement (with an accelerometer) to make sure that general locomotion was not the dominate contributor to neural activity in the amygdala after delivery of noxious (painful) stimuli—as opposed to the prominent movement-related activity in striatum, see our Nature paper ( Parker*, Marshall*, 2018). However, this luxury is not often the case when analyzing epidemiological or similar data outside clinical or other studies.

Figure 2. 2013-2020 CDC reporting for various morality statistics. There is a prominent decline in the current year (red box) that one should interpret with caution, as it is likely an artifact of CDC data collection and reporting.

The many lockdowns in place throughout the United States (and the world), owing to SARS-CoV-2 continuing to spread and many people developing COVID-19, has led to hypotheses about whether having less overall civilian activity would also reduce overall or flu-related mortality (e.g. less car crashes since people are driving less, reduced spread of seasonal influenza due to physical distancing ("social distancing" is a terrible and inaccurate phrase), etc.). To get a quick sense of whether this was true, I downloaded the weekly CDC mortality data that also includes influenza- and pneumonia-related morality. As can be seen in Fig. 2, there looks to be a dramatic drop in deaths reported around when many states and local governments began implementing lockdown (or shelter-at-home) orders. However, the dip was so dramatic that I became immediately suspicious...

Figure 3. 2019 CDC mortality statistics by 2020 reporting week. Using the 2020 weekly CDC archive mortality releases to plot 2019 mortality statistics also shows dramatic revising of death statistics upwards for prior weeks (see end of the year near yellow bracket).

To test my suspicions, I quickly wrote a R script that would download all the archived weekly releases (going back to the beginning of January 2020) from the CDC for the current year (2020). When I then plot the Fig. 2 measures (see Fig. 1), one immediately notices that the CDC continually revises their data (as they state on their website) and that the trend almost always appears to be in an upward direction (and can be quite significant). In fact, if one was to download data on any given week, one would see the dip seen in Fig. 2 occur. We can see this quite dramatically if we use the same 2020 weekly CDC data series for this year but plot 2019 instead of 2020 (see Fig. 3). Once we do this, we see that "All Deaths" near the end of 2019 are revised upward quite a bit with each successive weekly release of CDC data in 2020.

This only gives a very brief introduction on one aspect of data analysis: how to quickly check whether data that appears to confirm a hypothesis is in fact coming from an artifact in the data collection, compilation, or processing. For a similar look at how to interpret statistics one might see in the media, see my old post Sochi Olympic stats: medal count. Look out for future post related to statistics and data analysis. I hope everyone is doing well and staying safe during these tumultuous times.

Note: None of these graphs force the ordinate to start at zero as it is not relevant for, nor does it dramatically skew, the interpretation of the data as it relates to the conclusions being drawn. See Edward Tufte's take.

Below is the data used and script run to do the analysis.

download cdc_mortality.R

R / S+
  1. # Biafra Ahanonu
  2. # started:2020.04.06
  3. # Script to download CDC total mortality and pneumonia/influenza statistics and plot the total per year along with showing the increased deaths as a function of reporting week
  4. # Run by opening R or RStudio
  5. # changelog
  6.         #
  7. # TODO
  8.         #
  9.  
  10. # =====================
  11. # CONSTANTS, change these as needed
  12. # Path on computer where data will be saved
  13. savePath = '_downloads\\';
  14.  
  15. # URL to download CDC data. NOTE that %s near the end indicates where a dynamic string will be place to download each data series
  16. baseUrl = 'https://www.cdc.gov/flu/weekly/weeklyarchives2019-2020/data/NCHSData%s.csv';
  17.  
  18. # Current week
  19. currentWeek = as.numeric(strftime(c(Sys.Date()), format = "%V"))-2;
  20.         # currentWeek = 13;
  21. # Current year
  22. currentYear = 2020;
  23. columnsToRemove = c("Expected","Threshold")
  24.  
  25. # Size of text for ggplot
  26. axisTextSize = 20
  27. defaultTextSize = 15
  28.  
  29. # =====================
  30. # Load dependencies
  31. srcFileList = c('helper.ggplot_themes.R')
  32. lapply(srcFileList,FUN=function(file){source(file)})
  33.  
  34. packagesFileList = c("reshape2", "ggplot2","grid","gridExtra","ggthemes","RColorBrewer")
  35. lapply(packagesFileList,FUN=function(file){if(!require(file,character.only = TRUE)){install.packages(file,dep=TRUE)}})
  36.  
  37. # =====================
  38. # Format of imported CDC table data
  39. # [1] "Year"
  40. # [2] "Week"
  41. # [3] "Percent.of.Deaths.Due.to.Pneumonia.and.Influenza"
  42. # [4] "Expected"
  43. # [5] "Threshold"
  44. # [6] "All.Deaths"
  45. # [7] "Pneumonia.Deaths"
  46. # [8] "Influenza.Deaths"
  47.  
  48. # =====================
  49. # Download and load the CDC data
  50. # Weeks to download
  51. urlList = c(1:currentWeek)
  52. nLinks = length(urlList)
  53.  
  54. # Check download directory exists
  55. if(dir.exists(savePath)==FALSE){
  56.         dir.create(savePath)
  57. }
  58.  
  59. for (fileNo in c(1:nLinks)) {
  60.         if(fileNo<10){
  61.                 strNo = paste0('0',fileNo)
  62.         }else{
  63.                 strNo = fileNo
  64.         }
  65.  
  66.         # Download CDC data
  67.         fileURL = sprintf(baseUrl,strNo)
  68.         destfile = paste0(savePath,sprintf('NCHSData%s.csv',strNo))
  69.         # Check file exists, if not then download
  70.         if(!file.exists(destfile)){
  71.                 print(paste0('Download: ',fileURL,' | ',destfile))
  72.                 download.file(fileURL,destfile=destfile,quiet=TRUE,cacheOK = FALSE,method="curl")
  73.         }
  74.  
  75.         # Load downloaded data into RAM
  76.         filePath = destfile
  77.         print(paste0('Loading: ',filePath))
  78.         tableData = read.table(filePath,sep=',',header=T);
  79.         tableData$CollectionSeries = sprintf('NCHSData%s',strNo)
  80.         tableData$CollectionNo = fileNo
  81.         if(fileNo>1){
  82.                 tableDataAll = rbind(tableDataAll,tableData)
  83.         }else{
  84.                 tableDataAll = tableData
  85.         }
  86.         if(fileNo==nLinks){
  87.                 tableDataLatest = tableData
  88.         }
  89. }
  90.  
  91. # =====================
  92. # Convert table into long data format to allow easy faceting, ONLY use the current year
  93. tableDataAll$Year = as.character(tableDataAll$Year)
  94.         tableDataAllMelt = melt(tableDataAll[tableDataAll$Year==currentYear,],id.vars=c("Year","Week","CollectionSeries","CollectionNo"))
  95.         tableDataAllMelt$variable = gsub('\\.',' ',tableDataAllMelt$variable)
  96.         tableDataAllMelt$variable = gsub(' to ',' to\n',tableDataAllMelt$variable)
  97.         # Remove rows not informative to most users
  98.         tableDataAllMelt = tableDataAllMelt[!(tableDataAllMelt$variable %in% columnsToRemove),]
  99.         # Adjust the default week by what is actually in each table
  100.         currentWeekAdj = max(tableDataAllMelt$Week)
  101.  
  102. # =====================
  103. # Plot CDC data as a function of week data reported
  104. newplot = ggplot(tableDataAllMelt,aes(Week,value,color=CollectionNo,group=CollectionNo))+
  105.         geom_vline(xintercept=seq(4,currentWeekAdj,4),color='gray')+
  106.         geom_line(size=2)+
  107.         geom_point(size=2)+
  108.         xlab('Week (1 = start of year)')+
  109.         ylab('')+
  110.         labs(color='Week data reported')+
  111.         scale_x_continuous(breaks=seq(1,currentWeekAdj,1))+
  112.         # ggCustomColor(colourCount=14)+
  113.         # ggCustomColorContinuous(lowColor="gray",midColor="blue",midpointH=currentWeek/2)+
  114.         scale_colour_gradient2(low="gray", mid="blue", high="red",midpoint=currentWeek/2,limits = c(1,currentWeek),breaks=seq(1,currentWeek,1))+
  115.         facet_wrap(~variable,scales="free_y")+
  116.         ggtitle(sprintf('CDC total mortality by week data reported in %s | Gray lines indicate every 4 weeks',currentYear))+
  117.         ggThemeBlank(axisTextSize=axisTextSize,defaultTextSize=defaultTextSize)
  118.  
  119. dev.new(width=20,height=5)
  120. print(newplot)
  121.  
  122. # =====================
  123. # PLOT STATISTICS FOR ALL YEARS
  124.  
  125. # Convert most recent table into long format
  126. tableData = tableDataLatest
  127.         tableData$Year = as.character(tableData$Year)
  128.         tableDataMelt = melt(tableData,id.vars=c("Year","Week","CollectionSeries","CollectionNo"))
  129.         tableDataMelt$variable = gsub('\\.',' ',tableDataMelt$variable)
  130.         tableDataMelt$variable = gsub(' to ',' to\n',tableDataMelt$variable)
  131.         # Remove rows not informative to most users
  132.         tableDataMelt = tableDataMelt[!(tableDataMelt$variable %in% columnsToRemove),]
  133.         # Adjust the default week by what is actually in each table
  134.         currentWeekAdj = max(tableDataMelt$Week)
  135.  
  136. # Plot all years in CDC dataset by various statistics
  137. newplot = ggplot(tableDataMelt,aes(Week,value,color=Year,group=Year))+
  138.         geom_vline(xintercept=seq(4,currentWeekAdj,4),color='gray')+
  139.         geom_line(size=2)+
  140.         xlab('Week (1 = start of year)')+
  141.         ylab('')+
  142.         scale_x_continuous(breaks=seq(1,currentWeekAdj,6))+
  143.         facet_wrap(~variable,scales="free_y")+
  144.         ggCustomColor(colourCount=length(unique(tableDataLatest$Year))+1)+
  145.         ggThemeBlank(axisTextSize=axisTextSize,defaultTextSize=defaultTextSize)
  146.  
  147. dev.new(width=20,height=5)
  148. print(newplot)
download helper.ggplot_themes.R

R / S+
  1. # Biafra Ahanonu
  2. # Themes to add to ggplot
  3. # started 2014.03.19
  4.  
  5. ggThemeBlank <- function(stripTextSize=20,stripYAngle = 270, axisTextSize = 25, defaultTextSize = 20, axisXAngle = 0, gridMajorColor = "transparent", gridMinorColor = "transparent", backgroundColor="white",plotBackgroundColor='white', borderColor = "transparent",xAxisAdj = 0.5, tickColor="black", tickSize = 0.5, tickWidth = 1, titleTextSize = 15, legendPosition="right", legendDirection="vertical",stripTextColor="black",axisTextColor="black", backgroundColorLine=NA,backgroundColorSize=1, axisLineColor="transparent", axisLineSize = 2){
  6.         # font_import(pattern="[F/f]utura")
  7.         # theme(text=element_text(size=16, family="Comic Sans MS"))
  8.         # gridMajorColor = "#F0F0F0"
  9.         theme(panel.background = element_rect(fill = backgroundColor, colour = NA),
  10.                 plot.background = element_rect(size=backgroundColorSize,linetype="solid",color=backgroundColorLine, fill = plotBackgroundColor),
  11.                 text = element_text(size=defaultTextSize),
  12.                 legend.text=element_text(size=defaultTextSize),
  13.                 legend.title=element_text(size=defaultTextSize),
  14.                 legend.key = element_blank(),
  15.                 legend.key.height=unit(1.5,"line"),
  16.                 legend.key.width=unit(1.5,"line"),
  17.                 legend.position=legendPosition,
  18.                 legend.direction = legendDirection,
  19.                 # strip.background = element_rect(fill = '#005FAD'),
  20.                 # strip.text.x = element_text(colour = 'white', angle = 0, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
  21.                 # strip.text.y = element_text(colour = 'white', angle = stripYAngle, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
  22.                 strip.background = element_rect(fill = 'white'),
  23.                 strip.text.x = element_text(colour = stripTextColor, angle = 0, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
  24.                 strip.text.y = element_text(colour = stripTextColor, angle = stripYAngle, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
  25.                 axis.text.x = element_text(colour=axisTextColor, size = axisTextSize, angle = axisXAngle, vjust = xAxisAdj,hjust = xAxisAdj),
  26.                 axis.text.y = element_text(colour=axisTextColor, size = axisTextSize),
  27.                 axis.title.y=element_text(vjust=1.6, size = axisTextSize,colour=axisTextColor),
  28.                 axis.title.x=element_text(vjust=0.2, size = axisTextSize,colour=axisTextColor),
  29.                 # axis.line = element_line(colour = axisLineColor),
  30.                 axis.line.x = element_line(color = axisLineColor, size = axisLineSize),
  31.                 axis.line.y = element_line(color = axisLineColor, size = axisLineSize),
  32.                 plot.title = element_text(vjust=1.4, size = titleTextSize),
  33.                 # axis.ticks.x = element_blank(),
  34.                 # axis.ticks.y = element_blank(),
  35.                 axis.ticks.x = element_line(color = tickColor, size = tickWidth),
  36.                 axis.ticks.y = element_line(color = tickColor, size = tickWidth),
  37.                 axis.ticks.length=unit(tickSize,"cm"),
  38.                 panel.grid.major = element_line(color = gridMajorColor),
  39.                 panel.grid.minor = element_line(color = gridMinorColor),
  40.                 panel.border = element_rect(fill = NA,colour = borderColor),
  41.                 panel.spacing=unit(1 , "lines"))
  42. }
  43. ggCustomColor <- function(palette="Set1",colourCount = 9,revList=FALSE,...){
  44.         # Pastel1 also option
  45.         if(colourCount<3){colourCount=3;}
  46.         getPalette = colorRampPalette(brewer.pal(colourCount, palette))
  47.         colorValues = getPalette(colourCount)
  48.         if(revList==TRUE){
  49.                 colorValues = rev(colorValues)
  50.         }
  51.         return(scale_colour_manual(values = colorValues))
  52. }
  53. ggCustomColorContinuous <- function(midpointH=0,lowColor="white",midColor="gray",highColor="red",...){
  54.         return(scale_colour_gradient2(low=lowColor, mid=midColor, high=highColor,midpoint=midpointH,limits = c(1,13),breaks=seq(1,13,1)))
  55. }

-biafra
bahanonu [at] alum.mit.edu

additional articles to journey through:

¿qué es la calle?
24 may 2013 | short story | spanish

Había terminado y salé para mi cocina. Tenía hambre pero este día no había comida dentro de mi despensa. Me fui y caminé hacienda[...] la Tport—una máquina que puede transportar una persona a otro lugar sin energía y tiempo. Cuando entré la máquina, toqué algunos botónes y esperé. Pero nada ocurrió y lo hice las mismas acciones otra vez—y nada ocurrió.

How would the disappearance of streets affect the social fabric? This short story briefly (in castellano!) explores a world in which instantaneous, free transport is possible. Meant mainly to practice my spanish, i plan to follow-up with a more detail story in the future.

coding practices, part 1
10 december 2013 | programming

Some thoughts on coding practices and an outline of a work-flow that i find useful when starting any project. Future posts will focus on co[...]ncrete examples.

satellite-based videos: eastern europe during the russia-ukraine conflict
30 november 2022 | satellite

To visualize the nighttime lights of Eastern Europe, with a focus on times before and after the ongoing Russia-Ukraine conflict, I updated [...]my geoSatView R code originally built to view forest fires in the west coast of the United States to use satellite data from VNP46A1 and other datasets collected from the Suomi NPP VIIRS satellite.

I then created higher-quality movies in MATLAB by using the VNP46A2 Black Marble dataset collected by the same satellite, which has reduced cloud and other artifacts due to additional data processing. This allowed me to quantitate a permanent reduction in nighttime lights within Ukraine (in line with my initial hypothesis) and identify a multi-stage reduction of nighttime lights in Kiev's outer neighborhoods/metropolitan area that was greater than that seen in the city core/center. This highlights the utility of public satellite data to quickly test hypotheses and visualize large-scale changes.

I will go over how the Black Marble dataset is collected and processed along with how I created the movies and the advantages/disadvantages of each data source.

Using this platform and codebase, in follow-up posts I will look at 2021 Texas power crisis during the winter storms, vegetation changes in deforested areas or after conservation efforts, and other events.

©2006-2024 | Site created & coded by Biafra Ahanonu | Updated 21 October 2024
biafra ahanonu