Wary Statistics #1: The tale of CDC mortality

in statistics on 06 April 2020

Summary

I will briefly discuss properly interpreting data you might see in the mainstream or on social media. The takeaway: if recent data for some measure (e.g. pneumonia deaths) from this year looks to be different than prior years, make sure to check that it is not an artifact of data collection or compilation.

Figure 1. CDC statistics by what week they were reported. Shows a continuous revision of each statistic upward and reveals why people should be extremely cautious about interpreting recent CDC (or other) data that is subject to change.

One of the many analysis mindsets you learn as a scientist, or after having analyzed enough data, is to be immediately suspicious of unusual data (especially if it appears to confirm a hypothesis) and check that it has not come about due to some quirk/artifact in the data analysis, compilation, or collection process. This is a brief post about interpreting trends in recent CDC data in light of current world events.

Readers can find code and data for the plots in this post at end or at

https://github.com/bahanonu/cdc_statistics.

The code automatically downloads the data.

Raw data from CDC can be found at https://www.cdc.gov/flu/weekly/#S2.

In experimental science we often have the luxury of setting up experiments to control for variables that are causing trends unrelated to the variable(s) we think are causative. For example, as a systems neuroscientist, this often comes about by checking that animals are not doing something that is not immediately being measured after an experimental manipulation. We then either create devices to quantitatively measure the variable (e.g. accelerometers for movement) so it can be integrated into downstream analysis or by designing control experiments. For example, see our Science paper from last year (Corder*, Ahanonu*, 2019) where we quantitatively measured general locomotion (video tracking) and stimulus-induced movement (with an accelerometer) to make sure that general locomotion was not the dominate contributor to neural activity in the amygdala after delivery of noxious (painful) stimuli—as opposed to the prominent movement-related activity in striatum, see our Nature paper ( Parker*, Marshall*, 2018). However, this luxury is not often the case when analyzing epidemiological or similar data outside clinical or other studies.

Figure 2. 2013-2020 CDC reporting for various morality statistics. There is a prominent decline in the current year (red box) that one should interpret with caution, as it is likely an artifact of CDC data collection and reporting.

The many lockdowns in place throughout the United States (and the world), owing to SARS-CoV-2 continuing to spread and many people developing COVID-19, has led to hypotheses about whether having less overall civilian activity would also reduce overall or flu-related mortality (e.g. less car crashes since people are driving less, reduced spread of seasonal influenza due to physical distancing ("social distancing" is a terrible and inaccurate phrase), etc.). To get a quick sense of whether this was true, I downloaded the weekly CDC mortality data that also includes influenza- and pneumonia-related morality. As can be seen in Fig. 2, there looks to be a dramatic drop in deaths reported around when many states and local governments began implementing lockdown (or shelter-at-home) orders. However, the dip was so dramatic that I became immediately suspicious...

Figure 3. 2019 CDC mortality statistics by 2020 reporting week. Using the 2020 weekly CDC archive mortality releases to plot 2019 mortality statistics also shows dramatic revising of death statistics upwards for prior weeks (see end of the year near yellow bracket).

To test my suspicions, I quickly wrote a R script that would download all the archived weekly releases (going back to the beginning of January 2020) from the CDC for the current year (2020). When I then plot the Fig. 2 measures (see Fig. 1), one immediately notices that the CDC continually revises their data (as they state on their website) and that the trend almost always appears to be in an upward direction (and can be quite significant). In fact, if one was to download data on any given week, one would see the dip seen in Fig. 2 occur. We can see this quite dramatically if we use the same 2020 weekly CDC data series for this year but plot 2019 instead of 2020 (see Fig. 3). Once we do this, we see that "All Deaths" near the end of 2019 are revised upward quite a bit with each successive weekly release of CDC data in 2020.

This only gives a very brief introduction on one aspect of data analysis: how to quickly check whether data that appears to confirm a hypothesis is in fact coming from an artifact in the data collection, compilation, or processing. For a similar look at how to interpret statistics one might see in the media, see my old post Sochi Olympic stats: medal count. Look out for future post related to statistics and data analysis. I hope everyone is doing well and staying safe during these tumultuous times.

Note: None of these graphs force the ordinate to start at zero as it is not relevant for, nor does it dramatically skew, the interpretation of the data as it relates to the conclusions being drawn. See Edward Tufte's take.

Below is the data used and script run to do the analysis.

download cdc_mortality.R

R / S+
# Biafra Ahanonu
# started:2020.04.06
# Script to download CDC total mortality and pneumonia/influenza statistics and plot the total per year along with showing the increased deaths as a function of reporting week
# Run by opening R or RStudio
# changelog
        #
# TODO
        #
 
# =====================
# CONSTANTS, change these as needed
# Path on computer where data will be saved
savePath = '_downloads\\';
 
# URL to download CDC data. NOTE that %s near the end indicates where a dynamic string will be place to download each data series
baseUrl = 'https://www.cdc.gov/flu/weekly/weeklyarchives2019-2020/data/NCHSData%s.csv';
 
# Current week
currentWeek = as.numeric(strftime(c(Sys.Date()), format = "%V"))-2;
        # currentWeek = 13;
# Current year
currentYear = 2020;
columnsToRemove = c("Expected","Threshold")
 
# Size of text for ggplot
axisTextSize = 20
defaultTextSize = 15
 
# =====================
# Load dependencies
srcFileList = c('helper.ggplot_themes.R')
lapply(srcFileList,FUN=function(file){source(file)})
 
packagesFileList = c("reshape2", "ggplot2","grid","gridExtra","ggthemes","RColorBrewer")
lapply(packagesFileList,FUN=function(file){if(!require(file,character.only = TRUE)){install.packages(file,dep=TRUE)}})
 
# =====================
# Format of imported CDC table data
# [1] "Year"
# [2] "Week"
# [3] "Percent.of.Deaths.Due.to.Pneumonia.and.Influenza"
# [4] "Expected"
# [5] "Threshold"
# [6] "All.Deaths"
# [7] "Pneumonia.Deaths"
# [8] "Influenza.Deaths"
 
# =====================
# Download and load the CDC data
# Weeks to download
urlList = c(1:currentWeek)
nLinks = length(urlList)
 
# Check download directory exists
if(dir.exists(savePath)==FALSE){
        dir.create(savePath)
}
 
for (fileNo in c(1:nLinks)) {
        if(fileNo<10){
                strNo = paste0('0',fileNo)
        }else{
                strNo = fileNo
        }
 
        # Download CDC data
        fileURL = sprintf(baseUrl,strNo)
        destfile = paste0(savePath,sprintf('NCHSData%s.csv',strNo))
        # Check file exists, if not then download
        if(!file.exists(destfile)){
                print(paste0('Download: ',fileURL,' | ',destfile))
                download.file(fileURL,destfile=destfile,quiet=TRUE,cacheOK = FALSE,method="curl")
        }
 
        # Load downloaded data into RAM
        filePath = destfile
        print(paste0('Loading: ',filePath))
        tableData = read.table(filePath,sep=',',header=T);
        tableData$CollectionSeries = sprintf('NCHSData%s',strNo)
        tableData$CollectionNo = fileNo
        if(fileNo>1){
                tableDataAll = rbind(tableDataAll,tableData)
        }else{
                tableDataAll = tableData
        }
        if(fileNo==nLinks){
                tableDataLatest = tableData
        }
}
 
# =====================
# Convert table into long data format to allow easy faceting, ONLY use the current year
tableDataAll$Year = as.character(tableDataAll$Year)
        tableDataAllMelt = melt(tableDataAll[tableDataAll$Year==currentYear,],id.vars=c("Year","Week","CollectionSeries","CollectionNo"))
        tableDataAllMelt$variable = gsub('\\.',' ',tableDataAllMelt$variable)
        tableDataAllMelt$variable = gsub(' to ',' to\n',tableDataAllMelt$variable)
        # Remove rows not informative to most users
        tableDataAllMelt = tableDataAllMelt[!(tableDataAllMelt$variable %in% columnsToRemove),]
        # Adjust the default week by what is actually in each table
        currentWeekAdj = max(tableDataAllMelt$Week)
 
# =====================
# Plot CDC data as a function of week data reported
newplot = ggplot(tableDataAllMelt,aes(Week,value,color=CollectionNo,group=CollectionNo))+
        geom_vline(xintercept=seq(4,currentWeekAdj,4),color='gray')+
        geom_line(size=2)+
        geom_point(size=2)+
        xlab('Week (1 = start of year)')+
        ylab('')+
        labs(color='Week data reported')+
        scale_x_continuous(breaks=seq(1,currentWeekAdj,1))+
        # ggCustomColor(colourCount=14)+
        # ggCustomColorContinuous(lowColor="gray",midColor="blue",midpointH=currentWeek/2)+
        scale_colour_gradient2(low="gray", mid="blue", high="red",midpoint=currentWeek/2,limits = c(1,currentWeek),breaks=seq(1,currentWeek,1))+
        facet_wrap(~variable,scales="free_y")+
        ggtitle(sprintf('CDC total mortality by week data reported in %s | Gray lines indicate every 4 weeks',currentYear))+
        ggThemeBlank(axisTextSize=axisTextSize,defaultTextSize=defaultTextSize)
 
dev.new(width=20,height=5)
print(newplot)
 
# =====================
# PLOT STATISTICS FOR ALL YEARS
 
# Convert most recent table into long format
tableData = tableDataLatest
        tableData$Year = as.character(tableData$Year)
        tableDataMelt = melt(tableData,id.vars=c("Year","Week","CollectionSeries","CollectionNo"))
        tableDataMelt$variable = gsub('\\.',' ',tableDataMelt$variable)
        tableDataMelt$variable = gsub(' to ',' to\n',tableDataMelt$variable)
        # Remove rows not informative to most users
        tableDataMelt = tableDataMelt[!(tableDataMelt$variable %in% columnsToRemove),]
        # Adjust the default week by what is actually in each table
        currentWeekAdj = max(tableDataMelt$Week)
 
# Plot all years in CDC dataset by various statistics
newplot = ggplot(tableDataMelt,aes(Week,value,color=Year,group=Year))+
        geom_vline(xintercept=seq(4,currentWeekAdj,4),color='gray')+
        geom_line(size=2)+
        xlab('Week (1 = start of year)')+
        ylab('')+
        scale_x_continuous(breaks=seq(1,currentWeekAdj,6))+
        facet_wrap(~variable,scales="free_y")+
        ggCustomColor(colourCount=length(unique(tableDataLatest$Year))+1)+
        ggThemeBlank(axisTextSize=axisTextSize,defaultTextSize=defaultTextSize)
 
dev.new(width=20,height=5)
print(newplot)

download helper.ggplot_themes.R

R / S+
# Biafra Ahanonu
# Themes to add to ggplot
# started 2014.03.19
 
ggThemeBlank <- function(stripTextSize=20,stripYAngle = 270, axisTextSize = 25, defaultTextSize = 20, axisXAngle = 0, gridMajorColor = "transparent", gridMinorColor = "transparent", backgroundColor="white",plotBackgroundColor='white', borderColor = "transparent",xAxisAdj = 0.5, tickColor="black", tickSize = 0.5, tickWidth = 1, titleTextSize = 15, legendPosition="right", legendDirection="vertical",stripTextColor="black",axisTextColor="black", backgroundColorLine=NA,backgroundColorSize=1, axisLineColor="transparent", axisLineSize = 2){
        # font_import(pattern="[F/f]utura")
        # theme(text=element_text(size=16, family="Comic Sans MS"))
        # gridMajorColor = "#F0F0F0"
        theme(panel.background = element_rect(fill = backgroundColor, colour = NA),
                plot.background = element_rect(size=backgroundColorSize,linetype="solid",color=backgroundColorLine, fill = plotBackgroundColor),
                text = element_text(size=defaultTextSize),
                legend.text=element_text(size=defaultTextSize),
                legend.title=element_text(size=defaultTextSize),
                legend.key = element_blank(),
                legend.key.height=unit(1.5,"line"),
                legend.key.width=unit(1.5,"line"),
                legend.position=legendPosition,
                legend.direction = legendDirection,
                # strip.background = element_rect(fill = '#005FAD'),
                # strip.text.x = element_text(colour = 'white', angle = 0, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
                # strip.text.y = element_text(colour = 'white', angle = stripYAngle, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
                strip.background = element_rect(fill = 'white'),
                strip.text.x = element_text(colour = stripTextColor, angle = 0, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
                strip.text.y = element_text(colour = stripTextColor, angle = stripYAngle, size = stripTextSize, hjust = 0.5, vjust = 0.5, face = 'bold'),
                axis.text.x = element_text(colour=axisTextColor, size = axisTextSize, angle = axisXAngle, vjust = xAxisAdj,hjust = xAxisAdj),
                axis.text.y = element_text(colour=axisTextColor, size = axisTextSize),
                axis.title.y=element_text(vjust=1.6, size = axisTextSize,colour=axisTextColor),
                axis.title.x=element_text(vjust=0.2, size = axisTextSize,colour=axisTextColor),
                # axis.line = element_line(colour = axisLineColor),
                axis.line.x = element_line(color = axisLineColor, size = axisLineSize),
                axis.line.y = element_line(color = axisLineColor, size = axisLineSize),
                plot.title = element_text(vjust=1.4, size = titleTextSize),
                # axis.ticks.x = element_blank(),
                # axis.ticks.y = element_blank(),
                axis.ticks.x = element_line(color = tickColor, size = tickWidth),
                axis.ticks.y = element_line(color = tickColor, size = tickWidth),
                axis.ticks.length=unit(tickSize,"cm"),
                panel.grid.major = element_line(color = gridMajorColor),
                panel.grid.minor = element_line(color = gridMinorColor),
                panel.border = element_rect(fill = NA,colour = borderColor),
                panel.spacing=unit(1 , "lines"))
}
ggCustomColor <- function(palette="Set1",colourCount = 9,revList=FALSE,...){
        # Pastel1 also option
        if(colourCount<3){colourCount=3;}
        getPalette = colorRampPalette(brewer.pal(colourCount, palette))
        colorValues = getPalette(colourCount)
        if(revList==TRUE){
                colorValues = rev(colorValues)
        }
        return(scale_colour_manual(values = colorValues))
}
ggCustomColorContinuous <- function(midpointH=0,lowColor="white",midColor="gray",highColor="red",...){
        return(scale_colour_gradient2(low=lowColor, mid=midColor, high=highColor,midpoint=midpointH,limits = c(1,13),breaks=seq(1,13,1)))
}

-biafra

bahanonu [at] alum.mit.edu

global history of architecture
08 june 2013 | architecture

One of my favorite classes at MIT was 4.605 (Global History of Architecture), which explored various styles and themes found throughout dif[...]ferent buildings and other types of architecture from the beginning of civilization to the present. In the spirit of sharing my enthusiasm for the course i'll discuss some takeaways from, and have included a couple papers i wrote for, the class.

week 9-12 | scooting along
17 october 2012 | singapore

And here it seemed I'd forgotten to write about the last exhilarating weeks in Singapore. Fear not! Beautiful buildings, a Hong Kong speed-[...]run, and more food await the eager reader.

comp. neuro
30 november 2010 | science

Presentation I gave in 9.29, the computational neuroscience class at MIT. They attempted to build a model that could use the information ob[...]tained from recordings in the prefrontal cortex and other areas of a monkey brain to predict their behavior based on the 'state' their neuron groups were in.

the origin and evolution of life
01 january 2020 | origin and evolution of life

I am starting a new blog focused on the Origin and Evolution of Life. Here is the first post.[...]

Biafra Ahanonu, PhD

home

about

contact [at] bahanonu.com

stanford

linkden

github

goodreads

medium

twitter

publications

talks

ciatah

articles

graduate school resources

Stanford Biosciences Student Association

list of post tags

all articles - with pictures

all articles - text form

favorite posts

favorite short stories

short stories

spanish short stories

singapore

teaching

reading

current reading + ratings

full reviews

designs

neuroscience

blog

resources

technologies

abiogenesis

search

feeds

main website

brain initiative notes

next»

«previous

random!