Methodology

Updated: Sept 5, 2020 1200  MDT


This page is to show some of the methodology and reasoning behind how the other Covid pages are composed.

Positive tests vs deaths

Most reporting that is seen on the web or on TV is showing positive tests. Unfortunately, tests are probably the worst metric to use based on the variability of testing (when even done at all). Deaths while not perfect, seem to be a better indicator. Deaths, however, are a lagging indicator as the person is probably 14 to 21 days behind when a positive test would have shown up, if testing were in fact done and evaluated immediately. The RT-PCR tests are taking up to 8 days to be evaluated. How are they then reported, when the positive is confirmed or when the test was taken.
Deaths have their own issues as well, as the New York Times did some good reporting on. Early on, some deaths were not reported as Covid when in fact they were. Deaths not in the hospitals continue to have what can be called "false negatives". Supposedly NY is now (as of April 9) requiring deaths outside the hospitals to be reported as Covid if that is what was suspected. If this is true, NY should see a bump, similar to when China changed their reporting criteria on positive tests at the beginning of the pandemic.

Data Repositories

The data has been compiled by either the New York Times (for US) or Johns Hopkins (US and all countries) and is stored in github, and can be pulled out as a "csv" file which then can be parsed in a linux environment.  Originally I started with just the Johns Hopkins, but for the US I have moved to NYT. This is because at least according to their logs, the NYT will update prior dates with corrections if needed. The Johns Hopkins data for states is a new data set each day just for that day, so an error which is found for say 3 days ago won't get updated other than the current day may have a more accurate cumulative total. The JH data for countries is different, each country has its own row, and the last entry in the row is the current cumulative data. It may be they do correct errors in that.

That being said, both datasets disagree with each other and in the case of Colorado, disagree with the state's "official" web site.
In addition, the Colorado website updates without a log file explaining the updates. Compare the two Colorado columns. There was a major update on April 10 for data through April 9.

This picture below shows a comparison (cells highlighted when there is a delta from the previous day):



This table shows for each date since the first death in Colorado the cumulative number of deaths. Early on JH and NYT had significant differences. More importantly, they both differ from the state itself. The state at around 1600 each afternoon reports the previous day. The shaded values on the Colorado.gov indicate a change from the previous day. Each day someone is going back and scrubbing the numbers.

Starting at April 11, however, if you compare NYT 4/11 (274) to Colorado.gov 4/10 (274) then NYT 4/12 (290) to Colorado.gov 4/11 (290) etc on up, you see that NYT is in fact grabbing the latest number but just putting it one day later.

When you look at NYT vs the last column in the Colorado.gov, you see quite a story




Because the NYT keeps a cumulative count, when you get to the rolling 6 day average, it does not really look (as of April 15) like it has peaked. But when you look at the data from the Colorado.gov web site that is revised daily, it would appear there was a peak around April 5 to 6, and has been declining since then. This is a danger of grabbing the datasets from either NYT or JH, given that they don't have the capability to go back in time (automatically) like is being done in Colorado.

Here is the Colorado data in animation:

Visualization

There are two types of visualization that are used on these pages. The original that was used is just reporting total number of deaths vs date, with the y axis being a log scale. Eventually when the number of deaths stops increasing, then the slope on the curve will go to zero (a horizontal line). As the slope starts to approach zero, this type of chart becomes less useful.
A better chart then is the daily death rate. This is what the second type of charts show. For those, the death rate is per million inhabitants, with the population taken from google. In Looking at this data, it can be very noisy, so what is done is each day is an average of the current day and the five previous days. It is a straight average. That is what is done on the roll6 page.


However, for New York State, I want to show the daily vs a 2 day avg, 3 day ... through 6 day. You can see that the 6 day does a good job smoothing but also will lag the daily.

What can be seen is two fold. First is that the average is much less noisy than the daily. The second is due to the averaging over 6 days is that the average will lag the daily by a few days.

There are other ways to smooth the curve
1) change the number of days averaged over
2) give the more current days a higher weight than the earlier days
3) fairly sophisticated Holt-Winters moving average filter. Mark Handley uses this method.
4) others ??

Overall for the purposes of this visualization, not modelling, the 6 day rolling average that was originally proposed by Kevin Drum of MotherJones does pretty well I think.



Links back to other covid pages maintained by Don

Main (Roll6) Covid Page
Cumulative Covid Page
Eagle vs Summit Covid Page