Retrieving and Visualizing COVID-19 Statistics from Local Health Authority Website

Motivation of this Work:

The Ohio Department of Health provide daily update on Covid-19 related data such as total confirmed, total hospitalized, and total death, but the access to the daily new case count (newly confirmed, hospitalized, and dead) is a bit tricky. So, I created my own version of daily case count tracking.

Also, in some cases, the department updates daily case count data a few days later. This has made me hesitate on how much trust I should place for the “latest” update. Therefore, to track the discrepancies, I decided to add a function to save the data each time I run the code. This way, I can make comparisons later to see if there are any patterns in the “late update”.

Primary Outcome: Visualized Daily New Case Count

Comparison of Daily Confirmed Cases among Four Major Cities in Ohio, USA

A detailed breakdown of what my code can do:

  • Retrieving daily update data from the Ohio Department of Health Website
  • Producing a daily confirmed case data from the raw data
  • Charting the daily confirmed case data graphically
  • Visualizing the trend (lowess smoothing)
  • Saving the daily confirmed case data into a .csv file for later comparison.
  • Run the code any time to get the results!

Code Block:

clear all

//retrieving the data
import delimited ""

//capturing the date and the time of the data
local c_date = c(current_date)
local c_time = c(current_time)

drop if _n == 1

//manipulating variable names and columes to obtain the daily count
rename v1 county
label variable county "county"

rename v2 sex
label variable sex "sex"
encode(sex), gen(gender)

rename v3 ageRange
label variable ageRange "Age range"

gen onSetDate = date(v4, "MDY")
format onSetDate %tdnn/dd
label variable onSetDate "Onset Date"

gen dateDeath = date(v5, "MDY")
format dateDeath %tdnn/dd
label variable dateDeath "Date of Death"

gen admissionDate = date(v6, "MDY")
format admissionDate %tdnn/dd
label variable admissionDate "Admission Date"

rename v7 caseCount
destring(caseCount), replace force
label variable caseCount "Case Count"

rename v8 deathCount
destring(deathCount), replace force
label variable deathCount "Death Count"

rename v9 hospitalizeCount
destring(hospitalizeCount), replace force
label variable hospitalizeCount "Hospitalize Count"

//Generating daily confirmed case count
collapse (sum)caseCount (sum)deathCount (sum)hospitalizeCount, by(onSetDate county)

//Charting four major counties 

twoway (scatter caseCount onSetDate if county == "Cuyahoga", msize(0.5)) (lowess caseCount onSetDate if county == "Cuyahoga"), name(Cuyahoga) legend(off) xtitle("") ytitle("") yline(5, lcolor(green)) title("Cuyahoga County (Cleveland)") note("Population: 1.24 million")

twoway (scatter caseCount onSetDate if county == "Franklin" & caseCount <= 100, msize(0.5)) (lowess caseCount onSetDate if county == "Franklin"), name(Franklin, replace) legend(off) xtitle("") ytitle("") yline(5, lcolor(green)) title("Franklin County (Columbus)") note("Population: 0.89 million; Some dates in late April to early May where the case count " "was higher than 100 is obmitted to keep the graph scale consistent.")

twoway (scatter caseCount onSetDate if county == "Hamilton", msize(0.5)) (lowess caseCount onSetDate if county == "Hamilton"), name(Hamilton, replace) legend(off) xtitle("") ytitle("") yline(5, lcolor(green)) title("Hamilton County (Cincinnati)") note("Population: 0.82 million")

twoway (scatter caseCount onSetDate if county == "Lucas", msize(0.5)) (lowess caseCount onSetDate if county == "Lucas"), name(Lucas, replace) legend(off) xtitle("") ytitle("") yline(5, lcolor(green)) title("Lucas County (Toledo)") note("Population: 0.43 million")

graph combine Cuyahoga Franklin Hamilton Lucas, col(1) xcommon ycommon ysize(2) xsize(1) note("Data Source: Ohio Department of Health; Green Line = 5 Cases") title("COVID-19 Daily New Case Count in OH") subtitle("Updated on $S_DATE" )

//Saving the data using the date and time of the data
local c_date = c(current_date)
local c_time = c(current_time)

local c_time_date = "`c_date'"+"_" +"`c_time'"

local time_string = subinstr("`c_time_date'", ":", "_", .)
local time_string = subinstr("`time_string'", " ", "_", .)
display "`time_string'"

local folderPath = "D:/OHIO_COVID_DAILY_CONFIRMED_"
local fileName = "`folderPath'" + "`time_string'"
display "`fileName'"

export delimited using `fileName'.csv, replace
Data Analytics

Installing the latest R and RStudio on a Chromebook

Note 1) This instruction is based on plus my own trial and error experience. His/her instruction was awesome other than a couple of minor typos that would cause some confusion during the installation. 2) I assume that you are planning to use the Linux environment that came with your Chromebook.

Installation environment

  • Chromebook: HP Chromebook 15.6 (4GB RAM)
  • ChromeOS: Version 80.0.3987.128 (Official Build) (64-bit)
  • Linux Distribution: Debian 9
    • To turn on the Linux environment in your Chromebook, please refer to this link.

Linux CodeInstalling R

sudo apt search r-base | grep ^r-base
sudo apt install -y gnupg2
sudo apt-key adv --keyserver --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'
sudo vi /etc/apt/sources.list

The line of code above will open up a document where you will need to add a line of code at the end of the document. The code is shown below (the original instruction from omitted the ‘/’ at the end, which I am not sure whether it is a typo or because he/she is using a different system). In case I did not add the last ‘/’, it reported an error and the source is not recognized during the subsequent use.

It adds the repository to the existing pool of the repository. In case you have never used a vi text editor.
1) ‘shift+g’ will take you to the last line;
2) ‘o’ will add a new line and the editor will turn into the insert mode;
3) ‘esc’ will exist the insert mode;
4) ‘:wq’ will save and exit.

deb stretch-cran35/
sudo apt update
sudo apt upgrade

Finally, installing r base.

sudo apt install -y r-base r-base-dev

Linux Code – Installing RStudio

Download the latest RStudio package from and save it as rstudio.deb. You can replace the URL with the lastest one from website.

curl -o rstudio.deb

Then, I install the package.

sudo dpkg -i rstudio.deb

Mine encountered a problem after the installation. Specifically, when clicked on the icon, RStudio won’t start and a loading circle continues to hover on the surface of the RStudio icon. When tried to use the terminal to load, the terminal showed the following message.

#This part is an error message, not code. 
rstudio: error while loading shared libraries: cannot open shared object file: No such file or directory

The problem is fixed by the following line of code.

sudo apt install libnss3


Screenshot: RStudio on my Chromebook
Business Statistics Quality Management R

Sampling Distribution and Testing Hypothesis

I developed this lecture note spring 2020 using R markdown for the first time. It supports the compilation of R, Markdown, and LaTex code at the same time! I was really impressed.

Markdown Code:

Chapter 6. Statistical Techniques in Quality Management
author: Z. Wen (OSCM 3340, Spring 2020)
date: 02/12/2020, Thursday
autosize: true
font-family: 'Fira Sans', sans-serif
width: 1440
height: 900

Learning Objeectives

- Review of Sampling Distribution
- Confidence Interval
- Testing Hypothesis
- Various Distributions
- Sample Size Determination

Conceptually, the following relationship holds in any type of estimation. 
$$ \theta = \hat{\theta} + M.E.$$

In quality management, we are often interested in the process mean. (e.g., Does our machines need alignments?)
$$ \mu = \bar{x} + M.E. $$

We are also interested in the process standard deviation. (e.g., Does our machines need calibrations?)
$$ \sigma = s + M.E. $$

Margin of Error

**Since a lot of times we have information about $\bar{x}$ and $s$, we need to develop our knowledge on $M.E.$**

There are three components in developing the confidence interval 
- Your level of confidence  ($1-\alpha$)
- Sample size  ($n$)
- Best estimate of the population s.d.  ($\sigma$ or $s$, whichever available)

Here is the formal relationship of these three in forming the margin of error: 
$$M.E. = z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$$

Confidence Interval 

**With the knowledge of M.E., we can construct an interval where the true parameter is located with $1-\alpha$ level of confidence.**

$$C.I. = \bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$$

There are many different types of variations. For example, 
$$C.I. = \bar{x} \pm t_{\alpha/2}\frac{s}{\sqrt{n}}$$
$$C.I. = \bar{p} \pm z_{\alpha/2}\sqrt{\frac{\bar{p}*(1-\bar{p})}{n}}$$

And many more... $C.I.$ for F distribution, $\chi^2$ distribution, etc. **As long as it is an estimation result, you will always see the reporting of $C.I.$**

Confidence Interval (Example)

See the following calculated examples: 

| Level of $\alpha$ | $n$ |    $z$   | $\sigma$ |   M.E.   | $\bar{x}$ | C.I. | Interval Length |
|        0.01       | 100 | 2.58 |     4    | 1.03 |     34    |    [32.97, 35.03]   |     2.06    |
|        0.05       | 100 | 1.96 |     4    | 0.78 |     34    |   [33.22, 34.78]  |     1.57    |
|        0.1        | 100 | 1.64 |     4    | 0.66 |     34    |   [33.34, 34.66]  |     1.32    |
|        0.01       |  64 | 2.58 |     4    | 1.29 |     34    |   [32.71, 35.29]  |     2.58    |
|        0.05       |  64 | 1.96 |     4    | 0.98 |     34    |    [33.02, 34.98]   |     1.96    |
|        0.1        |  64 | 1.64 |     4    | 0.82 |     34    |   [33.18, 34.82]  |     1.64    |

<small>Although the name could be confusing, Excel formula **=CONFIDENCE.NORM(alpha, standard_dev, size)** and **=CONFIDENCE.T(alpha, standard_dev, size)** will give you the **M.E.** value. </small>

One Very Important Application (1/4) - Confirming Doubts
**How to find out someone who had your total trust betrayed you?** 

*I am almost certain that he/she won't do that...* 

But what if he/she got caught in doing that thing. Is he/she still trustworthy?

**You heartfully believed the mean is 7. But, what if your 95% confidence interval does not contain 7? Will you still believe the mean is 7?**

In this case, you either have to update your belief on the mean, or you must have encountered a rare chance event.

One Very Important Application (2/4)

**Example:** <br>
A cylinder manufacturer claims that their process mean is 12.5 mm. Historically, their process standard deviation was .08 mm and there is no reason to think that the s.d. has changed. Upon drawing a random sample of 9, the sample average was 12.22. Please test the claim under 5% of error tolerance level. 

1. First, please use a visual aid to determine the answer. <br>
2. Please use a formal approach to determine the probability of obtaining such sample average, given the true mean is 12.5 mm. <br>
3. What is the conclusion? 

One Very Important Application (3/4)
If what the company claiming is true, this will be the distribution of the population. 
- Population mean $\mu = 12.5$
- Population standard deviation $\sigma = 0.08$

![plot of chunk unnamed-chunk-1](Confidence Interval PowerPoint-figure/unnamed-chunk-1-1.png)

If what the company claiming is true, for $n=25$, we have... 
- Hypothesized mean of $\mu_0 = 12.5$
- Standard Error of $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{.08}{5} = 0.016$
- 95% C.I. $\bar{x} \pm ME = [12.19, 12.25]$

![plot of chunk unnamed-chunk-2](Confidence Interval PowerPoint-figure/unnamed-chunk-2-1.png)

Formal Hypothesis Testing (3/4)

**Hypothesis testing** re-defined: 
- It is a process of determining a likelihood of surprise for a given sample result.
- How likely it is that we can obtain a sample mean of this size, given the claim is true?

Formal hypothesis: <br>
- $H_o: \mu = 12.5$ and $H_\alpha: \mu \neq  12.5$ 

Determination of the Likelihood (Excel Formula): 
- $p-value = norm.dist(12.22, 12.5, 0.016, TRUE) = 7.16E-69 \approx 0$

**Verdict: It is very unlikely that the true mean is 12.5**

Any idea about the ture mean: 
- All we know is, it is very unlikely 12.5. With a 95% confidence, it could be said it is within [12.19, 12.25]

Another Example - Two Group Mean Testing (Exercise)

**Case**: Please determine whether the following two hospitals have the same quality rating.
- Data URL: [Download Data]( <small>(*UTAD ID/PW Needed for Access*) </small>

<img src="Confidence Interval PowerPoint-figure/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" width="1000" height="600" />

Formalizing Two Group Mean Test
A visual inspection of the confidence interval seems to be arguing that the means are not the same. Now, we formalize the test. 
- The mean difference of $d = \bar{x}_A - \bar{x}_B$ are known to follow a **T distribution** when the both $\sigma_A$ and $\sigma_B$ are not known.
- T distribution gives a more liberal estimate than Z distribution as long as the degree of freem (*formula omitted*) is less than 120.
- A **formal hypothesis**: $H_0: \mu_A - \mu_B = 0$ and $H_\alpha: \mu_A - \mu_B \neq 0$
- Construct a hypothesized distribution with the mean of $0$ and the standard error of $\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$
- Then draw samples from each group and calculate the mean difference $d$ to determine how likely (through $p-value$) it is that one can obtain such results when there are assumed to have no difference.

[Tool from ArtofStat (Two Mean Test)]( 

Excel Solution for Two Group T-Test

**Excel Data Analysis ToolPak**

![Analysis ToolPak in Excel](

Also as Excel Formulae

$=T.DIST.2T(X, DF)$


**Side of the Test**

![Which side are you testing?](

How would you define a surprise? 
- Too big is a surprise (Right-Tail)
- Too small is a surprise (Left-Tail)
- Too big or too small both (Two Tail)

Overview of Distributions and Their Usages

Different test statistics assume different statistical distributions. Here are what is relevant to our current context. 


|                              | Testing for One Group | Comparing Two Groups |
| Mean                         | Z distribution        | Z distribution       |
| Mean (When $\sigma$ unknown) | T distribution        | T distribution       |
| Variance                     | $\chi^2$ distribution | F distribution       |


In the quality context, we wish to know
- Whether the machine needs alignment? (Parts are even, but missing the target)
- Whether the machine needs re-calibration? (Generating uneven parts, but meets the target)

Overview of Distributions and Their Usages (cont.)

Other types of distributions and their potential usage: 


| Use Case                              | Distributions   | Variable Type |
| Number of defective units per x units | Poisson         | Discrete      |
| Time factor                           | Exponential     | Continuous    |
| Number of success per x trials        | Binomial        | Discrete      |
| Sampling without replacement          | Hyper-geometric | Discrete      |


Many times, statistical distribution can be used to establish important baselines for the estimation. 

**Now, resumed to the test of Variance**

Test of Variance

Detecting whether the variance (or $\sigma^2$) of the process has changed provide important information
- About the machine condition
- About the accuracy of of the mean estimate



<img src="Confidence Interval PowerPoint-figure/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" height="700" />


Test of Variance - Formalization (1/2) 

**One group variance testing statistics** 

$$ \chi^2 = \frac{(n-1)s^2}{\sigma_0^2}$$

It follows a $\chi^2$ distribution with $n-1$ degree of freedom.



Shape of $\chi^2$ Distribution


$\chi^2$ distribution is useful when determining whether an observed pattern follows the expected pattern. 


Test of Variance - Formalization (2/2) 

**Two group variance testing statistics**

$$F = \frac{s_1^2}{s_1^2}$$

It follows a $F$ distribution with $n_1 - 1$ degree of freedom of numerator and  $n_2 - 1$ degree of freedom of denominator.

$F$ distribution is a distribution of ratio. 


Shape of $F$ Distribution



Test of Variance - Excel Solutions


Excel Formulae:

$=F.Dist(F, DF1, DF2, TRUE)$
$=F.Dist.RT(F, DF1, DF2)$
$=F.Test(Range1, Range2)$

One More Application of Confidence Interval
Sometimes, for budgetary reason, we need to calculate the size of the sample befor we conduct a sampling. See if you can answer these two questions. 

**Example:** <br>
A manager wants to ensure that whenever he rejects a shipment, he does not want to make more than 5% of mistake. Historically, the supplier's process had a very stable standard deviation of .5 mm. He believes 0.02 mm could serve as a meaningful margin of error size. What would be his choice of sample size? That is, 

$$ 0.02 = 1.96 * \frac{.5}{\sqrt{n}}$$
Q: What is this n? 
**Solution:** To get the sample size: 
$$ 0.02 = 1.96 * \frac{.5}{\sqrt{n}}$$
$$ \sqrt{n}= (\frac{1.96 * .5}{.02})^2 = 2401$$

Sample Size Solution