Logo

Modelling Statistics

1. Throw a fair coin

In [127]:
# simulate a fair coin - 2 dimensions
import numpy as np

reps = 20    # repetitions (i.e. coin flips)
res = []

for r in range(reps):
    res.append(np.random.randint(2))

print('The results of %d coin flips : ' %(len(res)) + str(res))
heads = sum(res)    # ones are heads
tails = reps - sum(res)
print('Heads: %0.f' %heads)
print('Tails: %0.f' %tails)
sample_mean = (sum(res)/float(reps))
print('The mean of the results is %0.2f' %sample_mean)
The results of 20 coin flips : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1]
Heads: 6
Tails: 14
The mean of the results is 0.30

1.1 Null-hypothesis

The null-hypothesis $H_0$ is an assumption about a population, which is put to the test. Often, it is an assumption in opposition to the actual working hypothesis (the argumentation of a devil's advocate).

Here, the null-hypothesis is: the coin is fair and thus the probability to fall on head is exactly 0.5.

1.2 Significance level

The significance level (often denoted by the Greek letter $\alpha$) is a probability threshold that determines when you reject the null hypothesis. Often $\alpha = 0.05$ (5%) is used to delimit significance.

1.3 $p$-value

The $p$-value is the probability of getting a certain result from an experiment if the null hypothesis is true. For example getting a result of 14 heads and 6 tails from 20 coin flips if the coin is fair yields a (two-sided) $p$-value of 0.115 (An alternative would be to consider a one-sided $p$-value, meaning that only heads (not heads and tails) are considered. In this case the $p$-value would be 0.058 (= 0.115/2)).

The $p$-value depends on the number of repetitions (reps) in the experiment.

If the $p$-value is lower than the significance level, the null hypothesis should be rejected in favor of an alternative hypothesis.

In [4]:
# p-value of the fair coin result
from scipy.special import binom

# throwing a fair coin yields a binomial distribution (it is a Bernoulli-process)
# the one-sided p-value thus is calculated with the sum of binomial coefficients 
# multiplied by the expected value to the power of the number of trials. two-sided its just doubled (head + tail)

sl = 0.05   # 5% significance level

if tails <= heads:
    x = heads
else:
    x = tails

P = []
for k in range(x, reps + 1):
    P.append(binom(reps, k))
    
p = 2 * sum(P) * 1 / (2**reps)
print('The two-sided p-value of getting a result of %d ones in %d coin flips is %0.3f' %(tails, reps, p))

# alternatively use the binom_test function from the scipy stats modul
#import scipy.stats as stats
#bintest = stats.binom_test(tails, reps, 0.5) # (two-sided test)
#print bintest

if p < sl:
    print('%0.3f is smaller than the significance level of %0.3f. The null hypothesis should be rejected.' %(p, sl))
else:
    print('%0.3f is greater than the significance level of %0.3f. Therefore the null hypothesis will not to be rejected.' %(p, sl))
The two-sided p-value of getting a result of 7 ones in 20 coin flips is 0.263
0.263 is greater than the significance level of 0.050. Therefore the null hypothesis will not to be rejected.
In [5]:
# plot a distribution of the results and indicate the significance level
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
# show plots in notebook
% matplotlib inline

# prepare plots
fig = plt.figure(figsize=(12,8))
fig.add_subplot(1,1,1)

sl = 0.05   # 5% significance level
reps = 20
runs = 1000
Ru = range(runs)

runres = []
P = []

for ru in Ru:
    res = []
    for r in range(reps):
        res.append(np.random.randint(2))
    # binomial test
    p = stats.binom_test(sum(res), reps, 0.5)        # (two-sided test)
    if p < sl:
        P.append(0)
    else:
        P.append(1)
    runres.append(np.mean(res))

c = 0   # counter
for n,x in enumerate(P):
    if x == 0:
        plt.plot(n, runres[n], 'ro')
        c += 1
    else:    
        plt.plot(n, runres[n], 'bo')

print('%d samples out of %d trials are outside of the significance level' %(c, runs))
plt.plot(Ru, [0.5]*runs, 'r--', linewidth=3, label = '0.5-line')
plt.title('Results of fair coin flips, red dots are outside of %0.f%% significance level' %(sl * 100))
plt.ylim(0, 1)
plt.grid()
plt.legend(loc = 'lower right')
39 samples out of 1000 trials are outside of the significance level
Out[5]:
<matplotlib.legend.Legend at 0x17d7b668>

2. Evaluating a model

Compare observed and simulated data. How well is simulated data fitting observed data? How good is your model?

In [116]:
# generate some data (from a model, but lets pretend this is observed empirical data)

# import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
# show plots in notebook
% matplotlib inline

# prepare drawing canvas and plot
fig = plt.figure(figsize=(15,5))
fig.add_subplot(1,2,1)

# growth rates
g = 1.2    # observed
sg = 1.5   # simulated

T = range(100)
dt = 0.1
K = 10.
sigma = 0.1   # strength of noise

# define list to store results, include initial value
OD = [1.]
SD = [1.]
# loop over time range
for t in T:
    odiff = (g * OD[t] * (1 - OD[t]/K)) * dt
    sdiff = (sg * SD[t] * (1 - SD[t]/K)) * dt
    noise = sigma * np.random.normal(0, 1) 
    OD.append(OD[t] + odiff + noise)
    SD.append(SD[t] + sdiff)

plt.plot(OD, 'bo', label = 'observed data')
plt.plot(SD, 'r.', label = 'simulated data')

plt.title('Observed and simulated population growth in time')
plt.xlabel('time')
plt.grid()
plt.ylim(1, 11)
plt.legend(loc = 'best')

fig.add_subplot(1,2,2)
plt.plot(OD, SD, 'go')
plt.plot(range(12), range(12), 'r-')
plt.xlim(1, 11)
plt.ylim(1, 11)
plt.xlabel('observed data')
plt.ylabel('simulated data')
plt.title('Observed versus simulated population growth')
plt.grid()

2a. Root mean squared deviation $ = \sqrt{\frac{\sum_{i=1}^n (O_i - S_i)^2}{n}}$

i.e. the square root of the average square of the difference between $O$ = observed and $S$ = simulated data.

RMSD expresses the total difference in the same units as used for the measurment

In [117]:
d = []
for n, od in enumerate(OD):
    d.append((od - SD[n])**2)
RMSD = np.sqrt(sum(d)/float(len(OD)))
print('The root mean squared deviation of observed and simulated data is %0.4f ' %RMSD)
The root mean squared deviation of observed and simulated data is 0.8576 

Note that the RMSD is not identical with the average difference $ = \frac{\sum_{i=1}^n \sqrt{(O_i - S_i)^2}}{n}$

In [118]:
d = []
for n, od in enumerate(OD):
    d.append(np.sqrt((od - SD[n])**2))
AD = sum(d)/float(len(OD))
print('The average difference of observed and simulated data is %0.4f ' %AD)
The average difference of observed and simulated data is 0.5737 

2b. Root mean squared error $ = \frac{100}{\overline{O}} * \sqrt{\frac{\sum_{i=1}^n (O_i - S_i)^2}{n}}$

where $\overline{O}$ is the average of observed data

The RMSE expresses the total difference as a percentage. An RMSE of 10% may be considered to be acceptable. Any RMSE of less than 10% thus would indicate a sufficiently accurate model.

In [119]:
d = []
for n, od in enumerate(OD):
    d.append((od - SD[n])**2)
RMSE = (100/(np.mean(OD))) * np.sqrt(sum(d)/float(len(OD)))
print('The root mean squared error of observed and simulated data is %0.2f%%' %RMSE)
The root mean squared error of observed and simulated data is 10.75%

3. Confidence interval

The 95% confidence interval is the range around the mean value of repeatedly observed data within which 95% of the simulated data would be expected to be found, if the model is sufficiently accurate.

In [120]:
# generate d (= 3) replicates of observed data and take the mean from them

# prepare drawing canvas and plot
fig = plt.figure(figsize=(8,5))
fig.add_subplot(1,1,1)

Col = ['wo', 'co', 'yo']

sigma = 0.2
m = 3
MD = []
for c, n in enumerate(range(m)):
    nOD = [1.]  # newly observed data
    # loop over time range
    for t in T:
        odiff = (g * nOD[t] * (1 - nOD[t]/K)) * dt
        noise = sigma * np.random.normal(0, 1) 
        nOD.append(nOD[t] + odiff + noise)
    plt.plot(nOD, Col[c], label = 'observed data %0i' %(c+1))
    MD.append(nOD)    
MOD = [np.mean(x) for x in zip(MD[0], MD[1], MD[2])]

plt.plot(MOD, 'k--', linewidth = 3, label = 'mean of all observed data')
plt.plot(SD, 'r--', linewidth = 3, label = 'simulated data')
plt.title('Data of three independent observations, and their mean')
plt.grid()
plt.legend(loc = 'best')
Out[120]:
<matplotlib.legend.Legend at 0x273cdc88>

The 95%-confidence interval can be obtained from the standard error multiplied by Student's $t$ at 95% probability.

The Student's $t$ is a continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small and the population's standard deviation is unknown. Whereas a normal distribution describes a full population, t-distributions describe samples drawn from a full population. Accordingly, the t-distribution for each sample size is different, and the larger the sample, the more the distribution resembles a normal distribution.

The standard error, $SE_i = \frac{\sum_{ij=1}^m (O_i - O_{ij})^2}{\sqrt m * (m-1)}$, where $O_i$ is the mean of the $i$th measurement, $O_{ij}$ is the $j$th replicate of the $i$th measurement, and $m$ is the total number of replicates of the $i$th measurement. The standard error is the standard deviation of all possible replicate means.

In [121]:
fig = plt.figure(figsize=(15,5))
fig.add_subplot(1,2,1)

# standard error
SE = []
for n, item in enumerate(MD[0]):
    S = []
    for x in range(m):
        S.append((MOD[n] - MD[x][n])**2)
    SE.append(sum(S) / float(np.sqrt(m)*(m-1)))

plt.plot(SE)#[10]
plt.grid()
plt.ylim(0,1)
plt.title('Standard error of all observed data')

fig.add_subplot(1,2,2)
# ci = 1.96 * standard error

ci_l = [MOD[n] - 1.96 * x for n,x in enumerate(SE)]
ci_u = [MOD[n] + 1.96 * x for n,x in enumerate(SE)]

# show only most important part
plt.plot(T[20: 40], MOD[20: 40], 'bo', label = 'mean of all observed data')
plt.plot(T[20: 40], SD[20: 40], 'r.', label = 'simulated data')
plt.plot(T[20: 40], ci_l[20: 40], 'g--', linewidth=3, label = '95%-confidence interval')
plt.plot(T[20: 40], ci_u[20: 40], 'g--', linewidth=3)

plt.title('95%-confidence interval around observed and simulated data')
plt.xlabel('time')
plt.grid()
plt.ylim(1, 11)
plt.legend(loc = 'best')
Out[121]:
<matplotlib.legend.Legend at 0x2a470be0>

4. LOFIT - lack of fit

singles out errors due to variations in the measurements (i.e. in observations) and considers only errors due to the difference between simulated and measured data. However, actual values of replicated measurements are needed to calculate LOFIT.

$LOFIT = \sum_{i=1}^n m_i*(O_i - S_i)^2$

where $m_i$ is the number of replicates of the $i$-th measurement, $n$ is the number of simulated and measured pairs being compared, $O_i$ is the mean of the replicates of the $i$-th measurement and $S_i$ is the simulated value.

In [122]:
# LOFIT of the above replicates

m = 3    # number of replicates
L = 0
for n,i in enumerate(SD):
    L += m * (MOD[n] - i)**2
    
print('The LOFIT is %0.2f' %L)
The LOFIT is 99.60

4.1. The significance of the LOFIT, i.e. the difference between measured (observed) and simulated values

excluding errors assiciated with variations in measurements.

$F = \frac{\sum_{i=1}^n (m_i - 1) * LOFIT}{n*\sum_{i=1}^n \sum_{j=1}^{m_i} ((O_{ij} - S_i) - (O_i - S_i))^2}$

In [123]:
# significance of LOFIT

num = []    
for n,i in enumerate(SD):
    num.append((MOD[n] - 1) * L)
Num = sum(num)
    
denum = []  # denumerator
for n,i in enumerate(SD):
    s1 = []
    for x in range(m):
        s1.append(((MD[x][n] - SD[n]) - (MOD[n] - SD[n]))**2)
    denum.append(sum(s1))
Denum = len(SD) * sum(denum)

F = Num / float(Denum)
print('The significance of the LOFIT is %0.2f' %F)
The significance of the LOFIT is 7.33

5. The sample correlation coefficient

determines the degree of association between measured and simulated data.

$SCC = \frac{\sum_{i=1}^{n_i} (O_i - \overline{O}) * (S_i - \overline{S})}{\sqrt{\sum_{i=1}^{n_i} (O_i - \overline{O})^2} * \sqrt{\sum_{i=1}^{n_i} (S_i - \overline{S})^2}}$

The SCC can have any value between -1 and 1. A correlation coefficient of 1 or close to one indicates good positive correlation. The simulated values are strongly associated with the measured values, suggesting that the model is performing well.

A correlation coefficient of -1 or close to it indicates a negative correlation. The simulated values are strongly associated with the measured values, but there is something wrong with the model. Results are coming out the wrong way.

A correlation coefficient of 0 indicates no correlation. The simulated values are not associated with the measured values. The model is performing badly.

In [124]:
Om = np.mean(MOD)  # (OD)
Sm = np.mean(SD)

s1 = []
for n,x in enumerate(OD):
    s1.append((x - Om) * (SD[n] - Sm))
s2 = []
for n,x in enumerate(OD):
    s2.append((x - Om)**2)
s3 = []
for n,x in enumerate(OD):
    s3.append((SD[n] - Sm)**2)

r = sum(s1) / float(np.sqrt(sum(s2)) * np.sqrt(sum(s3)))

print('The sample correlation coefficient is %0.2f' %r)
The sample correlation coefficient is 0.97