Simple statistics

The weather on a summer's day (Graz August 1st 2011)

The data set to the following example can be found at http://systems-sciences.uni-graz.at/etextbook/data-metUG.csv. You can view and analyze the data in a common spreadsheet program or an text editor. The data is arranged in columns, the first row is the header (no data), the second row is the data description. From the third row onwards the data lists various meteorological variables: 10-minute averages for temperature and relative humidity and 10-minute totals for precipitation and sunshine duration.

In the following, the programming language Python (https://www.python.org/) and its open source data analytics library Pandas (http://pandas.pydata.org/) is deployed to acces this data, to visualize it and to perform basic statistical analysis on it.

In [4]:
# import required library
import pandas as pd

#import and read data
# convert data into Pandas DataFrame
df = pd.DataFrame(da)
# show first 5 instances of data

date time temperature (C) rel. humidity (%) sunshine duration (min) precipitation (mm)
0 01.08.2011 00:00 16.1 90.6 0 0
1 01.08.2011 00:10 16.0 90.8 0 0
2 01.08.2011 00:20 16.0 91.0 0 0
3 01.08.2011 00:30 16.1 90.9 0 0
4 01.08.2011 00:40 16.0 90.8 0 0

5 rows × 6 columns

In [5]:
# show 43rd to 47th instance of data

date time temperature (C) rel. humidity (%) sunshine duration (min) precipitation (mm)
43 01.08.2011 07:10 16.6 89.3 0.0 0.1
44 01.08.2011 07:20 16.6 89.4 0.0 0.0
45 01.08.2011 07:30 16.7 88.6 0.9 0.0
46 01.08.2011 07:40 17.0 87.0 10.0 0.0
47 01.08.2011 07:50 17.4 84.2 10.0 0.0

5 rows × 6 columns

In [6]:
# show last 3 instances of data

date time temperature (C) rel. humidity (%) sunshine duration (min) precipitation (mm)
141 01.08.2011 23:30 17.2 94.5 0 0
142 01.08.2011 23:40 17.2 94.5 0 0
143 01.08.2011 23:50 17.2 94.6 0 0

3 rows × 6 columns

Optical data analysis

At first, we use graphics to determine which meteorological variables are strongly correlated.

To gain a first overview of the weather on that day, we plot variables against time. For plotting we use another open source Python library, called matplotlib (http://matplotlib.org/)

In [7]:
# import required library
import matplotlib.pyplot as plt
 # include plots into notebook
%matplotlib inline

# define frame for plots, and size
fig = plt.figure(figsize=(12,10))
# define subplots, and distance between them
fig.subplots_adjust(wspace = 0.5, hspace = 0.3)
# define plots
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

#define timeticks
timeticks = ('00:00','03:24','06:48','10:12','13:36','17:00','20:24','23:48')
# corresponds roughly to
# timeticks = df['time'].values[0:143:20]
# print timeticks

# data for temperature
t = df['temperature (C)']
# plot it
# define axes and tick labels
ax1.set_xticklabels(timeticks, rotation=30, fontsize='small')
ax1.set_ylabel('temperature in C')
# data for humidity in %
h = df['rel. humidity (%)']
ax2.plot(h, color='r')
ax2.set_xticklabels(timeticks, rotation=30, fontsize='small')
ax2.set_ylabel('rel. humidity in %')
# data for sunshine duration in minutes
s = df['sunshine duration (min)']
ax3.plot(s, color='y')
ax3.set_xticklabels(timeticks, rotation=30, fontsize='small')
ax3.set_ylabel('sunshine duration in min')
# data for precipitation in mm
p = df['precipitation (mm)']
ax4.plot(p, color='g')
ax4.set_xticklabels(timeticks, rotation=30, fontsize='small')
ax4.set_ylabel('precipitation in mm')
<matplotlib.text.Text at 0x97e2e50>

Correlations as concluded from optical analysis of the plots above

  1. High temperature occurs together with lower relative humidity and vice versa (= correlation). This seems reasonable since rel. humidity is the ratio of specific humidity (= mass of water per mass of air) to saturation humidity. According to Clausius-Clapeyron equation saturation humidity increases with increasing temperature.
  2. When the sun is shining, it is not raining, obviously (= correlation), and when it rains, the sun does not shine (= correlation). This seems reasonable too since rain falls from clouds and cloud shadow the sun. But: If it is not raining, the sun may appear or not (no correlation), and when the sun is not shining, it may rain or not (no correlation). Sunshine duration and total precipitation can therefore be only partially correlated.
  3. Minor variations in temperature seem to have a correlation with sunshine. The correlation is understandable because a larg amount of solar energy is converted into heat on the earth's surface .

Now let's analyze the relationship between the variables (starting with the first 3 pairs) with the help of scatter plots (x-axis: variable 1, y-axis: variable 2). Which variables are strongly interdependent, which are not? For which variable pairs would it be reasonable to set up a linear regression model?

In [8]:
# see above
fig1 = plt.figure(figsize=(15,10))
fig1.subplots_adjust(wspace = 0.5, hspace = 0.3)
ax1 = fig1.add_subplot(2,2,1)
ax2 = fig1.add_subplot(2,2,2)
ax3 = fig1.add_subplot(2,2,3)

# scatterplot
ax1.scatter(t, h, s = 50, color="red", marker="+")
ax1.set_xlabel("temperature in C")
ax1.set_ylabel("rel. humidity in %")

ax2.scatter(t, s, s = 50, color="red", marker="+")
ax2.set_xlabel("temperature in C")
ax2.set_ylabel("sunshine duration (min)")

ax3.scatter(t, p, s = 50, color="red", marker="+")
ax3.set_xlabel("temperature in C")
ax3.set_ylabel("precipitation in mm")
<matplotlib.text.Text at 0x9b33d50>
  1. top-left plot: temperature / rel. humidity: strong correlation which is close to linear in the observed range (not perfectly). The correlation is negative, i.e., higher temperatures imply lower rel. humindity. Obviously, yet another factor influences rel. humidity: high values between 17 and 20°C cannot be explained by temperature. (These values occur in the evening / night after rainfall. Precipitation increases specific humidity, and thus relative humidity too.) The strong, close to linear relationship suggests a linear model.
  2. top-right plot: temperature / sunshine duration: no clear correlation, each temperature can occur at any sunshine. Sunshine duration is not an option (and vice versa) as the only explanatory variable for temperature. However, we know that sunshine occurs simultaneously with smaller temperature fluctuations. If the dominant explanatory variable is found (a hot tip: time of day), sunshine duration can be an additional explanatory variable, for example in a multiple linear regression model.
  3. bottom-left plot: temperature / precipitation: No stressable correlation. The number of data points with precipitation > 0 is too small for a statistically meaningful statement. Also, no simple direct causal relationship can be found. Linear regression does not make sense!

The three other pairs of data correlation:

In [9]:
fig2 = plt.figure(figsize=(15,10))
fig2.subplots_adjust(wspace = 0.5, hspace = 0.3)
ax1 = fig2.add_subplot(2,2,1)
ax2 = fig2.add_subplot(2,2,2)
ax3 = fig2.add_subplot(2,2,3)

ax1.scatter(p, h, s = 50, color="blue", marker="+")
ax1.set_xlabel("precipitation in mm")
ax1.set_ylabel("rel. humidity in %")

ax2.scatter(p, s, s = 50, color="blue", marker="+")
ax2.set_xlabel("precipitation in mm")
ax2.set_ylabel("sunshine duration (min)")

ax3.scatter(h, s, s = 50, color="red", marker="+")
ax3.set_xlabel("rel. humidity in %")
ax3.set_ylabel("sunshine duration (min)")
<matplotlib.text.Text at 0x9d739d0>
  1. top-left plot: rel. humidity / precipitation: there is a clear correlation, precipitation occurs only at > 70% rel. humidity. However, > 70% is obviously no sufficient condition for precipitation. In addition, the relation is strongly non-linear. Linear regression would not make sense!
  2. top-right plot: sunshine duration / precipitation: strong correlation! It rains only when there is no sun. Nevertheless, a linear regression would not be useful! The relationship is extremely non-linear and the value of the sunshine duration > 0 has no explanational power for the total precipitation.
  3. bottom-left plot: sunshine / rel humidity: no clear correlation evident. Linear regression would not make sense.

Basic statistics

The Python library Pandas offers a comfortable way to calculate basic statistics:

-> Mean is defined as \(\bar x = \frac{1}{n}\sum_{i=1}^{n}x_i\)

-> Standard deviation (\(std\)) is defined as the squareroot of the variance \(\sqrt{var(x)}\), with \(var(x)\) being defined as \(var(x)= \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar x)^2\)

Variance and standard deviation are measures of the "spread" of a variable. The greater the variance, the more the values of \(x\) in this case deviate from \(\bar x\) on average. \(Std\) has the same information content as \(var (x)\) but is more intuitive, because it has the same units as the variable \(x\) itself.

Additionally, the Pandas command \(describe()\) reports the minimum (\(min\)), the maximum (\(max\)) and the 25%, 50% and 75% levels of the data set.

In [10]:
# calculate basic statstics for t, h, s and p
t.describe(), h.describe(), s.describe(), p.describe()
(count    144.000000
mean      18.868750
std        2.776058
min       15.400000
25%       16.250000
50%       18.100000
75%       21.825000
max       24.500000
Name: temperature (C), dtype: float64,
 count    144.000000
mean      80.927083
std       14.836881
min       49.600000
25%       65.950000
50%       89.400000
75%       92.325000
max       94.900000
Name: rel. humidity (%), dtype: float64,
 count    144.000000
mean       1.828472
std        3.381121
min        0.000000
25%        0.000000
50%        0.000000
75%        1.100000
max       10.000000
Name: sunshine duration (min), dtype: float64,
 count    144.000000
mean       0.040278
std        0.220082
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        2.300000
Name: precipitation (mm), dtype: float64)

Covariance indicates the conjoint variance of the variables \(x\) and \(y\). If \(x\) shows positive deviations from mean together with positive deviations in \(y\), \(cov (x, y)\) is positive. If \(x\) shows positive deviations from mean along with negative deviations in \(y\), \(cov (x, y)\) is negative. In both cases a correlation exists between \(x\) and \(y\).

If \(x\) shows some positive deviations from mean along with some negative deviations and about the same amount of positive deviations in \(y\), \(cov (x, y)\) is about zero. In this case there is no correlation between \(x\) and \(y\).

Covariance is defined as \(cov (x, y) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar x)(y_i - \bar y)\)

In [11]:
# calculate covariance

temperature (C) rel. humidity (%) sunshine duration (min) precipitation (mm)
temperature (C) 7.706499 -37.982994 5.126351 -0.010271
rel. humidity (%) -37.982994 220.133038 -29.944343 0.362748
sunshine duration (min) 5.126351 -29.944343 11.431981 -0.074162
precipitation (mm) -0.010271 0.362748 -0.074162 0.048436

4 rows × 4 columns

The correlation coefficient \(corr (x, y)\) indicates the normalized covariance. Its values range from \(-1\) to \(+1\). The amount of \(corr (x, y)\) indicates how well the pairs of values fit a straight line. If \(corr (x, y) = 1\), all values lay on a line, if it is \(0\), there is no correlation between \(x\) and \(y\). If \(corr (x, y) < 0\) correlation is negative, if \(corr (x, y) > 0\), it is positive.

The correlation coefficient is defined as \(corr (x, y) = \frac{cov (x, y)}{\sqrt {var(x)var(y)}}\)

In [12]:
# calculate correlation coefficient

temperature (C) rel. humidity (%) sunshine duration (min) precipitation (mm)
temperature (C) 1.000000 -0.922185 0.546159 -0.016811
rel. humidity (%) -0.922185 1.000000 -0.596914 0.111090
sunshine duration (min) 0.546159 -0.596914 1.000000 -0.099663
precipitation (mm) -0.016811 0.111090 -0.099663 1.000000

4 rows × 4 columns


Average temperature on August 1st 2011 in Graz was 18.9° C, average relative humidity was 80.9% and average sunshine duration for 10 minutes was 1.8 minutes.

Standard deviation (variance) for these values respectively: 2.8 ° C (7.7); 14.8% (220.1); 3.4 minutes (11.4). Note the large standard deviation of sunshine duration compared to its average.

\(cov (temp, rh) = -37.7; corr (temp, rh) = -0.92\) hence, there is a strong negative correlation. 85% of the variance of rel. humidity is explained by temperature.

\(cov (temp, sushine) = 5.1; corr (temp, sunshine) = 0.52\), hence a positive correlation, which was not clearly visible from plain optical anaysis above. Correlation is relatively weak, only 30% of the variance of the sunshine duration is explained by temperature.