The data set to the following example can be found at http://systems-sciences.uni-graz.at/etextbook/data-metUG.csv. You can view and analyze the data in a common spreadsheet program or an text editor. The data is arranged in columns, the first row is the header (no data), the second row is the data description. From the third row onwards the data lists various meteorological variables: 10-minute averages for temperature and relative humidity and 10-minute totals for precipitation and sunshine duration.

In the following, the programming language Python (https://www.python.org/) and its open source data analytics library Pandas (http://pandas.pydata.org/) is deployed to acces this data, to visualize it and to perform basic statistical analysis on it.

In [4]:

```
# import required library
import pandas as pd
#import and read data
da=pd.read_csv('http:\\systems-sciences.uni-graz.at\\etextbook\\data-metUG.csv')
# convert data into Pandas DataFrame
df = pd.DataFrame(da)
# show first 5 instances of data
df.head()
```

Out[4]:

In [5]:

```
# show 43rd to 47th instance of data
df[43:48]
```

Out[5]:

In [6]:

```
# show last 3 instances of data
df.tail(3)
```

Out[6]:

At first, we use graphics to determine which meteorological variables are strongly correlated.

To gain a first overview of the weather on that day, we plot variables against time. For plotting we use another open source Python library, called matplotlib (http://matplotlib.org/)

In [7]:

```
# import required library
import matplotlib.pyplot as plt
# include plots into notebook
%matplotlib inline
# define frame for plots, and size
fig = plt.figure(figsize=(12,10))
# define subplots, and distance between them
fig.subplots_adjust(wspace = 0.5, hspace = 0.3)
# define plots
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)
#define timeticks
timeticks = ('00:00','03:24','06:48','10:12','13:36','17:00','20:24','23:48')
# corresponds roughly to
# timeticks = df['time'].values[0:143:20]
# print timeticks
# data for temperature
t = df['temperature (C)']
# plot it
ax1.plot(t)
# define axes and tick labels
ax1.set_xticklabels(timeticks, rotation=30, fontsize='small')
ax1.set_xlabel('time')
ax1.set_ylabel('temperature in C')
# data for humidity in %
h = df['rel. humidity (%)']
ax2.plot(h, color='r')
ax2.set_xticklabels(timeticks, rotation=30, fontsize='small')
ax2.set_xlabel('time')
ax2.set_ylabel('rel. humidity in %')
# data for sunshine duration in minutes
s = df['sunshine duration (min)']
ax3.plot(s, color='y')
ax3.set_xticklabels(timeticks, rotation=30, fontsize='small')
ax3.set_xlabel('time')
ax3.set_ylabel('sunshine duration in min')
# data for precipitation in mm
p = df['precipitation (mm)']
ax4.plot(p, color='g')
ax4.set_xticklabels(timeticks, rotation=30, fontsize='small')
ax4.set_xlabel('time')
ax4.set_ylabel('precipitation in mm')
```

Out[7]:

- High temperature occurs together with lower relative humidity and vice versa (= correlation). This seems reasonable since rel. humidity is the ratio of specific humidity (= mass of water per mass of air) to saturation humidity. According to Clausius-Clapeyron equation saturation humidity increases with increasing temperature.
- When the sun is shining, it is not raining, obviously (= correlation), and when it rains, the sun does not shine (= correlation). This seems reasonable too since rain falls from clouds and cloud shadow the sun. But: If it is not raining, the sun may appear or not (no correlation), and when the sun is not shining, it may rain or not (no correlation). Sunshine duration and total precipitation can therefore be only partially correlated.
- Minor variations in temperature seem to have a correlation with sunshine. The correlation is understandable because a larg amount of solar energy is converted into heat on the earth's surface .

Now let's analyze the relationship between the variables (starting with the first 3 pairs) with the help of scatter plots (x-axis: variable 1, y-axis: variable 2). Which variables are strongly interdependent, which are not? For which variable pairs would it be reasonable to set up a linear regression model?

In [8]:

```
# see above
fig1 = plt.figure(figsize=(15,10))
fig1.subplots_adjust(wspace = 0.5, hspace = 0.3)
ax1 = fig1.add_subplot(2,2,1)
ax2 = fig1.add_subplot(2,2,2)
ax3 = fig1.add_subplot(2,2,3)
# scatterplot
ax1.scatter(t, h, s = 50, color="red", marker="+")
ax1.set_xlabel("temperature in C")
ax1.set_ylabel("rel. humidity in %")
ax2.scatter(t, s, s = 50, color="red", marker="+")
ax2.set_xlabel("temperature in C")
ax2.set_ylabel("sunshine duration (min)")
ax3.scatter(t, p, s = 50, color="red", marker="+")
ax3.set_xlabel("temperature in C")
ax3.set_ylabel("precipitation in mm")
```

Out[8]:

- top-left plot: temperature / rel. humidity: strong correlation which is close to linear in the observed range (not perfectly). The correlation is negative, i.e., higher temperatures imply lower rel. humindity. Obviously, yet another factor influences rel. humidity: high values between 17 and 20°C cannot be explained by temperature. (These values occur in the evening / night after rainfall. Precipitation increases specific humidity, and thus relative humidity too.) The strong, close to linear relationship suggests a linear model.
- top-right plot: temperature / sunshine duration: no clear correlation, each temperature can occur at any sunshine. Sunshine duration is not an option (and vice versa) as the only explanatory variable for temperature. However, we know that sunshine occurs simultaneously with smaller temperature fluctuations. If the dominant explanatory variable is found (a hot tip: time of day), sunshine duration can be an additional explanatory variable, for example in a multiple linear regression model.
- bottom-left plot: temperature / precipitation: No stressable correlation. The number of data points with precipitation > 0 is too small for a statistically meaningful statement. Also, no simple direct causal relationship can be found. Linear regression does not make sense!

The three other pairs of data correlation:

In [9]:

```
fig2 = plt.figure(figsize=(15,10))
fig2.subplots_adjust(wspace = 0.5, hspace = 0.3)
ax1 = fig2.add_subplot(2,2,1)
ax2 = fig2.add_subplot(2,2,2)
ax3 = fig2.add_subplot(2,2,3)
ax1.scatter(p, h, s = 50, color="blue", marker="+")
ax1.set_xlabel("precipitation in mm")
ax1.set_ylabel("rel. humidity in %")
ax2.scatter(p, s, s = 50, color="blue", marker="+")
ax2.set_xlabel("precipitation in mm")
ax2.set_ylabel("sunshine duration (min)")
ax3.scatter(h, s, s = 50, color="red", marker="+")
ax3.set_xlabel("rel. humidity in %")
ax3.set_ylabel("sunshine duration (min)")
```

Out[9]:

- top-left plot: rel. humidity / precipitation: there is a clear correlation, precipitation occurs only at > 70% rel. humidity. However, > 70% is obviously no sufficient condition for precipitation. In addition, the relation is strongly non-linear. Linear regression would not make sense!
- top-right plot: sunshine duration / precipitation: strong correlation! It rains only when there is no sun. Nevertheless, a linear regression would not be useful! The relationship is extremely non-linear and the value of the sunshine duration > 0 has no explanational power for the total precipitation.
- bottom-left plot: sunshine / rel humidity: no clear correlation evident. Linear regression would not make sense.

The Python library Pandas offers a comfortable way to calculate basic statistics:

-> Mean is defined as \(\bar x = \frac{1}{n}\sum_{i=1}^{n}x_i\)

-> Standard deviation (\(std\)) is defined as the squareroot of the variance \(\sqrt{var(x)}\), with \(var(x)\) being defined as \(var(x)= \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar x)^2\)

Variance and standard deviation are measures of the "spread" of a variable. The greater the variance, the more the values of \(x\) in this case deviate from \(\bar x\) on average. \(Std\) has the same information content as \(var (x)\) but is more intuitive, because it has the same units as the variable \(x\) itself.

Additionally, the Pandas command \(describe()\) reports the minimum (\(min\)), the maximum (\(max\)) and the 25%, 50% and 75% levels of the data set.

In [10]:

```
# calculate basic statstics for t, h, s and p
t.describe(), h.describe(), s.describe(), p.describe()
```

Out[10]:

Covariance indicates the conjoint variance of the variables \(x\) and \(y\). If \(x\) shows positive deviations from mean together with positive deviations in \(y\), \(cov (x, y)\) is positive. If \(x\) shows positive deviations from mean along with negative deviations in \(y\), \(cov (x, y)\) is negative. In both cases a correlation exists between \(x\) and \(y\).

If \(x\) shows some positive deviations from mean along with some negative deviations and about the same amount of positive deviations in \(y\), \(cov (x, y)\) is about zero. In this case there is no correlation between \(x\) and \(y\).

Covariance is defined as \(cov (x, y) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar x)(y_i - \bar y)\)

In [11]:

```
# calculate covariance
df.cov()
```

Out[11]:

The correlation coefficient \(corr (x, y)\) indicates the normalized covariance. Its values range from \(-1\) to \(+1\). The amount of \(corr (x, y)\) indicates how well the pairs of values fit a straight line. If \(corr (x, y) = 1\), all values lay on a line, if it is \(0\), there is no correlation between \(x\) and \(y\). If \(corr (x, y) < 0\) correlation is negative, if \(corr (x, y) > 0\), it is positive.

The correlation coefficient is defined as \(corr (x, y) = \frac{cov (x, y)}{\sqrt {var(x)var(y)}}\)

In [12]:

```
# calculate correlation coefficient
df.corr()
```

Out[12]:

Average temperature on August 1st 2011 in Graz was 18.9° C, average relative humidity was 80.9% and average sunshine duration for 10 minutes was 1.8 minutes.

Standard deviation (variance) for these values respectively: 2.8 ° C (7.7); 14.8% (220.1); 3.4 minutes (11.4). Note the large standard deviation of sunshine duration compared to its average.

\(cov (temp, rh) = -37.7; corr (temp, rh) = -0.92\) hence, there is a strong negative correlation. 85% of the variance of rel. humidity is explained by temperature.

\(cov (temp, sushine) = 5.1; corr (temp, sunshine) = 0.52\), hence a positive correlation, which was not clearly visible from plain optical anaysis above. Correlation is relatively weak, only 30% of the variance of the sunshine duration is explained by temperature.