Linear data discrimination

Linear discrimination (or segmentation) of data refers to a set of methods in machine learning which are able to distinguish - and to some extent predict membership in - different classes by way of a linear combination - a weighted sum - of the attributes of the data. This discrimination is usually given in the form of a straight separation line - the linear discriminant. The maybe most basic method in this set is linear regression.

Linear regression

In statistics, linear regression is commonly used to deduce some sort of average in data or to estimate the dynamics of a time series. In data science however, regression has a slightly different meaning. It refers to a sort of "value estimation". Other than classification, where data is classified into two or more distinct sets, regression methods attempt to estimate or predict numerical values of some variable.

Lets first look at some simple statistical problem of deriving dynamics from data. Consider the following table, which shows US-population numbers from 1790 to 1950 in thousands.

1790 1800 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 1910 1920 1930 1940 1950
3929 5308 7240 9638 12866 17069 23192 31443 38558 50156 62948 75995 91972 105711 122775 131669 150697

Plotted these data gives the following curve:

The most common form of linear regression looks for values which approximate the dynamics as implied by this curve. For this, it uses the methodology of least squares defining a line $$y=k*x+d$$ drawn through the plot so that the squared distance of the data-points from the line is minimized. The formula for $$k = \frac{\frac{1}{n}\sum_{i=1}^n x_i y_i - \bar x\bar y}{\frac{1}{n}\sum_{i=1}^n x_i^2 - \bar x^2} (= 937.17)$$ and for $$d=\bar y - k \bar x (= -1697154.09)$$. ($$\bar x$$ and $$\bar y$$ indicate the mean of $$x$$ and $$y$$-values respectively, that is, the mean of the values for the years and for the population.)

Plotted, the regression line looks like follows and indicates a direction into which the dynamics of population growth might further develop (on the base of the data given. Note however, that population growth rarely proceeds linearly):

Linear classification

The least square method can also be used to separate - i.e. to classify - different classes of data in respect to its distance from the separation line. The following example regards two features of a data set from a query of people deciding for or against the adoption of photo-voltaic technology. The two features are "importance of financial aspects" and "importance of independence from mainstream energy providers" found to be most informative with the method of information gain.The blue data points indicate adopters, the red ones non-adopters. As you can see, the data is not completely unambiguous in respect to the target value (i.e. the question whether people adopt or not). There is one red point on, and another one beyond the separation line.

The line separating adopters from non-adopters is called the discriminant and can be expressed by way of a function - in this case the function of a straight line $$y = k*x + d$$ - the discriminant function. The values above and to the right of the discriminant, that is, the values where $$k*x + d - y > 0$$ should indicate persons with a higher probability to adopt a PV-installation than the values where $$k*x + d- y \le 0$$ (that is, below and to the left of the discriminant).

Predictive modeling

Separating data in this way can be interpreted as providing information on the probability of data points belonging to one or the other class. Intuitively, one would say that the further a data point is from the discriminant line, the more likely its class membership (the likelihood of belonging to a class) can be derived. In this sense, the data point in the right upper corner for instance obviously has the highest likelihood of belonging to the class of adopters.

An essential feature of this kind of machine learning is it to provide a possibility for training the algorithm on one set of data in order to learn to predict the class membership of an other so far unclassified set of data. For this, data sets often are deliberately divided into sets of training-data and test-data. The training-data - consisting of feature-values and target-values - is used to train the algorithm. And the test-data - also consisting of feature- and target-values - then is used to test the predictive power of the algorithm. What the algorithm generates are likelihoods for data points belonging to one or the other class. Think of a machine learning algorithm trained on empirically collected data about PV-adoption of inhabitants of one particular region predicting the likelihood of PV-adoption of inhabitants in an other (so far un-surveyed) region.

A problem of separating data in this way is that often there are several ways to draw linear discriminants through data. The least square method is just one way of doing this.

Support Vector Machine

An alternative method offers the so called Support Vector Machine (SVM). Whereas linear regression separates data in respect to the least square distances of data points from the separation line (the discriminant), the Support Vector Machine (SVM) tries to find a broad as possible corridor through the data of different classes. This corridor is defined by the so called margin between data points of different classes. In short, one principle of this algorithm is it, to try to maximize the margin between data points of different classes. It thereby often finds different - and sometimes better - results than linear regression. The following plot shows the same data set as above, this time classified with an SVM.

As in the case of linear regression, not all points in the SVM-classification are classified correctly. Again there is one red point beyond the line. A second important principle of SVMs therefore penalizes data points which are on the wrong side of the discriminant. On the base of this principle, the SVM-algorithm iteratively looks for the discriminant which reduces the total penalty for incorrectly classified data points to the minimum. Its solution suggests a balance between minimal error penalty and maximal margin.

In SVMs, the penalty is computed proportionally to the distance from the discriminant line. With the general term for error penalty being loss, the loss-function used in SVMs is called a hinge loss indicating that the penalty for an incorrectly classified data-point increases linearly with the distance from the separation line. A simpler form of loss is zero-one-loss assigning a penalty of zero to correctly classified points and a loss of one to incorrectly classified points.

Logistic regression

A third alternative for linear data discrimination offers logistic regression. This algorithm too bases its working principle on penalizing incorrectly classified data points. In this case however, the size of the penalty for an incorrectly classified data point is squeezed into the interval between 0 and 1 by way of the logistic function $$p_+(x)=\frac{1}{1+e^{-f(x)} }$$ with $$p_+(x)$$ indicating the estimated probability of $$x$$ belonging to a $$+$$-class and $$f(x)$$ indicating the distance of $$x$$ to the discriminant. This function - generating a characteristic sigmoid curve (see here with $$\lambda=10$$ and $$h=5$$) - has the advantage of penalizing only points in a certain distance from the discriminant. Similarly to the zero-one-loss it does not consider higher penalties than 1, and similarly to the hinge-loss it accounts for differences in the distances of data, but only then when it is close enough to the discriminant to be seized by the logistic function. A plot of a classification done with logistic regression (left) and a comparison of the three linear discriminant functions as discussed on this site (right) are shown below (black line = linear regression, green line = SVM, brown line = logistic regression):