Logo

Latent Semantic Analysis

Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI), is a mathematical method that tries to bring out latent relationships within a collection of documents. It is based on the assumption that words close in meaning will occur in similar pieces of text. Rather than looking at each document by itself, LSA looks at a corpus of documents as a whole and analyses the correlation and context of terms within this corpus in order to identify relationships. A typical example would be a search engine search for the term "sand" which among others also returns documents that do not contain the term "sand" but contain terms like "beach". In this case LSA has found out that the term “sand” is semantically close to “beach”.

LSA starts from the fact that the words of our language do not just unambiguously refer to one concept of meaning (like in the left image below), but can have all sorts of same or multiple meanings (like in the right image below).

LSA 1
LSA 2

Usually, these ambiguities are solved by the context in which words are used. While the word "bank" for instance, when used in the context of words like "mortgage", "loans" or "rates", probably refers to a financial institution, it could refer to a river bank when used together with "lures", "casting", and "fish". In order to find the meanings or concepts behind the words hence, LSA attempts to map both words and their contexts into a "concept space" in which different meanings can be compared. For this, it has to filter out some of the noise that arises by different authors using different words to express the same meaning. LSA thus is basically a statistical method.

LSA builds on the following simplifications:

  • Word contexts (which usually are called documents, with the size of a document possibly ranging from a sentence to a paragraph to whole articles) are represented as "bags of words", where the order of the words is not important. What counts however, is how many times each word occurs.
  • Concepts are represented as patterns of words that usually appear together in a document, such as “leash”, “treat”, and “obey” appearing together in a document about dog training.
  • Words are treated as having just one meaning, although this is clearly not the case (as mentioned above).
  • A set of words, called "stop words", is usually excluded from the analysis. These words, like “and”, “or”, “for”, “in”, “of”, “the”, “to”, etc. do not contribute much (if any) meaning to a context.
  • Words are stemmed, that is, reduced to their word stem, like 'measur' in 'measurement', 'measuring' or 'measure' etc.

The term-document matrix

The first step of an LSA consists of creating a term-document matrix, where each row represents a word and each column a document, and the respective cells of the matrix contain the frequencies with which the term occurs in the document. Terms are reduced to their stem and stop words are excluded.


Assume the following nine sentences (documents D1 - D9) containing definitions of productivity:


  • D1: "A measure of the efficiency of a person, machine, factory, system, etc., in converting inputs into useful outputs"
  • D2: "Productivity is computed by dividing average output per period by the total costs incurred or resources consumed in that period."
  • D3: "Productivity is a critical determinant of cost efficiency"
  • D4: "An economic measure of output per unit of input. Inputs include labor and capital, while output is typically measured in revenues and other GDP components."
  • D5: "Productivity is measured and tracked by many economists as a clue for predicting future levels of GDP growth."
  • D6: "Productivity gains are vital to the economy because they allow us to accomplish more with less."
  • D7: "Productivity is the ratio of output to inputs in production; it is an average measure of the efficiency of production."
  • D8: "The rate at which radiant energy is used by producers to form organic substances as food for consumers"
  • D9: "Productivity is commonly defined as a ratio between the output volume and the volume of inputs."

The following analysis does not consider all words in these documents, but focuses on the following list of already stemmed terms:

['measur', 'effici', 'machin', 'factori', 'system', 'input', 'output', 'averag', 'cost', 'resourc', 'consum', 'econom', 'labor', 'revenu', 'gdp', 'predict', 'futur', 'growth', 'gain', 'accomplish', 'energi', 'produc', 'food']

The table to the right shows the respective term-document matrix.

LSA 3

The TFIDF matrix

In the next step, the raw matrix counts are modified so that rare words are weighted more heavily than often used words. In this way, a word that occurs only in a small number of documents is weighted more heavily than a word that occurs in most of the documents. A common weighting method is called TFIDF (Term Frequency – Inverse Document Frequency), which replaces the count in each cell by the following formula:


\[TFIDF_{i,j}=\frac{N_{i,j}}{N_{*,j}}*\log\frac{D}{D_i}\]

where


  • \(N_{i,j}\) = the number of times word \(i\) appears in document \(j\) (the original cell count).

  • \(N_{*,j}\) = the number of total words in document \(j\) (just sums the counts in column \(j\)).

  • \(D\) = the number of documents (the number of columns).

  • \(D_i\) = the number of documents in which word \(i\) appears (number of non-zero columns in row \(i\)).


In this way, both, words that concentrate in certain documents, as well as words that only appear in a few documents, are emphasized. The first ones by the \(\frac{N_{i,j}}{N_{*,j}}\) ratio and the second ones by the \(\log \frac{D}{D_i}\) term.



The table to the right shows the TFIDF matrix.

LSA 4

Singular value decomposition

In the third step, an algorithm called Singular Value Decomposition (SVD) is used to generate a reduced dimensional representation of the TFIDF-matrix which emphasizes the strongest relationships of words and documents and discards the unneeded noise. In other words, it makes the best possible reconstruction of the matrix with the least possible information. To do this, it throws out values which do not add information, and emphasizes strong patterns and trends, which do. However, SVD is sort of context-dependent. It needs a decision about how many dimensions or “concepts” to use when approximating the matrix. Unfortunately, there is no exact method of finding the right dimensions and it can be tricky to figure out how many dimensions or “concepts” to use when approximating the matrix. If the corpus of documents is large, typical dimensions numbers range from 100 to 500. In small cases, like in this example case, just a few dimensions will do.

By keeping only the \(k\) largest singular values and their associated vectors, SVD decomposes the matrix into the product of three other matrices - \(T_k * S_k * P_k^T\) - where \(S_k\) contains the singular- or eigenvalues of the matrix. The rows in the \(T_k\) matrix represent the term vectors and the rows in the \(P_k\) matrix represent the document vectors.

The \(S_k\) matrix of singular values can provide information about how many dimensions or “concepts” should be considered. One way to do this, is to plot the squared singular values, as shown on the right side, indicating that the first value seems to add significantly more information to the analysis than the others. However, this first dimension provides an absolute value. For documents, it corresponds to the length of the document, for words, it corresponds to the number of times a word is used in all documents. To get more meaningful information, it can make sense to ignore the first dimension and consider some of the following ones.

In this case, the focus is on the second and third dimension. With the SVD done for \(k = 3\) hence and ignoring the first values in each matrix, the second and third values of the \(T_k\) matrix provide the coordinates of each word in a concept space and the second and third values in the \(P_k\) matrix provide the coordinates of each document in the concept space. Plotting this concept space, as shown below, provides information as to which words are clustered in the vicinity of which documents (sentences), and vice versa, which documents are indicative for which words.



LSA 6
LSA 5