This article was published as a part of theData Science Blogathon
In the previous article, we had started with understanding the basic terminologies of text in Natural Language Processing(NLP), what is topic modeling, its applications, the types of models, and the different topic modeling techniques available.
Let’s continue from there, explore Latent Dirichlet Allocation (LDA), working of LDA, and its similarity to another very popular dimensionality reduction technique called Principal Component Analysis (PCA).
Table of Contents
- A Little Background about LDA
- Latent Dirichlet Allocation (LDA) and its Process
- How does LDA work and how will it derive the particular distributions?
- Vector Space of LDA
- How will LDA optimize the distributions?
- LDA is an Iterative Process
- The Similarity between LDA and PCA
A Little Background about LDA
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus. The term latent conveys something that exists but is not yet developed. In other words, latent means hidden or concealed.
Now, the topics that we want to extract from the data are also “hidden topics”. It is yet to be discovered. Hence, the term “latent” in LDA. The Dirichlet allocation is after the Dirichlet distribution and process.
Named after the German mathematician, Peter Gustav Lejeune Dirichlet, Dirichlet processes in probability theory are “a family of stochastic processes whose realizations are probability distributions.”
This process is a distribution over distributions, meaning that each draw from a Dirichlet process is itself a distribution. What this implies is that a Dirichlet process is a probability distribution wherein the range of this distribution is itself a set of probability distributions!
Okay, that’s great! But how is the Dirichlet process helpful for us in retrieving the topics from the documents? Remember from part 1 of the blog, topics or themes are a group of statistically significant words within a corpus.
So, without going into the technicalities of the process, in our context, the Dirichlet model describes the pattern of the words that are repeating together, occurring frequently, and these words are similar to each other.
And this stochastic process uses Bayesian inferences for explaining “the prior knowledge about the distribution of random variables”. In the case of topic modeling, the process helps in estimating what are the chances of the words, which are spread over the document, will occur again? This enables the model to build data points, estimate probabilities, that’s why LDA is a breed of generative probabilistic model.
LDA generates probabilities for the words using which the topics are formed and eventually the topics are classified into documents.
Latent Dirichlet Allocation (LDA) and its Process
A tool and technique for Topic Modeling, Latent Dirichlet Allocation (LDA) classifies or categorizes the text into a document and the words per topic, these are modeled based on the Dirichlet distributions and processes.
The LDA makes two key assumptions:
- Documents are a mixture of topics, and
- Topics are a mixture of tokens (or words)
And, these topics using the probability distribution generate the words. In statistical language, the documents are known as the probability density (or distribution) of topics and the topics are the probability density (or distribution) of words.
Now, how does LDA work and how will it derive the particular distributions?
Firstly, LDA applies the above two important assumptions to the given corpus. Let’s say we have the corpus with the following five documents:
- Document 1: I want to watch a movie this weekend.
- Document 2: I went shopping yesterday. New Zealand won the World Test Championship by beating India by eight wickets at Southampton.
- Document 3: I don’t watch cricket. Netflix and Amazon Prime have very good movies to watch.
- Document 4: Movies are a nice way to chill however, this time I would like to paint and read some good books. It’s been so long!
- Document 5: This blueberry milkshake is so good! Try reading Dr. Joe Dispenza’s books. His work is such a game-changer! His books helped to learn so much about how our thoughts impact our biology and how we can all rewire our brains.
Any corpus, which is the collection of documents, can be represented as a document-word (or document term matrix) also known as DTM.
We know the first step with the text data is to clean, preprocess and tokenize the text to words. After preprocessing the documents, we get the following document word matrix where:
- D1, D2, D3, D4, and D5 are the five documents, and
- the words are represented by the Ws, say there are 8 unique words from W1, to W8.
Hence, the shape of the matrix is 5 * 8 (five rows and eight columns):
So, now the corpus is mainly the above-preprocessed document-word matrix, in which every row is a document and every column is the tokens or the words.
LDA converts this document-word matrix into two other matrices: Document Term matrix and Topic Word matrix as shown below:
These matrices:
The Document-Topic matrix already contains the possible topics (represented by K above) that the documents can contain. Here, suppose we have 5 topics and had 5 documents so the matrix is of dimension 5*6
The Topic-Word matrix has the words (or terms) that those topics can contain. We have 5 topics and 8 unique tokens in the vocabulary hence the matrix had a shape of 6*8.
Vector Space of LDA
The entire LDA space and its dataset are represented by the diagram below:
Source: researchgate.net
- The yellow box refers to all the documents in the corpus (represented by M). In our case, M = 5 as we have 5 documents.
- Next, the peach color box is the number of words in a document, given by N
- Inside this peach box, there can be many words. One of those words is w, which is in the blue color circle.
According to LDA, every word is associated (or related) with a latent (or hidden) topic, which here is stated by Z. Now, this assignment of Z to a topic word in these documents gives a topic word distribution present in the corpus that is represented by theta (𝛳).
The LDA model has two parameters that control the distributions:
- Alpha (ɑ) controls per-document topic distribution, and
- Beta (ꞵ) controls per topic word distribution
To summarize:
- M: is the total documents in the corpus
- N: is the number of words in the document
- w: is the Word in a document
- z: is the latent topic assigned to a word
- theta (𝛳): is the topic distribution
- LDA model s parameters: Alpha (ɑ) and Beta (ꞵ)
How will LDA optimize the distributions?
The end goal of LDA is to find the most optimal representation of the Document-Topic matrix and the Topic-Word matrix to find the most optimized Document-Topic distribution and Topic-Word distribution.
As LDA assumes that documents are a mixture of topics and topics are a mixture of words so LDA backtracks from the document level to identify which topics would have generated these documents and which words would have generated those topics.
Now, our corpus that had 5 documents (D1 to D5) and with their respective number of words:
- D1 = (w1, w2, w3, w4, w5, w6, w7, w8)
- D2 = (w`1, w`2, w`3, w`4, w`5, w`6, w`7, w`8, w`9, w`10)
- D3 = (w“1, w“2, w“3, w“4, w“5, w“6, w“7, w“8, w“9, w“10, w“11, w“12, w“13, w“14 w“15)
- D4 = (w“`1, w“`2, w“`3, w“`4, w“`5, w“`6, w“`7, w“`8, w“`9, w“`10, w“`11, w“`12)
- D5 = (w““1, w““2, w““3, w““4, w““5, w““6, w““7, w““8, w““9, w““10,…, w““32, w““33, w““34)
LDA is an iterative process
The first iteration of LDA:
In the first iteration, it randomly assigns the topics to each word in the document. The topics are represented by the letter k. So, in our corpus, the words in the documents will be associated with some random topics like below:
- D1 = (w1 (k5), w2 (k3), w3 (k1), w4 (k2), w5 (k5), w6 (k4), w7 (k7), w8(k1))
- D2 = (w`1(k2), w`2 (k4), w`3 (k2), w`4 (k1), w`5 (k2), w`6 (k1), w`7 (k5), w`8(k3), w`9 (k7), w`10(k1))
- D3 = (w“1(k3), w“2 (k1), w“3 (k5), w“4 (k3), w“5 (k4), w“6(k1),…, w“13 (k1), w“14(k3), w“15 (k2))
- D4 = (w“`1(k4), w“`2 (k5), w“`3 (k3), w“`4 (k6), w“`5 (k5), w“`6 (k3) …, w“`10 (k3), w“`11 (k7), w“`12 (k1))
- D5 = (w““1 (k1), w““2 (k7), w““3 (k2), w““4 (k8), w““5 (k1), w““6(k8) …, w““32(k3), w““33(k6), w““34 (k5))
This gives the output as Documents with the composition of Topics and Topics composing of words:
The documents are the mixture of the topics:
- D1 = k5 + k3 + k1 + k2 + k5 + k4 + k7+ k1
- D2 = k2 + k4 + k2 + k1 + k5 + k2 + k1+ k5 + k3 + k7 + k1
- D3 = k4 + k5 + k3 + k6 + k5 + k3 + … + k3+ k7 + k1
- D3 = k1 + k7 + k2 + k8 + k1 + k8 + … + k3+ k6 + k5
The topics are the mixture of the words:
- K1 = w3 + w8 + w`4 + w`6 + w’10 + w“2 + w“6 + … + w“13 + w“`12 + w““1 + w““5
- K2 = w4 + w`1 + w`3 + w“15 + …. + w““3 + …
- K3 = w2 + w’8 + w“1 + w“4 + w“14 + w“`3 + w“`6 + … + w“`10 + w““32 + …
Similarly, LDA will give the word combinations for other topics.
Post the first iteration of LDA:
After the first iteration, LDA does provide the initial document- topic and topic-word matrices. The task at hand is to optimize these obtained results which LDA does by iterating over all the documents and all the words.
LDA makes another assumption that all the topics that have been assigned are correct except the current word. So, based on those already-correct topic-word assignments, LDA tries to correct and adjust the topic assignment of the current word with a new assignment for which:
LDA will iterate over: each document ‘D’ and each word ‘w’
How would it do that? It does so by computing two probabilities: p1 and p2 for every topic (k) where:
P1: proportion of words in the document (D) that are currently assigned to the topic (k)
(Video) LDA Topic Modelling Explained with implementation using gensim in Python #nlp #tutorialP2: is the proportion of assignments to the topic(k) over all documents that come from this word w. In other words, p2 is the proportion of those documents in which the word (w) is also assigned to the topic (k)
The formula for p1 and p2 is:
- P1 = proportion (topic k / document D) ,and
- P2 = proportion(word w / topic k)
Now, using these probabilities p1 and p2, LDA estimates a new probability, which is the product of (p1*p2), and through this product probability, LDA identifies the new topic, which is the most relevant topic for the current word.
Reassignment of word ‘w’ of the document ‘D’ to a new topic ‘k’ via the product probability of p1 * p2
Now, the LDA is performed for a large number of iterations for the step of choosing the new topic ‘k’ until a steady-state is obtained. The convergence point of LDA is obtained where it gives the most optimized representation of the document-term matrix and topic-word matrix.
This completes the working and the process of Latent Dirichlet Allocation.
Now, before moving on to the implementation of LDA, here’s one last thing …
The Similarity between LDA and PCA
Topic Modeling is similar to Principal Component Analysis (PCA). You may be wondering how is that? Allow me to explain.
So, PCA is a dimensionality reduction technique, right? And, it is used for data having numerical values. It is the linear combination of variables from which components are derived that are used to build the model. PCA works by breaking or decomposing a larger value (i.e. a singular value) into smaller values to reduce the dimensions.
LDA operates in the same way as PCA does. LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.
Let’s say we have the following corpus of documents having the tokenized words freekick, dunk, rebound, foul, shoot, NBA, Liverpool as below:
Source: miro.medium.com
In case, we do not apply LDA or any of the topic modeling techniques then, we will have the
tokenized words (after the text preprocessing steps) which are on the left panel of the above image. These words will then be passed to build a model and may have more features (more than present in the image).
Now, instead of using the tokenized words obtained after applying the Bag of Words vectorizer, if we transform the initial document-word matrix into two parts: one for the word per topic and the other for the topic per document.
See, in the above image, rather than using the words: freekick, dunk, rebound, foul, shoot, NBA, Liverpool, we have combined similar words into two groups: Soccer and Basketball.
This gave us two different topics: one on football where Soccer consists of the words freekick, shoot, and Liverpool, and the other topic Basketball consisting of the words dunk, rebound, foul, and NBA.
This breaking of the initial larger word features matrix not only enabled us to classify text into documents but it also essentially reduced the features used to build the model.
LDA breaks the corpus document word into lower-dimensional matrices. Therefore, Topic modeling and its techniques are also used for dimensionality reduction.
Endnotes
Latent Dirichlet Allocation (LDA) does two tasks: it finds the topics from the corpus, and at the same time, assigns these topics to the document present within the same corpus. The below schematic diagram summarizes the process of LDA well:
Source: researchgate.net
We will apply all theory that we learned and execute in Python using the gensim and sklearn packages in the next and last part of the Topic Modeling series.
References:
- https://en.wikipedia.org/wiki/Dirichlet_process
- Dirichlet Process by Yee Whye Teh, University College London
Thank You for reading and Happy Learning! 🙂
About me
Hi there! I am Neha Seth, a technical writer for AnalytixLabs. I hold a Postgraduate Program in Data Science & Engineering from the Great Lakes Institute of Management and a Bachelors in Statistics. I have been featured as Top 10 Most Popular Guest Authors in 2020 on Analytics Vidhya (AV).
My area of interest lies in NLP and Deep Learning. I have also passed the CFA Program. You can reach out to me on LinkedIn and can read my other blogs for ALabs and AV.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
blogathongensimLDANLP
FAQs
What is topic Modelling and LDA? ›
Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic.
What is Gensim LDA? ›Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python's Gensim package. The challenge, however, is how to extract good quality of topics that are clear, segregated and meaningful.
Why LDA is used in NLP? ›LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
Why LDA is used? ›Linear discriminant analysis (LDA) is used here to reduce the number of features to a more manageable number before the process of classification. Each of the new dimensions generated is a linear combination of pixel values, which form a template.
What is topic modeling used for? ›Topic modeling is an unsupervised machine learning technique that's capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.
What is topic Modelling in NLP? ›Topic modelling is an unsupervised approach of recognizing or extracting the topics by detecting the patterns like clustering algorithms which divides the data into different parts. The same happens in Topic modelling in which we get to know the different topics in the document.
What is topic Modelling in Python? ›Topic Modelling is a technique to extract hidden topics from large volumes of text. The technique I will be introducing is categorized as an unsupervised machine learning algorithm. The algorithm's name is Latent Dirichlet Allocation (LDA) and is part of Python's Gensim package. LDA was first developed by Blei et al.
How does LDA prepare data? ›- Step 1: Computing the d-dimensional mean vectors. ...
- Step 2: Computing the Scatter Matrices. ...
- Step 3: Solving the generalized eigenvalue problem for the matrix S−1WSB. ...
- Step 4: Selecting linear discriminants for the new feature subspace.
View the topics in LDA model
The above LDA model is built with 10 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.
Linear Discriminant Analysis, or LDA for short, is a classification machine learning algorithm. It works by calculating summary statistics for the input features by class label, such as the mean and standard deviation. These statistics represent the model learned from the training data.
How does LDA algorithm work? ›
LDA is a generative probability model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to discriminative models, which attempt to learn how inputs map to outputs.
What is LDA algorithm? ›The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus.
What is LDA in NLP? ›In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar.
Is LDA supervised or unsupervised? ›All Answers (4) LDA is unsupervised by nature, hence it does not need predefined dictionaries. This means it finds topics automatically, but you cannot control the kind of topics it finds. That's right that LDA is an unsupervised method.
Is LDA a type of clustering? ›Strictly speaking, Latent Dirichlet Allocation (LDA) is not a clustering algorithm. This is because clustering algorithms produce one grouping per item being clustered, whereas LDA produces a distribution of groupings over the items being clustered. Consider k-means, for instance, a popular clustering algorithm.
Is LDA generative or discriminative? ›LDA is a generative model because it uses the joint probability distribution, P(x,y).
What is meant by topic Modelling? ›Topic modeling is an unsupervised machine learning technique that's capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.
What is the purpose of topic Modelling? ›As described above, the goal of topic modeling is to automatically discover the topics in a collection of documents. The documents themselves are examined, whereas the topic structure—the topics, per-document topic distributions, and the per-document per-word topic assignments—is hidden structure.
What is topic Modelling in NLP? ›Topic modelling is an unsupervised approach of recognizing or extracting the topics by detecting the patterns like clustering algorithms which divides the data into different parts. The same happens in Topic modelling in which we get to know the different topics in the document.
What is topic Modelling in Python? ›Topic Modelling is a technique to extract hidden topics from large volumes of text. The technique I will be introducing is categorized as an unsupervised machine learning algorithm. The algorithm's name is Latent Dirichlet Allocation (LDA) and is part of Python's Gensim package. LDA was first developed by Blei et al.
Is topic modelling supervised or unsupervised? ›
Topic modeling is an unsupervised machine learning way to organize text (or image or DNA, etc.) information such that related pieces of text can be identified.
What are the advantages of topic modelling? ›Nevertheless, topic models have two important advantages over simple forms of cluster analysis such as k-means clustering. In k-means clustering, each observation—for our purposes, each document—can be assigned to one, and only one, cluster. Topic models, however, are mixture models.