Latent Dirichlet allocation (LDA)
Topic modeling is a method for unsupervised classification of text documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for. In clustering one entity can belongs to one group only, whereas in topic modeling a word can belongs to multiple groups/clusters with varying level of probability. The input of the model is a text document/ or a set of documents. The out of the model is to split the documents into multiple K groups and then determining a topic from each group based on the association of the most important words in the respective group. The number of topic which is equivalent to the number of clusters in cluster analysis (K) has to be selected based various heuristics on how many topics might be extracted from the document/s. LDA treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” with each other in terms of content, rather than being separated into discrete groups.
As an output of LDA model, if we decide to find out K topics then our set of documents are segregated into K groups. The key words or the tokens in each group receive a beta value describing how strong the tokenized word is associated with many other words (tokens) within the group. The larger the value of beta explains the more importance of the word in that group. Top 6-10 words with the largest beta values are chosen to decide the topic that is depicting by that group of words. The topic is decided based on human intelligence on understanding the meaning of those words in the underlying context of the collected documents.
How to determine the number of topic from a set of documents