This post follows up on the series of posts in Topic Modeling for text analytics. Previously, we looked at the LDA (Latent Dirichlet Allocation) topic modeling library available within MLlib in PySpark. While LDA is a very capable tool, here we look at a more scalable and state-of-the-art technique called BigARTM. LDA is based on a two-level Bayesian generative model that assumes a Dirichlet distribution for the topic and word distributions. BigARTM (BigARTM GitHub and https://bigartm.org) is an open source project based on Additive Regularization on Topic Models (ARTM), which is a non-Bayesian regularized model and aims to simplify the topic inference problem. BigARTM is motivated by the premise that the Dirichlet prior assumptions conflict with the notion of sparsity in our document topics, and that trying to account for this sparsity leads to overly-complex models. Here, we will illustrate the basic principles behind BigARTM and how to apply it to the Daily Kos dataset.
Why BigARTM over LDA?
As mentioned above, BigARTM is a probabilistic non-Bayesian approach as opposed to the Bayesian LDA approach. According to Konstantin Vorontsov’s and Anna Potapenko’s paper on additive regularization the assumptions of a Dirichlet prior in LDA do not align with the real-life sparsity of topic distributions in a document. BigARTM does not attempt to build a fully generative model of text, unlike LDA; instead, it choosesto optimize certain criteria using regularizers. These regularizers do not require any probabilistic interpretations. It is therefore noted that the formulation of multi-objective topic models are easier with BigARTM.
Overview of BigARTM
Problem statement
We are trying to learn a set of topics from a corpus of documents. The topics would consist of a set of words that make semantic sense. The goal here is that the topics would summarize the set of documents. In this regard, let us summarize the terminology used in the BigARTM paper:
D = collection of texts, each document ‘d’ is an element of D, each document is a collection of ‘nd’ words (w0, w1,…wd)
W = collection of vocabulary
T = a topic, a document ‘d’ is supposed to be made up of a number of topics
We sample from the probability space spanned by words (W), documents (D) and topics(T). The words and documents are observed but topics are latent variables.
The term ‘ndw’ refers to the number of times the word ‘w’ appears in the document ‘d’.
There is an assumption of conditional independence that each topic generates the words independent of the document. This gives us
p(w|t) = p(w|t,d)
The problem can be summarized by the following equation
What we are really trying to infer are the probabilities within the summation term, (i.e., the mixture of topics in a document (p(t|d)) and the mixture of words in a topic (p(w|t)). Each document can be considered to be a mixture of domain-specific topics and background topics. Background topics are those that show up in every document and have a rather uniform per-document distribution of words. Domain-specific topics tend to be sparse, however.
Stochastic factorization
Through stochastic matrix factorization, we infer the probability product terms in the equation above. The product terms are now represented as matrices. Keep in mind that this process results in non-unique solutions as a result of the factorization; hence, the learned topics would vary depending on the initialization used for the solutions.
We create a data matrix F almost equal to [fwd] of dimension WxD, where each element fwd is the normalized count of word ‘w’ in document ‘d’ divided by the number of words in the document ‘d’. The matrix F can be stochastically decomposed into two matrices ∅ and θ so that:
F ≈ [∅] [θ]
[∅] corresponds to the matrix of word probabilities for topics, WxT
[θ] corresponds to the matrix of topic probabilities for the documents, TxD
All three matrices are stochastic and the columns are given by:
[∅]t which represents the words in a topic and,
[θ]d which represents the topics in a document respectively.
The number of topics is usually far smaller than the number of documents or the number of words.
LDA
In LDA the matrices ∅ and θ have columns, [∅]t and [θ]d that are assumed to be drawn from Dirichlet distributions with hyperparameters given by β and α respectively.
β= [βw], which is a hyperparameter vector corresponding to the number of words
α= α[αt], which is a hyperparameter vector corresponding to the number of topics
Likelihood and additive regularization
The log-likelihood we would like to maximize to obtain the solution is given by the equations below. This is the same as the objective function in Probabilistic Latent Semantic Analysis (PLSA) and will be the starting point for BigARTM.
We are maximizing the log of the product of the joint probability of every word in each document here. Applying Bayes Theorem results in the summation terms seen on the right side in the equation above. Now for BigARTM, we add ‘r’ regularizer terms, which are the regularizer coefficients τ
i multiplied by a function of ∅ and θ.
where R
i is a regularizer function that can take a few different forms depending on the type of regularization we seek to incorporate. The two common types are:
- Smoothing regularization
- Sparsing regularization
In both cases, we use the KL Divergence as a function for the regularizer. We can combine these two regualizers to meet a variety of objectives. Some of the other types of regularization techniques are decorrelation regularization and coherence regularization. (http://machinelearning.ru/wiki/images/4/47/Voron14mlj.pdf, e.g. 34 and eq. 40.) The final objective function then becomes the following:
L(∅,θ) + Regularizer
Smoothing regularization
Smoothing regularization is applied to smooth out background topics so that they have a uniform distribution relative to the domain-specific topics. For smoothing regularization, we
- Minimize the KL Divergence between terms [∅]t and a fixed distribution β
- Minimize the KL Divergence between terms [θ]d and a fixed distribution α
- Sum the two terms from (1) and (2) to get the regularizer term
We want to minimize the KL Divergence here to make our topic and word distributions as close to the desired α and β distributions respectively.
Sparsing strategy for fewer topics
To get fewer topics we employ the sparsing strategy. This helps us to pick out domain-specific topic words as opposed to the background topic words. For sparsing regularization, we want to:
- Maximize the KL Divergence between the term [∅]t and a uniform distribution
- Maximize the KL Divergence between the term [θ]d and a uniform distribution
- Sum the two terms from (1) and (2) to get the regularizer term
We are seeking to obtain word and topic distributions with minimum entropy (or less uncertainty) by maximizing the KL divergence from a uniform distribution, which has the highest entropy possible (highest uncertainty). This gives us ‘peakier’ distributions for our topic and word distributions.
Model quality
The ARTM model quality is assessed using the following measures:
- Perplexity: This is inversely proportional to the likelihood of the data given the model. The smaller the perplexity the better the model, however a perplexity value of around 10 has been experimentally proven to give realistic documents.
- Sparsity: This measures the percentage of elements that are zero in the ∅ and θ matrices.
- Ratio of background words: A high ratio of background words indicates model degradation and is a good stopping criterion. This could be due to too much sparsing or elimination of topics.
- Coherence: This is used to measure the interpretability of a model. A topic is supposed to be coherent, if the most frequent words in a topic tend to appear together in the documents. Coherence is calculated using the Pointwise Mutual Information (PMI). The coherence of a topic is measured as:
- Get the ‘k’ most probable words for a topic (usually set to 10)
- Compute the Pointwise Mutual Information (PMIs) for all pairs of words from the word list in step (a)
- Compute the average of all the PMIs
- Kernel size, purity and contrast: A kernel is defined as the subset of words in a topic that separates a topic from the others, (i.e. Wt = {w: p(t|w) >δ}, where is δ selected to about 0.25). The kernel size is set to be between 20 and 200. Now the terms purity and contrast are defined as:
which is the sum of the probabilities of all the words in the kernel for a topic
For a topic model, higher values are better for both purity and contrast.
Using the BigARTM library
Data files
The BigARTM library is available from the BigARTM website and the package can be installed via pip. Download the example data files and unzip them as shown below. The dataset we are going to use here is the Daily Kos dataset.
wget https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt.gz
wget https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt
gunzip docword.kos.txt.gz
LDA
We will start off by looking at their implementation of LDA, which requires fewer parameters and hence acts as a good baseline. Use the ‘fit_offline’ method for smaller datasets and ‘fit_online’ for larger datasets. You can set the number of passes through the collection or the number of passes through a single document.
import artm
batch_vectorizer = artm.BatchVectorizer(data_path='.', data_format='bow_uci',collection_name='kos', target_folder='kos_batches')
lda = artm.LDA(num_topics=15, alpha=0.01, beta=0.001, cache_theta=True, num_document_passes=5, dictionary=batch_vectorizer.dictionary)
lda.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)
top_tokens = lda.get_top_tokens(num_tokens=10)
for i, token_list in enumerate(top_tokens):
print('Topic #{0}: {1}'.format(i, token_list))
Topic #0: ['bush', 'party', 'tax', 'president', 'campaign', 'political', 'state', 'court', 'republican', 'states']
Topic #1: ['iraq', 'war', 'military', 'troops', 'iraqi', 'killed', 'soldiers', 'people', 'forces', 'general']
Topic #2: ['november', 'poll', 'governor', 'house', 'electoral', 'account', 'senate', 'republicans', 'polls', 'contact']
Topic #3: ['senate', 'republican', 'campaign', 'republicans', 'race', 'carson', 'gop', 'democratic', 'debate', 'oklahoma']
Topic #4: ['election', 'bush', 'specter', 'general', 'toomey', 'time', 'vote', 'campaign', 'people', 'john']
Topic #5: ['kerry', 'dean', 'edwards', 'clark', 'primary', 'democratic', 'lieberman', 'gephardt', 'john', 'iowa']
Topic #6: ['race', 'state', 'democrats', 'democratic', 'party', 'candidates', 'ballot', 'nader', 'candidate', 'district']
Topic #7: ['administration', 'bush', 'president', 'house', 'years', 'commission', 'republicans', 'jobs', 'white', 'bill']
Topic #8: ['dean', 'campaign', 'democratic', 'media', 'iowa', 'states', 'union', 'national', 'unions', 'party']
Topic #9: ['house', 'republican', 'million', 'delay', 'money', 'elections', 'committee', 'gop', 'democrats', 'republicans']
Topic #10: ['november', 'vote', 'voting', 'kerry', 'senate', 'republicans', 'house', 'polls', 'poll', 'account']
Topic #11: ['iraq', 'bush', 'war', 'administration', 'president', 'american', 'saddam', 'iraqi', 'intelligence', 'united']
Topic #12: ['bush', 'kerry', 'poll', 'polls', 'percent', 'voters', 'general', 'results', 'numbers', 'polling']
Topic #13: ['time', 'house', 'bush', 'media', 'herseth', 'people', 'john', 'political', 'white', 'election']
Topic #14: ['bush', 'kerry', 'general', 'state', 'percent', 'john', 'states', 'george', 'bushs', 'voters']
You can extract and inspect the ∅ and θ matrices, as shown below.
phi = lda.phi_ # size is number of words in vocab x number of topics
theta = lda.get_theta() # number of rows correspond to the number of topics
print(phi)
topic_0 topic_1 ... topic_13 topic_14
sawyer 3.505303e-08 3.119175e-08 ... 4.008706e-08 3.906855e-08
harts 3.315658e-08 3.104253e-08 ... 3.624531e-08 8.052595e-06
amdt 3.238032e-08 3.085947e-08 ... 4.258088e-08 3.873533e-08
zimbabwe 3.627813e-08 2.476152e-04 ... 3.621078e-08 4.420800e-08
lindauer 3.455608e-08 4.200092e-08 ... 3.988175e-08 3.874783e-08
... ... ... ... ... ...
history 1.298618e-03 4.766201e-04 ... 1.258537e-04 5.760234e-04
figures 3.393254e-05 4.901363e-04 ... 2.569120e-04 2.455046e-04
consistently 4.986248e-08 1.593209e-05 ... 2.500701e-05 2.794474e-04
section 7.890978e-05 3.725445e-05 ... 2.141521e-05 4.838135e-05
loan 2.032371e-06 9.697820e-06 ... 6.084746e-06 4.030099e-08
print(theta)
1001 1002 1003 ... 2998 2999 3000
topic_0 0.000319 0.060401 0.002734 ... 0.000268 0.034590 0.000489
topic_1 0.001116 0.000816 0.142522 ... 0.179341 0.000151 0.000695
topic_2 0.000156 0.406933 0.023827 ... 0.000146 0.000069 0.000234
topic_3 0.015035 0.002509 0.016867 ... 0.000654 0.000404 0.000501
topic_4 0.001536 0.000192 0.021191 ... 0.001168 0.000120 0.001811
topic_5 0.000767 0.016542 0.000229 ... 0.000913 0.000219 0.000681
topic_6 0.000237 0.004138 0.000271 ... 0.012912 0.027950 0.001180
topic_7 0.015031 0.071737 0.001280 ... 0.153725 0.000137 0.000306
topic_8 0.009610 0.000498 0.020969 ... 0.000346 0.000183 0.000508
topic_9 0.009874 0.000374 0.000575 ... 0.297471 0.073094 0.000716
topic_10 0.000188 0.157790 0.000665 ... 0.000184 0.000067 0.000317
topic_11 0.720288 0.108728 0.687716 ... 0.193028 0.000128 0.000472
topic_12 0.216338 0.000635 0.003797 ... 0.049071 0.392064 0.382058
topic_13 0.008848 0.158345 0.007836 ... 0.000502 0.000988 0.002460
topic_14 0.000655 0.010362 0.069522 ... 0.110271 0.469837 0.607572
ARTM
This API provides the full functionality of ARTM, however, with this flexibility comes the need to manually specify metrics and parameters.
model_artm = artm.ARTM(num_topics=15, cache_theta=True, scores=[artm.PerplexityScore(name='PerplexityScore', dictionary=dictionary)], regularizers=[artm.SmoothSparseThetaRegularizer(name='SparseTheta', tau=-0.15)])
model_plsa.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=6))
model_artm.scores.add(artm.SparsityPhiScore(name='SparsityPhiScore'))
model_artm.scores.add(artm.TopicKernelScore(name='TopicKernelScore', probability_mass_threshold=0.3))
model_artm.scores.add(artm.TopTokensScore(name='TopTokensScore', num_tokens=6))
model_artm.regularizers.add(artm.SmoothSparsePhiRegularizer(name='SparsePhi', tau=-0.1))
model_artm.regularizers.add(artm.DecorrelatorPhiRegularizer(name='DecorrelatorPhi', tau=1.5e+5))
model_artm.num_document_passes = 1
model_artm.initialize(dictionary=dictionary)
model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=15)
There are a number of metrics available, depending on what was specified during the initialization phase. You can extract any of the metrics using the following syntax.
model_artm.scores
[PerplexityScore, SparsityPhiScore, TopicKernelScore, TopTokensScore]
model_artm.score_tracker['PerplexityScore'].value
[6873.0439453125,
2589.998779296875,
2684.09814453125,
2577.944580078125,
2601.897216796875,
2550.20263671875,
2531.996826171875,
2475.255126953125,
2410.30078125,
2319.930908203125,
2221.423583984375,
2126.115478515625,
2051.827880859375,
1995.424560546875,
1950.71484375]
You can use the model_artm.get_theta() and model_artm.get_phi() methods to get the ∅ and θ matrices respectively. You can extract the topic terms in a topic for the corpus of documents.
for topic_name in model_artm.topic_names:
print(topic_name + ': ',model_artm.score_tracker['TopTokensScore'].last_tokens[topic_name])
topic_0: ['party', 'state', 'campaign', 'tax', 'political', 'republican']
topic_1: ['war', 'troops', 'military', 'iraq', 'people', 'officials']
topic_2: ['governor', 'polls', 'electoral', 'labor', 'november', 'ticket']
topic_3: ['democratic', 'race', 'republican', 'gop', 'campaign', 'money']
topic_4: ['election', 'general', 'john', 'running', 'country', 'national']
topic_5: ['edwards', 'dean', 'john', 'clark', 'iowa', 'lieberman']
topic_6: ['percent', 'race', 'ballot', 'nader', 'state', 'party']
topic_7: ['house', 'bill', 'administration', 'republicans', 'years', 'senate']
topic_8: ['dean', 'campaign', 'states', 'national', 'clark', 'union']
topic_9: ['delay', 'committee', 'republican', 'million', 'district', 'gop']
topic_10: ['november', 'poll', 'vote', 'kerry', 'republicans', 'senate']
topic_11: ['iraq', 'war', 'american', 'administration', 'iraqi', 'security']
topic_12: ['bush', 'kerry', 'bushs', 'voters', 'president', 'poll']
topic_13: ['war', 'time', 'house', 'political', 'democrats', 'herseth']
topic_14: ['state', 'percent', 'democrats', 'people', 'candidates', 'general']
Conclusion
LDA tends to be the starting point for topic modeling for many use cases. In this post, BigARTM was introduced as a state-of-the-art alternative. The basic principles behind BigARTM were illustrated along with the usage of the library. I would encourage you to try out BigARTM and see if it is a good fit for your needs!
Please try the attached notebook.
--
Try Databricks for free. Get started today.
The post Beyond LDA: State-of-the-art Topic Models With BigARTM appeared first on Databricks.