One of my academic projects over the past year has focused on using language models for topic modeling within the metabolomics field. During the peer review, one of the reviewers rightfully suggested evaluating the topic model, specifically regarding topic coherence, this led me to the rabbit hole of topic coherence metrics. What follows is my learning notes.
I. Preamble
Suppose you’ve built a topic model to classify newspaper articles into distinct categories. You’ve generated embeddings using an encoder model, reduced these embedding dimensions with t-SNE, clustered the embeddings using HDBSCAN, and finally applied c-TF-IDF to represent the topics. Naturally, this representation process produces keywords for each cluster, leading to identifiable topics.
Now, the question arises: How coherent is your topic model? This is where topic coherence metrics become essential. Topic coherence metrics score each topic by measuring the degree of semantic similarity between its highest-ranking words.
A high coherence score generally means the words within the topic are semantically related, making sense together and forming a coherent concept.
Here, we’ll focus on two widely-used coherence metrics: C_npmi and C_v.
II. Normalized Pointwise Mutual Information Coherence
C_npmi stands for coherence calculated using Normalized Pointwise Mutual Information.
To simplify how it works: Consider a topic identified by its top words. C_npmi examines pairs of these top words and measures how frequently each pair appears together in the same documents (e.g., newspaper articles), compared to how often they would appear together purely by chance, given their individual occurrence across all documents.
If two words frequently appear together more often than chance would suggest, they contribute positively to the coherence score.
The “normalized” aspect ensures the score ranges typically from -1 (indicating words actively avoid each other) to +1 (indicating words always co-occur). A score of 0 means the words appear together as frequently as random chance would dictate.
The final coherence score for the topic is generally the average of these pairwise scores for its top words.
I think an example is in order. Imagine your topic is “sports team,” represented by the keywords “quarterback,” “touchdown,” and “helmet.” The C_npmi coherence score checks your newspaper articles to see how often “quarterback” and “touchdown” co-occur. If these words appear together significantly more often than would be expected if sports-related words were scattered randomly across articles, this suggests they form a coherent concept within the “sports team” topic.
III. Composite Coherence Measure (C_v)
C_v is a composite coherence measure. While the “C” denotes coherence, the “v” doesn’t have a universally fixed meaning but often implies validation through word vectors or multiple coherence-checking techniques.
C_v is slightly more complex. Like C_npmi, it considers pairwise co-occurrence using NPMI.
However, it also incorporates the semantic similarity of the words through word embeddings (such as Word2Vec). These embeddings represent words numerically as vectors, with semantically similar words being closer to each other.
Thus, C_v checks not only direct word co-occurrence within your documents but also assesses whether the words are semantically related more broadly, even if they don’t appear in every document together.
Essentially, C_v combines:
Direct co-occurrence evidence (from NPMI).
Indirect semantic similarity (measured via cosine similarity between word embeddings).
The final coherence score averages these combined similarity scores across the topic’s top words.
Returning to the “sports team” example: just as with C_npmi, C_v first evaluates how often “quarterback” and “touchdown” appear together in the newspaper articles. Additionally, C_v leverages a “sports dictionary” (word embeddings), where words with related meanings are grouped closely together. It checks whether “quarterback,” “touchdown,” and “helmet” reside in a similar “neighborhood,” indicating they’re related within the broader context of sports, even if some articles don’t explicitly mention all three terms simultaneously.
By combining both direct co-occurrence from the articles and general semantic closeness, C_v provides a more comprehensive coherence assessment.
In summary, both C_npmi and C_v quantify how semantically coherent a topic’s top words are. C_v is often considered more robust, integrating direct co-occurrence with deeper semantic similarity from word embeddings. In contrast, C_npmi emphasizes direct co-occurrence patterns within your specific set of documents.
References
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (pp. 399–408). Association for Computing Machinery. https://doi.org/10.1145/2684822.2685324
Pedro, J. (2022, January 10). Understanding topic coherence measures. Towards Data Science. https://towardsdatascience.com/understanding-topic-coherence-measures-4aa41339634c



