v0.9 #176

MaartenGr · 2021-07-10T05:18:47Z

Highlights

Get the most representative documents per topic: topic_model.get_representative_docs(topic=1)
- This allows users to see which documents are good representations of a topic and better understand the topics that were created
Added normalize_frequency parameter to visualize_topics_per_class and visualize_topics_over_time in order to better compare the relative topic frequencies between topics
Return flat probabilities as default, only calculate the probabilities of all topics per document if calculate_probabilities is True
Implemented a guided BERTopic by defining seed topics:

# NOTE: The Reuters dataset was ued
seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
                   ["acquisition", "procurement", "merge"],
                   ["exchange", "currency", "trading", "rate", "euro"],
                   ["grain", "wheat", "corn"],
                   ["coffee", "cocoa"],
                   ["natural", "gas", "oil", "fuel", "products", "petrol"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, _ = topic_model.fit_transform(docs)

Guided BERTopic works in two ways.

First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

Second, we take all words in seed_topic_list and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an inrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.

Fixes

Fix loading pre-trained BERTopic model
Fix mapping of probabilities

Akhamis01

Hello!

Just wanted to note that this update to improve the probabilities and fix the mappings has introduced a new bug. When using the Bertopic() function with the 'nr_topics' parameter, it will return an error.

To reproduce this error, run:
topic_model = BERTopic(nr_topics=10)

This will return an error (Key error: 10)

MaartenGr · 2021-07-20T13:28:03Z

@Akhamis01 Nice catch! Indeed, there are still some bugs regarding this to figure out. There is a fix coming though that should remedy this issue.

Akhamis01 · 2021-07-20T17:09:46Z

Sounds good! Also, just a side note, have you considered creating a function that would return the optimal number of topics to be used within a topic model? I feel that would be extremely useful and a lot of users may benefit from it :)

MaartenGr · 2021-07-21T05:40:45Z

@Akhamis01 You can already get the optimal number of topics by setting nr_topics = "auto" when instantiating BERTopic.

Akhamis01 · 2021-07-22T09:53:54Z

@Akhamis01 You can already get the optimal number of topics by setting nr_topics = "auto" when instantiating BERTopic.

Thank you! Just wondering, how is the optimal number of topics calculated for a topic model? Is it through Coherence scores?

MaartenGr · 2021-07-22T10:51:49Z

@Akhamis01 The reduction is based on the similarity between topics. If two topics are very similar to each other, they will be merged. Coherence scores have a bunch of issues that make it difficult to properly evaluate a topic model using only one method.

Akhamis01 · 2021-07-22T10:55:32Z

@Akhamis01 The reduction is based on the similarity between topics. If two topics are very similar to each other, they will be merged. Coherence scores have a bunch of issues that make it difficult to properly evaluate a topic model using only one method.

Alright, and how is the similarity between the topics calculated in general? What metrics are the similarities based on?

MaartenGr · 2021-07-22T11:02:03Z

@Akhamis01 HDBSCAN is used to cluster the topics even further based on their topic vectors. You can find more about that here:

BERTopic/bertopic/_bertopic.py

Line 1568 in 687d846

def _auto_reduce_topics(self, documents: pd.DataFrame) -> pd.DataFrame:

If you have any further questions, could you create a separate issue for that? It helps keeping discussions relevant to this specific pull request.

Akhamis01 · 2021-07-22T11:04:52Z

Makes sense. Thank you for your help, appreciate it! I'll create a new issue for any other questions :)

farshi · 2021-07-23T04:26:03Z

@MaartenGr GuidedBERTopic idea and implementations looks great! I will give it a try and let you know my feedback. Thank you

…l performance steps

FedePacio97 · 2022-08-18T09:25:00Z

Highlights

Get the most representative documents per topic: topic_model.get_representative_docs(topic=1)

This allows users to see which documents are good representations of a topic and better understand the topics that were created

Added normalize_frequency parameter to visualize_topics_per_class and visualize_topics_over_time in order to better compare the relative topic frequencies between topics

Return flat probabilities as default, only calculate the probabilities of all topics per document if calculate_probabilities is True

Implemented a guided BERTopic by defining seed topics:
# NOTE: The Reuters dataset was ued
seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
                   ["acquisition", "procurement", "merge"],
                   ["exchange", "currency", "trading", "rate", "euro"],
                   ["grain", "wheat", "corn"],
                   ["coffee", "cocoa"],
                   ["natural", "gas", "oil", "fuel", "products", "petrol"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, _ = topic_model.fit_transform(docs)
Guided BERTopic works in two ways.

First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

Second, we take all words in seed_topic_list and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an inrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.

Fixes

Fix loading pre-trained BERTopic model

Fix mapping of probabilities

Hi, thanks for the update.
Can we decide how many documents we would like to get from this method? For instance, instead of 3 documents, could we get X number of documents. As an example, a kind of method like this: topic_model.get_representative_docs(topic=1, nr_of_documents_returned=20).

MaartenGr · 2022-08-18T10:20:09Z

@FedePacio97 That is currently not possible since only three documents per topic are saved in order to prevent the topic model from saving too many documents. Moreover, this value would then depend on the min_topic_size as well as the underlying algorithm that is being used, since this method is only used for HDBSCAN.

Instead, you can use the topics output to see which documents belong to which topic. Also, you could use .visualize_documents to get an understanding of the documents that occupy each topic and which might be best representative.

FedePacio97 · 2022-08-18T14:09:01Z

Thank you very much for the fast reply.
This is definitely what I was looking for.

MaartenGr and others added 8 commits July 10, 2021 07:17

Fix loading embedding model

122f5ef

Fix probability mapping

177bbfb

Improve accuracy of probabilities

39fae5a

Update changelog

a89a97b

Update changelog

b40bc70

Guided BERTopic by using semi-supervised UMAP

d677728

Fix import

0d8c134

Guide idf values based on seed topics

227cfa3

MaartenGr mentioned this pull request Jul 19, 2021

Is there any way to make the BERTopic guided like guidedLDA? #178

Closed

Akhamis01 suggested changes Jul 20, 2021

View reviewed changes

Fix probabilities not taking reduction into account

d30d282

MaartenGr mentioned this pull request Jul 21, 2021

Inconsistent results on comparing calculated probability with topic label #185

Closed

MaartenGr mentioned this pull request Jul 21, 2021

Probabilities FAQ (and topic outcomes) #146

Closed

Always return flat probabilities

cb0b450

MaartenGr mentioned this pull request Jul 23, 2021

HDBSCAN taking a long time with many (30k) small documents #188

Closed

MaartenGr and others added 5 commits July 24, 2021 09:24

Extract and save the most representative documents per topic

1166dd3

Improve strength of guided topic modeling

04b854a

Add normalization options to DTM and topics per class visualizations

17d1f76

Extract representative documents with exemplars instead of probabilities

c7ebec2

Update documentation -> replace topics, _ with topics, probs

4a77b6c

MaartenGr mentioned this pull request Jul 27, 2021

Retrieve text docs for each topic #189

Closed

MaartenGr added 6 commits July 27, 2021 15:14

Update FAQ with chinese documents, internet connection, and additiona…

01542ac

…l performance steps

Fix #190

a5938a1

Fix prob mapping of missing -1 topic

b539042

Update version, add guided bertopic document, general doc improvements

6884b30

Update documentation

68c707f

Additional FAQ

f143785

MaartenGr merged commit 80c9fa1 into master Aug 7, 2021

Hannah-key mentioned this pull request Apr 25, 2022

Still different results from argmax(probs) to topics #518

Closed

MaartenGr deleted the v0.9 branch May 4, 2023 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9 #176

v0.9 #176

MaartenGr commented Jul 10, 2021 •

edited

Loading

Akhamis01 left a comment

MaartenGr commented Jul 20, 2021

Akhamis01 commented Jul 20, 2021 •

edited

Loading

MaartenGr commented Jul 21, 2021

Akhamis01 commented Jul 22, 2021

MaartenGr commented Jul 22, 2021

Akhamis01 commented Jul 22, 2021

MaartenGr commented Jul 22, 2021

Akhamis01 commented Jul 22, 2021

farshi commented Jul 23, 2021

FedePacio97 commented Aug 18, 2022

Highlights

Fixes

MaartenGr commented Aug 18, 2022

FedePacio97 commented Aug 18, 2022

v0.9 #176

v0.9 #176

Conversation

MaartenGr commented Jul 10, 2021 • edited Loading

Highlights

Fixes

Akhamis01 left a comment

Choose a reason for hiding this comment

MaartenGr commented Jul 20, 2021

Akhamis01 commented Jul 20, 2021 • edited Loading

MaartenGr commented Jul 21, 2021

Akhamis01 commented Jul 22, 2021

MaartenGr commented Jul 22, 2021

Akhamis01 commented Jul 22, 2021

MaartenGr commented Jul 22, 2021

Akhamis01 commented Jul 22, 2021

farshi commented Jul 23, 2021

FedePacio97 commented Aug 18, 2022

Highlights

Fixes

MaartenGr commented Aug 18, 2022

FedePacio97 commented Aug 18, 2022

MaartenGr commented Jul 10, 2021 •

edited

Loading

Akhamis01 commented Jul 20, 2021 •

edited

Loading