-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.9 #176
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello!
Just wanted to note that this update to improve the probabilities and fix the mappings has introduced a new bug. When using the Bertopic() function with the 'nr_topics' parameter, it will return an error.
To reproduce this error, run:
topic_model = BERTopic(nr_topics=10)
This will return an error (Key error: 10)
@Akhamis01 Nice catch! Indeed, there are still some bugs regarding this to figure out. There is a fix coming though that should remedy this issue. |
Sounds good! Also, just a side note, have you considered creating a function that would return the optimal number of topics to be used within a topic model? I feel that would be extremely useful and a lot of users may benefit from it :) |
@Akhamis01 You can already get the optimal number of topics by setting |
Thank you! Just wondering, how is the optimal number of topics calculated for a topic model? Is it through Coherence scores? |
@Akhamis01 The reduction is based on the similarity between topics. If two topics are very similar to each other, they will be merged. Coherence scores have a bunch of issues that make it difficult to properly evaluate a topic model using only one method. |
Alright, and how is the similarity between the topics calculated in general? What metrics are the similarities based on? |
@Akhamis01 HDBSCAN is used to cluster the topics even further based on their topic vectors. You can find more about that here: BERTopic/bertopic/_bertopic.py Line 1568 in 687d846
If you have any further questions, could you create a separate issue for that? It helps keeping discussions relevant to this specific pull request. |
Makes sense. Thank you for your help, appreciate it! I'll create a new issue for any other questions :) |
@MaartenGr GuidedBERTopic idea and implementations looks great! I will give it a try and let you know my feedback. Thank you |
Hi, thanks for the update. |
@FedePacio97 That is currently not possible since only three documents per topic are saved in order to prevent the topic model from saving too many documents. Moreover, this value would then depend on the Instead, you can use the |
Thank you very much for the fast reply. |
Highlights
topic_model.get_representative_docs(topic=1)
normalize_frequency
parameter tovisualize_topics_per_class
andvisualize_topics_over_time
in order to better compare the relative topic frequencies between topicscalculate_probabilities
is TrueGuided BERTopic works in two ways.
First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.
Second, we take all words in
seed_topic_list
and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an inrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.Fixes