Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.9 #176

Merged
merged 21 commits into from
Aug 7, 2021
Merged

v0.9 #176

merged 21 commits into from
Aug 7, 2021

Conversation

MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Jul 10, 2021

Highlights

  • Get the most representative documents per topic: topic_model.get_representative_docs(topic=1)
    • This allows users to see which documents are good representations of a topic and better understand the topics that were created
  • Added normalize_frequency parameter to visualize_topics_per_class and visualize_topics_over_time in order to better compare the relative topic frequencies between topics
  • Return flat probabilities as default, only calculate the probabilities of all topics per document if calculate_probabilities is True
  • Implemented a guided BERTopic by defining seed topics:
# NOTE: The Reuters dataset was ued
seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
                   ["acquisition", "procurement", "merge"],
                   ["exchange", "currency", "trading", "rate", "euro"],
                   ["grain", "wheat", "corn"],
                   ["coffee", "cocoa"],
                   ["natural", "gas", "oil", "fuel", "products", "petrol"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, _ = topic_model.fit_transform(docs)

Guided BERTopic works in two ways.

First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

Second, we take all words in seed_topic_list and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an inrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.

Fixes

  • Fix loading pre-trained BERTopic model
  • Fix mapping of probabilities

Copy link

@Akhamis01 Akhamis01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!

Just wanted to note that this update to improve the probabilities and fix the mappings has introduced a new bug. When using the Bertopic() function with the 'nr_topics' parameter, it will return an error.

To reproduce this error, run:
topic_model = BERTopic(nr_topics=10)

This will return an error (Key error: 10)

@MaartenGr
Copy link
Owner Author

@Akhamis01 Nice catch! Indeed, there are still some bugs regarding this to figure out. There is a fix coming though that should remedy this issue.

@Akhamis01
Copy link

Akhamis01 commented Jul 20, 2021

Sounds good! Also, just a side note, have you considered creating a function that would return the optimal number of topics to be used within a topic model? I feel that would be extremely useful and a lot of users may benefit from it :)

@MaartenGr
Copy link
Owner Author

@Akhamis01 You can already get the optimal number of topics by setting nr_topics = "auto" when instantiating BERTopic.

@Akhamis01
Copy link

@Akhamis01 You can already get the optimal number of topics by setting nr_topics = "auto" when instantiating BERTopic.

Thank you! Just wondering, how is the optimal number of topics calculated for a topic model? Is it through Coherence scores?

@MaartenGr
Copy link
Owner Author

@Akhamis01 The reduction is based on the similarity between topics. If two topics are very similar to each other, they will be merged. Coherence scores have a bunch of issues that make it difficult to properly evaluate a topic model using only one method.

@Akhamis01
Copy link

@Akhamis01 The reduction is based on the similarity between topics. If two topics are very similar to each other, they will be merged. Coherence scores have a bunch of issues that make it difficult to properly evaluate a topic model using only one method.

Alright, and how is the similarity between the topics calculated in general? What metrics are the similarities based on?

@MaartenGr
Copy link
Owner Author

@Akhamis01 HDBSCAN is used to cluster the topics even further based on their topic vectors. You can find more about that here:

def _auto_reduce_topics(self, documents: pd.DataFrame) -> pd.DataFrame:

If you have any further questions, could you create a separate issue for that? It helps keeping discussions relevant to this specific pull request.

@Akhamis01
Copy link

Makes sense. Thank you for your help, appreciate it! I'll create a new issue for any other questions :)

@farshi
Copy link

farshi commented Jul 23, 2021

@MaartenGr GuidedBERTopic idea and implementations looks great! I will give it a try and let you know my feedback. Thank you

@FedePacio97
Copy link

Highlights

  • Get the most representative documents per topic: topic_model.get_representative_docs(topic=1)

    • This allows users to see which documents are good representations of a topic and better understand the topics that were created
  • Added normalize_frequency parameter to visualize_topics_per_class and visualize_topics_over_time in order to better compare the relative topic frequencies between topics

  • Return flat probabilities as default, only calculate the probabilities of all topics per document if calculate_probabilities is True

  • Implemented a guided BERTopic by defining seed topics:

# NOTE: The Reuters dataset was ued
seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
                   ["acquisition", "procurement", "merge"],
                   ["exchange", "currency", "trading", "rate", "euro"],
                   ["grain", "wheat", "corn"],
                   ["coffee", "cocoa"],
                   ["natural", "gas", "oil", "fuel", "products", "petrol"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, _ = topic_model.fit_transform(docs)

Guided BERTopic works in two ways.

First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

Second, we take all words in seed_topic_list and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an inrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.

Fixes

  • Fix loading pre-trained BERTopic model
  • Fix mapping of probabilities

Hi, thanks for the update.
Can we decide how many documents we would like to get from this method? For instance, instead of 3 documents, could we get X number of documents. As an example, a kind of method like this: topic_model.get_representative_docs(topic=1, nr_of_documents_returned=20).

@MaartenGr
Copy link
Owner Author

@FedePacio97 That is currently not possible since only three documents per topic are saved in order to prevent the topic model from saving too many documents. Moreover, this value would then depend on the min_topic_size as well as the underlying algorithm that is being used, since this method is only used for HDBSCAN.

Instead, you can use the topics output to see which documents belong to which topic. Also, you could use .visualize_documents to get an understanding of the documents that occupy each topic and which might be best representative.

@FedePacio97
Copy link

Thank you very much for the fast reply.
This is definitely what I was looking for.

@MaartenGr MaartenGr deleted the v0.9 branch May 4, 2023 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants