Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.5 #46

Merged
merged 28 commits into from
Feb 8, 2021
Merged

v0.5 #46

merged 28 commits into from
Feb 8, 2021

Conversation

MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Jan 21, 2021

Several features and fixes will be added to this version (#44, #43, #49):

Features

  • Option to use custom UMAP
  • Option to use custom HDBSAN
  • Added low_memory parameter to reduce memory during computation
  • Improved verbosity (shows progress bar)
  • Improved testing
  • Use the newest version of sentence-transformers as it speeds ups encoding significantly
  • Add Flair to allow for more (custom) token/document embeddings
  • Return the figure of visualize_topics()
  • Expose all parameters with a single function: get_params()
  • Option to disable the saving of embedding_model, should reduce BERTopic size significantly
  • Add FAQ page

Fixes

  • To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.
  • Set calculate_probabilities to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.

Roadmap

  • Update notebook

Issues

  • Currently, Flair makes use of transformers v3.5.0 and will update in the near future

In the progress of developing this new version, there might be more features added that you might see now. These will be added to this message when working on them.

@MaartenGr MaartenGr mentioned this pull request Jan 22, 2021
Copy link
Owner Author

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is an issue with HDBSCAN not working using NumPy 1.19.3 and 1.19.5. Trying out 1.18.5 to see if that helps.

EDIT: It seems that UMAP is having similar issues with NumPy.

@bhavul
Copy link

bhavul commented Feb 4, 2021

Hi, noticed you facing same numpy and pip related issues for hdbscan and umap.

I was able to resolve this (posted here) but I've been using BERTopic inside of a Dockerfile. Posting it here so it helps you, and you can update setup.py file accordingly, or perhaps add this or similar Dockerfile itself in the repo along with this PR.

Do note that each of python version, order of install and the extra parameters used in pip for hdbscan seem to be important. Without them it fails.

Also, most likely with future patches for hdbscan/umap these issues with pip and version cross-compatibility would probably get fixed.

FROM python:3.8.7

# change shell to bash
SHELL ["/bin/bash", "-c"]

# Install some important libraries
RUN pip install --upgrade pip
RUN pip install --upgrade numpy umap-learn
RUN pip install --upgrade hdbscan --no-cache-dir --no-binary :all:
RUN pip install bertopic[visualization]

# Do whatever you wish to
WORKDIR "/src/"
CMD ["python", "test.py"]

Build it via:
docker build -f Dockerfile -t bertopic:v1 .

Run it via :
docker run -it -v $PWD:/src/ --name bertopic bertopic:v1

@MaartenGr
Copy link
Owner Author

@bhavul Thanks! I had been following the issues with pypi for the last few days and was waiting for a fix to this issue. It's a shame there aren't perfect workarounds. However, it seems to be working for now with the updated requirements. Thanks again for the help.

@MaartenGr MaartenGr merged commit e84d7d1 into master Feb 8, 2021
@ahmed1
Copy link

ahmed1 commented Feb 8, 2021

Running the installs mentioned above with python 3.7.7 worked for me:

pip install --upgrade pip
pip install --upgrade numpy umap-learn
pip install --upgrade hdbscan --no-cache-dir --no-binary :all:
pip install bertopic[visualization]

@MaartenGr MaartenGr mentioned this pull request Mar 14, 2021
@MaartenGr MaartenGr deleted the feature-memory branch March 29, 2021 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants