🙏 How to Convert a FLUX-Dev Checkpoint to an NF4 Model? #1224

sashaok123 · 2024-08-17T08:05:32Z

sashaok123
Aug 17, 2024

Hi everyone,

I've recently trained a model using the FLUX-Dev checkpoint. A lot of users in the comments are requesting an NF4 version, as it would allow them to generate images on GPUs with lower VRAM.

I'm looking for a step-by-step guide on how to convert my FLUX-Dev checkpoint into an NF4 model. Specifically, I would like to know:

Software: What tools or libraries do I need to perform this conversion?
Scripts: Are there any specific scripts available for this task, or do I need to write one myself?
Parameters: What parameters should I use during the quantization process to ensure the model works efficiently while maintaining as much quality as possible?
Hardware Requirements: What kind of hardware would be best suited for this process? Will I need a specific GPU, or can it be done on a CPU?
Other Tips: Any additional tips or best practices for ensuring the converted model is optimized and free of errors?

I'm trying to make this model more accessible for users with less powerful GPUs, so any help would be greatly appreciated!

Thank you!

Answered by lllyasviel

Aug 18, 2024

I will probably give some convert codes later ...

Also people need to notice that GGUF is a pure compression tech, which means it is smaller but also slower because it has extra steps to decompress tensors and computation is still pytorch. (unless someone is crazy enough to port llama.cpp compilers

BNB (NF4) is computational acceleration library to make things faster by replacing pytorch ops with native low-bit cuda kernels, so that computation is faster.

NF4 and Q4_0 should be very similar, with the difference that Q4_0 has smaller chunk size and NF4 has more gaussian-distributed quants. I do not recommend to trust comparisons of one or two images. And, I also want to have smaller chunk …

View full answer

pflky · 2024-08-18T03:24:11Z

pflky
Aug 18, 2024

I would say look more into GGUF instead. It produces better quality than NF4 and runs just fine on a 12GB card. NF4 is kind of like the craiyon version of Flux. And while it can produce some pretty good looking images, the precision issues do crop up here and there and the keeper ratio drops drastically. Besides, NF4 is actually built into the UI, you can run any model at NF4 precision on the fly. There's no convert and save tool for it though, but again, GGUF Q4 is probably the better thing to convert it into, because it comes out to about the same download size as an SDXL checkpoint.

2 replies

FLYleoYBQ Aug 18, 2024

But really, the GGUF was a little too slow for my 3070TI

yamfun Aug 28, 2024

GGUF are too slow right now

lllyasviel · 2024-08-18T04:37:42Z

lllyasviel
Aug 18, 2024
Maintainer

I will probably give some convert codes later ...

Also people need to notice that GGUF is a pure compression tech, which means it is smaller but also slower because it has extra steps to decompress tensors and computation is still pytorch. (unless someone is crazy enough to port llama.cpp compilers

BNB (NF4) is computational acceleration library to make things faster by replacing pytorch ops with native low-bit cuda kernels, so that computation is faster.

NF4 and Q4_0 should be very similar, with the difference that Q4_0 has smaller chunk size and NF4 has more gaussian-distributed quants. I do not recommend to trust comparisons of one or two images. And, I also want to have smaller chunk size in NF4 but it seems that bnb hard coded some thread numbers and changing that is non trivial.

However Q4_1 and Q4_K are technically granted to be more precise than NF4, but with even more computation overheads – and such overheads may be more costly than simply moving higher precision weight from CPU to GPU. If that happens then the quant lose the point.

5 replies

FLYleoYBQ Aug 18, 2024

Very much agree

pflky Aug 18, 2024

I get that NF4 can produce quite good images, but that was why I said the keeper rate drops. I like the idea of NF4, but when NF4 falls apart, it really falls apart. Basically when a concept or seed bumps into things, it's noticeable, and so the results turn into more of a minefield. In my experience with a 4070ti, GGUF takes about 25% longer. But if something generates faster but has a lower keeper rate, you'll be generating more and spending more time reworking prompts and settings, which can more than negate the speed improvements, as the per image generation time may be less, but the overall time spent generating to find a keeper can be more. This might be anecdotal, but I think one area where I noticed issues the most was with LORAs on NF4. Another anecdotal with NF4 is that I found I had to use twice as many steps to help get somewhat around the minefield of bad generations.

Of course NF4 is still a great step and has a place, and maybe even NF4 + GGUF can make sense in some spaces too. I could see scenarios where people want to convert their entire libraries of models to reduce space usage. One thing to note is that GGUF Q8 does give virtually the same results as the main Dev model while being half the size, so a GGUF convertor built-in alongside NF4 convertor would actually be pretty useful, as that could mean 1GB SD 1.5 models and 3GB SDXL models that give results virtually indistinguishable from the original. Throw in NF4 + GGUF and you could have some truly tiny model sizes, and since there is less to decompress, the speed differences should also be reduced.

JohnnyHoff Aug 18, 2024

Do you use Q4_0 with your 4070ti? Thats what I am using right now with the same GPU, I am still not sure if its better than NF4

pflky Aug 18, 2024

I do use Q4 and NF4. NF4 does use a little less memory, but I don't like it as a 20 step model unless you're prompting something very simple. If I use it, I do at least 40-80 steps, and lower the CFG down to 2. Much better than generating at 20 steps, way more keepers. Especially when it comes to text, I find 20 steps just isn't enough for good adherence, 40 steps bumps that up a lot. People often forget that CFG and Steps are directly related, so if you use more steps, you generally use less CFG. CFG has always been primarily a way to get more adherence at lower steps. That's not exactly how it works, but that's functionally how it works. For example, you can get better results with even 1-2 CFG and 80 steps than you might get at 3.5 CFG and 20 steps, but obviously one takes longer than the other. The Beta scheduler is also generally better than Simple.

I have read that there's some issues with using T5 and Clip together in prompting, as one wants natural language while the other wants just keywords, so that could be hurting NF4 and GGUF models too.

FLYleoYBQ Aug 19, 2024

The Q8 version is really good, very close to the unquantified flux version, I used about 170s on average per graph
-- 3070TI graphics card

lllyasviel · 2024-08-19T13:35:49Z

lllyasviel
Aug 19, 2024
Maintainer

can save like this now:

red marked areas can be changed. Will always save without double quant (the method for nf4-v2) - double quant will be removed soon because of quality and performance problems

4 replies

E2GO Aug 20, 2024

Nice addition! Thanks!
Request: can we have also ability to merge LORAs into models?

Jonseed Aug 20, 2024

It looks like saves only the UNet when you click "Save UNet", but you still have to specify the VAE and Text Encoders in the top menu. If you are only saving the UNet, you shouldn't have to specify these.... It seems that the community should just start sharing UNet models only, since we all already have the VAE and Text Encoders, we don't need to download them again each time with each model (baked in, as they say), as it just adds bloat to each model filesize. The only time it seems to make sense is if the text encoder was also fine-tuned in training. Or am I not understanding right?

Pizzawookiee Aug 20, 2024

Compatible with SDXL?

al-swaiti Aug 28, 2024

thanks for sharing bnb-NF4 conversion , is the output V2 ? further more is there an option to lower memory usage i succeed with creating checkpoint with t5xxfp8 but failed with t5xxfp16 ,,, out of memory

Dampfinchen · 2024-08-29T08:24:38Z

Dampfinchen
Aug 29, 2024

I have converted my dev safetensor file including the vae, clip and t5 fp8 to bnb-nf4(FP16 Lora) however, I think there's a memory leak bug. During inference, the RAM usage will jump to 26 GB RAM, which is more than FP8 and on par with FP16. When using it in Comfy with the BnB node I'm even getting and OOM which I never got with the FP16 and the FP8 one, so it's not only forge.

The file size is just 12 GB which appears to be correct. Why does it need 26 GB RAM then?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🙏 How to Convert a FLUX-Dev Checkpoint to an NF4 Model? #1224

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

🙏 How to Convert a FLUX-Dev Checkpoint to an NF4 Model? #1224

Replies: 4 comments · 11 replies

lllyasviel Aug 18, 2024 Maintainer

lllyasviel Aug 19, 2024 Maintainer

Replies: 4 comments 11 replies

lllyasviel
Aug 18, 2024
Maintainer

lllyasviel
Aug 19, 2024
Maintainer