improve 30% inference speed for Stable Diffusion pipelines:

gameindie — Fri, 05 Jul 2024 17:36:54 +0000

I've been generating a lot of nail art images for my image site lately,Finally i'm using OneDiff get 30% speed up and I've found a few things that can improve the speed of stable diffusion reasoning, as summarized below

Config

Here are some key ways to optimize inference speed for Stable Diffusion pipelines:

1. Use half-precision (FP16) instead of full precision (FP32)

Load the model with torch_dtype=torch.float16
This can provide up to 60% speedup with minimal quality loss

2. Enable TensorFloat-32 (TF32) on NVIDIA GPUs[1]:

   import torch
   torch.backends.cuda.matmul.allow_tf32 = True

3. Use a distilled model[1]:

Smaller distilled models like "nota-ai/bk-sdm-small" can be 1.5-1.6x faster
They maintain comparable quality to full models

4. Enable memory-efficient attention implementations[1]:

Use xFormers or PyTorch 2.0's scaled dot product attention

5. Use CUDA graphs to reduce CPU overhead[3]:

Capture UNet, VAE and TextEncoder into CUDA graph format

6. Apply DeepSpeed-Inference optimizations[2][4]:

Can provide 1.7x speedup with minimal code changes
Fuses operations and uses optimized CUDA kernels

7. Use torch.inference_mode() or torch.no_grad()[4]:

Disables gradient computation for slight speedup

8. Consider specialized libraries like stable-fast[3]:

Provides CUDNN fusion, low precision ops, fused attention, etc.
Claims significant speedups over other methods

9. Reduce the number of inference steps if quality allows

10. Use a larger batch size if memory permits

By combining multiple optimizations, you can potentially reduce inference time from over 5 seconds to around 2-3 seconds for a single 512x512 image generation on high-end GPUs[1][2][4]. The exact speedup will depend on your specific hardware and model configuration.

Citations:
[1] https://huggingface.co/docs/diffusers/en/optimization/fp16
[2] https://www.philschmid.de/stable-diffusion-deepspeed-inference
[3] https://github.com/chengzeyi/stable-fast
[4] https://blog.cerebrium.ai/how-to-speed-up-stable-diffusion-to-a-2-second-inference-time-500x-improvement-d561c79a8952?gi=94a7e93c17f1
[5] https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl

Try Other Inference Runtime

Yes, there are several compile backends that can improve inference speed for Stable Diffusion pipelines. Here are some key options:

1. torch.compile:

Available in PyTorch 2.0+
Can provide significant speedups with minimal code changes

Example usage:

 model = torch.compile(model, mode="reduce-overhead")

Compilation takes some time initially but subsequent runs are faster[1]

2. Onediff:

Can provide 30% speedup with minimal code changes for diffusers
Easy to integrate with Hugging Face Diffusers[2]

3. DeepSpeed-Inference:

Can provide around 1.7x speedup with minimal code changes
Optimizes operations and uses custom CUDA kernels
Easy to integrate with Hugging Face Diffusers[2]

4. stable-fast:

Specialized optimization framework for Hugging Face Diffusers
Implements techniques like CUDNN convolution fusion, low precision ops, fused attention, etc.
Claims significant speedups over other methods
Provides fast compilation within seconds, much quicker than torch.compile or TensorRT[4]

5. TensorRT:

NVIDIA's deep learning inference optimizer and runtime
Can provide substantial speedups but requires more setup

6. ONNX Runtime:

Cross-platform inference acceleration
Supports various hardware accelerators

When choosing a compile backend, consider factors like:

Ease of integration
Compilation time
Compatibility with your specific model and hardware
Performance gains for your particular use case

For Stable Diffusion specifically, stable-fast seems promising as it's optimized for Diffusers and claims fast compilation times[4]. However, torch.compile is also a solid choice for its ease of use and good performance gains[1]. DeepSpeed-Inference is another strong contender, especially if you're already using the Hugging Face ecosystem[2].

Remember that the effectiveness of these optimizations can vary depending on your specific hardware, model, and inference settings. It's often worth benchmarking multiple options to find the best fit for your particular use case.

Citations:
[1] https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl
[2] https://github.com/siliconflow/onediff/tree/main/onediff_diffusers_extensions/examples/sd3
[3] https://www.philschmid.de/stable-diffusion-deepspeed-inference
[4] https://www.youtube.com/watch?v=AKBelBkPHYk
[5] https://github.com/chengzeyi/stable-fast
[6] https://www.reddit.com/r/StableDiffusion/comments/18lvwja/stablefast_v1_2x_speedup_for_svd_stable_video/

Create PalWorld server on Linux with docker

gameindie — Fri, 31 May 2024 11:26:21 +0000

Reference to creating-a-cheap-palworld-dedicated-server

In this article, will guide you through the steps to set up a PalWorld dedicated server.

Prerequisites

A Linux-based system
Docker installed on your system
Basic knowledge of command-line operations

Step 1: Download the Docker Image

docker pull thijsvanloef/palworld-server-docker:latest

Step 2: Configure and Run the Docker Container

run the PalWorld server container using the following command:

docker run -d \
    --name palworld-server \
    -p 8211:8211/udp \
    -p 27015:27015/udp \
    -v ./palworld:/palworld/ \
    -e PUID=1000 \
    -e PGID=1000 \
    -e PORT=8211 \
    -e PLAYERS=16 \
    -e MULTITHREADING=true \
    -e RCON_ENABLED=true \
    -e RCON_PORT=25575 \
    -e TZ=UTC \
    -e ADMIN_PASSWORD="adminPasswordHere" \
    -e SERVER_PASSWORD="worldofpals" \
    -e COMMUNITY=false \
    -e SERVER_NAME="palworld-server-docker by Thijs van Loef" \
    -e SERVER_DESCRIPTION="palworld-server-docker by Thijs van Loef" \
    --restart unless-stopped \
    --stop-timeout 30 \
    thijsvanloef/palworld-server-docker:latest

Params explain

•    -d runs the server in detached mode, freeing up your terminal.
•    --name palworld-server assigns a distinct name to the container.
•    -p 8211:8211/udp and -p 27015:27015/udp map the necessary UDP ports from the container to your host machine.
•    -v ./palworld:/palworld/ sets a volume linking a host system directory to a corresponding directory within the container.
•    The --restart unless-stopped flag ensures the server resumes operation after unexpected shutdowns or reboots.
•    --stop-timeout 30 is a grace period for the server to shut down cleanly before Docker forces it to stop.

Step 3: Check the Running Server

Ensure everything is working as expected by checking the container’s logs:

docker logs -f palworld-server

Forem: gameindie