<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: gameindie</title>
    <description>The latest articles on Forem by gameindie (@codesmart_1).</description>
    <link>https://forem.com/codesmart_1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1541407%2Fbc8a52fa-843e-4f36-8e43-e31192199d4f.png</url>
      <title>Forem: gameindie</title>
      <link>https://forem.com/codesmart_1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/codesmart_1"/>
    <language>en</language>
    <item>
      <title>improve 30% inference speed for Stable Diffusion pipelines:</title>
      <dc:creator>gameindie</dc:creator>
      <pubDate>Fri, 05 Jul 2024 17:36:54 +0000</pubDate>
      <link>https://forem.com/codesmart_1/improve-your-inference-speed-for-stable-diffusion-pipelines-2h4</link>
      <guid>https://forem.com/codesmart_1/improve-your-inference-speed-for-stable-diffusion-pipelines-2h4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;I've been generating a lot of &lt;a href="https://nailarts.pro" rel="noopener noreferrer"&gt;nail art images&lt;/a&gt; for my image site lately,Finally i'm using &lt;a href="https://github.com/siliconflow/onediff" rel="noopener noreferrer"&gt;OneDiff&lt;/a&gt; get 30% speed up and I've found a few things that can improve the speed of stable diffusion reasoning, as summarized below&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Config
&lt;/h2&gt;

&lt;p&gt;Here are some key ways to optimize inference speed for Stable Diffusion pipelines:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Use half-precision (FP16) instead of full precision (FP32)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Load the model with &lt;code&gt;torch_dtype=torch.float16&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;This can provide up to 60% speedup with minimal quality loss&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Enable TensorFloat-32 (TF32) on NVIDIA GPUs[1]:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
   &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backends&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matmul&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allow_tf32&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Use a distilled model[1]:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Smaller distilled models like "nota-ai/bk-sdm-small" can be 1.5-1.6x faster&lt;/li&gt;
&lt;li&gt;They maintain comparable quality to full models&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Enable memory-efficient attention implementations[1]:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use xFormers or PyTorch 2.0's scaled dot product attention&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Use CUDA graphs to reduce CPU overhead[3]:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Capture UNet, VAE and TextEncoder into CUDA graph format&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Apply DeepSpeed-Inference optimizations[2][4]:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Can provide 1.7x speedup with minimal code changes&lt;/li&gt;
&lt;li&gt;Fuses operations and uses optimized CUDA kernels&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7. Use torch.inference_mode() or torch.no_grad()[4]:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Disables gradient computation for slight speedup&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8. Consider specialized libraries like stable-fast[3]:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Provides CUDNN fusion, low precision ops, fused attention, etc.&lt;/li&gt;
&lt;li&gt;Claims significant speedups over other methods&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  9. Reduce the number of inference steps if quality allows
&lt;/h3&gt;

&lt;h3&gt;
  
  
  10. Use a larger batch size if memory permits
&lt;/h3&gt;

&lt;p&gt;By combining multiple optimizations, you can potentially reduce inference time from over 5 seconds to around 2-3 seconds for a single 512x512 image generation on high-end GPUs[1][2][4]. The exact speedup will depend on your specific hardware and model configuration.&lt;/p&gt;

&lt;p&gt;Citations:&lt;br&gt;
[1] &lt;a href="https://huggingface.co/docs/diffusers/en/optimization/fp16" rel="noopener noreferrer"&gt;https://huggingface.co/docs/diffusers/en/optimization/fp16&lt;/a&gt;&lt;br&gt;
[2] &lt;a href="https://www.philschmid.de/stable-diffusion-deepspeed-inference" rel="noopener noreferrer"&gt;https://www.philschmid.de/stable-diffusion-deepspeed-inference&lt;/a&gt;&lt;br&gt;
[3] &lt;a href="https://github.com/chengzeyi/stable-fast" rel="noopener noreferrer"&gt;https://github.com/chengzeyi/stable-fast&lt;/a&gt;&lt;br&gt;
[4] &lt;a href="https://blog.cerebrium.ai/how-to-speed-up-stable-diffusion-to-a-2-second-inference-time-500x-improvement-d561c79a8952?gi=94a7e93c17f1" rel="noopener noreferrer"&gt;https://blog.cerebrium.ai/how-to-speed-up-stable-diffusion-to-a-2-second-inference-time-500x-improvement-d561c79a8952?gi=94a7e93c17f1&lt;/a&gt;&lt;br&gt;
[5] &lt;a href="https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl" rel="noopener noreferrer"&gt;https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Try Other Inference Runtime
&lt;/h2&gt;

&lt;p&gt;Yes, there are several compile backends that can improve inference speed for Stable Diffusion pipelines. Here are some key options:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. torch.compile:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Available in PyTorch 2.0+&lt;/li&gt;
&lt;li&gt;Can provide significant speedups with minimal code changes&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Example usage:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reduce-overhead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Compilation takes some time initially but subsequent runs are faster[1]&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Onediff:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Can provide 30% speedup with minimal code changes for diffusers&lt;/li&gt;
&lt;li&gt;Easy to integrate with Hugging Face Diffusers[2]&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. DeepSpeed-Inference:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Can provide around 1.7x speedup with minimal code changes&lt;/li&gt;
&lt;li&gt;Optimizes operations and uses custom CUDA kernels&lt;/li&gt;
&lt;li&gt;Easy to integrate with Hugging Face Diffusers[2]&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. stable-fast:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Specialized optimization framework for Hugging Face Diffusers&lt;/li&gt;
&lt;li&gt;Implements techniques like CUDNN convolution fusion, low precision ops, fused attention, etc.&lt;/li&gt;
&lt;li&gt;Claims significant speedups over other methods&lt;/li&gt;
&lt;li&gt;Provides fast compilation within seconds, much quicker than torch.compile or TensorRT[4]&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. TensorRT:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA's deep learning inference optimizer and runtime&lt;/li&gt;
&lt;li&gt;Can provide substantial speedups but requires more setup&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. ONNX Runtime:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cross-platform inference acceleration&lt;/li&gt;
&lt;li&gt;Supports various hardware accelerators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When choosing a compile backend, consider factors like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ease of integration&lt;/li&gt;
&lt;li&gt;Compilation time&lt;/li&gt;
&lt;li&gt;Compatibility with your specific model and hardware&lt;/li&gt;
&lt;li&gt;Performance gains for your particular use case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For Stable Diffusion specifically, stable-fast seems promising as it's optimized for Diffusers and claims fast compilation times[4]. However, torch.compile is also a solid choice for its ease of use and good performance gains[1]. DeepSpeed-Inference is another strong contender, especially if you're already using the Hugging Face ecosystem[2].&lt;/p&gt;

&lt;p&gt;Remember that the effectiveness of these optimizations can vary depending on your specific hardware, model, and inference settings. It's often worth benchmarking multiple options to find the best fit for your particular use case.&lt;/p&gt;

&lt;p&gt;Citations:&lt;br&gt;
[1] &lt;a href="https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl" rel="noopener noreferrer"&gt;https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl&lt;/a&gt;&lt;br&gt;
[2] &lt;a href="https://github.com/siliconflow/onediff/tree/main/onediff_diffusers_extensions/examples/sd3" rel="noopener noreferrer"&gt;https://github.com/siliconflow/onediff/tree/main/onediff_diffusers_extensions/examples/sd3&lt;/a&gt;&lt;br&gt;
[3] &lt;a href="https://www.philschmid.de/stable-diffusion-deepspeed-inference" rel="noopener noreferrer"&gt;https://www.philschmid.de/stable-diffusion-deepspeed-inference&lt;/a&gt;&lt;br&gt;
[4] &lt;a href="https://www.youtube.com/watch?v=AKBelBkPHYk" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=AKBelBkPHYk&lt;/a&gt;&lt;br&gt;
[5] &lt;a href="https://github.com/chengzeyi/stable-fast" rel="noopener noreferrer"&gt;https://github.com/chengzeyi/stable-fast&lt;/a&gt;&lt;br&gt;
[6] &lt;a href="https://www.reddit.com/r/StableDiffusion/comments/18lvwja/stablefast_v1_2x_speedup_for_svd_stable_video/" rel="noopener noreferrer"&gt;https://www.reddit.com/r/StableDiffusion/comments/18lvwja/stablefast_v1_2x_speedup_for_svd_stable_video/&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Create PalWorld server on Linux with docker</title>
      <dc:creator>gameindie</dc:creator>
      <pubDate>Fri, 31 May 2024 11:26:21 +0000</pubDate>
      <link>https://forem.com/codesmart_1/create-palworld-server-on-linux-with-docker-5db6</link>
      <guid>https://forem.com/codesmart_1/create-palworld-server-on-linux-with-docker-5db6</guid>
      <description>&lt;p&gt;Reference to &lt;a href="https://www.palworldtravel.com/creating-a-cheap-palworld-dedicated-server"&gt;creating-a-cheap-palworld-dedicated-server&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this article, will guide you through the steps to set up a PalWorld dedicated server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Linux-based system&lt;/li&gt;
&lt;li&gt;Docker installed on your system&lt;/li&gt;
&lt;li&gt;Basic knowledge of command-line operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Download the Docker Image
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull thijsvanloef/palworld-server-docker:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Configure and Run the Docker Container
&lt;/h2&gt;

&lt;p&gt;run the PalWorld server container using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--name&lt;/span&gt; palworld-server &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; 8211:8211/udp &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; 27015:27015/udp &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-v&lt;/span&gt; ./palworld:/palworld/ &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;PUID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1000 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;PGID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1000 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;8211 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;PLAYERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;16 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;MULTITHREADING&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;RCON_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;RCON_PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;25575 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;TZ&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;UTC &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ADMIN_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"adminPasswordHere"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;SERVER_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"worldofpals"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;COMMUNITY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;SERVER_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"palworld-server-docker by Thijs van Loef"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;SERVER_DESCRIPTION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"palworld-server-docker by Thijs van Loef"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--restart&lt;/span&gt; unless-stopped &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--stop-timeout&lt;/span&gt; 30 &lt;span class="se"&gt;\&lt;/span&gt;
    thijsvanloef/palworld-server-docker:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Params explain&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;•    ﻿-d runs the server in detached mode, freeing up your terminal.
•    ﻿--name palworld-server assigns a distinct name to the container.
•    ﻿-p 8211:8211/udp and ﻿-p 27015:27015/udp map the necessary UDP ports from the container to your host machine.
•    ﻿-v ./palworld:/palworld/ sets a volume linking a host system directory to a corresponding directory within the container.
•    The ﻿--restart unless-stopped flag ensures the server resumes operation after unexpected shutdowns or reboots.
•    ﻿--stop-timeout 30 is a grace period for the server to shut down cleanly before Docker forces it to stop.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Step 3: Check the Running Server
&lt;/h2&gt;

&lt;p&gt;Ensure everything is working as expected by checking the container’s logs:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker logs &lt;span class="nt"&gt;-f&lt;/span&gt; palworld-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>gamedev</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
