<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Zahin Tapadar</title>
    <description>The latest articles on Forem by Zahin Tapadar (@thefalkonguy).</description>
    <link>https://forem.com/thefalkonguy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3743609%2Fa4d5a9e6-5a9d-4417-a889-3ffe8b5b6a14.jpg</url>
      <title>Forem: Zahin Tapadar</title>
      <link>https://forem.com/thefalkonguy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/thefalkonguy"/>
    <language>en</language>
    <item>
      <title>Installing Qwen 3.5 on Apple Silicon Using MLX for 2X Performance</title>
      <dc:creator>Zahin Tapadar</dc:creator>
      <pubDate>Tue, 03 Mar 2026 17:02:13 +0000</pubDate>
      <link>https://forem.com/thefalkonguy/installing-qwen-35-on-apple-silicon-using-mlx-for-2x-performance-37ma</link>
      <guid>https://forem.com/thefalkonguy/installing-qwen-35-on-apple-silicon-using-mlx-for-2x-performance-37ma</guid>
      <description>&lt;p&gt;&lt;strong&gt;Preface&lt;/strong&gt;&lt;br&gt;
Apple Silicon has rapidly emerged as a major platform for machine learning development and deployment. With unified memory architectures supporting up to 192 GB of shared CPU and GPU memory and memory bandwidth exceeding 400 GB/s, recent Mac devices provide substantial capability for running large language models locally. This has increased interest in efficient inference frameworks tailored specifically to Apple hardware, particularly for development workflows, privacy-sensitive applications, and edge deployment.&lt;/p&gt;

&lt;p&gt;Existing inference solutions, however, present structural limitations. PyTorch’s MPS backend adapts CUDA-style execution to Metal but does not fully exploit the unified memory model. llama.cpp delivers strong performance for text-only models yet lacks support for multimodal architectures. vLLM-metal provides continuous batching but does not include multimodal execution or vision caching support. As a result, the current ecosystem remains fragmented.&lt;/p&gt;

&lt;p&gt;Recent advances in MLX introduce a native execution framework specifically designed for Apple Silicon. By leveraging unified memory and Metal acceleration directly, MLX enables zero-copy tensor operations and optimized kernel execution. Models compiled with MLX demonstrate significantly higher throughput compared to standard PyTorch or GGUF-based builds.&lt;/p&gt;

&lt;p&gt;On M-series systems, MLX-optimized builds of Qwen 3.5 achieve approximately 2x token generation speed relative to baseline implementations on identical hardware.&lt;/p&gt;

&lt;p&gt;This article demonstrates how to deploy Qwen 3.5 using an MLX build through LM Studio, providing a simplified interface for high-performance local inference on Apple Silicon.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9y7niwpyp1rz0k5x5m86.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9y7niwpyp1rz0k5x5m86.png" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;MLX Outperforms llama.cpp.&lt;/strong&gt;&lt;br&gt;
results show vllm-mlx consistently exceeds llama.cpp throughput by 21% to 87%. We attribute this to three factors: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;MLX’s native unified memory design enables zero-copy tensor operations, avoiding the memory transfer overhead present in llama.cpp’s Metal backend; &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MLX’s lazy evaluation allows operation fusion and reduces kernel launch overhead; &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Our continuous batching scheduler maximizes GPU utilization by processing multiple sequences simultaneously.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3poh9gjcfzma9nk1tu4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3poh9gjcfzma9nk1tu4d.png" alt=" " width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://lmstudio.ai/" rel="noopener noreferrer"&gt;LM Studio&lt;/a&gt; provides a graphical interface for downloading and running local models without manual configuration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpa4x9qaekh8lompw0w1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpa4x9qaekh8lompw0w1.png" alt=" " width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Enter Qwen 3.5 MLX in the search field.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select a model(0.8B ,2B ,4B ,9B) explicitly labeled as MLX. - By default, LLM Studio is capable of displaying which models your device is capable of running.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzt6a5v2apzh8a0vajo2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzt6a5v2apzh8a0vajo2.png" alt=" " width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alternatives&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This method uses MLX to run Qwen 3.5 models directly on Apple Silicon hardware. MLX is designed specifically for M-series chips and provides GPU acceleration through the Metal backend while fully exploiting unified CPU/GPU memory. This eliminates explicit memory transfers and improves execution efficiency.&lt;br&gt;
MLX supports M1, M2, and M3 series devices.&lt;br&gt;
Installing MLX on macOS&lt;br&gt;
Inference support for Qwen models is provided via the mlx-lm Python package.&lt;/p&gt;

&lt;p&gt;Requirements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;macOS running on Apple Silicon&lt;/li&gt;
&lt;li&gt;Python 3.11 or newer
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create a clean virtual environment (recommended)
python3 -m venv mlx-qwen
source mlx-qwen/bin/activate

# Install latest mlx-lm (handles text-only Qwen3.5 great)
pip install mlx-lm

# Optional: for vision-language Qwen3.5 models
pip install mlx-vlm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;MLX is currently supported only on Apple Silicon MX-series systems.&lt;/p&gt;

&lt;p&gt;Loading Qwen 3.5 Models with MLX&lt;/p&gt;

&lt;p&gt;The 0.8B model is natively multimodal and highly efficient. You can run it in two primary modes:&lt;br&gt;
For interactive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mlx_lm.chat --model Qwen/Qwen3.5-0.8B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For server:&lt;br&gt;
Use this to host an OpenAI-compatible API (at &lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt;) for integration with custom web UIs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mlx_lm.server --model Qwen/Qwen3.5-0.8B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While MLX can load standard Hugging Face repositories, using quantized models from the mlx-community is recommended for systems with 16GB of RAM.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quantization: Using 4-bit or 8-bit variants (e.g., mlx-community/Qwen2.5-1.5B-Instruct-4bit) significantly reduces VRAM usage and improves generation speed.&lt;/li&gt;
&lt;li&gt;Unified Memory: Because the GPU and CPU share the same 16GB pool, ensure high-memory applications (like Chrome) are closed when running larger models to prevent Metal "Out of Memory" errors.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
