Forem: Aditi Bindal

A Step-by-Step Guide to Install Qwen3-Next 80B

Aditi Bindal — Mon, 22 Sep 2025 07:15:58 +0000

If you're relentlessly following AI advancements, one thing can be clearly observed, the trend has been simple: go bigger. However, the new Qwen3-Next-80B series models challenges this paradigm by focusing on groundbreaking efficiency rather than just raw scale. This model represents a monumental leap forward, delivering the performance of a much larger model with a fraction of the computational cost. At its core is a revolutionary Hybrid Attention mechanism, to process ultra-long context lengths, natively supporting 262,144 tokens and extensible to over a million. This is paired with a High-Sparsity Mixture-of-Experts (MoE) architecture that keeps a staggering 80 billion total parameters on tap while only activating 3 billion at any given time. The result? Drastically reduced computational load, leading to inference speeds up to 10 times faster than its predecessors on long-context tasks. With additional enhancements like Multi-Token Prediction for accelerated performance and advanced stability optimizations, Qwen3-Next-80B proves its worth by outperforming models like Qwen3-32B with only 10% of the training cost and performing on par with models on key reasoning, coding, and alignment benchmarks.

In this article we'll dive into the simple and straightforward walkthrough showing the installation, setup and usage of this model.

Prerequisites

The minimum system requirements for running this model are:

GPU: 2x H200s or 4x H100s
Storage: 1TB+ (preferable)
VRAM: at least 160GB
Anaconda installed

Step-by-step process to install Qwen3-Next-80B-A3B Locally

For the purpose of this tutorial, we’ll use a GPU-powered Virtual Machine by NodeShift since it provides high compute Virtual Machines at a very affordable cost on a scale that meets GDPR, SOC2, and ISO27001 requirements. Also, it offers an intuitive and user-friendly interface, making it easier for beginners to get started with Cloud deployments. However, feel free to use any cloud provider of your choice and follow the same steps for the rest of the tutorial.

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

These GPU nodes are GPU-powered virtual machines by NodeShift. These nodes are highly customizable and let you control different environmental configurations for GPUs ranging from H100s to A100s, CPUs, RAM, and storage, according to your needs.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 2x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 5 TB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 2x H200 140GB GPU node with 192vCPUs/504GB RAM/5TB SSD.

2) Next, you'll need to select an authentication method. Two methods are available: Password and SSH Key. We recommend using SSH keys, as they are a more secure option. To create one, head over to our official documentation.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen python=3.11 -y && conda activate qwen

Output:

2) Install required dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 
pip install git+https://github.com/huggingface/transformers.git@main
pip install git+https://github.com/huggingface/accelerate
pip install huggingface_hub

Output:

3) Login to Hugging Face with HF READ token.

This is a gated model, make sure to get access granted from the model card.

hf auth login

Output:

4) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

5) If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

To download the thinking model, just replace the model_name value with "Qwen/Qwen3-Next-80B-A3B-Thinking".

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
)

Output:

3) Run the model for inference.

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

And for inferencing with the thinking model, use the following snippet:

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)

Output:

Conclusion

The Qwen3-Next-80B model represents a shift in AI development, prioritizing efficiency over raw scale through its Hybrid Attention and High-Sparsity Mixture-of-Experts (MoE) architecture. This allows it to achieve high performance with a fraction of the computational load, enabling it to handle massive context lengths and accelerate inference speeds. NodeShift Cloud plays a crucial role in making this advanced technology accessible and practical by providing a cost-effective, secured platform for deploying and running such compute-intensive models. By offering affordable GPU resources, NodeShift Cloud democratizes access to state-of-the-art AI, allowing developers and businesses to leverage the power of models like Qwen3-Next-80B without the prohibitive costs and infrastructure management typically associated with large-scale AI.

For more information about NodeShift:

Generate Expressive, Long Form Multi-Speaker Audios & Podcasts with Microsoft's VibeVoice

Aditi Bindal — Wed, 03 Sep 2025 16:48:39 +0000

If you're looking for an open-source text-to-speech system that can generate podcasts, audiobooks, or multi-speaker conversations that actually sound real, Microsoft’s VibeVoice is a model you’ll want to try. Unlike traditional TTS systems that often feel robotic, inconsistent, or restricted to short clips, VibeVoice is designed from the ground up to produce expressive, long-form, multi-speaker audio with remarkable naturalness and flow. It can synthesize speech lasting up to 90 minutes and seamlessly handle up to four distinct speakers, an impressive upgrade over most existing models that struggle to maintain quality beyond a few minutes or across more than two voices. What makes this possible is its continuous speech tokenizers (acoustic and semantic) that operate at a very low frame rate (7.5 Hz), preserving audio richness while drastically reducing computation. On top of this, the model uses a next-token diffusion framework, powered by a Qwen2.5-based LLM, to understand dialogue context and generate nuanced turn-taking, while a lightweight diffusion head ensures high-fidelity acoustic detail. The result: smooth, consistent, and lifelike conversations that feel like they were recorded, not generated.

In this guide, we have covered a simple and step-by-step walkthrough of how to get this model up and running locally or in GPU-accelerated environments.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x RTX 4090 or 1x RTX A6000
Storage: 50GB (preferable)
VRAM: at least 16GB
Anaconda installed

Step-by-step process to install and run VibeVoice

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x A100 SXM4 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 100GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x RTXA6000 GPU node with 64vCPUs/63GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running a CUDA dependent application like VibeVoice, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

That’s it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n vibe python=3.11 -y && conda activate vibe

Output:

2) Clone the official repository and move inside the project directory.

git clone https://github.com/microsoft/VibeVoice.git

Output:

3) Install required dependencies.

pip install -e .
pip install flash-attn --no-build-isolation
apt update && apt install ffmpeg -y

Output:

4) Launch the Gradio demo. This will automatically download the model checkpoints as well.

python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share

Output:

5) If you're on a remote machine (e.g., NodeShift GPU), you'll need to do SSH port forwarding in order to access the Gradio session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server - you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 7860:localhost:7860 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

After this copy the URL you received in your remote server: http://0.0.0.0:7860

And paste this on your local browser to access the Gradio session.

Step 8: Run the model

1) Once you access the Gradio interface, it will look like this:

2) Generate podcast from script.

Put any script of your chocie, we’re using one of the example scripts given in the official repo.

3) The app has started to stream generated podcast audio in real-time.

Conclusion

VibeVoice stands out as a groundbreaking open-source TTS framework that combines continuous speech tokenizers, a Qwen2.5-powered LLM, and a diffusion head to deliver expressive, long-form, multi-speaker audio that feels astonishingly real. Its ability to generate up to 90 minutes of consistent, multi-voice speech makes it a powerful tool for creators and researchers alike. And while running it locally is a great way to get started, NodeShift makes the experience even smoother by providing GPU-accelerated environments, simplified deployment, and scalability out of the box, so you can focus on exploring and scaling with the model’s capabilities without worrying about complex infrastructure setup.

For more information about NodeShift:

A Step-by-Step Guide to Install DeepSeek V3.1

Aditi Bindal — Mon, 01 Sep 2025 18:21:46 +0000

DeepSeek has once again pushed the boundaries of what’s possible in open-source AI with the release of DeepSeek-V3.1, a next-generation hybrid model that seamlessly supports both thinking and non-thinking modes. Building on the foundation of its powerful V3 base checkpoint, this version introduces smarter tool calling, faster reasoning efficiency, and a more versatile chat template design that adapts effortlessly to different use cases. Its post-training optimization dramatically boosts performance in agent tasks and tool usage, making it a strong choice for developers working on automation, research assistance, and coding agents. Moreover, the model’s ability to process extended contexts has been expanded through a two-phase long context extension approach: a massive 10x increase in the 32K token phase to 630B tokens and a 3.3x increase in the 128K token phase to 209B tokens. Combined with training on the cutting-edge UE8M0 FP8 data format, DeepSeek-V3.1 not only ensures efficiency and scalability but also guarantees compatibility with modern microscaling data pipelines.

Deploying a model of this caliber locally might seem daunting at first due to its substantial 671 billion parameters. However, Unsloth has made it entirely feasible. Unsloth has used selective quantization techniques to reduce the model's size without any significant loss of accuracy by targeting specific layers, such as the Mixture-of-Experts (MoE) layers, while preserving the precision of attention and other critical layers.

In the following guide, we'll walk you through the step-by-step process of installing and running DeepSeek-V3.1 locally using LLaMA.cpp and Unsloth's dynamic quants, ensuring you can access its full potential efficiently and effectively.

Prerequisites

The system requirements for running DeepSeek-V3.1 are:

GPU: Multiple H100s or H200s (count may vary across different bits)
Storage: 1TB+ (preferable)
Nvidia Cuda installed.
Anaconda installed
Disk Space requirements depending on the type of model are as follows:

Source: Unsloth

We recommend you to take a screenshot of this chart and save it somewhere to quickly look up to the disk space prerequisites before trying a specific bit quantized version.

For this article, we’ll download the 2.71-bit version (recommended).

Step-by-step process to install DeepSeek-V3.1 Locally

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 200 GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x H100 SXM 80GB GPU node with 192vCPUs/80GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Install and build LLaMA.cpp

llama.cpp is a C++ library for running LLaMA and other large language models efficiently on GPUs, CPUs and edge devices.

We’ll first install llama.cpp as we’ll use it to install and run DeepSeek-V3-0324.

1) Start by creating a virtual environment using Anaconda.

conda create -n deepseek python=3.11 -y && conda activate deepseek

Output:

2) Once inside the environment, update the Ubuntu package source-list for fetching the latest repository updates and patches.

apt-get update`

3) Install dependencies for llama.cpp.

apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y

Output:

4) Clone the official repository of llama.cpp.

git clone https://github.com/ggml-org/llama.cpp

Output:

5) Compile llama.cpp‘s build files.

In the below command, keep -DGGML_CUDA=OFF if you’re running it on a non-GPU system. However, it’s recommended to keep it OFF, even if you’re on a GPU-based system, as it will allow llama.cpp’s compilation process to occur through CPU, which is faster in this case as compared to GPU-based compilation. In addition to being slow, compiling llama.cpp through GPU can sometimes throw unwanted errors.

cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DLLAMA_CURL=ON

Output:

6) Build llama.cpp from the build directory.

cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

Output:

7) Finally, we’ll copy all the executables from llama.cpp/build/bin/ that start with llama- into the llama.cpp directory.

cp llama.cpp/build/bin/llama-* llama.cpp

Step 8: Download the Model Files

We’ll download the model files from Hugging Face using a Python script.

1) To do that, let’s first install the Hugging Face Python packages.

pip install huggingface_hub hf_transfer

huggingface_hub – Provides an interface to interact with the Hugging Face Hub, allowing you to download, upload, and manage models, datasets, and other resources.
hf_transfer – A tool optimized for faster uploads and downloads of large files (e.g., LLaMA, DeepSeek models) from the Hugging Face Hub using a more efficient transfer protocol.

`
Output:

2) Run the model installation script with Python.

The script below will download all the specifical quant’s checkpoints from unsloth/DeepSeek-V3.1.

` python -c "import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER']='0'; from huggingface_hub import snapshot_download; snapshot_download(repo_id='unsloth/DeepSeek-V3.1-GGUF', local_dir='unsloth/DeepSeek-V3.1-GGUF', allow_patterns=['*UD-Q2_K_XL*'])" `

Output:

Depending on your GPU configuration, the download process can be slow and take some time. The installation might also seem stuck at some points, which is normal, so do not interrupt or kill the installation in between.

Step 9: Run the model for Inference

Finally, once all checkpoints are downloaded, we can proceed to the inference part.

In the below command, we’ll run the model with a prompt given inside a formatted template which will be run through LLaMA.cpp’s LLaMA-CLI tool. The prompt will ask the model to create a Flappy Bird game in Python with all the interface, logic, and controls.

` ./llama.cpp/llama-cli \ --model unsloth/DeepSeek-V3.1-GGUF/UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \ --cache-type-k q4_0 \ --threads -1 \ --n-gpu-layers 99 \ --prio 3 \ --temp 0.6 \ --top_p 0.95 \ --min_p 0.01 \ --ctx-size 16384 \ --seed 3407 \ -ot ".ffn_.*_exps.=CPU" \ -no-cnv \ --prompt "<｜User｜>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<｜Assistant｜>" `
Output:

The model has started generating the code as shown below:

Once the process is complete, it may end the output like this:

As we run the code for the Flappy Bird game generated by DeepSeek-V3.1 through VSCode Editor, it opens a game panel as shown below (Note: Install pygame in your editor before running the code):

You can see the live demonstration of the game in the video attached on the original article here.

Conclusion

In this guide, we explored how DeepSeek-V3.1 elevates open-source AI with its hybrid thinking modes, smarter tool calling, faster reasoning, and extended long-context capabilities, all supported by efficient training techniques like FP8 scaling and Unsloth’s dynamic quantization. While deploying such a massive model locally with LLaMA.cpp is now more accessible, it still demands considerable compute resources. This is where NodeShip Cloud steps in, offering a seamless alternative with scalable, cost-effective GPU and compute infrastructure. By offloading deployment to NodeShip’s intuitive cloud platform, developers can unlock the full potential of DeepSeek-V3.1 without the burden of managing heavy local infrastructure, making experimentation, scaling, and production use both faster and simpler.

For more information about NodeShift:

A Complete Setup Guide to Powerful AI Image Editing with Qwen-Image-Edit

Aditi Bindal — Mon, 01 Sep 2025 16:52:19 +0000

Image editing has always required a delicate balance between precision and creativity, and that’s exactly what Qwen-Image-Edit delivers. Built on the robust 20B Qwen-Image model, this cutting-edge tool takes image editing to the next level by combining semantic control (powered by Qwen2.5-VL) with appearance control (via its VAE Encoder). This dual-system approach allows users to seamlessly perform both low-level edits, like adding or removing objects while keeping the rest of the image untouched, and high-level transformations, such as rotating objects, transferring artistic styles, or even creating new concepts entirely. What truly sets Qwen-Image-Edit apart, however, is its precise text editing capability, enabling direct modification of text in English and Chinese while preserving the original font, size, and style.

If you’re looking for an image editing model that’s powerful, versatile, and incredibly easy to use, Qwen-Image-Edit is a must-try. Let's see how to get it up and running on your machine.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x H100
Storage: 50 GB (preferable)
VRAM: at least 64 GB
Anaconda installed

Step-by-step process to install and run Qwen Image Edit

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x H100 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 200 GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x H100 SXM 80GB GPU node with 192vCPUs/80GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen python=3.11 -y && conda activate qwen

Output:

2) Install required dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/huggingface/diffusers
pip install transformers accelerate gradio pillow

Output:

3) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

4) If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

` with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

`
import os
from PIL import Image
import torch

from diffusers import QwenImageEditPipeline

pipeline = QwenImageEditPipeline.from_pretrained("Qwen/Qwen-Image-Edit")
print("pipeline loaded")
pipeline.to(torch.bfloat16)
pipeline.to("cuda")
pipeline.set_progress_bar_config(disable=None)
image = Image.open("./cat.jpg").convert("RGB")
prompt = "Add a pillow under cat's head and cover it with a blanket."
inputs = {
"image": image,
"prompt": prompt,
"generator": torch.manual_seed(0),
"true_cfg_scale": 4.0,
"negative_prompt": " ",
"num_inference_steps": 50,
}

with torch.inference_mode():
output = pipeline(**inputs)
output_image = output.images[0]
output_image.save("output_image_edit.png")
print("image saved at", os.path.abspath("output_image_edit.png"))
`

Output:

Original Image:

Edited Image:

Conclusion

Qwen-Image-Edit stands out as a next-generation image editing model, seamlessly blending semantic intelligence with appearance precision to enable everything from subtle object adjustments to bold creative transformations, all while offering unmatched text editing capabilities. By running it on Nodeshift Cloud, you gain a frictionless way to harness this power, eliminating complex setup hurdles and ensuring a smooth, scalable environment for experimentation. Together, Qwen-Image-Edit and Nodeshift Cloud make advanced image editing not just possible, but practical and accessible for creators, developers, and enterprises alike.

For more information about NodeShift:

Get Started with MiniCPM-v4: The Next-Gen Multimodal AI Model by OpenBMB

Aditi Bindal — Mon, 01 Sep 2025 16:24:49 +0000

Multimodal AI is rapidly evolving, MiniCPM-V 4.0 by OpenBMB emerges as a game-changer, combining cutting-edge visual understanding with unprecedented efficiency. Built on SigLIP2-400M and MiniCPM4-3B, this compact yet powerful model packs 4.1B parameters, but consistently punches above its weight. It not only inherits the strong single-image, multi-image, and video comprehension capabilities of its predecessor (MiniCPM-V 2.6), but also surpasses it with remarkable efficiency. Benchmark results on OpenCompass demonstrate this leap. MiniCPM-V 4.0 achieves a 69.0 average score, outperforming models like GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B), and Qwen2.5-VL-3B-Instruct, proving that smaller can indeed be smarter. What makes it even more exciting is its real-world usability: the model runs seamlessly on end devices, delivering under 2s first-token delay and over 17 tokens/s decoding on iPhone 16 Pro Max, all without heating issues, making on-device multimodal AI finally practical. With easy integration across frameworks like llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory, and even a native iOS app, MiniCPM-V 4.0 isn’t just another AI model, it’s a versatile, efficient, and deployment-ready multimodal powerhouse.

In this article, we're going to see a step-by-step process to install and run this model locally or in GPU-accelerated environments.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x RTX 4090 or 1x RTX A6000
Storage: 50GB (preferable)
VRAM: at least 16GB
Anaconda installed

Step-by-step process to install and run MiniCPM-v4

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x A100 SXM4 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 100GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x RTXA6000 GPU node with 64vCPUs/63GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n minicpm python=3.11 -y && conda activate minicpm

Output:

2) Install required dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 
pip install einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy

Output:

3) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

4) If you're on a remote machine (e.g., NodeShift GPU), you'll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server - you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

from PIL import Image
import torch
from transformers import AutoModel, AutoTokenizer

model_path = 'openbmb/MiniCPM-V-4'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True,
                                  # sdpa or flash_attention_2, no eager
                                  attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(
    model_path, trust_remote_code=True)



image = Image.open('./landform.jpg').convert('RGB')

# First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    image=image,
    tokenizer=tokenizer
)
print(answer)


# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": [
            "What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    image=None,
    tokenizer=tokenizer
)
print(answer)

Output:

Here’s the image we used to testing the model:

Picsum ID: 866

Output:

Conclusion

To wrap up, MiniCPM-V 4.0 clearly demonstrates how multimodal AI is becoming more efficient, accessible, and deployment-ready, setting a new benchmark in balancing compact design with powerful visual and reasoning capabilities. From its ability to outperform larger models on benchmarks to its seamless real-world usability on devices like the iPhone 16 Pro Max, it proves that high performance no longer requires massive scale. At the same time, Nodeshift Cloud makes experimenting with and deploying such state-of-the-art models far more practical, offering GPU-accelerated environments, simple setup workflows, and flexible scaling to match your needs.

For more information about NodeShift:

How to Install & Run Qwen Image

Aditi Bindal — Mon, 01 Sep 2025 16:10:21 +0000

Imagine transforming a simple text prompt into a high-quality image with just a few lines of code. Qwen-Image makes this possible by combining advanced image generation with precise text rendering, whether you’re working in English or Chinese. It handles everything from photorealistic scenes and impressionist-style paintings to clean, minimalist designs, adapting its output to your needs. On top of that, Qwen-Image offers powerful editing features: you can insert or remove objects, fine-tune colours and details, edit text directly within an image, and even adjust human poses—all through clear, natural-language commands. Behind the scenes, it also performs tasks like object detection, semantic segmentation, depth estimation and super-resolution, giving you a complete toolkit for creating and refining images with ease.

Getting started is simple. In the next section, you’ll see exactly how to install Qwen-Image and run your first prompt in minutes.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x H100
Storage: 50 GB (preferable)
VRAM: at least 64 GB
Anaconda installed

Step-by-step process to install and run Qwen Image

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 200 GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x H100 SXM 80GB GPU node with 192vCPUs/80GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen-img python=3.11 -y && conda activate qwen-img

Output:

2) Install required dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 
pip install einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy

Output:

3) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

4) If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

` with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

`
from diffusers import DiffusionPipeline
import torch

model_name = "Qwen/Qwen-Image"

Load the pipeline

if torch.cuda.is_available():
torch_dtype = torch.bfloat16
device = "cuda"
else:
torch_dtype = torch.float32
device = "cpu"

pipe = DiffusionPipeline.from_pretrained(model_name, torch_dtype=torch_dtype)
pipe = pipe.to(device)

positive_magic = {
"en": "Ultra HD, 4K, cinematic composition." # for english prompt,
"zh": "超清，4K，电影级构图" # for chinese prompt,
}

Generate image

prompt = '''A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197". Ultra HD, 4K, cinematic composition'''

negative_prompt = " " # using an empty string if you do not have specific concept to remove

Generate with different aspect ratios

aspect_ratios = {
"1:1": (1328, 1328),
"16:9": (1664, 928),
"9:16": (928, 1664),
"4:3": (1472, 1140),
"3:4": (1140, 1472),
"3:2": (1584, 1056),
"2:3": (1056, 1584),
}

width, height = aspect_ratios["16:9"]

image = pipe(
prompt=prompt + positive_magic["en"],
negative_prompt=negative_prompt,
width=width,
height=height,
num_inference_steps=50,
true_cfg_scale=4.0,
generator=torch.Generator(device="cuda").manual_seed(42)
).images[0]

image.save("example.png")
`

Output:

Conclusion

You’ve seen how Qwen-Image turns simple text prompts into stunning, high-fidelity images, whether photorealistic, painterly or minimal, and offers intuitive editing for objects, colour, text and even human poses, all backed by robust image-understanding capabilities like segmentation and super-resolution. Equally straightforward is getting up and running: a few commands installs the model via diffusers and, in minutes, you’re generating your first visuals. By pairing Qwen-Image with NodeShift Cloud, you gain instant access to scalable GPU instances, automated deployment of your inference pipeline and managed versioning, so you can focus on creativity while NodeShift ensures performance, reliability and easy integration into your existing workflows.

For more information about NodeShift:

A Step-By-Step Guide to Install Qwen3 30B Locally

Aditi Bindal — Mon, 11 Aug 2025 13:43:54 +0000

The Qwen3-30B-A3B-Instruct-2507 is an advanced iteration of the Qwen3 series, marking a significant leap forward in the landscape of causal language models. Boasting an impressive 30.5 billion parameters with 3.3 billion actively engaged, this model excels across a diverse array of capabilities such as instruction following, complex logical reasoning, text comprehension, mathematics, and science. Its robust coding proficiency, demonstrated by high scores in benchmarks such as MultiPL-E and LiveCodeBench, makes it particularly attractive to developers and researchers. The model also excels in multilingual contexts and handles extensive 256K token contexts effortlessly, making it ideal for intricate, lengthy tasks. Furthermore, its refined alignment with user preferences in subjective and open-ended scenarios ensures that interactions feel natural, intuitive, and highly personalised.

In this article, we guide you step-by-step on installing Qwen3-30B locally or in GPU-accelerated environment.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x H200
Storage: 50 GB (preferable)
VRAM: at least 64 GB
Anaconda installed

Step-by-step process to install and run Qwen3-30B

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 200 GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x H100 SXM 80GB GPU node with 192vCPUs/80GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen python=3.11 -y && conda activate qwen

Output:

2) Once you’re inside the environment, install vllm with dependencies.

pip install --upgrade vllm

Output:

3) Also, open a second terminal, connect to remote server with SSH and install open-webui.

pip install open-webui

Step 8: Download and Run the model

1) Download model with vllm and host the endpoint at 8000.

vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --max-model-len 32768 --gpu-memory-utilization 0.95

Output:

2) In the second terminal connected with the GPU host with ssh, serve the open-webui frontend endpoint.

open-webui serve --port 3000

Output:

3) Forward both the ports and tunnel them to access in the local browser.

If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the both vllm and open-webui session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 3000:localhost:3000 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

In another local terminal run forward the port for vllm endpoint:

ssh -L 8000:localhost:8000 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Step 9: Run the model via Open WebUI Interface

Once ports are forwarded, you can simply access the model via Open WebUI interface and chat with it.

1) Before running the model, connect the webui with vllm API endpoint in the settings.

2) Select the Qwen3-30B model in the chat page and run the prompt.

For e.g., we’re testing the following prompt:

1. Summarize the following passage in 3 bullet points.
2. Then, extract 3 key insights and explain their implications.
3. Finally, write a Python function that could analyze similar passages for sentiment.

---
Passage:

"The rapid advancement of AI technologies has transformed industries across the globe. In healthcare, AI models are diagnosing diseases earlier and more accurately. In finance, algorithmic trading and risk modeling are becoming more sophisticated. Yet, as AI grows more powerful, ethical questions around bias, privacy, and job displacement remain urgent. Policymakers and technologists must collaborate to create guardrails that ensure innovation benefits society as a whole."
---

Give your response in clearly separated sections.

Output:

Conclusion

Installing Qwen3-30B-A3B-Instruct-2507 locally equips developers and researchers with a cutting-edge language model, renowned for its powerful reasoning, extensive multilingual support, and exceptional handling of long-context tasks. Pariring it with NodeShift GPUs further enhances this experience, providing streamlined deployment, efficient resource management, and scalable infrastructure. Together, these tools empower users to harness advanced AI capabilities effectively, bridging innovation with accessibility and performance.

For more information about NodeShift:

How AI is Saving Sales Teams 1,000+ Hours Annually - Securely and at Scale

Aditi Bindal — Mon, 11 Aug 2025 12:20:12 +0000

Sales teams are under immense pressure. Quarter after quarter, they’re expected to hit ambitious revenue targets, respond faster than ever, and deliver personalized experiences across every touchpoint. However, there’s a hidden roadblock that no one talks about, i.e., time. From drafting repetitive outreach emails, chasing follow-ups, and customizing proposals to filling out RFPs and updating CRM entries, a massive portion of a sales representative’s day is spent in "not selling". In fact, studies show that over 60% of a representative’s time is consumed by non-revenue-generating admin work. That’s not just frustrating, it’s expensive.

For mid-sized sales teams, this amounts to over 1,000 hours lost per year, often by the highest-paid and most skilled employees. And while traditional sales automation tools offer some relief, they rarely deliver the deep contextual understanding or security that today’s enterprise environments demand.

That’s where NodeShift’s sovereign AI platform steps in, offering a private, self-hosted solution that seamlessly integrates with your existing CRM tools like HubSpot, Apollo, or Salesforce and more. Powered by state-of-the-art open-source models like LLaMA, DeepSeek, and Mistral, NodeShift delivers intelligent sales copilots that work just like ChatGPT, only fully within your infrastructure.

Practical Ways Sales Teams Save Time with AI

Sales teams today are overwhelmed by admin-heavy workflows that slow down pipeline momentum and erase selling time. With NodeShift’s sovereign AI platform, sales orgs can deploy high-performing, private AI agents that embed directly into their tools and workflows, turning weeks of manual work into minutes of intelligent automation.

Here are five real-world applications where sales teams are regaining control of their time and focus:

1) Auto-Generate Personalized Emails & Outreach

Writing personalized prospecting emails at scale is time-consuming, and using templates often leads to low engagement. NodeShift-hosted models, such as LLaMA 3 or Mistral-7B, can be fine-tuned on your past outreach, ICP definitions, and product positioning. Once integrated with your CRM (e.g., HubSpot, Apollo via MCPs or APIs), the AI can:

Auto-draft hyper-personalized cold emails for each lead based on firmographic and behavioral data
Tailor follow-ups dynamically based on engagement history
Localize content (e.g., for multilingual markets) using in-house translation support

Impact: Sales reps save 6–10 hours/week on outbound efforts, while improving open and reply rates by 20–35%.

2) Instantly Generate Proposals, Sales Decks & Security Responses

Creating customized proposals, tailored pitch decks, and responses to procurement or security questionnaires is a time-consuming process. Reps and solution engineers often spend hours assembling assets from multiple documents, past deals, and stakeholder inputs, often under tight deadlines.

NodeShift’s sovereign AI platform acts as your always-on sales assistant. By connecting securely to your internal document repository, past proposals, slide decks, pricing sheets, and product documentation, your AI copilot can:

Generate fully customized sales proposals and SoWs (Statements of Work) based on customer profile, industry, and deal stage
Auto-build branded presentation decks with editable slides, based on opportunity data, product fit, and competitor positioning
Complete security questionnaires and RFPs by retrieving previously approved answers and formatting them to match submission requirements
Pull insights from your CRM, knowledge base, and compliance docs without ever sending data to any external third-party service or cloud

The system outputs editable content in formats like PDF, Google Slides, Microsoft PowerPoint, and Word, making review and refinement effortless.

Impact: Sales teams reduce proposal cycles from 3–5 days to less than 3 hours, reclaiming dozens of hours per month while improving consistency and response quality.

3) Call Summarization & Deal Insights

Valuable insights from discovery calls, demos, and follow-ups often get buried in meeting recordings or never make it into the CRM.

Once connected to tools like Zoom or Google Meet transcripts, NodeShift’s AI:

Summarizes key takeaways, objections, and action items
Automatically updates the CRM (Salesforce, HubSpot) with notes
Suggests next steps or personalized recap emails

Impact: Reps save ~5 hours/week on call logging, while managers gain full visibility into deal health without micromanaging.

4) Intelligent CRM Hygiene & Lead Prioritization

CRMs are often cluttered, outdated, or underutilized, making it hard to prioritize the right deals.

By integrating NodeShift's AI models with your CRM and third-party data sources (via APIs or the Model Context Protocol), AI agents can:

Auto-enrich lead profiles with public data (e.g., LinkedIn, Clearbit, Apollo)
Flag duplicate or stale records
Score leads based on buying intent and account fit
Recommend the next best action (call, email, nurture)

Impact: Sales operations save 2–4 hours/week, while reps spend less time guessing and more time engaging.

5) Conversational Team Onboarding & Enablement

Onboarding new sales reps can take 1–2 months of shadowing, reading, and repetitive Q&A.

For this, you can deploy a custom AI copilot on NodeShift, which is trained on:

Product FAQs and battlecards
Objection handling scripts
Past deals and CRM notes
Competitive positioning documents

New hires can interact with the AI like a live trainer:

“How do we position our product against [Competitor] for [Healthcare CTOs]?”

Impact: Onboarding time drops by 30–50%, and reps are deal-ready weeks faster.

NodeShift's Sovereign AI - Designed for Security and Scale

NodeShift offers a fundamentally different approach, a sovereign AI platform purpose-built for privacy-conscious organizations that want the power of open-source large language models (LLMs) without sacrificing data ownership or operational control.

By securely deploying models like Mistral, LLaMA 3, DeepSeek, and others directly within your infrastructure, and integrating seamlessly with CRM systems like HubSpot, Apollo, Salesforce, and internal knowledge bases, NodeShift enables organizations to build intelligent sales copilots that are:

Deeply contextualized with your product, customer, and market knowledge
Fully under your control (on-prem, VPC, or private sovereign cloud)
Capable of automating hours of manual tasks across the entire sales funnel

If your team is spending more time documenting deals than closing them, it’s time to rethink what AI can do for your sales operations.

Ready to See NodeShift AI in Action?

Sales performance is no longer just about hiring more reps; it’s about removing the friction that slows them down.

With NodeShift’s sovereign AI platform, your organization can harness the power of cutting-edge open-source models like Mistral, LLaMA, and DeepSeek, securely deployed inside your infrastructure, connected to your CRM, and tailored to your workflows.

Now that you’ve seen how sales teams can reclaim 1,000+ hours annually, it’s time to explore what that transformation looks like for your team. Our team is ready to help you:

Assess integration points and automation gaps
Deploy the right open-source AI model in your own infrastructure
Launch in under a week with full ownership and compliance

For more information about NodeShift:

How to Install & Run Qwen3-Thinking

Aditi Bindal — Mon, 04 Aug 2025 16:48:42 +0000

In the world of open-source AI, very few models come close to rivaling the intellectual firepower of proprietary giants, until now. Introducing Qwen3-235B-A22B-Thinking-2507, a frontier model in the realm of thinking-capable language models. Engineered by Alibaba Cloud, this 235B-parameter behemoth, 22B of which are actively used per inference, excels in high-level reasoning, mathematical problem solving, scientific logic, and advanced coding tasks. With its unprecedented 256K context length, this model is built not just for chat, but for deep, extended reasoning across massive documents and chains of logic. From dominating benchmarks like AIME25 and HMMT25 to outperforming Claude Opus in reasoning-heavy scenarios, Qwen Thinking isn’t just another LLM, it’s a state-of-the-art brain built for intellectual rigor. And the best part? You can now run it locally.

Let's walk through how to install and harness this thinking model right from your own machine.

Prerequisites

The minimum system requirements for running this model are:

GPU: 8x H200
Storage: 1 TB (preferable)
VRAM: at least 1 TB
Anaconda installed

Step-by-step process to install and run Qwen Thinking

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 8x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 1 TB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 8x H200 140 GB GPU node with 384vCPUs/2TB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen-thinking python=3.11 -y && conda activate qwen-thinking

Output:

2) Once you're inside the environment, install necessary dependencies to run the model.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install --upgrade transformers accelerate einops

Output:

3) Install PyTorch, transformers and other python packages.

pip install torch torchvision torchaudio 
pip install einops timm pillow
pip install transformers==4.47.0 git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy ffmpeg

4) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

5) If you're on a remote machine (e.g., NodeShift GPU), you'll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server - you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-235B-A22B-Thinking-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Output:

Conclusion

Installing Qwen3-235B-A22B-Thinking-2507 locally isn’t just a technical feat; it’s a gateway to unlocking one of the most advanced open-source reasoning models available today. In this guide, we explored what makes this model a powerhouse for logical reasoning, coding, and long-context understanding, and how its “thinking mode” elevates it far beyond conventional LLMs. NodeShift Cloud played a pivotal role by simplifying the deployment process, offering the compute muscle and flexible infrastructure needed to run such a massive model seamlessly. Whether you’re experimenting, building, or benchmarking, NodeShift makes it easier than ever to bring cutting-edge AI capabilities right to your fingertips.

For more information about NodeShift:

How to Install & Run Higgs Audio v2 Locally

Aditi Bindal — Mon, 04 Aug 2025 16:12:38 +0000

Imagine an audio generation model so expressive, it can narrate a story with human-like cadence, speak in your cloned voice with melodic voice, or conduct a natural conversation between two completely different speakers, all in multiple languages and without any fine-tuning. That’s exactly what Higgs Audio v2 delivers. Pretrained on over 10 million hours of meticulously annotated audio and text, this powerful open-source audio foundation model pushes the boundaries of what's possible in text-to-speech (TTS) and audio synthesis. Built on top of Llama 3.2-3B and enhanced with a novel DualFFN audio adapter architecture, Higgs Audio v2 combines the deep language understanding of LLMs with a cutting-edge discretized audio tokenizer capable of capturing both semantic and acoustic detail at just 25 fps. It excels in zero-shot prosody adaptation, multilingual translation, multi-speaker dialogue generation, and even simultaneous background music and speech synthesis. With state-of-the-art results on Seed-TTS Eval, ESD, and EmergentTTS-Eval, and win rates of up to 75.7% over GPT-4o-mini-TTS in emotion-rich generations, this model is not just a technical marvel, it's an invitation to explore the future of voice AI.

If you're ready to build with the next generation of audio intelligence, this guide will walk you through installing Higgs Audio v2 locally, unlocking everything from high-fidelity narration to real-time multilingual voice cloning, right on your machine.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x RTX 4090 or 1x RTX A6000
Storage: 50GB (preferable)
VRAM: at least 16GB
Anaconda installed

Step-by-step process to install and run Higgs Audio v2

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x A100 SXM4 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 100GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 2x A100 80GB GPU node with 32vCPUs/131GB RAM/100GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n higgs python=3.11 -y && conda activate higgs

Output:

2) Once you’re inside the environment, clone the official repository.

git clone https://github.com/boson-ai/higgs-audio.git
cd higgs-audio

Output:

3) Install required dependencies.

pip install -r requirements.txt
pip install -e .

4) Install PyTorch, transformers and other python packages.

pip install torch torchvision torchaudio 
pip install einops timm pillow
pip install transformers==4.47.0 git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy ffmpeg

5) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

6) If you're on a remote machine (e.g., NodeShift GPU), you'll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server - you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download model checkpoints and run the model for inference.

from boson_multimodal.serve.serve_engine import HiggsAudioServeEngine, HiggsAudioResponse
from boson_multimodal.data_types import ChatMLSample, Message, AudioContent

import torch
import torchaudio
import time
import click

MODEL_PATH = "bosonai/higgs-audio-v2-generation-3B-base"
AUDIO_TOKENIZER_PATH = "bosonai/higgs-audio-v2-tokenizer"

system_prompt = (
    "Generate audio following instruction.\n\n<|scene_desc_start|>\nAudio is recorded from a quiet room.\n<|scene_desc_end|>"
)

messages = [
    Message(
        role="system",
        content=system_prompt,
    ),
    Message(
        role="user",
        content="The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
    ),
]
device = "cuda" if torch.cuda.is_available() else "cpu"

serve_engine = HiggsAudioServeEngine(MODEL_PATH, AUDIO_TOKENIZER_PATH, device=device)

output: HiggsAudioResponse = serve_engine.generate(
    chat_ml_sample=ChatMLSample(messages=messages),
    max_new_tokens=1024,
    temperature=0.3,
    top_p=0.95,
    top_k=50,
    stop_strings=["<|end_of_text|>", "<|eot_id|>"],
)
torchaudio.save(f"output.wav", torch.from_numpy(output.audio)[None, :], output.sampling_rate)

Output:

Conclusion

Higgs Audio v2 showcases the cutting edge of expressive audio generation, from zero-shot multilingual TTS to realistic multi-speaker dialogues, all powered by innovations like DualFFN architecture, a unified audio tokenizer, and training on 10 million hours of diverse audio. Installing it locally opens the door to these advanced capabilities for developers, researchers, and creatives alike. Powered by NodeShift Cloud, the deployment process becomes even more seamless, offering scalable compute, fast storage, and integrated tooling that accelerates experimentation and production workflows.

For more information about NodeShift:

Transform Clinical Research with Microsoft's MediPhi-Instruct

Aditi Bindal — Mon, 04 Aug 2025 15:32:16 +0000

In an era where medical language understanding is fast becoming indispensable, Microsoft’s MediPhi-Instruct stands out as a game-changing clinical AI model that combines precision, efficiency, and modular design. Built on the Phi-3.5-mini-instruct foundation, MediPhi isn’t just one model, it’s a symphony of seven finely tuned experts, each trained on distinct medical corpora such as PubMed, medical guidelines, and clinical documents. These specialists are smartly fused using SLERP and BreadCrumbs techniques to retain both depth and generality, culminating in the final MediPhi-Instruct model, aligned using Microsoft’s massive MediFlow dataset. The result? A compact, 3.8B-parameter powerhouse optimized for clinical NLP tasks, from parsing medical codes and literature to assisting in healthcare research, all while running efficiently in memory-constrained or low-latency environments. With its clinical alignment, and modular architecture, MediPhi is not just a research tool, it’s an invitation to revolutionize how we interact with medical language models.

If you're a medical researcher, NLP engineer, or healthcare innovator, MediPhi-Instruct is remarkably easy to install and tailor to your own clinical use cases, let’s dive in and get it running.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x RTX 4090 or 1x RTX A6000
Storage: 50GB (preferable)
VRAM: at least 16GB
Anaconda installed

Step-by-step process to install and run Gemma 3n

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x A100 SXM4 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 100GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 2x A100 80GB GPU node with 32vCPUs/131GB RAM/100GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n med python=3.11 -y && conda activate med

Output:

2) Once you're inside the environment, install necessary dependencies to run the model.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install huggingface_hub

Output:

3) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

4) If you're on a remote machine (e.g., NodeShift GPU), you'll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server - you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download model checkpoints and run the model for inference.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model_name = "microsoft/MediPhi-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Output:

3) Run the model for your inference.

prompt = "Operative Report:\nPerformed: Cholecystectomy\nOperative Findings: The gallbladder contained multiple stones and had thickening of its wall. Mild peritoneal fluid was noted."

messages = [
    {"role": "system", "content": "Extract medical keywords from this operative notes focus on anatomical, pathological, or procedural vocabulary."},
    {"role": "user", "content": prompt},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

Output:

Conclusion

MediPhi-Instruct brings together the power of modular fine-tuning, expert model merging, and clinical alignment to deliver a lightweight yet robust solution for medical language processing. From decoding complex medical literature to enabling real-time clinical assistance in low-resource settings, it represents a significant leap in accessible and efficient healthcare AI. Powered by Microsoft’s deep domain expertise, it’s designed not only to be technically advanced but also practical for real-world use. And with NodeShift, deploying and experimenting with MediPhi becomes even easier, offering a seamless, GPU-backed environment where researchers can skip the infrastructure hassles and focus on what matters most: unlocking insights and driving innovation in medical AI.

For more information about NodeShift:

Combine the Power of ASR & LLM with NVIDIA's Canary-Qwen-2.5B

Aditi Bindal — Mon, 04 Aug 2025 14:37:36 +0000

If you’ve been looking for a way to bring powerful, reliable speech recognition to your local environment, without relying on external APIs, NVIDIA’s new Canary-Qwen-2.5B might be exactly what you need. With 2.5 billion parameters under the hood, this model doesn’t just transcribe English speech with near state-of-the-art accuracy, it does so with punctuation, capitalization, and ultra fast speed (418 RTFx). Canary-Qwen stands out with its two-in-one nature: in ASR mode, it delivers high-quality transcriptions; in LLM mode, it can go further, summarizing, answering questions, or post-processing transcripts using full language understanding. It's fast, it's flexible, and it’s designed for real-world, production-grade use.

In this article, we’ll walk you through how to get Canary-Qwen-2.5B up and running locally or in GPU-acclerated environments in minutes.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x RTX 4090 or 1x RTX A6000
Storage: 50GB (preferable)
VRAM: at least 16GB
Anaconda installed

Step-by-step process to install and run Canary-Qwen-2.5B

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x A100 SXM4 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 100GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 2x A100 80GB GPU node with 32vCPUs/131GB RAM/100GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n canary python=3.11 -y && conda activate canary

Output:

2) Once you're inside the environment, install necessary dependencies to run the model.

pip install torch>=2.6.0 torchvision torchaudio 
pip install einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy

Output:

3) Install NVIDIA NeMo.

python -m pip install "nemo_toolkit[asr,tts] @ git+https://github.com/NVIDIA/NeMo.git"

Output:

4) Install some other required dependencies for audio processing.

sudo apt-get update
sudo apt-get install ffmpeg
pip install sacrebleu

5) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

6) If you're on a remote machine (e.g., NodeShift GPU), you'll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server - you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints, load it to GPU and configure prompt and load audio file.

import torch
import torchaudio
from nemo.collections.speechlm2.models import SALM

# Load the model
model = SALM.from_pretrained('nvidia/canary-qwen-2.5b')

# Load and preprocess the audio
waveform, sample_rate = torchaudio.load("speech.wav")

expected_sr = model.perception.preprocessor.featurizer.sample_rate

if sample_rate != expected_sr:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=expected_sr)
    waveform = resampler(waveform)

if waveform.dim() == 2:
    waveform = waveform.mean(dim=0)

waveform = waveform.unsqueeze(0)
audio_lens = torch.tensor([waveform.shape[1]])

prompt = [[{"role": "user", "content": f"Transcribe the following: {model.audio_locator_tag}"}]]

answer_ids = model.generate(
    prompts=prompt,
    audios=waveform,
    audio_lens=audio_lens,
    max_new_tokens=128,
)

# Decode result
text = model.tokenizer.ids_to_text(answer_ids[0].tolist())
print(text)

Here’s the output for the given audio file:

Audio:
https://drive.google.com/file/d/1u9583BT8pvQB_pmxHhrDlNDfVi8BTG4_/view?usp=sharing

Output:

Conclusion

We’ve covered everything from installing dependencies and loading the 2.5 B‑parameter Canary‑Qwen model locally or on GPU‑accelerated infrastructure, to managing audio preprocessing, transcription, and optional LLM post‑processing. What makes this setup rock-solid is pairing Canary‑Qwen with NodeShift AI infra. NodeShift offers affordable, scalable on‑demand GPU instances across global regions, automated deployment via Terraform or GitHub Actions, and enterprise‑grade compliance, so you can spin up an A100 or H100‑backed VM in minutes, run your transcription workflows and other AI workflows, and scale them securely and cost‑effectively.

For more information about NodeShift: