
Picture by Writer
DeepSeek-R1-0528 is the most recent replace to DeepSeek’s R1 reasoning mannequin that requires 715GB of disk area, making it one of many largest open-source fashions out there. Nonetheless, because of superior quantization strategies from Unsloth, the mannequin’s measurement may be lowered to 162GB, an 80% discount. This enables customers to expertise the complete energy of the mannequin with considerably decrease {hardware} necessities, albeit with a slight trade-off in efficiency.
On this tutorial, we’ll:
- Arrange Ollama and Open Net UI to run the DeepSeek-R1-0528 mannequin regionally.
- Obtain and configure the 1.78-bit quantized model (IQ1_S) of the mannequin.
- Run the mannequin utilizing each GPU + CPU and CPU-only setups.
Step 0: Stipulations
To run the IQ1_S quantized model, your system should meet the next necessities:
GPU Necessities: No less than 1x 24GB GPU (e.g., NVIDIA RTX 4090 or A6000) and 128GB RAM. With this setup, you’ll be able to anticipate a technology velocity of roughly 5 tokens/second.
RAM Necessities: A minimal of 64GB RAM is required to run the mannequin to run the mannequin with out GPU however efficiency shall be restricted to 1 token/second.
Optimum Setup: For one of the best efficiency (5+ tokens/second), you want at the least 180GB of unified reminiscence or a mixture of 180GB RAM + VRAM.
Storage: Guarantee you’ve got at the least 200GB of free disk area for the mannequin and its dependencies.
Step 1: Set up Dependencies and Ollama
Replace your system and set up the required instruments. Ollama is a light-weight server for working massive language fashions regionally. Set up it on an Ubuntu distribution utilizing the next instructions:
apt-get replace
apt-get set up pciutils -y
curl -fsSL https://ollama.com/set up.sh | sh
Step 2: Obtain and Run the Mannequin
Run the 1.78-bit quantized model (IQ1_S) of the DeepSeek-R1-0528 mannequin utilizing the next command:
ollama serve &
ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0

Step 3: Setup and Run Open Net UI
Pull the Open Net UI Docker picture with CUDA assist. Run the Open Net UI container with GPU assist and Ollama integration.
This command will:
- Begin the Open Net UI server on port 8080
- Allow GPU acceleration utilizing the
--gpus all
flag - Mount the required knowledge listing (
-v open-webui:/app/backend/knowledge
)
docker pull ghcr.io/open-webui/open-webui:cuda
docker run -d -p 9783:8080 -v open-webui:/app/backend/knowledge --name open-webui ghcr.io/open-webui/open-webui:cuda
As soon as the container is working, entry the Open Net UI interface in your browser at http://localhost:8080/
.
Step 4: Operating DeepSeek R1 0528 in Open WebUI
Choose the hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0
mannequin from the mannequin menu.

If the Ollama server fails to correctly use the GPU, you’ll be able to change to CPU execution. Whereas it will considerably cut back efficiency (roughly 1 token/second), it ensures the mannequin can nonetheless run.
# Kill any current Ollama processes
pkill ollama
# Clear GPU reminiscence
sudo fuser -v /dev/nvidia*
# Restart Ollama service
CUDA_VISIBLE_DEVICES="" ollama serve
As soon as the mannequin is working, you’ll be able to work together with it through Open Net UI. Nonetheless, word that the velocity shall be restricted to 1 token/second because of the lack of GPU acceleration.

Closing Ideas
Operating even the quantized model was difficult. You want a quick web connection to obtain the mannequin, and if the obtain fails, you need to restart your complete course of from the start. I additionally confronted many points attempting to run it on my GPU, as I saved getting GGUF errors associated to low VRAM. Regardless of attempting a number of widespread fixes for GPU errors, nothing labored, so I ultimately switched all the things to CPU. Whereas this did work, it now takes about 10 minutes only for the mannequin to generate a response, which is much from ideally suited.
I am positive there are higher options on the market, maybe utilizing llama.cpp, however belief me, it took me the entire day simply to get this working.
Abid Ali Awan (@1abidaliawan) is a licensed knowledge scientist skilled who loves constructing machine studying fashions. At the moment, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in know-how administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids scuffling with psychological sickness.