

Picture by Creator
Have you ever ever questioned if there’s a greater option to set up and run llama.cpp domestically? Nearly each native giant language mannequin (LLM) utility right this moment depends on llama.cpp
because the backend for operating fashions. However right here’s the catch: most setups are both too complicated, require a number of instruments, or don’t offer you a strong consumer interface (UI) out of the field.
Wouldn’t or not it’s nice when you may:
- Run a strong mannequin like GPT-OSS 20B with just some instructions
- Get a fashionable Net UI immediately, with out further problem
- Have the quickest and most optimized setup for native inference
That’s precisely what this tutorial is about.
On this information, we are going to stroll via the finest, most optimized, and quickest means to run the GPT-OSS 20B mannequin domestically utilizing the llama-cpp-python
package deal along with Open WebUI. By the tip, you’ll have a completely working native LLM atmosphere that’s straightforward to make use of, environment friendly, and production-ready.
# 1. Setting Up Your Setting
If you have already got the uv
command put in, your life simply obtained simpler.
If not, don’t fear. You’ll be able to set up it rapidly by following the official uv set up information.
As soon as uv
is put in, open your terminal and set up Python 3.12 with:
Subsequent, let’s arrange a mission listing, create a digital atmosphere, and activate it:
mkdir -p ~/gpt-oss && cd ~/gpt-oss
uv venv .venv --python 3.12
supply .venv/bin/activate
# 2. Putting in Python Packages
Now that your atmosphere is prepared, let’s set up the required Python packages.
First, replace pip to the newest model. Subsequent, set up the llama-cpp-python
server package deal. This model is constructed with CUDA help (for NVIDIA GPUs), so you’ll get most efficiency when you have a appropriate GPU:
uv pip set up --upgrade pip
uv pip set up "llama-cpp-python[server]" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
Lastly, set up Open WebUI and Hugging Face Hub:
uv pip set up open-webui huggingface_hub
- Open WebUI: Gives a ChatGPT-style internet interface on your native LLM server
- Hugging Face Hub: Makes it straightforward to obtain and handle fashions straight from Hugging Face
# 3. Downloading the GPT-OSS 20B Mannequin
Subsequent, let’s obtain the GPT-OSS 20B mannequin in a quantized format (MXFP4) from Hugging Face. Quantized fashions are optimized to make use of much less reminiscence whereas nonetheless sustaining sturdy efficiency, which is ideal for operating domestically.
Run the next command in your terminal:
huggingface-cli obtain bartowski/openai_gpt-oss-20b-GGUF openai_gpt-oss-20b-MXFP4.gguf --local-dir fashions
# 4. Serving GPT-OSS 20B Regionally Utilizing llama.cpp
Now that the mannequin is downloaded, let’s serve it utilizing the llama.cpp
Python server.
Run the next command in your terminal:
python -m llama_cpp.server
--model fashions/openai_gpt-oss-20b-MXFP4.gguf
--host 127.0.0.1 --port 10000
--n_ctx 16384
Right here’s what every flag does:
--model
: Path to your quantized mannequin file--host
: Native host handle (127.0.0.1)--port
: Port quantity (10000 on this case)--n_ctx
: Context size (16,384 tokens for longer conversations)
If all the pieces is working, you will notice logs like this:
INFO: Began server course of [16470]
INFO: Ready for utility startup.
INFO: Utility startup full.
INFO: Uvicorn operating on http://127.0.0.1:10000 (Press CTRL+C to give up)
To verify the server is operating and the mannequin is obtainable, run:
curl http://127.0.0.1:10000/v1/fashions
Anticipated output:
{"object":"listing","knowledge":[{"id":"models/openai_gpt-oss-20b-MXFP4.gguf","object":"model","owned_by":"me","permissions":[]}]}
Subsequent, we are going to combine it with Open WebUI to get a ChatGPT-style interface.
# 5. Launching Open WebUI
Now we have already put in the open-webui
Python package deal. Now, let’s launch it.
Open a brand new terminal window (maintain your llama.cpp
server operating within the first one) and run:
open-webui serve --host 127.0.0.1 --port 9000
It will begin the WebUI server at: http://127.0.0.1:9000
Once you open the hyperlink in your browser for the primary time, you can be prompted to:
- Create an admin account (utilizing your e-mail and a password)
- Log in to entry the dashboard
This admin account ensures your settings, connections, and mannequin configurations are saved for future classes.
# 6. Setting Up Open WebUI
By default, Open WebUI is configured to work with Ollama. Since we’re operating our mannequin with llama.cpp
, we have to regulate the settings.
Observe these steps contained in the WebUI:
// Add llama.cpp as an OpenAI Connection
- Open the WebUI: http://127.0.0.1:9000 (or your forwarded URL).
- Click on in your avatar (top-right nook) → Admin Settings.
- Go to: Connections → OpenAI Connections.
- Edit the prevailing connection:
- Base URL:
http://127.0.0.1:10000/v1
- API Key: (depart clean)
- Base URL:
- Save the connection.
- (Non-compulsory) Disable Ollama API and Direct Connections to keep away from errors.
// Map a Pleasant Mannequin Alias
- Go to: Admin Settings → Fashions (or below the connection you simply created)
- Edit the mannequin title to
gpt-oss-20b
- Save the mannequin
// Begin Chatting
- Open a new chat
- Within the mannequin dropdown, choose:
gpt-oss-20b
(the alias you created) - Ship a check message
# Closing Ideas
I truthfully didn’t count on it to be this straightforward to get all the pieces operating with simply Python. Prior to now, establishing llama.cpp
meant cloning repositories, operating CMake
builds, and debugging infinite errors — a painful course of many people are accustomed to.
However with this strategy, utilizing the llama.cpp
Python server along with Open WebUI, the setup labored proper out of the field. No messy builds, no sophisticated configs, just some easy instructions.
On this tutorial, we:
- Arrange a clear Python atmosphere with
uv
- Put in the
llama.cpp
Python server and Open WebUI - Downloaded the GPT-OSS 20B quantized mannequin
- Served it domestically and related it to a ChatGPT-style interface
The end result? A completely native, non-public, and optimized LLM setup you can run by yourself machine with minimal effort.
Abid Ali Awan (@1abidaliawan) is a licensed knowledge scientist skilled who loves constructing machine studying fashions. Presently, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in know-how administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids fighting psychological sickness.