
Local Models with tidyllm
Source:vignettes/articles/tidyllm_local_models.Rmd
tidyllm_local_models.RmdRunning large language models locally has become a practical option for researchers. Modern small models handle most annotation, extraction, and classification tasks that previously required a paid cloud API, while keeping data entirely on your machine. The three concrete advantages are: privacy, sensitive data such as patient records, survey responses, or proprietary documents never leaves your infrastructure; cost at scale, once a model is running there are no per-request charges, so annotating millions of documents is essentially free; and reproducibility, a saved model file produces identical outputs indefinitely, whereas cloud API models are updated or retired without notice.
This article walks through two local inference tools that tidyllm supports: Ollama, which prioritises ease of setup, and llama.cpp, which exposes more control for advanced workflows. It also covers how to use OpenRouter to test open-source models via a cloud API before committing to running them locally.
Ollama
Ollama is the easiest path to running models locally. It packages model downloads, storage, and a local server into a single application that requires no configuration to get started.
Setup
Install Ollama from ollama.com or via Homebrew on macOS:
On macOS, Ollama runs as a background service after installation. On
Linux and Windows, start it manually with ollama serve. For
most setups, no further configuration is needed; tidyllm connects to
Ollama’s default address automatically.
Downloading and managing models
tidyllm exposes the full Ollama model lifecycle from R. The
list_models() verb shows what is already available on the
machine:
## # A tibble: 4 × 6
## name size format family parameter_size quantization_level
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 qwen3.5:4b 2.6 GB gguf qwen3.5 4.0B Q4_K_M
## 2 qwen3.5:14b 9.0 GB gguf qwen3.5 14.0B Q4_K_M
## 3 qwen3-embedding:0.6b 522 MB gguf qwen3 0.6B F16
## 4 smollm2:1.7b 1.0 GB gguf smollm2 1.7B Q4_K_M
ollama_download_model() pulls a model by name from the
Ollama registry and shows a progress bar:
ollama_download_model("qwen3.5:4b")## Pulling qwen3.5:4b...
## ✓ Downloaded qwen3.5:4b (2.6 GB)
It also accepts a full HuggingFace URL. GGUF is the standard file format for distributing local models, and HuggingFace hosts thousands of them. Any GGUF-based model there can be installed directly into Ollama without leaving R:
ollama_download_model("https://huggingface.co/nsatya/SmolLM2-1.7B-Instruct")## Pulling from https://huggingface.co/nsatya/SmolLM2-1.7B-Instruct...
## ✓ Downloaded smollm2:1.7b (1.0 GB)
To remove a model when you no longer need it:
ollama_delete_model("smollm2:1.7b")Chatting and embeddings
Chat with the default model (qwen3.5:4b) using the
standard tidyllm verb pattern:
reply <- llm_message("What is the difference between a tibble and a data frame in R?") |>
chat(ollama())
get_reply(reply)## [1] "A tibble is a modern reimagining of the data frame from the tidyverse. The main
## differences are that tibbles never convert strings to factors, never change column
## names, and print only the first ten rows by default, showing column types beneath
## the header. They also give more informative error messages when you access a column
## that does not exist."
The .num_ctx parameter controls how much text the model
can read at once, known as its context window. Text is measured
in tokens, which are roughly three-quarters of a word each. The function
default is 2048 tokens, which is quite short and much
smaller than what modern models like Qwen3.5 actually support. For long
documents, increase it to fit your input:
long_document |>
llm_message() |>
chat(ollama(.num_ctx = 32768))For embeddings, use the embed() verb. Embeddings convert
text into numeric vectors that capture meaning, making them useful for
similarity search and clustering. The default embedding model is
qwen3-embedding:0.6b, a purpose-built model for this
task:
## # A tibble: 3 × 2
## input embeddings
## <chr> <list>
## 1 neural networks <dbl [1,024]>
## 2 deep learning <dbl [1,024]>
## 3 baking bread <dbl [1,024]>
Finding the Right Model
The open-source landscape moves fast enough that a model recommended a few months ago may already have been superseded. Here is how to navigate it.
Where to look
Chatbot Arena is the most reliable signal for general instruction-following quality. Visitors vote on blind pairwise comparisons between randomly assigned models, using their own prompts. Because models cannot overfit to the prompts, the resulting rankings reflect what people actually prefer. It is the best single answer to “which model should I try first?”
Hugging Face Open LLM Leaderboard tracks automated test scores across a range of academic benchmarks. Useful for filtering clearly weak models and spotting improvements between versions; less reliable as a final verdict because scores can be inflated by training specifically for these tests.
Your own pilot beats both. If your task is classification, extraction, or annotation, run 50–100 real examples through two or three candidate models. The results will tell you more than any leaderboard because they reflect your specific domain, label set, and prompt style.
Understanding quantization
Models on HuggingFace are distributed as GGUF files in various quantization levels. Quantization is the process of compressing a model by storing its internal numbers at lower precision, similar to saving an image as a smaller JPEG instead of a full-resolution PNG. A more compressed model uses less memory and runs faster, but may produce slightly worse outputs on complex tasks.
The label in the filename tells you how aggressively the model has been compressed:
| Label | Quality vs. full precision | Memory use | Typical use |
|---|---|---|---|
| Q8_0 | Near-identical | Largest | Reference quality; use when you have enough GPU memory |
| Q4_K_M | Small, often imperceptible loss | Medium | Best everyday tradeoff |
| Q3_K_M | Noticeable on complex reasoning | Smaller | Only when memory is very tight |
| Q2_K | Significant degradation | Smallest | Experimentation only |
For most research tasks, Q4_K_M on a model one size class larger than your first instinct is a reliable starting point. If a 7B model at Q4_K_M does not perform well enough, try a 13B at Q4_K_M rather than upgrading to Q8_0 on the 7B.
The current model landscape
Model size is measured in parameters, the numerical weights learned during training. More parameters generally means more capable, but also more memory required and slower inference. As of early 2026, strong open-weight families come from a range of labs:
| Family | Lab | Sizes | Notes |
|---|---|---|---|
| Qwen3.5 | Alibaba | 0.8B–9B; 27B, 35B-A3B, 122B-A10B; 397B-A17B | 201 languages, 1M context, thinking mode on by default; default in tidyllm |
| Mistral Small 4 | Mistral AI | 119B total / 6B active (MoE) | Reasoning, vision, and agentic coding unified; 256K context; Apache 2.0 |
| Llama 4 Scout | Meta | 109B total / ~17B active (MoE) | Native multimodal, 10M token context |
| Gemma 3 | 1B, 4B, 12B, 27B | Compact multimodal models; permissive license | |
| Kimi K2.5 | Moonshot AI | 1T total / 32B active (MoE) | Strong coding and vision; MIT license |
| MiniMax M2.5 | MiniMax | large MoE | Near-frontier quality at a fraction of proprietary API cost |
| DeepSeek V3.2 | DeepSeek | 671B MoE | Open-weight general model; best when hardware is not the constraint |
| Phi-4 | Microsoft | 14B | Punches well above weight class on reasoning tasks |
Several of these are Mixture-of-Experts (MoE) models. Instead of activating the entire network for every word, MoE models route each step through only a small subset of specialised sub-networks. This means the memory and speed cost depends on the active parameter count, not the total. In the table, notation like “35B-A3B” means 35 billion total parameters but only 3 billion active per step; it runs much more like a 3B model than a 35B one.
Testing with OpenRouter before committing hardware
Running a large model locally requires significant hardware: a 13B model at Q4_K_M needs around 10 GB of memory; a 70B model needs 40 GB or more. Before investing in hardware or reserving compute time, test the model via a cloud API first to confirm it works for your task.
OpenRouter provides a single API
key for hundreds of models, including the Qwen, Mistral, Llama, Kimi,
Gemma, and MiniMax families. Because tidyllm uses the same
verb-and-provider pattern for every provider, switching between
candidates requires only changing the .model argument:
prompt <- llm_message("Classify this job description into a one-sentence occupation label:
'I oversee a team of data engineers building our company data platform'")
result_qwen <- prompt |> chat(openrouter(.model = "qwen/qwen3.5-35b-a3b"))
result_mistral <- prompt |> chat(openrouter(.model = "mistralai/mistral-small-4"))
result_kimi <- prompt |> chat(openrouter(.model = "moonshotai/kimi-k2.5"))
get_reply(result_qwen)
get_reply(result_mistral)
get_reply(result_kimi)## [1] "Data Engineering Manager: oversees a team responsible for building and maintaining the company data platform."
## [1] "Data Engineering Manager: leads engineers developing the organisation's core data infrastructure."
## [1] "Data Engineering Manager: manages a team of data engineers constructing the company's data platform."
openrouter_list_models() returns a table of all
available models with pricing. You can filter it by context length (how
much text the model can process at once) and sort by cost to shortlist
candidates:
openrouter_list_models() |>
filter(context_length >= 32000) |>
arrange(prompt_price_per_million) |>
select(id, name, context_length, prompt_price_per_million, completion_price_per_million)## # A tibble: 8 × 5
## id name context_length prompt_price_per_mil…¹ completion_price_per…²
## <chr> <chr> <int> <dbl> <dbl>
## 1 qwen/qwen3… Qwen… 1000000 0.1 0.3
## 2 mistralai/… Mist… 256000 0.1 0.3
## 3 google/gem… Gemm… 131072 0.1 0.2
## 4 microsoft/… Phi-4 131072 0.14 0.14
## 5 meta-llama… Llam… 10000000 0.17 0.6
## 6 minimax/mi… Mini… 1000000 0.2 0.55
## 7 moonshotai… Kimi… 131072 0.5 1.5
## 8 deepseek/d… Deep… 163840 0.55 2.19
## # ℹ abbreviated names: ¹prompt_price_per_million, ²completion_price_per_million
Prices are in US dollars per million tokens. Once a model performs
well on your task, find the corresponding GGUF on HuggingFace and run it
locally for free. openrouter_credits() tracks how much of
your balance the pilot has consumed:
## # A tibble: 1 × 3
## total_credits total_usage remaining
## <dbl> <dbl> <dbl>
## 1 15 0.153 14.8
llama.cpp
Ollama is built on top of llama.cpp and covers most common use cases well. Going directly to llama.cpp gives you access to features Ollama does not expose: hard guarantees on output format (grammar constraints), confidence scores for each prediction (log-probabilities), and finer control over how the model generates text. It is worth the extra setup effort when your workflow needs any of these.
Setup
Installation. On macOS, install via Homebrew:
On Linux, build from source or use a pre-built release from the llama.cpp GitHub releases page. On Windows, pre-built binaries are available from the same page; pick the variant that matches your hardware (GPU-accelerated with CUDA for Nvidia or Vulkan for AMD/Intel, or CPU-only if you have no dedicated GPU).
Starting a server. Unlike Ollama, llama.cpp does not
manage models for you: you point it at a GGUF file on disk and it serves
that model until you stop the process. The server command is
llama-server:
The -c flag sets the context window size in tokens
(32768 is a good default for most tasks). Once started, the server
listens on http://localhost:8080 by default.
Embeddings require a separate process. The embedding endpoint uses a dedicated embedding model and must run on a different port:
You can run both commands in separate terminal tabs, or keep them running in the background using a terminal multiplexer like tmux.
Telling tidyllm where to look. Add the following to
your .Renviron file (run
usethis::edit_r_environ() to open it, then restart R):
LLAMACPP_SERVER=http://localhost:8080
LLAMACPP_MODEL_DIR=~/models
LLAMACPP_SERVER is the address tidyllm uses to reach the
chat server. LLAMACPP_MODEL_DIR is the folder where model
files are stored and where the management functions below will look for
them.
Downloading models from HuggingFace
HuggingFace is the primary source for GGUF files. tidyllm provides functions to browse and download directly from R.
list_hf_gguf_files() shows all the quantization variants
available for a given model repository:
list_hf_gguf_files("Qwen/Qwen3.5-8B-GGUF")## # A tibble: 5 × 3
## filename size_gb url
## <chr> <dbl> <chr>
## 1 Qwen3.5-8B-Q2_K.gguf 3.19 https://huggingface.co/Qwen/Qwen3.5-8B-GGUF/re…
## 2 Qwen3.5-8B-Q3_K_M.gguf 4.02 https://huggingface.co/Qwen/Qwen3.5-8B-GGUF/re…
## 3 Qwen3.5-8B-Q4_K_M.gguf 5.16 https://huggingface.co/Qwen/Qwen3.5-8B-GGUF/re…
## 4 Qwen3.5-8B-Q6_K.gguf 6.6 https://huggingface.co/Qwen/Qwen3.5-8B-GGUF/re…
## 5 Qwen3.5-8B-Q8_0.gguf 8.54 https://huggingface.co/Qwen/Qwen3.5-8B-GGUF/re…
llamacpp_download_model() downloads a specific file with
a progress bar:
llamacpp_download_model(
.repo = "Qwen/Qwen3.5-8B-GGUF",
.filename = "Qwen3.5-8B-Q4_K_M.gguf"
)## Downloading Qwen3.5-8B-Q4_K_M.gguf from Qwen/Qwen3.5-8B-GGUF...
## [==============================] 5.16 GB / 5.16 GB
## ✓ Saved to ~/models/Qwen3.5-8B-Q4_K_M.gguf
llamacpp_list_local_models() shows what GGUF files are
already in your model directory, without needing a running server:
## # A tibble: 2 × 4
## filename size_gb modified path
## <chr> <dbl> <chr> <chr>
## 1 Qwen3.5-8B-Q4_K_M.gguf 5.16 2026-03-17 09:14:22 ~/models/Qwen3.5-8…
## 2 Qwen3-Embedding-0.6B-Q8_0.gguf 0.62 2026-03-10 14:03:05 ~/models/Qwen3-Emb…
To verify the server is running and see which model is loaded:
## $status
## [1] "ok"
##
## $slots_idle
## [1] 1
##
## $slots_processing
## [1] 0
## # A tibble: 1 × 3
## id object created
## <chr> <chr> <int>
## 1 local-model model 1742204062
Chatting
Basic chat works through the same verb pattern as all other providers:
reply <- llm_message("Explain in one sentence what a mixture-of-experts model is.") |>
chat(llamacpp())
get_reply(reply)## [1] "A mixture-of-experts model routes each token through only a subset of specialised
## networks called experts, keeping inference cost low while maintaining the capacity
## of a much larger model."
Confidence scores for annotation quality control
When you use a model for classification or annotation, not all predictions are equally confident. Log-probabilities give you a numeric confidence score for each output token: a score close to zero means the model was highly certain; a strongly negative score means it was genuinely uncertain between several options.
Passing .logprobs = TRUE stores these scores in the
response metadata alongside the reply:
result <- llm_message(
"Reply with exactly one word. Classify the sentiment of this review:
'The delivery was two days late but the product itself is solid.'"
) |>
chat(llamacpp(.logprobs = TRUE, .top_logprobs = 3))
get_reply(result)
get_metadata(result)$api_specific$logprobs## [1] "neutral"
## $content
## $content[[1]]
## $content[[1]]$token
## [1] "neutral"
## $content[[1]]$logprob
## [1] -0.847
## $content[[1]]$top_logprobs
## # A tibble: 3 × 2
## token logprob
## <chr> <dbl>
## 1 neutral -0.847
## 2 mixed -1.20
## 3 positive -2.03
A score of -0.85 on “neutral” means the model had genuine
uncertainty; the review is mixed and the model knows it. In a large
annotation batch, you can flag predictions where the winning score is
better than -1 (high confidence) for direct use, and route the rest to a
larger model or human review. Log-probability scores are also available
on openai() and ollama().
Grammar constraints for reliable structured output
Instruction-tuned models generally follow format instructions well, but they can occasionally produce malformed output, especially with smaller models or unusual inputs. Grammar constraints enforce the output format at the deepest level of the model’s text generation process: the model is physically prevented from producing tokens that would violate the grammar, regardless of its size or how well it follows instructions.
This means even a small model running on a laptop will always return
valid JSON when you pass a .json_schema. The
.grammar parameter goes further: it accepts a grammar
definition that can constrain output to any pattern, not just JSON:
sentiment_grammar <- r"(root ::= ("positive" | "negative" | "neutral"))"
result <- llm_message(
"Classify the sentiment of this review. Reply with one word only:
'Fast shipping, exactly as described, would buy again.'"
) |>
chat(llamacpp(.grammar = sentiment_grammar))
get_reply(result)## [1] "positive"
Local embeddings and reranking
The embedding server provides local dense vectors for similarity
search and clustering. Point the llamacpp() provider at it
via the .server argument:
c("transformer architecture", "attention mechanism", "baking sourdough") |>
embed(llamacpp(.server = "http://localhost:8081"))## # A tibble: 3 × 2
## input embeddings
## <chr> <list>
## 1 transformer architecture <dbl [1,024]>
## 2 attention mechanism <dbl [1,024]>
## 3 baking sourdough <dbl [1,024]>
If you are building a document search system,
llamacpp_rerank() can re-order a set of candidate documents
by how relevant they are to a query. This is more accurate than sorting
by raw embedding similarity alone, and runs on the same local server
with no additional model:
llamacpp_rerank(
.query = "how does self-attention work",
.documents = c(
"Attention is All You Need introduced the transformer architecture.",
"Sourdough bread uses a wild yeast starter for leavening.",
"Self-attention computes a weighted sum of all positions in a sequence.",
"BERT uses bidirectional transformers pre-trained on masked language modelling."
)
)## # A tibble: 4 × 3
## index document relevance_score
## <int> <chr> <dbl>
## 1 2 Self-attention computes a weighted sum of all positions… 0.921
## 2 0 Attention is All You Need introduced the transformer ar… 0.783
## 3 3 BERT uses bidirectional transformers pre-trained on mas… 0.641
## 4 1 Sourdough bread uses a wild yeast starter for leavening. 0.021
Combined with embed() for retrieval and
llamacpp_rerank() for re-ranking, you have a fully local
document search pipeline: no cloud calls, no token costs, and
reproducible results.
Any OpenAI-Compatible Endpoint
Many universities, hospitals, and cloud providers run their own LLM
servers using tools like vLLM or text-generation-inference.
These services typically expose the same API format as OpenAI, so they
work directly with tidyllm’s openai() provider by pointing
it at a different server address:
llm_message("Summarise this paragraph in one sentence.") |>
chat(openai(.api_url = "http://my-internal-server:8000/v1"))For llama.cpp specifically, using llamacpp() with
LLAMACPP_SERVER set in .Renviron is the
cleaner path since it also exposes the llama.cpp-specific parameters
covered above. The openai() route is there when you are
connecting to a generic institutional endpoint that does not need any of
those extras.