Embedding Models in tidyllm • tidyllm

While most tidyllm use cases revolve around chat models, it also supports embedding models — another type of large language model. These models are designed to generate numerical vectors, which map input text to points in a high-dimensional space. Each point represents the semantic meaning of the text.:

Similar meanings are close together: For example, a text about a “cat” and another about a “kitten” would be mapped to nearby points.
Different meanings are farther apart: Conversely, a text about a “cat” and one about a “car” would have points that are much farther apart.

Semantic Search in Economics Paper Abstracts

To demonstrate embeddings in action, we’ll implement a semantic search on a dataset of 22,960 economics paper abstracts published between 2010 and 2024. Instead of relying on keyword matching, we’ll use embeddings to find papers with similar topics based on their underlying meaning.

Let’s start by loading and exploring the data:

library(tidyverse)
library(tidyllm)
library(here)

abstracts <- read_rds("abstracts_data.rds")
#The structure of our file:
print(abstracts,width = 60)
## # A tibble: 22,960 × 8
##    year  journal  authors volume firstpage lastpage abstract
##    <chr> <chr>    <chr>   <chr>  <chr>     <chr>    <chr>   
##  1 2024  Journal… Bauer,… 22     2075      2107     This pa…
##  2 2019  The Rev… Karlan… 86     1704      1746     We use …
##  3 2022  Journal… Corset… 20     513       548      We stud…
##  4 2018  The Rev… Anagol… 85     1971      2004     We stud…
##  5 2024  America… Thores… 16     447       79       This pa…
##  6 2024  Journal… Ren, Y… 238    NA        NA       The rap…
##  7 2013  The Rev… Adhvar… 95     725       740      A key p…
##  8 2022  Econome… Brooks… 90     2187      2214     If expe…
##  9 2011  Health … Fletch… 20     553       570      We exam…
## 10 2010  Journal… Rohwed… 24     119       38       Early r…
## # ℹ 22,950 more rows
## # ℹ 1 more variable: pdf_link <chr>

target_abstract <-  "We use the expansion of the high-speed rail network in Germany  as a natural experiment to examine the causal effect of reductions in commuting time  betweenregions on the commuting decisions of workers and their choices regarding  where tolive and where to work. We exploit three key features in this setting:i) investmentin high-speed rail has, in some cases dramatically, reduced travel times between regions,ii) several small towns were connected to the high-speed rail network onlyfor political reasons, and iii) high-speed trains have left the transportation of goodsunaffected. Combining novel information on train schedules and the opening ofhigh-speed rail stations with panel data on all workers in Germany, we show that a reduction in travel time by one percent raises the number of commuters betweenregions by 0.25 percent. This effect is mainly driven by workers changing jobs to smaller cities while keeping their place of residence in larger ones. Our findings support the notion that benefits from infrastructure investments accrue in particular to peripheral regions, which gain access to a large pool of qualified workers with a preference for urban life. We find that the introduction of high-speed trains led to a modal shift towards rail transportation in particular on medium distances between 150 and 400 kilometers."

With the dataset ready, we’ll now use tidyllm to generate embeddings and perform a semantic search to find papers that are similar to the target_abstract from a Paper on commuting and the expansion of high-speed rail by Heuerman and Schmieder (2018). For this task, we’ll use the mxbai-embed-large model. This model, developed by Mixedbread.ai, achieves state-of-the-art performance among efficiently sized models and outperforms closed-source models like OpenAI’s text-embedding-ada-002. If you have Ollama installed, you can download it using ollama_download_model("mxbai-embed-large"). It is important to choose your embedding model carefully upfront, as each model produces unique numerical representations of text that are not interchangeable between models.

Alternatively, embedding APIs are also available for mistral(), gemini(), and openai() (as well as azure_openai()).

Step 1: Computing Embeddings for one abstract

To compute an embedding of the target abstract we use the embed() function with ollama() as provider-function:

target_tbl <- target_abstract |>
  embed(ollama,.model="mxbai-embed-large:latest")

target_tbl
str(target_tbl)

## # A tibble: 1 × 2
##   input                                           embeddings
##   <chr>                                           <list>    
## 1 We use the expansion of the high-speed rail ne… <dbl>

The embed() function returns a tibble with two columns: - input: The original text provided for embedding. - embeddings: A list column containing the numerical vector representation (embedding) for each input text.

In our case we have a single input and the embeddings column contains a 1,024-dimenstional vector for this input:

str(target_tbl$embeddings)
## List of 1
##  $ : num [1:1024] -0.838 0.669 0.202 -0.686 -0.985 ...

Step 2: Computing Embeddings for the entire abstract corpus

When working with a large corpus like our 22,960 abstracts, embedding all entries in a single pass using embed() is impractical and often leads to errors. For commercial APIs, there are typically strict limits on the number of inputs allowed per request (usually there are caps of 50 or a 100 inputs). For local APIs, resource constraints such as memory and processing power impose similar restrictions.

To efficiently handle this, we batch the data, processing a manageable number of abstracts at a time. The generate_abstract_embeddings() function below takes a vector of abstracts as input and divides them into manageable batches of 200. For each batch, it uses the embed() function to compute embeddings via ollama(). Progress is logged to the console to keep track of batch completion and provide a clear view of the process.

Since long-running processes are prone to interruptions, such as network timeouts or unexpected system errors, it saves the results to disk as .rds files (consider arrow::write_parquet() or a database for really big workloads). On a MacBook Pro with an M1 Pro processor, this function completes embedding the entire dataset in approximately 25 minutes and writes 207 MB of data to disk. The time may vary depending on system specifications and batch size. To compare multiple target abstracts against the entire collection, we of course only need to embed it once.

#Our batches embedding function
generate_abstract_embeddings <- function(abstracts){
  
  #Preapre abstract batches
  embedding_batches <- tibble(abstract = abstracts) |>
    group_by(batch = floor(1:n() / 200)+1) |>
    group_split()
  
  #Work with batches of 200 abstracts
  n_batches <- length(embedding_batches)
  glue("Processing {n_batches} batches of 200 abstracts") |> cat("\n")
  
  #Embed the batches via ollama mxbai-embed-large 
  embedding_batches %>%
    walk(~{
      batch_number <- pull(.x,batch) |> unique() 
      glue("Generate Text Embeddings for Abstract Batch: {batch_number}/{n_batches}") |> cat("\n")
      
      emb_matrix <- .x$abstract %>%
        embed(ollama,.model="mxbai-embed-large:latest") |>
        write_rds(here("embedded_asbtracts",paste0(batch_number,".rds")))
    })
}

#Run the function over all abstracts
abstracts |>
  pull(abstract) |>
  generate_abstract_embeddings()

## Processing 115 batches of 200 abstracts
## Generate Text Embeddings for Abstract Batch: 1/115
## Generate Text Embeddings for Abstract Batch: 2/115 
## Generate Text Embeddings for Abstract Batch: 3/115
## ...

After the function has finished, we only need to load the computed embeddings:

embedded_asbtracts <- here("embedded_asbtracts") |>
  dir() |>
  map_dfr(~read_rds(here("embedded_asbtracts",.x)))

Step 3: Performing the Semantic Search

With the embeddings precomputed, we can now perform a semantic search to find abstracts most similar to our target. For the search we will use cosine similarity to compare embedding vectors. Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them. It ranges from -1 (opposite directions) to 1 (identical directions). The formula is:

$\text{cosine_similarity}(\mathbf{a}, \mathbf{b}) = \frac{\sum_{i=1}^n a_i b_i}{\sqrt{\sum_{i=1}^n a_i^2} \cdot \sqrt{\sum_{i=1}^n b_i^2}}$

Where:

$\sum a_i b_i$ is the dot product.
$\sqrt{\sum a_i^2}$ and $\sqrt{\sum b_i^2}$ are the magnitudes of the vectors.

Vectors in an embedding space represent semantic meanings of texts. In this space:

Cosine similarity focuses on direction rather than magnitude.
Two vectors pointing in similar directions (small angle) will have a cosine similarity close to 1.
Vectors at 90 degrees (orthogonal, no semantic overlap) have a cosine similarity of 0.
Vectors pointing in opposite directions (large angle, entirely dissimilar) have a cosine similarity of -1.

With the model we use each embedding vector represents a point in a high-dimensional space with 1,024 dimensions. Even though these dimensions are not spatially interpretable like in 2D or 3D, the underlying principle still holds: cosine similarity measures how much two vectors “lean” in the same direction. If two vectors have a cosine similarity of 0.25, it means the angle between them is relatively small, implying they share 25% of their directional alignment.

To compute cosine similarity between two vectors we express it in a simple function:

cosine_similarity <- function(a, b) {
    sum(a * b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
}

We apply this function to compute the cosine similarity of abstracts in the corpus to the target paper to find the 10 most similar abstracts:

top10_similar <- embedded_asbtracts %>%
  mutate(cosine_sim = map_dbl(embeddings, 
                              ~cosine_similarity(.x, 
                                                 target_tbl$embeddings[[1]])))|>
  arrange(desc(cosine_sim)) |>
  slice(1:10) |> 
  rename(abstract=input) |>
  left_join(abstracts |>
              select(year,authors,journal,abstract), by="abstract") |>
  select(-embeddings)

The top ten most similar articles in the corpus based on our search are these:

Top 10 Most Similar Abstracts
abstract	cosine_sim	year	authors	journal
We use the expansion of the high-speed rail (HSR) network in Germany as a natural experiment to examine the causal effect of reductions in commuting time between regions on the commuting decisions of workers and their choices regarding where to live and where to work. We exploit three key features in this setting: (i) investment in HSR has, in some cases dramatically, reduced travel times between regions, (ii) several small towns were connected to the HSR network only for political reasons, and (iii) high-speed trains have left the transportation of goods unaffected. Combining novel information on train schedules and the opening of HSR stations with panel data on all workers in Germany, we show that a reduction in travel time by 1% raises the number of commuters between regions by 0.25%. This effect is mainly driven by workers changing jobs to smaller cities while keeping their place of residence in larger ones. Our findings support the notion that benefits from infrastructure investments accrue in particular to peripheral regions, which gain access to a large pool of qualified workers with a preference for urban life. We find that the introduction of high-speed trains led to a modal shift toward rail transportation in particular on medium distances between 150 and 400 km.	0.9935939	2019	Heuermann, Daniel; Schmieder, Johannes	Journal of Economic Geography
We analyze the economic impact of the German high-speed rail (HSR) connecting Cologne and Frankfurt, which provides plausibly exogenous variation in access to surrounding economic mass. We find a causal effect of about 8.5% on average of the HSR on the GDP of three counties with intermediate stops. We make further use of the variation in bilateral transport costs between all counties in our study area induced by the HSR to identify the strength and spatial scope of agglomeration forces. Our most careful estimate points to an elasticity of output with respect to market potential of 12.5%. The strength of the spillover declines by 50% every 30 min of travel time, diminishing to 1% after about 200 min. Our results further imply an elasticity of per-worker output with respect to economic density of 3.8%, although the effects seem driven by worker and firm selection.	0.8984364	2018	Ahlfeldt, Gabriel; Feddersen, Arne	Journal of Economic Geography
We investigate whether localities gain or lose employment when there are connected to a transportation network, such as a high-speed railway line. We argue that long-haul economies—implying that the marginal transportation cost decreases with network distance—play a pivotal role in understanding the location choices of firms. We develop a new spatial model to show that improvements in transportation infrastructure have nontrivial impacts on the location choices of firms. Using data on Japan’s Shinkansen, we show that ‘in-between’ municipalities that are connected to the Shinkansen witness a sizable decrease in employment.	0.8691969	2022	Koster, Hans; Tabuchi, Takatoshi; Thisse, Jacques	Journal of Economic Geography
How does intercity passenger transportation shape urban employment and specialization patterns? To shed light on this question I study China’s High Speed Railway (HSR), an unprecedentedly large-scale network that connected 81 cities from 2003 to 2014 with trains running at speeds over 200 km/h. Using a difference-in-differences approach, I find that an HSR connection increases city-wide passenger flows by 10% and employment by 7%. To deal with the issues of endogenous railway placement and simultaneous public investments accompanying HSR connection, I examine the impact of a city’s market access changes purely driven by the HSR connection of other cities. The estimates suggest that HSR-induced expansion in market access increases urban employment with an elasticity between 2 and 2.5. Further evidence on sectoral employment suggests that industries with a higher reliance on nonroutine cognitive skills benefit more from HSR-induced market access to other cities.	0.8684935	2017	Lin, Yatang	Journal of Urban Economics
We use the natural experiment provided by the opening and progressive extension of the Regional Express Rail (RER) between 1970 and 2000 in the Paris metropolitan region, and in particular the departure from the original plans due to budget constraints and technical considerations, to identify the causal impact of urban rail transport on firm location, employment and population growth. We apply a difference-in-differences method to a particular subsample, selected to minimize the endogeneity that is routinely found in the evaluation of the effects of transport infrastructure. We find that the RER opening caused a 8.8% rise in employment in the municipalities connected to the network between 1975 and 1990. While we find no effect on overall population growth, our results suggest that the arrival of the RER may have increased competition for land, since high-skilled households were more likely to locate in the vicinity of a RER station.	0.8664813	2017	Mayer, Thierry; Trevien, Corentin	Journal of Urban Economics
Infrastructure investment may reshape economic activities. In this article, I examine the distributional impacts of high-speed rail upgrades in China, which have improved passengers’ access to high-speed train services in the city nodes but have left the peripheral counties along the upgraded railway lines bypassed by the services. By exploiting the quasi-experimental variation in whether counties were affected by this project, my analysis suggests that the affected counties on the upgraded railway lines experienced reductions in GDP and GDP per capita following the upgrade, which was largely driven by the concurrent drop in fixed asset investments. This article provides the first empirical evidence on how transportation costs of people affect urban peripheral patterns.	0.8579637	2017	Qin, Yu	Journal of Economic Geography
Many US cities have made large investments in light rail transit in order to improve commuting networks. I analyse the labour market effects of light rail in four US metros. I propose a new instrumental variable to overcome endogeneity in transit station location, enabling causal identification of neighbourhood effects. Light rail stations are found to drastically improve employment outcomes in the surrounding neighbourhood. To incorporate endogenous sorting by workers, I estimate a structural neighbourhood choice model. Light rail systems tend to raise rents in accessible locations, displacing lower skilled workers to isolated neighbourhoods, which reduces aggregate metropolitan employment in equilibrium.	0.8559837	2021	Tyndall, Justin	Journal of Urban Economics
I study Los Angeles Metro Rail’s effects using panel data on bilateral commuting flows, a quantitative spatial model, and historically motivated quasi-experimental research designs. The model separates transit’s commuting effects from local productivity or amenity effects, and spatial shift-share instruments identify inelastic labor and housing supply. Metro Rail connections increase commuting by 16% but do not have large effects on local productivity or amenities. Metro Rail generates $94 million in annual benefits by 2000 or 12â€“25% of annualized costs. Accounting for reduced congestion and slow transit adoption adds, at most, another $200 million in annual benefits.	0.8432086	2023	Severen, Christopher	The Review of Economics and Statistics
We examine the effect of commuting distance on workers’ labour supply patterns, distinguishing between weekly labour supply, number of workdays per week and daily labour supply. We account for endogeneity of distance by using employer-induced changes in distance. In Germany, distance has a slight positive effect on daily and weekly labour supply, but no effect on the number of workdays. The effect of distance on labour supply patterns is stronger for female workers, but it is still small.	0.8416284	2010	Gutiérrez-i-Puigarnau, Eva; van Ommeren, Jos	Journal of Urban Economics
We estimate the causal impact of wage variations on commuting distance of workers. We test whether higher wages across years lead workers to live further away from their working place. We use employer–employee data for the French Ile-de-France region (surrounding Paris), from 2003 to 2008, and we deal with the endogenous relation between income and commuting using an instrumental variable strategy. We estimate that increases in wages coming from exogenous exposure to trade activities lead workers to increase their commuting distance and to settle closer to the city of Paris historical center. Our results cast novel insights upon the causal mechanisms from wage to spatial allocation of workers.	0.8367841	2022	Aboulkacem, El-Mehdi; Nedoncelle, Clément	Journal of Economic Geography

Unsurprisingly, the top result is the target abstract itself, presented with slight formatting differences. However, the remaining top results strongly align with the target’s thematic focus on the economic and social impacts of transportation (or specifically high speed reail) infrastructure. The second result discusses the economic effects of German High-Speed Rail (HSR), while the third and fourth focus on Japan’s Shinkansen and China’s HSR network, respectively. These findings highlight the model’s capability to identify semantically rich connections across diverse contexts, illustrating its ability to capture complex thematic overlaps.

Sidenote: Multimodal Embeddings

While most embedding functions work only on text, the Voyage.ai functions in tidyllm allow embedding both text and images in the same space. This feature is particularly useful for cross-modal search, where text descriptions and images need to be compared on a semantic level.

With tidyllm, you can use the img() function to create image objects and mix them with text in a list. When passing such a list to voyage_embedding(), the function automatically switches to Voyage’s multimodal API. Suppose we want to compare a textual description of an object with an image embedding to see if they align in meaning.

# Define a text description and an image of the same object
list("a dish consisting of a sausage served in the slit of a partially sliced bun", img(here("hotdog.jpg"))) |>
 embed(voyage)

## # A tibble: 2 × 2
##   input                                                               embeddings
##   <chr>                                                               <list>    
## 1 A dish consisting of a sausage served in the slit of a partially s… <dbl>     
## 2 [IMG] hotdog.jpg                                                    <dbl>

Since the text description and the image refer to the same concept (a hotdog), their embeddings should be close together in the high-dimensional space, enabling similarity-based retrieval and comparison.

Outlook: Clustering and Beyond

While this article focused on semantic search, embeddings open the door to a wide range of advanced analytical techniques.

Here are some potential further use-cases for embeddings:

Clustering for Topic Discovery: Embeddings can be leveraged with unsupervised learning methods like K-means or hierarchical clustering to automatically group text into topics or themes. This approach is particularly beneficial for exploratory research, enabling users to uncover hidden patterns and structures within large corpora without predefined labels.
Dimensionality Reduction and Visualization: High-dimensional embedding vectors often need simplification for human interpretation. Techniques like Principal Component Analysis (PCA), t-SNE, or UMAP allow us to project embeddings into lower-dimensional spaces (e.g., 2D or 3D). These visualizations can reveal clusters, trends, and outliers, providing insights at a glance.
Retrieval augmented generation (RAG): Embeddings play a crucial role in retrieval-augmented generation, a technique that enhances large language models (LLMs). In this workflow, an embedding search retrieves semantically similar documents from a corpus, which are then appended to the LLM’s prompt. This process helps the model generate contextually rich and accurate responses, especially in knowledge-intensive tasks.
Supporting Workflows in Qualitative Research: Embeddings can also transform workflows in qualitative research, as discussed in this paper by Kugler et al. (2023). The integration of Natural Language Processing (NLP) tools enables researchers to automate coding steps that traditionally require manual effort. Specifically, embeddings can assist in categorizing text data according to predefined themes, making research workflows more efficient and transparent. While these models bring significant advantages, challenges remain. The study highlights that off-the-shelf language models often struggle to discern implicit references and closely related topics as effectively as human researchers. However, more modern embedding models than the ones used in the study might help to deal with some of these challenges.