⚠️ Note: This article refers to the development version 0.2.5. of tidyllm, which has a chnaged interface compared to the last CRAN release. You can install the current development version directly from GitHub using devtools:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install TidyLLM from GitHub
devtools::install_github("edubruell/tidyllm")
While most tidyllm use cases revolve around chat models, it also supports embedding models — another type of large language model. These models are designed to generate numerical vectors, which map input text to points in a high-dimensional space. Each point represents the semantic meaning of the text.:
- Similar meanings are close together: For example, a text about a “cat” and another about a “kitten” would be mapped to nearby points.
- Different meanings are farther apart: Conversely, a text about a “cat” and one about a “car” would have points that are much farther apart.
Semantic Search in Economics Paper Abstracts
To demonstrate embeddings in action, we’ll implement a semantic search on a dataset of 22,960 economics paper abstracts published between 2010 and 2024. Instead of relying on keyword matching, we’ll use embeddings to find papers with similar topics based on their underlying meaning.
Let’s start by loading and exploring the data:
library(tidyverse)
library(tidyllm)
library(here)
abstracts <- read_rds("abstracts_data.rds")
#The structure of our file:
print(abstracts,width = 60)
## # A tibble: 22,960 × 8
## year journal authors volume firstpage lastpage abstract
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2024 Journal… Bauer,… 22 2075 2107 This pa…
## 2 2019 The Rev… Karlan… 86 1704 1746 We use …
## 3 2022 Journal… Corset… 20 513 548 We stud…
## 4 2018 The Rev… Anagol… 85 1971 2004 We stud…
## 5 2024 America… Thores… 16 447 79 This pa…
## 6 2024 Journal… Ren, Y… 238 NA NA The rap…
## 7 2013 The Rev… Adhvar… 95 725 740 A key p…
## 8 2022 Econome… Brooks… 90 2187 2214 If expe…
## 9 2011 Health … Fletch… 20 553 570 We exam…
## 10 2010 Journal… Rohwed… 24 119 38 Early r…
## # ℹ 22,950 more rows
## # ℹ 1 more variable: pdf_link <chr>
target_abstract <- "We use the expansion of the high-speed rail network in Germany as a natural experiment to examine the causal effect of reductions in commuting time betweenregions on the commuting decisions of workers and their choices regarding where tolive and where to work. We exploit three key features in this setting:i) investmentin high-speed rail has, in some cases dramatically, reduced travel times between regions,ii) several small towns were connected to the high-speed rail network onlyfor political reasons, and iii) high-speed trains have left the transportation of goodsunaffected. Combining novel information on train schedules and the opening ofhigh-speed rail stations with panel data on all workers in Germany, we show that a reduction in travel time by one percent raises the number of commuters betweenregions by 0.25 percent. This effect is mainly driven by workers changing jobs to smaller cities while keeping their place of residence in larger ones. Our findings support the notion that benefits from infrastructure investments accrue in particular to peripheral regions, which gain access to a large pool of qualified workers with a preference for urban life. We find that the introduction of high-speed trains led to a modal shift towards rail transportation in particular on medium distances between 150 and 400 kilometers."
With the dataset ready, we’ll now use tidyllm to generate embeddings
and perform a semantic search to find papers that are similar to the
target_abstract
from a Paper on commuting and the expansion
of high-speed rail by Heuerman
and Schmieder (2018). For this task, we’ll use the
mxbai-embed-large
model. This model, developed by Mixedbread.ai, achieves
state-of-the-art performance among efficiently sized models and
outperforms closed-source models like OpenAI’s
text-embedding-ada-002
. If you have Ollama installed, you
can download it using
ollama_download_model("mxbai-embed-large")
. It’s important
to choose your embedding model carefully upfront, as each model produces
unique numerical representations of text that are not interchangeable
between models.
Alternatively, embedding APIs are also available for
mistral()
, gemini()
, and openai()
(as well as azure_openai()
).
Step 1: Computing Embeddings for one abstract
To compute an embedding of the target abstract we use the
embed()
function with ollama()
as
provider-function:
target_tbl <- target_abstract |>
embed(ollama,.model="mxbai-embed-large:latest")
target_tbl
str(target_tbl)
## # A tibble: 1 × 2
## input embeddings
## <chr> <list>
## 1 We use the expansion of the high-speed rail ne… <dbl>
The embed()
function returns a tibble
with
two columns: - input: The original text provided for
embedding. - embeddings: A list column containing the
numerical vector representation (embedding) for each input text.
In our case we have a single input and the embeddings column contains a 1,024-dimenstional vector for this input:
str(target_tbl$embeddings)
## List of 1
## $ : num [1:1024] -0.838 0.669 0.202 -0.686 -0.985 ...
Step 2: Computing Embeddings for the entire abstract corpus
When working with a large corpus like our 22,960
abstracts, embedding all entries in a single pass using
embed()
is impractical and often leads to errors. For
commercial APIs, there are typically strict limits on the number of
inputs allowed per request (usually there are caps of 50 or a 100
inputs). For local APIs, resource constraints such as memory and
processing power impose similar restrictions.
To efficiently handle this, we batch the data, processing a
manageable number of abstracts at a time. The
generate_abstract_embeddings()
function below takes a
vector of abstracts as input and divides them into manageable batches of
200. For each batch, it uses the embed()
function to
compute embeddings via ollama()
. Progress is logged to the
console to keep track of batch completion and provide a clear view of
the process.
Since long-running processes are prone to interruptions, such as
network timeouts or unexpected system errors, it saves the results to
disk as .rds
files (consider
arrow::write_parquet()
or a database for really big
workloads). On a MacBook Pro with an M1 Pro processor, this function
completes embedding the entire dataset in approximately 25
minutes and writes 207 MB of data to disk. The time may vary
depending on system specifications and batch size. To compare multiple
target abstracts against the entire collection, we of course only need
to embed it once.
#Our batches embedding function
generate_abstract_embeddings <- function(abstracts){
#Preapre abstract batches
embedding_batches <- tibble(abstract = abstracts) |>
group_by(batch = floor(1:n() / 200)+1) |>
group_split()
#Work with batches of 200 abstracts
n_batches <- length(embedding_batches)
glue("Processing {n_batches} batches of 200 abstracts") |> cat("\n")
#Embed the batches via ollama mxbai-embed-large
embedding_batches %>%
walk(~{
batch_number <- pull(.x,batch) |> unique()
glue("Generate Text Embeddings for Abstract Batch: {batch_number}/{n_batches}") |> cat("\n")
emb_matrix <- .x$abstract %>%
embed(ollama,.model="mxbai-embed-large:latest") |>
write_rds(here("embedded_asbtracts",paste0(batch_number,".rds")))
})
}
#Run the function over all abstracts
abstracts |>
pull(abstract) |>
generate_abstract_embeddings()
## Processing 115 batches of 200 abstracts
## Generate Text Embeddings for Abstract Batch: 1/115
## Generate Text Embeddings for Abstract Batch: 2/115
## Generate Text Embeddings for Abstract Batch: 3/115
## ...
After the function has finished, we only need to load the computed embeddings:
Step 3: Performing the Semantic Search
With the embeddings precomputed, we can now perform a semantic search to find abstracts most similar to our target. For the search we will use cosine similarity to compare embedding vectors. Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them. It ranges from -1 (opposite directions) to 1 (identical directions). The formula is:
Where:
- is the dot product.
- and are the magnitudes of the vectors.
Vectors in an embedding space represent semantic meanings of texts. In this space:
- Cosine similarity focuses on direction rather than magnitude.
- Two vectors pointing in similar directions (small angle) will have a cosine similarity close to 1.
- Vectors at 90 degrees (orthogonal, no semantic overlap) have a cosine similarity of 0.
- Vectors pointing in opposite directions (large angle, entirely dissimilar) have a cosine similarity of -1.
With the model we use each embedding vector represents a point in a high-dimensional space with 1,024 dimensions. Even though these dimensions are not spatially interpretable like in 2D or 3D, the underlying principle still holds: cosine similarity measures how much two vectors “lean” in the same direction. If two vectors have a cosine similarity of 0.25, it means the angle between them is relatively small, implying they share 25% of their directional alignment.
To compute cosine similarity between two vectors we express it in a simple function:
We apply this function to compute the cosine similarity of abstracts in the corpus to the target paper to find the 10 most similar abstracts:
top10_similar <- embedded_asbtracts %>%
mutate(cosine_sim = map_dbl(embeddings,
~cosine_similarity(.x,
target_tbl$embeddings[[1]])))|>
arrange(desc(cosine_sim)) |>
slice(1:10) |>
rename(abstract=input) |>
left_join(abstracts |>
select(year,authors,journal,abstract), by="abstract") |>
select(-embeddings)
The top ten most similar articles in the corpus based on our search are these:
read_rds("top10_asbtracts.rds") |>
mutate(abstract = stringr::str_wrap(abstract, width = 60)) |>
kableExtra::kable("html", caption = "Top 10 Most Similar Abstracts") |>
kableExtra::kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))
abstract | cosine_sim | year | authors | journal |
---|---|---|---|---|
We use the expansion of the high-speed rail (HSR) network in Germany as a natural experiment to examine the causal effect of reductions in commuting time between regions on the commuting decisions of workers and their choices regarding where to live and where to work. We exploit three key features in this setting: (i) investment in HSR has, in some cases dramatically, reduced travel times between regions, (ii) several small towns were connected to the HSR network only for political reasons, and (iii) high-speed trains have left the transportation of goods unaffected. Combining novel information on train schedules and the opening of HSR stations with panel data on all workers in Germany, we show that a reduction in travel time by 1% raises the number of commuters between regions by 0.25%. This effect is mainly driven by workers changing jobs to smaller cities while keeping their place of residence in larger ones. Our findings support the notion that benefits from infrastructure investments accrue in particular to peripheral regions, which gain access to a large pool of qualified workers with a preference for urban life. We find that the introduction of high-speed trains led to a modal shift toward rail transportation in particular on medium distances between 150 and 400 km. | 0.9935939 | 2019 | Heuermann, Daniel; Schmieder, Johannes | Journal of Economic Geography |
We analyze the economic impact of the German high-speed rail (HSR) connecting Cologne and Frankfurt, which provides plausibly exogenous variation in access to surrounding economic mass. We find a causal effect of about 8.5% on average of the HSR on the GDP of three counties with intermediate stops. We make further use of the variation in bilateral transport costs between all counties in our study area induced by the HSR to identify the strength and spatial scope of agglomeration forces. Our most careful estimate points to an elasticity of output with respect to market potential of 12.5%. The strength of the spillover declines by 50% every 30 min of travel time, diminishing to 1% after about 200 min. Our results further imply an elasticity of per-worker output with respect to economic density of 3.8%, although the effects seem driven by worker and firm selection. | 0.8984364 | 2018 | Ahlfeldt, Gabriel; Feddersen, Arne | Journal of Economic Geography |
We investigate whether localities gain or lose employment when there are connected to a transportation network, such as a high-speed railway line. We argue that long-haul economies—implying that the marginal transportation cost decreases with network distance—play a pivotal role in understanding the location choices of firms. We develop a new spatial model to show that improvements in transportation infrastructure have nontrivial impacts on the location choices of firms. Using data on Japan’s Shinkansen, we show that ‘in-between’ municipalities that are connected to the Shinkansen witness a sizable decrease in employment. | 0.8691969 | 2022 | Koster, Hans; Tabuchi, Takatoshi; Thisse, Jacques | Journal of Economic Geography |
How does intercity passenger transportation shape urban employment and specialization patterns? To shed light on this question I study China’s High Speed Railway (HSR), an unprecedentedly large-scale network that connected 81 cities from 2003 to 2014 with trains running at speeds over 200 km/h. Using a difference-in-differences approach, I find that an HSR connection increases city-wide passenger flows by 10% and employment by 7%. To deal with the issues of endogenous railway placement and simultaneous public investments accompanying HSR connection, I examine the impact of a city’s market access changes purely driven by the HSR connection of other cities. The estimates suggest that HSR-induced expansion in market access increases urban employment with an elasticity between 2 and 2.5. Further evidence on sectoral employment suggests that industries with a higher reliance on nonroutine cognitive skills benefit more from HSR-induced market access to other cities. | 0.8684935 | 2017 | Lin, Yatang | Journal of Urban Economics |
We use the natural experiment provided by the opening and progressive extension of the Regional Express Rail (RER) between 1970 and 2000 in the Paris metropolitan region, and in particular the departure from the original plans due to budget constraints and technical considerations, to identify the causal impact of urban rail transport on firm location, employment and population growth. We apply a difference-in-differences method to a particular subsample, selected to minimize the endogeneity that is routinely found in the evaluation of the effects of transport infrastructure. We find that the RER opening caused a 8.8% rise in employment in the municipalities connected to the network between 1975 and 1990. While we find no effect on overall population growth, our results suggest that the arrival of the RER may have increased competition for land, since high-skilled households were more likely to locate in the vicinity of a RER station. | 0.8664813 | 2017 | Mayer, Thierry; Trevien, Corentin | Journal of Urban Economics |
Infrastructure investment may reshape economic activities. In this article, I examine the distributional impacts of high-speed rail upgrades in China, which have improved passengers’ access to high-speed train services in the city nodes but have left the peripheral counties along the upgraded railway lines bypassed by the services. By exploiting the quasi-experimental variation in whether counties were affected by this project, my analysis suggests that the affected counties on the upgraded railway lines experienced reductions in GDP and GDP per capita following the upgrade, which was largely driven by the concurrent drop in fixed asset investments. This article provides the first empirical evidence on how transportation costs of people affect urban peripheral patterns. | 0.8579637 | 2017 | Qin, Yu | Journal of Economic Geography |
Many US cities have made large investments in light rail transit in order to improve commuting networks. I analyse the labour market effects of light rail in four US metros. I propose a new instrumental variable to overcome endogeneity in transit station location, enabling causal identification of neighbourhood effects. Light rail stations are found to drastically improve employment outcomes in the surrounding neighbourhood. To incorporate endogenous sorting by workers, I estimate a structural neighbourhood choice model. Light rail systems tend to raise rents in accessible locations, displacing lower skilled workers to isolated neighbourhoods, which reduces aggregate metropolitan employment in equilibrium. | 0.8559837 | 2021 | Tyndall, Justin | Journal of Urban Economics |
I study Los Angeles Metro Rail’s effects using panel data on bilateral commuting flows, a quantitative spatial model, and historically motivated quasi-experimental research designs. The model separates transit’s commuting effects from local productivity or amenity effects, and spatial shift-share instruments identify inelastic labor and housing supply. Metro Rail connections increase commuting by 16% but do not have large effects on local productivity or amenities. Metro Rail generates $94 million in annual benefits by 2000 or 12–25% of annualized costs. Accounting for reduced congestion and slow transit adoption adds, at most, another $200 million in annual benefits. | 0.8432086 | 2023 | Severen, Christopher | The Review of Economics and Statistics |
We examine the effect of commuting distance on workers’ labour supply patterns, distinguishing between weekly labour supply, number of workdays per week and daily labour supply. We account for endogeneity of distance by using employer-induced changes in distance. In Germany, distance has a slight positive effect on daily and weekly labour supply, but no effect on the number of workdays. The effect of distance on labour supply patterns is stronger for female workers, but it is still small. | 0.8416284 | 2010 | Gutiérrez-i-Puigarnau, Eva; van Ommeren, Jos | Journal of Urban Economics |
We estimate the causal impact of wage variations on commuting distance of workers. We test whether higher wages across years lead workers to live further away from their working place. We use employer–employee data for the French Ile-de-France region (surrounding Paris), from 2003 to 2008, and we deal with the endogenous relation between income and commuting using an instrumental variable strategy. We estimate that increases in wages coming from exogenous exposure to trade activities lead workers to increase their commuting distance and to settle closer to the city of Paris historical center. Our results cast novel insights upon the causal mechanisms from wage to spatial allocation of workers. | 0.8367841 | 2022 | Aboulkacem, El-Mehdi; Nedoncelle, Clément | Journal of Economic Geography |
Unsurprisingly, the top result is the target abstract itself, presented with slight formatting differences. However, the remaining top results strongly align with the target’s thematic focus on the economic and social impacts of transportation (or specifically high speed reail) infrastructure. The second result discusses the economic effects of German High-Speed Rail (HSR), while the third and fourth focus on Japan’s Shinkansen and China’s HSR network, respectively. These findings highlight the model’s capability to identify semantically rich connections across diverse contexts, illustrating its ability to capture complex thematic overlaps.
Outlook: Clustering and Beyond
While this article focused on semantic search, embeddings open the door to a wide range of advanced analytical techniques.
Here are some potential further use-cases for embeddings:
Clustering for Topic Discovery: Embeddings can be leveraged with unsupervised learning methods like K-means or hierarchical clustering to automatically group text into topics or themes. This approach is particularly beneficial for exploratory research, enabling users to uncover hidden patterns and structures within large corpora without predefined labels.
Dimensionality Reduction and Visualization: High-dimensional embedding vectors often need simplification for human interpretation. Techniques like Principal Component Analysis (PCA), t-SNE, or UMAP allow us to project embeddings into lower-dimensional spaces (e.g., 2D or 3D). These visualizations can reveal clusters, trends, and outliers, providing insights at a glance.
Retrieval augmented generation (RAG): Embeddings play a crucial role in retrieval-augmented generation, a technique that enhances large language models (LLMs). In this workflow, an embedding search retrieves semantically similar documents from a corpus, which are then appended to the LLM’s prompt. This process helps the model generate contextually rich and accurate responses, especially in knowledge-intensive tasks.
Supporting Workflows in Qualitative Research: Embeddings can also transform workflows in qualitative research, as discussed in this paper by Kugler et al. (2023). The integration of Natural Language Processing (NLP) tools enables researchers to automate coding steps that traditionally require manual effort. Specifically, embeddings can assist in categorizing text data according to predefined themes, making research workflows more efficient and transparent. While these models bring significant advantages, challenges remain. The study highlights that off-the-shelf language models often struggle to discern implicit references and closely related topics as effectively as human researchers. However, more modern embedding models than the ones used in the study might help to deal with some of these challenges.