Structured Question Answering from PDFs • tidyllm

Navigating through a large collection of academic papers can be time-consuming, especially when you’re trying to extract specific insights or determine the relevance of each document to your research. With tidyllm, you can streamline this process by automating the extraction of structured answers directly from PDF documents using large language models.

Imagine you have a folder of papers on the economic effects of generative AI, and you need to assess how each paper is related to your own research interests. This article will guide you through setting up a workflow that processes the first few pages of each paper, asks an AI model targeted questions, and returns the answers in a structured format — perfect for converting into a table for easy review and analysis.

Example Workflow

Imagine your folder looks something like this—many downloaded papers, but no structure yet:

library(tidyverse)
library(tidyllm)
dir("aipapers")
##  [1] "2018_Felten_etal_AILinkOccupations.pdf"                                                           
##  [2] "2024_Bick_etal_RapidAdoption.pdf"                                                                 
##  [3] "2024_Caplin_etal_ABCsofAI.pdf"                                                                    
##  [4] "2301.07543v1.pdf"                                                                                 
##  [5] "2302.06590v1.pdf"                                                                                 
##  [6] "2303.10130v5.pdf"                                                                                 
##  [7] "488.pdf"                                                                                          
##  [8] "88684e36-en.pdf"                                                                                  
##  [9] "ABCs_AI_Oct2024.pdf"                                                                              
## [10] "acemoglu-restrepo-2019-automation-and-new-tasks-how-technology-displaces-and-reinstates-labor.pdf"
## [11] "BBD_GenAI_NBER_Sept2024.pdf"                                                                      
## [12] "Deming-Ong-Summers-AESG-2024.pdf"                                                                 
## [13] "dp22036.pdf"                                                                                      
## [14] "FeltenRajSeamans_AIAbilities_AEA.pdf"                                                             
## [15] "JEL-2023-1736_published_version.pdf"                                                              
## [16] "Noy_Zhang_1.pdf"                                                                                  
## [17] "pb01-25.pdf"                                                                                      
## [18] "sd-2024-09-falck-etal-kuenstliche-intelligenz-unternehmen.pdf"                                    
## [19] "ssrn-4700751.pdf"                                                                                 
## [20] "SSRN-id4573321.pdf"                                                                               
## [21] "The Simple Macroeconomics of AI.pdf"                                                              
## [22] "w24001.pdf"                                                                                       
## [23] "w24871.pdf"                                                                                       
## [24] "w31161.pdf"                                                                                       
## [25] "w32430.pdf"

Our goal is to get a first overview of these papers and what they are about and to give them good file names.

Step 1: Generating Messages for Each Document

First, we need to prepare a list of messages for all PDFs in the folder by applying llm_message() with a prompt to the first five pages of each document. This step sets up a list of messages, where each entry specifies a task to retrieve structured answers on one file. Even though the gpt-4o-mini model we will use in this example can process up to 128,000 tokens -approximately 80-90 pages of English text- we limit the input to five pages for demonstration purposes and to maintain focus on the introduction that is usually enough to get a first overview of a paper.

document_tasks <- files |> 
  map(~llm_message("Below are the first 5 pages of a document. 
                     Summarise the document based on the provided schema.", 
            .pdf = list(
              filename = .x,
              start_page = 1,
              end_page = 5)) #Maximally 5 pages since we are not sure how long each document is
)

Step 2: Defining the Schema for Structured Output

In this step, we define a schema that outlines the expected data types for each field in the model’s responses. This schema enables the large language model to return answers in a structured, consistent format that can later be converted into a table, making it easy to analyze and compare results across documents.

document_schema <- tidyllm_schema(
  name = "DocumentAnalysisSchema",
  Title = field_chr("The full title of the provided document"),
  Authors = field_chr("A semicolon-seperated list of Authors"),
  SuggestedFilename = field_chr("Suggest a filename in the format \"ReleaseYear_Author_etal_ShortTitle.pdf\". Use the publication year if available; otherwise, use XXXX."),
  Type = field_fct("Is the document a Policy Report or a Research Paper based on  it’s style, structure, and purpose?",.levels =c("Policy", "Research")),
  Empirics =  field_chr("A 100 word description of empirical methods used, including data collection, statistical techniques, or analysis methods mentioned."),
  Theory = field_chr("A 100 word outline the primary theoretical framework or models discussed if any."),
  MainPoint = field_chr("A one sentence summary ot the main point raised"),
  Contribution = field_chr("A short explanation of the main contributions claimed in the document"),
  KeyCitations = field_chr("The four most frequently cited or critical references in the first 5 pages")
)

In tidyllm_schema(), we specify each field field with a description. The seperate field functions indicate what kind of type we expect for the answer (i.e. field_chr() for text or field_dbl() for numbers). For fields where categorical responses are needed, we can use field_fct() to define specific allowed options with the .levels argument. For instance field_fct(), in the schema above creates a categorical field with the choices “Policy” and “Research.” To indicate that a field should return a list of values, you can set .vector=TRUE in the field functions. He, we intentionally avoid lists to maintain a flat structure. This ensures that the output can be easily converted to a single-row tibble. name is a special field, which creates an identifier for a schema. By default it is "tidyllm_schema"

Step 3: Running the Analysis on a Sample Document

To test the setup, we run the analysis on a single document with the standard chat() function, using the schema to ensure structured output.

example_task <- document_tasks[[1]] |> 
  chat(openai(.json_schema = document_schema,
              .model       = "gpt-4o-mini"))

Step 4: Extracting and Formatting the Results for our example task

We use get_reply_data() to extract the model’s structured responses from the model reply.

get_reply_data(example_task)
## $Title
## [1] "A Method to Link Advances in Artificial Intelligence to Occupational Abilities"
## 
## $Authors
## [1] "Edward W. Felten; Manav Raj; Robert Seamans"
## 
## $SuggestedFilename
## [1] "2018_Felten_etal_AILinkOccupations.pdf"
## 
## $Type
## [1] "Research"
## 
## $Empirics
## [1] "The study employs two main databases: the Electronic Frontier Foundation (EFF) AI Progress Measurement dataset and the Occupational Information Network (O*NET). The EFF dataset tracks task-specific AI performance metrics across various categories from 2010 to 2015, while O*NET provides contemporary occupational definitions. The authors calculate the correlation between AI advancements and occupational changes using an analysis of variance methodology to derive impact scores for occupations."
## 
## $Theory
## [1] "The theoretical framework relies on linking advancements in various AI fields (e.g., image recognition) to a set of 52 abilities outlined by O*NET. The authors construct a matrix correlating EFF AI categories to O*NET abilities, assessing how AI impacts specific skills important for different occupations, theoretically grounded in human capital and job task frameworks."
## 
## $MainPoint
## [1] "The paper introduces a novel methodology to quantitatively assess the impact of AI advancements on occupational abilities, facilitating better understanding for researchers and policymakers."
## 
## $Contribution
## [1] "This research contributes a systematic approach for linking AI advancements to occupational changes, enhancing the understanding of AI's role in labor markets and the skills needed across different jobs. The methodology allows for further analysis of the varying impacts of AI on occupations, potentially aiding policy development."
## 
## $KeyCitations
## [1] "Autor and Handel (2013); Brynjolfsson et al. (2018); Acemoglu and Restrepo (2017); Frey and Osborne (2017)"

The model seems to have reasonably answered our questions in the structured format we provided for our first example task. Here we can also look at the token usage of the example task:

get_metadata(example_task)

## # A tibble: 1 × 7
##   model  timestamp           prompt_tokens completion_tokens total_tokens stream
##   <chr>  <dttm>                      <int>             <int>        <int> <lgl> 
## 1 gpt-4… 2025-03-22 22:07:53          4265               337         4602 FALSE 
## # ℹ 1 more variable: api_specific <list>

At a price for gpt-4o-mini of $0.15 / million input tokens and $0.60 / million output tokens the cost of our example task is less than a cent.

Step 5: Scaling up to a whole batch of papers

After confirming that our single-document analysis is working well, we can extend this workflow to process a larger batch of documents. Batch processing is particularly valuable when handling a large collection of files, as it allows us to submit multiple messages at once, which are then processed together on the model provider’s servers.

Batch APIs, like those from Anthropic and OpenAI, often offer up to 50% savings compared to single-interaction requests. In tidyllm, we can use send_batch() to submit batch requests. The OpenAI Batch API, supports up to 50,000 requests in a single batch with a maximum file size of 100 MB. Additionally, batch API rate limits are separate from the standard per-model limits, meaning batch usage doesn’t impact your regular API rate allocations.

document_tasks |> 
  send_batch(openai(.json_schema = document_schema,
                     .model       = "gpt-4o-mini")) |>
  write_rds("document_batch.rds")

After the batch is sent, the output of send_batch() contains the input list of message histories along with batch metadata, such as the batchID as an attribute as well as unique names for each list element that can be used to stitch together messages with replies, once they are ready. If you provide a named list of messages, tidyllm will use these names as identifiers in the batch (provided that these names are unique for each list element). Batches are processed within 24 hours, usually much faster. The batch request for this example was processed within 10 minutes.

⚠️ Note: We save the RDS file to disk to preserve the state of our batch request, including all messages and their unique identifiers. This file acts as a checkpoint, allowing us to easily reload and check the batch status or retrieve results across R sessions without needing to resend the entire batch if we close the session in the mean time.

To check whether a batch was compeleted you can load the output file into check_openai_batch():

read_rds("document_batch.rds")|>
  check_batch(openai)

Alternatively you can list all OpenAI batches with list_openai_batches()or list_batchers(openai()). Of course, you can also look into the batches dashboard of the OpenAI platform for the same overview. Since OpenAI batches are sent as .jsonl and saved on the OpenAI server, you can also look into the file tab of the dashboard to delete old files from time to time.

Step 6: Getting data from the entire batch

Once the batch is complete, we fetch all responses with fetch_batch():

results <- read_rds("document_batch.rds")|>
  fetch_batch(openai)

We can then process the results into a table by mapping get_reply_data() and as_tibble() over the batch output:

docuemnt_table <- results |>
  map(get_reply_data) |>
  map_dfr(as_tibble)
  
docuemnt_table
## # A tibble: 24 × 9
##    Title  Authors SuggestedFilename Type  Empirics Theory MainPoint Contribution
##    <chr>  <chr>   <chr>             <chr> <chr>    <chr>  <chr>     <chr>       
##  1 A Met… Edward… 2018_Felten_etal… Rese… The pap… Theor… The pape… The primary…
##  2 The R… Alexan… 2024_Bick_etal_R… Rese… The emp… The t… The pape… The primary…
##  3 The A… Andrew… 2024_Caplin_etal… Rese… The stu… The t… AI can s… The paper c…
##  4 Large… John J… 2023_Horton_etal… Rese… The doc… The c… The pape… This resear…
##  5 The I… Sida P… 2023_Peng_etal_I… Rese… The stu… The p… The stud… This resear…
##  6 GPTs … Tyna E… 2023_Eloundou_et… Rese… The stu… The t… The intr… This paper …
##  7 Autom… Philip… 2023_Lergetporer… Rese… This st… The u… Workers … This paper …
##  8 Artif… Andrew… 2024_Green_etal_… Rese… The doc… The r… AI is tr… The primary…
##  9 The A… Andrew… 2024_Caplin_etal… Rese… The res… The s… The abil… This resear…
## 10 Autom… Daron … 2019_Acemoglu_Re… Rese… The doc… The p… The impa… The authors…
## # ℹ 14 more rows
## # ℹ 1 more variable: KeyCitations <chr>

This table can be exported to Excel with writexl::write_xlsx() for further review. Additionally, we could programmatically rename the PDFs using the model’s suggested filenames, helping maintain an organized document structure for future analysis.

Further notes on working with large documents

Context length

When working with long documents like research papers, reports, or books, one common challenge is context length—the maximum amount of text the model can process in a single query. If a document exceeds this limit, the model will only see a portion of it, which may lead to missing important sections or incomplete answers.

In most models, context length is measured in tokens, the basic units of text. For example, many small local models have a maximum context length of around 8,192 tokens, roughly covering 30–35 pages. This means that for a long academic paper, the model may only see the beginning of the document, potentially omitting later sections like bibliographies or appendices, where key references or results might appear. Moreover, appending a whole document to a prompt might leave your actual prompt out of the context.

To manage this, a common approach is to limit the number of pages sent to the model. In this workflow, we focus on the first five pages for an initial overview. This typically includes the abstract, introduction, methodology, results, and discussion—enough to capture the essence of the paper. While this approach ensures that the model can process core content, it may omit information found in later sections.

Alternatively, for very large documents, you could split them into smaller sections and process each separately, covering more content without exceeding the model’s context window. However, splitting can disrupt the document’s flow, which may affect how well the model retains context across sections.

Gemini for image-heavy PDFs

For gemini() there is an alternative to adding the text from a PDF with llm_message(). You can directly upload a PDF to Google’s servers with gemini_upload_file() and use it in the context of your messages. The advantage of this approach is that Gemini can handle images in PDFs or are even image-only PDFs (such as scanned documents). See the article on Video and Audio Data with the Gemini API for an example how to use the gemini_upload_file() feature.

Local Models

Using open-source local models for PDF processing is also possible, though remote models like gpt-4o-mini tend to handle longer documents more effectively. Smaller local models in ollama(), may struggle with large content and complex, structured queries, even if they support extended context lengths. For more demanding tasks, larger local models may perform better with complex queries, but they require substantial hardware resources to run smoothly. Since Ollama has a default context length of just 2,048 tokens, you will likely also need to adjust the context length with the .num_ctx option in ollama(). Sending a message that is longer than the context length might lead to strange errors, since the input will be truncated by Ollama, which might lead to cases where only parts of the document are processed without your instructions. Note that increasing context length will likely slow down processing due to higher memory usage. In these cases, reducing the number of pages may help.

⚠️ Note: When using paid remote models, it’s important to consider data privacy and security, especially if you’re processing sensitive documents, as uploading data to external servers may introduce risks — local models provide more control in such cases.

Outlook

This structured question-answering workflow not only streamlines the extraction of key insights from academic papers, but can also be adapted for other document-heavy tasks. Whether you’re working with reports, policy documents, or news articles this approach can quickly help you summarize and categorize information for further analysis.