Skip to contents

Navigating through a large collection of academic papers can be time-consuming, especially when you’re trying to extract specific insights or determine the relevance of each document to your research. With tidyllm, you can streamline this process by automating the extraction of structured answers directly from PDF documents using large language models.

Imagine you have a folder of papers on the economic effects of generative AI, and you need to assess how each paper is related to your own research interests. This article guides you through setting up a workflow that processes the first few pages of each paper, asks an AI model targeted questions, and returns the answers in a structured format, perfect for converting into a table for easy review and analysis.

Example Workflow

Imagine your folder looks something like this, many downloaded papers but no structure yet:

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
library(tidyllm)
dir("aipapers")
##  [1] "2018_Felten_etal_AILinkOccupations.pdf"                                                           
##  [2] "2024_Bick_etal_RapidAdoption.pdf"                                                                 
##  [3] "2024_Caplin_etal_ABCsofAI.pdf"                                                                    
##  [4] "2301.07543v1.pdf"                                                                                 
##  [5] "2302.06590v1.pdf"                                                                                 
##  [6] "2303.10130v5.pdf"                                                                                 
##  [7] "488.pdf"                                                                                          
##  [8] "88684e36-en.pdf"                                                                                  
##  [9] "ABCs_AI_Oct2024.pdf"                                                                              
## [10] "acemoglu-restrepo-2019-automation-and-new-tasks-how-technology-displaces-and-reinstates-labor.pdf"
## [11] "BBD_GenAI_NBER_Sept2024.pdf"                                                                      
## [12] "Deming-Ong-Summers-AESG-2024.pdf"                                                                 
## [13] "dp22036.pdf"                                                                                      
## [14] "FeltenRajSeamans_AIAbilities_AEA.pdf"                                                             
## [15] "JEL-2023-1736_published_version.pdf"                                                              
## [16] "Noy_Zhang_1.pdf"                                                                                  
## [17] "pb01-25.pdf"                                                                                      
## [18] "sd-2024-09-falck-etal-kuenstliche-intelligenz-unternehmen.pdf"                                    
## [19] "ssrn-4700751.pdf"                                                                                 
## [20] "SSRN-id4573321.pdf"                                                                               
## [21] "The Simple Macroeconomics of AI.pdf"                                                              
## [22] "w24001.pdf"                                                                                       
## [23] "w24871.pdf"                                                                                       
## [24] "w31161.pdf"                                                                                       
## [25] "w32430.pdf"

Our goal is to get a first overview of these papers and to give them good file names.

Step 1: Generating Messages for Each Document

First, we prepare a list of messages for all PDFs in the folder by applying llm_message() with a prompt to the first five pages of each document. Even though gpt-5.4 can process up to 128,000 tokens (roughly 80-90 pages of English text), we limit the input to five pages for demonstration purposes; the introduction is usually enough to get a first overview of a paper.

files <- list.files("aipapers", full.names = TRUE, pattern = "\\.pdf$")

document_tasks <- files |>
  map(\(f) llm_message(
    "Below are the first 5 pages of a document.
     Summarise the document based on the provided schema.",
    .pdf = list(
      filename   = f,
      start_page = 1,
      end_page   = 5
    )
  ))

Step 2: Defining the Schema for Structured Output

We define a schema that outlines the expected data types for each field in the model’s responses. This enables the model to return answers in a consistent, structured format that converts directly into a tibble.

document_schema <- tidyllm_schema(
  name             = "DocumentAnalysisSchema",
  Title            = field_chr("The full title of the provided document"),
  Authors          = field_chr("A semicolon-separated list of authors"),
  SuggestedFilename = field_chr("Suggest a filename in the format \"ReleaseYear_Author_etal_ShortTitle.pdf\". Use the publication year if available; otherwise, use XXXX."),
  Type             = field_fct("Is the document a Policy Report or a Research Paper based on its style, structure, and purpose?",
                               .levels = c("Policy", "Research")),
  Empirics         = field_chr("A 100 word description of empirical methods used, including data collection, statistical techniques, or analysis methods mentioned."),
  Theory           = field_chr("A 100 word outline of the primary theoretical framework or models discussed, if any."),
  MainPoint        = field_chr("A one-sentence summary of the main point raised"),
  Contribution     = field_chr("A short explanation of the main contributions claimed in the document"),
  KeyCitations     = field_chr("The four most frequently cited or critical references in the first 5 pages")
)

field_chr() returns text, field_dbl() returns numbers, and field_fct() restricts to allowed categories via the .levels argument. Setting .vector = TRUE on any field returns a list of values; here we keep a flat structure so each document maps to a single-row tibble. name is a special field providing an identifier for the schema.

Step 3: Running the Analysis on a Sample Document

To test the setup, we run the analysis on a single document with chat() and the schema:

example_task <- document_tasks[[1]] |>
  chat(openai(.json_schema = document_schema,
              .model       = "gpt-5.4"))

Step 4: Extracting and Formatting the Results

get_reply_data() extracts the model’s structured response:

get_reply_data(example_task)
## $Title
## [1] "A Method to Link Advances in Artificial Intelligence to Occupational Abilities"
## 
## $Authors
## [1] "Edward W. Felten; Manav Raj; Robert Seamans"
## 
## $SuggestedFilename
## [1] "2018_Felten_etal_AILinkOccupations.pdf"
## 
## $Type
## [1] "Research"
## 
## $Empirics
## [1] "The study employs two main databases: the Electronic Frontier Foundation (EFF) AI Progress Measurement dataset and the Occupational Information Network (O*NET). The EFF dataset tracks task-specific AI performance metrics across various categories from 2010 to 2015, while O*NET provides contemporary occupational definitions. The authors calculate the correlation between AI advancements and occupational changes using an analysis of variance methodology to derive impact scores for occupations."
## 
## $Theory
## [1] "The theoretical framework relies on linking advancements in various AI fields (e.g., image recognition) to a set of 52 abilities outlined by O*NET. The authors construct a matrix correlating EFF AI categories to O*NET abilities, assessing how AI impacts specific skills important for different occupations, theoretically grounded in human capital and job task frameworks."
## 
## $MainPoint
## [1] "The paper introduces a novel methodology to quantitatively assess the impact of AI advancements on occupational abilities, facilitating better understanding for researchers and policymakers."
## 
## $Contribution
## [1] "This research contributes a systematic approach for linking AI advancements to occupational changes, enhancing the understanding of AI's role in labor markets and the skills needed across different jobs. The methodology allows for further analysis of the varying impacts of AI on occupations, potentially aiding policy development."
## 
## $KeyCitations
## [1] "Autor and Handel (2013); Brynjolfsson et al. (2018); Acemoglu and Restrepo (2017); Frey and Osborne (2017)"

The model returns a named list matching our schema, which as_tibble() converts to a single row. We can also inspect token usage:

get_metadata(example_task)
## # A tibble: 1 × 7
##   model  timestamp           prompt_tokens completion_tokens total_tokens stream
##   <chr>  <dttm>                      <int>             <int>        <int> <lgl> 
## 1 gpt-5… 2026-03-16 10:00:00          4265               337         4602 FALSE 
## # ℹ 1 more variable: api_specific <list>

At current gpt-5.4 batch pricing the per-document cost is a few cents; processing a folder of 50 papers costs well under a dollar.

Step 5: Scaling Up to a Whole Batch of Papers

After confirming the single-document analysis works, we extend the workflow to process a full batch. Batch APIs from Anthropic and OpenAI offer up to 50% savings compared to single requests, and batch rate limits are separate from standard per-model limits so they don’t affect your regular quota.

document_tasks |>
  send_batch(openai(.json_schema = document_schema,
                    .model       = "gpt-5.4")) |>
  write_rds("document_batch.rds")

The output of send_batch() contains the input message list plus batch metadata (a batch_id attribute and unique per-message identifiers). Batches are processed within 24 hours; the example here completed in about 10 minutes.

⚠️ Note: Save the RDS file to disk. It acts as a checkpoint, letting you reload and check batch status or retrieve results across R sessions without resubmitting the batch.

Check whether a batch has completed:

read_rds("document_batch.rds") |>
  check_job(openai())

You can also list all OpenAI batches with list_batches(openai()) or view them in the OpenAI batches dashboard.

Step 6: Getting Data from the Entire Batch

Once the batch is complete, fetch all responses with fetch_batch():

results <- read_rds("document_batch.rds") |>
  fetch_batch(openai())

Map get_reply_data() and as_tibble() over the batch output to produce a tidy table:

document_table <- results |>
  map(get_reply_data) |>
  map_dfr(as_tibble)

document_table
## # A tibble: 24 × 9
##    Title  Authors SuggestedFilename Type  Empirics Theory MainPoint Contribution
##    <chr>  <chr>   <chr>             <chr> <chr>    <chr>  <chr>     <chr>       
##  1 A Met… Edward… 2018_Felten_etal… Rese… The pap… Theor… The pape… The primary…
##  2 The R… Alexan… 2024_Bick_etal_R… Rese… The emp… The t… The pape… The primary…
##  3 The A… Andrew… 2024_Caplin_etal… Rese… The stu… The t… AI can s… The paper c…
##  4 Large… John J… 2023_Horton_etal… Rese… The doc… The c… The pape… This resear…
##  5 The I… Sida P… 2023_Peng_etal_I… Rese… The stu… The p… The stud… This resear…
##  6 GPTs … Tyna E… 2023_Eloundou_et… Rese… The stu… The t… The intr… This paper …
##  7 Autom… Philip… 2023_Lergetporer… Rese… This st… The u… Workers … This paper …
##  8 Artif… Andrew… 2024_Green_etal_… Rese… The doc… The r… AI is tr… The primary…
##  9 The A… Andrew… 2024_Caplin_etal… Rese… The res… The s… The abil… This resear…
## 10 Autom… Daron … 2019_Acemoglu_Re… Rese… The doc… The p… The impa… The authors…
## # ℹ 14 more rows
## # ℹ 1 more variable: KeyCitations <chr>

This table can be exported to Excel with writexl::write_xlsx() for further review, or used to programmatically rename the PDFs using the model’s suggested filenames.

Further Notes on Working with Large Documents

Context Length

When working with long documents, context length limits how much text the model can process in a single query. If a document exceeds this limit, the model only sees a portion of it and may miss important sections.

Focusing on the first five pages, as in this workflow, typically captures the abstract, introduction, and methodology; enough for a first overview. For documents where later sections matter, split them into smaller chunks and process each separately. The overlap between chunks helps preserve continuity.

Gemini and Claude for Image-Heavy PDFs

For gemini() there is an alternative to embedding PDF text in the prompt. You can upload a PDF directly to Google’s servers with gemini_upload_file() and reference it in your messages. The key advantage is that Gemini handles image-heavy PDFs and scanned documents that cannot be extracted as text. See the Video and Audio Data with the Gemini API article for an example of the file upload workflow.

Claude offers a parallel option via the Claude Files API (claude_upload_file()), which is well suited for text-heavy PDFs where citation accuracy and long-context coherence matter most. The two providers complement each other: Gemini for image-only or scan-heavy documents, Claude for dense academic text.

Local Models

Local models via ollama() or llamacpp() are a strong choice for privacy-sensitive documents, as no data leaves your machine. Modern models like qwen3.5 support a context window of up to 262,144 tokens in principle; the practical limit on a typical laptop is RAM. On a 16 GB machine a 9B model at Q4 quantisation already uses roughly 6 GB for weights, leaving limited headroom for long contexts before inference slows to a crawl or the system starts swapping.

For most academic papers the first five pages fit comfortably within what laptop hardware can handle. For longer documents, limit the page range or reduce the context window explicitly via the .num_ctx parameter in ollama().

⚠️ Note: When using paid remote APIs, consider data privacy, especially for sensitive documents. Local models via ollama() or llamacpp() keep all processing on-device.

Outlook

This structured question-answering workflow streamlines extraction of key insights from academic papers and adapts directly to other document-heavy tasks; reports, policy documents, contracts, or news archives can all be processed with the same pattern of schema definition, single-document testing, and batch scale-up.