Video and Audio Data with the Gemini API • tidyllm

The tidyllm package aims to implement some API-specific functionality beyond the main features that are available on all APIs. With the Google Gemini API, you can now upload files directly, supporting a range of formats, including video and audio. You can use uploaded files in the context of your messages through the .fileid-argument in gemini_chat(). This feature opens up a world of possibilities for interactive data analysis and communication.

Sending an example video to Gemini

We’ll explore how to leverage this functionality by uploading The Thinking Machine, an interesting 1960s TV segment on AI , and interact with Gemini using this file as context:

We have this segment on our PC as mp4-file and can simply use gemini_upload_file() to make it available in gemini() requests.

library(tidyverse)
library(tidyllm)
upload_info <- gemini_upload_file("the_thinking_machine_1960s.mp4")
upload_info
str(upload_info)

## # A tibble: 1 × 7
##   name   display_name mime_type size_bytes create_time uri  
##   <chr>  <chr>        <chr>          <dbl> <chr>       <chr>
## 1 files… the_thinkin… video/mp4    6520447 2024-11-18… http…
## # ℹ 1 more variable: state <chr>
## tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
##  $ name        : chr "files/wutpmcv9jve9"
##  $ display_name: chr "the_thinking_machine_1960s.mp4"
##  $ mime_type   : chr "video/mp4"
##  $ size_bytes  : num 6520447
##  $ create_time : chr "2024-11-18T09:40:29.603980Z"
##  $ uri         : chr "https://generativelanguage.googleapis.com/v1beta/files/wutpmcv9jve9"
##  $ state       : chr "PROCESSING"

When you use gemini_upload_file(), it returns a tibble containing detailed metadata about the uploaded file. The output includes several key columns: name (a unique identifier for the file on the server), display_name (the original file name), mime_type (the file’s media type, e.g., video/mp4), size_bytes (the file size in bytes), create_time (timestamp indicating when the file was uploaded), uri (a URL pointing to the file’s location on Google’s servers) and state (indicating the current processing status of the file).

From this range of metadata, you primarily need the name, which you can use in the .fileid argument in gemini():

llm_message("Give me a detailed summary of this video") |>
  chat(gemini(.fileid = upload_info$name))

## Message History:
## system:
## You are a helpful assistant
## --------------------------------------------------------------
## user:
## Give me a 200 word summary of this video
## --------------------------------------------------------------
## assistant:
## Here is a summary of the video.
## 
## This 1961 black and white Paramount News clip explores
## the burgeoning field of artificial intelligence. The host
## interviews Professor Jerome B. Wiesner, director of MIT's
## Research Laboratory of Electronics. Wiesner discusses the
## capabilities of computers, noting that while their abilities
## were previously limited, he suspects they'll be able to
## "think" within a few years. The segment shows a computer
## playing checkers against a human opponent. Other experts,
## including Oliver Selfridge and Claude Shannon, offer their
## perspectives on whether machines can truly think and the
## potential implications for the future, particularly in
## language translation. One expert predicts that within
## 10–15 years, machines will be performing tasks previously
## considered the realm of human intelligence. The film ends
## by showing a computer that translates Russian to English.
## Despite this success, one expert notes that computers will
## not replace translators of poetry and novels.
## --------------------------------------------------------------

Once a file is uploaded you can reuse it for different requests. Note that Gemini also supports tidyllm_schema(), allowing you to get structured responses for all file types that you can use with Gemini:

structured_request <- llm_message("Extract some details about this video") |>
  chat(gemini(.fileid = upload_info$name),
       .json_schema = tidyllm_schema(name="videoschema",
                                     Persons = "character[]",
                                     ScientistQuotes = "character[]",
                                     Topics = "character[]",
                                     runtime = "numeric"))

structured_request |>
  get_reply_data()

## $Persons
## [1] "Professor Jerome B. Wiesner" "Oliver G. Selfridge"        
## [3] "Claude Shannon"             
## 
## $ScientistQuotes
## [1] "Well, that's a very hard question to answer. If you'd asked me that question just a few years ago I'd have said it was very far fetched. And today I just have to admit I don't really know. I suspect that if you'd come back in four or five years I'll say sure they really do think."                                                                                                                                              
## [2] "I'm convinced that machines can and will think. I don't mean that machines will behave like men. I don't think for a very long time we're going to have a difficult problem distinguishing a man from a robot. And I don't think my daughter will ever marry a computer. But I think the computers will be doing the things that men do when we say they're thinking. I am convinced that machines can and will think in our lifetime."
## [3] "I confidently expect that within a matter of 10 or 15 years, something will emerge from the laboratories which is not too far from the robot of science fiction fame"                                                                                                                                                                                                                                                                  
## 
## $Topics
## [1] "Artificial Intelligence"        "Computer"                      
## [3] "Thinking Machine"               "Machine Translation"           
## [5] "Russian to English Translation" "Cold War"                      
## 
## $runtime
## [1] 172

gemini_list_files() creates a tibble with an overview of files you have uploaded to Gemni. You can use it to check which files are currently available for use with Gemini.

gemini_files <- gemini_list_files()
gemini_files

## # A tibble: 3 × 10
##   name  display_name mime_type size_bytes create_time update_time
##   <chr> <chr>        <chr>          <dbl> <chr>       <chr>      
## 1 file… example.mp3  audio/mp3    2458836 2024-11-15… 2024-11-15…
## 2 file… the_thinkin… video/mp4    6520447 2024-11-18… 2024-11-18…
## 3 file… example_akt… applicat…     193680 2024-11-17… 2024-11-17…
## # ℹ 4 more variables: expiration_time <chr>, sha256_hash <chr>,
## #   uri <chr>, state <chr>

For example, we can now ask gemini() what the audio file I uploaded earlier is about:

llm_message("What's in this audio file. Describe it in 100 words.") |>
  gemini_chat(.fileid=gemini_files$name[1])

## Message History:
## system:
## You are a helpful assistant
## --------------------------------------------------------------
## user:
## What's in this audio file. Describe it in 100 words.
## --------------------------------------------------------------
## assistant:
## This is an audio recording of an interview with Robert
## Bosch. He discusses his early life, his career path from
## apprentice to founder of his renowned company, and his
## philosophy on business. Bosch recounts his decision to
## become a precision mechanic, his travels to America and
## England, and the development of his famous magneto ignition
## system. He emphasizes his principles of high-quality work,
## fair treatment of employees, and the importance of trust and
## reliability in business dealings. The interview concludes
## with well wishes.
## --------------------------------------------------------------

Once you have completed working with a file, you can delete it with gemini_delete_file()

gemini_files$name[1] |>
  gemini_delete_file()

## File files/nzq2zw9u30y8 has been successfully deleted.

Supported file types

The Google Gemini API supports a wide range of file formats, enabling seamless multimodal workflows. For documents, it handles PDFs, plain text, HTML, CSS, Markdown, CSV-tables, XML, and RTF. Uploading documents with gemini_upload_file() is useful, since the standard llm_message() function with the .pdf argument is designed to extract only the textual content from PDFs. In contrast, the Google Gemini API enhances this by supporting multimodal PDFs, including those that contain images or are image-only (such as scanned documents). This allows users to work with both textual and visual data in their interactions.

For example, we can upload an old scan of a paper that contains no extractable text at all and ask gemini to extract all the references:

neal1995 <- gemini_upload_file("1995_Neal_Industry_Specific.pdf")
            
bib <- llm_message("Extract all the references from this paper in the specified format") |>         
            chat(gemini(.fileid=neal1995$name),
                 .json_schema = tidyllm_schema(name="references",
                                               APAreferences ="character[]")
            )

references <- bib |> get_reply_data()
references[1:5]

## [1] "Addison, John, and Portugal, Pedro. \"Job Displacement, Relative Wage Changes, and Duration of Unemployment.\" *Journal of Labor Economics* 7 (July 1989): 281–302."
## [2] "Altonji, Joseph, and Shakotko, Robert. \"Do Wages Rise with Seniority?\" *Review of Economic Studies* 54 (July 1987): 437–59."                                      
## [3] "Becker, Gary. *Human Capital*. New York: Columbia University Press, 1975."                                                                                          
## [4] "Carrington, William. \"Wage Losses for Displaced Workers: Is It Really the Firm That Matters?\" *Journal of Human Resources* 28 (Summer 1993): 435–62."             
## [5] "Carrington, William, and Zaman, Asad. \"Interindustry Variation in the Costs of Job Displacement.\" *Journal of Labor Economics* 12 (April 1994): 243–76."

Image formats for gemini_upload_file() include PNG, JPEG, WEBP, HEIC, and HEIF, while supported video formats cover MP4, MPEG, MOV, AVI, FLV, MPG, WEBM, WMV, and 3GPP. For audio, the API works with WAV, MP3, AIFF, AAC, OGG Vorbis, and FLAC. Code files for JavaScript and Python are also supported.

Conclusion

The ability to work with diverse media types through the Gemini API opens up a wide range of applications across scientific disciplines. In linguistics and humanities research, text, audio, and video can be used to analyze language patterns, conduct speech recognition, or study historical footage. Social scientists can process interview recordings, video observations, and there might other interesting application across a whole range of fields. By integrating text, audio, video, images, and code into a unified workflow, using the Gemini API with tidyllm offers many possibilities that are not available with the standard multimodal tools in llm_message().