Video and Audio Data with the Gemini API
Source:vignettes/articles/tidyllm_video.Rmd
tidyllm_video.Rmd
⚠️ Note: This article refers to the development version 0.2.5. of tidyllm, which has a chnaged interface compared to the last CRAN release. You can install the current development version directly from GitHub using devtools:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install TidyLLM from GitHub
devtools::install_github("edubruell/tidyllm")
The tidyllm package aims to implement some
API-specific functionality beyond the main features that are available
on all APIs. With the Google Gemini API, you can now upload files
directly, supporting a range of formats, including video and audio. You
can use uploaded files in the context of your messages through the
.fileid
-argument in gemini_chat()
. This
feature opens up a world of possibilities for interactive data analysis
and communication.
Sending an example video to Gemini
We’ll explore how to leverage this functionality by uploading The Thinking Machine, an interesting 1960s TV segment on AI , and interact with Gemini using this file as context:We have this segment on our PC as mp4-file and can
simply use gemini_upload_file()
to make it available in
gemini()
requests.
library(tidyverse)
library(tidyllm)
upload_info <- gemini_upload_file("the_thinking_machine_1960s.mp4")
upload_info
str(upload_info)
## # A tibble: 1 × 7
## name display_name mime_type size_bytes create_time uri
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 files… the_thinkin… video/mp4 6520447 2024-11-18… http…
## # ℹ 1 more variable: state <chr>
## tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
## $ name : chr "files/wutpmcv9jve9"
## $ display_name: chr "the_thinking_machine_1960s.mp4"
## $ mime_type : chr "video/mp4"
## $ size_bytes : num 6520447
## $ create_time : chr "2024-11-18T09:40:29.603980Z"
## $ uri : chr "https://generativelanguage.googleapis.com/v1beta/files/wutpmcv9jve9"
## $ state : chr "PROCESSING"
When you use gemini_upload_file()
, it returns a
tibble
containing detailed metadata about the uploaded
file. The output includes several key columns: name
(a
unique identifier for the file on the server), display_name
(the original file name), mime_type
(the file’s media type,
e.g., video/mp4), size_bytes
(the file size in
bytes), create_time
(timestamp indicating when the file was
uploaded), uri
(a URL pointing to the file’s location on
Google’s servers) and state
(indicating the current
processing status of the file).
From this range of metadata, you primarily need the
name
, which you can use in the .fileid
argument in gemini()
:
llm_message("Give me a detailed summary of this video") |>
chat(gemini(.fileid = upload_info$name))
## Message History:
## system: You are a helpful assistant
## --------------------------------------------------------------
## user: Give me a 200 word summary of this video
## --------------------------------------------------------------
## assistant: Here is a summary of the video.
##
## This 1961 black and white Paramount News clip explores the burgeoning field of artificial intelligence. The host interviews Professor Jerome B. Wiesner, director of MIT's Research Laboratory of Electronics. Wiesner discusses the capabilities of computers, noting that while their abilities were previously limited, he suspects they'll be able to "think" within a few years. The segment shows a computer playing checkers against a human opponent. Other experts, including Oliver Selfridge and Claude Shannon, offer their perspectives on whether machines can truly think and the potential implications for the future, particularly in language translation. One expert predicts that within 10–15 years, machines will be performing tasks previously considered the realm of human intelligence. The film ends by showing a computer that translates Russian to English. Despite this success, one expert notes that computers will not replace translators of poetry and novels.
## --------------------------------------------------------------
Once a file is uploaded you can reuse it for different requests. Note
that Gemini also supports tidyllm_schema()
, allowing you to
get structured responses for all file types that you can use with
Gemini:
structured_request <- llm_message("Extract some details about this video") |>
chat(gemini(.fileid = upload_info$name),
.json_schema = tidyllm_schema(name="videoschema",
Persons = "character[]",
ScientistQuotes = "character[]",
Topics = "character[]",
runtime = "numeric"))
structured_request |>
get_reply_data()
## $Persons
## [1] "Professor Jerome B. Wiesner" "Oliver G. Selfridge"
## [3] "Claude Shannon"
##
## $ScientistQuotes
## [1] "Well, that's a very hard question to answer. If you'd asked me that question just a few years ago I'd have said it was very far fetched. And today I just have to admit I don't really know. I suspect that if you'd come back in four or five years I'll say sure they really do think."
## [2] "I'm convinced that machines can and will think. I don't mean that machines will behave like men. I don't think for a very long time we're going to have a difficult problem distinguishing a man from a robot. And I don't think my daughter will ever marry a computer. But I think the computers will be doing the things that men do when we say they're thinking. I am convinced that machines can and will think in our lifetime."
## [3] "I confidently expect that within a matter of 10 or 15 years, something will emerge from the laboratories which is not too far from the robot of science fiction fame"
##
## $Topics
## [1] "Artificial Intelligence" "Computer"
## [3] "Thinking Machine" "Machine Translation"
## [5] "Russian to English Translation" "Cold War"
##
## $runtime
## [1] 172
gemini_list_files()
creates a tibble with an overview of
files you have uploaded to Gemni. You can use it to check which files
are currently available for use with Gemini.
gemini_files <- gemini_list_files()
gemini_files
## # A tibble: 3 × 10
## name display_name mime_type size_bytes create_time update_time
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 file… example.mp3 audio/mp3 2458836 2024-11-15… 2024-11-15…
## 2 file… the_thinkin… video/mp4 6520447 2024-11-18… 2024-11-18…
## 3 file… example_akt… applicat… 193680 2024-11-17… 2024-11-17…
## # ℹ 4 more variables: expiration_time <chr>, sha256_hash <chr>,
## # uri <chr>, state <chr>
For example, we can now ask gemini()
what the audio file
I uploaded earlier is about:
llm_message("What's in this audio file. Describe it in 100 words.") |>
gemini_chat(.fileid=gemini_files$name[1])
## Message History:
## system: You are a helpful assistant
## --------------------------------------------------------------
## user: What's in this audio file. Describe it in 100 words.
## --------------------------------------------------------------
## assistant: This is an audio recording of an interview with Robert Bosch. He discusses his early life, his career path from apprentice to founder of his renowned company, and his philosophy on business. Bosch recounts his decision to become a precision mechanic, his travels to America and England, and the development of his famous magneto ignition system. He emphasizes his principles of high-quality work, fair treatment of employees, and the importance of trust and reliability in business dealings. The interview concludes with well wishes.
## --------------------------------------------------------------
Once you have completed working with a file, you can delete it with
gemini_delete_file()
gemini_files$name[1] |>
gemini_delete_file()
## File files/nzq2zw9u30y8 has been successfully deleted.
Supported file types
The Google Gemini API supports a wide range of file formats, enabling
seamless multimodal workflows. For documents, it handles
PDFs, plain text,
HTML, CSS, Markdown,
CSV-tables, XML, and
RTF. Uploading documents with
gemini_upload_file()
is useful, since the standard
llm_message()
function with the .pdf
argument
is designed to extract only the textual content from PDFs. In contrast,
the Google Gemini API enhances this by supporting multimodal PDFs,
including those that contain images or are image-only (such as scanned
documents). This allows users to work with both textual and visual data
in their interactions.
For example, we can upload an old scan of a paper that contains no extractable text at all and ask gemini to extract all the references:
neal1995 <- gemini_upload_file("1995_Neal_Industry_Specific.pdf")
bib <- llm_message("Extract all the references from this paper in the specified format") |>
chat(gemini(.fileid=neal1995$name),
.json_schema = tidyllm_schema(name="references",
APAreferences ="character[]")
)
references <- bib |> get_reply_data()
references[1:5]
## [1] "Addison, John, and Portugal, Pedro. \"Job Displacement, Relative Wage Changes, and Duration of Unemployment.\" *Journal of Labor Economics* 7 (July 1989): 281–302."
## [2] "Altonji, Joseph, and Shakotko, Robert. \"Do Wages Rise with Seniority?\" *Review of Economic Studies* 54 (July 1987): 437–59."
## [3] "Becker, Gary. *Human Capital*. New York: Columbia University Press, 1975."
## [4] "Carrington, William. \"Wage Losses for Displaced Workers: Is It Really the Firm That Matters?\" *Journal of Human Resources* 28 (Summer 1993): 435–62."
## [5] "Carrington, William, and Zaman, Asad. \"Interindustry Variation in the Costs of Job Displacement.\" *Journal of Labor Economics* 12 (April 1994): 243–76."
Image formats for gemini_upload_file()
include
PNG, JPEG, WEBP,
HEIC, and HEIF, while supported video
formats cover MP4, MPEG,
MOV, AVI, FLV,
MPG, WEBM, WMV, and
3GPP. For audio, the API works with
WAV, MP3, AIFF,
AAC, OGG Vorbis, and
FLAC. Code files such as JavaScript
and Python are also supported.
Conclusion
The ability to work with diverse media types through the Gemini API
opens up a wide range of applications across scientific disciplines. In
linguistics and humanities research,
text, audio, and video can be used to analyze language patterns, conduct
speech recognition, or study historical footage. Social
scientists* can process interview recordings, video
observations, and there might other interesting application across a
whole range of fields. By integrating text, audio, video, images, and
code into a unified workflow, using the Gemini API with
tidyllm offers many possibilities that are not
available with the standard multimodal tools in
llm_message()
.