Implement batch processing #143

dylanpieper · 2024-11-01T17:42:06Z

Implement batch processing into ellmer to process lists of inputs through a single prompt. Batch processing enables running multiple inputs to get multiple responses while maintaining safety and ability to leverage ellmer's other features.

Basic Features (likely in scope):

Parallel or sequential (single "parallel") API requests
Prompt caching
Integration with ellmer's core functionality
- Tooling and structured data extraction
- Error, retry, and resume handling
- Token and rate limit handling

Ambitious Features / Extensions (likely out of scope):

Asynchronous batching (polling process)
Similarity scoring for multi-model/provider requests or prompt tweaking and eval
(something like inter-rater reliability but for LLMs)

hadley · 2025-01-23T22:38:13Z

Do you think this code could live in ellmer itself? Do you have any sense of what the UI might look like?

hadley · 2025-01-24T13:32:47Z

I was thinking that in a batch scenario you still want to be able to do all the same things that you do in a regular chat, but you just want to do many of them at once. So maybe an interface like this would work:

library(ellmer)

chat <- chat_openai("You're reply succintly")
chat$register_tool(tool(function() Sys.Date(), "Return the current date"))
chat$chat_batch(list(
  "What's the date today?",
  "What's the date tomorrow?",
  "What's the date yesterday?",
  "What's the date next week?",
  "What's the date last week?",
))

I think the main challenge is what would chat_btach() return? I think it would make sense to return a list of text answers, but then how do you access the chat if there's a problem? Maybe we just have to accept that there's no easy to way get them, and if you want to debug you'd need to do it with individual requests.

Other thoughts:

Would also need a batched version of structured data extraction.
Will need error handling like req_perform_parallel(): https://httr2.r-lib.org/reference/req_perform_parallel.html#arg-on-error.
Unlike req_perform_parallel() probably needs an explicit argument for parallelism.
Like to require work to req_perform_parallel() in order to support req_retry() (and possibly uploading OAuth tokens).
Will want to implement after Think about prompt caching #107, since you'd want to cache the base chat.
This would probably not support asynchronous batch requests (e.g. https://www.anthropic.com/news/message-batches-api), since that needs a separate polling process. Although we could have a async = TRUE argument that then returns a batch object with a polling method? There's some argument that it would make sense for chat_batch() to always return such an object since that could provide better interrupt support, and would provide a way to access individual chat objects for debugging. But that would require rearchitecting req_perform_parallel() to run in the background.

dylanpieper · 2025-01-24T14:31:56Z

After revisiting this, I do think my original ideas were ambitious, and I would like to see this code in the ellmer package if possible. I revised my original post to simplify the concept.

I think your interface idea is a good start, and as for the returned object, would it be possible to return a list not only of the text answers, but a nested list with each chat object for diagnostics? Although that might be a beefy object if you have big batches (maybe only return the chat object if there was an issue?).

dylanpieper · 2025-01-24T14:34:36Z

@t-emery - Adding Teal because I know he has experience using ellmer for batch processing.

t-emery · 2025-01-24T19:01:11Z

FWIW, my use-case was batched structured data extraction.

I was reading text data from a tibble and doing structured data extraction (x 18,000). In my initial attempts, I found that I was using far more tokens than made sense given the inputs. Eventually I found the workable answer was to clear the turns after each time. Then everything worked as expected.

I'm new to using LLM APIs, so I have great humility that I might be missing something simple. I haven't had time to think about this deeply yet, and I haven't figured out what part of this is a documentation issue (creating documentation about best practices for running a lot of data) versus a feature issue.

Here are the relevant functions. The key fix was:

# Clear all prior turns so we start fresh for *each* record chat$set_turns(list())

# 1. Core Classification Function ----
classify_single_project <- function(text, chat, type_spec) {
  result <- purrr::safely(
    ~ chat$extract_data(text, type = type_spec),
    otherwise = NULL,
    quiet = TRUE
  )()
  
  tibble::tibble(
    success = is.null(result$error),
    error_message = if(is.null(result$error)) NA_character_ else as.character(result$error),
    classification = list(result$result)
  )
}

# 2. Process a chunk of data ----
# 3. Modify process_chunk to handle provider types
process_chunk <- function(chunk, chat, type_spec, provider_type = NULL) {
  # Use the provider_type parameter to determine which classification function to use
  classify_fn <- if(provider_type == "deepseek") {
    classify_single_project_deepseek
  } else {
    classify_single_project
  }
  
  chunk |>
    mutate(
      classification_result = map(
        combined_text,
        ~{
          # Clear all prior turns so we start fresh for *each* record
          chat$set_turns(list())
          
          classify_fn(.x, chat, type_spec)
        }
      )
    ) |>
    unnest(classification_result) |>
    mutate(
      primary_class = map_chr(classification, ~.x$classification$primary %||% NA_character_),
      confidence = map_chr(classification, ~.x$classification$confidence %||% NA_character_),
      project_type = map_chr(classification, ~.x$classification$project_type %||% NA_character_),
      justification = map_chr(classification, ~.x$justification %||% NA_character_),
      evidence = map_chr(classification, ~.x$evidence %||% NA_character_)
    ) |>
    select(-classification)
}

hadley · 2025-01-27T17:22:21Z

@t-emery a slightly better way to ensure that the chats are standalone is to call chat$clone().

dylanpieper · 2025-01-27T21:45:54Z

Beyond retries, it would be great to handle interrupted batches and resume when you re-call the function. For example:

library(ellmer)

chat <- chat_openai("You reply succinctly")

chat$register_tool(tool(
  function() Sys.Date(), 
  "Return the current date"
))

prompts <- list(
  "What's the date today?",
  "What's the date tomorrow?",
  "What's the date yesterday?",
  "What's the date next week?",
  "What's the date last week?"
)

#  Initial processing (interrupted and returns a partial object)
responses <- chat$chat_batch(prompts)

#  Resume processing
responses <- chat$chat_batch(prompts, responses)

Where chat_batch() may look something like:

chat_batch <- function(prompts, last_chat = NULL) {
  if (!is.null(last_chat) && !last_chat$complete) {
    chat$process_batch(prompts, start_at = last_chat$last_index + 1)
  } else {
    chat$process_batch(prompts)
  }
}

hadley · 2025-01-27T23:43:31Z

@dylanpieper I think it's a bit better to return a richer object that lets you resume. Something like this maybe:

batched_chat <- chat$chat_batch(prompts)
batched_chat$process()
# ctrl + c
batched_chat$process() # resumes where the work left off

That object would also be the way you get either rich chat objects or simple text responses:

batched_chat$texts()
batched_chat$chats()

For the batch (not parallel) case, where it might take up to 24 hours, the function would also need to serialise something to disk, so if your R process completely dies, you can resume it in a fresh session:

batched_chat <- resume(some_path)

(All of these names are up for discussion, I just brain-dumped quickly.)

dylanpieper · 2025-01-30T17:35:36Z

The naming seems good and the functional OOP style is elegant for sure. I can start tinkering with defining an S7 class and writing functions for the sequential batch case. I need to get hands-on and wrap my head around the ellmer architecture. I feel less confident about the error handling needs, which is likely where I'll need more help.

hadley · 2025-02-01T01:14:08Z

I’m working on parallel requests which I think should be superior to sequentially (esp since you’ll be able to opt-in to sequential by using just a single “parallel” request). Probably won’t get this finished for a week or two since we have a company get together and I have to think through how to resolve the current limitations with req_perform_parallel()

dylanpieper · 2025-02-01T16:33:12Z

@hadley Can you explain the reasoning to default to parallel? I was thinking that the only advantage for parallel is for multi-model or provider batches where you are seen as a new requester per API call. What I've read about single model parallelism is that it is often slower because of the provider's rate limits per model.

hadley · 2025-02-02T18:21:44Z

Looking at the openAI rate limits suggests that parallelism is going to be a big improvement for most use cases. (Plus I’d rather solve this problem once in a way that also benefits the design of httr2)

bengowan · 2025-02-05T04:19:44Z

Just an fyi, naming-wise, thought I'd mention there is something on both openai and anthropic called batch processing, which is more of an overnight job submission type of option:

Just wanted to alert you in case you wanted to distinguish this as a parallel_chats or something along those lines to avoid confusion.

Marwolaeth · 2025-02-05T09:15:44Z

@t-emery a slightly better way to ensure that the chats are standalone is to call chat$clone().

The tutorial states: “The most programmatic way to chat is to create the chat object inside a function. By doing so, live streaming is automatically suppressed, and chat() returns the result as a string.” For me, it seemed intuitive to avoid creating multiple chat instances and instead use a single chat object for a function that I wanted to vectorise over a batch of texts. However, the Ellmer Assistant convinced me that creating a chat object inside a function is safer and worth the overhead it introduces. I wonder if combining chat <- chat$clone() with chat$set_turns(list()) inside a function is just as safe while avoiding the creation of multiple chats with the same system prompt.

hadley · 2025-02-05T13:36:21Z

@Marwolaeth why are you concerned with creating multiple chat objects?

Marwolaeth · 2025-02-05T14:32:35Z

@Marwolaeth why are you concerned with creating multiple chat objects?

I wasn't sure whether it would introduce significant overhead. I also suspected that a chat object might embed the system prompt upon creation, which could lead to wasted resources since I have a rather large prompt. However, a microbenchmark demonstrated that there is no difference in performance between these options, at least for chat_ollama(), proving my suspicions wrong.

So the final answer is here: creating multiple chats does not deteriorate the performance and should be preferred.

I should have realized that a chat object is an R object and does not interact with any API upon creation. I apologize for the oversight.

dylanpieper · 2025-02-10T17:59:41Z

I packaged my interim batch processing solution (sequential) until this is complete. https://github.com/dylanpieper/hellmer

dylanpieper · 2025-02-12T20:19:30Z

I added parallel processing to my package via future/furrr and it's indeed a big improvement. I do see the logic in doing it right the first time now using httr2 (at least as a core ellmer feature). I'm not sure how well my retry handling will hold up in the wild. I urge people to try it out though. It's much better than any other solution I've encountered thus far. https://github.com/dylanpieper/hellmer

CorradoLanera · 2025-02-17T10:23:34Z

Hi everyone! I came across this issue and this comment in particular, and I would like to add my thoughts (as an intensive user of batch LLM calls). The approach you are currently discussing seems to rely on parallel calls rather than genuine "batch" calls. With a proper batch feature (similar to OpenAI), you can have your calls run without keeping your session active or even your system switched on; there's a substantially reduced rate for bulk processing (i.e., batch_price = standard_price/2), and also a guarantee your job will finish within the next 24h (by experience, often a lot less!). This can be especially helpful when dealing with large workloads (e.g., ~10^4+ calls).

If {ellmer} provided a genuine batch interface (that handles object creation for batch submissions, submissions themselves, easy completion checks, global and individual error management, result retrieval/aggregation, ...), it could substantially benefit many users. Those who need extensive parallelization might already have tools (like {targets}) that handle retries, logging, and scaling well and might not see as much benefit from a new chat_batch() function that runs parallel calls behind the scenes. Still, a batch feature in {ellmer} would be a great addition for others.

For example, how does a "parallel” chat_batch() approach enhance or offer advantages in the following scenario (and readily generalize to structured output, function calls, data-interpolated prompts, and all the other great {ellmer} features, but also, e.g., automatically manage partial changes on some of the prompts sending to LLM only the modified/updated ones, ... )?

library(targets)

tar_dir({
  tar_script({
    library(targets)
    library(tarchetypes)
    library(crew)
    
    ellmer_controller <- crew_controller_local(
      name = "ellmer's crew controller",
      workers = 2
    )
    
    tar_option_set(
      controller = ellmer_controller,
      packages = c("ellmer", "dplyr")
    )
    
    list(
      tar_target(
        chatMaster,
        chat_openai(system_prompt = "Reply concisely, one sentence.")
      ),
      tar_target(
        prompts, 
        c(
          "What is the meaning of life?",
          "What is the best way to learn R?",
          "What is the best way to learn Tidyverse?"
        )
      ),
      tar_target(
        chatResponses,
        {
          tibble(
            chat = list(chatMaster$clone()),
            response = chat[[1]]$chat(prompts),
            tokens_input = chat[[1]]$tokens()[3, "input"],
            tokens_output = chat[[1]]$tokens()[3, "output"]
          )
        },
        pattern = map(prompts)
      )
    )
  })
  tar_make()
  tar_read(chatResponses) |> print()
  tar_crew()
})
#> ▶ dispatched target chatMaster
#> ▶ dispatched target prompts
#> ● completed target prompts [0.125 seconds, 118 bytes]
#> ● completed target chatMaster [0.172 seconds, 8.91 kilobytes]
#> ▶ dispatched branch chatResponses_8aae0534d8dc1db8
#> ▶ dispatched branch chatResponses_05413695e06a1f47
#> ● completed branch chatResponses_8aae0534d8dc1db8 [1.078 seconds, 13.392 kilobytes]
#> ▶ dispatched branch chatResponses_eec392bfe0cfa1d2
#> ● completed branch chatResponses_05413695e06a1f47 [1.125 seconds, 13.464 kilobytes]
#> ● completed branch chatResponses_eec392bfe0cfa1d2 [0.938 seconds, 13.469 kilobytes]
#> ● completed pattern chatResponses 
#> ▶ ended pipeline [4.578 seconds]
#> # A tibble: 3 × 4
#>   chat   response                                     tokens_input tokens_output
#>   <list> <chr>                                               <dbl>         <dbl>
#> 1 <Chat> The meaning of life is subjective and can v…           26            20
#> 2 <Chat> The best way to learn R is to combine hands…           28            40
#> 3 <Chat> The best way to learn Tidyverse is by follo…           30            39
#> # A tibble: 2 × 4
#>   controller               worker seconds targets
#>   <chr>                    <chr>    <dbl>   <int>
#> 1 ellmer's crew controller 1         2.30       3
#> 2 ellmer's crew controller 2         1.36       2

^{Created on 2025-02-17 with reprex v2.1.1}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31 ucrt)
#>  os       Windows 11 x64 (build 26100)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       Europe/Rome
#>  date     2025-02-17
#>  pandoc   3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  backports     1.5.0   2024-05-23 [1] CRAN (R 4.4.0)
#>  base64url     1.4     2018-05-14 [1] CRAN (R 4.4.1)
#>  callr         3.7.6   2024-03-25 [1] CRAN (R 4.4.1)
#>  cli           3.6.3   2024-06-21 [1] CRAN (R 4.4.1)
#>  codetools     0.2-20  2024-03-31 [2] CRAN (R 4.4.2)
#>  coro          1.1.0   2024-11-05 [1] CRAN (R 4.4.2)
#>  data.table    1.16.4  2024-12-06 [1] CRAN (R 4.4.2)
#>  digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.1)
#>  ellmer        0.1.1   2025-02-07 [1] CRAN (R 4.4.2)
#>  evaluate      1.0.1   2024-10-10 [1] CRAN (R 4.4.1)
#>  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.1)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.1)
#>  fs            1.6.5   2024-10-30 [1] RSPM
#>  glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1)
#>  httr2         1.1.0   2025-01-18 [1] CRAN (R 4.4.2)
#>  igraph        2.1.1   2024-10-19 [1] CRAN (R 4.4.2)
#>  knitr         1.49    2024-11-08 [1] RSPM
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.1)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.1)
#>  processx      3.8.4   2024-03-16 [1] CRAN (R 4.4.1)
#>  ps            1.8.1   2024-10-28 [1] RSPM (R 4.4.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.1)
#>  rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.4.1)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.1)
#>  rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.1)
#>  rmarkdown     2.29    2024-11-04 [1] RSPM
#>  rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.2)
#>  S7            0.2.0   2024-11-07 [1] CRAN (R 4.4.2)
#>  secretbase    1.0.3   2024-10-02 [1] CRAN (R 4.4.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.1)
#>  targets     * 1.9.1   2024-12-04 [1] CRAN (R 4.4.2)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.1)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.1)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.1)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.1)
#>  withr         3.0.2   2024-10-28 [1] CRAN (R 4.4.2)
#>  xfun          0.49    2024-10-31 [1] RSPM
#>  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.1)
#> 
#>  [1] C:/Users/corra/AppData/Local/R/win-library/4.4
#>  [2] C:/Program Files/R/R-4.4.2/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

dylanpieper · 2025-02-17T12:42:24Z

@CorradoLanera This doesn't really address the ellmer implementation, but you can interact with batch APIs of several providers using tidyllm.

hadley · 2025-02-17T17:52:35Z

@CorradoLanera yes, I understand the different between parallel and batch calls.

The advantage of ellmer providing tools for parallel requests is that it's going to be much more efficient to do everything in the same process, rather than having to spin out multiple processes. It also makes it much easier to implement throttling.

dylanpieper · 2025-02-17T18:09:35Z

Parallel is also twice as fast, even without the graceful features that Hadley is planning.

hadley · 2025-02-17T18:28:23Z

@dylanpieper we'll see how much faster doing it in httr2 is, but I would hope it was more like 5-10x faster.

hadley · 2025-02-17T19:55:09Z

Yeah, performance looks pretty good:

prompts <- list(
  "What is the meaning of life?",
  "What is the best way to learn R?",
  "What is the best way to learn Tidyverse?",
  "What colours are in the rainbow?",
  "What's funny about legs?",
  "What's the capital of the United States?",
  "Where is the moon?"
)

chat <- chat_openai(system_prompt = "Reply concisely, one sentence.")
system.time(chats <- chat$chat_parallel(prompts))
#>    user  system elapsed 
#>   0.039   0.003   1.248 

chat <- chat_openai(system_prompt = "Reply concisely, one sentence.")
system.time(chats <- map(prompts, \(prompt) chat$clone()$chat(prompt, echo = FALSE)))
#>    user  system elapsed 
#>   0.770   0.017   5.602

dylanpieper · 2025-02-17T20:20:01Z

That's awesome! I know there are still some limitations to req_perform_parallel (throttling, retries, cache, etc.) but I'm curious if the end goal will be to support all of the features (would be cool). Just for fun - this is what hellmer looks like. I interrupt it to show that behavior/UI.

Part of #143

hadley · 2025-02-17T20:54:17Z

@dylanpieper as of a couple of days ago, most of those limitations have been lifted. I implemented something similar for interrupts too.

Part of #143

hadley mentioned this issue Jan 23, 2025

Batch API #260

Closed

dylanpieper changed the title ~~Proposal for helmer: Build LLM Chat Pipelines from a Data Frame or Tibble~~ Build LLM Chat Pipelines (Batch Processing) Jan 24, 2025

hadley changed the title ~~Build LLM Chat Pipelines (Batch Processing)~~ Implement batch processing Jan 24, 2025

hadley added a commit that referenced this issue Feb 17, 2025

Start exploring $chat_parallel()

8b46cfa

Part of #143

hadley mentioned this issue Feb 17, 2025

Start exploring $chat_parallel() #330

Draft

4 tasks

hadley added a commit that referenced this issue Feb 18, 2025

Start sketching out batch API

8843557

Part of #143

hadley mentioned this issue Feb 18, 2025

Start sketching out batch API #331

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement batch processing #143

Implement batch processing #143

dylanpieper commented Nov 1, 2024 •

edited

Loading

hadley commented Jan 23, 2025

hadley commented Jan 24, 2025

dylanpieper commented Jan 24, 2025 •

edited

Loading

dylanpieper commented Jan 24, 2025

t-emery commented Jan 24, 2025

hadley commented Jan 27, 2025

dylanpieper commented Jan 27, 2025 •

edited

Loading

hadley commented Jan 27, 2025

dylanpieper commented Jan 30, 2025

hadley commented Feb 1, 2025

dylanpieper commented Feb 1, 2025

hadley commented Feb 2, 2025

bengowan commented Feb 5, 2025

Marwolaeth commented Feb 5, 2025

hadley commented Feb 5, 2025

Marwolaeth commented Feb 5, 2025

dylanpieper commented Feb 10, 2025

dylanpieper commented Feb 12, 2025

CorradoLanera commented Feb 17, 2025

dylanpieper commented Feb 17, 2025

hadley commented Feb 17, 2025 •

edited

Loading

dylanpieper commented Feb 17, 2025

hadley commented Feb 17, 2025

hadley commented Feb 17, 2025

dylanpieper commented Feb 17, 2025

hadley commented Feb 17, 2025

Implement batch processing #143

Implement batch processing #143

Comments

dylanpieper commented Nov 1, 2024 • edited Loading

hadley commented Jan 23, 2025

hadley commented Jan 24, 2025

dylanpieper commented Jan 24, 2025 • edited Loading

dylanpieper commented Jan 24, 2025

t-emery commented Jan 24, 2025

hadley commented Jan 27, 2025

dylanpieper commented Jan 27, 2025 • edited Loading

hadley commented Jan 27, 2025

dylanpieper commented Jan 30, 2025

hadley commented Feb 1, 2025

dylanpieper commented Feb 1, 2025

hadley commented Feb 2, 2025

bengowan commented Feb 5, 2025

Marwolaeth commented Feb 5, 2025

hadley commented Feb 5, 2025

Marwolaeth commented Feb 5, 2025

dylanpieper commented Feb 10, 2025

dylanpieper commented Feb 12, 2025

CorradoLanera commented Feb 17, 2025

dylanpieper commented Feb 17, 2025

hadley commented Feb 17, 2025 • edited Loading

dylanpieper commented Feb 17, 2025

hadley commented Feb 17, 2025

hadley commented Feb 17, 2025

dylanpieper commented Feb 17, 2025

hadley commented Feb 17, 2025

dylanpieper commented Nov 1, 2024 •

edited

Loading

dylanpieper commented Jan 24, 2025 •

edited

Loading

dylanpieper commented Jan 27, 2025 •

edited

Loading

hadley commented Feb 17, 2025 •

edited

Loading