Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement batch processing #143

Open
dylanpieper opened this issue Nov 1, 2024 · 26 comments
Open

Implement batch processing #143

dylanpieper opened this issue Nov 1, 2024 · 26 comments

Comments

@dylanpieper
Copy link

dylanpieper commented Nov 1, 2024

Implement batch processing into ellmer to process lists of inputs through a single prompt. Batch processing enables running multiple inputs to get multiple responses while maintaining safety and ability to leverage ellmer's other features.

Basic Features (likely in scope):

  • Parallel or sequential (single "parallel") API requests
  • Prompt caching
  • Integration with ellmer's core functionality
    • Tooling and structured data extraction
    • Error, retry, and resume handling
    • Token and rate limit handling

Ambitious Features / Extensions (likely out of scope):

  • Asynchronous batching (polling process)
  • Similarity scoring for multi-model/provider requests or prompt tweaking and eval
    (something like inter-rater reliability but for LLMs)
@hadley
Copy link
Member

hadley commented Jan 23, 2025

Do you think this code could live in ellmer itself? Do you have any sense of what the UI might look like?

@hadley hadley mentioned this issue Jan 23, 2025
@hadley
Copy link
Member

hadley commented Jan 24, 2025

I was thinking that in a batch scenario you still want to be able to do all the same things that you do in a regular chat, but you just want to do many of them at once. So maybe an interface like this would work:

library(ellmer)

chat <- chat_openai("You're reply succintly")
chat$register_tool(tool(function() Sys.Date(), "Return the current date"))
chat$chat_batch(list(
  "What's the date today?",
  "What's the date tomorrow?",
  "What's the date yesterday?",
  "What's the date next week?",
  "What's the date last week?",
))

I think the main challenge is what would chat_btach() return? I think it would make sense to return a list of text answers, but then how do you access the chat if there's a problem? Maybe we just have to accept that there's no easy to way get them, and if you want to debug you'd need to do it with individual requests.

Other thoughts:

  • Would also need a batched version of structured data extraction.
  • Will need error handling like req_perform_parallel(): https://httr2.r-lib.org/reference/req_perform_parallel.html#arg-on-error.
  • Unlike req_perform_parallel() probably needs an explicit argument for parallelism.
  • Like to require work to req_perform_parallel() in order to support req_retry() (and possibly uploading OAuth tokens).
  • Will want to implement after Think about prompt caching #107, since you'd want to cache the base chat.
  • This would probably not support asynchronous batch requests (e.g. https://www.anthropic.com/news/message-batches-api), since that needs a separate polling process. Although we could have a async = TRUE argument that then returns a batch object with a polling method? There's some argument that it would make sense for chat_batch() to always return such an object since that could provide better interrupt support, and would provide a way to access individual chat objects for debugging. But that would require rearchitecting req_perform_parallel() to run in the background.

@dylanpieper dylanpieper changed the title Proposal for helmer: Build LLM Chat Pipelines from a Data Frame or Tibble Build LLM Chat Pipelines (Batch Processing) Jan 24, 2025
@dylanpieper
Copy link
Author

dylanpieper commented Jan 24, 2025

After revisiting this, I do think my original ideas were ambitious, and I would like to see this code in the ellmer package if possible. I revised my original post to simplify the concept.

I think your interface idea is a good start, and as for the returned object, would it be possible to return a list not only of the text answers, but a nested list with each chat object for diagnostics? Although that might be a beefy object if you have big batches (maybe only return the chat object if there was an issue?).

@dylanpieper
Copy link
Author

@t-emery - Adding Teal because I know he has experience using ellmer for batch processing.

@t-emery
Copy link

t-emery commented Jan 24, 2025

FWIW, my use-case was batched structured data extraction.

I was reading text data from a tibble and doing structured data extraction (x 18,000). In my initial attempts, I found that I was using far more tokens than made sense given the inputs. Eventually I found the workable answer was to clear the turns after each time. Then everything worked as expected.

I'm new to using LLM APIs, so I have great humility that I might be missing something simple. I haven't had time to think about this deeply yet, and I haven't figured out what part of this is a documentation issue (creating documentation about best practices for running a lot of data) versus a feature issue.

Here are the relevant functions. The key fix was:

# Clear all prior turns so we start fresh for *each* record chat$set_turns(list())

# 1. Core Classification Function ----
classify_single_project <- function(text, chat, type_spec) {
  result <- purrr::safely(
    ~ chat$extract_data(text, type = type_spec),
    otherwise = NULL,
    quiet = TRUE
  )()
  
  tibble::tibble(
    success = is.null(result$error),
    error_message = if(is.null(result$error)) NA_character_ else as.character(result$error),
    classification = list(result$result)
  )
}

# 2. Process a chunk of data ----
# 3. Modify process_chunk to handle provider types
process_chunk <- function(chunk, chat, type_spec, provider_type = NULL) {
  # Use the provider_type parameter to determine which classification function to use
  classify_fn <- if(provider_type == "deepseek") {
    classify_single_project_deepseek
  } else {
    classify_single_project
  }
  
  chunk |>
    mutate(
      classification_result = map(
        combined_text,
        ~{
          # Clear all prior turns so we start fresh for *each* record
          chat$set_turns(list())
          
          classify_fn(.x, chat, type_spec)
        }
      )
    ) |>
    unnest(classification_result) |>
    mutate(
      primary_class = map_chr(classification, ~.x$classification$primary %||% NA_character_),
      confidence = map_chr(classification, ~.x$classification$confidence %||% NA_character_),
      project_type = map_chr(classification, ~.x$classification$project_type %||% NA_character_),
      justification = map_chr(classification, ~.x$justification %||% NA_character_),
      evidence = map_chr(classification, ~.x$evidence %||% NA_character_)
    ) |>
    select(-classification)
}

@hadley hadley changed the title Build LLM Chat Pipelines (Batch Processing) Implement batch processing Jan 24, 2025
@hadley
Copy link
Member

hadley commented Jan 27, 2025

@t-emery a slightly better way to ensure that the chats are standalone is to call chat$clone().

@dylanpieper
Copy link
Author

dylanpieper commented Jan 27, 2025

Beyond retries, it would be great to handle interrupted batches and resume when you re-call the function. For example:

library(ellmer)

chat <- chat_openai("You reply succinctly")

chat$register_tool(tool(
  function() Sys.Date(), 
  "Return the current date"
))

prompts <- list(
  "What's the date today?",
  "What's the date tomorrow?",
  "What's the date yesterday?",
  "What's the date next week?",
  "What's the date last week?"
)

#  Initial processing (interrupted and returns a partial object)
responses <- chat$chat_batch(prompts)

#  Resume processing
responses <- chat$chat_batch(prompts, responses)

Where chat_batch() may look something like:

chat_batch <- function(prompts, last_chat = NULL) {
  if (!is.null(last_chat) && !last_chat$complete) {
    chat$process_batch(prompts, start_at = last_chat$last_index + 1)
  } else {
    chat$process_batch(prompts)
  }
}

@hadley
Copy link
Member

hadley commented Jan 27, 2025

@dylanpieper I think it's a bit better to return a richer object that lets you resume. Something like this maybe:

batched_chat <- chat$chat_batch(prompts)
batched_chat$process()
# ctrl + c
batched_chat$process() # resumes where the work left off

That object would also be the way you get either rich chat objects or simple text responses:

batched_chat$texts()
batched_chat$chats()

For the batch (not parallel) case, where it might take up to 24 hours, the function would also need to serialise something to disk, so if your R process completely dies, you can resume it in a fresh session:

batched_chat <- resume(some_path)

(All of these names are up for discussion, I just brain-dumped quickly.)

@dylanpieper
Copy link
Author

The naming seems good and the functional OOP style is elegant for sure. I can start tinkering with defining an S7 class and writing functions for the sequential batch case. I need to get hands-on and wrap my head around the ellmer architecture. I feel less confident about the error handling needs, which is likely where I'll need more help.

@hadley
Copy link
Member

hadley commented Feb 1, 2025

I’m working on parallel requests which I think should be superior to sequentially (esp since you’ll be able to opt-in to sequential by using just a single “parallel” request). Probably won’t get this finished for a week or two since we have a company get together and I have to think through how to resolve the current limitations with req_perform_parallel()

@dylanpieper
Copy link
Author

@hadley Can you explain the reasoning to default to parallel? I was thinking that the only advantage for parallel is for multi-model or provider batches where you are seen as a new requester per API call. What I've read about single model parallelism is that it is often slower because of the provider's rate limits per model.

@hadley
Copy link
Member

hadley commented Feb 2, 2025

Looking at the openAI rate limits suggests that parallelism is going to be a big improvement for most use cases. (Plus I’d rather solve this problem once in a way that also benefits the design of httr2)

@bengowan
Copy link

bengowan commented Feb 5, 2025

Just an fyi, naming-wise, thought I'd mention there is something on both openai and anthropic called batch processing, which is more of an overnight job submission type of option:

Just wanted to alert you in case you wanted to distinguish this as a parallel_chats or something along those lines to avoid confusion.

@Marwolaeth
Copy link

@t-emery a slightly better way to ensure that the chats are standalone is to call chat$clone().

The tutorial states: “The most programmatic way to chat is to create the chat object inside a function. By doing so, live streaming is automatically suppressed, and chat() returns the result as a string.” For me, it seemed intuitive to avoid creating multiple chat instances and instead use a single chat object for a function that I wanted to vectorise over a batch of texts. However, the Ellmer Assistant convinced me that creating a chat object inside a function is safer and worth the overhead it introduces. I wonder if combining chat <- chat$clone() with chat$set_turns(list()) inside a function is just as safe while avoiding the creation of multiple chats with the same system prompt.

@hadley
Copy link
Member

hadley commented Feb 5, 2025

@Marwolaeth why are you concerned with creating multiple chat objects?

@Marwolaeth
Copy link

@Marwolaeth why are you concerned with creating multiple chat objects?

I wasn't sure whether it would introduce significant overhead. I also suspected that a chat object might embed the system prompt upon creation, which could lead to wasted resources since I have a rather large prompt. However, a microbenchmark demonstrated that there is no difference in performance between these options, at least for chat_ollama(), proving my suspicions wrong.

So the final answer is here: creating multiple chats does not deteriorate the performance and should be preferred.

I should have realized that a chat object is an R object and does not interact with any API upon creation. I apologize for the oversight.

@dylanpieper
Copy link
Author

I packaged my interim batch processing solution (sequential) until this is complete. https://github.com/dylanpieper/hellmer

@dylanpieper
Copy link
Author

I added parallel processing to my package via future/furrr and it's indeed a big improvement. I do see the logic in doing it right the first time now using httr2 (at least as a core ellmer feature). I'm not sure how well my retry handling will hold up in the wild. I urge people to try it out though. It's much better than any other solution I've encountered thus far. https://github.com/dylanpieper/hellmer

@CorradoLanera
Copy link

Hi everyone! I came across this issue and this comment in particular, and I would like to add my thoughts (as an intensive user of batch LLM calls). The approach you are currently discussing seems to rely on parallel calls rather than genuine "batch" calls. With a proper batch feature (similar to OpenAI), you can have your calls run without keeping your session active or even your system switched on; there's a substantially reduced rate for bulk processing (i.e., batch_price = standard_price/2), and also a guarantee your job will finish within the next 24h (by experience, often a lot less!). This can be especially helpful when dealing with large workloads (e.g., ~10^4+ calls).

If {ellmer} provided a genuine batch interface (that handles object creation for batch submissions, submissions themselves, easy completion checks, global and individual error management, result retrieval/aggregation, ...), it could substantially benefit many users. Those who need extensive parallelization might already have tools (like {targets}) that handle retries, logging, and scaling well and might not see as much benefit from a new chat_batch() function that runs parallel calls behind the scenes. Still, a batch feature in {ellmer} would be a great addition for others.

For example, how does a "parallel” chat_batch() approach enhance or offer advantages in the following scenario (and readily generalize to structured output, function calls, data-interpolated prompts, and all the other great {ellmer} features, but also, e.g., automatically manage partial changes on some of the prompts sending to LLM only the modified/updated ones, ... )?

library(targets)

tar_dir({
  tar_script({
    library(targets)
    library(tarchetypes)
    library(crew)
    
    ellmer_controller <- crew_controller_local(
      name = "ellmer's crew controller",
      workers = 2
    )
    
    tar_option_set(
      controller = ellmer_controller,
      packages = c("ellmer", "dplyr")
    )
    
    list(
      tar_target(
        chatMaster,
        chat_openai(system_prompt = "Reply concisely, one sentence.")
      ),
      tar_target(
        prompts, 
        c(
          "What is the meaning of life?",
          "What is the best way to learn R?",
          "What is the best way to learn Tidyverse?"
        )
      ),
      tar_target(
        chatResponses,
        {
          tibble(
            chat = list(chatMaster$clone()),
            response = chat[[1]]$chat(prompts),
            tokens_input = chat[[1]]$tokens()[3, "input"],
            tokens_output = chat[[1]]$tokens()[3, "output"]
          )
        },
        pattern = map(prompts)
      )
    )
  })
  tar_make()
  tar_read(chatResponses) |> print()
  tar_crew()
})
#> ▶ dispatched target chatMaster
#> ▶ dispatched target prompts
#> ● completed target prompts [0.125 seconds, 118 bytes]
#> ● completed target chatMaster [0.172 seconds, 8.91 kilobytes]
#> ▶ dispatched branch chatResponses_8aae0534d8dc1db8
#> ▶ dispatched branch chatResponses_05413695e06a1f47
#> ● completed branch chatResponses_8aae0534d8dc1db8 [1.078 seconds, 13.392 kilobytes]
#> ▶ dispatched branch chatResponses_eec392bfe0cfa1d2
#> ● completed branch chatResponses_05413695e06a1f47 [1.125 seconds, 13.464 kilobytes]
#> ● completed branch chatResponses_eec392bfe0cfa1d2 [0.938 seconds, 13.469 kilobytes]
#> ● completed pattern chatResponses 
#> ▶ ended pipeline [4.578 seconds]
#> # A tibble: 3 × 4
#>   chat   response                                     tokens_input tokens_output
#>   <list> <chr>                                               <dbl>         <dbl>
#> 1 <Chat> The meaning of life is subjective and can v…           26            20
#> 2 <Chat> The best way to learn R is to combine hands…           28            40
#> 3 <Chat> The best way to learn Tidyverse is by follo…           30            39
#> # A tibble: 2 × 4
#>   controller               worker seconds targets
#>   <chr>                    <chr>    <dbl>   <int>
#> 1 ellmer's crew controller 1         2.30       3
#> 2 ellmer's crew controller 2         1.36       2

Created on 2025-02-17 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31 ucrt)
#>  os       Windows 11 x64 (build 26100)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       Europe/Rome
#>  date     2025-02-17
#>  pandoc   3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  backports     1.5.0   2024-05-23 [1] CRAN (R 4.4.0)
#>  base64url     1.4     2018-05-14 [1] CRAN (R 4.4.1)
#>  callr         3.7.6   2024-03-25 [1] CRAN (R 4.4.1)
#>  cli           3.6.3   2024-06-21 [1] CRAN (R 4.4.1)
#>  codetools     0.2-20  2024-03-31 [2] CRAN (R 4.4.2)
#>  coro          1.1.0   2024-11-05 [1] CRAN (R 4.4.2)
#>  data.table    1.16.4  2024-12-06 [1] CRAN (R 4.4.2)
#>  digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.1)
#>  ellmer        0.1.1   2025-02-07 [1] CRAN (R 4.4.2)
#>  evaluate      1.0.1   2024-10-10 [1] CRAN (R 4.4.1)
#>  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.1)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.1)
#>  fs            1.6.5   2024-10-30 [1] RSPM
#>  glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1)
#>  httr2         1.1.0   2025-01-18 [1] CRAN (R 4.4.2)
#>  igraph        2.1.1   2024-10-19 [1] CRAN (R 4.4.2)
#>  knitr         1.49    2024-11-08 [1] RSPM
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.1)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.1)
#>  processx      3.8.4   2024-03-16 [1] CRAN (R 4.4.1)
#>  ps            1.8.1   2024-10-28 [1] RSPM (R 4.4.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.1)
#>  rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.4.1)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.1)
#>  rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.1)
#>  rmarkdown     2.29    2024-11-04 [1] RSPM
#>  rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.2)
#>  S7            0.2.0   2024-11-07 [1] CRAN (R 4.4.2)
#>  secretbase    1.0.3   2024-10-02 [1] CRAN (R 4.4.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.1)
#>  targets     * 1.9.1   2024-12-04 [1] CRAN (R 4.4.2)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.1)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.1)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.1)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.1)
#>  withr         3.0.2   2024-10-28 [1] CRAN (R 4.4.2)
#>  xfun          0.49    2024-10-31 [1] RSPM
#>  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.1)
#> 
#>  [1] C:/Users/corra/AppData/Local/R/win-library/4.4
#>  [2] C:/Program Files/R/R-4.4.2/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@dylanpieper
Copy link
Author

@CorradoLanera This doesn't really address the ellmer implementation, but you can interact with batch APIs of several providers using tidyllm.

@hadley
Copy link
Member

hadley commented Feb 17, 2025

@CorradoLanera yes, I understand the different between parallel and batch calls.

The advantage of ellmer providing tools for parallel requests is that it's going to be much more efficient to do everything in the same process, rather than having to spin out multiple processes. It also makes it much easier to implement throttling.

@dylanpieper
Copy link
Author

Parallel is also twice as fast, even without the graceful features that Hadley is planning.

Image

@hadley
Copy link
Member

hadley commented Feb 17, 2025

@dylanpieper we'll see how much faster doing it in httr2 is, but I would hope it was more like 5-10x faster.

@hadley
Copy link
Member

hadley commented Feb 17, 2025

Yeah, performance looks pretty good:

prompts <- list(
  "What is the meaning of life?",
  "What is the best way to learn R?",
  "What is the best way to learn Tidyverse?",
  "What colours are in the rainbow?",
  "What's funny about legs?",
  "What's the capital of the United States?",
  "Where is the moon?"
)

chat <- chat_openai(system_prompt = "Reply concisely, one sentence.")
system.time(chats <- chat$chat_parallel(prompts))
#>    user  system elapsed 
#>   0.039   0.003   1.248 

chat <- chat_openai(system_prompt = "Reply concisely, one sentence.")
system.time(chats <- map(prompts, \(prompt) chat$clone()$chat(prompt, echo = FALSE)))
#>    user  system elapsed 
#>   0.770   0.017   5.602 

@dylanpieper
Copy link
Author

That's awesome! I know there are still some limitations to req_perform_parallel (throttling, retries, cache, etc.) but I'm curious if the end goal will be to support all of the features (would be cool). Just for fun - this is what hellmer looks like. I interrupt it to show that behavior/UI.

Image

hadley added a commit that referenced this issue Feb 17, 2025
@hadley
Copy link
Member

hadley commented Feb 17, 2025

@dylanpieper as of a couple of days ago, most of those limitations have been lifted. I implemented something similar for interrupts too.

hadley added a commit that referenced this issue Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants