How does the Batch API 50% discount actually work? Are there any exceptions?
The discount applies to both input and output tokens at the standard rate, each cut to half price. For example, if Claude Sonnet's standard input pricing is $3 per million tokens, the Batch API input price would be $1.50; the same applies to output.
One exception to be aware of: prompt caching. If you are using both Batch API and prompt caching simultaneously, the pricing logic for cached tokens is more complex — check the current official pricing page for the specifics. Also, different models have different base prices; the 50% discount is consistent, but the absolute amount depends on which model you are using. For high-volume use cases, running a small-scale test to estimate actual costs before deploying is more accurate than relying solely on the pricing page numbers.
How do I know when my Batch job is done?
Two options. First, polling: use the batch_id you received to call the status endpoint periodically; it returns the current status (in_progress, ended, canceling, etc.) and the count of completed requests. Set up a loop that checks every few minutes to every hour, depending on the scale of your job.
Second, webhooks (if you have your own server): specify a webhook URL when submitting the Batch, and Anthropic will call your endpoint when the job completes — no polling needed. For automated production pipelines, webhooks are usually cleaner; for one-off scripted jobs, polling is simpler. Either way, the final step is downloading the JSONL results file.
Are there Batch API use cases I might not have thought of?
The most commonly overlooked but high-value application is automated evaluation (eval). If you are building an AI product, you need to assess model output quality — which typically means using Claude to evaluate another set of Claude outputs (the LLM-as-judge pattern). These jobs usually require hundreds to thousands of evaluation samples, which fits the Batch API perfectly; the 50% discount also lets you run larger-scale evaluations without worrying about cost.
Another commonly missed use case is knowledge base construction: batch-converting your company documents, FAQs, and product descriptions into structured formats, or pre-generating the raw material for embeddings. These tasks don't need real-time responses but are critical preparatory steps in RAG system design, and the Batch API makes them highly cost-efficient.
Advanced: is there a best practice for combining Batch API with prompt caching?
You can use them together, but it helps to understand how they interact. Prompt caching lets you reuse the same long prefix (e.g. a very long system prompt) across multiple requests — the first request pays full price; subsequent requests pay only a small fraction for the cached portion. Batch API then takes 50% off that already-reduced cost.
The most effective combination: you have a batch of requests that all share a long system prompt or prefixed document, but each request has different user input. In this case, put the long document in a cacheable prefix and submit all requests via Batch API — the two discounts stack, and the token cost for the shared prefix becomes very low. One caveat: prompt caching behavior in Batch API may differ slightly from real-time API; test the actual cache hit rate to confirm costs match expectations.
If you are using the Claude API for batch processing tasks — evaluating large datasets, labeling training data, generating content in bulk — you may be paying more than you need to. Anthropic's Batch API cuts token costs in half for these workloads, in exchange for waiting up to 24 hours for results. This article explains how it works, when it is worth using, and a few things to watch out for.
The standard Claude API (real-time API) is a synchronous model: send a request, wait a few seconds, get a response. The Batch API is asynchronous: submit hundreds to tens of thousands of requests at once, let the system process them during idle capacity, and receive all results within 24 hours.
The essential difference is time in exchange for cost. The real-time API gives you answers in seconds at full token pricing. The Batch API lets Anthropic process your tasks when it is convenient for them; in exchange, you get a 50% discount. Currently the Batch API discount applies to both input and output tokens.
One test: does this task have a time requirement? If not, the Batch API is almost always the better-value choice.
Best scenarios: dataset labeling (ten thousand records you need Claude to classify), large-scale content evaluation (model output quality scoring, automated evaluation runs), bulk document processing (summarizing hundreds of contracts into structured data), offline content generation (pre-generating large volumes of FAQs or product descriptions). These share one trait — you don't need each result in seconds; you just need it today or tomorrow.
Not suitable: anything a user is waiting on (chatbots, real-time Q&A), chained tasks where each step depends on the previous result, tasks with strict deadlines measured in minutes.
Three steps. First, package your requests in JSONL format — each line is one standalone request object containing a custom_id (your identifier for matching results later) and the API request parameters. Second, submit the file to the Batch API endpoint; it returns a batch_id. Third, poll the status with the batch_id (or set up a webhook); when the status becomes ended, download the results file.
Results are also in JSONL format, one line per request, carrying your custom_id and the response content. Design your custom_id scheme at submission time so you can correctly match responses back to their original requests when results arrive.
First, 24 hours is the ceiling, not a guarantee. Anthropic's documentation says 'typically completes within an hour, but may take up to 24 hours.' Don't put Batch API completion time in your critical path for time-sensitive work.
Second, the Batch API supports the same models as the real-time API, but not all features. Tool use (function calling) and vision are supported in Batch API; some newer features may have delayed support — check the current documentation before building on them.
Third, a single Batch has a request limit (currently 100,000 requests or 256 MB, whichever comes first). If your job exceeds this, split it into multiple Batches submitted separately.
If you have any existing Claude API usage for batch-style tasks that do not need real-time responses, the Batch API can cut your API bill in half without any changes to models or prompts — only the way you submit requests changes. For data-intensive AI applications, this is typically the fastest, lowest-effort cost optimization available.