For 1 million users, sending 1,000 requests per month, the total size saved would be approximately 22.9 TB per month!
In today’s fast-paced world of AI applications, optimizing performance should be one of your top priorities. This guide walks you through a simple yet powerful way to reduce OpenAI embedding response sizes by 75%—cutting them from 32 KB to just 8 KB per request. By switching from float32 to base64 encoding in your Retrieval-Augmented Generation (RAG) system, you can achieve a 4x efficiency boost, minimizing network overhead, saving costs and dramatically improving responsiveness.
Let's consider the following scenario.
Use Case: RAG Application Processing a 10-Page PDF
A user interacts with a RAG-powered application that processes a 10-page PDF and uses OpenAI embedding models to make the document searchable from an LLM. The goal is to show how optimizing embedding response size impacts overall system performance.
Step 1: Embedding Creation from the 10-Page PDF
In a typical RAG system, the first step is to embed documents (in this case, a 10-page PDF) to store meaningful vectors that will later be retrieved for answering queries. The PDF is split into chunks. In our example, each chunk contains approximately 100 tokens (for the sake of simplicity), but the recommended chunk size varies based on the language and the embedding model.
Assumptions for the PDF:
- A 10-page PDF has approximately 3325 tokens (about 300 tokens per page).
- You’ll split this document into 34 chunks (each containing 100 tokens).
- Each chunk will then be sent to the embedding OpenAI API for processing.
Step 2: The User Interacts with the RAG Application
Once the embeddings for the PDF are created, the user interacts with the RAG application, querying it multiple times. Each query is processed by retrieving the most relevant pieces of the document using the previously created embeddings.
For simplicity, let’s assume:
- The user sends 10 queries, each containing 200 tokens.
- Each query requires 2 embedding requests (since the query is split into 100-token chunks for embedding).
- After embedding the query, the system performs retrieval and returns the most relevant documents (the RAG response).
Embedding Response Size
The OpenAI Embeddings models take an input of tokens (the text to embed) and return a list of numbers called a vector. This list of numbers represents the “embedding” of the input in the model so that it can be compared with another vector to measure similarity. In RAG, we use embedding models to quickly search for relevant data in a vector database.
By default, embeddings are serialized as an array of floating-point values in a JSON document so each response from the embedding API is relatively large. The array values are 32-bit floating point numbers, or float32. Each float32 value occupies 4 bytes, and the embedding vector returned by models like OpenAI’s text-embedding-ada-002 typically consists of 1536-dimensional vectors.
The challenge is the size of the embedding response:
- Each response consists of 1536 float32 values (one per dimension).
- 1536 float32 values result in 6144 bytes (1536 × 4 bytes).
- When serialized as UTF-8 for transmission over the network, this results in approximately 32 KB per response due to additional serialization overhead (like delimiters).
Optimizing Embedding Response Size
One approach to optimize the embedding response size is to serialize the embedding as base64. This encoding reduces the overall size by compressing the data, while maintaining the integrity of the embedding information. This leads to a significant reduction in the size of the embedding response.
With base64-encoded embeddings, the response size reduces from 32 KB to approximately 8 KB, as demonstrated below:
base64 vs float32 |
Min (Bytes) |
Max (Bytes) |
Mean (Bytes) |
Min (+) |
Max (+) |
Mean (+) |
100 tokens embeddings: text-embedding-3-small |
32673.000
|
32751.000
|
32703.800
|
8192.000 (4.0x) (74.9%)
|
8192.000 (4.0x) (75.0%)
|
8192.000 (4.0x) (74.9%)
|
100 tokens embeddings: text-embedding-3-large |
65757.000
|
65893.000
|
65810.200
|
16384.000 (4.0x) (75.1%)
|
16384.000 (4.0x) (75.1%)
|
16384.000 (4.0x) (75.1%)
|
100 tokens embeddings: text-embedding-ada-002 |
32882.000
|
32939.000
|
32909.000
|
8192.000 (4.0x) (75.1%)
|
8192.000 (4.0x) (75.2%)
|
8192.000 (4.0x) (75.1%)
|
The source code of this benchmark can be found at: https://github.com/manekinekko/rich-bench-node (kudos to Anthony Shaw for creating the rich-bench python runner)
Comparing the Two Scenarios
Let’s break down and compare the total performance of the system in two scenarios:
Scenario 1: Embeddings Serialized as float32 (32 KB per Response)
Scenario 2: Embeddings Serialized as base64 (8 KB per Response)
Scenario 1: Embeddings Serialized as Float32
In this scenario, the PDF embedding creation and user queries involve larger responses due to float32 serialization. Let’s compute the total response size for each phase:
1. Embedding Creation for the PDF:
- 34 embedding requests (one per 100-token chunk).
- 34 responses with 32 KB each.
Total size for PDF embedding responses: 34 × 32 KB = 1088 KB = 1.088 MB
2. User Interactions with the RAG App:
- Each user query consists of 200 tokens (which is split into 2 chunks of 100 tokens).
- 10 user queries, requiring 2 embedding responses per query (for 2 chunks).
- Each embedding response is 32 KB.
Total size for user queries:
Embedding responses: 20 × 32 KB = 640 KB.
RAG responses: 10 × 32 KB = 320 KB.
Total size for user interactions: 640 KB (embedding) + 320 KB (RAG) = 960 KB.
3. Total Size:
Total size for embedding responses (PDF + user queries): 1088 KB + 640 KB = 1.728 MB
Total size for RAG responses: 320 KB.
Overall total size for all 10 responses: 1728 KB + 320 KB = 2048 KB = 2 MB
Scenario 2: Embeddings Serialized as Base64
In this optimized scenario, the embedding response size is reduced to 8 KB by using base64 encoding.
1. Embedding Creation for the PDF:
- 34 embedding requests.
- 34 responses with 8 KB each.
Total size for PDF embedding responses: 34 × 8 KB = 272 KB.
2. User Interactions with the RAG App:
- Embedding responses for 10 queries, 2 responses per query.
- Each embedding response is 8 KB.
Total size for user queries:
Embedding responses: 20 × 8 KB = 160 KB.
RAG responses: 10 × 8 KB = 80 KB.
Total size for user interactions: 160 KB (embedding) + 80 KB (RAG) = 240 KB
3. Total Size (Optimized Scenario):
Total size for embedding responses (PDF + user queries): 272 KB + 160 KB = 432 KB.
Total size for RAG responses: 80 KB.
Overall total size for all 10 responses: 432 KB + 80 KB = 512 KB
Performance Gain: Comparison Between Scenarios
The optimized scenario (base64 encoding) is 4 times smaller than the original (float32 encoding): 2048 / 512 = 4 times smaller.
The total size reduction between the two scenarios is: 2048 KB - 512 KB = 1536 KB = 1.536 MB. And the reduction in data size is: (1536 / 2048) × 100 = 75% reduction.
How to Configure base64 encoding format
When getting a vector representation of a given input that can be easily consumed by machine learning models and algorithms, as a developer, you usually call either the OpenAI API endpoint directly or use one of the official libraries for your programming language.
Calling the OpenAI or Azure OpenAI APIs
Using OpenAI endpoint:
curl -X POST "https://api.openai.com/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"input": "The five boxing wizards jump quickly",
"model": "text-embedding-ada-002",
"encoding_format": "base64"
}'
Or, calling Azure OpenAI resources:
curl -X POST "https://{endpoint}/openai/deployments/{deployment-id}/embeddings?api-version=2024-10-21" \
-H "Content-Type: application/json" \
-H "api-key: YOUR_API_KEY" \
-d '{
"input": ["The five boxing wizards jump quickly"],
"encoding_format": "base64"
}'
Using OpenAI Libraries
JavaScript/TypeScript
const response = await client.embeddings.create({
input: "The five boxing wizards jump quickly",
model: "text-embedding-3-small",
encoding_format: "base64"
});
A pull request has been sent to the openai SDK for Node.js repository to make base64 the default encoding when/if the user does not provide an encoding. Please feel free to give that PR a thumb up.
Python
embedding = client.embeddings.create(
input="The five boxing wizards jump quickly",
model="text-embedding-3-small",
encoding_format="base64"
)
NB: from 1.62 the openai SDK for Python will default to base64.
Java
EmbeddingCreateParams embeddingCreateParams = EmbeddingCreateParams
.builder()
.input("The five boxing wizards jump quickly")
.encodingFormat(EncodingFormat.BASE64)
.model("text-embedding-3-small")
.build();
.NET
The openai-dotnet library is already enforcing the base64 encoding, and does not allow setting encoding_format by the user (see).
Conclusion
By optimizing the embedding response serialization from float32 to base64, you achieved a 75% reduction in data size and improved performance by 4x. This reduction significantly enhances the efficiency of your RAG application, especially when processing large documents like PDFs and handling multiple user queries.
For 1 million users sending 1,000 requests per month, the total size saved would be approximately 22.9 TB per month simply by using base64 encoded embeddings.
As demonstrated, optimizing the size of the API responses is not only crucial for reducing network overhead but also for improving the overall responsiveness of your application. In a world where efficiency and scalability are key to delivering robust AI-powered solutions, this optimization can make a substantial difference in both performance and user experience.
■ Shoutout to my colleague Anthony Shaw for the the long and great discussions we had about embedding optimisations.
Updated Mar 07, 2025
Version 1.0wassimchegham
Microsoft
Joined February 04, 2021
Microsoft Developer Community Blog
Follow this blog board to get notified when there's new activity