TL;DR – We are excited to announce voyage-3
and voyage-3-lite
embedding models, advancing the frontier of retrieval quality, latency, and cost. voyage-3
outperforms OpenAI v3 large by 7.55% on average across all evaluated domains, including code, law, finance, multilingual, and long-context, with 2.2x lower costs and 3x smaller embedding dimension, resulting in 3x lower vectorDB costs. voyage-3-lite
offers 3.82% better retrieval accuracy than OpenAI v3 large while costing 6x less and having 6x smaller embedding dimension. Both models support a 32K-token context length, 4x more than OpenAI.
In the last nine months, we have released a suite of our Voyage 2 series embedding models, including state-of-the-art general-purpose models, such as voyage-large-2
, and domain-specific models, such as voyage-code-2
, voyage-law-2
, voyage-finance-2
, and voyage-multilingual-2
, all extensively trained on data from their respective domains. For example, voyage-multilingual-2
demonstrates superior retrieval quality in French, German, Japanese, Spanish, and Korean, while still providing best-in-class performance in English. We have also fine-tuned models for companies with specific use cases and data, e.g., Harvey.ai.
Now, we are thrilled to introduce our Voyage 3 series embedding models, voyage-3
and voyage-3-lite
, with voyage-3-large
coming in a few weeks. These models outperform competitors1 in retrieval quality while significantly reducing price and downstream costs for vectorDB. Specifically, voyage-3
:
- Outperforms OpenAI v3 large across all eight evaluated domains (tech, code, web, law, finance, multilingual, conservation, and long-context) by 7.55% on average.
- Costs 2.2x less than OpenAI v3 large and 1.6x less than Cohere English v3, at $0.06 per 1M tokens.
- Has a 3-4x smaller embedding dimension (1024) compared to OpenAI (3072) and E5 Mistral (4096), resulting in 3-4x lower vectorDB costs.
- Supports a 32K-token context length, compared to OpenAI (8K) and Cohere (512).
voyage-3-lite
is more lightweight model optimized for latency and low cost, which:
- Outperforms OpenAI v3 large by 3.82% on average across the domains.
- Costs 6.5x less than OpenAI v3 large, at $0.02 per 1M tokens.
- Outperforms OpenAI v3 small by 7.58% with the same price.
- Has a 6-8x smaller embedding dimension (512) compared to OpenAI (3072) and E5 Mistral (4096), resulting in 6-8x lower vectorDB costs.
- Supports a 32K-token context length, compared to OpenAI (8K) and Cohere (512).
The table below summarize the important aspects of these models along with a few competitors, accompanied by a plot of retrieval quality versus cost2.
Model | Dimensions | Context Length |
Cost (per 1M tokens) |
Retrieval Quality (NDCG@10) |
---|---|---|---|---|
voyage-3 | 1024 | 32K | $0.06 | 76.72 |
voyage-3-lite | 512 | 32K | $0.02 | 72.98 |
OpenAI v3 large | 3072 | 8K | $0.13 | 69.17 |
OpenAI v3 small | 1536 | 8K | $0.02 | 67.08 |
Cohere English v3 | 1024 | 512 | $0.10 | 59.33 |
E5 Mistral | 4096 | 4K | $0.10 | 70.13 |
BGE M3 | 1024 | 8K | $0.016 | 66.61 |
voyage-3
and voyage-3-lite
are the result of several research innovations, including an improved architecture, distillation from larger models, over 2T high-quality tokens in pre-training, and retrieval result alignment via human feedback.
Recommendations. Any general-purpose embedding users can upgrade to voyage-3
for better retrieval quality at a low cost, or voyage-3-lite
for further cost saving. If you are particularly interested in code, law, finance, and multilingual retrieval, Voyage 2 series domain-specific models (voyage-code-2
, voyage-law-2
, voyage-finance-2
, and voyage-multilingual-2
) are still best for their respective domains, even though voyage-3
has highly competitive performance as well (see Section below). If you’ve used Voyage embeddings, you can simply specify "voyage-3"
or "voyage-3-lite"
as the model
parameter in Voyage API calls, for both the corpus and queries.
Evaluation Details
Datasets. We evaluate on 40 domain-specific retrieval datasets, spanning eight domains, technical documentation, code, law, finance, web reviews, multilingual, long documents, and conversations. Each dataset consists of a corpus to be retrieved from and a set of queries. The corpus typically encompasses documents in a particular domain, such as answers in StackExchange, court opinions, technical documentation, etc., and the queries can be questions, summarization of a long document, or simply individual documents. The following table list the datasets in the eight categories except multilingual. The multilingual domain covers 62 datasets covering 26 languages, including French, German, Japanese, Spanish, Korean, Bengali, Portuguese, and Russian. Each of the first 5 languages has multiple datasets. The other languages involve one dataset each and are grouped into an OTHER category in the multilingual radar chart below.
Category | Descriptions | Datasets |
---|---|---|
TECH | Technical documentation | Cohere, 5G, OneSignal, LangChain, PyTorch |
CODE | Code snippets, docstrings | LeetCodeCpp, LeetCodeJava, LeetCodePython, HumanEval, MBPP, DS1000-referenceonly, DS1000, apps_5doc |
LAW | Cases, court opinions, statutes, patents | LeCaRDv2, LegalQuAD, LegalSummarization, AILA casedocs, AILA statutes |
FINANCE | SEC filings, finance QA | RAG benchmark (Apple-10K-2022), FinanceBench, TAT-QA, Finance Alpaca, FiQA Personal Finance, Stock News Sentiment, ConvFinQA, FinQA, HC3 Finance |
WEB | Reviews, forum posts, policy pages | Huffpostsports, Huffpostscience, Doordash, Health4CA |
LONG-CONTEXT | Long documents on assorted topics: government reports, academic papers, and dialogues | NarrativeQA, Needle, Passkey, QMSum, SummScreenFD, WikimQA |
CONVERSATION | Meeting transcripts, dialogues | Dialog Sum, QA Conv, HQA |
A list of all evaluation datasets is available in this spreadsheet.
Models. We evaluate voyage-3
and voyage-3-lite
alongside several alternatives, including: OpenAI v3 small (text-embedding-3-small
) and large (text-embedding-3-large
), E5 Mistral (intfloat/e5-mistral-7b-instruct
), BGE M3 (BAAI/bge-m3
), Cohere English v3 (embed-english-v3.0
), and voyage-large-2-instruct
. For the domain-specific and multilingual datasets, we also evaluate voyage-law-2
, voyage-finance-2
, voyage-multilingual-2
, Multilingual E5 (infloat/multilingual-e5-large
) and Cohere multilingual v3 (embed-multilingual-v3.0
).
Metrics. Given a query, we retrieve the top 10 documents based on cosine similarities and report the normalized discounted cumulative gain (NDCG@10), a standard metric for retrieval quality and a variant of the recall.
Results
Retrieval Across Domains. As discussed earlier and shown in the first radar chart of this post, voyage-3
outperforms OpenAI v3 large by an average of 7.55% across domains. Furthermore, voyage-3
trails closely behind Voyage’s domain-specific models as shown in the bar plots below.
Multilingual Retrieval. As shown in the radar chart below, voyage-3
’s multilingual retrieval quality is just slightly behind voyage-multilingual-2
, with a lower latency and at half the cost. voyage-3-lite
outperforms all non-Voyage models, besting OpenAI v3 large, Cohere multilingual v3, and Multilingual E5 by 4.55%, 3.13%, and 3.89%, respectively.
All the evaluation results are available in this spreadsheet.
Try Voyage 3 Series!
Give voyage-3
and voyage-3-lite
a try today! The first 200M tokens are free. Head over to our docs to learn more. If you’re also interested in fine-tuning embeddings, we’d love to hear from you—please email us at contact@voyageai.com. Follow us on X (Twitter) and LinkedIn, and join our Discord for more updates.
Premium IPTV Experience with line4k
Experience the ultimate entertainment with our premium IPTV service. Watch your favorite channels, movies, and sports events in stunning 4K quality. Enjoy seamless streaming with zero buffering and access to over 10,000+ channels worldwide.
