#mlnews #openai #embeddings
COMMENTS DIRECTLY FROM THE AUTHOR (thanks a lot for reaching out Arvind 🙂 ):
1. The FIQA results you share also have code to reproduce the results in the paper using the API: There’s no discrepancy AFAIK.
2. We leave out 6 not 7 BEIR datasets. Results on msmarco, nq and triviaqa are in a separate table (Table 5 in the paper). NQ is part of BEIR too and we didn’t want to repeat it. Finally, the 6 datasets we leave out are not readily available and it is common to leave them out in prior work too. For examples, see SPLADE v2 ( also evaluates on the same 12 BEIR datasets.
3. Finally, I’m now working on time travel so that I can cite papers from the future 🙂
END COMMENTS FROM THE AUTHOR
OpenAI launches an embeddings endpoint in their API, providing high-dimensional vector embeddings for use in text similarity, text search, and code search. While embeddings are universally recognized as a standard tool to process natural language, people have raised doubts about the quality of OpenAI’s embeddings, as one blog post found they are often outperformed by open-source models, which are much smaller and with which embedding would cost a fraction of what OpenAI charges. In this video, we examine the claims made and determine what it all means.
OUTLINE:
0:00 – Intro
0:30 – Sponsor: Weights & Biases
2:20 – What embeddings are available?
3:55 – OpenAI shows promising results
5:25 – How good are the results really?
6:55 – Criticism: Open models might be cheaper and smaller
10:05 – Discrepancies in the results
11:00 – The author’s response
11:50 – Putting things into perspective
13:35 – What about real world data?
14:40 – OpenAI’s pricing strategy: Why so expensive?
Sponsor: Weights & Biases
Merch: store.ykilcher.com
ERRATA: At 13:20 I say “better”, it should be “worse”
References:
Links:
TabNine Code Completion (Referral):
YouTube:
Twitter:
Discord:
BitChute:
LinkedIn:
BiliBili:
If you want to support me, the best thing to do is to share out the content 🙂
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar:
Patreon:
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
source