Retrieval augmentation is a technique that enhances the ability of a large language model to generate relevant and accurate responses by using an external source of information. Although RAG is fairly new, RAG can mitigate the error's from LLM hallucination. Semantic searches are the most common in RAG applications. Instead of depending on traditional Boolean logic, semantic search utilizes word relationships by generating embeddings and keeping them in a vector database. For simplicity, embeddings transform words or sentences into dense vectors of real numbers within a continuous vector space. These vectors stores the contextual and semantic significance of each word. When a user submits a semantic query, the search system aims to understand their intent and the context. The system then dissects the query into individual words or tokens, translates them into vector representations using embedding models, and delivers ranked results based on their relevance.
RAG technology has shown promising results in enhancing the accuracy and relevance of responses generated by large language models. But also raises privacy concerns. In the paper "Text Embeddings Reveal (Almost) As Much As Text", Cornell University trained a model to invert embeddings. Text inversion refers to the process of recovering the original text from its dense text embeddings. To simply given an embedding (a vector representation of text), text inversion aims to recreate the original text or phrases from that embedding.
A High overview of the results are below:
In-Domain Reconstruction
The model outperformed baselines on all metrics within the in-domain evaluation. Increasing the number of rounds improved performance, with 77% of BLEU score recovered in just 5 rounds of correction, although running for 50 rounds led to even higher reconstruction performance. Sequence-level beam search proved particularly effective, increasing the exact match score by 2 to 6 times across the three settings. While the model struggled with exactly recovering longer texts, it still managed to recover a significant portion of the original content.
Out-of-Domain Reconstruction
The model was tested on 15 datasets from the BEIR benchmark, with varying results. It performed best on Quora, the shortest dataset in BEIR, where it was able to exactly recover 66% of examples. Overall, the model adapted well to different-length inputs, producing reconstructions with an average length error of fewer than 3 tokens. Reconstruction accuracy generally decreased with increasing example length, which is discussed further in Section 7. Despite this, the model achieved a minimum Token F1 score of 41 and cosine similarity to the true embedding of at least a certain threshold across all datasets.
Case study: MIMIC
Medical data can reveals a persons predispositions and vulnerabilities. The final case study, the authors embedded “pseudo re-identified” version MIMICIII clinical notes. The author found that the inversion model recovered 94% of the first names, 95% of last names, and 80% of the full names. Furthermore 26% of the documents were completely recovered by the document.
There is a surge in the use of Vector Databases. While traditional encryption such as symmetric (e.g., AES) and asymmetric (e.g., RSA) can safeguard data, traditional encryption are not well-suited for protecting embeddings due to the unique properties of vectors.
Loss of Semantic Properties: Traditional encryption methods obscure the original data, making it unreadable to unauthorized parties. However, this also destroys the semantic properties of vector representations that are crucial for tasks like similarity search and classification.
Inefficiency in Query Processing: Encrypting individual embeddings would require decrypting them before performing any computations, making it impractical for real-time query processing and data analysis. Stay tuned till for our next blog. We will investigate how to defend inversion embeddings