Introducing TextRerank™: A privacy-first AI document processing pipeline

updated on 18 February 2024

TextRerank™ combines state-of-the-art artificial intelligence tooling with proprietary Silatus algorithms to produce best-in-class document retrieval, processing, and generation results.

At Silatus, we are on a mission to become the world's best research automation engine. Towards that end, we are constantly searching for ways to improve our processing pipelines. TextRerank™ is the next evolution of Silatus research automation.

Background

To understand why Silatus' TextRerank™ is so powerful, we first have to investigate and understand the history and components of search.

Famously, Google co-founders Sergey Brin and Larry page developed the PageRank algorithm in 1998 that could rank websites against human search queries. This algorithm created a "relevance web", using links between pages to create a graph of relevance. It was a major revolution in search technology, but it did not fully address the challenge of searching within webpages or documents to fully contextualize its contents.

Keyword Search

Keyword search algorithms, like Okapi BM25, attempt to address this challenge. Once a user types in their search query, the information retrieval system identifies matches of each word in the user's query with documents in its database, returning the documents with the most matches.

Importantly, the BM25 algorithm does not consider relative location of each word. For example, if the user's search query is "best fictional war movies", the algorithm will match documents containing both of the following sentences: "These are the best fictional war movies" and "These war movies are not fictional, but they are the best". Clearly, one sentence fits the query better than the other, but the keyword search algorithm is not capable of distinguishing those semantics.

Semantic Search

Semantic search began gaining prominence in the early 2000s, moving beyond keyword matching to understanding the context and intent behind user queries. The goal is to interpret the meaning of queries to provide more relevant search results. Techniques like latent semantic indexing and subsequent developments in semantic algorithms played a crucial role in this evolution.

The first major breakthroughs began with the introduction of neural networks and deep learning techniques in the early 2010s. These technologies allowed for a more nuanced understanding of language, considering the relationships and subtleties within text. Google's RankBrain, introduced in 2015, was a notable example, using machine learning to interpret complex queries and provide more tailored results. 

More recently, the industry has been looking for ways to leverage large language models (LLMs) to conduct semantic searches. LLMs, like OpenAI's GPT-4, are excellent at determining a piece of content's relevance relative to a user query. They do it so well that, on some benchmarks, they perform better than humans who manually select relevant documents. The problem with LLMs is that they are slow and expensive. 

Vector Databases

Vector databases efficiently store, or cache, an LLM representation of content so that the LLM doesn't have to process the content every time a user submits a query. In other words, the LLM runs over the content once, storing a numerical data matrix in the vector database. Then, when a user submits a search, the LLM runs only over that search query, not the relevant content. The LLM output from the search query is then compared against the previously stored contents in the vector database - using mathematical functions like dot products or Euclidean distances - to find matches. numerical semantic representation of content.

Combining Keyword and Semantic Search

Hybrid search is an advanced approach in information retrieval that blends traditional keyword-based searching with more sophisticated semantic search techniques. In this methodology, the search system integrates the speed of straightforward keyword search — where users input specific words or phrases to find relevant documents — with the contextual understanding of semantic search. The latter involves interpreting the user's intent and the meaning behind their query, going beyond mere word matching.

In a hybrid search system, when a user inputs a query, the system employs both keyword matching and semantic analysis. It first identifies the key terms within the query, akin to a traditional search. Simultaneously, it also analyzes the query's semantic context, trying to grasp the user's underlying intent and the nuanced relationships between the words. By combining these two search paradigms, hybrid search can provide results that are not only textually relevant (based on keyword match) but also contextually appropriate (based on semantic understanding). This dual approach enables a more comprehensive and accurate retrieval of information, catering to a wider range of user queries and needs.

TextRerank

We have developed a comprehensive and robust research pipeline that can process extremely large numbers of documents effectively, efficiently, and quickly at scale - transforming unstructured data into valuable, actionable insights. Our primary objective with TextRerank™ is to revolutionize research automation, making it easier than ever to make sense of large amounts of textual data. Let me guide you through the nuances of our state-of-the-art pipeline.

Document Uploads

The journey begins with the document upload stage. Documents are scanned for viruses and processed by our secure, privacy-first third party provider, Transloadit, before being stored to our encrypted central relational database repository. This process plays a critical role in ensuring the integrity and quality of the data, and strengthens our ability to scale our data processing up and down. 

The TextRerank™ Pipeline
The TextRerank™ Pipeline

Using OpenSearch for Keyword Search

Once the documents find their place in our central repository, they are forwarded to OpenSearch for indexing and keyword search. Our choice of OpenSearch is deliberate and strategic. It is resilient, highly customizable, scalable, and performant. It uses a variant of the industry-standard BM25 keyword search algorithm and Amazon AWS has a managed, highly scalable, production-capable service. Given the dynamic nature of our operations, it is vital for our search engine to not just meet our current needs but to be capable of scaling and evolving alongside our growth. 

Semantic Search Integration with Qdrant

At the same time, we send the documents to our text pre-processor for embedding and eventual storage in our encrypted vector database. After investigating various solutions, we decided to use Qdrant as our vector database and search engine, and Qdrant's FastEmbed for generating embeddings.

We chose Qdrant for the following reasons:

1. It can be self-hosted, so we can fully secure customer data in a private, controlled environment.

2. It is scalable and distributed, so we don't have to worry about performance bottlenecks.

3. It has best-in-class performance. As of January 2024, Qdrant significantly outperformed other popular vector database options like OpenSearch and Redis across a robust suite of performance evaluations. 

Once the data comes out of the text pre-processor, it goes to our customized FastEmbed embedding engine. FastEmbed, created by Qdrant's Nirant Kasliwal, is a lightweight, fast, Python library built for embedding generation. We modified it to run as a Rust library with GPU acceleration and a larger embedding model for even better performance and throughput. Once the embeddings are generated from the pre-processed text, they are passed into the Qdrant vector database, where they are indexed, encrypted, isolated by customer, and stored.

Semantic Storage Performance

To understand how well Qdrant's vector database stores and clusters relevant documents together, we used its visualization tooling.

Relative similarity of processed documents visualized in 2D
Relative similarity of processed documents visualized in 2D

Each blue bubble represents a document stored in the vector database. This image shows the cosine similarity of the documents to each other. As we would hope and expect, every document from the same search has a very short spatial distance, creating clusters of documents by search query. This indicates that Qdrant is functioning well for semantic search.

Each of these meticulously designed stages in our pipeline underlines our commitment to delivering a search experience that is not only fast but also acutely relevant and insightful. By leveraging cutting-edge technology and innovative methodologies, we aim to set new benchmarks in document retrieval and data processing.

The Neural Query Preprocessor (NQPP)

Once data is processed and stored, its ready for consumption by the user. One of our novel innovations and key differentiators is our neural query preprocessor. Among other things, the NQPP employs LLMs to extract unique contextual information like dates and proper nouns and generates variations on the user prompt and hypothetical document embeddings (HyDEs) to enhance the retrieval process. 

Searching Across Internal and External Data

The modified prompts from the NQPP then get passed to OpenSearch, Qdrant, and our External Data Integration Engine (EDIE) where documents are fetches and returned. EDIE has access to over 100,000 external sources and can use those sources to enhance the final result presented to the user.

Importantly, Qdrant, OpenSearch, and EDIE conduct their processing concurrently, so the pipeline is only as slow as the slowest engine of these three.

Aggregation and Post-Processing

Once results from are returned from these sources, they have to be post-processed to determine which documents are most relevant and what part of each document is relevant. Each source provides a list of five to ten of the documents that it thinks are most relevant. It also gives relative relevance scores for the documents it returns. 

Since LLMs have limited input data sizes, they can only make sense of so much data at a time. While companies like Anthropic offer models with very large input context windows, we've found that those companies' models still only make sense of a relatively small amount of the data. Moreover, every character of text sent to the LLM increases cost and time spent in the pipeline. Therefore, we created a data post-processor to minimize input context size - reducing costs and maximizing performance.

There are two commonly used techniques for narrowing document result sets in a post-processor like ours: statistical fusion and ML re-ranking. In statistical fusion, a mathematical function is used to prioritize documents. In ML re-ranking, an AI model is used. While ML re-ranking models are constantly improving, we have found that they are not yet performant enough for effective re-ranking in our pipeline, so we went with statistical fusion.

There are multiple statistical fusion algorithms, including Condorcet Fuse, CombMNZ, and Reciprocal Rank Fusion (RRF). In 2009, Cormack, et. al. found that RRF outperformed both Condorcet Fuse and CombMNZ at combining document rankings from different sources. Microsoft recommends RRF as the standard for hybrid search document aggregation. In our own testing, we found that RRF was effective but it had some drawbacks and limitations. So, we wrote our own proprietary algorithm that better factors in repeated documents among other variables. This is another novel innovation and differentiator that makes TextRerank™ extremely powerful and unique.

Producing Content

In the final stage, we send the post-processed document contents to a state-of-the-art LLM for contextualization and content generation. As of today, we primarily leverage OpenAI's GPT-4, but going forward we will use the best LLM available to us for the job. The LLM can return content in whatever format is specified by the user, giving them a well-researched, factual, hallucination-resistant, and highly relevant response.

Conclusion

What Silatus has created isn't just a search pipeline; it's a nexus of traditional and AI-powered search mechanisms that brings forth precision, speed, and scalability. Each of these meticulously designed stages in our pipeline underlines our commitment to delivering a search experience that is not only fast but also acutely relevant and insightful. By leveraging cutting-edge technology and innovative methodologies, we at Silatus are setting new benchmarks in document retrieval and data processing.

TextRerank™ is rolling out gradually to Silatus customers. We expect it to be available to all customers by the end of February 2024. We're excited to see how this technology will continue to empower our users to conduct high-quality research with unparalleled efficiency and speed.

This article was created with the help of Silatus AI.

Read more