The use of Retrieval Augmentation Generation (RAG) applications has skyrocketed in the past year. Users and developers can index hundreds of data files, and then have a Language Learning Model (LLM) answer user questions. That said, RAG is not perfect and requires refinement. Welcome back to Silatus Weekly, and today we are discussing the issues with PDFs in RAG applications.
Chunking
The first step in RAG applications is the data loading process. Users load in their PDFS and the application extracts the text. The PDF extraction can be done by using libraries such as PYDPF, PDFMINER, and PDFPLUMBER. Another option is to use Optical Character Recogntion (OCR) to extract. OCR scans the documents for images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast).
The P in the PDF Problem
While text-extractions and scans are suitable for most PDFS, things tend to get messy with complex PDF structures. PDFS naturally structured with different pages, tables, section and so on. Representing structured documents as plain text means representing them without any formatting or organization that reflects their rich structure. When a system needs to search for information within the document, this mismatch becomes apparent, and even simple questions can confuse the system.
Furthermore, complex PDF documents, using the traditional approach of predicting the next token (or word) during the self-training phase might limit us. Not all preceding words are always relevant, due to the varied and creative ways text can be arranged on a page. Words are laid out horizontally, vertically, or even in a zigzag pattern, making the previous-word predictor less effective.
Legal complaints come with headers located on the margins. However, when using PYPDF for extraction, these headers are pulled out along with the main content of the document. Thus when using RAG the model like Chat-GPT have trouble discerning the header and the actual argument. Another example are scientific papers. Extracting text from scientific papers that include tabular data presents unique challenges. Often, these papers contain complex tables with a rich array of information arranged in rows and columns. Therefore, the use of text extraction on scientific papers with tabular data can occasionally result in inaccurate or incomplete information retrieval.
From Problem to Possibilities
Although PDFs do have their problems, there are a few work arounds. In the "PDFTriage", researchers went beyond text extraction. First researchers extracted the structural elements of the document into readable metadata. Researchers Adobe Extract API to convert digital PDF files into a format like a webpage. This lets us pull out sections, titles, tables, and images, and other parts of the document. Each element has additional info like its page number and location. Then, researchers organized the data into a format called JSON, which they used as the starting input for their language model.
JP Morgan took another approach, and trained their own document model; DOCLLM. The model was pre-trained in a self-supervised manner on a large corpus of documents including the IIT-CDIP and DocBank datasets.A text infilling objective was used where spans of text are randomly masked and the model tries to predict the masked spans. This helps the model handle irregular document layouts and disjointed text segments. After pre-training, the model was fine-tuned using instruction-based prompts from 16 datasets covering tasks like visual QA, information extraction etc. The prompts provide examples for the model to learn these downstream document understanding tasks.
The Future of RAG
While PDF structures are complex, today's PDF problems will become tomorrow's solved issues. Researchers and developers are tackling these challenges and open-sourcing their work. I've heard that Silatus is making major upgrades to their RAG pipeline. PDF understanding is an active area of research and rapid progress is being made. Challenging problems that seem intractable now may well become tractable in the near future as new techniques emerge. It's an exciting time, and we look forward to seeing what new innovations arise in this space. Stay tuned!
Silatus AI helped to write this article.