Last week on Silatus Weekly, we gave our thoughts on OpenAI’s new RAG pipeline. To be brief, we thought the all-in-one pipeline was interesting on paper, but mediocre at best in practice. This week however, we are taking a deep dive into the latest GPT-4-Turbo-128k.
GPT-4-Turbo-128k latest features are self-explanatory, so we will give a quick overview of the model. First, GPT-4-Turbo-128k is faster than the previous GPT-4. Previous GPT-4s such as GPT-4-0314 and GPT-4-0614 were extremely slow. The previous GPT-4 would take at least one minute to complete a prompt for users. However, GPT-4-Turbo-128k can complete tasks much quicker. For example, GPT-4-Turbo-128k can generate a poem within 30 seconds. Furthermore, GPT-4's new context length is 128k, which is about more than 300 pages of text !
It's surprising that the GPT-4-Turbo-128k token rate costs half as much as GPT-4. So, why would OpenAI launch a superior model at half the price? The Silatus team has several theories, but today we'll focus on one: quantization.
Quantization is a process that reduces the computational complexity of a model. Furthermore, quantization converts continuous values or larger sets of values into a limited set of possibilities, making the model more efficient without significantly affecting performance. Quantization essentially reduces the model's size by representing data with lesser bits. This not only lessens GPU memory demands but also speeds up model loading. Essentially, it allows for more efficient model training by necessitating fewer computers. Precision sizes can vary, including 16bit, 8bit, and 4bit.
Typically, a smaller precision size results in a quicker inference speed of the model. However, model performance might deteriorate when reduced to 4bit precision. Therefore, for many developers, 8bit precision appears to be the most efficient balance. Also quantization also applies to fine-tuning as well. Popular libraries open-source libraries such LoRA and Q-LoRA utilize quantization to save memory, time, and cost. LoRA implements 16bit and 8bit precision, while Q-LoRA goes further to 4bit precision.
Considering the speed and cost of GPT-4-Turbo-128k, it's possible that OpenAI uses quantization and other methods. It's also likely that GPT-4-Turbo-128k was fine-tuned using LoRA or Q-LoRA. Remember, GPT-4's training data only extended up to September 2021, while GPT-4-Turbo-128k's data includes facts up to April 2023. Considering GPT-4 was only released in March 2023, OpenAI has impressively trained GPT-4-Turbo-128k on two years of data in a very short time. This fast training duration further suggests the use of quantization.
Ultimately, only OpenAI knows the specifics of their operations. However, the evidence strongly hints at the use of quantization. Quantization helps speed up training and inference times, reduces memory requirements, and consequently saves costs. Quantization, in essence, promotes speed, efficiency, and affordability, potentially making it a desirable method in the development and improvement of models like GPT-4-Turbo-128k.
Silatus AI helped to write this article.