How to optimize prompts to reduce LLMs costs

By optimizing a prompt, it is possible to drastically reduce the number of tokens used by an LLM and therefore its cost.

Mastering the art of prompt with restraint will undoubtedly be a key skill in 2024. While the use of long and inefficient prompts with ChatGPT has no direct impact on the subscription price, resorting to API query systems for large language models quickly becomes expensive, especially in production. To bill the use of models, the majority of publishers favor cost per token (except Google, which bills its Gemini model per character). Generally, the cost per token is higher at the output (result of LLM production) than at the input (prompt and associated resources).

For example, in January 2023, GPT-4 (8K) from OpenAI has the highest cost per token of proprietary LLMs, with a price of 30 dollars per million tokens at input and 60 dollars per million tokens at output. However, the cost per token pricing presents a major bias: the process varies from one model to another. Consequently, a prompt will have a different number of tokens depending on the LLM used.

Comparison of token pricing for each LLM. © Philipp Schmid / Hugging Face

Originally came tokenization

To understand this difference, it is necessary to understand the tokenization process. During the training phase of large language models like ChatGPT, tokenization aims to divide the text of the training corpus into units, called tokens. “GPT 3.5 and 4 use more than 100,000 unique tokens to represent the vocabulary. To generate this set of tokens optimally, ChatGPT is based on the most frequent characters and words in the corpus,” explains Nicolas Cavallo, head of generative AI at Octo Technology. In concrete terms, the model first looks at the most recurring sequences of two characters. “For example, if it detects that ‘th’ is very frequent, it will add the token ‘th’. Then iteratively, it merges the most probable tokens, like ‘th’ and ‘e’ to generate the token ‘the’,” explains the expert.

The process is repeated until convergence to find the best way to divide the corpus into tokens allowing an optimal representation of the text. The ultimate goal is to maximize the model’s performance for a given number of tokens. The result is a custom tokenization based on the corpus statistics, which can vary depending on the languages and the frequencies of character sequences. Although most large language models use a similar character-based tokenization approach (Byte-Pair Encoding tokenization), there may be some variations. Indeed, each model trains its own tokenization on its own training corpus. Depending on the size and composition of the corpus, the statistics of frequent character sequences can change.

Language, a major cost variable

Each LLM having its own text breakdown, the number of tokens will therefore be different. To calculate the number of tokens, there are several solutions available to you: calculate using the tools provided by publishers when they exist (Tokenizer at OpenAI) or use the iterative approach using parameter feedback during queries in the APIs. More than a difference between two models, there is also a real gap between the languages used with LLMs. The overrepresentation of English in the majority of training corpora leads to a better-optimized tokenization for this language. “In addition to the encoding used (UTF-8), a text in French will be tokenized with about 32% more tokens (at an equal character count) compared to the same text in English with the GPT-3.5 model,” assures Nicolas Cavallo. To reduce prompting costs, the advice is therefore to prompt in English, even to generate French text. On a production system, the cost difference can be significant.

Another factor influencing tokenization, although more secondary but equally relevant, lies in the choice of vocabulary used. Since tokenization directly depends on the regularity of the expressions encountered in the dataset by the model during its training, a more common expression than another will logically require fewer tokens. “The results show that if ‘Bonjour’ is written with a capital ‘B’, it only requires one token with GPT-3. On the other hand, if ‘bonjour’ is written entirely in lowercase, it is encoded with 2 distinct tokens (‘bon’ and ‘jour’),” demonstrates Nicolas Cavallo. Using a very simple lexical field in the prompt can therefore further reduce the number of tokens used and therefore the overall cost.

“Bonjour” requires one less token than ‘bonjour”. © JDN / OpenAI

Open source models (also) affected

When using an open source model in on-premise or cloud inference, the problem is not the same. The cost of prompting is not directly measured per token but depends on the resources needed to run the model. However, the more the prompt (or output) requires tokens, the longer the generation time at runtime will be. Consequently, resources will be used for a longer time, impacting the overall operating cost.

Currently, aside from using English and a simpler language, there are few truly effective methods to reduce the number of tokens. However, the state of the art in the field is evolving very quickly and new solutions could soon emerge. Very recently, Microsoft, for example, presented LLMLingua, a model capable of identifying and removing non-essential tokens in prompts. Researchers promise a compression up to 20 times greater with minimal performance loss. Promising results that should be tested and adapted to specific use cases.

Leave a Reply

Your email address will not be published. Required fields are marked *