Join our daily and weekly newspapers for exclusive content on the latest updates and industry-composure AI coverage. learn more
January 2025 shook the AI landscape. It seems that invincible Openai and powerful American technical giants were shocked, which we can definitely call a Dalit in the field of big language model (LLM). Deepsek, a Chinese firm not on anyone’s radar, suddenly challenged Openai. It is not that the Deepsek-R1 was better than the top model of American veterans; It was slightly behind in terms of benchmarks, but it suddenly inspired everyone to think about efficiency in terms of hardware and energy use.
Given the unavailability of the best high-end hardware, it seems that Deepsek was motivated to innovate in the field of efficiency, which was a less concern for big players. Openai has claimed that they have evidence that Deepsek may have used its model for training, but we have no concrete evidence to support it. Therefore, is it true or is it open, just trying to please its investors, the subject of debate. However, Deepsek has published its work, and people have verified that the results are at least a very small -scale fertility.
But how could Deepsek achieve such costs while American companies could not? The brief answer is simple: they had more inspiration. The long answer requires a slightly more technical explanation.
Deepsek used KV-Cash optimization
An important cost for GPU memory was the optimization of key-value cache used in every meditation layer in an LLM.
LLMs are made of transformer blocks, each of which contains a attention layer after a regular vanilla feed-forward network. The feed-forward network models the relationship ideologically, but in practice, it is always difficult to determine the pattern in the data. Meditation layer solves this problem for language modeling.
The model processes the texts using tokens, but for simplicity, we will refer to them as words. In an LLM, each word is assigned a vector in a high dimension (says, one thousand dimensions). Ideally, each dimension represents a concept, such as warm or cold, green, soft, being a noun. The vector representation of a word has its meaning and value according to each dimension.
However, our language allows other words to modify the meaning of each word. For example, an apple has a meaning. But we can have a green apple as a modified version. A more extreme example of amendment will be that an apple in an iPhone context differs from Apple in a Meadow context. How do we modify the vector meaning of a word based on another word to our system? This is the place where meditation comes.
The attention model provides each word two other vectors: a key and a querry. Querry represents the qualities of the meaning of a word that can be modified, and the key represents modifications that can provide other words. For example, the word ‘Green’ can provide information about color and green-numb. So, the key to the word ‘Green’ will be a high value on ‘Green-Non’ dimension. On the other hand, the word ‘apple’ may be green or not, so the query vector of ‘apple’ will also have a higher value for the green-moan dimension. If we take the dot product of ‘Green’ key with the query of ‘Apple’, the product should be relatively larger than the key of ‘table’ and the query of ‘Apple’. The meditation layer then adds a small fraction of the value of the word ‘green’ to the value of the word ‘Apple’. In this way, the value of the word ‘Apple’ is modified as slightly greenery.
When LLM generates text, it performs one word after the other. When it produces a word, all the words already generated become part of its context. However, the key and values of those words have already been calculated. When another word is added to the context, its value is required to update its querry and all previous words based on the key and values. Therefore all those values are stored in GPU memory. This is KV cash.
Deepsek determined that the key and value of a word belongs to. So, the meaning of green word and its ability to influence greenness is clearly very closely related. So, it is possible to compress both single (and perhaps small) vector and decompress when processing very easily. Deepsek has found that it affects their performance on the benchmark, but it saves a lot of GPU memory.
Deepsek implemented Mo
The nature of a nervous network is that the entire network needs to be evaluated (or calculated) for each querry. However, all this is not a useful calculation. The knowledge of the world sits in the weight or parameters of a network. Knowledge about Eiffel Tower is not used to answer questions about the history of South American tribes. Knowing that an apple is a fruit that is not useful when answering questions about the general principle of relativity. However, when the network is calculated, all parts of the network are processed regardless. This lesson increases heavy calculation costs during production that must be avoided ideally. This is the place where the idea of mixing-experts (MOE) comes.
In an MOE model, the nerve network is divided into several small networks called experts. Note that the ‘specialist’ in the subject is not clearly defined; Network does it out during training. However, the networks provide some relevance scores for each querry and only activate parts with high matching scores. This provides huge cost savings in calculation. Note that some questions require specialization in many areas to answer properly, and the performance of such questions will be degraded. However, because the areas have shown data, the number of such questions is minimal.
Importance of learning reinforcement
An LLM is taught to think through a chain-off-the-through model, in which the model is properly-tuned, which is to copy thinking before answering. The model is asked to add its idea orally (generate ideas before generating the answer). The model is then evaluated on both thoughts and answers, and is trained with reinforcement learning (rewarded for a correct match and punished for a wrong match with training data).
This requires expensive training data with ideas tokens. Deepsek only asked the system to generate ideas between the tag
Deepsek appoints several additional adaptation tricks. However, they are highly technical, so I will not engage them here.
Final view about Deepsek and big market
In any technology research, we need to first see what is possible before improving efficiency. This is a natural progress. Deepsek’s contribution to the LLM landscape is unprecedented. Educational contribution cannot be ignored, whether they are trained using Openai output. It can also change the way the operation of startups operate. But there is no reason for disappointment to openi or other American giants. This is how research works – a group benefits from research of other groups. The earlier research conducted by Google, Openai and many other researchers definitely benefited the lamp.
However, the idea that Openai LLM will dominate the world indefinitely, now is not very likely. No amount of regulatory lobbying or finger-pointing will preserve their monopoly. The technology is already in the open, which has made its progress invincible. Although this may be a little headache for openiAI investors, it is eventually a win for the rest of us. While the future is of many people, we will always be grateful to early contributors like Google and Openai.
Debashish Ray Chavadhuri is a senior principal engineer Talentica software,

