Mathematics can be enemies, for high-scale data centers in large-scale data centers. Thanks to the sheer scale of the ongoing calculations in hypersscale data centers, extremely unusual errors appear, scoring the clock with millions of nodes and vast amounts of silicon. These are only figures. These rare, “silent” data errors do not appear during traditional quality-control screening-even when companies spend hours for them.
This month in IEEE International reliability physics seminar In Montere, California, Intel Engineers have described a technology Uses reinforcement learning To rapidly expose more silent data errors. The company is using the machine learning method to ensure its quality Gon Processor.
When there is an error in the data center, the operators can either take the node and change it, or use the flawed system for lower-stake computing, Manu ShamsaIntel’s Chandler, Aries, Electrical Engineer at Campus. But it would be much better if the errors can be detected first. Ideally they will be caught before involving a chip in a computer system, when it is possible to design or manufacture to prevent future recurring errors.
“In a laptop, you will not notice any error. In data centers, With actually dense nodes, there are high possibilities that the stars will align and be an error. , -Manu Shamsa, Intel
Finding these flaws is not so easy. Shamsa says that engineers have been so amazed by him that he jokingly said that he should be due to scary action at some distance, the phrases of Einstein for quantum entangles. But nothing is scary about them, and Shamsa has spent years to portray them. In a paper presented at the same conference last year, his team provides a complete. List Among the reasons for these errors. Most are due to infinitesimal variation in manufacturing.
Even if each of the billions of transistors on each chip is functional, they are not completely the same. For example, a given transistor can cause subtle differences, an error, how to react to changes in temperature, voltage, or frequency.
Those subtleties are more likely to be cropped in huge data centers due to the speed of computing and the huge amount of silicon involved. “In a laptop, you will not notice any error. In data centers, With actually dense nodes, there are high possibilities that the stars will align and be an error, ”Shamsa says.
Some errors can only harvest and have been working for months after the installation of a chip in a data center. Small changes in the properties of the transistor downs them over time. One such silent error Shemsa has found that it is related to electrical resistance. A transistor that is first operated properly, and passes standard tests to look for shorts, with use, degrades so that it becomes more resistant.
“You are thinking that everything is fine, but down, an error is taking a wrong decision,” Shamsa says. Over time, thanks to a slight weakness in a single transistor, “a plus one, quietly, until you see the effect,” Shamsa says.
The new technology constructs on an existing set of methods to detect silent errors. Egen testThese tests make the chip problems of hard mathematics, over time, in the hope of clarifying silent errors. They include operations on different sizes of matris filled with random data.
There are large number of egen tests. Running all of them will incur an impractical amount of time, so chipmers use a random approach to generate one of them managed sets. It saves time but releases errors. “There is no principle to guide the selection of input,” Shamsa says. He wanted to find a way to guide the selection so that relatively small number of tests could change more errors.
Intel team used reinforcement learning to develop tests for its part Xeon cpu The chip that multiples the matrix, which is called the Fuse-Maltiply-Aid (FMA) instruction. Shamsa says that he chose the FMA area as it takes a relatively large area of the chip, making it more weakened for potentially silent errors – more silicon, more problems. What is more, defects in this part of a chip can produce electromagnetic fields that affect other parts of the system. And because the FMA is closed to save electricity. When it is not in use, its test involves repeatedly giving it up and down strength, to activate the potentially hidden defects that will not appear in standard tests otherwise.
During each stage of its training, the reinforcement-education program selects various tests for the possible defective chip. It is considered as a reward every detecting error, and over time the agent learns to choose which test errors maximize the possibility of detecting the errors. After approximately 500 testing cycles, the algorithm learned that the set of egen tests adapted to the rate of error to the FMA area.
Shamsa says that this technique is randomly randomly detecting a defect as an eagen test. Eigen tests are open sources, part of opencdag For data centers. They say that other users should be able to use reinforcement learning to modify these tests for their own systems.
To some extent, silent, subtle flaws are an indispensable part of the manufacturing process – perfection and uniformity remain out of reach. But Shamsa says that Intel is trying to use this research to learn to find the pioneers that rapidly increase the silent data errors. He is investigating whether there are red flags that can provide initial warnings of future errors, and whether it is possible to replace chip dishes or designs to manage them.
From your site articles
Related articles around web