A few days back we reported that owners of the AMD Radeon RX 7900 XTX are getting affected by a problem causing the GPU hot-spot temperature to reach 110 degrees Celsius, causing thermal throttling / downclocking.
This problem so far seems to be affecting the MBA (Made-by-AMD) reference designs only, with owners getting frustrated after having their RMA requests rejected by AMD where the company’s reply stated that “Temperatures are normal”.
Both Andreas Schilling from hardwareLUXX, and Roman Hartung from der8auer conducted an investigation of the issue with the help of their respective communities, verifying that there is a massive gap between the GPU temperature and hot-spot temperature affecting the MBA cards, which in turn causes the thermal throttling issue.
Der8auer: In-depth Testing of The Issue
In his initial investigation, Roman tested only one reference card, and concluded that the issue could be caused by any number of things, such as bad mounting pressure, or thermal paste spreading problem, but it’s hard to determine the reason exactly with such a limited testing sample of just one card.
However, since he posted that video, members of the community have reached out to Roman, and he received a total of 48 confirmed cases where the temps reaches 110°C, with 29 of them being horizontally mounted, and 19 vertically mounted. (seriously, that is a concerning number considering that der8auer main viewers are probably German).
A lot of viewers getting affected by the issue even offered to send Roman their cards for further testing, and he actually bought four cards from his viewers and decided to do a thorough test to determine what could be causing this issue, and exactly how serious it is.
Roman went through a very in-depth testing methodology, where all cards were tested in an open-bench setup for 10+ minutes, using both FurMark software, and Remnant: From the Ashes as a gaming test.
From there he used a process of elimination to zero-in on the cause of the issue. First, by testing the cards in both vertical and horizontal mounting orientation, he found that the issue is compounded especially in the horizontal scenario.
In the top part of the graph showing the hot-spot temperatures for the tested cards, you can see the temp reaching 110 degrees within minutes in the horizontal test, causing the affected cards to throttle.
In vertical mounting, the fans were typically running in the 1700-1800 RPM range, whereas in horizontal mounting, the fans were really struggling to bring the temperature down, and were running way above 2000 RPM, one card even was running 2800 RPM.
To eliminate that the issue could be caused by gravity in the case of horizontal mounting, which would pull the cooler further down and away from the GPU if there is no sufficient mounting pressure, Roman fashioned a device that allowed him to eliminate the effects of gravity, but then again that did not change the results.
He even went on to disassemble the card to determine if the issue could be be caused by the cooler part, which consist of the vapor chamber responsible for cooling the GPU and memory, and what Roman would call the ‘Mid-plate’ which is responsible for cooling the VRMs.
He cut down a few millimeters from the cooler Mid-plate stands, after theorizing that if the stands have a more than necessary height, they could cause the PCB to have bad contact with the cooler.
With the adjusted heights of the stands which are now shorter, assembling the cooler parts again would result in an enormous amounts of mounting pressure, so if that was indeed the culprit for the issue, we would see it resolved after this step, which it did not, the affected cards went straight back to 110 degrees again in both horizontal and vertical positions.
This left Roman with only one conclusion, the vapor chamber itself must be faulty, to further confirm this, he did a flip test, where he tested the cards first in a vertical position where temps were in normal ranges and there were no signs of the issue.
Flipping the cards while being tested to horizontal position, which caused the hot-spot temps to quickly reach 110 degrees, and then flipping them back again to vertical, which did not solve the issue, the temperature stays at 110 degrees.
The Vapor Chamber
The liquid inside the vapor chamber goes through a cycle where it absorbs heat from the GPU, gets hot, starts to vaporize and move to the cold area under the fins, gets cooled down, condense into liquid, and moves back to the area near the GPU, and then repeats the cycle all over again.
The flip test conducted indicates that there may be an issue preventing the vapor from condensing back into liquid again and thus breaking the cooling cycle. According to Roman, there are a number of things that could cause this, such as using the wrong pressure inside the chamber, the amount of liquid inside, or it could be a mechanical design failure of the vapor chamber itself.
The irony in this situation is that during the Nvidia 12VHPWR power connector melting problems, AMD executives mocked Nvidia by recommending the 7900 XTX as the ‘safer’ option instead, with Sasa Marinkovic, the senior director of gaming marketing at AMD, tweeting “Stay safe this holiday season” with an image of the RX 7900 XTX dual 8-pin connector.
Stay safe this holiday season. @amdradeon pic.twitter.com/DOpg0f2qaP
— Sasa Marinkovic (@SasaMarinkovic) November 17, 2022
In a previous statement made by AMD, the company revealed that it is aware of the hot-spot temperature issue causing the thermal throttling of the reference RX 7900 XTX models, and that the issue is being investigated by its GPU team.
Bear in mind that this does not affect the AMD reference models only, it potentially affects any other partner company that uses the MBA design itself.
It would be interesting to see what course AMD takes exactly to remedy this situation, but at this point even recalling the MBA designs would not be a surprise considering that the issue is affecting potentially thousands of customers.
Source: der8auer