Matrix Core Programming on AMD CDNA Architecture

(blogs.amd.com)

56 points | by salykova 5 days ago

2 comments

  • phkahler 20 hours ago
    So from CDNA3 to 4 they doubled fp16 and fp8 performance but cut fp32 and fp64 by half?

    Wonder why the regression on non-AI workloads?

    • adrian_b 16 hours ago
      Because those who nowadays have money for investing, do not invest them in the research problems whose solutions are urgently needed for the survival of humanity, e.g. for developing technologies for using all substances in closed cycles (like biosphere did before humans), but instead of that they invest all their money in research for the dream of developing AGI, which even if successful will be of benefit only for a small number of humans, not for all mankind.

      The fp64 and fp32 performance is needed for physical simulations required by the former goal, while fp16 and fp8 performance is useful only for the latter goal.

      So AMD's choice logically follows the choice of those who control the investment money.

      • Archit3ch 11 hours ago
        > The fp64 and fp32 performance is needed for physical simulations

        In the very unlikely case where

        1) You need fp64 Matrix-Matrix products for physical simulations

        2) You bought the MI355X accelerator instead of hardware better suited for the task

        you can still emulate it with the Ozaki scheme.

      • jjtheblunt 12 hours ago
        expanding (i think) to your point, it's perhaps just a fork into two product lines for different uses?
        • walterbell 10 hours ago
          Will there be future hardware optimized for physical simulations, or should existing/faster hardware be stockpiled now?
          • adrian_b 7 hours ago
            I am still using ancient AMD GPUs, bought between 2015 and 2019, because all later GPUs have much worse FP64 throughput per dollar.

            So I was never able to upgrade them, because all newer GPUs are worse.

            There was a little hope when the last generation of Intel discrete desktop Battlemage GPUs improved their FP64 throughput. While their throughput is relatively modest, i.e. half of a Zen 5 desktop Ryzen, they are extremely cheap so their performance per dollar is very good. Therefore they can be used to multiply the throughput of a desktop computer at a modest additional cost.

            Unfortunately, with the new Intel CEO the future of the Intel GPUs is very unclear, so it is unknown whether they will be followed by better GPUs or they will be canceled. If Intel will stupidly choose to no longer compete in the GPU market, the last source of GPUs with good FP64 throughput will disappear.

            The datacenter GPUs that still have good FP64 throughput have huge prices that cannot be justified for any small business or individual. In order to recover the cost of such GPUs you must have a workload that keeps them busy continuously, day and night. Such workloads must be aggregated from a large number of users. So we have regressed to the mainframes used by time-sharing around the beginning of the seventies of the last century, backwards from the freedom of personal computers.

            I see no hope for the future availability of any computing devices with better FP64 throughput per dollar than the desktop CPUs. Technically, it would be trivial to make such devices, but the companies like AMD and NVIDIA do not care about small business or individual customers but only about selling to other equally huge companies, so they dimension their devices accordingly and they also set fictitious retail prices many times greater than the actual price that will be negotiated with the big companies. While the big companies will pay much less, small businesses or individuals cannot buy at other prices than the list prices, which means that they must give up on buying such devices as they are not worth such prices.

            • walterbell 6 hours ago
              It took about 25 years for the cycle from Napster -> MP3 players -> flash memory -> smartphones -> big data -> big GPUs -> LLMs and generative AI -> OpenAI buying 100% of remaining memory wafer capacity from SK Hynix and Samsung = little left for the edge, with 100% price hikes for consumer DIMMs.

              https://openai.com/index/samsung-and-sk-join-stargate/

              > Samsung Electronics and SK hynix plan to scale up production of advanced memory chips, targeting 900,000 DRAM wafer starts per month at an accelerated capacity rollout, critical for powering OpenAI’s advanced AI models.

              We need a new "Napster moment" to restart supply chain investment and business models at the edge. Humanoid robotics might qualify, since robots will need low-latency responses to local sensor input.

              Another factor in edge vs. mainframe economics is the cost of energy in each location.

    • bigdict 20 hours ago
      cuz area and power
      • fancyfredbot 14 hours ago
        Area and power are why there was a choice to make. AI data centre demand is why they made this choice specifically.
    • trueismywork 15 hours ago
      Non-AI workloads prefer vector units and not matrix units
      • adrian_b 6 hours ago
        False.

        While there are indeed parts of the workloads that must be executed in vector units, those parts are limited by the memory interface throughput, not by the computational throughput.

        Only the matrix-matrix operations are limited by the computational throughput, not by the memory throughput, and all matrix-matrix operations (this includes the solving of dense systems of equations, which is the most frequent kind of non-AI workload) are better done with dedicated matrix units, because the matrix units reduce the number of memory transfers that are required for performing matrix operations.

        • Archit3ch 6 hours ago
          > this includes the solving of dense systems of equations

          Is there even dedicated hardware for LU?

      • phkahler 11 hours ago
        >> Non-AI workloads prefer vector units and not matrix units

        FEA and other "scientific" workloads are all matrix math. This is why super computers have been benchmarked using BLAS and LAPACK for the past 40 years. OTOH are those matrix * vector where AI is matrix * matrix?

        Either way its a regression which seems strange.

        • trueismywork 10 hours ago
          Nvidia b200 did the same. A lot of FEA go explicit (matrix free) because scaling is better.

          Also lookup ozaki algorithms.

          • adrian_b 6 hours ago
            I do not see which is the relationship between Ozaki algorithms and algorithms that are supposedly "matrix free".

            The Ozaki scheme and its variants improves the precision of matrix-matrix multiplications, allowing a matrix-matrix multiplication done with operations having lower-precision to approach the precision of the same multiplication done with operations with higher precision.

            So it is an improvement for matrix-matrix operations, which are better done in matrix units. It is not any kind of "matrix free" algorithm.

            The Ozaki scheme is not good enough for emulating FP64 in a GPU with poor FP64 throughput, but good FP32 throughput. The reason is that not only the greater precision of FP64 is important, but also its much greater dynamic range in comparison with FP32. In computations with FP64, overflows and underflows are extremely rare events and easy to avoid. On the other hand, in complex physical simulations it is impossible to avoid overflows and underflows in FP32, unless one uses extremely cumbersome frequent rescalings, which eliminate all the advantages of using floating-point numbers instead of fixed-point numbers.

            I do not know to which kind of "matrix free" algorithms for FEA you are referring .

            Nevertheless, the problem of any "matrix free" algorithm is exactly its poor scaling, because any "matrix free" algorithm must do similar amounts of computational operations and memory transfers. This limits the performance to that of the memory, which prevents scaling.

            The advantage of the algorithms based on matrices is exactly the better scaling, because only such algorithms can do more computational operations than memory transfers, so their scaling is no longer limited by the memory interface.

            For implementing matrix-matrix operations, the matrix units introduced initially by NVIDIA and then by AMD, Apple, Intel and since next year also by Arm, are preferable, because they reduce even more the number of memory transfers that prevent scaling, in comparison with implementing the same matrix-matrix operations in vector units.

  • saagarjha 11 hours ago
    If AMD were serious they would show a fully-worked out GEMM, not just "here is our theoretical performance, this is the instruction to use".