Merge pull request #9698 from pinewall/submit-tech20180516-3

submit tech/20180516 How Graphics Cards Work.md
This commit is contained in:
Xingyu.Wang 2018-08-04 21:52:30 +08:00 committed by GitHub
commit d574c6e0f4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 74 additions and 76 deletions

View File

@ -1,76 +0,0 @@
pinewall translating
How Graphics Cards Work
======
![AMD-Polaris][1]
Ever since 3dfx debuted the original Voodoo accelerator, no single piece of equipment in a PC has had as much of an impact on whether your machine could game as the humble graphics card. While other components absolutely matter, a top-end PC with 32GB of RAM, a $500 CPU, and PCIe-based storage will choke and die if asked to run modern AAA titles on a ten year-old card at modern resolutions and detail levels. Graphics cards (also commonly referred to as GPUs, or graphics processing units) are critical to game performance and we cover them extensively. But we dont often dive into what makes a GPU tick and how the cards function.
By necessity, this will be a high-level overview of GPU functionality and cover information common to AMD, Nvidia, and Intels integrated GPUs, as well as any discrete cards Intel might build in the future. It should also be common to the mobile GPUs built by Apple, Imagination Technologies, Qualcomm, ARM, and other vendors.
### Why Dont We Run Rendering With CPUs?
The first point I want to address is why we dont use CPUs for rendering workloads in gaming in the first place. The honest answer to this question is that you can run rendering workloads directly on a CPU, at least in theory. Early 3D games that predate the widespread availability of graphics cards, like Ultima Underworld, ran entirely on the CPU. UU is a useful reference case for multiple reasons — it had a more advanced rendering engine than games like Doom, with full support for looking up and down, as well as then-advanced features like texture mapping. But this kind of support came at a heavy price — many people lacked a PC that could actually run the game.
![](https://www.extremetech.com/wp-content/uploads/2018/05/UU.jpg)
In the early days of 3D gaming, many titles like Half Life and Quake II featured a software renderer to allow players without 3D accelerators to play the title. But the reason we dropped this option from modern titles is simple: CPUs are designed to be general-purpose microprocessors, which is another way of saying they lack the specialized hardware and capabilities that GPUs offer. A modern CPU could easily handle titles that tended to stutter when run in software 18 years ago, but no CPU on Earth could easily handle a modern AAA game from today if run in that mode. Not, at least, without some drastic changes to the scene, resolution, and various visual effects.
### Whats a GPU?
A GPU is a device with a set of specific hardware capabilities that are intended to map well to the way that various 3D engines execute their code, including geometry setup and execution, texture mapping, memory access, and shaders. Theres a relationship between the way 3D engines function and the way GPU designers build hardware. Some of you may remember that AMDs HD 5000 family used a VLIW5 architecture, while certain high-end GPUs in the HD 6000 family used a VLIW4 architecture. With GCN, AMD changed its approach to parallelism, in the name of extracting more useful performance per clock cycle.
![](https://www.extremetech.com/wp-content/uploads/2018/05/GPU-Evolution.jpg)
Nvidia first coined the term “GPU” with the launch of the original GeForce 256 and its support for performing hardware transform and lighting calculations on the GPU (this corresponded, roughly to the launch of Microsofts DirectX 7). Integrating specialized capabilities directly into hardware was a hallmark of early GPU technology. Many of those specialized technologies are still employed (in very different forms), because its more power efficient and faster to have dedicated resources on-chip for handling specific types of workloads than it is to attempt to handle all of the work in a single array of programmable cores.
There are a number of differences between GPU and CPU cores, but at a high level, you can think about them like this. CPUs are typically designed to execute single-threaded code as quickly and efficiently as possible. Features like SMT / Hyper-Threading improve on this, but we scale multi-threaded performance by stacking more high-efficiency single-threaded cores side-by-side. AMDs 32-core / 64-thread Epyc CPUs are the largest you can buy today. To put that in perspective, the lowest-end Pascal GPU from Nvidia has 384 cores. A “core” in GPU parlance refers to a much smaller unit of processing capability than in a typical CPU.
**Note:** You cannot compare or estimate relative gaming performance between AMD and Nvidia simply by comparing the number of GPU cores. Within the same GPU family (for example, Nvidias GeForce GTX 10 series, or AMDs RX 4xx or 5xx family), a higher GPU core count means that GPU is more powerful than a lower-end card.
The reason you cant draw immediate conclusions on GPU performance between manufacturers or core families based solely on core counts is because different architectures are more and less efficient. Unlike CPUs, GPUs are designed to work in parallel. Both AMD and Nvidia structure their cards into blocks of computing resources. Nvidia calls these blocks an SM (Streaming Multiprocessor), while AMD refers to them as a Compute Unit.
![](https://www.extremetech.com/wp-content/uploads/2018/05/PascalSM.png)
Each block contains a group of cores, a scheduler, a register file, instruction cache, texture and L1 cache, and texture mapping units. The SM / CU can be thought of as the smallest functional block of the GPU. It doesnt contain literally everything — video decode engines, render outputs required for actually drawing an image on-screen, and the memory interfaces used to communicate with onboard VRAM are all outside its purview — but when AMD refers to an APU as having 8 or 11 Vega Compute Units, this is the (equivalent) block of silicon theyre talking about. And if you look at a block diagram of a GPU, any GPU, youll notice that its the SM/CU thats duplicated a dozen or more times in the image.
![](https://www.extremetech.com/wp-content/uploads/2016/11/Pascal-Diagram.jpg)
The higher the number of SM/CU units in a GPU, the more work it can perform in parallel per clock cycle. Rendering is a type of problem thats sometimes referred to as “embarrassingly parallel,” meaning it has the potential to scale upwards extremely well as core counts increase.
When we discuss GPU designs, we often use a format that looks something like this: 4096:160:64. The GPU core count is the first number. The larger it is, the faster the GPU, provided were comparing within the same family (GTX 970 versus GTX 980 versus GTX 980 Ti, RX 560 versus RX 580, and so on).
### Texture Mapping and Render Outputs
There are two other major components of a GPU: texture mapping units and render outputs. The number of texture mapping units in a design dictates its maximum texel output and how quickly it can address and map textures on to objects. Early 3D games used very little texturing, because the job of drawing 3D polygonal shapes was difficult enough. Textures arent actually required for 3D gaming, though the list of games that dont use them in the modern age is extremely small.
The number of texture mapping units in a GPU is signified by the second figure in the 4096:160:64 metric. AMD, Nvidia, and Intel typically shift these numbers equivalently as they scale a GPU family up and down. In other words, you wont really find a scenario where one GPU has a 4096:160:64 configuration while a GPU above or below it in the stack is a 4096:320:64 configuration. Texture mapping can absolutely be a bottleneck in games, but the next-highest GPU in the product stack will typically offer at least more GPU cores and texture mapping units (whether higher-end cards have more ROPs depends on the GPU family and the card configuration).
Render outputs (also sometimes called raster operations pipelines) are where the GPUs output is assembled into an image for display on a monitor or television. The number of render outputs multiplied by the clock speed of the GPU controls the pixel fill rate. A higher number of ROPs means that more pixels can be output simultaneously. ROPs also handle antialiasing, and enabling AA — especially supersampled AA — can result in a game thats fill-rate limited.
### Memory Bandwidth, Memory Capacity
The last components well discuss are memory bandwidth and memory capacity. Memory bandwidth refers to how much data can be copied to and from the GPUs dedicated VRAM buffer per second. Many advanced visual effects (and higher resolutions more generally) require more memory bandwidth to run at reasonable frame rates because they increase the total amount of data being copied into and out of the GPU core.
In some cases, a lack of memory bandwidth can be a substantial bottleneck for a GPU. AMDs APUs like the Ryzen 5 2400G are heavily bandwidth-limited, which means increasing your DDR4 clock rate can have a substantial impact on overall performance. The choice of game engine can also have a substantial impact on how much memory bandwidth a GPU needs to avoid this problem, as can a games target resolution.
The total amount of on-board memory is another critical factor in GPUs. If the amount of VRAM needed to run at a given detail level or resolution exceeds available resources, the game will often still run, but itll have to use the CPUs main memory for storing additional texture data — and it takes the GPU vastly longer to pull data out of DRAM as opposed to its onboard pool of dedicated VRAM. This leads to massive stuttering as the game staggers between pulling data from a quick pool of local memory and general system RAM.
One thing to be aware of is that GPU manufacturers will sometimes equip a low-end or midrange card with more VRAM than is otherwise standard as a way to charge a bit more for the product. We cant make an absolute prediction as to whether this makes the GPU more attractive because honestly, the results vary depending on the GPU in question. What we can tell you is that in many cases, it isnt worth paying more for a card if the only difference is a larger RAM buffer. As a rule of thumb, lower-end GPUs tend to run into other bottlenecks before theyre choked by limited available memory. When in doubt, check reviews of the card and look for comparisons of whether a 2GB version is outperformed by the 4GB flavor or whatever the relevant amount of RAM would be. More often than not, assuming all else is equal between the two solutions, youll find the higher RAM loadout not worth paying for.
Check out our [ExtremeTech Explains][2] series for more in-depth coverage of todays hottest tech topics.
--------------------------------------------------------------------------------
via: https://www.extremetech.com/gaming/269335-how-graphics-cards-work
作者:[Joel Hruska][a]
选题:[lujun9972](https://github.com/lujun9972)
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://www.extremetech.com/author/jhruska
[1]:https://www.extremetech.com/wp-content/uploads/2016/07/AMD-Polaris-640x353.jpg
[2]:http://www.extremetech.com/tag/extremetech-explains

View File

@ -0,0 +1,74 @@
显卡工作原理简介
======
![AMD-Polaris][1]
自从 sdfx 推出最初的 Voodoo 加速器以来,不起眼的显卡对你的 PC 是否可以玩游戏起到决定性作用PC 上任何其它设备都无法与其相比。其它组件当然也很重要,但对于一个拥有 32GB 内存、价值 500 美金的 CPU 和 基于 PCIe 的存储设备的高端 PC如果使用 10 年前的显卡,都无法以最高分辨率和细节质量运行当前<ruby>最高品质的游戏<rt>AAA titles</rt></ruby>,会发生卡顿甚至无响应。显卡(也常被称为 GPU, 或<ruby>图形处理单元<rt>Graphic Processing Unit</rt></ruby>)对游戏性能影响极大,我们反复强调这一点;但我们通常并不会深入了解显卡的工作原理。
出于实际考虑,本文将概述 GPU 的上层功能特性,内容包括 AMD 显卡、Nvidia 显卡、Intel 集成显卡以及 Intel 后续可能发布的独立显卡之间共同的部分。也应该适用于 Apple, Imagination Technologies, Qualcomm, ARM 和 其它显卡生产商发布的移动平台 GPU。
### 我们为何不使用 CPU 进行渲染?
我要说明的第一点是我们为何不直接使用 CPU 完成游戏中的渲染工作。坦率的说,在理论上你确实可以直接使用 CPU 完成<ruby>渲染<rt>rendering</rt></ruby>工作。在显卡没有广泛普及之前,早期的 3D 游戏就是完全基于 CPU 运行的,例如 Ultima UnderworldLCTT 译注:中文名为 _地下创世纪_ ,下文中简称 UU。UU 是一个很特别的例子,原因如下:与 Doom LCTT 译注:中文名 _毁灭战士_相比UU 具有一个更高级的渲染引擎,全面支持<ruby>向上或向下查找<rt>looking up and down</rt></ruby>以及一些在当时比较高级的特性,例如<ruby>纹理映射<rt>texture mapping</rt></ruby>。但为支持这些高级特性,需要付出高昂的代价,很少有人可以拥有真正能运行起 UU 的 PC。
![](https://www.extremetech.com/wp-content/uploads/2018/05/UU.jpg)
对于早期的 3D 游戏,包括 Half Life 和 Quake II 在内的很多游戏,内部包含一个软件渲染器,让没有 3D 加速器的玩家也可以玩游戏。但现代游戏都弃用了这种方式原因很简单CPU 是设计用于通用任务的微处理器,意味着缺少 GPU 提供的<ruby>专用硬件<rt>specialized hardware</rt></ruby><ruby>功能<rt>capabilities</rt></ruby>。对于 18 年前使用软件渲染的那些游戏,当代 CPU 可以轻松胜任;但对于当代最高品质的游戏,除非明显降低<ruby>景象质量<rt>scene</rt></ruby>、分辨率和各种虚拟特效,否则现有的 CPU 都无法胜任。
### 什么是 GPU ?
GPU 是一种包含一系列专用硬件特性的设备,其中这些特性可以让各种 3D 引擎更好地执行代码,包括<ruby>形状构建<rt>geometry setup</rt></ruby>,纹理映射,<ruby>访存<rt>memory access</rt></ruby><ruby>着色器<rt>shaders</rt></ruby>等。3D 引擎的功能特性影响着设计者如何设计 GPU。可能有人还记得AMD HD5000 系列使用 VLIW5 <ruby>架构<rt>archtecture</rt></ruby>;但在更高端的 HD 6000 系列中使用了 VLIW4 架构。通过 GCN LCTT 译注GCN 是 Graphics Core Next 的缩写字面意思是下一代图形核心既是若干代微体系结构的代号也是指令集的名称AMD 改变了并行化的实现方法,提高了每个时钟周期的有效性能。
![](https://www.extremetech.com/wp-content/uploads/2018/05/GPU-Evolution.jpg)
Nvidia 在发布首款 GeForce 256 时(大致对应 Microsoft 推出 DirectX7 的时间点)提出了 GPU 这个术语,这款 GPU 支持在硬件上执行转换和<ruby>光照计算<rt>lighting calculation</rt></ruby>。将专用功能直接集成到硬件中是早期 GPU 的显著技术特点。很多专用功能还在(以一种极为不同的方式)使用,毕竟对于特定类型的工作任务,使用<ruby>片上<rt>on-chip</rt></ruby>专用计算资源明显比使用一组<ruby>可编程单元<rt>programmable cores</rt></ruby>要更加高效和快速。
GPU 和 CPU 的核心有很多差异但我们可以按如下方式比较其上层特性。CPU 一般被设计成尽可能快速和高效的执行单线程代码。虽然 <ruby>同时多线程<rt>SMT, Simultaneous multithreading</rt></ruby><ruby>超线程<rt>Hyper-Threading</rt></ruby>在这方面有所改进但我们实际上通过堆叠众多高效率的单线程核心来扩展多线程性能。AMD 的 32 核心/64 线程 Epyc CPU 已经是我们能买到的核心数最多的 CPU相比而言Nvidia 最低端的 Pascal GPU 都拥有 384 个核心。但相比 CPU 的核心GPU 所谓的核心是处理能力低得多的的处理单元。
**注意:** 简单比较 GPU 核心数,无法比较或评估 AMD 与 Nvidia 的相对游戏性能。在同样 GPU 系列(例如 Nvidia 的 GeForce GTX 10 系列,或 AMD 的 RX 4xx 或 5xx 系列)的情况下,更高的 GPU 核心数往往意味着更高的性能。
你无法只根据核心数比较不同供应商或核心系列的 GPU 之间的性能,这是因为不同的架构对应的效率各不相同。与 CPU 不同GPU 被设计用于并行计算。AMD 和 Nvidia 在结构上都划分为计算资源<ruby><rt>block</rt></ruby>。Nvidia 将这些块称之为<ruby>流处理器<rt>SM, Streaming Multiprocessor</rt></ruby>,而 AMD 则称之为<ruby>计算单元<rt>Compute Unit</rt></ruby>
![](https://www.extremetech.com/wp-content/uploads/2018/05/PascalSM.png)
每个块都包含如下组件:一组核心,一个<ruby>调度器<rt>scheduler</rt></ruby>,一个<ruby>寄存器文件<rt>register file</rt></ruby>,指令缓存,纹理和 L1 缓存以及纹理<ruby>映射单元<rt>mapping units</rt></ruby>。SM/CU 可以被认为是 GPU 中最小的可工作块。SM/CU 没有涵盖全部的功能单元,例如视频解码引擎,实际在屏幕绘图所需的渲染输出,以及与<ruby>板载<rt>onboard</rt></ruby><ruby>显存<rt>VRAM, Video Memory</rt></ruby>通信相关的<ruby>内存接口<rt>memory interfaces</rt></ruby>都不在 SM/CU 的范围内;但当 AMD 提到一个 APU 拥有 8 或 11 个 Vega 计算单元时,所指的是(等价的)<ruby>硅晶块<rt>block of silicon</rt></ruby>数目。如果你查看任意一款 GPU 的模块设计图,你会发现图中 SM/CU 是反复出现很多次的部分。
![](https://www.extremetech.com/wp-content/uploads/2016/11/Pascal-Diagram.jpg)
GPU 中的 SM/CU 数目越多,每个时钟周期内可以并行完成的工作也越多。渲染是一种通常被认为是“高度并行”的计算问题,意味着随着核心数增加带来的可扩展性很高。
当我们讨论 GPU 设计时,我们通常会使用一种形如 4096:160:64 的格式,其中第一个数字代表核心数。在核心系列(如 GTX970/GTX 980/GTX 980 Ti, 如 RX 560/RX 580 等等一致的情况下核心数越高GPU 也就相对更快。
### 纹理映射和渲染输出
GPU 的另外两个主要组件是纹理映射单元和渲染输出。设计中的纹理映射单元数目决定了最大的<ruby>纹素<rt>texel</rt></ruby>输出以及可以多快的处理并将纹理映射到对象上。早期的 3D 游戏很少用到纹理,这是因为绘制 3D 多边形形状的工作有较大的难度。纹理其实并不是 3D 游戏必须的,但不使用纹理的现代游戏屈指可数。
GPU 中的纹理映射单元数目用 4096:160:64 指标中的第二个数字表示。AMDNvidia 和 Intel 一般都等比例变更指标中的数字。换句话说,如果你找到一个指标为 4096:160:64 的 GPU同系列中不会出现指标为 4096:320:64 的 GPU。纹理映射绝对有可能成为游戏的瓶颈但产品系列中次高级别的 GPU 往往提供更多的核心和纹理映射单元(是否拥有更高的渲染输出单元取决于 GPU 系列和显卡的指标)。
<ruby>渲染输出单元<rt>Render outputs, ROPs</rt></ruby>(有时也叫做<ruby>光栅操作管道<rt>raster operations pipelines</rt></ruby>是 GPU 输出汇集成图像的场所,图像最终会在显示器或电视上呈现。渲染输出单元的数目乘以 GPU 的时钟频率决定了<ruby>像素填充速率<rt>pixel fill rate</rt></ruby>。渲染输出单元数目越多意味着可以同时输出的像素越多。渲染输出单元还处理<ruby>抗锯齿<rt>antialiasing</rt></ruby>,启用抗锯齿(尤其是<ruby>超级采样<rt>supersampled</rt></ruby>抗锯齿)会导致游戏填充速率受限。
### 显存带宽与显存容量
我们最后要讨论的是<ruby>显存带宽<rt>memory bandwidth</rt></ruby><ruby>显存容量<rt>memory capacity</rt></ruby>。显存带宽是指一秒时间内可以从 GPU 专用的显存缓冲区内拷贝进或拷贝出多少数据。很多高级视觉特效(以及更常见的高分辨率)需要更高的显存带宽,以便保证足够的<ruby>帧率<rt>frame rates</rt></ruby>,因为需要拷贝进和拷贝出 GPU 核心的数据总量增大了。
在某些情况下,显存带宽不足会成为 GPU 的显著瓶颈。以 Ryzen 5 2400G 为例的 AMD APU 就是严重带宽受限的,以至于提高 DDR4 的时钟频率可以显著提高整体性能。导致瓶颈的显存带宽阈值,也与游戏引擎和游戏使用的分辨率相关。
板载内存大小也是 GPU 的重要指标。如果按指定细节级别或分辨率运行所需的显存量超过了可用的资源量,游戏通常仍可以运行,但会使用 CPU 的主存存储额外的纹理数据;而从 DRAM 中提取数据比从板载显存中提取数据要慢得多。这会导致游戏在板载的快速访问内存池和系统内存中共同提取数据时出现明显的卡顿。
有一点我们需要留意GPU 生产厂家通常为一款低端或中端 GPU 配置比通常更大的显存,这是他们为产品提价的一种常用手段。很难说大显存是否更具有吸引力,毕竟需要具体问题具体分析。大多数情况下,用更高的价格购买一款仅显存更高的显卡是不划算的。经验规律告诉我们,低端显卡遇到显存瓶颈之前就会碰到其它瓶颈。如果存在疑问,可以查看相关评论,例如 4G 版本或其它数目的版本是否性能超过 2G 版本。更多情况下,如果其它指标都相同,购买大显存版本并不值得。
查看我们的[极致技术讲解][2]系列,深入了解更多当前最热的技术话题。
--------------------------------------------------------------------------------
via: https://www.extremetech.com/gaming/269335-how-graphics-cards-work
作者:[Joel Hruska][a]
选题:[lujun9972](https://github.com/lujun9972)
译者:[pinewall](https://github.com/pinewall)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://www.extremetech.com/author/jhruska
[1]:https://www.extremetech.com/wp-content/uploads/2016/07/AMD-Polaris-640x353.jpg
[2]:http://www.extremetech.com/tag/extremetech-explains