Multi-GPU frame rendering

US 10 430 915 B2

Assignee
NVIDIA Corporation
Inventors
Andrei Khodakovsky, Kirill A. Dmitriev, Rouslan L. Dimitrov, Tzyywei Hwang, Wishwesh Anil Gandhi, Lacky Vasant Shah
Filing date
January 24 2018
Publication date
October 1 2019
Table of contents
Classifications
CPC: G06T1/20, G06T1/60, G06T15/005, G06T2200/24, G06T2210/52
IPC: G06T1/20, G06T1/60, G06T15/00

One or more copy commands are scheduled for locating one or more pages of data in a local memory of a graphics processing unit (GPU) for more efficient access to the pages of data during rendering. A first processing unit that is coupled to a first GPU receives a notification that an access request count has reached a specified threshold. The first processing unit schedules a copy command to copy the first page of data to a first memory circuit of the first GPU from a second memory circuit of the second GPU. The copy command is included within a GPU command stream.

drawing #0

Show all 17 drawings

PatentSwarm provides a collaborative workspace to search, highlight, annotate, and monitor patent data.

Start free trial Sign in

Tip: Select text to highlight, annotate, search, or share the selection.

Claims

1. A computer-implemented method, comprising:
receiving, by a first processing unit coupled to a first graphics processing unit (GPU), a notification generated by the first GPU that an access request count has reached a specified threshold, wherein the access request count indicates a number of access requests generated during rendering of a first frame by the first GPU targeting a first page of data residing at a second GPU;
scheduling, by the first processing unit, a copy command to copy the first page of data to a first memory circuit of the first GPU from a second memory circuit of the second GPU before the first GPU accesses the first page of data during rendering of a second frame; and
inserting, by the first processing unit, the copy command within a GPU command stream for rendering the second frame,
wherein the first page of data is copied to the first memory circuit.

Show 9 dependent claims

11. A computer-implemented method, comprising:
selecting, by a first processing unit coupled to a first graphics processing unit (GPU), an experiment, wherein the first GPU is configured to generate a number of access requests targeting a first page of data residing at a second GPU;
generating, by the first processing unit, one or more GPU command streams according to the selected experiment;
causing, by the first processing unit, the first GPU to execute one of the one or more GPU command streams and the second GPU to execute one of the one or more GPU command streams;
measuring, by the first processing unit, execution metrics associated with the one or more GPU command streams;
recording, by the first processing unit, the execution metrics associated with the one or more GPU command streams;
receiving, by the first processing unit, a notification that an access request count has reached a specified threshold, wherein the access request count indicates a number of access requests generated by the first GPU targeting a first page of data residing at a second GPU; and
scheduling, by the first processing unit, a copy command to copy the first page of data to a first memory circuit of the first GPU from a second memory circuit of the second GPU for rendering a frame by the first GPU.

Show 2 dependent claims

14. A processing system, configured to:
receive, by a first processing unit coupled to a first graphics processing unit (GPU), a notification generated by the first GPU that an access request count has reached a specified threshold, wherein the access request count indicates a number of access requests generated during rendering of a first frame by the first GPU targeting a first page of data residing at a second GPU;
schedule, by the first processing unit, a copy command to copy the first page of data to a first memory circuit of the first GPU from a second memory circuit of the second GPU before the first GPU accesses the first page of data during rendering of a second frame; and
insert, by the first processing unit, the copy command within a GPU command stream for rendering the second frame,
wherein the first page of data is copied to the first memory circuit.

Show 2 dependent claims

17. A processing system configured to:
select, by a first processing unit coupled to a first graphics processing unit (GPU), an experiment, wherein the first GPU is configured to generate a number of access requests targeting a first page of data residing at a second CPU;
generate, by the first processing unit, one or more CPU command streams according to the selected experiment;
cause, by the first processing unit, the first GPU to execute one of the one or more GPU command streams and the second GPU to execute one of the one or more CPU command streams;
measure, by the first processing unit, execution metrics associated with the one or more GPU command streams; record, by the first processing unit, the execution metrics associated with the one or more GPU command streams;
receive, by the first processing unit, a notification that an access request count has reached a specified threshold, wherein the access request count indicates a number of access requests generated by the first GPU targeting a first page of data residing at a second CPU; and
schedule, by the first processing unit, a copy command to copy the first page of data to a first memory circuit of the first CPU from a second memory circuit of the second CPU for rendering a frame by the first GPU.

Show dependent claim

19. A non-transitory, computer-readable storage medium storing instructions that, when executed by a first processing unit coupled to a first graphics processing unit (GPU), cause the processor to:
receive, by the first processing unit, a notification generated by the first GPU that an access request count has reached a specified threshold, wherein the access request count indicates a number of access requests generated during rendering of a first frame by the first GPU targeting a first page of data residing at a second GPU;
schedule, by the first processing unit, a copy command to copy the first page of data to a first memory circuit of the first GPU from a second memory circuit of the second GPU before the first GPU accesses the first page of data during rendering of a second frame; and
insert, by the first processing unit, the copy command within a GPU command stream for rendering the second frame,
wherein the first page of data is copied to the first memory circuit.

Description

CLAIM OF PRIORITY

This application is a continuation-in-part of U.S. Non-Provisional application Ser. No. 15/857,330 titled MULTI-GPU FRAME RENDERING, filed Dec. 28, 2017, the entire contents of which is incorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to image rendering, and more particularly to frame rendering with multiple graphics processing units.

BACKGROUND

Two key performance metrics in a graphics rendering system are frame rate and latency. In many applications, such as augmented reality, reducing latency is very important for a realistic user experience. Alternate frame rendering (AFR) can improve frame rate by assigning alternate frames to a corresponding alternate graphics processing unit (GPU). However, AFR does not typically improve latency and can cause performance degradation in applications with inter-frame data dependencies. Split-frame rendering (SFR) is another technique that can improve frame rate for certain applications. However, SFR is impractical with modern workloads. Furthermore, both AFR and SFR impose certain computational and run-time restrictions that limit applicability to many current graphics applications. Thus, there is a need for addressing these issues and/or other issues associated with the prior art.

SUMMARY

A method, computer readable medium, and system are disclosed for rendering graphics frames using multiple graphics processing units (GPUs). The method schedules one or more copy commands for locating one or more pages of data in local memory of a graphics processing unit (GPU) for more efficient access to the pages of data during rendering. The method comprises receiving, by a first processing unit coupled to a first GPU, a notification that an access request count has reached a specified threshold, scheduling, by the first processing unit, a copy command to copy the first page of data to a first memory circuit of the first GPU from a second memory circuit of the second GPU, and including, by the first processing unit, the copy command within a GPU command stream. In one embodiment, the access request count indicates a number of access requests generated by the first GPU targeting a first page of data residing at a second GPU. Furthermore, the first page of data is copied to the first memory circuit through a data link coupled to the first GPU and the second GPU.

The computer readable medium includes instructions that, when executed by a processing unit, perform the method. Furthermore, the system includes circuitry configured to perform the method.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A illustrates a flowchart of a method for notifying a driver that a request count has reached a specified threshold, in accordance with one embodiment;

FIG. 1B illustrates a flowchart of a method for accessing a page of copied data, in accordance with one embodiment;

FIG. 1C illustrates a flowchart of a method for scheduling a copy command for a page of data, in accordance with one embodiment;

FIG. 1D illustrates a technique for allocating rendering work based on a screen space checkerboard pattern, in accordance with one embodiment;

FIG. 1E illustrates a system for transmitting compressed data through a high-speed data link, in accordance with one embodiment;

FIG. 1F illustrates a system comprising hardware counters for accumulating access request counts within a graphics processing unit, in accordance with one embodiment;

FIG. 1G illustrates an exemplary chronology for copying and accessing pages of data, in accordance with one embodiment;

FIG. 2 illustrates a parallel processing unit, in accordance with one embodiment;

FIG. 3A illustrates a general processing cluster of the parallel processing unit of FIG. 2, in accordance with one embodiment;

FIG. 3B illustrates a partition unit of the parallel processing unit of FIG. 2, in accordance with one embodiment;

FIG. 4 illustrates the streaming multi-processor of FIG. 3A, in accordance with one embodiment;

FIG. 5 illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented;

FIG. 6 is a conceptual diagram of a graphics processing pipeline implemented by the parallel processing unit of FIG. 2, in accordance with one embodiment;

FIG. 7A is a conceptual diagram of a two processor configuration, in accordance with one embodiment;

FIG. 7B is a conceptual diagram of a four processor configuration, in accordance with one embodiment;

FIG. 7C is a conceptual diagram of a second four processor configuration, in accordance with one embodiment; and

FIG. 8 illustrates a flowchart of a method 800 for managing GPU operation, in accordance with one embodiment.

DETAILED DESCRIPTION

As described further herein, distributing rendering on multiple GPUs reduces rendering latency and provides a general and scalable GPU computation environment relative to conventional techniques. In particular, rendering on multiple GPUs provides a more general and scalable GPU computation environment than conventional split frame rendering (SFR) or alternate frame rendering (AFR) techniques.

In one embodiment, two or more GPUs are configured to operate as peers, with one peer able to access data (e.g., surfaces) in local memory of another peer through a high-speed data link (e.g., NVLINK, high-speed data link 150 of FIG. 1E). For example, a first GPU of the two or more GPUs may perform texture mapping operations using surface data residing remotely within a memory of a second GPU of the two or more GPUs. Based on remote memory access tracking data, the remote surface data may be copied to the local memory of the first GPU because bandwidth and/or latency of the high-speed data link may be inferior to that of a local memory interface. In certain embodiments, a given frame to be rendered is partitioned into regions (e.g., rectangular regions) forming a checkerboard pattern, with non-overlapping adjacent regions sharing a common edge in the checkerboard pattern generally assigned to different GPUs. In other embodiments, the frame is partitioned into regions that may overlap by one or more pixels (e.g., to trade off redundant computation for potentially reduced inter-processor communication). For large surfaces, the regions are aligned to memory page boundaries for render targets and distributed textures. In one embodiment, the number of regions is dynamically determined and updated for new frames to reduce remote transfers and provide overall load balancing among the two or more GPUs. The rectangles are rendered separately by the different GPUs for the frame and combined to form a complete frame in a frame buffer. In one embodiment, the frame buffer is located in local memory for one of the two or more GPUs.

In certain embodiments, primitives (e.g., represented by primitive commands) for rendering the frame are transmitted to the two or more GPUs, and hardware circuits therein provide clip windows used to select which of the primitives are included in a given checkerboard rectangle. Non-selected primitives are discarded early to reduce GPU resource utilization. In one embodiment, complete primitive discard is accelerated at a primitive engine, where a given triangle bounding box is tested for intersection with a currently active ownership region (e.g., a rectangle in the checkerboard pattern that is assigned to a given GPU). Such primitive discard can be performed by hardware logic circuits, which may be positioned in a rendering pipeline after a position transform pipeline stage. In one embodiment, the transformed attributes of a discarded triangle are not written into memory and the triangle is not rasterized, thereby saving both processing cycles and memory bandwidth. In one embodiment, pixel-exact discard is implemented in a rasterizer circuit or a rasterizer shader function.

Each of the two or more GPUs render selected primitives within assigned regions. Rendering may include multiple rendering passes, and results from one rendering pass stored in one or more surfaces may be used by the two or more GPUs for one or more subsequent rendering passes. Rendering a given pass for an assigned region on a first GPU may require remote data from a second GPU. The remote data can be fetched on demand from the second (remote) GPU in response to a request by the first GPU, or the remote data can be copied asynchronously in advance of an anticipated request to potentially achieve a lower overall latency. In many common scenarios, sequentially rendered frames are self-similar and a memory access request pattern for one frame is substantially replicated in a subsequent frame, making each frame a good predictor of access patterns in a subsequent frame. For example, two sequential frames may include substantially the same command primitives, with each generating substantially the same memory access requests while being rendered. Consequently, an access pattern observed in connection with executing a first command stream to render a first frame may be used to anticipate which blocks of memory should be copied in advance of executing a second command stream to render a second frame. In certain embodiments, data within memory pages is stored in a compressed format and remote data is copied in the compressed format to reduce overall utilization of the high-speed data link.

FIG. 1A illustrates a flowchart of a method 100 for notifying a driver that a request count has reached a specified threshold, in accordance with one embodiment. Although the method 100 is described in conjunction with the systems of FIGS. 2-7, any processing system that implements method 100 is within the scope and spirit of embodiments of the present disclosure. In various embodiments, method 100 is implemented in the context of a graphics system, configured to render graphics frames from scene information comprising graphics primitives. One or more operations of the method 100 may be performed by task-specific circuitry or by a combination of task-specific circuitry and general-purpose processing units. In one embodiment, method 100 is performed by a processing system, which may include a general purpose central processing unit (CPU), a parallel (PPU), such as PPU 200 of FIG. 2, or any combination thereof.

In one embodiment, the processing system includes a first GPU that is directly coupled to a first memory circuit, and a second GPU that is directly coupled to a second memory circuit. Furthermore, the first GPU is coupled to the second GPU through the high-speed data link. In one embodiment, the high-speed data link provides atomic peer access operations, and transfers data at a rate of at least one gigabyte per second. The first memory circuit is not directly coupled to the second GPU and the second memory circuit is not directly coupled to the first GPU.

At step 102, the first GPU generates one or more first memory access requests in connection with rendering a first primitive (e.g., executing a first primitive command) of the first command stream, where at least one of the first memory access requests targets a first page of data that physically resides within the second memory circuit. In this context, the first primitive is associated with a first frame.

At step 104, the first GPU requests the first page of data through the high-speed data link. The request may include, without limitation, a read access request. To anticipate which pages of data should be copied in advance from a remote memory to a first memory that is local to the first GPU, remote memory accesses are tracked within each GPU. Specifically, hardware counters are configured to count access requests to different pages of data in memory and report high-traffic pages to a GPU driver. The hardware counters (registers) can be configured to selectively accumulate (by incrementing) access requests to different remote pages while the frame (current frame) is being rendered. For example, the hardware counters can be configured to selectively accumulate access requests only for certain contexts or sub-contexts of a specified rendering pass. A given counter can be restricted to accumulating access requests for a specific rendering pass and may be reset at the start of the rendering pass. In one embodiment, the hardware counters each include a programmable address range for accumulating an access request count. Only an access request with an address within the programmable address range can cause any given hardware counter to increment.

At step 106, a register circuit (e.g. hardware counter) accumulates an access request count for the first page of data. For example, each access to the first page of data may cause the register circuit to increment an accumulated count by one. In one embodiment, the count is initialized at an integer value of N and the count decremented by one for each access. In another embodiment, the count is initialized (e.g., to zero) and incremented until reaching a programmed threshold count.

In one embodiment, when an access count for a particular remote page exceeds a certain threshold, the GPU driver is notified and an identifier for the page (e.g., address and rendering pass) is added to a collection of high-traffic pages. The GPU driver may be configured to schedule high-traffic pages to be copied (e.g., using a hardware copy engine) from the local memory of one GPU that writes the pages to a local memory of a second GPU that reads the high-traffic pages. With the high-traffic pages copied to the local memory of the second GPU, remote traffic can be reduced. By separating access counts according to rendering pass, the GPU driver is better able to schedule when a particular page is copied in the overall sequence of rendering for a given frame. Furthermore, by restricting which contexts are counted, the hardware counters can be allocated more efficiently.

At step 108, the first GPU notifies a driver that the access request count has reached a specified threshold (i.e., a threshold value). The driver may comprise a software driver configured to execute in any technically feasible position within a given system architecture. For example, the driver may execute within a central processing unit (CPU) responsible for managing the operation of the first GPU and the second GPU.

In one embodiment, the first primitive is rendered by the first GPU according to a first primitive command included in a first command stream for a first frame and the second frame is rendered subsequent to the first frame. In one embodiment, the first command stream specifies a first rendering pass performed in connection with rendering the first frame, and a second command stream specifies the same rendering pass performed subsequently in connection with rendering the second frame, and the notifying occurs during the rendering of the first frame.

More illustrative information will now be set forth regarding various optional architectures and features with which the foregoing framework may or may not be implemented, per the desires of the user. It should be strongly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any manner. Any of the following features may be optionally incorporated with or without the exclusion of other features described.

FIG. 1B illustrates a flowchart of a method 110 for accessing a page of copied data, in accordance with one embodiment. Although the method 110 is described in conjunction with the systems of FIGS. 2-7, any processing system that implements method 110 is within the scope and spirit of embodiments of the present disclosure. In various embodiments, method 110 is implemented in the context of a graphics system, configured to render graphics frames from scene information comprising graphics primitives. One or more operations of the method 110 may be performed by task-specific circuitry or by a combination of task-specific circuitry and general-purpose processing units. In one embodiment, method 110 is performed by a processing system, which may include a general purpose central processing unit (CPU), a parallel (PPU), such as PPU 200 of FIG. 2, or any combination thereof. In one embodiment, method 110 is performed in response to receiving a copy command from the driver. Furthermore, the driver schedules the copy command upon receiving a notification, as described in step 108 of method 100.

At step 112, the first GPU receives a first copy command to copy the first page of data from the second memory circuit through the high-speed data link to produce a copy of the first page of data within the first memory circuit. In one embodiment, the first copy command is executed before the first GPU accesses the first page of data in connection with rendering the first primitive. In one embodiment, the first primitive is rendered by the first GPU according to a first primitive command included in a first command stream for a first frame and the second frame is rendered subsequent to the first frame. Rendering the first primitive for the second frame may cause the first GPU to generate one or more second memory access requests for data residing within the copy of the first page of data residing within the first memory circuit.

At step 114, the first GPU executes the first copy command to copy the first page of data from the second memory circuit (local memory of the second GPU) to the first memory circuit (local memory of the first GPU). At step 116, the first GPU generates the one or more second memory access requests targeting the first page of data residing within the first memory circuit in connection with rendering the first primitive for a second frame.

FIG. 1C illustrates a flowchart of a method 120 for scheduling a copy command for a page of data, in accordance with one embodiment. Although the method 120 is described in conjunction with the systems of FIGS. 2-7, any processing system that implements method 120 is within the scope and spirit of embodiments of the present disclosure. In various embodiments, method 120 is implemented in the context of a graphics system, configured to render graphics frames from scene information comprising graphics primitives. One or more operations of the method 120 may be performed by task-specific circuitry or by a combination of task-specific circuitry and general-purpose processing units. In one embodiment, method 120 is performed by a processing system, which may include a general purpose central processing unit (CPU), a parallel (PPU), such as PPU 200 of FIG. 2, or any combination thereof. In one embodiment, a driver executing within the CPU is configured to perform method 120.

At step 122, the driver receives a notification that an access request count for the first page of data has reached a specified threshold. For example, in the course of rendering the first frame, the first GPU may access the first page of data, residing in the second memory circuit of the second GPU, as described in step 104 of FIG. 1A. When the first GPU accesses the first page of data a number of times equal to the specified threshold, the first GPU notifies the driver, causing the driver to receive the notification, as described in step 108 of FIG. 1A. Alternatively, when the second GPU services a threshold number of access requests to the first page of data from the first GPU, the second GPU may notify the driver, causing the driver to receive the notification.

At step 124, the driver schedules a copy command to copy the first page of data from the second GPU to the first GPU. In practice, the first page of data may reside within the second memory circuit of the second GPU, or within a cache circuit of the second GPU. The first page of data is transmitted to the first GPU and stored as a copy within the first memory circuit of the first GPU, or within a cache circuit of the first GPU. At step 126, the driver includes the copy command within a GPU command stream for rendering the second frame. In one embodiment, the driver inserts the copy command into the GPU command stream. In one embodiment, including the copy command within the GPU stream comprises inserting the copy command into the GPU stream.

In one embodiment, the GPU command stream is the second command stream for the first GPU, and the copy command may cause the first GPU to perform a copy operation specified by the copy command. In an alternative embodiment, the GPU command stream is a command stream for the second GPU, and the copy command may cause the second GPU to perform a copy operation specified by the copy command.

In one embodiment, the copy command is included in a dedicated copy command stream for copy commands, while graphics primitives are included in a general command stream. In such an embodiment, a semaphore may be used to mediate and synchronize progress of the copy command stream and the general command stream. For example, the semaphore mechanism may be configured to guarantee completion of the copy command prior to the start of a specified rendering pass comprising commands from the general command stream that will access to the first page of data. In addition to a rendering pass serving as a synchronization barrier, individual graphics primitives and/or general commands may also serve in this way and execution of specific commands may be mediated by a semaphore.

In another embodiment, the copy command and graphics primitives are included in a unified command stream. A given copy command may execute asynchronously relative to other commands in the unified command stream, and any technically feasible technique (including a semaphore per copy command) may be implemented to provide a synchronization barrier between otherwise asynchronous copy completion and execution of commands that depend on copied data.

During normal operation, the driver may receive notifications from the first GPU, the second GPU, and, optionally, additional GPUs. In a typical usage mode, the first GPU writes data to a given page of data (e.g., during a first rendering pass) and the first GPU subsequently references the data many times for further processing (e.g., during a subsequent rendering pass). The notifications mark certain pages as high-traffic (hot) pages, which may benefit from being copied to a local memory where accesses occur locally rather than accessed repeatedly through the high-speed data link. With the notification information, the driver is able to schedule a copy command for a high-traffic page of data (e.g., the first page of data) determined during rendering of a first frame to occur prior to high-traffic access through the high-speed data link in connection with rendering a second frame. For example, the driver may schedule a copy command to copy the first page of data from the second memory circuit to the first memory circuit based on access intensity to the first page of data while one frame is rendered prior to the first GPU needing to access the first page of data to render a subsequent frame. In this way, the driver is able to adaptively manage where data resides for overall improved performance.

Allocation of rendering work between the first GPU and the second GPU may be accomplished using any technically feasible technique. One such technique is illustrated in FIG. 1D. More generally, rendering work may be allocated between two or more GPUs using any technically feasible technique without departing the scope and spirit of various embodiments.

Citations

US 2014 344,528 A1 - TECHNIQUES FOR ASSIGNING PRIORITIES TO MEMORY COPIES
One embodiment sets forth a method for guiding the order in which a parallel processing subsystem executes memory copies. A driver creates semaphores for all...

US 2011 57,939 A1 - Reading a Local Memory of a Processing Unit
Disclosed herein are systems, apparatuses, and methods for enabling efficient reads to a local memory of a processing unit. In an embodiment, a processing unit...

US 2011 210,976 A1 - TECHNIQUES FOR TRANSFERRING GRAPHICS DATA FROM SYSTEM MEMORY TO A DISCRETE GPU
A method for transferring graphics data includes receiving graphics data in the system memory. The graphics data may be loaded into system memory by and...

US 2014 176,586 A1 - MULTI-MODE MEMORY ACCESS TECHNIQUES FOR PERFORMING GRAPHICS PROCESSING UNIT-BASED MEMORY TRANSFER OPERATIONS
This disclosure describes techniques for performing memory transfer operations with a graphics processing unit (GPU) based on a selectable memory transfer mode, and techniques for...

PatentSwarm provides a collaborative workspace to search, highlight, annotate, and monitor patent data.

Start free trial Sign in

Assignee
NVIDIA Corporation
Inventors
Andrei Khodakovsky, Kirill A. Dmitriev, Rouslan L. Dimitrov, Tzyywei Hwang, Wishwesh Anil Gandhi, Lacky Vasant Shah
Filing date
January 24 2018
Publication date
October 1 2019
Table of contents
Classifications
CPC: G06T1/20, G06T1/60, G06T15/005, G06T2200/24, G06T2210/52
IPC: G06T1/20, G06T1/60, G06T15/00