E D R S I H C RSS
ID
Password
Join
거짓말을 하기는 쉽다. 그러나 단 한번만 거짓말을 하기는 어렵다. ―「텍사스 뉴스」誌

 * 원문링크 : [http]http://developer.nvidia.com/object/Efficient_Use_Vertex_Buffers.html
  • DX7위주로 내용이 구성된 아티클이지만 GL에도 적용될만한 내용이라서 정리합니다. nvidia위주이지만 현재는 이정도면 벤더에 상관없이 지켜야하는 내용일듯 합니다.

Contents

1 용어 정리
2 문제점
3 목표
3.1 모든 데이타 타입에 대하여 괜찮은 처리속도를 얻는 것
4 Complications
4.1 동적 vs 정적 데이타
4.2 최적의 정점 버퍼 크기
4.3 최적의 FVF
4.4 인덱스를 쓸것인가, 말것인가?
4.5 스트립(Strip) vs 리스트(list)
4.6 "최적화"의 의미
4.7 Which caches are at work and how do they affect things
4.7.1 GPU 메모리 캐시(Cache)
4.7.2 GPU 정점 캐시(Cache)
4.8 정점 버퍼를 생성하는 가장 좋은 방법
4.9 What does it mean to Lock a Vertex Buffer?
4.9.1 성능을 높이기위해 플래그 옵션(flag)을 사용하기
4.9.2 What happens if I use both of DISCARDCONTENTS and NOOVERWRITE?
4.10 정점 버퍼를 사용하는 것이 나쁜 경우는 언제죠?
5 Problem Resolution
5.1 The way to handle all cases:
5.1.1 Static Data
5.1.2 Dynamic Data
5.1.3 Large data sets
6 Essential thinking… or “Why it should all make sense"
6.1 Why the Vertex Buffer is good in the first case
6.2 Register renaming in CPUs
6.3 Thinking about the GPU caches
6.4 How DX8 changes things
7 부록 1
8 부록 2

1 용어 정리 #

여기서는 독자가 다이렉트7에 익숙하다고 가정하겠습니다.

HEL변형, 광원, 래스터 처리를 다루기위해 DX7에서 지원하는 소프트웨어 에뮬레이션 계층(모두 CPU로만 처리합니다)
일반 HALDX7상에서의 소프트웨어 HAL. 모든 변형 및 광원처리는 CPU에 의해 다루어집니다. 하드웨어 가속은 래스터 처리에서만 사용합니다.
TnL HALDX7내의 변형, 클리핑, 광원, 래스터 처리모두를 다루기위해 하드웨어 가속을 사용합니다.
VBDX7에 정의된 정점버퍼
GPU그래픽스 처리 유닛(Graphics Processing Unit). 변형, 광원, 클리핑, 래스터 처리를 다루기 위한 모든 필요한 로직을 담고 있는 연산칩.
FVF유연한 정점 포멧(Flexible Vertex Format). DX7에 정의되어있음.

아래는 전형적으로 PC상에서 사용되는 형태의 메모리 용도들을 나열한 것입니다.
로컬 비디오 메모리프레임 버퍼, Z 버퍼, 대부분의 텍스쳐들
AGP 메모리비디오 메모리가 가득차서 대신 저장된 텍스쳐들, 정점 버퍼
시스템 메모리프로그램 코드, 시스템 메모리로 명시적으로 로딩된 정점버퍼들, 모든 시스템 메모리 서피스 및 복사본들

2 문제점 #

정점 버퍼의 잘못된 사용은 DX7 어플리케이션에서 가장 흔하게 발생하는 구조적인 단하나의 문제점입니다. 개발자 기술 관련 인력과 매일매일 작업하는 중에, 가장 일반적으로 부딛히는 문제점 한가지는 VB를 부적절하게 사용하는 것입니다. DX7에서 최적화되지 않은 사용으로 인한 부작용은 중요한 내용에서부터 매우 심각한 부분까지 다양합니다. 개발 초기에 올바른 결정을 하는 것은 많은 시간을 절약하는 결과를 가져옵니다. 정점 버퍼들을 남용하는 어플리케이션을 재구축하는 것은 힘든 작업이 될 수 있습니다.

3 목표 #

3.1 모든 데이타 타입에 대하여 괜찮은 처리속도를 얻는 것 #

The objective for any performance sensitive app should be to produce a polygon throughput that stresses the platform on which it is running. One of the main difficulties with achieving this is the wide spread of capabilities from low-end legacy hardware to top performance modern hardware. In spite of this spread, there are a few basic rules that you can apply. These rules will help you to maximize the performance on each and every one of the machines that your app will end up running on. The key notion that is examined in this paper is Vertex Buffers - and their efficient use.

We will target high poly throughputs with both static and dynamic data and show that, unlike previous hardware generations, it is now possible to write apps that actually get within a few percentage points of the ‘theoretical’ maximum throughput.

4 Complications #

4.1 동적 vs 정적 데이타 #

One of the simplest misconceptions about hardware transform and lighting ("H/W TnL") is that this is designed exclusively to handle static data ; and that dynamic data suffers such a severe performance penalty as to eliminate the advantage of hardware acceleration.

In fact, correct and careful use of the API is capable of showing very high throughputs for dynamically generated data (far higher than using the CPU alone). Throughputs as high as 11 million triangles per second have been achieved on GeForce 256 with dynamic data (which is around 70% of the peak theoretical figure for static data!).

Static data is easier to handle and many benchmarks exist which are able to demonstrate polygon throughputs that actually match the hardware specified rates.

It is necessary to use different data management approaches for dynamic and static data ; but both can be handled within the same application without great difficulty.

4.2 최적의 정점 버퍼 크기 #

지포스 계열 GPU에는 최적의 VB 크기를 정해놓고 있지는 않지만 다음의 규칙을 적용하고 있습니다:
  1. VB 변환은 (일반 HAL에서보다도 훨씬 더) 실행비용이 비싸므로 하나의 VB로 여러 객체들을 묶어놓는 작업이 좋습니다. 이렇게 하는 것만으로 전송 비용을 절약할 수 있기 때문입니다.
  2. 200개가량의 삼각 폴리곤보다 더 적은 수로 만들어진 단위 폴리곤들의 반복정보는 필히 부분-최적화하여야만 합니다. 매우작은 반복정보(폴리곤 10개 또는 그 이하)는 최적화의 주요 목표로 간주해야만 합니다.
  3. 정점 포맷내에 쓸데없는 정보를 담는 것은 VB의 크기를 크게 늘릴 수 있으며 AGP 버스를 통하여 읽어들일수 있는 정점 데이타의 전체비율을 감소시킵니다.

Note that (1) and (3) can produce conflicting demands. Since switching VB is so expensive it can be advantageous to use a common vertex format throughout all or much of your app so that all vertex data can live in a small number of VBs. This is quite typical of optimization issues and makes the task significantly more complex.

마이크로소프트는 DX7의 대부분의 사용에 있어서 "대략 1000개의 정점"만큼을 사용할 것을 권장하고 있지만, 현재 증명된 결과는 더 크게 잡는 것이 좋으며 2000개 이상의 개수를 사용할 경우가 확실히 보다 좋은 결과가 나오는 것으로 알려져 있습니다.

4.3 최적의 FVF #

The general rule is: Prefer compact FVFs. Redundant data tends to reduce the efficiency of the bus usage when transferring data – and is usually a bad thing. Sometimes it can be justified on the basis of saving VB transitions or eliminating duplicate vertices, but most usually it simply slows the system down and should be avoided. If including a small amount of redundant data (say, a second texture coordinate pair) allows you to use just one dynamic VB then it’s almost certain to be a win because there’s a bug in the DX7 runtime which means that there are special performance benefits to using one, and only one, dynamic VB.

That said there are certain highly efficient data types which the GeForce family of GPUs handle very quickly. These are typically 32 bytes or 64 bytes in length.

4.4 인덱스를 쓸것인가, 말것인가? #

Generally, indexed primitives are to be preferred over non-indexed primitives. The single best format in which to send data to the API is as indexed strips, a close second is indexed lists. Indexing can substantially reduce the total bandwidth consumed when handling long sequences of triangles; as can be seen from the optimal case in which a strip ends with a final index that refers to a vertex that is already in the GPU cache. In such a case the total amount of data that needs to be loaded into the GPU to setup the final triangle is just two bytes (since all indices in DX7 are 16 bit quantities).

As bandwidth demands increase in future chips, it is reasonable to expect indexing to be preferred well into the future. In addition, the introduction in DX8 of index buffers is a step that will most certainly help.

4.5 스트립(Strip) vs 리스트(list) #

Because strips represent each additional triangle by adding just one extra vertex, it should come as no surprise to discover that strips are the preferred way of passing data to the API. In the limit, rectangular grids approach the ideal throughput of two triangles for every additional vertex. In such cases, indexed strips can send just 1 word per triangle (indexed lists require at least 3 words per triangle).

4.6 "최적화"의 의미 #

VB에 Optimize()를 호출하는 것은 3가지 주요 효과를 가져옵니다.
  1. It means that the contents of the VB are re-arranged in such a way as they can no longer be used with any other rendering device. So a VB allocated in AGP memory which is passed to Optimize can no longer be read directly by the CPU. Locks, and calls to ProcessVertices will both fail.
  2. Rendering will, in some cases, be noticeably faster. For hardware TnL devices you should expect only a minimal speed up, but for software devices Optimize is a significant opportunity.
  3. VBs that have been optimized can never be locked thereafter.

We recommend using Optimize in all cases where your data is highly persistent. You should not call Optimize on dynamic data (because you’ll lose the right to lock that VB) and you should not call Optimize on data that lasts for only a few renders.

If calling Optimize would force you to Create or Destroy VBs in any time critical code then you should avoid it at all costs. You should never call Optimize in time-critical code.

4.7 Which caches are at work and how do they affect things #

On the GeForce family of GPU’s there are two separate caches and effective use of both can make a big difference.

4.7.1 GPU 메모리 캐시(Cache) #

There is a pure memory cache which simply stores the most recently used lines of AGP memory which were read when fetching vertices from VBs. The size of the cache line is 32 bytes so in cases where your FVF isn’t an exact multiple of 32 then it’s in your interest to access vertex data in a roughly sequential way (because otherwise reads into the cache will often fetch data which will not be used). True random access into a VB effectively eliminates this cache and therefore should be avoided. As with all caches “locality of reference” is the basic principle that brings rewards.

is a term which simply means tending to make memory accesses as close to the previous access as possible.

4.7.2 GPU 정점 캐시(Cache) #

The second GPU cache (which is better known) is a cache of post-transform and light data. The last ten distinct vertices that the chip has processed are maintained in a cache that is driven as a FIFO (first in first out). The value of this cache is greater the higher the load on the GPU. The fact that the cache is run as a FIFO is not intuitive – but in practice it has shown to be highly effective. Just like the memory cache the essential lesson is to go for locality of reference. Vertex data that has just been handled should be used again as soon as possible. As with the memory cache, random access is a bad policy.

It should be obvious that the Vertex Cache is only able to help you if you use indexed primitives. If that’s not obvious, think about it for a short while…

4.8 정점 버퍼를 생성하는 가장 좋은 방법 #

When VBs are going to be used for rendering (i.e. not submitted to ProcessVertices) then their optimal placement is an important matter.

Software devices (e.g. the regular HAL) need their VBs to be placed in system memory at creation time. Hardware devices (i.e. the TnL HAL) need their VBs to reside in non-system memory. Typically this means that the driver will put them into AGP memory. AGP memory is fast for the CPU to write, and fast for the GPU to read – but very slow for the CPU to read.

For a GPU the ideal create call will use only WRITEONLY. Generally, you should not specify SYSTEMMEMORY for VBs that will be used for the GPU rendering methods.

The underlying logic is surprisingly simple. You should apply the following rule. “VBs should be created with the SYSTEMMEMORY flag if and only if they will be read by the CPU”. That includes the app using the CPU directly to access data, and it includes the runtime accessing data via ProcessVertices. There’s no gray area here. If the CPU is going to read from them, then your VBs belong in system memory.

Naturally, you should avoid creating ‘pathological’ situations where you use the CPU to read from a VB very rarely, accompanied by very frequent access by the GPU. In such cases, you probably do better to maintain a system memory copy for the CPU to use and an AGP copy for the GPU.

It’s also worth noting that by far the most typical usage pattern for VB data is that the app creates it and only ever writes to it. Using the CPU to read from a VB is relatively unusual…

4.9 What does it mean to Lock a Vertex Buffer? #

Locking a VB is the way by which you get direct access to the memory within a VB via a pointer which is returned from the Lock call. It is important to note that this pointer is only valid for the duration of the Lock. Like all DirectX surfaces there is no reason to expect the pointer returned from successive calls to Lock to remain valid. This means that you should never attempt to retain a pointer into a VB after releasing the lock. Using that pointer is almost certain to lead to major system instabilities that can be very hard to track down.

As with all surfaces in DirectX, locking a VB can be an arbitrarily slow process, and for that reason Lock should be used with great care.

4.9.1 성능을 높이기위해 플래그 옵션(flag)을 사용하기 #

Critically, locking a VB that is currently in use by the GPU can stall the whole graphics pipeline and cause severe performance impact. When locking VBs care should be taken to make sure that the flags WRITEONLY, DISCARDCONTENTS and NOOVERWRITE are always used in the correct way.

The flags are clearly described in the DX7 help files and you should take great care to read and understand the pseudo code that is provided at the end of this document in Appendix 1.

4.9.2 What happens if I use both of DISCARDCONTENTS and NOOVERWRITE? #

Many people (including the author) have wrongly assumed that DISCARDCONTENTS is the ‘winning’ flag if both are set. This is not the case. The implicit promise given when using NOOVERWRITE clearly indicates that the app is guaranteeing that it will not trample on any data that is currently in use – and for that reason DISCARDCONTENTS is ignored if NOOVERWRITE is also supplied.

It is recommended that you avoid passing both flags to a single call to Lock.

Using WRITEONLY for both creation and locking is highly advantageous as it allows the driver to return AGP memory. CPU reads from AGP memory are very slow, but GPU reads from AGP are several times faster than from CPU-cached system memory.

4.10 정점 버퍼를 사용하는 것이 나쁜 경우는 언제죠? #

Never!

If you are coding for a hardware accelerated transform and lighting device then you should always use vertex buffers. This applies to all vertex data, even for non-3D operations like HUD art or text. If you fail to use a vertex buffer the runtime will copy your data into it’s own internal VB but will not apply the optimal rules for using DISCARDCONTENTS and NOOVERWRITE.

In severe cases we have seen performance cut by more than one half as a consequence of introducing non-VB based rendering. Don’t make this mistake!

5 Problem Resolution #

5.1 The way to handle all cases: #

5.1.1 Static Data #

Static data is any vertex data that does not change for hundreds of frames, possibly for the life of the game (or level). Static data should be handled by committing it to AGP memory (by specifying WRITEONLY and not specifying SYSTEMMEMORY). Because there is typically much more AGP memory than free video memory this can work even if you have many megabytes that need to be held. Most current games don’t use in excess of 32Mb of vertex data – if you are considering doing so then you should be careful to ensure that you have good justification. Remember also that AGP allocations always consume real physical memory that cannot be swapped out by the Windows virtual memory manager. For each VB of static data the following rules should be applied. (1) Create the VB using only the flag WRITEONLY. (2) Lock the VB once (again using WRITEONLY) and fill the VB. If this code is likely to be executed in a performance sensitive situation then try to make your writes sequential to allow the write combining capabilities of the CPU to work for you. Random access writes to AGP memory are much slower. (3) Unlock the VB (4) Optimize the VB

5.1.2 Dynamic Data #

Dynamic data is data which can be varied or ‘written’ to (even if it’s only changed infrequently) in the course of the game (or level). As such, it will require locking from time to time and therefore can never be the subject of a call to Optimize.

Dynamic data falls into two distinct categories. Data which is only the subject of write operations is most efficiently rendered and this should generally be your aim with dynamic data. The alternative is to have data which can be both read and written by the CPU. For performance reasons this should not be kept in AGP memory as CPU reads from AGP are very slow. These two dynamic data types are referred to as W/O (Write Only) and R/W (read write) in the remainder of this section.

For each VB of dynamic data the following rules should be applied.

W/O: Create the VB using only the flag WRITEONLY and then every time the data is updated… (1) Lock the VB (using WRITEONLY and one of DISCARDCONTENTS and NOVERWRITE) and fill the VB. See the pseudo code in Appendix 1 for guidance on which flags to use and when. If this code is likely to be executed in a performance sensitive situation then try to make your writes sequential to allow the write combining capabilities of the CPU to work for you. (2) Unlock the VB (3) Submit the VB to DIPVB

R/W: Create the VB with the flag SYSTEMMEMORY and then every time the data is updated… (1) Lock the VB and fill the VB with any read/modify/write operations required. Note that you will be unable to specify any of the DISCARDCONTENTS, NOOVERWRITE and WRITEONLY flags. (2) Unlock the VB (4) Submit the VB to DIPVB

N.B. Just to complicate matters, there’s a bug in the DX7 runtime which means that there are special performance benefits to using exactly one dynamic VB.

5.1.3 Large data sets #

Large data sets are typically best broken down into subsets and each subset treated as static or dynamic on it’s own merits. If you are thinking of creating very many large VBs or substantial numbers of VBs you should be aware of the direct consequences.

Firstly, for every VB which you create there is a system memory overhead of around 2K which is used by the runtime for managing it’s own resources.

If you have large numbers of VBs it’s usually an indication that you will be switching your rendering from one VB to another frequently. Since this transition is one of the most expensive operations under DX7 you should strive to avoid this penalty.

Note that although the maximum number of vertices in a VB is 65535 it’s not usually a great idea to approach this kind of size. Since the VB renaming scheme in DX7 requires the driver to find a contiguous free block of AGP memory which the same size as the original buffer it can be easier for the driver to satisfy these requirements if the request is for a smaller chunk of memory. Typically that’s more likely in cases where the VBs are themselves smaller.

Data sets which are as large or larger than the actual physical memory of the machine are typically best handled as in the following case:

Situation:
  • Physical memory: 128Mb
  • AGP Heap size: 44Mb
  • Vertex Data: 256Mb

Suggested arrangement:

For each FVF type that you need to support:
  1. Create a WRITEONLY VB in AGP memory which is large enough to hold 4K of vertices.
  2. Create system memory copy of all the data which matches that FVF. Even when this exhausts all physical memory and forces Windows to virtualize your vertex data.

Then at render time:
  1. BeginScene()
    1. Lock the appropriate VB using WRITEONLY and NOOVERWRITE for the first lock, and using both WRITEONLY and DISCARDCONTENTS on subsequent locks.
    2. Copy your virtualized vertex data into the matched VB until the VB is full. Unlock the VB, and immediately render from it. Continue with this VB and FVF until all data of this format is rendered.
    3. Cycle through the full set of FVFs until they’ve all been rendered.
  2. EndScene()
  3. Flip() or Blt()

It is advantageous to reverse the order of rendering from one frame to the next as this allows data which is left in physical memory at the end of one frame to be used at the start of the next without requiring too much intervention from Window’s virtual memory manager.

As before it’s also helpful to arrange for your writes to the VB to visit successive memory locations. Any other access pattern will show poorer performance because of the way the CPU interacts with AGP memory.

6 Essential thinking… or “Why it should all make sense" #


6.1 Why the Vertex Buffer is good in the first case #

VBs are good because they have semantics of ownership and because the driver is able to place them in optimal memory. The correct use of flags when creating and locking is critical to getting best performance.

Without clear semantics of ownership the driver would need to copy the contents of the VB into driver-owned memory with the consequential loss in performance and waste of bandwidth.

6.2 Register renaming in CPUs #

The technique applied by the driver known as “VB renaming” is a technique borrowed from modern high-performance CPU design. For those who have not met the technique before I’ll describe the situation in which a highly parallel register-based processor is able to gain benefits from register renaming.

Suppose the object code contains the following sequence of instructions.
mov	a,1
mov	b,a
mov	a,2

The first instruction poses no problem. A value of 1 is placed in ‘a’. The second instruction copies the content of ‘a’ into ‘b’. The third instruction waits for the second to complete and then a value of 2 placed in ‘a’.

The problem which we seek to address is how to remove the stall implied in the execution of the third instruction. We cannot allow the value of 2 to be written to ‘a’ before the previous instruction completes…

The trick is to introduce a 3rd register which is not directly accessible to the code and which is managed by the processor it-self.

Since the 3rd instruction is guaranteed to destroy the previous contents of ‘a’ we don’t really care where the value is put provided that we subsequently take all mention of ‘a’ as referring to that value. In this case we’ll introduce a shadow register ‘S’ which effectively takes the place of ‘a’ in subsequent instructions. So we can substitute the following code and gain the same effect.
mov	a,1
mov	b,a
mov	S,2

Now, provided that we subsequently direct all references to ‘a’ towards ‘S’ then we can completely paralellize the 2nd and 3rd instructions. If we achieve this then we have doubled the number of instructions which can be handled at one time.

VB renaming works in the same way but generalizes the renaming method by allowing the driver to perform renaming as many times as the app (and free memory) allows.

Without renaming performance can be pretty unimpressive.

With constructive use of the Lock and Create flags performance can be several times higher.

6.3 Thinking about the GPU caches #

The post-transform cache is only able to work for you at all when you use the indexed rendering interfaces (DrawIndexedPrimitiveVB using strips or lists).

The memory cache is only working for you when you use high locality of reference.

For these reasons, for maximum performance you should used indexed triangle strips where the vertices have been arranged in the VB in such a way as to make locality of reference of the vertex data implicit in the index sequence.

If it proves impractical to use strips then use indexed lists but otherwise observe all of the same rules.

6.4 How DX8 changes things #

The two main changes to the handling of VBs under DX8 are the introduction of Index Buffers and lightweight VB changes.

Index Buffers (or ‘IB’s) are a highly efficient way of passing indexing data to the API. Index sets which are constant, or which change only infrequently and which are regularly associated with specific VBs offer excellent optimization opportunities.

DX8 also introduces 32 bit indices and allows VBs to be much larger than before.

In DX8 it is now harder to render without using Vertex Buffers – which reduces the number of opportunities for getting things wrong.

And, for advanced users, there is a new idea of separate DMA streams which allow you to take different components of the vertex data from different VBs. This has the additional benefit of allowing the unchanging parts of the vertex data to be placed in a separate, static VB and therefore allowing the app to access and change only that part of the data which actually needs to change.

It’s fair to say that DX8 makes it easier to get things right, and harder to get things wrong. If you have not considered upgrading to DX8 then you should seriously consider it now.

7 부록 1 #

다음은 동적 정점 버퍼를 갱신하기위한 가상 코드입니다.
	CreateVB(WRITEONLY, 4K단위);
	i = 0;
Add:	VB안에 N 정점만큼의 공간이 있는가?
	   예:  { Flag = NOOVERWRITE; }
	Else
	   아니오:  { Flag = DISCARDCONTENTS; I = 0; }
	
	Lock(Flag | WRITEONLY);
	인덱스 i위치에 N 개의 정점을 채워넣는다.
	Unlock();

	DIPVB(i);
	i += N;
	GOTO Add;

8 부록 2 #

다음 규칙을 따르도록 하십시요.
  • 언제나 모든 곳에 VB를 사용하라.
  • 만약 여러분이 VB를 잠궈야한다면 반드시 WRITEONLY와, DISCARDCONENTS와 NOOVERWRITE중 한 옵션을 사용해야만 한다.
  • CPU가 VB로부터 데이타를 읽어들일 것이 아니라면 VB들은 시스템 메모리에 두지 마라.
  • GPU가 읽어들이는 VB를 CPU가 읽도록 하지마라.
  • TnL 하드웨어를 가속하기위해 ProcessVertices()를 사용하지마라 - 이것은 대부분 오히려 더 느려지는 결과를 가져온다.

Valid XHTML 1.0! Valid CSS! powered by MoniWiki
last modified 2010-10-28 12:42:52
Processing time 0.6591 sec