NV_fragment_shader_interlock

Name

NV_fragment_shader_interlock

Name Strings

GL_NV_fragment_shader_interlock

Contact

Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

Contributors

Jeff Bolz, NVIDIA Corporation
Mathias Heyer, NVIDIA Corporation

Status

Shipping

Version

Last Modified Date:         March 27, 2015
NVIDIA Revision:            2

Number

OpenGL Extension #468
OpenGL ES Extension #230

Dependencies

This extension is written against the OpenGL 4.3
(Compatibility Profile, dated February 14, 2013), and the
OpenGL ES 3.1.0 (dated March 17, 2014) Specification

This extension is written against the OpenGL Shading Language
Specification (version 4.30, revision 8) and the OpenGL ES Shading
Language Specification (version 3.10, revision 2).

OpenGL 4.3 and GLSL 4.30 are required in an OpenGL implementation
OpenGL ES 3.1 and GLSL ES 3.10 are required in an OpenGL ES implementation

This extension interacts with NV_shader_buffer_load and
NV_shader_buffer_store.

This extension interacts with NV_gpu_program4 and NV_gpu_program5.

This extension interacts with EXT_tessellation_shader.

This extension interacts with OES_sample_shading

This extension interacts with OES_shader_multisample_interpolation

This extension interacts with OES_shader_image_atomic

Overview

In unextended OpenGL 4.3 or OpenGL ES 3.1, applications may produce a
large number of fragment shader invocations that perform loads and
stores to memory using image uniforms, atomic counter uniforms,
buffer variables, or pointers. The order in which loads and stores
to common addresses are performed by different fragment shader
invocations is largely undefined.  For algorithms that use shader
writes and touch the same pixels more than once, one or more of the
following techniques may be required to ensure proper execution ordering:

  * inserting Finish or WaitSync commands to drain the pipeline between
    different "passes" or "layers";

  * using only atomic memory operations to write to shader memory (which
    may be relatively slow and limits how memory may be updated); or

  * injecting spin loops into shaders to prevent multiple shader
    invocations from touching the same memory concurrently.

This extension provides new GLSL built-in functions
beginInvocationInterlockNV() and endInvocationInterlockNV() that delimit a
critical section of fragment shader code.  For pairs of shader invocations
with "overlapping" coverage in a given pixel, the OpenGL implementation
will guarantee that the critical section of the fragment shader will be
executed for only one fragment at a time.

There are four different interlock modes supported by this extension,
which are identified by layout qualifiers.  The qualifiers
"pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual
exclusion in the critical section for any pair of fragments corresponding
to the same pixel.  When using multisampling, the qualifiers
"sample_interlock_ordered" and "sample_interlock_unordered" only provide
mutual exclusion for pairs of fragments that both cover at least one
common sample in the same pixel; these are recommended for performance if
shaders use per-sample data structures.

Additionally, when the "pixel_interlock_ordered" or
"sample_interlock_ordered" layout qualifier is used, the interlock also
guarantees that the critical section for multiple shader invocations with
"overlapping" coverage will be executed in the order in which the
primitives were processed by the GL.  Such a guarantee is useful for
applications like blending in the fragment shader, where an application
requires that fragment values to be composited in the framebuffer in
primitive order.

This extension can be useful for algorithms that need to access per-pixel
data structures via shader loads and stores.  Such algorithms using this
extension can access such data structures in the critical section without
worrying about other invocations for the same pixel accessing the data
structures concurrently.  Additionally, the ordering guarantees are useful
for cases where the API ordering of fragments is meaningful.  For example,
applications may be able to execute programmable blending operations in
the fragment shader, where the destination buffer is read via image loads
and the final value is written via image stores.

New Procedures and Functions

None.

New Tokens

None.

Modifications to the OpenGL 4.3 Specification (Compatibility Profile)

None.

Modifications to the OpenGL Shading Language Specification, Version 4.30

Including the following line in a shader can be used to control the
language features described in this extension:

  #extension GL_NV_fragment_shader_interlock : <behavior>

where <behavior> is as specified in section 3.3.

New preprocessor #defines are added to the OpenGL Shading Language:

  #define GL_NV_fragment_shader_interlock           1


Modify Section 4.4.1.3, Fragment Shader Inputs (p. 58)

(add to the list of layout qualifiers containing "early_fragment_tests",
 p. 59, and modify the surrounding language to reflect that multiple
 layout qualifiers are supported on "in")

  layout-qualifier-id
    pixel_interlock_ordered
    pixel_interlock_unordered
    sample_interlock_ordered
    sample_interlock_unordered

(add to the end of the section, p. 59)

The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered",
"sample_interlock_ordered", and "sample_interlock_unordered" control the
ordering of the execution of shader invocations between calls to the
built-in functions beginInvocationInterlockNV() and
endInvocationInterlockNV(), as described in section 8.13.3. A
compile or link error will be generated if more than one of these layout
qualifiers is specified in shader code. If a program containing a
fragment shader includes none of these layout qualifiers, it is as
though "pixel_interlock_ordered" were specified.

Add to the end of Section 8.13, Fragment Processing Functions (p. 168)

8.13.3, Fragment Shader Execution Ordering Functions

By default, fragment shader invocations are generally executed in
undefined order. Multiple fragment shader invocations may be executed
concurrently, including multiple invocations corresponding to a single
pixel. Additionally, fragment shader invocations for a single pixel might
not be processed in the order in which the primitives generating the
fragments were specified in the OpenGL API.

The paired functions beginInvocationInterlockNV() and
endInvocationInterlockNV() allow shaders to specify a critical section,
inside which stronger execution ordering is guaranteed.  When using the
"pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier,
ordering guarantees are provided for any pair of fragment shader
invocations X and Y triggered by fragments A and B corresponding to the
same pixel. When using the "sample_interlock_ordered" or
"sample_interlock_unordered" qualifier, ordering guarantees are provided
for any pair of fragment shader invocations X and Y triggered by fragments
A and B that correspond to the same pixel, where at least one sample of
the pixel is covered by both fragments. No ordering guarantees are
provided for pairs of fragment shader invocations corresponding to
different pixels. Additionally, no ordering guarantees are provided for
pairs of fragment shader invocations corresponding to the same fragment.
When multisampling is enabled and the framebuffer has sample buffers,
multiple fragment shader invocations may result from a single fragment due
to the use of the "sample" auxilliary storage qualifier, OpenGL API
commands forcing multiple shader invocations per fragment, or for other
implementation-dependent reasons.

When using the "pixel_interlock_unordered" or "sample_interlock_unordered"
qualifier, the interlock will ensure that the critical sections of
fragment shader invocations X and Y with overlapping coverage will never
execute concurrently. That is, invocation X is guaranteed to complete its
call to endInvocationInterlockNV() before invocation Y completes its call
to beginInvocationInterlockNV(), or vice versa.

When using the "pixel_interlock_ordered" or "sample_interlock_ordered"
layout qualifier, the critical sections of invocations X and Y with
overlapping coverage will be executed in a specific order, based on the
relative order assigned to their fragments A and B.  If fragment A is
considered to precede fragment B, the critical section of invocation X is
guaranteed to complete before the critical section of invocation Y begins.
When a pair of fragments A and B have overlapping coverage, fragment A is
considered to precede fragment B if

  * the OpenGL API command producing fragment A was called prior to the
    command producing B, or

  * the point, line, triangle, [[compatibility profile: quadrilateral,
    polygon,]] or patch primitive producing fragment A appears earlier in
    the same strip, loop, fan, or independent primitive list producing
    fragment B.

When [[compatibility profile: decomposing quadrilateral or polygon
primitives or]] tessellating a single patch primitive, multiple
primitives may be generated in an undefined implementation-dependent
order.  When fragments A and B are generated from such unordered
primitives, their ordering is also implementation-dependent.

If fragment shader X completes its critical section before fragment shader
Y begins its critical section, all stores to memory performed in the
critical section of invocation X using a pointer, image uniform, atomic
counter uniform, or buffer variable qualified by "coherent" are guaranteed
to be visible to any reads of the same types of variable performed in the
critical section of invocation Y.

If multisampling is disabled, or if the framebuffer does not include
sample buffers, fragment coverage is computed per-pixel. In this case,
the "sample_interlock_ordered" or "sample_interlock_unordered" layout
qualifiers are treated as "pixel_interlock_ordered" or
"pixel_interlock_unordered", respectively.


  Syntax:

    void beginInvocationInterlockNV(void);
    void endInvocationInterlockNV(void);

  Description:

The beginInvocationInterlockNV() and endInvocationInterlockNV() may only
be placed inside the function main() of a fragment shader and may not be
called within any flow control.  These functions may not be called after a
return statement in the function main(), but may be called after a discard
statement.  A compile- or link-time error will be generated if main()
calls either function more than once, contains a call to one function
without a matching call to the other, or calls endInvocationInterlockNV()
before calling beginInvocationInterlockNV().

Additions to the AGL/GLX/WGL Specifications

None.

Errors

None.

New State

None.

New Implementation Dependent State

None.

Interactions with OpenGL ES 3.1

Disabling multisample rasterization is not available on OpenGL ES;
it is always enabled.

Dependencies on EXT_tessellation_shader

 If this extension is implemented on OpenGL ES and EXT_tessellation_shader
 is not supported, remove language referring to tessellation of patch
 primitives.

Dependencies on OES_sample_shading

 If this extension is implemented on OpenGL ES and OES_sample_shading
 is not supported, remove references to per-sample shading via
 MinSampleShading[OES]()

Dependencies on OES_shader_image_atomic

If this extension is implemented on OpenGL ES and OES_shader_image_atomic
is not supported, disregard language referring to atomic memory operations.

Dependencies on OES_shader_multisample_interpolation

If this extension is implemented on OpenGL ES and OES_shader_- multisample_interpolation is not supported, ignore language about the "sample" auxilliary storage qualifier.

Dependencies on NV_shader_buffer_load and NV_shader_buffer_store

If NV_shader_buffer_load and NV_shader_buffer_store are not supported,
references to ordering memory accesses using pointers should be deleted.

Dependencies on NV_gpu_program4 and NV_fragment_program4

Modify Section 2.X.2, Program Grammar, of the NV_fragment_program4
specification (which modifies the NV_gpu_program4 base grammar)

  <SpecialInstruction>    ::= "FSIB"
                            | "FSIE"


Modify Section 2.X.4, Program Execution Environment

(add to the opcode table)

              Modifiers
  Instruction F I C S H D  Out Inputs    Description
  ----------- - - - - - -  --- --------  --------------------------------
  FSIB        - - - - - -  -   -         begin fragment shader interlock
  FSIE        - - - - - -  -   -         end fragment shader interlock


Modify Section 2.X.6, Program Options

+ Fragment Shader Interlock (NV_pixel_interlock_ordered,
  NV_pixel_interlock_unordered, NV_sample_interlock_ordered, and
  NV_sample_interlock_ordered)

If a fragment program specifies the "NV_pixel_interlock_ordered",
"NV_pixel_interlock_unordered", "NV_sample_interlock_ordered", or
"NV_sample_interlock_ordered" options, it will configure a critical
section using the FSIB (fragment shader interlock begin) and FSIE opcodes
(fragment shader interlock end) opcodes.  The execution of the critical
sections will be ordered for pairs of program invocations corresponding to
the same pixel, as described in Section 8.13.3 of the OpenGL Shading
Language Specification, where the four options are considered to specify
layout qualifiers with names equivalent to matching the program option.

A program will fail to load if it specifies more than one of these program
options, if it specifies exactly one of these options but does not contain
exactly one FSIB instruction and one FSIE instruction, or if it contains
an FSIB or FSIE instruction without specifying any of these options.


Add the following subsections to section 2.X.8, Program Instruction Set


Section 2.X.8.Z, FSIB:  Fragment Shader Interlock Begin

The FSIB instruction specifies the beginning of a critical section in a
fragment program, where execution of the critical section is ordered
relative to other fragments.  This instruction has no other effect.

The FSIB instruction is not allowed in arbitrary locations in a program.
A program will fail to load if it includes an FSIB instruction inside a
IF/ELSE/ENDIF block, inside a REP/ENDREP block, or inside any subroutine
block other than the one labeled "main".  Additionally, a program will
fail to load if it contains more than one FSIB instruction, or if its one
FSIB instruction is not followed by an FSIE instruction.

FSIB has no operands and generates no result.


Section 2.X.8.Z, FSIE:  Fragment Shader Interlock End

The FSIE instruction specifies the end of a critical section in a fragment
program, where execution of the critical section is ordered relative to
other fragments.  This instruction has no other effect.

The FSIE instruction is not allowed in arbitrary locations in a program.
A program will fail to load if it includes an FSIE instruction inside a
IF/ELSE/ENDIF block, inside a REP/ENDREP block, or inside any subroutine
block other than the one labeled "main".  Additionally, a program will
fail to load if it contains more than one FSIE instruction, or if its one
FSIE instruction is not preceded by an FSIB instruction.

FSIE has no operands and generates no result.

Issues

(1) What should this extension be called?

  RESOLVED:  NV_fragment_shader_interlock.  The
  beginInvocationInterlockNV() and endInvocationInterlockNV() commands
  identify a critical section during which other invocations with
  overlapping coverage are locked out until the critical section
  completes.

(2) When using multisampling, the OpenGL specification permits
    multiple fragment shader invocations to be generated for a single
    fragment.  For example, per-sample shading using the "sample"
    auxilliary storage qualifier or the MinSampleShading() OpenGL API command
    can be used to force per-sample shading.  What execution ordering
    guarantees are provided between fragment shader invocations generated
    from the same fragment?

  RESOLVED:  We don't provide any ordering guarantees in this extension.
  This implies that when using multisampling, there is no guarantee that
  two fragment shader invocations for the same fragment won't be executing
  their critical sections concurrently.  This could cause problems for
  algorithms sharing data structures between all the samples of a pixel
  unless accesses to these data structures are performed atomically.

  When using per-sample shading, the interlock we provide *does* guarantee
  that no two invocations corresponding to the same sample execute the
  critical section concurrently.  If a separate set of data structures is
  provided for each sample, no conflicts should occur within the critical
  section.

  Note that in addition to the per-sample shading options in the shading
  language and API, implementations may provide multisample antialiasing
  modes where the implementation can't simply run the fragment shader once
  and broadcast results to a large set of covered samples.

(3) What performance differences are expected between shaders using the
   "pixel" and "sample" layout qualifier variants in this extension (e.g.,
   "pixel_invocation_ordered" and "sample_invocation_ordered")?

  RESOLVED:  We expect that shaders using "sample" qualifiers may have
  higher performance, since the implementation need not order pairs of
  fragments that touch the same pixel with "complementary" coverage.  Such
  situations are fairly common:  when two adjacent triangles combine to
  cover a given pixel, two fragments will be generated for the pixel but
  no sample will be covered by both.  When using "sample" qualifiers, the
  invocations for both fragments can run concurrently.  When using "pixel"
  qualifiers, the critical section for one fragment must wait until the
  critical section for the other fragment completes.

(4) What performance differences are expected between shaders using the
   "ordered" and "unordered" layout qualifier variants in this extension
   (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")?

  RESOLVED:  We expect that shaders using "unordered" may have higher
  performance, since the critical section implementation doesn't need to
  ensure that all previous invocations with overlapping coverage have
  completed their critical sections.  Some algorithms (e.g., building data
  structures in order-independent transparency algorithms) will require
  mutual exclusion when updating per-pixel data structures, but do not
  require that shaders execute in a specific ordering.

(5) Are fragment shaders using this extension allowed to write outputs?
    If so, is there any guarantee on the order in which such outputs are
    written to the framebuffer?

  RESOLVED:  Yes, fragment shaders with critical sections may still write
  outputs.  If fragment shader outputs are written, they are stored or
  blended into the framebuffer in API order, as is the case for fragment
  shaders not using this extension.

(6) What considerations apply when using this extension to implement a
    programmable form of conventional blending using image stores?

  RESOLVED:  Per-fragment operations performed in the pipeline following
  fragment shader execution obviously have no effect on image stores
  executing during fragment shader execution.  In particular, multisample
  operations such as broadcasting a single fragment output to multiple
  samples or modifying the coverage with alpha-to-coverage or a shader
  coverage mask output value have no effect.  Fragments can not be killed
  before fragment shader blending using the fixed-function alpha test or
  using the depth test with a Z value produced by the shader.  Fragments
  will normally not be killed by fixed-function depth or stencil tests,
  but those tests can be enabled before fragment shader invocations using
  the layout qualifier "early_fragment_tests".  Any required
  fixed-function features that need to be handled before programmable
  blending that aren't enabled by "early_fragment_tests" would need to be
  emulated in the shader.

  Note also that performing blend computations in the shader are not
  guaranteed to produce results that are bit-identical to these produced
  by fixed-function blending hardware, even if mathematically equivalent
  algorithms are used.

(7) For operations accessing shared per-pixel data structures in the
    critical section, what operations (if any) must be performed in shader
    code to ensure that stores from one shader invocation are visible to
    the next?

  RESOLVED:  The "coherent" qualifier is required in the declaration of
  the shared data structures to ensure that writes performed by one
  invocation are visible to reads performed by another invocation.

  In shaders that don't use the interlock, "coherent" is not sufficient as
  there is no guarantee of the ordering of fragment shader invocations --
  even if invocation A can see the values written by another invocation B,
  there is no general guarantee that invocation A's read will be performed
  before invocation B's write.  The built-in function memoryBarrier() can
  be used to generate a weak ordering by which threads can communicate,
  but it doesn't order memory transactions between two separate
  invocations.  With the interlock, execution ordering between two threads
  from the same pixel is well-defined as long as the loads and stores are
  performed inside the critical section, and the use of "coherent" ensures
  that stores done by one invocation are visible to other invocations.

(8) Should we provide an explicit mechanisms for shaders to indicate a
    critical section?  Or should we just automatically infer a critical
    section by analyzing shader code?  Or should we just wrap the entire
    fragment shader in a critical section?

  RESOLVED:  Provide an explicit critical section.

  We definitely don't want to wrap the entire shader in a critical section
  when a smaller section will suffice.  Doing so would hold off the
  execution of any other fragment shader invocation with the same (x,y)
  for the entire (potentially long) life of the fragment shader.  Hardware
  would need to track a large number of fragments awaiting execution, and
  may be so backed up that further fragments will be blocked even if they
  don't overlap with any fragments currently executing.  Providing a
  smaller critical section reduces the amount of time other fragments are
  blocked and allows implementations to perform useful work for
  conflicting fragments before they hit the critical section.

  While a compiler could analyze the code and wrap a critical section
  around all memory accesses, it may be difficult to determine which
  accesses actually require mutual exclusion and ordering, and which
  accesses are safe to do with no protection.  Requiring shaders to
  explicitly identify a critical section doesn't seem overwhelmingly
  burdensome, and allows applications to exclude memory accesses that it
  knows to be "safe".

(9) What restrictions should be imposed on the use of the
    beginInvocationInterlockNV() and endInvocationInterlockNV() functions
    delimiting a critical section?

  RESOLVED:  We impose restrictions similar to those on the barrier()
  built-in function in tessellation control shaders to ensure that any
  shader using this functionality has a single critical section that can
  be easily identified during compilation.  In particular, we require that
  these functions be called in main() and don't permit them to be called
  in conditional flow control.

  These restrictions ensure that there is always exactly one call to the
  "begin" and "end" functions in a predictable location in the compiled
  shader code, and ensure that the compiler and hardware don't have to
  deal with unusual cases (like entering a critical section and never
  leaving, leaving a critical section without entering it, or trying to
  enter a critical section more than once).

Revision History

Revision 2, 2015/03/27
  - Add ES interactions

Revision 1
  - Internal revisions