The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
R600 family up until the current GCN families. It lives in the
llvm/lib/Target/AMDGPU directory.
Compute kernels executed on HSA [HSA] compatible runtimes
such as:
AMD’s ROCm™ runtime [AMD-ROCm] using the rocm-amdhsa
loader on Linux. See AMD ROCm Platform Release Notes[AMD-ROCm-Release-Notes] for supported hardware and
software.
AMD’s PAL runtime using the pal-amdhsa loader on
Windows.
amdpal
Graphic shaders and compute kernels executed on AMD’s PAL
runtime using the pal-amdpal loader on Windows and Linux
Pro.
mesa3d
Graphic shaders and compute kernels executed on AMD’s Mesa
3D runtime using the mesa-mesa3d loader on Linux.
Use the Clang options -mcpu=<target-id> or --offload-arch=<target-id> to
specify the AMDGPU processor together with optional target features. See
Target ID and Target Features for AMD GPU target
specific information.
Every processor supports every OS ABI (see AMDGPU Operating Systems) with the following exceptions:
Generic processors allow execution of a single code object on any of the processors that
it supports. Such code objects may not perform as well as those for the non-generic processors.
Generic processors are only available on code object V6 and above (see ELF Code Object).
Generic processor code objects are versioned. See Generic Processor Versioning for more information on how versioning works.
For a generic code object, adding a new supported processor may require the code generated for the generic target to be changed
so it can continue to execute on the previously supported processors as well as on the new one.
When this happens, the generic code object version number is incremented at the same time as the generic target is updated.
Each supported processor of a generic target is mapped to the version it was introduced in.
A generic code object can execute on a supported processor if the version of the code object being loaded is
greater than or equal to the version in which the processor was added to the generic target.
Target features control how code is generated to support certain
processor specific features. Not all target features are supported by
all processors. The runtime must ensure that the features supported by
the device used to execute the code match the features enabled when
generating the code. A mismatch of features may result in incorrect
execution, or a reduction in performance.
The target features supported by each processor are listed in
Processors.
Target features are controlled by exactly one of the following Clang
options:
-mcpu=<target-id> or --offload-arch=<target-id>
The -mcpu and --offload-arch can specify the target feature as
optional components of the target ID. If omitted, the target feature has the
any value. See Target ID.
-m[no-]<target-feature>
Target features not specified by the target ID are specified using a
separate option. These target features can have an on or off
value. on is specified by omitting the no- prefix, and
off is specified by including the no- prefix. The default
if not specified is off.
Control the wavefront execution mode used
when generating code for kernels. When disabled
native WGP wavefront execution mode is used,
when enabled CU wavefront execution mode is used
(see Memory Model).
sramecc
-mcpu
--offload-arch
If specified, generate code that can only be
loaded and executed in a process that has a
matching setting for SRAMECC.
If not specified for code object V2 to V3, generate
code that can be loaded and executed in a process
with SRAMECC enabled.
If not specified for code object V4 or above, generate
code that can be loaded and executed in a process
with either setting of SRAMECC.
tgsplit
-m[no-]tgsplit
Enable/disable generating code that assumes
work-groups are launched in threadgroup split mode.
When enabled the waves of a work-group may be
launched in different CUs.
wavefrontsize64
-m[no-]wavefrontsize64
Control the wavefront size used when
generating code for kernels. When disabled
native wavefront size 32 is used, when enabled
wavefront size 64 is used.
xnack
-mcpu
--offload-arch
If specified, generate code that can only be
loaded and executed in a process that has a
matching setting for XNACK replay.
If not specified for code object V2 to V3, generate
code that can be loaded and executed in a process
with XNACK replay enabled.
If not specified for code object V4 or above, generate
code that can be loaded and executed in a process
with either setting of XNACK replay.
XNACK replay can be used for demand paging and
page migration. If enabled in the device, then if
a page fault occurs the code may execute
incorrectly unless generated with XNACK replay
enabled, or generated for code object V4 or above without
specifying XNACK replay. Executing code that was
generated with XNACK replay enabled, or generated
for code object V4 or above without specifying XNACK replay,
on a device that does not have XNACK replay
enabled will execute correctly but may be less
performant than code generated for XNACK replay
disabled.
AMDGPU supports target IDs. See Clang Offload Bundler for a general
description. The AMDGPU target specific information is:
processor
Is an AMDGPU processor or alternative processor name specified in
AMDGPU Processors. The non-canonical form target ID allows both
the primary processor and alternative processor names. The canonical form
target ID only allows the primary processor name.
target-feature
Is a target feature name specified in AMDGPU Target Features that
is supported by the processor. The target features supported by each processor
is specified in AMDGPU Processors. Those that can be specified in
a target ID are marked as being controlled by -mcpu and
--offload-arch. Each target feature must appear at most once in a target
ID. The non-canonical form target ID allows the target features to be
specified in any order. The canonical form target ID requires the target
features to be specified in alphabetical order.
The generic address space is supported unless the Target Properties column
of AMDGPU Processors specifies Does not support generic address
space.
The generic address space uses the hardware flat address support for two fixed
ranges of virtual addresses (the private and local apertures), that are
outside the range of addressable global memory, to map from a flat address to
a private or local address. This uses FLAT instructions that can take a flat
address and access global, private (scratch), and group (LDS) memory depending
on if the address is within one of the aperture ranges.
Flat access to scratch requires hardware aperture setup and setup in the
kernel prologue (see Flat Scratch). Flat
access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
setup (see M0).
To convert between a private or group address space address (termed a segment
address) and a flat address, the base address of the corresponding aperture
can be used. For GFX7-GFX8 these are available in the
HSA AQL Queue the address of which can be obtained with
Queue Ptr SGPR (see Initial Kernel Execution State). For
GFX9-GFX11 the aperture base addresses are directly available as inline
constant registers SRC_SHARED_BASE/LIMIT and SRC_PRIVATE_BASE/LIMIT.
In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
aligned to 2^32 which makes it easier to convert from flat to segment or
segment to flat.
A global address space address has the same value when used as a flat address
so no conversion is needed.
Global and Constant
The global and constant address spaces both use global virtual addresses,
which are the same virtual address space used by the CPU. However, some
virtual addresses may only be accessible to the CPU, some only accessible
by the GPU, and some by both.
Using the constant address space indicates that the data will not change
during the execution of the kernel. This allows scalar read instructions to
be used. As the constant address space could only be modified on the host
side, a generic pointer loaded from the constant address space is safe to be
assumed as a global pointer since only the device global memory is visible
and managed on the host side. The vector and scalar L1 caches are invalidated
of volatile data before each kernel dispatch execution to allow constant
memory to change values between kernel dispatches.
Region
The region address space uses the hardware Global Data Store (GDS). All
wavefronts executing on the same device will access the same memory for any
given region address. However, the same region address accessed by wavefronts
executing on different devices will access different memory. It is higher
performance than global memory. It is allocated by the runtime. The data
store (DS) instructions can be used to access it.
Local
The local address space uses the hardware Local Data Store (LDS) which is
automatically allocated when the hardware creates the wavefronts of a
work-group, and freed when all the wavefronts of a work-group have
terminated. All wavefronts belonging to the same work-group will access the
same memory for any given local address. However, the same local address
accessed by wavefronts belonging to different work-groups will access
different memory. It is higher performance than global memory. The data store
(DS) instructions can be used to access it.
Private
The private address space uses the hardware scratch memory support which
automatically allocates memory when it creates a wavefront and frees it when
a wavefronts terminates. The memory accessed by a lane of a wavefront for any
given private address will be different to the memory accessed by another lane
of the same or different wavefront for the same private address.
If a kernel dispatch uses scratch, then the hardware allocates memory from a
pool of backing memory allocated by the runtime for each wavefront. The lanes
of the wavefront access this using dword (4 byte) interleaving. The mapping
used from private address to backing memory address is:
If each lane of a wavefront accesses the same private address, the
interleaving results in adjacent dwords being accessed and hence requires
fewer cache lines to be fetched.
There are different ways that the wavefront scratch base address is
determined by a wavefront (see
Initial Kernel Execution State).
Scratch memory can be accessed in an interleaved manner using buffer
instructions with the scratch buffer descriptor and per wavefront scratch
offset, by the scratch instructions, or by flat instructions. Multi-dword
access is not supported except by flat and scratch instructions in
GFX9-GFX11.
On targets without “Globally Accessible Scratch” (introduced in GFX125x), code that
manipulates the stack values in other lanes of a wavefront, such as by
addrspacecast-ing stack pointers to generic ones and taking offsets that reach other
lanes or by explicitly constructing the scratch buffer descriptor, triggers undefined
behavior when it modifies the scratch values of other lanes. The compiler may assume
that such modifications do not occur for such targets.
When using code object V5 LIBOMPTARGET_STACK_SIZE may be used to provide the
private segment size in bytes, for cases where a dynamic stack is used.
Constant 32-bit
TODO
Buffer Fat Pointer
The buffer fat pointer is an experimental address space that is currently
unsupported in the backend. It exposes a non-integral pointer that is in
the future intended to support the modelling of 128-bit buffer descriptors
plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
pointer), allowing normal LLVM load/store/atomic operations to be used to
model the buffer descriptors used heavily in graphics workloads targeting
the backend.
The buffer descriptor used to construct a buffer fat pointer must be raw:
the stride must be 0, the “add tid” flag must be 0, the swizzle enable bits
must be off, and the extent must be measured in bytes. (On subtargets where
bounds checking may be disabled, buffer fat pointers may choose to enable
it or not). The cache swizzle support introduced in gfx942 may be used.
These pointers can be created by addrspacecast from a buffer resource
(ptraddrspace(8)`) or by using llvm.amdgcn.make.buffer.rsrc to produce a
ptraddrspace(7) directly, which produces a buffer fat pointer with an initial
offset of 0 and prevents the address space cast from being rewritten away.
The align attribute on operations from buffer fat pointers is deemed to apply
to all componenents of the pointer - that is, an align4 load is expected to
both have the offset be a multiple of 4 and to have a base pointer with an
alignment of 4.
This componentwise definition of alignment is needed to allow for promotion of
aligned loads to s_buffer_load, which requires that both the base pointer and
offset be appropriately aligned.
Buffer Resource
The buffer resource pointer, in address space 8, is the newer form
for representing buffer descriptors in AMDGPU IR, replacing their
previous representation as <4xi32>. It is a non-integral pointer
that represents a 128-bit buffer descriptor resource (V#).
Since, in general, a buffer resource supports complex addressing modes that cannot
be easily represented in LLVM (such as implicit swizzled access to structured
buffers), performing address computations such as getelementptr is not
recommended on ptraddrspace(8)``s(ifsuchcomputationsareperformed,theoffsetmustbewavefront-uniform.)NotethatsuchausageofGEPiscurrently**unimplemented**inthebackend,asitwouldrequireawrapping48-bitaddition.BufferresourcesmaybepassedtoAMDGPUbufferintrinsics,andtheymaybeconvertedtoandfrom``i128.
Casting a buffer resource to a buffer fat pointer is permitted and adds an offset
of 0.
Buffer resources can be created from 64-bit pointers (which should be either
generic or global) using the llvm.amdgcn.make.buffer.rsrc intrinsic, which
takes the pointer, which becomes the base of the resource,
the 16-bit stride (and swzizzle control) field stored in bits 63:48 of a V#,
the 32-bit NumRecords/extent field (bits 95:64), and the 32-bit flags field
(bits 127:96). The specific interpretation of these fields varies by the
target architecture and is detailed in the ISA descriptions.
On gfx1250, the base pointer is instead truncated to 57 bits and the NumRecords
field is 45 bits, which necessitated a change to make.buffer.rsrcs’s arguments
in order to make that field an i64.
When buffer resources are passed to buffer intrinsics such as
llvm.amdgcn.raw.ptr.buffer.load or
llvm.amdgcn.struct.ptr.buffer.store, the align attribute on the
pointer is assumed to apply to both the offset and the base pointer value.
That is, align8 means that both the base address within the ptraddrspace(8) and the offset argument have their three lowest bits set
to 0. If the stride of the resource is nonzero, the stride must be a multiple
of the given alignment.
In other words, the align attribute specifies the alignment of the effective
address being loaded from/stored to and acts as a guarantee that this is
not achieved from adding lower-alignment parts (as hardware may not always
allow for such an addition). For example, if a buffer resource has the base
address 0xfffe and is accessed with a raw.ptr.buffer.load with an offset
of 2, the load must not be marked align4 (even though the
effective adddress 0x10000 is so aligned) as this would permit the compiler
to make incorrect transformations (such as promotion to s_buffer_load,
which requires such componentwise alignment).
Buffer Strided Pointer
The buffer index pointer is an experimental address space. It represents
a 128-bit buffer descriptor and a 32-bit offset, like the Buffer Fat
Pointer. Additionally, it contains an index into the buffer, which
allows the direct addressing of structured elements. These components appear
in that order, i.e., the descriptor comes first, then the 32-bit offset
followed by the 32-bit index.
The bits in the buffer descriptor must meet the following requirements:
the stride is the size of a structured element, the “add tid” flag must be 0,
and the swizzle enable bits must be off.
These pointers can be created by addrspacecast from a buffer resource
(ptraddrspace(8)) or by using llvm.amdgcn.make.buffer.rsrc to produce a
ptraddrspace(9)` directly, which produces a buffer strided pointer whose initial
index and offset values are both 0. This prevents the address space cast from
being rewritten away.
As with buffer fat pointers, alignment of a buffer strided pointer applies to
both the base pointer address and the offset. In addition, the alignment also
constrains the stride of the pointer. That is, if you do an align4 load from
a buffer strided pointer, this means that the base pointer is align(4), that
the offset is a multiple of 4 bytes, and that the stride is a multiple of 4.
Streamout Registers
Dedicated registers used by the GS NGG Streamout Instructions. The register
file is modelled as a memory in a distinct address space because it is indexed
by an address-like offset in place of named registers, and because register
accesses affect LGKMcnt. This is an internal address space used only by the
compiler. Do not use this address space for IR pointers.
This section provides LLVM memory synchronization scopes supported by the AMDGPU
backend memory model when the target triple OS is amdhsa (see
Memory Model and Target Triples).
The memory model supported is based on the HSA memory model [HSA] which is
based in turn on HRF-indirect with scope inclusion [HRF]. The happens-before
relation is transitive over the synchronizes-with relation independent of scope
and synchronizes-with allows the memory scope instances to be inclusive (see
table AMDHSA LLVM Sync Scopes).
This is different to the OpenCL [OpenCL] memory model which does not have scope
inclusion and requires the memory scopes to exactly match. However, this
is conservatively correct for OpenCL.
Synchronizes with, and participates in modification
and seq_cst total orderings with, other operations
(except image operations) for all address spaces
(except private, or generic that accesses private)
provided the other operation’s sync scope is:
system.
agent and executed by a thread on the same
agent.
workgroup and executed by a thread in the
same work-group.
wavefront and executed by a thread in the
same wavefront.
agent
Synchronizes with, and participates in modification
and seq_cst total orderings with, other operations
(except image operations) for all address spaces
(except private, or generic that accesses private)
provided the other operation’s sync scope is:
system or agent and executed by a thread
on the same agent.
workgroup and executed by a thread in the
same work-group.
wavefront and executed by a thread in the
same wavefront.
cluster
Synchronizes with, and participates in modification
and seq_cst total orderings with, other operations
(except image operations) for all address spaces
(except private, or generic that accesses private)
provided the other operation’s sync scope is:
system, agent or cluster and
executed by a thread on the same cluster.
workgroup and executed by a thread in the
same work-group.
wavefront and executed by a thread in the
same wavefront.
On targets that do not support workgroup cluster
launch mode, this behaves like agent scope instead.
workgroup
Synchronizes with, and participates in modification
and seq_cst total orderings with, other operations
(except image operations) for all address spaces
(except private, or generic that accesses private)
provided the other operation’s sync scope is:
system, agent or workgroup and
executed by a thread in the same work-group.
wavefront and executed by a thread in the
same wavefront.
wavefront
Synchronizes with, and participates in modification
and seq_cst total orderings with, other operations
(except image operations) for all address spaces
(except private, or generic that accesses private)
provided the other operation’s sync scope is:
system, agent, workgroup or
wavefront and executed by a thread in the
same wavefront.
singlethread
Only synchronizes with and participates in
modification and seq_cst total orderings with,
other operations (except image operations) running
in the same thread for all address spaces (for
example, in signal handlers).
one-as
Same as system but only synchronizes with other
operations within the same address space.
agent-one-as
Same as agent but only synchronizes with other
operations within the same address space.
cluster-one-as
Same as cluster but only synchronizes with other
operations within the same address space.
workgroup-one-as
Same as workgroup but only synchronizes with
other operations within the same address space.
wavefront-one-as
Same as wavefront but only synchronizes with
other operations within the same address space.
singlethread-one-as
Same as singlethread but only synchronizes with
other operations within the same address space.
Named barriers are fixed function hardware barrier objects that are available
in gfx12.5+ in addition to the traditional default barriers.
In LLVM IR, named barriers are represented by global variables of type
target("amdgcn.named.barrier",0) in the LDS address space. Named barrier
global variables do not occupy actual LDS memory, but their lifetime and
allocation scope matches that of global variables in LDS. Programs in LLVM IR
refer to named barriers using pointers.
The following named barrier types are supported in global variables, defined
recursively:
a single, standalone target("amdgcn.named.barrier",0)
an array of supported types
a struct containing a single element of supported type
Named barriers do not have an underlying byte representation.
It is undefined behavior to use a pointer to any part of a named barrier object
as the pointer operand of a regular memory access instruction or intrinsic.
Pointers to named barrier objects are intended to be used with dedicated
intrinsics. Reading from or writing to such pointers is undefined behavior.
Implemented for float and half (and vectors of float or
half). Not implemented for double. Hardware provides
1ULP accuracy for float, and 0.51ULP for half. Float
instruction does not natively support denormal
inputs.
Implemented for float and half (and vectors of float or
half). Not implemented for double. Hardware provides
1ULP accuracy for float, and 0.51ULP for half. Float
instruction does not natively support denormal
inputs.
The natural floating-point mode type is i32. This
is implemented by extracting relevant bits out of the MODE
register with s_getreg_b32. The first 10 bits are the
core floating-point mode. Bits 12:18 are the exception
mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not
relevant to floating-point instructions are 0s.
AMDGPU supports two separately controllable rounding
modes depending on the floating-point type. One
controls float, and the other controls both double and
half operations. If both modes are the same, returns
one of the standard return values. If the modes are
different, returns one of 12 extended values
describing the two modes.
To nearest, ties away from zero is not a supported
mode. The raw rounding mode values in the MODE
register do not exactly match the FLT_ROUNDS values,
so a conversion is performed.
Input value expected to be one of the valid results
from ‘llvm.get.rounding’. Rounding mode is
undefined if not passed a valid input. This should be
a wave uniform value. In case of a divergent input
value, the first active lane’s value will be used.
Returns the current value of the AMDGPU floating point environment.
This stores information related to the current rounding mode,
denormalization mode, enabled traps, and floating point exceptions.
The format is a 64-bit concatenation of the MODE and TRAPSTS registers.
Sets the floating point environment to the specified state.
llvm.amdgcn.load.to.lds.p<1/7>
Loads values from global memory (either in the form of a global
a raw fat buffer pointer) to LDS. The size of the data copied can be 1, 2,
or 4 bytes (and gfx950 also allows 12 or 16 bytes). The LDS pointer
argument should be wavefront-uniform; the global pointer need not be.
The LDS pointer is implicitly offset by 4 * lane_id bytes for size <= 4 bytes
and 16 * lane_id bytes for larger sizes. This lowers to global_load_lds,
buffer_load_* … lds, or global_load__* … lds depending on address
space and architecture. amdgcn.global.load.lds has the same semantics as
amdgcn.load.to.lds.p1.
llvm.amdgcn.load.async.to.lds.p<1/7>
Same as llvm.amdgcn.load.to.lds.p<1/7>, but the completion of this
asynchronous version is not automatically tracked
by the compiler. The user must explicitly track the completion with asyncmark
operations before using their side-effects.
llvm.amdgcn.readfirstlane
Provides direct access to v_readfirstlane_b32. Returns the value in
the lowest active lane of the input operand. Currently implemented
for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>,
i64, double, pointers, multiples of the 32-bit vectors.
llvm.amdgcn.readlane
Provides direct access to v_readlane_b32. Returns the value in the
specified lane of the first input operand. The second operand specifies
the lane to read from. Currently implemented for i16, i32, float, half,
bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers,
multiples of the 32-bit vectors.
llvm.amdgcn.writelane
Provides direct access to v_writelane_b32. Writes value in the first input
operand to the specified lane of divergent output. The second operand
specifies the lane to write. Currently implemented for i16, i32, float,
half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers,
multiples of the 32-bit vectors.
llvm.amdgcn.wave.reduce.umin
Performs an arithmetic unsigned min reduction on the unsigned values
provided by each lane in the wavefront.
Intrinsic takes a hint for reduction strategy using second operand
0: Target default preference,
1: Iterative strategy, and
2: DPP.
If the target does not support the DPP operations (e.g. gfx6/7),
reduction will be performed using default iterative strategy.
Intrinsic is implemented for i32 and i64 types.
llvm.amdgcn.wave.reduce.min
Similar to llvm.amdgcn.wave.reduce.umin, but performs a signed min
reduction on signed integers.
Intrinsic is implemented for i32 and i64 types.
llvm.amdgcn.wave.reduce.fmin
Similar to llvm.amdgcn.wave.reduce.umin, but performs a floating point min
reduction on floating point values.
Intrinsic is implemented for float and double types.
Intrinsic is modelled similar to llvm.minnum intrinsic.
For a reduction between two NAN values, a NAN is returned.
For a reduction between a NAN and a number, the number is returned.
-0.0 < +0.0 is true for this reduction.
The ordering behaviour of SNANs is non-deterministic.
llvm.amdgcn.wave.reduce.umax
Performs an arithmetic unsigned max reduction on the unsigned values
provided by each lane in the wavefront.
Intrinsic takes a hint for reduction strategy using second operand
0: Target default preference,
1: Iterative strategy, and
2: DPP.
If the target does not support the DPP operations (e.g. gfx6/7),
reduction will be performed using default iterative strategy.
Intrinsic is implemented for i32 and i64 types.
llvm.amdgcn.wave.reduce.max
Similar to llvm.amdgcn.wave.reduce.umax, but performs a signed max
reduction on signed integers.
Intrinsic is implemented for i32 and i64 types.
llvm.amdgcn.wave.reduce.fmax
Similar to llvm.amdgcn.wave.reduce.umax, but performs a floating point max
reduction on floating point values.
Intrinsic is implemented for float and double types.
Intrinsic is modelled similar to llvm.maxnum intrinsic.
For a reduction between two NAN values, a NAN is returned.
For a reduction between a NAN and a number, the number is returned.
-0.0 < +0.0 is true for this reduction.
The ordering behaviour of SNANs is non-deterministic.
llvm.amdgcn.wave.reduce.add
Performs an arithmetic add reduction on the signed/unsigned values
provided by each lane in the wavefront.
Intrinsic takes a hint for reduction strategy using second operand
0: Target default preference,
1: Iterative strategy, and
2: DPP.
If the target does not support the DPP operations (e.g. gfx6/7),
reduction will be performed using default iterative strategy.
Intrinsic is implemented for signed/unsigned i32 and i64 types.
llvm.amdgcn.wave.reduce.fadd
Similar to llvm.amdgcn.wave.reduce.add, but performs a floating point add
reduction on floating point values.
Intrinsic is implemented for float and double types.
llvm.amdgcn.wave.reduce.sub
Performs an arithmetic sub reduction on the signed/unsigned values
provided by each lane in the wavefront.
Intrinsic takes a hint for reduction strategy using second operand
0: Target default preference,
1: Iterative strategy, and
2: DPP.
If the target does not support the DPP operations (e.g. gfx6/7),
reduction will be performed using default iterative strategy.
Intrinsic is implemented for signed/unsigned i32 and i64 types.
llvm.amdgcn.wave.reduce.fsub
Similar to llvm.amdgcn.wave.reduce.sub, but performs a floating point sub
reduction on floating point values.
Intrinsic is implemented for float and double types.
llvm.amdgcn.wave.reduce.and
Performs a bitwise-and reduction on the values
provided by each lane in the wavefront.
Intrinsic takes a hint for reduction strategy using second operand
0: Target default preference,
1: Iterative strategy, and
2: DPP.
If the target does not support the DPP operations (e.g. gfx6/7),
reduction will be performed using default iterative strategy.
Intrinsic is implemented for i32 and i64 types.
llvm.amdgcn.wave.reduce.or
Similar to llvm.amdgcn.wave.reduce.and, but performs a bitwise-or
reduction on the values provided by each wavefront.
Intrinsic is implemented for i32 and i64 types.
llvm.amdgcn.wave.reduce.xor
Similar to llvm.amdgcn.wave.reduce.and, but performs a bitwise-xor
reduction on the values provided by each wavefront.
Intrinsic is implemented for i32 and i64 types.
llvm.amdgcn.permlane16
Provides direct access to v_permlane16_b32. Performs arbitrary gather-style
operation within a row (16 contiguous lanes) of the second input operand.
The third and fourth inputs must be scalar values. These are combined into
a single 64-bit value representing lane selects used to swizzle within each
row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>,
<2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.
llvm.amdgcn.permlanex16
Provides direct access to v_permlanex16_b32. Performs arbitrary gather-style
operation across two rows of the second input operand (each row is 16 contiguous
lanes). The third and fourth inputs must be scalar values. These are combined
into a single 64-bit value representing lane selects used to swizzle within each
row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>,
<2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.
llvm.amdgcn.permlane64
Provides direct access to v_permlane64_b32. Performs a specific permutation across
lanes of the input operand where the high half and low half of a wave64 are swapped.
Performs no operation in wave32 mode. Currently implemented for i16, i32, float, half,
bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the
32-bit vectors.
llvm.amdgcn.udot2
Provides direct access to v_dot2_u32_u16 across targets which
support such instructions. This performs an unsigned dot product
with two v2i16 operands, summed with the third i32 operand. The
i1 fourth operand is used to clamp the output.
llvm.amdgcn.udot4
Provides direct access to v_dot4_u32_u8 across targets which
support such instructions. This performs an unsigned dot product
with two i32 operands (holding a vector of 4 8bit values), summed
with the third i32 operand. The i1 fourth operand is used to clamp
the output.
llvm.amdgcn.udot8
Provides direct access to v_dot8_u32_u4 across targets which
support such instructions. This performs an unsigned dot product
with two i32 operands (holding a vector of 8 4bit values), summed
with the third i32 operand. The i1 fourth operand is used to clamp
the output.
llvm.amdgcn.sdot2
Provides direct access to v_dot2_i32_i16 across targets which
support such instructions. This performs a signed dot product
with two v2i16 operands, summed with the third i32 operand. The
i1 fourth operand is used to clamp the output.
When applicable (e.g. no clamping), this is lowered into
v_dot2c_i32_i16 for targets which support it.
llvm.amdgcn.sdot4
Provides direct access to v_dot4_i32_i8 across targets which
support such instructions. This performs a signed dot product
with two i32 operands (holding a vector of 4 8bit values), summed
with the third i32 operand. The i1 fourth operand is used to clamp
the output.
When applicable (i.e. no clamping / operand modifiers), this is lowered
into v_dot4c_i32_i8 for targets which support it.
RDNA3 does not offer v_dot4_i32_i8, and rather offers
v_dot4_i32_iu8 which has operands to hold the signedness of the
vector operands. Thus, this intrinsic lowers to the signed version
of this instruction for gfx11 targets.
llvm.amdgcn.sdot8
Provides direct access to v_dot8_u32_u4 across targets which
support such instructions. This performs a signed dot product
with two i32 operands (holding a vector of 8 4bit values), summed
with the third i32 operand. The i1 fourth operand is used to clamp
the output.
When applicable (i.e. no clamping / operand modifiers), this is lowered
into v_dot8c_i32_i4 for targets which support it.
RDNA3 does not offer v_dot8_i32_i4, and rather offers
v_dot4_i32_iu4 which has operands to hold the signedness of the
vector operands. Thus, this intrinsic lowers to the signed version
of this instruction for gfx11 targets.
llvm.amdgcn.sudot4
Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs
dot product with two i32 operands (holding a vector of 4 8bit values), summed
with the fifth i32 operand. The i1 sixth operand is used to clamp
the output. The i1s preceding the vector operands decide the signedness.
llvm.amdgcn.sudot8
Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs
dot product with two i32 operands (holding a vector of 8 4bit values), summed
with the fifth i32 operand. The i1 sixth operand is used to clamp
the output. The i1s preceding the vector operands decide the signedness.
llvm.amdgcn.sched.barrier
Controls the types of instructions that may be allowed to cross the intrinsic
during instruction scheduling. The parameter is a mask for the instruction types
that can cross the intrinsic.
0x0000: No instructions may be scheduled across sched_barrier.
0x0001: All, non-memory, non-side-effect producing instructions may be
scheduled across sched_barrier, i.e. allow ALU instructions to pass.
0x0002: VALU instructions may be scheduled across sched_barrier.
0x0004: SALU instructions may be scheduled across sched_barrier.
0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.
0x0010: All VMEM instructions may be scheduled across sched_barrier.
0x0020: VMEM read instructions may be scheduled across sched_barrier.
0x0040: VMEM write instructions may be scheduled across sched_barrier.
0x0080: All DS instructions may be scheduled across sched_barrier.
0x0100: All DS read instructions may be scheduled across sched_barrier.
0x0200: All DS write instructions may be scheduled across sched_barrier.
0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.
llvm.amdgcn.sched.group.barrier
Creates schedule groups with specific properties to create custom scheduling
pipelines. The ordering between groups is enforced by the instruction scheduler.
The intrinsic applies to the code that precedes the intrinsic. The intrinsic
takes three values that control the behavior of the schedule groups.
Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.
Size : The number of instructions that are in the group.
SyncID : Order is enforced between groups with matching values.
The mask can include multiple instruction types. It is undefined behavior to set
values beyond the range of valid masks.
Combining multiple sched_group_barrier intrinsics enables an ordering of specific
instruction types during instruction scheduling. For example, the following enforces
a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA
instructions.
//1VMEMread
__builtin_amdgcn_sched_group_barrier(32,1,0)
//1VALU
__builtin_amdgcn_sched_group_barrier(2,1,0)
//5MFMA
__builtin_amdgcn_sched_group_barrier(8,5,0)
llvm.amdgcn.iglp.opt
An experimental intrinsic for instruction group level parallelism. The intrinsic
implements predefined instruction scheduling orderings. The intrinsic applies to the
surrounding scheduling region. The intrinsic takes a value that specifies the
strategy. The compiler implements two strategies.
Interleave DS and MFMA instructions for small GEMM kernels.
Interleave DS and MFMA instructions for single wave small GEMM kernels.
Interleave TRANS and MFMA instructions, as well as their VALU and DS predecessors, for attention kernels.
Interleave TRANS and MFMA instructions, with no predecessor interleaving, for attention kernels.
Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic
cannot be combined with sched_barrier or sched_group_barrier.
The iglp_opt strategy implementations are subject to change.
llvm.amdgcn.s.getpc
Provides access to the s_getpc_b64 instruction, but with the return value
sign-extended from the width of the underlying PC hardware register even on
processors where the s_getpc_b64 instruction returns a zero-extended value.
llvm.amdgcn.ballot
Returns a bitfield(i32 or i64) containing the result of its i1 argument
in all active lanes, and zero in all inactive lanes.
Provides a way to convert i1 in LLVM IR to i32 or i64 lane mask - bitfield
used by hardware to control active lanes when used in EXEC register.
For example, ballot(i1 true) return EXEC mask.
llvm.amdgcn.mfma.scale.f32.16x16x128.f8f6f4
Emit v_mfma_scale_f32_16x16x128_f8f6f4 to set the scale factor. The
last 4 operands correspond to the scale inputs.
2-bit byte index to use for each lane for matrix A
Matrix A scale values
2-bit byte index to use for each lane for matrix B
Matrix B scale values
llvm.amdgcn.mfma.scale.f32.32x32x64.f8f6f4
Emit v_mfma_scale_f32_32x32x64_f8f6f4
llvm.amdgcn.permlane16.swap
Provide direct access to v_permlane16_swap_b32 instruction on supported targets.
Swaps the values across lanes of first 2 operands. Odd rows of the first operand are
swapped with even rows of the second operand (one row is 16 lanes).
Returns a pair for the swapped registers. The first element of the return corresponds
to the swapped element of the first argument.
llvm.amdgcn.permlane32.swap
Provide direct access to v_permlane32_swap_b32 instruction on supported targets.
Swaps the values across lanes of first 2 operands. Rows 2 and 3 of the first operand are
swapped with rows 0 and 1 of the second operand (one row is 16 lanes).
Returns a pair for the swapped registers. The first element of the return
corresponds to the swapped element of the first argument.
llvm.amdgcn.mov.dpp
The llvm.amdgcn.mov.dpp.`<type>` intrinsic represents the mov.dpp operation in AMDGPU.
This operation is being deprecated and can be replaced with llvm.amdgcn.update.dpp.
llvm.amdgcn.update.dpp
The llvm.amdgcn.update.dpp.`<type>` intrinsic represents the update.dpp operation in AMDGPU.
It takes an old value, a source operand, a DPP control operand, a row mask, a bank mask, and a bound control.
Various data types are supported, including, bf16, f16, f32, f64, i16, i32, i64, p0, p3, p5, v2f16, v2f32, v2i16, v2i32, v2p0, v3i32, v4i32, v8f16.
This operation is equivalent to a sequence of v_mov_b32 operations.
It is preferred over llvm.amdgcn.mov.dpp.`<type>` for future use.
llvm.amdgcn.update.dpp.<type> <old> <src> <dpp_ctrl> <row_mask> <bank_mask> <bound_ctrl>
Should be equivalent to:
Implemented on gfx1250, ignored on earlier targets.
First argument is flat, global, or constant address space pointer.
Any other address space is not supported.
On gfx125x generates flat_prefetch_b8 or global_prefetch_b8 and brings data to GL2.
Second argument is rw and currently ignored. Can be 0 or 1.
Third argument is locality, 0-3. Translates to memory scope:
0 - SCOPE_SYS
1 - SCOPE_DEV
2 - SCOPE_SE
3 - SCOPE_SE
Note that SCOPE_CU is not generated and not safe on an invalid address.
Fourth argument is cache type:
0 - Instruction cache, currently ignored and no code is generated.
1 - Data cache.
Instruction cache prefetches are unsafe on invalid address.
llvm.amdgcn.s.barrier
Performs a barrier arrive operation immediately followed
by a barrier wait operation on the workgroup barrier object.
see Execution Barriers.
llvm.amdgcn.s.barrier.init
Performs a barrier init operation on the barrier object determined by the first operand.
See Execution Barriers.
Available starting GFX12.5.
llvm.amdgcn.s.barrier.signal
Performs a barrier arrive operation on the barrier object determined by the i32 immediate argument.
See Execution Barriers.
Available starting GFX12.
llvm.amdgcn.s.barrier.signal.var
Performs a barrier arrive operation on the barrier object determined by the first argument.
The second argument is an i32 immediate expected count. The expected count of the barrier object
is only set when the argument is not zero, and when the barrier object is a named barrier object.
See Execution Barriers.
Available starting GFX12.
llvm.amdgcn.s.barrier.signal.isfirst
Performs a barrier arrive operation on the barrier object determined by the i32 immediate argument.
Returns i11 if the wave is running in a workgroup, and it was the first wave to arrive at the barrier
object, otherwise returns i10.
See Execution Barriers.
Available starting GFX12.
llvm.amdgcn.s.barrier.wait
Performs a barrier wait operation on the barrier object determined by the i16 immediate argument.
If waiting on a named barrier object, this instruction always waits on the last named barrier object
that the thread has joined, even if it is different from the argument.
See Execution Barriers.
Available starting GFX12.
llvm.amdgcn.s.barrier.join
Performs a barrier join operation on the barrier object determined by the first operand.
See Execution Barriers.
Available starting GFX12.5.
llvm.amdgcn.s.barrier.leave
Performs a barrier drop operation.
See Execution Barriers.
Available starting GFX12.5.
llvm.amdgcn.flat.load.monitor
Available on GFX12.5 only.
Corresponds to flat_load_monitor_b32/64/128 (.b32/64/128 suffixes)
instructions.
For the purposes of the memory model, this is an atomic load operation in
the generic (flat) address space.
Synchronization Scope.
Note that the scope used must ensure that the L2 cache will be hit.
llvm.amdgcn.global.load.monitor
Available on GFX12.5 only.
Corresponds to global_load_monitor_b32/64/128 (.b32/64/128 suffixes)
instructions.
For the purposes of the memory model, this is an atomic load operation in
the global address space.
The llvm.amdgcn.cooperative.atomicfamily of intrinsics
provide atomic load and store operations to a naturally-aligned contiguous memory regions.
Memory is accessed cooperatively by a collection of convergent threads, with each thread accessing
a fraction of the contiguous memory region.
This intrinsic has a memory ordering and may be used to synchronize-with another cooperative atomic.
If the memory ordering is relaxed, it may pair with a fence if that same fence is executed by
all participating threads with the same synchronization scope and set of address spaces.
In both cases, a synchronize-with relation can only be established between cooperative atomics with the
same total access size.
Each target may have additional restrictions on how the intrinsic may be used; see
the table below.
Targets not covered in the table do not support these intrinsics.
If the intrinsic is used without meeting all of the above conditions, or the target-specific conditions,
then this intrinsic causes undefined behavior.
Intrinsic operands in this format are always i32 integer constants whose value is
determined by the C ABI encoding of atomic memory orderings. The supported values are in
the table below.
Table 35 AMDGPU Intrinsics C ABI Atomic Memory Ordering Values¶
Value
Atomic Memory
Ordering
Notes
i320
relaxed
The default for unsupported values.
i322
acquire
Only for loads.
i323
release
Only for stores.
i325
seq_cst
Example:
; "i32 5" is the atomic ordering operand
%0 = tail call i32 @llvm.amdgcn.cooperative.atomic.load.32x4B.p0(ptr %addr, i32 5, metadata !0)
Intrinsics operand in this format are metadata strings which must be one of the supported
memory scopes.
The metadata node must be made of a single MDString at the top level.
Asserts a memory operation does not access bytes in host memory, or
remote connected peer device memory (the address must be device
local). This is intended for use with atomicrmw
and other atomic instructions. This is required to emit a native
hardware instruction for some system scope atomic operations on some subtargets. For most
integer atomic operations, this is a sufficient restriction to emit a
native atomic instruction.
An atomicrmw without metadata will be treated
conservatively as required to preserve the operation behavior in all
cases. This will typically be used in conjunction with
!amdgpu.no.fine.grained.memory.
; Indicates the atomic does not access fine-grained memory, or; remote device memory.%old0=atomicrmwsubptr%ptr0,i321acquire,!amdgpu.no.fine.grained.memory!0,!amdgpu.no.remote.memory!0; Indicates the atomic does not access peer device memory.%old2=atomicrmwsubptr%ptr2,i321acquire,!amdgpu.no.remote.memory!0!0=!{}
Asserts a memory access does not access bytes allocated in
fine-grained allocated memory. This is intended for use with
atomicrmw and other atomic instructions. This is
required to emit a native hardware instruction for some system
scope atomic operations on some subtargets. An
atomicrmw without metadata will be treated
conservatively as required to preserve the operation behavior in all
cases. This will typically be used in conjunction with
!amdgpu.no.remote.memory.access.
; Indicates the access does not access fine-grained memory, or; remote device memory.%old0=atomicrmwsubptr%ptr0,i321acquire,!amdgpu.no.fine.grained.memory!0,!amdgpu.no.remote.memory.access!0; Indicates the access does not access fine-grained memory%old2=atomicrmwsubptr%ptr2,i321acquire,!amdgpu.no.fine.grained.memory!0!0=!{}
For use with atomicrmw floating-point
operations. Indicates the handling of denormal inputs and results is
insignificant and may be inconsistent with the expected floating-point
mode. This is necessary to emit a native atomic instruction on some
targets for some address spaces where float denormals are
unconditionally flushed. This is typically used in conjunction with
!amdgpu.no.remote.memory.access
and
!amdgpu.no.fine.grained.memory
Specify the minimum and maximum flat work group sizes that
will be specified when the kernel is dispatched. Generated
by the amdgpu_flat_work_group_size CLANG attribute [CLANG-ATTR].
The IR implied default value is 1,1024. Clang may emit this attribute
with more restrictive bounds depending on language defaults.
If the actual block or workgroup size exceeds the limit at any point during
the execution, the behavior is undefined. For example, even if there is
only one active thread but the thread local id exceeds the limit, the
behavior is undefined.
Specifies the number of SGPRs to use. Generated by
the amdgpu_num_sgpr CLANG attribute [CLANG-ATTR].
“amdgpu-num-vgpr”=”n”
Specifies the number of VGPRs to use. Generated by the
amdgpu_num_vgpr CLANG attribute [CLANG-ATTR].
“amdgpu-waves-per-eu”=”m,n”
Specify the minimum and maximum number of waves per
execution unit. Generated by the amdgpu_waves_per_eu
CLANG attribute [CLANG-ATTR]. This is an optimization hint,
and the backend may not be able to satisfy the request. If
the specified range is incompatible with the function’s
“amdgpu-flat-work-group-size” value, the implied occupancy
bounds by the workgroup size takes precedence.
“amdgpu-ieee” true/false.
GFX6-GFX11 (Except GFX11.7) Only
Specify whether the function expects the IEEE field of the
mode register to be set on entry. Overrides the default for
the calling convention.
“amdgpu-dx10-clamp” true/false.
GFX6-GFX11 (Except GFX11.7) Only
Specify whether the function expects the DX10_CLAMP field of
the mode register to be set on entry. Overrides the default
for the calling convention.
“amdgpu-no-workitem-id-x”
Indicates the function does not depend on the value of the
llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
attribute, or reached through a call site marked with this attribute,
and that intrinsic is called, the behavior of the program is undefined.
(Whole-program undefined behavior is used here because, for example,
the absence of a required workitem ID in the preloaded register set can
mean that all other preloaded registers are earlier than the compilation
assumed they would be.) The backend can generally infer this during code
generation, so typically there is no benefit to frontends marking
functions with this.
“amdgpu-no-workitem-id-y”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.workitem.id.y intrinsic.
“amdgpu-no-workitem-id-z”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.workitem.id.z intrinsic.
“amdgpu-no-workgroup-id-x”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.workgroup.id.x intrinsic.
“amdgpu-no-workgroup-id-y”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.workgroup.id.y intrinsic.
“amdgpu-no-workgroup-id-z”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.workgroup.id.z intrinsic.
“amdgpu-no-cluster-id-x”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.cluster.id.x intrinsic.
“amdgpu-no-cluster-id-y”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.cluster.id.y intrinsic.
“amdgpu-no-cluster-id-z”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.cluster.id.z intrinsic.
“amdgpu-no-dispatch-ptr”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.dispatch.ptr intrinsic.
“amdgpu-no-implicitarg-ptr”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.implicitarg.ptr intrinsic.
“amdgpu-no-dispatch-id”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.dispatch.id intrinsic.
“amdgpu-no-wwm”
The same as amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.strict.wwm intrinsic.
“amdgpu-no-queue-ptr”
Similar to amdgpu-no-workitem-id-x, except for the
llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
attributes, the queue pointer may be required in situations where the
intrinsic call does not directly appear in the program. Some subtargets
require the queue pointer to handle some addrspacecasts, as well
as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
llvm.debug intrinsics.
“amdgpu-no-hostcall-ptr”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
kernel argument that holds the pointer to the hostcall buffer. If this
attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-no-heap-ptr”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
kernel argument that holds the pointer to an initialized memory buffer
that conforms to the requirements of the malloc/free device library V1
version implementation. If this attribute is absent, then the
amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-no-multigrid-sync-arg”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
kernel argument that holds the multigrid synchronization pointer. If this
attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-no-default-queue”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
kernel argument that holds the default queue pointer. If this
attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-no-completion-action”
Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
kernel argument that holds the completion action pointer. If this
attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
“amdgpu-lds-size”=”min[,max]”
Min is the minimum number of bytes that will be allocated in the Local
Data Store at address zero. Variables are allocated within this frame
using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS
pass. Optional max is the maximum number of bytes that will be allocated.
Note that min==max indicates that no further variables can be added to
the frame. This is an internal detail of how LDS variables are lowered,
language front ends should not set this attribute.
“amdgpu-gds-size”
Bytes expected to be allocated at the start of GDS memory at entry.
“amdgpu-git-ptr-high”
The hard-wired high half of the address of the global information table
for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since
current hardware only allows a 16-bit value.
“amdgpu-32bit-address-high-bits”
Assumed high 32-bits for 32-bit address spaces which are really truncated
64-bit addresses (i.e., addrspace(6))
“amdgpu-color-export”
Indicates shader exports color information if set to 1.
Defaults to 1 for amdgpu_ps, and 0 for other calling
conventions. Determines the necessity and type of null exports when a shader
terminates early by killing lanes.
“amdgpu-depth-export”
Indicates shader exports depth information if set to 1. Determines the
necessity and type of null exports when a shader terminates early by killing
lanes. A depth-only shader will export to depth channel when no null export
target is available (GFX11+).
“InitialPSInputAddr”
Set the initial value of the spi_ps_input_addr register for
amdgpu_ps shaders. Any bits enabled by this value will
be enabled in the final register value.
“amdgpu-wave-priority-threshold”
VALU instruction count threshold for adjusting wave priority. If exceeded,
temporarily raise the wave priority at the start of the shader function
until its last VMEM instructions to allow younger waves to issue their VMEM
instructions as well.
“amdgpu-memory-bound”
Set internally by backend
“amdgpu-wave-limiter”
Set internally by backend
“amdgpu-unroll-threshold”
Set base cost threshold preference for loop unrolling within this function,
default is 300. Actual threshold may be varied by per-loop metadata or
reduced by heuristics.
“amdgpu-max-num-workgroups”=”x,y,z”
Specify the maximum number of work groups for the kernel dispatch in the
X, Y, and Z dimensions. Each number must be >= 1. Generated by the
amdgpu_max_num_work_groups CLANG attribute [CLANG-ATTR]. Clang only
emits this attribute when all the three numbers are >= 1.
“amdgpu-hidden-argument”
This attribute is used internally by the backend to mark function arguments
as hidden. Hidden arguments are managed by the compiler and are not part of
the explicit arguments supplied by the user.
“amdgpu-agpr-alloc”=”min(,max)”
Indicates a minimum and maximum range for the number of AGPRs to make
available to allocate. The values will be rounded up to the next multiple
of the allocation granularity (4). The minimum value is interpreted as the
minimum required number of AGPRs for the function to allocate (that is, the
function requires no more than min registers). If only one value is specified,
it is interpreted as the minimum register budget. The maximum will restrict
allocation to use no more than max AGPRs.
The values may be ignored if satisfying it would violate other allocation
constraints.
The behavior is undefined if a function which requires more AGPRs than the
lower bound is reached through any function marked with a higher value of this
attribute. A minimum value of 0 indicates the function does not require
any AGPRs.
This is only relevant on targets with AGPRs which support accum_offset (gfx90a+).
“amdgpu-sgpr-hazard-wait”
Disabled SGPR hazard wait insertion if set to 0.
Exists for testing performance impact of SGPR hazard waits only.
“amdgpu-sgpr-hazard-boundary-cull”
Enable insertion of SGPR hazard cull sequences at function call boundaries.
Cull sequence reduces future hazard waits, but has a performance cost.
“amdgpu-sgpr-hazard-mem-wait-cull”
Enable insertion of SGPR hazard cull sequences before memory waits.
Cull sequence reduces future hazard waits, but has a performance cost.
Attempt to amortize cost by overlapping with memory accesses.
“amdgpu-sgpr-hazard-mem-wait-cull-threshold”
Sets the number of active SGPR hazards that must be present before
inserting a cull sequence at a memory wait.
“amdgpu-promote-alloca-to-vector-max-regs”
Maximum vector size (in 32b registers) to create when promoting alloca.
“amdgpu-promote-alloca-to-vector-vgpr-ratio”
Ratio of VGPRs to budget for promoting alloca to vectors.
“amdgpu-dynamic-vgpr-block-size”
Represents the size of a VGPR block in the “Dynamic VGPR” hardware mode,
introduced in GFX12.
A value of 0 (default) means that dynamic VGPRs are not enabled.
Valid values for GFX12+ are 16 and 32.
Waves launched in this mode may allocate or deallocate the VGPRs
using dedicated instructions, but may not send the DEALLOC_VGPRS
message. If a shader has this attribute, then all its callees must
match its value.
An amd_cs_chain CC function with this enabled has an extra symbol
prefixed with “_dvgpr$” with the value of the function symbol,
offset by one less than the number of dynamic VGPR blocks required
by the function encoded in bits 5..3.
“amdgpu-cluster-dims”=”x,y,z”
Specify the cluster workgroup dimensions. A value of “0,0,0” indicates that
cluster is disabled. A value of “1024,1024,1024” indicates that cluster is enabled,
but the dimensions cannot be determined at compile time. Any other value explicitly
specifies the cluster dimensions.
This is only relevant on targets with cluster support.
“amdgpu-expert-scheduling-mode” true/false.
Enable expert scheduling mode 2 for this function. This is a hardware execution
mode introduced in GFX12.
This is only relevant on GFX12+.
“amdgpu-expand-waitcnt-profiling”
Enable expansion of s_waitcnt instructions for profiling purposes.
When enabled, each s_waitcnt instruction that waits on multiple counter
types is expanded into a sequence of s_waitcnt instructions, each waiting
on a single counter type. This allows PC-sampling based profilers to
attribute wait cycles to specific counter types (e.g., VMEM, LDS, EXP).
“amdgpu-no-fwd-progress”
Disable forward progress mode for wave priority
(enabled by default).
The C calling convention. Used by default.
See Non-Kernel Functions
for more details.
fastcc
The fast calling convention. Mostly the same as the ccc.
coldcc
The cold calling convention. Mostly the same as the ccc.
amdgpu_cs
Used for Mesa/AMDPAL compute shaders.
..TODO::
Describe.
amdgpu_cs_chain
Similar to amdgpu_cs, with differences described below.
Functions with this calling convention cannot be called directly. They must
instead be launched via the llvm.amdgcn.cs.chain intrinsic.
Arguments are passed in SGPRs, starting at s0, if they have the inreg
attribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs
than available in the subtarget is not allowed. On subtargets that use
a scratch buffer descriptor (as opposed to scratch_{load,store}_* instructions),
the scratch buffer descriptor is passed in s[48:51]. This limits the
SGPR / inreg arguments to the equivalent of 48 dwords; using more
than that is not allowed.
The return type must be void.
Varargs, sret, byval, byref, inalloca, preallocated are not supported.
Values in scalar registers as well as v0-v7 are not preserved. Values in
VGPRs starting at v8 are not preserved for the active lanes, but must be
saved by the callee for inactive lanes when using WWM (a notable exception is
when the llvm.amdgcn.init.whole.wave intrinsic is used in the function - in this
case the backend assumes that there are no inactive lanes upon entry; any inactive
lanes that need to be preserved must be explicitly present in the IR).
Chain functions receive a stack pointer from their caller (in s32), similar to
amdgpu_gfx functions. If needed, the frame pointer is s33 and the base pointer
is s34. Calls to amdgpu_gfx functions are allowed and behave like they do in
amdgpu_cs functions.
A function may have multiple exits (e.g. one chain exit and one plain retvoid
for when the wave ends), but all llvm.amdgcn.cs.chain exits must be in
uniform control flow.
amdgpu_cs_chain_preserve
Same as amdgpu_cs_chain, but active lanes for VGPRs starting at v8 are preserved.
Calls to amdgpu_gfx functions are not allowed, and any calls to llvm.amdgcn.cs.chain
must not pass more VGPR arguments than the caller’s VGPR function parameters.
amdgpu_es
Used for AMDPAL shader stage before geometry shader if geometry is in
use. So either the domain (= tessellation evaluation) shader if
tessellation is in use, or otherwise the vertex shader.
..TODO::
Describe.
amdgpu_gfx
Used for AMD graphics targets. Functions with this calling convention
cannot be used as entry points.
..TODO::
Describe.
amdgpu_gfx_whole_wave
Used for AMD graphics targets. Functions with this calling convention
cannot be used as entry points. They must have an i1 as the first argument,
which will be mapped to the value of EXEC on entry into the function. Other
arguments will contain poison in their inactive lanes. Similarly, the return
value for the inactive lanes is poison.
The function will run with all lanes enabled, i.e. EXEC will be set to -1 in the
prologue and restored to its original value in the epilogue. The inactive lanes
will be preserved for all the registers used by the function. Active lanes only
will only be preserved for the callee saved registers.
In all other respects, functions with this calling convention behave like
amdgpu_gfx functions.
amdgpu_gs
Used for Mesa/AMDPAL geometry shaders.
..TODO::
Describe.
amdgpu_hs
Used for Mesa/AMDPAL hull shaders (= tessellation control shaders).
..TODO::
Describe.
Used for AMDPAL vertex shader if tessellation is in use.
..TODO::
Describe.
amdgpu_ps
Used for Mesa/AMDPAL pixel shaders.
..TODO::
Describe.
amdgpu_vs
Used for Mesa/AMDPAL last shader stage before rasterization (vertex
shader if tessellation and geometry are not in use, or otherwise
copy shader if one is needed).
..TODO::
Describe.
The following ABI conventions apply to all calling conventions that are used for
callable functions (i.e. those that do not correspond to hardware entry points):
On entry to a function the dependency counters (VMcnt, LOADcnt etc.)
are in an indeterminate state.
On return from a function, all dependency counters must be zero except for
VScnt/STOREcnt.
For entry points, the ABI conventions are dictated by the hardware behavior at
wave launch and wave termination:
When a wave is launched the shader can assume that all dependency counters are
zero.
The shader can leave the dependency counters in any state before terminating
the wave (e.g. with s_endpgm).
A function’s resource usage depends on each of its callees’ resource usage. The
expressions used to denote resource usage reflect this by propagating each
callees’ equivalent expressions. Said expressions are emitted as symbols by the
compiler when compiling to either assembly or object format and should not be
overwritten or redefined.
The following describes all emitted function resource usage symbols:
Furthermore, three symbols are additionally emitted describing the compilation
unit’s worst case (i.e, maxima) num_vgpr, num_agpr, and
numbered_sgpr which may be referenced and used by the aforementioned
symbolic expressions. These three symbols are amdgcn.max_num_vgpr,
amdgcn.max_num_agpr, and amdgcn.max_num_sgpr.
The AMDGPU backend generates a standard ELF [ELF] relocatable code object that
can be linked by lld to produce a standard ELF shared code object which can
be loaded and executed on an AMDGPU target.
ELFCLASS64 for amdgcn architecture which only supports 64-bit
process address space applications.
e_ident[EI_DATA]
All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
e_ident[EI_OSABI]
One of the following AMDGPU target architecture specific OS ABIs
(see AMDGPU Operating Systems):
ELFOSABI_NONE for unknown OS.
ELFOSABI_AMDGPU_HSA for amdhsa OS.
ELFOSABI_AMDGPU_PAL for amdpal OS.
ELFOSABI_AMDGPU_MESA3D for mesa3D OS.
e_ident[EI_ABIVERSION]
The ABI version of the AMDGPU target architecture specific OS ABI to which the code
object conforms:
ELFABIVERSION_AMDGPU_HSA_V2 is used to specify the version of AMD HSA
runtime ABI for code object V2. Can no longer be emitted by this version of LLVM.
ELFABIVERSION_AMDGPU_HSA_V3 is used to specify the version of AMD HSA
runtime ABI for code object V3. Can no longer be emitted by this version of LLVM.
ELFABIVERSION_AMDGPU_HSA_V4 is used to specify the version of AMD HSA
runtime ABI for code object V4. Specify using the Clang option
-mcode-object-version=4.
ELFABIVERSION_AMDGPU_HSA_V5 is used to specify the version of AMD HSA
runtime ABI for code object V5. Specify using the Clang option
-mcode-object-version=5. This is the default code object
version if not specified.
ELFABIVERSION_AMDGPU_HSA_V6 is used to specify the version of AMD HSA
runtime ABI for code object V6. Specify using the Clang option
-mcode-object-version=6.
ELFABIVERSION_AMDGPU_PAL is used to specify the version of AMD PAL
runtime ABI.
ELFABIVERSION_AMDGPU_MESA3D is used to specify the version of AMD MESA
3D runtime ABI.
e_type
Can be one of the following values:
ET_REL
The type produced by the AMDGPU backend compiler as it is relocatable code
object.
ET_DYN
The type produced by the linker as it is a shared code object.
The AMD HSA runtime loader requires a ET_DYN code object.
The entry point is 0 as the entry points for individual kernels must be
selected in order to invoke them through AQL packets.
e_flags
The AMDGPU backend uses the following ELF header flags:
Table 42 AMDGPU ELF Header e_flags for Code Object V2¶
Name
Value
Description
EF_AMDGPU_FEATURE_XNACK_V2
0x01
Indicates if the xnack
target feature is
enabled for all code
contained in the code object.
If the processor
does not support the
xnack target
feature then must
be 0.
See
Target Features.
EF_AMDGPU_FEATURE_TRAP_HANDLER_V2
0x02
Indicates if the trap
handler is enabled for all
code contained in the code
object. If the processor
does not support a trap
handler then must be 0.
See
Target Features.
Table 43 AMDGPU ELF Header e_flags for Code Object V3¶
Indicates if the xnack
target feature is
enabled for all code
contained in the code object.
If the processor
does not support the
xnack target
feature then must
be 0.
See
Target Features.
EF_AMDGPU_FEATURE_SRAMECC_V3
0x200
Indicates if the sramecc
target feature is
enabled for all code
contained in the code object.
If the processor
does not support the
sramecc target
feature then must
be 0.
See
Target Features.
Table 44 AMDGPU ELF Header e_flags for Code Object V4 and V5¶
XNACK selection mask for
EF_AMDGPU_FEATURE_XNACK_*_V4
values.
EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4
0x000
XNACK unsupported.
EF_AMDGPU_FEATURE_XNACK_ANY_V4
0x100
XNACK can have any value.
EF_AMDGPU_FEATURE_XNACK_OFF_V4
0x200
XNACK disabled.
EF_AMDGPU_FEATURE_XNACK_ON_V4
0x300
XNACK enabled.
EF_AMDGPU_FEATURE_SRAMECC_V4
0xc00
SRAMECC selection mask for
EF_AMDGPU_FEATURE_SRAMECC_*_V4
values.
EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4
0x000
SRAMECC unsupported.
EF_AMDGPU_FEATURE_SRAMECC_ANY_V4
0x400
SRAMECC can have any value.
EF_AMDGPU_FEATURE_SRAMECC_OFF_V4
0x800
SRAMECC disabled,
EF_AMDGPU_FEATURE_SRAMECC_ON_V4
0xc00
SRAMECC enabled.
EF_AMDGPU_GENERIC_VERSION_V
0xff000000
Generic code object version selection
mask. This is a value between 1 and 255,
stored in the most significant byte
of EFLAGS.
See Generic Processor Versioning
These sections have their standard meanings (see [ELF]) and are only generated
if needed.
.debug*
The standard DWARF sections. See DWARF Debug Information for
information on the DWARF produced by the AMDGPU backend.
.dynamic, .dynstr, .dynsym, .hash
The standard sections used by a dynamic loader.
.note
See Note Records for the note records supported by the AMDGPU
backend.
.relaname, .rela.dyn
For relocatable code objects, name is the name of the section that the
relocation records apply. For example, .rela.text is the section name for
relocation records associated with the .text section.
For linked shared code objects, .rela.dyn contains all the relocation
records from each of the relocatable code object’s .relaname sections.
See Relocation Records for the relocation records supported by
the AMDGPU backend.
.text
The executable machine code for the kernels and functions they call. Generated
as position independent code. See Code Conventions for
information on conventions used in the isa generation.
As required by ELFCLASS32 and ELFCLASS64, minimal zero-byte padding
must be generated after the name field to ensure the desc field is 4
byte aligned. In addition, minimal zero-byte padding must be generated to
ensure the desc field size is a multiple of 4 bytes. The sh_addralign
field of the .note section must be at least 4 to indicate at least 8 byte
alignment.
vendor_name_size and architecture_name_size are the length of the
vendor and architecture names respectively, including the NUL character.
vendor_and_architecture_name contains the NUL terminated string for the
vendor, immediately followed by the NUL terminated string for the
architecture.
This note record is used by the HSA runtime loader.
Code object V2 only supports a limited number of processors and has fixed
settings for target features. See
AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings for a list of
processors and the corresponding target ID. In the table the note record ISA
name is a concatenation of the vendor name, architecture name, major, minor,
and stepping separated by a “:”.
The target ID column shows the processor name and fixed target features used
by the LLVM compiler. The LLVM compiler does not generate a
NT_AMD_HSA_HSAIL note record.
A code object generated by the Finalizer also uses code object V2 and always
generates a NT_AMD_HSA_HSAIL note record. The processor name and
sramecc target feature is as shown in
AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings but the xnack
target feature is specified by the EF_AMDGPU_FEATURE_XNACK_V2e_flags
bit.
NT_AMD_HSA_ISA_NAME
Specifies the target ISA name as a non-NUL terminated string.
This note record is not used by the HSA runtime loader.
See the NT_AMD_HSA_ISA_VERSION note record description of the code object
V2’s limited support of processors and fixed settings for target features.
See AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings for a mapping
from the string to the corresponding target ID. If the xnack target
feature is supported and enabled, the string produced by the LLVM compiler
will may have a +xnack appended. The Finlizer did not do the appending and
instead used the EF_AMDGPU_FEATURE_XNACK_V2e_flags bit.
NT_AMD_HSA_METADATA
Specifies extensible metadata associated with the code objects executed on HSA
[HSA] compatible runtimes (see AMDGPU Operating Systems). It is required when the
target triple OS is amdhsa (see Target Triples). See
Code Object V2 Metadata for the syntax of the code object
metadata string.
Global variables both used and defined by the compilation unit.
If the symbol is defined in the compilation unit then it is allocated in the
appropriate section according to if it has initialized data or is readonly.
If the symbol is external then its section is STN_UNDEF and the loader
will resolve relocations using the definition provided by another code object
or explicitly defined by the runtime.
If the symbol resides in local/group memory (LDS) then its section is the
special processor specific section name SHN_AMDGPU_LDS, and the
st_value field describes alignment requirements as it does for common
symbols.
Kernel descriptor
Every HSA kernel has an associated kernel descriptor. It is the address of the
kernel descriptor that is used in the AQL dispatch packet used to invoke the
kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
defined in Kernel Descriptor.
Kernel entry point
Every HSA kernel also has a symbol for its machine code entry point.
The AMDGPU backend generates Elf64_Rela relocation records for
AMDHSA or Elf64_Rel relocation records for Mesa/AMDPAL. Supported
relocatable fields are:
word32
This specifies a 32-bit field occupying 4 bytes with arbitrary byte
alignment. These values use the same byte order as other word values in the
AMDGPU architecture.
word64
This specifies a 64-bit field occupying 8 bytes with arbitrary byte
alignment. These values use the same byte order as other word values in the
AMDGPU architecture.
Following notations are used for specifying relocation calculations:
A
Represents the addend used to compute the value of the relocatable field. If
the addend field is smaller than 64 bits then it is zero-extended to 64 bits
for use in the calculations below. (In practice this only affects _HI
relocation types on Mesa/AMDPAL, where the addend comes from the 32-bit field
but the result of the calculation depends on the high part of the full 64-bit
address.)
G
Represents the offset into the global offset table at which the relocation
entry’s symbol will reside during execution.
GOT
Represents the address of the global offset table.
P
Represents the place (section offset for et_rel or address for et_dyn)
of the storage unit being relocated (computed using r_offset).
S
Represents the value of the symbol whose index resides in the relocation
entry. Relocations not using this must specify a symbol index of
STN_UNDEF.
B
Represents the base address of a loaded executable or shared object which is
the difference between the ELF address and the actual load address.
Relocations using this are only valid in executable or shared objects.
R_AMDGPU_ABS32_LO and R_AMDGPU_ABS32_HI are only supported by
the mesa3d OS, which does not support R_AMDGPU_ABS64.
There is no current OS loader support for 32-bit programs and so
R_AMDGPU_ABS32 is only generated for static relocations, for example to
implement some DWARF32 forms.
The AMD GPU code object loader represents the path of the ELF shared object from
which the code object was loaded as a textual Uniform Resource Identifier (URI).
Note that the code object is the in memory loaded relocated form of the ELF
shared object. Multiple code objects may be loaded at different memory
addresses in the same process from the same ELF shared object.
The loaded code object path URI syntax is defined by the following BNF syntax:
Is a C integral literal where hexadecimal values are prefixed by “0x” or “0X”,
and octal values by “0”.
file_path
Is the file’s path specified as a URI encoded UTF-8 string. In URI encoding,
every character that is not in the regular expression [a-zA-Z0-9/_.~-] is
encoded as two uppercase hexadecimal digits proceeded by “%”. Directories in
the path are separated by “/”.
offset
Is a 0-based byte offset to the start of the code object. For a file URI, it
is from the start of the file specified by the file_path, and if omitted
defaults to 0. For a memory URI, it is the memory address and is required.
size
Is the number of bytes in the code object. For a file URI, if omitted it
defaults to the size of the file. It is required for a memory URI.
process_id
Is the identity of the process owning the memory. For Linux it is the C
unsigned integral decimal literal for the process ID (PID).
This section describes provisional support for AMDGPU DWARF [DWARF] that
is not currently fully implemented and is subject to change.
AMDGPU generates DWARF [DWARF] debugging information ELF sections (see
ELF Code Object) which contain information that maps the code
object executable code and data to the source language constructs. It can be
used by tools such as debuggers and profilers. It uses features defined in
DWARF Extensions For Heterogeneous Debugging that are made available in
DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
This section defines the AMDGPU target architecture specific DWARF mappings.
This section defines the AMDGPU target architecture register numbers used in
DWARF operation expressions (see DWARF Version 5 section 2.5 and
A.2.5.4 DWARF Operation Expressions) and Call Frame Information
instructions (see DWARF Version 5 section 6.4 and
A.6.4 Call Frame Information).
A single code object can contain code for kernels that have different wavefront
sizes. The vector registers and some scalar registers are based on the wavefront
size. AMDGPU defines distinct DWARF registers for each wavefront size. This
simplifies the consumer of the DWARF so that each register has a fixed size,
rather than being dynamic according to the wavefront size mode. Similarly,
distinct DWARF registers are defined for those registers that vary in size
according to the process address size. This allows a consumer to treat a
specific AMDGPU processor as a single architecture regardless of how it is
configured at run time. The compiler explicitly specifies the DWARF registers
that match the mode in which the code it is generating will be executed.
DWARF registers are encoded as numbers, which are mapped to architecture
registers. The mapping for AMDGPU is defined in
AMDGPU DWARF Register Mapping. All AMDGPU targets use the same
mapping.
Program Counter (PC) when
executing in a 32-bit process
address space. Used in the CFI to
describe the PC of the calling
frame.
1
EXEC_MASK_32
32
Execution Mask Register when
executing in wavefront 32 mode.
2-15
Reserved
Reserved for highly accessed
registers using DWARF shortcut.
16
PC_64
64
Program Counter (PC) when
executing in a 64-bit process
address space. Used in the CFI to
describe the PC of the calling
frame.
17
EXEC_MASK_64
64
Execution Mask Register when
executing in wavefront 64 mode.
18-31
Reserved
Reserved for highly accessed
registers using DWARF shortcut.
32-95
SGPR0-SGPR63
32
Scalar General Purpose
Registers.
96-127
Reserved
Reserved for frequently accessed
registers using DWARF 1-byte ULEB.
128
STATUS
32
Status Register.
129-511
Reserved
Reserved for future Scalar
Architectural Registers.
512
VCC_32
32
Vector Condition Code Register
when executing in wavefront 32
mode.
513-767
Reserved
Reserved for future Vector
Architectural Registers when
executing in wavefront 32 mode.
768
VCC_64
64
Vector Condition Code Register
when executing in wavefront 64
mode.
769-1023
Reserved
Reserved for future Vector
Architectural Registers when
executing in wavefront 64 mode.
1024-1087
Reserved
Reserved for padding.
1088-1129
SGPR64-SGPR105
32
Scalar General Purpose Registers.
1130-1535
Reserved
Reserved for future Scalar
General Purpose Registers.
1536-2047
VGPR0-VGPR511
32*32
Vector General Purpose Registers
when executing in wavefront 32
mode.
2048-2303
AGPR0-AGPR255
32*32
Vector Accumulation Registers
when executing in wavefront 32
mode.
2304-2559
Reserved
Reserved for future Vector
Accumulation Registers when
executing in wavefront 32 mode.
2560-2815
VGPR0-VGPR255
64*32
Vector General Purpose Registers
when executing in wavefront 64
mode.
2816-3071
Reserved
Reserved for future Vector
General Purpose Registers when
executing in wavefront 64 mode.
3072-3327
AGPR0-AGPR255
64*32
Vector Accumulation Registers
when executing in wavefront 64
mode.
3328-3583
Reserved
Reserved for future Vector
Accumulation Registers when
executing in wavefront 64 mode.
3584-4095
VGPR512-VGPR1023
32*32
Second Block of Vector General
Purpose Registers When executing
in wavefront 32 mode
The vector registers are represented as the full size for the wavefront. They
are organized as consecutive dwords (32-bits), one per lane, with the dword at
the least significant bit position corresponding to lane 0 and so forth. DWARF
location expressions involving the DW_OP_LLVM_offset and
DW_OP_LLVM_push_lane operations are used to select the part of the vector
register corresponding to the lane that is executing the current thread of
execution in languages that are implemented using a SIMD or SIMT execution
model.
If the wavefront size is 32 lanes then the wavefront 32 mode register
definitions are used. If the wavefront size is 64 lanes then the wavefront 64
mode register definitions are used. Some AMDGPU targets support executing in
both wavefront 32 and wavefront 64 mode. The register definitions corresponding
to the wavefront mode of the generated code will be used.
If code is generated to execute in a 32-bit process address space, then the
32-bit process address space register definitions are used. If code is generated
to execute in a 64-bit process address space, then the 64-bit process address
space register definitions are used. The amdgcn target only supports the
64-bit process address space.
The DWARF memory space represents the source language memory space. See DWARF
Version 5 section 2.12 which is updated by the DWARF Extensions For
Heterogeneous Debugging section A.2.14 Memory Spaces.
The DWARF memory space values defined in the DWARF Extensions For Heterogeneous
Debugging section A.2.14 Memory Spaces are used.
In addition, DW_ADDR_AMDGPU_region is encoded as a vendor extension. This is
available for use for the AMD extension for access to the hardware GDS memory
which is scratchpad memory allocated per device.
For AMDGPU if no DW_AT_LLVM_memory_space attribute is present, then the
default memory space of DW_MSPACE_LLVM_none is used.
See Address Space Identifier for information on the AMDGPU
mapping of DWARF memory spaces to DWARF address spaces, including address size
and NULL value.
DWARF address spaces correspond to target architecture specific linear
addressable memory areas. See DWARF Version 5 section 2.12 and DWARF Extensions
For Heterogeneous Debugging section A.2.13 Address Spaces.
See Address Spaces for information on the AMDGPU LLVM IR address
spaces including address size and NULL value.
The DW_ASPACE_LLVM_none address space is the default target architecture
address space used in DWARF operations that do not specify an address space. It
therefore has to map to the global address space so that the DW_OP_addr* and
related operations can refer to addresses in the program code.
The DW_ASPACE_AMDGPU_generic address space allows location expressions to
specify the flat address space. If the address corresponds to an address in the
local address space, then it corresponds to the wavefront that is executing the
focused thread of execution. If the address corresponds to an address in the
private address space, then it corresponds to the lane that is executing the
focused thread of execution for languages that are implemented using a SIMD or
SIMT execution model.
Note
CUDA-like languages such as HIP that do not have address spaces in the
language type system, but do allow variables to be allocated in different
address spaces, need to explicitly specify the DW_ASPACE_AMDGPU_generic
address space in the DWARF expression operations as the default address space
is the global address space.
The DW_ASPACE_AMDGPU_local address space allows location expressions to
specify the local address space corresponding to the wavefront that is executing
the focused thread of execution.
The DW_ASPACE_AMDGPU_private_lane address space allows location expressions
to specify the private address space corresponding to the lane that is executing
the focused thread of execution for languages that are implemented using a SIMD
or SIMT execution model.
The DW_ASPACE_AMDGPU_private_wave address space allows location expressions
to specify the unswizzled private address space corresponding to the wavefront
that is executing the focused thread of execution. The wavefront view of private
memory is the per wavefront unswizzled backing memory layout defined in
Address Spaces, such that address 0 corresponds to the first
location for the backing memory of the wavefront (namely the address is not
offset by wavefront-scratch-base). The following formula can be used to
convert from a DW_ASPACE_AMDGPU_private_lane address to a
DW_ASPACE_AMDGPU_private_wave address:
If the DW_ASPACE_AMDGPU_private_lane address is dword aligned, and the start
of the dwords for each lane starting with lane 0 is required, then this
simplifies to:
A compiler can use the DW_ASPACE_AMDGPU_private_wave address space to read a
complete spilled vector register back into a complete vector register in the
CFI. The frame pointer can be a private lane address which is dword aligned,
which can be shifted to multiply by the wavefront size, and then used to form a
private wavefront address that gives a location for a contiguous set of dwords,
one per lane, where the vector register dwords are spilled. The compiler knows
the wavefront size since it generates the code. Note that the type of the
address may have to be converted as the size of a
DW_ASPACE_AMDGPU_private_lane address may be smaller than the size of a
DW_ASPACE_AMDGPU_private_wave address.
DWARF lane identifies specify a target architecture lane position for hardware
that executes in a SIMD or SIMT manner, and on which a source language maps its
threads of execution onto those lanes. The DWARF lane identifier is pushed by
the DW_OP_LLVM_push_lane DWARF expression operation. See DWARF Version 5
section 2.5 which is updated by DWARF Extensions For Heterogeneous Debugging
section A.2.5.4 DWARF Operation Expressions.
For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
wavefront. It is numbered from 0 to the wavefront size minus 1.
DWARF expressions are used to compute program values and the locations of
program objects. See DWARF Version 5 section 2.5 and
A.2.5.4 DWARF Operation Expressions.
DWARF location descriptions describe how to access storage which includes memory
and registers. When accessing storage on AMDGPU, bytes are ordered with least
significant bytes first, and bits are ordered within bytes with least
significant bits first.
For AMDGPU CFI expressions, DW_OP_LLVM_select_bit_piece is used to describe
unwinding vector registers that are spilled under the execution mask to memory:
the zero-single location description is the vector register, and the one-single
location description is the spilled memory location description. The
DW_OP_LLVM_form_aspace_address is used to specify the address space of the
memory location description.
In AMDGPU expressions, DW_OP_LLVM_select_bit_piece is used by the
DW_AT_LLVM_lane_pc attribute expression where divergent control flow is
controlled by the execution mask. An undefined location description together
with DW_OP_LLVM_extend is used to indicate the lane was not active on entry
to the subprogram. See DW_AT_LLVM_lane_pc for an example.
For AMDGPU expressions, DW_OP_convert may be used to convert between
DW_ATE_address-encoded base types in different address spaces.
Conversions are defined as in Address Spaces when all relevant
conditions described there are met, and otherwise result in an evaluation
error.
Note
For a target which does not support a particular address space, converting to
or from that address space is always an evaluation error.
For targets which support the generic address space, converting from
DW_ASPACE_AMDGPU_generic to DW_ASPACE_LLVM_none is defined when the
generic address is in the global address space. The conversion requires no
change to the literal value of the address.
Converting from DW_ASPACE_AMDGPU_generic to any of
DW_ASPACE_AMDGPU_local, DW_ASPACE_AMDGPU_private_wave or
DW_ASPACE_AMDGPU_private_lane is defined when the relevant hardware
support is present, any required hardware setup has been completed, and the
generic address is in the corresponding address space. Conversion to
DW_ASPACE_AMDGPU_private_lane additionally requires the context to
include the active lane.
For AMDGPU, the DW_AT_LLVM_lane_pc attribute is used to specify the program
location of the separate lanes of a SIMT thread.
If the lane is an active lane then this will be the same as the current program
location.
If the lane is inactive, but was active on entry to the subprogram, then this is
the program location in the subprogram at which execution of the lane is
conceptually positioned.
If the lane was not active on entry to the subprogram, then this will be the
undefined location. A client debugger can check if the lane is part of a valid
work-group by checking that the lane is in the range of the associated
work-group within the grid, accounting for partial work-groups. If it is not,
then the debugger can omit any information for the lane. Otherwise, the debugger
may repeatedly unwind the stack and inspect the DW_AT_LLVM_lane_pc of the
calling subprogram until it finds a non-undefined location. Conceptually the
lane only has the call frames that it has a non-undefined
DW_AT_LLVM_lane_pc.
The following example illustrates how the AMDGPU backend can generate a DWARF
location list expression for the nested IF/THEN/ELSE structures of the
following subprogram pseudo code for a target with 64 lanes per wavefront.
The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
execution mask (EXEC) to linearize the control flow. The condition is
evaluated to make a mask of the lanes for which the condition evaluates to true.
First the THEN region is executed by setting the EXEC mask to the
logical AND of the current EXEC mask with the condition mask. Then the
ELSE region is executed by negating the EXEC mask and logical AND of
the saved EXEC mask at the start of the region. After the IF/THEN/ELSE
region the EXEC mask is restored to the value it had at the beginning of the
region. This is shown below. Other approaches are possible, but the basic
concept is the same.
To create the DWARF location list expression that defines the location
description of a vector of lane program locations, the LLVM MIR DBG_VALUE
pseudo instruction can be used to annotate the linearized control flow. This can
be done by defining an artificial variable for the lane PC. The DWARF location
list expression created for it is used as the value of the
DW_AT_LLVM_lane_pc attribute on the subprogram’s debugger information entry.
A DWARF procedure is defined for each well nested structured control flow region
which provides the conceptual lane program location for a lane if it is not
active (namely it is divergent). The DWARF operation expression for each region
conceptually inherits the value of the immediately enclosing region and modifies
it according to the semantics of the region.
For an IF/THEN/ELSE region the divergent program location is at the start of
the region for the THEN region since it is executed first. For the ELSE
region the divergent program location is at the end of the IF/THEN/ELSE
region since the THEN region has completed.
The lane PC artificial variable is assigned at each region transition. It uses
the immediately enclosing region’s DWARF procedure to compute the program
location for each lane assuming they are divergent, and then modifies the result
by inserting the current program location for each lane that the EXEC mask
indicates is active.
By having separate DWARF procedures for each region, they can be reused to
define the value for any nested region. This reduces the total size of the DWARF
operation expressions.
The following provides an example using pseudo LLVM MIR.
The DWARF procedure %__active_lane_pc is used to update the lane pc elements
that are active, with the current program location.
Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
the execution masks saved on entry to a region. Using the DBG_VALUE pseudo
instruction, location list entries will be created that describe where the
artificial variables are allocated at any given program location. The compiler
may allocate them to registers or spill them to memory.
The DWARF procedures for each region use the values of the saved execution mask
artificial variables to only update the lanes that are active on entry to the
region. All other lanes retain the value of the enclosing region where they were
last active. If they were not active on entry to the subprogram, then will have
the undefined location description.
Other structured control flow regions can be handled similarly. For example,
loops would set the divergent program location for the region at the end of the
loop. Any lanes active will be in the loop, and any lanes not active must have
exited the loop.
An IF/THEN/ELSEIF/ELSEIF/... region can be treated as a nest of
IF/THEN/ELSE regions.
The DWARF procedures can use the active lane artificial variable described in
DW_AT_LLVM_active_lane rather than the actual
EXEC mask in order to support whole or quad wavefront mode.
The DW_AT_LLVM_active_lane attribute on a subprogram debugger information
entry is used to specify the lanes that are conceptually active for a SIMT
thread.
The execution mask may be modified to implement whole or quad wavefront mode
operations. For example, all lanes may need to temporarily be made active to
execute a whole wavefront operation. Such regions would save the EXEC mask,
update it to enable the necessary lanes, perform the operations, and then
restore the EXEC mask from the saved value. While executing the whole
wavefront region, the conceptual execution mask is the saved value, not the
EXEC value.
This is handled by defining an artificial variable for the active lane mask. The
active lane mask artificial variable would be the actual EXEC mask for
normal regions, and the saved execution mask for regions where the mask is
temporarily updated. The location list expression created for this artificial
variable is used to define the value of the DW_AT_LLVM_active_lane
attribute.
For AMDGPU, the DW_AT_LLVM_augmentation attribute of a compilation unit
debugger information entry has the following value for the augmentation string:
[amdgpu:v0.0]
The “vX.Y” specifies the major X and minor Y version number of the AMDGPU
extensions used in the DWARF of the compilation unit. The version number
conforms to [SEMVER].
DWARF Call Frame Information (CFI) describes how a consumer can virtually
unwind call frames in a running process or core dump. See DWARF Version 5
section 6.4 and A.6.4 Call Frame Information.
For AMDGPU, the Common Information Entry (CIE) fields have the following values:
augmentation string contains the following null-terminated UTF-8 string:
[amd:v0.0]
The vX.Y specifies the major X and minor Y version number of the AMDGPU
extensions used in this CIE or to the FDEs that use it. The version number
conforms to [SEMVER].
segment_selector_size is 0 as AMDGPU does not use a segment selector.
code_alignment_factor is 4 bytes.
data_alignment_factor is 4 bytes.
return_address_register is PC_32 for 32-bit processes and PC_64
for 64-bit processes defined in Register Identifier.
initial_instructions Since a subprogram X with fewer registers can be
called from subprogram Y that has more allocated, X will not change any of
the extra registers as it cannot access them. Therefore, the default rule
for all columns is samevalue.
For AMDGPU the register number follows the numbering defined in
Register Identifier.
For AMDGPU the instructions are variable size. A consumer can subtract 1 from
the return address to get the address of a byte within the call site
instructions. See DWARF Version 5 section 6.4.4.
For AMDGPU the lookup by name section header table:
augmentation_string_size (uword)
Set to the length of the augmentation_string value which is always a
multiple of 4.
augmentation_string (sequence of UTF-8 characters)
Contains the following UTF-8 string null padded to a multiple of 4 bytes:
[amdgpu:v0.0]
The “vX.Y” specifies the major X and minor Y version number of the AMDGPU
extensions used in the DWARF of this index. The version number conforms to
[SEMVER].
Note
This is different to the DWARF Version 5 definition that requires the first
4 characters to be the vendor ID. But this is consistent with the other
augmentation strings and does allow multiple vendor contributions. However,
backwards compatibility may be more desirable.
AMDGPU does not use the isa state machine registers and always sets it to 0.
The instruction set must be obtained from the ELF file header e_flags field
in the EF_AMDGPU_MACH bit position (see ELF Header). See DWARF Version 5 section 6.2.2.
For AMDGPU the line number program header fields have the following values (see
DWARF Version 5 section 6.2.4):
AMDGPU does not use a segment selector so this is 0.
minimum_instruction_length (ubyte)
For GFX9-GFX11 this is 4.
maximum_operations_per_instruction (ubyte)
For GFX9-GFX11 this is 1.
Source text for online-compiled programs (for example, those compiled by the
OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
See DWARF Version 5 section 6.2.4.1 which is updated by DWARF Extensions For
Heterogeneous Debugging section DW_LNCT_LLVM_source.
Enable/disable embedding source text in DWARF
debug sections. Useful for environments where
source cannot be written to disk, such as
when performing online compilation.
Code object metadata is specified in a note record (see
Note Records) and is required when the target triple OS is
amdhsa (see Target Triples). It must contain the minimum
information necessary to support the HSA compatible runtime kernel queries. For
example, the segment sizes needed in a dispatch packet. In addition, a
high-level language runtime may require other information to be included. For
example, the AMD OpenCL runtime records kernel argument information.
Code object V2 generation is no longer supported by this version of LLVM.
Code object V2 metadata is specified by the NT_AMD_HSA_METADATA note record
(see Code Object V2 Note Records).
The metadata is specified as a YAML formatted string (see [YAML] and
YAML I/O).
The metadata is represented as a single YAML document comprised of the mapping
defined in table AMDHSA Code Object V2 Metadata Map and
referenced tables.
For boolean values, the string values of false and true are used for
false and true respectively.
Additional information can be added to the mappings. To avoid conflicts, any
non-AMD key names should be prefixed by “vendor-name.”.
If not 0, 0, 0 then all values
must be >=1 and the dispatch
work-group size X, Y, Z must
correspond to the specified
values. Defaults to 0, 0, 0.
Corresponds to the OpenCL
reqd_work_group_size
attribute.
“WorkGroupSizeHint”
sequence of
3 integers
The dispatch work-group size
X, Y, Z is likely to be the
specified values.
Corresponds to the OpenCL
work_group_size_hint
attribute.
“VecTypeHint”
string
The name of a scalar or vector
type.
Corresponds to the OpenCL
vec_type_hint attribute.
“RuntimeHandle”
string
The external symbol name
associated with a kernel.
OpenCL runtime allocates a
global buffer for the symbol
and saves the kernel’s address
to it, which is used for
device side enqueueing. Only
available for device side
enqueued kernels.
Kernel argument alignment in
bytes. Must be a power of two.
“ValueKind”
string
Required
Kernel argument kind that
specifies how to set up the
corresponding argument.
Values include:
“ByValue”
The argument is copied
directly into the kernarg.
“GlobalBuffer”
A global address space pointer
to the buffer data is passed
in the kernarg.
“DynamicSharedPointer”
A group address space pointer
to dynamically allocated LDS
is passed in the kernarg.
“Sampler”
A global address space
pointer to a S# is passed in
the kernarg.
“Image”
A global address space
pointer to a T# is passed in
the kernarg.
“Pipe”
A global address space pointer
to an OpenCL pipe is passed in
the kernarg.
“Queue”
A global address space pointer
to an OpenCL device enqueue
queue is passed in the
kernarg.
“HiddenGlobalOffsetX”
The OpenCL grid dispatch
global offset for the X
dimension is passed in the
kernarg.
“HiddenGlobalOffsetY”
The OpenCL grid dispatch
global offset for the Y
dimension is passed in the
kernarg.
“HiddenGlobalOffsetZ”
The OpenCL grid dispatch
global offset for the Z
dimension is passed in the
kernarg.
“HiddenNone”
An argument that is not used
by the kernel. Space needs to
be left for it, but it does
not need to be set up.
“HiddenPrintfBuffer”
A global address space pointer
to the runtime printf buffer
is passed in kernarg. Mutually
exclusive with
“HiddenHostcallBuffer”.
“HiddenHostcallBuffer”
A global address space pointer
to the runtime hostcall buffer
is passed in kernarg. Mutually
exclusive with
“HiddenPrintfBuffer”.
“HiddenDefaultQueue”
A global address space pointer
to the OpenCL device enqueue
queue that should be used by
the kernel by default is
passed in the kernarg.
“HiddenCompletionAction”
A global address space pointer
to help link enqueued kernels into
the ancestor tree for determining
when the parent kernel has finished.
“HiddenMultiGridSyncArg”
A global address space pointer for
multi-grid synchronization is
passed in the kernarg.
“ValueType”
string
Unused and deprecated. This should no longer
be emitted, but is accepted for compatibility.
“PointeeAlign”
integer
Alignment in bytes of pointee
type for pointer type kernel
argument. Must be a power
of 2. Only present if
“ValueKind” is
“DynamicSharedPointer”.
“AddrSpaceQual”
string
Kernel argument address space
qualifier. Only present if
“ValueKind” is “GlobalBuffer” or
“DynamicSharedPointer”. Values
are:
“Private”
“Global”
“Constant”
“Local”
“Generic”
“Region”
“AccQual”
string
Kernel argument access
qualifier. Only present if
“ValueKind” is “Image” or
“Pipe”. Values
are:
“ReadOnly”
“WriteOnly”
“ReadWrite”
“ActualAccQual”
string
The actual memory accesses
performed by the kernel on the
kernel argument. Only present if
“ValueKind” is “GlobalBuffer”,
“Image”, or “Pipe”. This may be
more restrictive than indicated
by “AccQual” to reflect what the
kernel actually does. If not
present then the runtime must
assume what is implied by
“AccQual” and “IsConst”. Values
are:
“ReadOnly”
“WriteOnly”
“ReadWrite”
“IsConst”
boolean
Indicates if the kernel argument
is const qualified. Only present
if “ValueKind” is
“GlobalBuffer”.
“IsRestrict”
boolean
Indicates if the kernel argument
is restrict qualified. Only
present if “ValueKind” is
“GlobalBuffer”.
“IsVolatile”
boolean
Indicates if the kernel argument
is volatile qualified. Only
present if “ValueKind” is
“GlobalBuffer”.
“IsPipe”
boolean
Indicates if the kernel argument
is pipe qualified. Only present
if “ValueKind” is “Pipe”.
The size in bytes of
the kernarg segment
that holds the values
of the arguments to
the kernel.
“GroupSegmentFixedSize”
integer
Required
The amount of group
segment memory
required by a
work-group in
bytes. This does not
include any
dynamically allocated
group segment memory
that may be added
when the kernel is
dispatched.
“PrivateSegmentFixedSize”
integer
Required
The amount of fixed
private address space
memory required for a
work-item in
bytes. If the kernel
uses a dynamic call
stack then additional
space must be added
to this value for the
call stack.
“KernargSegmentAlign”
integer
Required
The maximum byte
alignment of
arguments in the
kernarg segment. Must
be a power of 2.
“WavefrontSize”
integer
Required
Wavefront size. Must
be a power of 2.
“NumSGPRs”
integer
Required
Number of scalar
registers used by a
wavefront for
GFX6-GFX11. This
includes the special
SGPRs for VCC, Flat
Scratch (GFX7-GFX10)
and XNACK (for
GFX8-GFX10). It does
not include the 16
SGPR added if a trap
handler is
enabled. It is not
rounded up to the
allocation
granularity.
“NumVGPRs”
integer
Required
Number of vector
registers used by
each work-item for
GFX6-GFX11
“MaxFlatWorkGroupSize”
integer
Required
Maximum flat
work-group size
supported by the
kernel in work-items.
Must be >=1 and
consistent with
ReqdWorkGroupSize if
not 0, 0, 0.
“NumSpilledSGPRs”
integer
Number of stores from
a scalar register to
a register allocator
created spill
location.
“NumSpilledVGPRs”
integer
Number of stores from
a vector register to
a register allocator
created spill
location.
The metadata is represented as Message Pack formatted binary data (see
[MsgPack]). The top level is a Message Pack map that includes the
keys defined in table
AMDHSA Code Object V3 Metadata Map and referenced
tables.
Additional information can be added to the maps. To avoid conflicts,
any key names should be prefixed by “vendor-name.” where
vendor-name can be the name of the vendor and specific vendor
tool that generates the information. The prefix is abbreviated to
simply “.” when it appears within a map that has been added by the
same vendor-name.
If not 0, 0, 0 then all values
must be >=1 and the dispatch
work-group size X, Y, Z must
correspond to the specified
values. Defaults to 0, 0, 0.
Corresponds to the OpenCL
reqd_work_group_size
attribute.
“.workgroup_size_hint”
sequence of
3 integers
The dispatch work-group size
X, Y, Z is likely to be the
specified values.
Corresponds to the OpenCL
work_group_size_hint
attribute.
“.vec_type_hint”
string
The name of a scalar or vector
type.
Corresponds to the OpenCL
vec_type_hint attribute.
“.device_enqueue_symbol”
string
The external symbol name
associated with a kernel.
OpenCL runtime allocates a
global buffer for the symbol
and saves the kernel’s address
to it, which is used for
device side enqueueing. Only
available for device side
enqueued kernels.
“.kernarg_segment_size”
integer
Required
The size in bytes of
the kernarg segment
that holds the values
of the arguments to
the kernel.
“.group_segment_fixed_size”
integer
Required
The amount of group
segment memory
required by a
work-group in
bytes. This does not
include any
dynamically allocated
group segment memory
that may be added
when the kernel is
dispatched.
“.private_segment_fixed_size”
integer
Required
The amount of fixed
private address space
memory required for a
work-item in
bytes. If the kernel
uses a dynamic call
stack then additional
space must be added
to this value for the
call stack.
“.kernarg_segment_align”
integer
Required
The maximum byte
alignment of
arguments in the
kernarg segment. Must
be a power of 2.
“.wavefront_size”
integer
Required
Wavefront size. Must
be a power of 2.
“.sgpr_count”
integer
Required
Number of scalar
registers required by a
wavefront for
GFX6-GFX9. A register
is required if it is
used explicitly, or
if a higher numbered
register is used
explicitly. This
includes the special
SGPRs for VCC, Flat
Scratch (GFX7-GFX9)
and XNACK (for
GFX8-GFX9). It does
not include the 16
SGPR added if a trap
handler is
enabled. It is not
rounded up to the
allocation
granularity.
“.vgpr_count”
integer
Required
Number of vector
registers required by
each work-item for
GFX6-GFX9. A register
is required if it is
used explicitly, or
if a higher numbered
register is used
explicitly.
“.agpr_count”
integer
Required
Number of accumulator
registers required by
each work-item for
GFX90A, GFX908.
“.max_flat_workgroup_size”
integer
Required
Maximum flat
work-group size
supported by the
kernel in work-items.
Must be >=1 and
consistent with
ReqdWorkGroupSize if
not 0, 0, 0.
“.sgpr_spill_count”
integer
Number of stores from
a scalar register to
a register allocator
created spill
location.
“.vgpr_spill_count”
integer
Number of stores from
a vector register to
a register allocator
created spill
location.
“.kind”
string
The kind of the kernel
with the following
values:
“normal”
Regular kernels.
“init”
These kernels must be
invoked after loading
the containing code
object and must
complete before any
normal and fini
kernels in the same
code object are
invoked.
“fini”
These kernels must be
invoked before
unloading the
containing code object
and after all init and
normal kernels in the
same code object have
been invoked and
completed.
If omitted, “normal” is
assumed.
“.max_num_work_groups_{x,y,z}”
integer
The max number of
launched work-groups
in the X, Y, and Z
dimensions. Each number
must be >=1.
Kernel argument offset in
bytes. The offset must be a
multiple of the alignment
required by the argument.
“.value_kind”
string
Required
Kernel argument kind that
specifies how to set up the
corresponding argument.
Values include:
“by_value”
The argument is copied
directly into the kernarg.
“global_buffer”
A global address space pointer
to the buffer data is passed
in the kernarg.
“dynamic_shared_pointer”
A group address space pointer
to dynamically allocated LDS
is passed in the kernarg.
“sampler”
A global address space
pointer to a S# is passed in
the kernarg.
“image”
A global address space
pointer to a T# is passed in
the kernarg.
“pipe”
A global address space pointer
to an OpenCL pipe is passed in
the kernarg.
“queue”
A global address space pointer
to an OpenCL device enqueue
queue is passed in the
kernarg.
“hidden_global_offset_x”
The OpenCL grid dispatch
global offset for the X
dimension is passed in the
kernarg.
“hidden_global_offset_y”
The OpenCL grid dispatch
global offset for the Y
dimension is passed in the
kernarg.
“hidden_global_offset_z”
The OpenCL grid dispatch
global offset for the Z
dimension is passed in the
kernarg.
“hidden_none”
An argument that is not used
by the kernel. Space needs to
be left for it, but it does
not need to be set up.
“hidden_printf_buffer”
A global address space pointer
to the runtime printf buffer
is passed in kernarg. Mutually
exclusive with
“hidden_hostcall_buffer”
before Code Object V5.
“hidden_hostcall_buffer”
A global address space pointer
to the runtime hostcall buffer
is passed in kernarg. Mutually
exclusive with
“hidden_printf_buffer”
before Code Object V5.
“hidden_default_queue”
A global address space pointer
to the OpenCL device enqueue
queue that should be used by
the kernel by default is
passed in the kernarg.
“hidden_completion_action”
A global address space pointer
to help link enqueued kernels into
the ancestor tree for determining
when the parent kernel has finished.
“hidden_multigrid_sync_arg”
A global address space pointer for
multi-grid synchronization is
passed in the kernarg.
“.value_type”
string
Unused and deprecated. This should no longer
be emitted, but is accepted for compatibility.
“.pointee_align”
integer
Alignment in bytes of pointee
type for pointer type kernel
argument. Must be a power
of 2. Only present if
“.value_kind” is
“dynamic_shared_pointer”.
“.address_space”
string
Kernel argument address space
qualifier. Only present if
“.value_kind” is “global_buffer” or
“dynamic_shared_pointer”. Values
are:
“private”
“global”
“constant”
“local”
“generic”
“region”
“.access”
string
Kernel argument access
qualifier. Only present if
“.value_kind” is “image” or
“pipe”. Values
are:
“read_only”
“write_only”
“read_write”
“.actual_access”
string
The actual memory accesses
performed by the kernel on the
kernel argument. Only present if
“.value_kind” is “global_buffer”,
“image”, or “pipe”. This may be
more restrictive than indicated
by “.access” to reflect what the
kernel actually does. If not
present then the runtime must
assume what is implied by
“.access” and “.is_const” . Values
are:
“read_only”
“write_only”
“read_write”
“.is_const”
boolean
Indicates if the kernel argument
is const qualified. Only present
if “.value_kind” is
“global_buffer”.
“.is_restrict”
boolean
Indicates if the kernel argument
is restrict qualified. Only
present if “.value_kind” is
“global_buffer”.
“.is_volatile”
boolean
Indicates if the kernel argument
is volatile qualified. Only
present if “.value_kind” is
“global_buffer”.
“.is_pipe”
boolean
Indicates if the kernel argument
is pipe qualified. Only present
if “.value_kind” is “pipe”.
Indicates if the kernel
requires that each dimension
of global size is a multiple
of corresponding dimension of
work-group size. Value of 1
implies true and value of 0
implies false. Metadata is
only emitted when value is 1.
Kernel argument kind that
specifies how to set up the
corresponding argument.
Values include:
the same as code object V3 metadata
(see AMDHSA Code Object V3 Kernel Argument Metadata Map)
with the following additions:
“hidden_block_count_x”
The grid dispatch work-group count for the X dimension
is passed in the kernarg. Some languages, such as OpenCL,
support a last work-group in each dimension being partial.
This count only includes the non-partial work-group count.
This is not the same as the value in the AQL dispatch packet,
which has the grid size in work-items.
“hidden_block_count_y”
The grid dispatch work-group count for the Y dimension
is passed in the kernarg. Some languages, such as OpenCL,
support a last work-group in each dimension being partial.
This count only includes the non-partial work-group count.
This is not the same as the value in the AQL dispatch packet,
which has the grid size in work-items. If the grid dimensionality
is 1, then must be 1.
“hidden_block_count_z”
The grid dispatch work-group count for the Z dimension
is passed in the kernarg. Some languages, such as OpenCL,
support a last work-group in each dimension being partial.
This count only includes the non-partial work-group count.
This is not the same as the value in the AQL dispatch packet,
which has the grid size in work-items. If the grid dimensionality
is 1 or 2, then must be 1.
“hidden_group_size_x”
The grid dispatch work-group size for the X dimension is
passed in the kernarg. This size only applies to the
non-partial work-groups. This is the same value as the AQL
dispatch packet work-group size.
“hidden_group_size_y”
The grid dispatch work-group size for the Y dimension is
passed in the kernarg. This size only applies to the
non-partial work-groups. This is the same value as the AQL
dispatch packet work-group size. If the grid dimensionality
is 1, then must be 1.
“hidden_group_size_z”
The grid dispatch work-group size for the Z dimension is
passed in the kernarg. This size only applies to the
non-partial work-groups. This is the same value as the AQL
dispatch packet work-group size. If the grid dimensionality
is 1 or 2, then must be 1.
“hidden_remainder_x”
The grid dispatch work group size of the partial work group
of the X dimension, if it exists. Must be zero if a partial
work group does not exist in the X dimension.
“hidden_remainder_y”
The grid dispatch work group size of the partial work group
of the Y dimension, if it exists. Must be zero if a partial
work group does not exist in the Y dimension.
“hidden_remainder_z”
The grid dispatch work group size of the partial work group
of the Z dimension, if it exists. Must be zero if a partial
work group does not exist in the Z dimension.
“hidden_grid_dims”
The grid dispatch dimensionality. This is the same value
as the AQL dispatch packet dimensionality. Must be a value
between 1 and 3.
“hidden_heap_v1”
A global address space pointer to an initialized memory
buffer that conforms to the requirements of the malloc/free
device library V1 version implementation.
“hidden_dynamic_lds_size”
Size of the dynamically allocated LDS memory is passed in the kernarg.
“hidden_private_base”
The high 32 bits of the flat addressing private aperture base.
Only used by GFX8 to allow conversion between private segment
and flat addresses. See Flat Scratch.
“hidden_shared_base”
The high 32 bits of the flat addressing shared aperture base.
Only used by GFX8 to allow conversion between shared segment
and flat addresses. See Flat Scratch.
“hidden_queue_ptr”
A global memory address space pointer to the ROCm runtime
structamd_queue_t structure for the HSA queue of the
associated dispatch AQL packet. It is only required for pre-GFX9
devices for the trap handler ABI (see Trap Handler ABI).
The HSA architected queuing language (AQL) defines a user space memory interface
that can be used to control the dispatch of kernels, in an agent independent
way. An agent can have zero or more AQL queues created for it using an HSA
compatible runtime (see AMDGPU Operating Systems), in which AQL packets (all of which
are 64 bytes) can be placed. See the HSA Platform System Architecture
Specification[HSA] for the AQL queue mechanics and packet layouts.
The packet processor of a kernel agent is responsible for detecting and
dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
packet processor is implemented by the hardware command processor (CP),
asynchronous dispatch controller (ADC) and shader processor input controller
(SPI).
An HSA compatible runtime can be used to allocate an AQL queue object. It uses
the kernel mode driver to initialize and register the AQL queue with CP.
To dispatch a kernel the following actions are performed. This can occur in the
CPU host program, or from an HSA kernel executing on a GPU.
A pointer to an AQL queue for the kernel agent on which the kernel is to be
executed is obtained.
A pointer to the kernel descriptor (see
Kernel Descriptor) of the kernel to execute is obtained.
It must be for a kernel that is contained in a code object that was loaded
by an HSA compatible runtime on the kernel agent with which the AQL queue is
associated.
Space is allocated for the kernel arguments using the HSA compatible runtime
allocator for a memory region with the kernarg property for the kernel agent
that will execute the kernel. It must be at least 16-byte aligned.
Kernel argument values are assigned to the kernel argument memory
allocation. The layout is defined in the HSA Programmer’s Language
Reference[HSA]. For AMDGPU the kernel execution directly accesses the
kernel argument memory in the same way constant memory is accessed. (Note
that the HSA specification allows an implementation to copy the kernel
argument contents to another location that is accessed by the kernel.)
An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
runtime api uses 64-bit atomic operations to reserve space in the AQL queue
for the packet. The packet must be set up, and the final write must use an
atomic store release to set the packet kind to ensure the packet contents are
visible to the kernel agent. AQL defines a doorbell signal mechanism to
notify the kernel agent that the AQL queue has been updated. These rules, and
the layout of the AQL queue and kernel dispatch packet is defined in the HSA
System Architecture Specification[HSA].
A kernel dispatch packet includes information about the actual dispatch,
such as grid and work-group size, together with information from the code
object about the kernel, such as segment sizes. The HSA compatible runtime
queries on the kernel symbol can be used to obtain the code object values
which are recorded in the Code Object Metadata.
CP executes micro-code and is responsible for detecting and setting up the
GPU to execute the wavefronts of a kernel dispatch.
CP ensures that when the a wavefront starts executing the kernel machine
code, the scalar general purpose registers (SGPR) and vector general purpose
registers (VGPR) are set up as required by the machine code. The required
setup is defined in the Kernel Descriptor. The initial
register state is defined in
Initial Kernel Execution State.
The prolog of the kernel machine code (see
Kernel Prolog) sets up the machine state as necessary
before continuing executing the machine code that corresponds to the kernel.
When the kernel dispatch has completed execution, CP signals the completion
signal specified in the kernel dispatch packet if not 0.
The global and constant memory spaces both use global virtual addresses, which
are the same virtual address space used by the CPU. However, some virtual
addresses may only be accessible to the CPU, some only accessible by the GPU,
and some by both.
Using the constant memory space indicates that the data will not change during
the execution of the kernel. This allows scalar read instructions to be
used. The vector and scalar L1 caches are invalidated of volatile data before
each kernel dispatch execution to allow constant memory to change values between
kernel dispatches.
The local memory space uses the hardware Local Data Store (LDS) which is
automatically allocated when the hardware creates work-groups of wavefronts, and
freed when all the wavefronts of a work-group have terminated. The data store
(DS) instructions can be used to access it.
The private memory space uses the hardware scratch memory support. If the kernel
uses scratch, then the hardware allocates memory that is accessed using
wavefront lane dword (4 byte) interleaving. The mapping used from private
address to physical address is:
There are different ways that the wavefront scratch base address is determined
by a wavefront (see Initial Kernel Execution State). This
memory can be accessed in an interleaved manner using buffer instruction with
the scratch buffer descriptor and per wavefront scratch offset, by the scratch
instructions, or by flat instructions. If each lane of a wavefront accesses the
same private address, the interleaving results in adjacent dwords being accessed
and hence requires fewer cache lines to be fetched. Multi-dword access is not
supported except by flat and scratch instructions in GFX9-GFX11.
The generic address space uses the hardware flat address support available in
GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
local apertures), that are outside the range of addressable global memory, to
map from a flat address to a private or local address.
FLAT instructions can take a flat address and access global, private (scratch)
and group (LDS) memory depending on if the address is within one of the
aperture ranges. Flat access to scratch requires hardware aperture setup and
setup in the kernel prologue (see
Flat Scratch). Flat access to LDS requires
hardware aperture setup and M0 (GFX7-GFX8) register setup (see
M0).
To convert between a segment address and a flat address the base address of the
apertures address can be used. For GFX7-GFX8 these are available in the
HSA AQL Queue the address of which can be obtained with
Queue Ptr SGPR (see Initial Kernel Execution State). For
GFX9-GFX11 the aperture base addresses are directly available as inline constant
registers SRC_SHARED_BASE/LIMIT and SRC_PRIVATE_BASE/LIMIT. In 64-bit
address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
which makes it easier to convert from flat to segment or segment to flat.
Image and sample handles created by an HSA compatible runtime (see
AMDGPU Operating Systems) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
object respectively. In order to support the HSA query_sampler operations
two extra dwords are used to store the HSA BRIG enumeration values for the
queries that are not trivially deducible from the S# representation.
HSA signal handles created by an HSA compatible runtime (see AMDGPU Operating Systems)
are 64-bit addresses of a structure allocated in memory accessible from both the
CPU and GPU. The structure is defined by the runtime and subject to change
between releases. For example, see [AMD-ROCm-github].
The HSA AQL queue structure is defined by an HSA compatible runtime (see
AMDGPU Operating Systems) and subject to change between releases. For example, see
[AMD-ROCm-github]. For some processors it contains fields needed to implement
certain language features such as the flat address aperture bases. It also
contains fields used by CP such as managing the allocation of scratch memory.
A kernel descriptor consists of the information needed by CP to initiate the
execution of a kernel, including the entry point address of the machine code
that implements the kernel.
The amount of fixed local
address space memory
required for a work-group
in bytes. This does not
include any dynamically
allocated local address
space memory that may be
added when the kernel is
dispatched.
63:32
4 bytes
PRIVATE_SEGMENT_FIXED_SIZE
The amount of fixed
private address space
memory required for a
work-item in bytes. When
this cannot be predicted,
code object v4 and older
sets this value to be
higher than the minimum
requirement.
95:64
4 bytes
KERNARG_SIZE
The size of the kernarg
memory pointed to by the
AQL dispatch packet. The
kernarg memory is used to
pass arguments to the
kernel.
If the kernarg pointer in
the dispatch packet is NULL
then there are no kernel
arguments.
If the kernarg pointer in
the dispatch packet is
not NULL and this value
is 0 then the kernarg
memory size is
unspecified.
If the kernarg pointer in
the dispatch packet is
not NULL and this value
is not 0 then the value
specifies the kernarg
memory size in bytes. It
is recommended to provide
a value as it may be used
by CP to optimize making
the kernarg memory
visible to the kernel
code.
127:96
4 bytes
Reserved, must be 0.
191:128
8 bytes
KERNEL_CODE_ENTRY_BYTE_OFFSET
Byte offset (possibly
negative) from base
address of kernel
descriptor to kernel’s
entry point instruction
which must be 256 byte
aligned.
The total number of SGPR
user data registers
requested must not exceed
16 and match value in
compute_pgm_rsrc2.user_sgpr.user_sgpr_count.
Any requests beyond 16
will be ignored.
>448
1 bit
ENABLE_SGPR_PRIVATE_SEGMENT
_BUFFER
If the Target Properties
column of
AMDGPU Processors
specifies Architected flat
scratch then not supported
and must be 0,
>449
1 bit
ENABLE_SGPR_DISPATCH_PTR
>450
1 bit
ENABLE_SGPR_QUEUE_PTR
>451
1 bit
ENABLE_SGPR_KERNARG_SEGMENT_PTR
>452
1 bit
ENABLE_SGPR_DISPATCH_ID
>453
1 bit
ENABLE_SGPR_FLAT_SCRATCH_INIT
If the Target Properties
column of
AMDGPU Processors
specifies Architected flat
scratch then not supported
and must be 0,
>454
1 bit
ENABLE_SGPR_PRIVATE_SEGMENT
_SIZE
457:455
3 bits
Reserved, must be 0.
458
1 bit
ENABLE_WAVEFRONT_SIZE32
GFX6-GFX9
Reserved, must be 0.
GFX10-GFX11
If 0 execute in
wavefront size 64 mode.
If 1 execute in
native wavefront size
32 mode.
459
1 bit
USES_DYNAMIC_STACK
Indicates if the generated
machine code is using a
dynamically sized stack.
This is only set in code
object v5 and later.
463:460
4 bits
Reserved, must be 0.
470:464
7 bits
KERNARG_PRELOAD_SPEC_LENGTH
GFX6-GFX9
Reserved, must be 0.
GFX90A, GFX942
The number of dwords from
the kernarg segment to preload
into User SGPRs before kernel
execution. (see
Preloaded Kernel Arguments).
479:471
9 bits
KERNARG_PRELOAD_SPEC_OFFSET
GFX6-GFX9
Reserved, must be 0.
GFX90A, GFX942
An offset in dwords into the
kernarg segment to begin
preloading data into User
SGPRs. (see
Preloaded Kernel Arguments).
Number of vector register
blocks used by each work-item;
granularity is device
specific:
GFX6-GFX9
vgprs_used 0..256
max(0, ceil(vgprs_used / 4) - 1)
GFX90A, GFX942
vgprs_used 0..512
vgprs_used = align(arch_vgprs, 4)
acc_vgprs
max(0, ceil(vgprs_used / 8) - 1)
GFX10-GFX12 (wavefront size 64)
max_vgpr 1..256
max(0, ceil(vgprs_used / 4) - 1)
GFX10-GFX12 (wavefront size 32)
max_vgpr 1..256
max(0, ceil(vgprs_used / 8) - 1)
GFX125X (wavefront size 32)
max_vgpr 1..1024
max(0, ceil(vgprs_used / 16) - 1)
Where vgprs_used is defined
as the highest VGPR number
explicitly referenced plus
one.
Used by CP to set up
COMPUTE_PGM_RSRC1.VGPRS.
The
Assembler
calculates this
automatically for the
selected processor from
values provided to the
.amdhsa_kernel directive
by the
.amdhsa_next_free_vgpr
nested directive (see
AMDHSA Kernel Assembler Directives).
9:6
4 bits
GRANULATED_WAVEFRONT_SGPR_COUNT
Number of scalar register
blocks used by a wavefront;
granularity is device
specific:
GFX6-GFX8
sgprs_used 0..112
max(0, ceil(sgprs_used / 8) - 1)
GFX9
sgprs_used 0..112
2 * max(0, ceil(sgprs_used / 16) - 1)
GFX10-GFX12
Reserved, must be 0.
(128 SGPRs always
allocated.)
Where sgprs_used is
defined as the highest
SGPR number explicitly
referenced plus one, plus
a target specific number
of additional special
SGPRs for VCC,
FLAT_SCRATCH (GFX7+) and
XNACK_MASK (GFX8+), and
any additional
target specific
limitations. It does not
include the 16 SGPRs added
if a trap handler is
enabled.
The target specific
limitations and special
SGPR layout are defined in
the hardware
documentation, which can
be found in the
Processors
table.
Used by CP to set up
COMPUTE_PGM_RSRC1.SGPRS.
The
Assembler
calculates this
automatically for the
selected processor from
values provided to the
.amdhsa_kernel directive
by the
.amdhsa_next_free_sgpr
and .amdhsa_reserve_*
nested directives (see
AMDHSA Kernel Assembler Directives).
11:10
2 bits
PRIORITY
Must be 0.
Start executing wavefront
at the specified priority.
CP is responsible for
filling in
COMPUTE_PGM_RSRC1.PRIORITY.
13:12
2 bits
FLOAT_ROUND_MODE_32
Wavefront starts execution
with specified rounding
mode for single (32-bit)
floating point
precision floating point
operations.
Used by CP to set up
COMPUTE_PGM_RSRC1.FLOAT_MODE.
20
1 bit
PRIV
Must be 0.
Start executing wavefront
in privilege trap handler
mode.
CP is responsible for
filling in
COMPUTE_PGM_RSRC1.PRIV.
21
1 bit
ENABLE_DX10_CLAMP
WG_RR_EN
GFX9-GFX11 (except GFX11.7)
Wavefront starts execution
with DX10 clamp mode
enabled. Used by the vector
ALU to force DX10 style
treatment of NaN’s (when
set, clamp NaN to zero,
otherwise pass NaN
through).
Used by CP to set up
COMPUTE_PGM_RSRC1.DX10_CLAMP.
GFX11.7
Reserved. Must be 0.
GFX12
If 1, wavefronts are scheduled
in a round-robin fashion with
respect to the other wavefronts
of the SIMD. Otherwise, wavefronts
are scheduled in oldest age order.
CP is responsible for filling in
COMPUTE_PGM_RSRC1.WG_RR_EN.
22
1 bit
DEBUG_MODE
Must be 0.
Start executing wavefront
in single step mode.
CP is responsible for
filling in
COMPUTE_PGM_RSRC1.DEBUG_MODE.
23
1 bit
ENABLE_IEEE_MODE
DISABLE_PERF
GFX9-GFX11 (except GFX11.7)
Wavefront starts execution
with IEEE mode
enabled. Floating point
opcodes that support
exception flag gathering
will quiet and propagate
signaling-NaN inputs per
IEEE 754-2008. Min_dx10 and
max_dx10 become IEEE
754-2008 compliant due to
signaling-NaN propagation
and quieting.
Used by CP to set up
COMPUTE_PGM_RSRC1.IEEE_MODE.
GFX11.7
Reserved. Must be 0.
GFX12
Reserved. Must be 0.
24
1 bit
BULKY
Must be 0.
Only one work-group allowed
to execute on a compute
unit.
CP is responsible for
filling in
COMPUTE_PGM_RSRC1.BULKY.
25
1 bit
CDBG_USER
Must be 0.
Flag that can be used to
control debugging code.
CP is responsible for
filling in
COMPUTE_PGM_RSRC1.CDBG_USER.
26
1 bit
FP16_OVFL
GFX6-GFX8
Reserved, must be 0.
GFX9-GFX12
Wavefront starts execution
with specified fp16 overflow
mode.
If 0, fp16 overflow generates
+/-INF values.
If 1, fp16 overflow that is the
result of an +/-INF input value
or divide by 0 produces a +/-INF,
otherwise clamps computed
overflow to +/-MAX_FP16 as
appropriate.
Used by CP to set up
COMPUTE_PGM_RSRC1.FP16_OVFL.
27
1 bit
RESERVED
FLAT_SCRATCH_IS_NV
GFX6-GFX120*
Reserved, must be 0.
GFX125*
0 - Use the NV ISA as indication
that scratch is NV. 1 - Force
scratch to NV = 1, even if
ISA.NV == 0 if the address falls
into scratch space (not global).
This allows global.NV = 0 and
scratch.NV = 1 for flat ops. Other
threads use the ISA bit value.
Used by CP to set up
COMPUTE_PGM_RSRC1.FLAT_SCRATCH_IS_NV.
28
1 bit
RESERVED
Reserved, must be 0.
29
1 bit
WGP_MODE
GFX6-GFX9
Reserved, must be 0.
GFX10-GFX12
If 0 execute work-groups in
CU wavefront execution mode.
If 1 execute work-groups on
in WGP wavefront execution mode.
Controls the behavior of the
s_waitcnt’s vmcnt and vscnt
counters.
If 0 vmcnt reports completion
of load and atomic with return
out of order with sample
instructions, and the vscnt
reports the completion of
store and atomic without
return in order.
If 1 vmcnt reports completion
of load, atomic with return
and sample instructions in
order, and the vscnt reports
the completion of store and
atomic without return in order.
Used by CP to set up
COMPUTE_PGM_RSRC1.MEM_ORDERED.
31
1 bit
FWD_PROGRESS
GFX6-GFX9
Reserved, must be 0.
GFX10-GFX12
If 0 execute SIMD wavefronts
using oldest first policy.
If 1 execute SIMD wavefronts to
ensure wavefronts will make some
forward progress.
Used by CP to set up
COMPUTE_PGM_RSRC1.FWD_PROGRESS.
If the Target Properties
column of
AMDGPU Processors
does not specify
Architected flat
scratch then enable the
setup of the SGPR
wavefront scratch offset
system register (see
Initial Kernel Execution State).
Used by CP to set up
COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT.
13
1 bit
ENABLE_EXCEPTION_ADDRESS_WATCH
Must be 0.
Wavefront starts execution
with address watch
exceptions enabled which
are generated when L1 has
witnessed a thread access
an address of
interest.
CP is responsible for
filling in the address
watch bit in
COMPUTE_PGM_RSRC2.EXCP_EN_MSB
according to what the
runtime requests.
14
1 bit
ENABLE_EXCEPTION_MEMORY
Must be 0.
Wavefront starts execution
with memory violation
exceptions
enabled which are generated
when a memory violation has
occurred for this wavefront from
L1 or LDS
(write-to-read-only-memory,
mis-aligned atomic, LDS
address out of range,
illegal address, etc.).
CP sets the memory
violation bit in
COMPUTE_PGM_RSRC2.EXCP_EN_MSB
according to what the
runtime requests.
23:15
9 bits
GRANULATED_LDS_SIZE
Must be 0.
CP uses the rounded value
from the dispatch packet,
not this value, as the
dispatch may contain
dynamically allocated group
segment memory. CP writes
directly to
COMPUTE_PGM_RSRC2.LDS_SIZE.
Amount of group segment
(LDS) to allocate for each
work-group. Granularity is
device specific:
GFX6
roundup(lds-size / (64 * 4))
GFX7-GFX12
roundup(lds-size / (128 * 4))
GFX950
roundup(lds-size / (320 * 4))
GFX125*
roundup(lds-size / (512 * 4))
24
1 bit
ENABLE_EXCEPTION_IEEE_754_FP
_INVALID_OPERATION
Wavefront starts execution
with specified exceptions
enabled.
Used by CP to set up
COMPUTE_PGM_RSRC2.EXCP_EN
(set from bits 0..6).
IEEE 754 FP Invalid
Operation
25
1 bit
ENABLE_EXCEPTION_FP_DENORMAL
_SOURCE
FP Denormal one or more
input operands is a
denormal number
26
1 bit
ENABLE_EXCEPTION_IEEE_754_FP
_DIVISION_BY_ZERO
IEEE 754 FP Division by
Zero
27
1 bit
ENABLE_EXCEPTION_IEEE_754_FP
_OVERFLOW
IEEE 754 FP FP Overflow
28
1 bit
ENABLE_EXCEPTION_IEEE_754_FP
_UNDERFLOW
IEEE 754 FP Underflow
29
1 bit
ENABLE_EXCEPTION_IEEE_754_FP
_INEXACT
IEEE 754 FP Inexact
30
1 bit
ENABLE_EXCEPTION_INT_DIVIDE_BY
_ZERO
Integer Division by Zero
(rcp_iflag_f32 instruction
only)
Offset of a first AccVGPR in the unified register file. Granularity 4.
Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, …,
63 - accum-offset = 256.
15:6
10
bits
Reserved, must be 0.
16
1 bit
TG_SPLIT
If 0 the waves of a work-group are
launched in the same CU.
If 1 the waves of a work-group can be
launched in different CUs. The waves
cannot use S_BARRIER or LDS.
Number of shared VGPR blocks when executing in subvector mode. For
wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
9:4
6 bits
INST_PREF_SIZE
GFX10
Reserved, must be 0.
GFX11
Number of instruction bytes to prefetch, starting at the kernel’s entry
point instruction, before wavefront starts execution. The value is 0..63
with a granularity of 128 bytes.
10
1 bit
TRAP_ON_START
GFX10
Reserved, must be 0.
GFX11
Must be 0.
If 1, wavefront starts execution by trapping into the trap handler.
CP is responsible for filling in the trap on start bit in
COMPUTE_PGM_RSRC3.TRAP_ON_START according to what the runtime
requests.
11
1 bit
TRAP_ON_END
GFX10
Reserved, must be 0.
GFX11
Must be 0.
If 1, wavefront execution terminates by trapping into the trap handler.
CP is responsible for filling in the trap on end bit in
COMPUTE_PGM_RSRC3.TRAP_ON_END according to what the runtime requests.
30:12
19 bits
Reserved, must be 0.
31
1 bit
IMAGE_OP
GFX10
Reserved, must be 0.
GFX11
If 1, the kernel execution contains image instructions. If executed as
part of a graphics pipeline, image read instructions will stall waiting
for any necessary WAIT_SYNC fence to be performed in order to
indicate that earlier pipeline stages have completed writing to the
image.
Not used for compute kernels that are not part of a graphics pipeline and
must be 0.
Number of instruction bytes to prefetch, starting at the kernel’s entry
point instruction, before wavefront starts execution. The value is 0..255
with a granularity of 128 bytes.
12
1 bit
RESERVED
Reserved, must be 0.
13
1 bit
GLG_EN
If 1, group launch guarantee will be enabled for this dispatch
16:14
3 bits
RESERVED
NAMED_BAR_CNT
GFX120*
Reserved, must be 0.
GFX125*
Number of named barriers to alloc for each workgroup, in granularity of
4. Range is from 0-4 allocating 0, 4, 8, 12, 16.
17
1 bit
RESERVED
ENABLE_DYNAMIC_VGPR
GFX120*
Reserved, must be 0.
GFX125*
Enables dynamic VGPR mode, where each wave allocates one VGPR chunk
at launch and can request for additional space to use during
execution in SQ.
Used by CP to set up COMPUTE_PGM_RSRC3.DYNAMIC_VGPR.
If 1, the kernel execution contains image instructions. If executed as
part of a graphics pipeline, image read instructions will stall waiting
for any necessary WAIT_SYNC fence to be performed in order to
indicate that earlier pipeline stages have completed writing to the
image.
Not used for compute kernels that are not part of a graphics pipeline and
must be 0.
32
Total size 4 bytes.
Table 80 Floating Point Rounding Mode Enumeration Values¶
Table 82 Floating Point Denorm Mode Enumeration Values¶
Enumeration Name
Value
Description
FLOAT_DENORM_MODE_FLUSH_SRC_DST
0
Flush Source and Destination Denorms
FLOAT_DENORM_MODE_FLUSH_DST
1
Flush Output Denorms
FLOAT_DENORM_MODE_FLUSH_SRC
2
Flush Source Denorms
FLOAT_DENORM_MODE_FLUSH_NONE
3
No Flush
Denormal flushing is sign respecting, i.e., the behavior expected by
"denormal-fp-math"="preserve-sign". The behavior is undefined with
"denormal-fp-math"="positive-zero"
Table 83 System VGPR Work-Item ID Enumeration Values¶
This section defines the register state that will be set up by the packet
processor prior to the start of execution of every wavefront. This is limited by
the constraints of the hardware controllers of CP/ADC/SPI.
The order of the SGPR registers is defined, but the compiler can specify which
ones are actually setup in the kernel descriptor using the enable_sgpr_* bit
fields (see Kernel Descriptor). The register numbers used
for enabled registers are dense starting at SGPR0: the first enabled register is
SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
an SGPR number.
The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to
all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
using the enable_sgpr_* bit fields, in which case only the first 16 are
actually initialized. These are then immediately followed by the System SGPRs
that are set up by ADC/SPI and can have different values for each wavefront of
the grid dispatch.
The 32-bit byte size of a
single work-item’s memory
allocation. This is the
value from the kernel
dispatch packet Private
Segment Byte Size rounded up
by CP to a multiple of
DWORD.
Having CP load it once avoids
loading it at the beginning of
every wavefront.
This is not used for
GFX7-GFX8 since it is the same
value as the second SGPR of
Flat Scratch Init. However, it
may be needed for GFX9-GFX11 which
changes the meaning of the
Flat Scratch Init value.
The order of the VGPR registers is defined, but the compiler can specify which
ones are actually setup in the kernel descriptor using the enable_vgpr* bit
fields (see Kernel Descriptor). The register numbers used
for enabled registers are dense starting at VGPR0: the first enabled register is
VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
VGPR number.
There are different methods used for the VGPR initial state:
Table 85 VGPR Register Set Up Order for Unpacked Work-Item ID Method¶
VGPR Order
Name
(kernel descriptor enable
field)
Number
of
VGPRs
Description
First
Work-Item Id X
(Always initialized)
1
32-bit work-item id in X
dimension of work-group for
wavefront lane.
then
Work-Item Id Y
(enable_vgpr_workitem_id
> 0)
1
32-bit work-item id in Y
dimension of work-group for
wavefront lane.
then
Work-Item Id Z
(enable_vgpr_workitem_id
> 1)
1
32-bit work-item id in Z
dimension of work-group for
wavefront lane.
Table 86 Register Layout for Packed Work-Item ID Method¶
Bits
Size
Field Name
Description
0:9
10 bits
Work-Item Id X
Work-item id in X
dimension of work-group for
wavefront lane.
Always initialized.
10:19
10 bits
Work-Item Id Y
Work-item id in Y
dimension of work-group for
wavefront lane.
Initialized if enable_vgpr_workitem_id >
0, otherwise set to 0.
20:29
10 bits
Work-Item Id Z
Work-item id in Z
dimension of work-group for
wavefront lane.
Initialized if enable_vgpr_workitem_id >
1, otherwise set to 0.
30:31
2 bits
Reserved, set to 0.
The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
SGPRs before the Work-Group Ids are set by CP using the 16 User Data
registers.
Work-group Id registers X, Y, Z are set by ADC which supports any
combination including none.
Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
its value cannot be included with the flat scratch init value which is per
queue (see Flat Scratch).
The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
or (X, Y, Z).
Flat Scratch register pair initialization is described in
Flat Scratch.
The global segment can be accessed either using buffer instructions (GFX6 which
has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
instructions (GFX9-GFX11).
If buffer operations are used, then the compiler can generate a V# with the
following properties:
base address of 0
no swizzle
ATC: 1 if IOMMU present (such as APU)
ptr64: 1
MTYPE set to support memory coherence that matches the runtime (such as CC for
APU and NC for dGPU).
On hardware that supports this feature, kernel arguments can be preloaded into
User SGPRs, up to the maximum number of User SGPRs available. The allocation of
Preload SGPRs occurs directly after the last enabled non-kernarg preload User
SGPR. (See Initial Kernel Execution State)
The data preloaded is copied from the kernarg segment, the amount of data is
determined by the value specified in the kernarg_preload_spec_length field of
the kernel descriptor. This data is then loaded into consecutive User SGPRs. The
number of SGPRs receiving preloaded kernarg data corresponds with the value
given by kernarg_preload_spec_length. The preloading starts at the dword offset
within the kernarg segment, which is specified by the
kernarg_preload_spec_offset field.
If the kernarg_preload_spec_length is non-zero, the CP firmware will append an
additional 256 bytes to the kernel_code_entry_byte_offset. This addition
facilitates the incorporation of a prologue to the kernel entry to handle cases
where code designed for kernarg preloading is executed on hardware equipped with
incompatible firmware. If hardware has compatible firmware the 256 bytes at the
start of the kernel entry will be skipped.
With code object V5 and later, hidden kernel arguments that are normally
accessed through the Implicit Argument Ptr, may be preloaded into User SGPRs.
These arguments are added to the kernel function signature and are marked with
the attributes “inreg” and “amdgpu-hidden-argument”. (See
AMDGPU LLVM IR Attributes).
The compiler performs initialization in the kernel prologue depending on the
target and information about things like stack usage in the kernel and called
functions. Some of this initialization requires the compiler to request certain
User and System SGPRs be present in the
Initial Kernel Execution State via the
Kernel Descriptor.
The CFI CFA is defined using an expression which evaluates to a location
description that comprises one memory location description for the
DW_ASPACE_AMDGPU_private_lane address space address 0.
The M0 register must be initialized with a value at least the total LDS size
if the kernel may access LDS via DS or flat operations. Total LDS size is
available in dispatch packet. For M0, it is also possible to use maximum
possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
GFX7-GFX8).
GFX9 and later
The M0 register is not used for range checking LDS accesses and so does not
need to be initialized in the prolog.
If the kernel has function calls it must set up the ABI stack pointer described
in Non-Kernel Functions by setting
SGPR32 to the unswizzled scratch offset of the address past the last local
allocation.
If the kernel needs a frame pointer for the reasons defined in
SIFrameLowering then SGPR33 is used and is always set to 0 in the
kernel prolog. On GFX12+, when dynamic VGPRs are enabled, the prologue will
check if the kernel is running on a compute queue, and if so it will reserve
some scratch space for any dynamic VGPRs that might need to be saved by the
CWSR trap handler. In this case, the frame pointer will be initialized to
a suitably aligned offset above this reserved area. If a frame pointer is not
required then all uses of the frame pointer are replaced with immediate 0
offsets.
There are different methods used for initializing flat scratch:
If the Target Properties column of AMDGPU Processors
specifies Does not support generic address space:
Flat scratch is not supported and there is no flat scratch register pair.
If the Target Properties column of AMDGPU Processors
specifies Offset flat scratch:
If the kernel or any function it calls may use flat operations to access
scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
Scratch Wavefront Offset SGPR registers (see
Initial Kernel Execution State):
The low word of Flat Scratch Init is the 32-bit byte offset from
SH_HIDDEN_PRIVATE_BASE_VIMID to the base of scratch backing memory
being managed by SPI for the queue executing the kernel dispatch. This is
the same value used in the Scratch Segment Buffer V# base address.
CP obtains this from the runtime. (The Scratch Segment Buffer base address
is SH_HIDDEN_PRIVATE_BASE_VIMID plus this offset.)
The prolog must add the value of Scratch Wavefront Offset to get the
wavefront’s byte scratch backing memory offset from
SH_HIDDEN_PRIVATE_BASE_VIMID.
The Scratch Wavefront Offset must also be used as an offset with Private
segment address when using the Scratch Segment Buffer.
Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
shifted by 8 before moving into FLAT_SCRATCH_HI.
FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
SGPRn is the highest numbered SGPR allocated to the wavefront).
FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
added to SH_HIDDEN_PRIVATE_BASE_VIMID to calculate the per wavefront
FLAT SCRATCH BASE in flat memory instructions that access the scratch
aperture.
The second word of Flat Scratch Init is 32-bit byte size of a single
work-items scratch memory usage.
CP obtains this from the runtime, and it is always a multiple of DWORD. CP
checks that the value in the kernel dispatch packet Private Segment Byte
Size is not larger and requests the runtime to increase the queue’s scratch
size if necessary.
CP directly loads from the kernel dispatch packet Private Segment Byte Size
field and rounds up to a multiple of DWORD. Having CP load it once avoids
loading it at the beginning of every wavefront.
The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
in flat memory instructions.
If the Target Properties column of AMDGPU Processors
specifies Absolute flat scratch:
If the kernel or any function it calls may use flat operations to access
scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
Initial Kernel Execution State):
The Flat Scratch Init is the 64-bit address of the base of scratch backing
memory being managed by SPI for the queue executing the kernel dispatch.
CP obtains this from the runtime.
The kernel prolog must add the value of the wave’s Scratch Wavefront Offset
and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
memory instructions.
The Scratch Wavefront Offset must also be used as an offset with Private
segment address when using the Scratch Segment Buffer (see
Private Segment Buffer).
If the Target Properties column of AMDGPU Processors
specifies Architected flat scratch:
If ENABLE_PRIVATE_SEGMENT is enabled in
compute_pgm_rsrc2 for GFX6-GFX12 then the FLAT_SCRATCH
register pair will be initialized to the 64-bit address of the base of scratch
backing memory being managed by SPI for the queue executing the kernel
dispatch plus the value of the wave’s Scratch Wavefront Offset for use as the
flat scratch base in flat memory instructions.
If the Target Properties column of AMDGPU Processors specifies
Architected flat scratch then a Private Segment Buffer is not supported.
Instead the flat SCRATCH instructions are used.
Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
that are used as a V# to access scratch. CP uses the value provided by the
runtime. It is used, together with Scratch Wavefront Offset as an offset, to
access the private memory space using a segment address. See
Initial Kernel Execution State.
The scratch V# is a four-aligned SGPR and always selected for the kernel as
follows:
If it is known during instruction selection that there is stack usage,
SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
optimizations are disabled (-O0), if stack objects already exist (for
locals, etc.), or if there are any function calls.
Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
are reserved for the tentative scratch V#. These will be used if it is
determined that spilling is needed.
If no use is made of the tentative scratch V#, then it is unreserved,
and the register count is determined ignoring it.
If use is made of the tentative scratch V#, then its register numbers
are shifted to the first four-aligned SGPR index after the highest one
allocated by the register allocator, and all uses are updated. The
register count includes them in the shifted location.
In either case, if the processor has the SGPR allocation bug, the
tentative allocation is not shifted or unreserved in order to ensure
the register count is higher to workaround the bug.
Note
This approach of using a tentative scratch V# and shifting the register
numbers if used avoids having to perform register allocation a second
time if the tentative V# is eliminated. This is more efficient and
avoids the problem that the second register allocation may perform
spilling which will fail as there is no longer a scratch V#.
When the kernel prolog code is being emitted it is known whether the scratch V#
described above is actually used. If it is, the prolog code must set it up by
copying the Private Segment Buffer to the scratch V# registers and then adding
the Private Segment Wavefront Offset to the queue base address in the V#. The
result is a V# with a base address pointing to the beginning of the wavefront
scratch backing memory.
The Private Segment Buffer is always requested, but the Private Segment
Wavefront Offset is only requested if it is used (see
Initial Kernel Execution State).
The barrier execution model is experimental and subject to change.
Threads can synchronize execution by performing barrier operations on barrier objects as described below:
Each barrier object has the following state:
An unsigned positive integer expected count: counts the number of arrive operations
expected for this barrier object.
An unsigned non-negative integer arrive count: counts the number of arrive operations
already performed on this barrier object.
The initial value of arrive count is zero.
When an operation causes arrive count to be equal to expected count, the barrier is completed,
and the arrive count is reset to zero.
Barrier-mutually-exclusive is a symmetric relation between barrier objects that share resources
in a way that restricts how a thread can use them at the same time.
Barrier operations are performed on barrier objects. A barrier operation is a dynamic instance
of one of the following:
Barrier init
Barrier init takes an additional unsigned positive integer argument k.
Sets the expected count of the barrier object to k.
Resets the arrive count of the barrier object to zero.
Barrier join.
Allow the thread that executes the operation to wait on a barrier object.
Barrier drop.
Decrements expected count of the barrier object by one.
Barrier arrive.
Increments the arrive count of the barrier object by one.
If supported, an additional argument to arrive can also update the expected count of the
barrier object before the arrive count is incremented;
the new expected count cannot be less than or equal to the arrive count,
otherwise the behavior is undefined.
Barrier wait.
Introduces execution dependencies between threads; this operation depends on
other barrier operations to complete.
Barrier modification operations are barrier operations that modify the barrier object state:
Barrier init.
Barrier drop.
Barrier arrive.
Thread-barrier-order<BO> is the subset of program-order that only
relates barrier operations performed on a barrier objectBO.
All barrier modification operations on a barrier objectBO occur in a strict total order called
barrier-modification-order<BO>; it is the order in which BO observes barrier
operations that change its state. For any valid barrier-modification-order<BO>, the
following must be true:
Let A and B be two barrier modification operations where A->B in
thread-barrier-order<BO>, then A->B is also in barrier-modification-order<BO>.
The first element in barrier-modification-order<BO> is always a barrier init, otherwise
the behavior is undefined.
barrier-participates-in relates barrier operations to the barrier waits that depend on them
to complete. A barrier operation Xbarrier-participates-in a barrier waitW
if and only if all of the following is true:
X and W are both performed on the same barrier objectBO.
X is a barrier arrive or drop operation.
X does not barrier-participate-in another distinct barrier waitW' in the same thread as W.
W->X not in thread-barrier-order<BO>.
All dependent constraint and relations are satisfied as well. [0]
For the set S consisting of all barrier operations that barrier-participate-in a barrier waitW for some
barrier objectBO:
The elements of S all exist in a continuous, uninterrupted interval of barrier-modification-order<BO>.
The arrive count of BO is zero before the first operation of S in barrier-modification-order<BO>.
The arrive count and expected count of BO are equal after the last operation of S in
barrier-modification-order<BO>. The arrive count and expected count of BO cannot
equal at any other point in S.
A barrier joinJ is barrier-joined-before a barrier operation X if and only if all
of the following is true:
J->X in thread-barrier-order<BO>.
X is not a barrier join.
There is no barrier join or dropJD where J->JD->X in thread-barrier-order<BO>.
There is no barrier joinJ' on a distinct barrier objectBO' such that J->J'->X in
program-order, and BObarrier-mutually-exclusiveBO'.
A barrier operation Abarrier-executes-before another barrier operation B if any of the
following is true:
A->B in program-order.
A->B in barrier-participates-in.
Abarrier-executes-before some barrier operation X, and Xbarrier-executes-beforeB.
Barrier-executes-before is consistent with barrier-modification-order<BO>
for every barrier object BO.
For every barrier dropD performed on a barrier objectBO:
There is a barrier joinJ such that J->D in barrier-joined-before;
otherwise, the behavior is undefined.
D cannot cause the expected count of BO to become negative; otherwise, the behavior is undefined.
For every pair of barrier arriveA and barrier dropD performed on a barrier objectBO, such that A->D in thread-barrier-order<BO>, one of the following must be true:
A does not barrier-participates-in any barrier wait.
Abarrier-participates-in at least one barrier waitW
such that W->D in barrier-executes-before.
For every barrier waitW performed on a barrier objectBO:
There is a barrier joinJ such that J->W in barrier-joined-before, and
J must barrier-executes-before at least one operation X that
barrier-participates-inW; otherwise, the behavior is undefined.
barrier-phase-with is a symmetric relation over barrier operations defined as the
transitive closure of: barrier-participates-in and its inverse relation.
For every barrier operation A that barrier-participates-in a barrier waitW on a barrier objectBO:
There is no barrier operation X on BO such that A->X->W in
barrier-executes-before, and Xbarrier-phase-with a non-empty set of operations
that does not include W.
Note
Barriers only synchronize execution and do not affect the visibility of memory operations between threads.
Refer to the execution barriers memory model
to determine how to synchronize memory operations through barrier-executes-before.
This section covers properties of barrier operation and objects that are specific to the implementation of
barriers in AMDGPU hardware.
Barrier operations have the following additional target-specific properties:
Barrier operations are convergent within a wave. All threads of a wavefront use the same barrier object when
performing any barrier operation.
Thus, barrier operations can only be performed in wave-uniform control flow.
All barrier objects have the following additional target-specific properties:
Barrier objects are allocated and managed by the hardware.
Barrier objects are stored in an unspecified memory region that does not alias with
any other address space. Updates to the barrier object are not done in order with any other
memory operation in any other address space.
Barrier objects exist within a scope (see AMDHSA LLVM Sync Scopes),
and each instance of a barrier object can only be accessed by threads in the scope where
the instance lives. The following scopes are supported:
workgroup.
cluster.
See AMDGPU LLVM IR Intrinsics for more information on how to perform barrier operations using
LLVM IR intrinsic calls, or see the sections below to perform barrier operations using machine code.
Informally, we can deduce from the above formal model that execution barriers behave as follows:
Synchronization of threads always happens at a wavefront granularity.
Barrier-executes-before relates the dynamic instances of operations from different threads together.
For example, if A->B in barrier-executes-before, then the execution of A must complete
before the execution of B can complete.
This property can also be combined with program-order. For example, let two (non-barrier) operations
X and Y where X->A and B->Y in program-order, then we know that the execution
of X completes before the execution of Y does.
Barriers do not complete “out-of-thin-air”; a barrier waitW cannot depend on a barrier operation
X to complete if W->X in barrier-executes-before.
It is undefined behavior to operate on an uninitialized barrier object.
It is undefined behavior for a barrier wait to never complete.
It is not mandatory to drop a barrier after joining it.
A thread may not arrive and then drop a barrier object unless the barrier completes before the
barrier drop. Incrementing the arrive count and decrementing the expected count directly
after may cause undefined behavior.
Joining a barrier is only useful if the thread will wait on that same barrier object later.
Targets from GFX6 through GFX11 included do not have the split barrier feature.
The barrier arrive and barrier wait operations cannot be performed independently.
There is only one workgroup barrier object of workgroup scope that is implicitly used
by all barrier operations.
The following code sequences can be used to implement the barrier operations described by the above specification:
Automatically initialized by the hardware when a workgroup
is launched. The expected count of this barrier is set
to the number of waves in the workgroup.
join
Workgroup barrier
Any thread launched within a workgroup automatically joins
this barrier object.
drop
Workgroup barrier
When a thread ends, it automatically drops this barrier
object if it had previously joined it.
Arrive and Wait
arrive then wait
Workgroup barrier
BackOffBarrier
s_barrier
No BackOffBarrier
s_waitcntvmcnt(0)expcnt(0)lgkmcnt(0)
s_waitcnt_vscntnull,0x0
s_barrier
If the target does not have the BackOffBarrier feature,
then there cannot be any outstanding memory operations
before issuing the s_barrier instruction.
The waitcnts can independently be moved earlier, or
removed entirely as long as the associated
counter remains at zero before issuing the
s_barrier instruction.
The s_barrier instruction cannot complete
before all waves of the workgroup have launched.
GFX12 targets have the split-barrier feature, and also offer multiple barrier objects per workgroup
(see AMDHSA Execution Barriers IDs GFX12). Each barrier object has a unique barrier ID that
instructions use to operate on them.
GFX12.5 additionally introduces new barrier objects that offer more flexibility for synchronizing the execution
of a subset of waves of a workgroup, or synchronizing execution across workgroups within a workgroup cluster.
Note
Check the the table below to determine which barrier IDs are
available to the shader on a given target.
The following code sequences can be used to implement the barrier operations described by the above specification:
Automatically initialized by the hardware when a workgroup
is launched. The expected count of this barrier is set
to the number of waves in the workgroup.
init
-4, -3
Automatically initialized by the hardware when a workgroup
is launched as part of a workgroup cluster.
The expected count of this barrier is set to the number
of workgroups in the workgroup cluster.
init
0
Automatically initialized by the hardware and always
available. This barrier object is opaque and immutable
as all operations other than barrier join are no-ops.
init
[1,16]
s_barrier_init<N>
<N> is an immediate constant, or stored in the lower
half of m0.
The value to set as the expected count of the barrier
is stored in the upper half of m0.
join
-2, -1
Any thread launched within a workgroup automatically joins
this barrier object.
join
-4, -3
Any thread launched within a workgroup cluster
automatically joins this barrier object.
join
0
[1,16]
s_barrier_join<N>
<N> is an immediate constant, or stored in the lower
half of m0.
drop
0
[1,16]
s_barrier_leave
s_barrier_leave takes no operand. It can only be used
to drop a barrier objectBO if BO was
previously joined using s_barrier_join.
Drops the barrier objectBO if and only if
there is a barrier joinJ such that J is
barrier-joined-before this barrier
drop operation.
drop
-2, -1
-4, -3
When a thread ends, it automatically drops this barrier
object if it had previously joined it.
Arrive and Wait
arrive
-4, -3
-2, -1
0
[1,16]
s_barrier_signal<N>
Or
s_barrier_signal_isfirst<N>
<N> is an immediate constant, or stored in bits [4:0] of m0.
The _isfirst variant sets SCC=1 if this wave is the first
to signal the barrier, otherwise SCC=0.
For barrier objects[1,16]: When using m0 as an operand,
if there is a non-zero value contained in the bits [22:16] of m0,
the expected count of the barrier object is set to that value before
the arrive count of the barrier object is incremented.
The new expected count value must be greater than or equal to the
arrive count, otherwise the behavior is undefined.
For barrier objects-4 and -3
(cluster barriers): only one wave
per workgroup may arrive at the barrier on behalf of
its entire workgroup. However, any wave within the workgroup
cluster can then wait on this barrier object.
This is a no-op on the NULL named barrier object
(barrier object0).
wait
-4, -3
-2, -1
0
[1,16]
s_barrier_wait<N>.
<N> is an immediate constant.
For barrier objects-2 and -1: This instruction
cannot complete before all waves of the
workgroup have launched.
For barrier objects-4 and -3 (cluster barriers):
This instruction cannot complete before all waves of the
workgroup cluster have launched.
This is a no-op on the NULL named barrier object
(barrier object0).
For named barrier objects, this instruction always waits on the
last named barrier object that the thread has joined, even
if it is different from the barrier object passed to the
instruction.
Cluster trap barrier; cluster barrier object for use by
all workgroups of a workgroup cluster. Dedicated for the trap
handler and only available in privileged execution mode
(not accessible by the shader).
-3
cluster
GFX12.5
Cluster user barrier; cluster barrier object for use by
all workgroups of a workgroup cluster.
-2
workgroup
GFX12 (all)
Workgroup trap barrier, dedicated for the trap handler and
only available in privileged execution mode
(not accessible by the shader).
-1
workgroup
GFX12 (all)
Workgroup barrier.
0
workgroup
GFX12.5
NULL named barrier object. Barrier-mutually-exclusive with
barriers [1,16].
[1,16]
workgroup
GFX12.5
Named barrier object. All barrier objects in this range are
barrier-mutually-exclusive with other barriers in [0,16].
Informally, we can note that:
All operations on the NULL named barrier object other than join are no-ops.
As the NULL named barrier object (barrier ID 0) is barrier-mutually-exclusive with all other
named barrier objects (barrier IDs [1,16]), a thread can use a join on the NULL
barrier as a way to “unjoin” a named barrier (break barrier-joined-before) without
having to use a drop operation.
When a thread ends, it does not implicitly drop any named barrier objects
(barrier IDs [0,16]) it has joined.
The AMDGPU backend supports the memory synchronization scopes specified in
Memory Scopes.
The code sequences used to implement the memory model specify the order of
instructions that a single thread must execute. The s_waitcnt and cache
management instructions such as buffer_wbinvl1_vol are defined with respect
to other memory instructions executed by the same thread. This allows them to be
moved earlier or later which can allow them to be combined with other instances
of the same instruction, or hoisted/sunk out of loops to improve performance.
Only the instructions related to the memory model are given; additional
s_waitcnt instructions are required to ensure registers are defined before
being used. These may be able to be combined with the memory model s_waitcnt
instructions as described above.
The AMDGPU backend supports the following memory models:
The OpenCL memory model which has separate happens-before relations for the
global and local address spaces. Only a fence specifying both global and
local address space, and seq_cst instructions join the relationships. Since
the LLVM memfence instruction does not allow an address space to be
specified the OpenCL fence has to conservatively assume both local and
global address space was specified. However, optimizations can often be
done to eliminate the additional s_waitcnt instructions when there are
no intervening memory instructions which access the corresponding address
space. The code sequences in the table indicate what can be omitted for the
OpenCL memory. The target triple environment is used to determine if the
source language is OpenCL (see OpenCL).
ds/flat_load/store/atomic instructions to local memory are termed LDS
operations.
buffer/global/flat_load/store/atomic instructions to global memory are
termed vector memory operations.
global_load_lds or buffer/global_load instructions with the lds flag
are LDS DMA loads. They interact with caches as if the loaded data were
being loaded to registers and not to LDS, and so therefore support the same
cache modifiers. They cannot be performed atomically. They implement volatile
(via aux/cpol bit 31) and nontemporal (via metadata) as if they were loads
from the global address space.
The LDS DMA instructions are synchronous by default, which means that the
compiler will automatically ensure that the corresponding operation has
completed before its side-effects are used. The asynchronous
versions of these same instructions perform the same
operations, but without automatic tracking in the compiler; the user must
explicitly track the completion of these instructions before using their
side-effects.
Private address space uses buffer_load/store using the scratch V#
(GFX6-GFX8), or scratch_load/store (GFX9-GFX11). Since only a single thread
is accessing the memory, atomic memory orderings are not meaningful, and all
accesses are treated as non-atomic.
Constant address space uses buffer/global_load instructions (or equivalent
scalar memory instructions). Since the constant address space contents do not
change during the execution of a kernel dispatch it is not legal to perform
stores, and atomic memory orderings are not meaningful, and all accesses are
treated as non-atomic.
A memory synchronization scope wider than work-group is not meaningful for the
group (LDS) address space and is treated as work-group.
When a work-group’s maximum flat work-group size does not exceed the wavefront
size, the work-group fits within a single wavefront. In this case, LLVM
workgroup synchronization scope is equivalent to wavefront scope.
If the compiler can determine this bound (e.g., via amdgpu-flat-work-group-size),
the AMDGPU backend optimizes workgroup scope operations by lowering them to
wavefront-scoped machine instructions.
It applies to atomic load, store, atomicrmw, and cmpxchg
instructions, and to fence instructions, when they use synchronizing memory
orderings (acquire, release, acq_rel, or seq_cst).
The memory model does not support the region address space which is treated as
non-atomic.
Acquire memory ordering is not meaningful on store atomic instructions and is
treated as non-atomic.
Release memory ordering is not meaningful on load atomic instructions and is
treated as non-atomic.
Acquire-release memory ordering is not meaningful on load or store atomic
instructions and is treated as acquire and release respectively.
Table 90 AMDHSA Memory Model Single Thread Optimization Constraints¶
LLVM Memory
Optimization Constraints
Ordering
unordered
none
monotonic
none
acquire
If a load atomic/atomicrmw then no following load/load
atomic/store/store atomic/atomicrmw/fence instruction can be
moved before the acquire.
If a fence then same as load atomic, plus no preceding
associated fence-paired-atomic can be moved after the fence.
release
If a store atomic/atomicrmw then no preceding load/load
atomic/store/store atomic/atomicrmw/fence instruction can be
moved after the release.
If a fence then same as store atomic, plus no following
associated fence-paired-atomic can be moved before the
fence.
acq_rel
Same constraints as both acquire and release.
seq_cst
If a load atomic then same constraints as acquire, plus no
preceding sequentially consistent load atomic/store
atomic/atomicrmw/fence instruction can be moved after the
seq_cst.
If a store atomic then the same constraints as release, plus
no following sequentially consistent load atomic/store
atomic/atomicrmw/fence instruction can be moved before the
seq_cst.
If an atomicrmw/fence then same constraints as acq_rel.
The code sequences used to implement the memory model are defined in the
following sections:
See Execution Barriers for definitions of the terminology used
in this section.
A barrier arrive operation A can pair with a release fence program-ordered before it
to form a barrier-arrive-releaseBR. The synchronization scope and the set of address
spaces affected are determined by the release fence.
A barrier wait operation W can pair with an acquire fence program-ordered after it to
form a barrier-wait-acquireBA. The synchronization scope and the set of address
spaces affected are determined by the acquire fence.
A BRsynchronizes-withBA in an address space AS if and only if:
LLVM fences do not have address space information, thus, fence
codegen usually needs to conservatively synchronize all address spaces.
In the case of OpenCL, where fences only need to synchronize
user-specified address spaces, this can result in extra unnecessary waits.
For instance, a fence that is supposed to only synchronize local memory will
also have to wait on all global memory operations, which is unnecessary.
Memory Model Relaxation Annotations can
be used as an optimization hint for fences to solve this problem.
The AMDGPU backend recognizes the following tags on fences to control which address
space a fence can synchronize:
amdgpu-synchronize-as:local - for the local address space
amdgpu-synchronize-as:global- for the global address space
Multiple tags can be used at the same time to synchronize with more than one address space.
Note
As an optimization hint, those tags are not guaranteed to survive until
code generation. Optimizations are free to drop the tags to allow for
better code optimization, at the cost of synchronizing additional address
spaces.
Each CU has multiple SIMDs that execute wavefronts.
The wavefronts for a single work-group are executed in the same CU but may be
executed by different SIMDs.
Each CU has a single LDS memory shared by the wavefronts of the work-groups
executing on it.
All LDS operations of a CU are performed as wavefront wide operations in a
global order and involve no caching. Completion is reported to a wavefront in
execution order.
The LDS memory has multiple request queues shared by the SIMDs of a
CU. Therefore, the LDS operations performed by different wavefronts of a
work-group can be reordered relative to each other, which can result in
reordering the visibility of vector memory operations with respect to LDS
operations of other wavefronts in the same work-group. A s_waitcntlgkmcnt(0) is required to ensure synchronization between LDS operations and
vector memory operations between wavefronts of a work-group, but not between
operations performed by the same wavefront.
The vector memory operations are performed as wavefront wide operations and
completion is reported to a wavefront in execution order. The exception is
that for GFX7-GFX9 flat_load/store/atomic instructions can report out of
vector memory order if they access LDS memory, and out of LDS operation order
if they access global memory.
The vector memory operations access a single vector L1 cache shared by all
SIMDs a CU. Therefore, no special action is required for coherence between the
lanes of a single wavefront, or for coherence between wavefronts in the same
work-group. A buffer_wbinvl1_vol is required for coherence between
wavefronts executing in different work-groups as they may be executing on
different CUs.
The scalar memory operations access a scalar L1 cache shared by all wavefronts
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
scalar operations are used in a restricted way so do not impact the memory
model. See Memory Spaces.
The vector and scalar memory operations use an L2 cache shared by all CUs on
the same agent.
The L2 cache has independent channels to service disjoint ranges of virtual
addresses.
Each CU has a separate request queue per channel. Therefore, the vector and
scalar memory operations performed by wavefronts executing in different
work-groups (which may be executing on different CUs) of an agent can be
reordered relative to each other. A s_waitcntvmcnt(0) is required to
ensure synchronization between vector memory operations of different CUs. It
ensures a previous vector memory operation has completed before executing a
subsequent vector memory or LDS operation and so can be used to meet the
requirements of acquire and release.
The L2 cache can be kept coherent with other agents on some targets, or ranges
of virtual addresses can be set up to bypass it to ensure system coherence.
Scalar memory operations are only used to access memory that is proven to not
change during the execution of the kernel dispatch. This includes constant
address space and global address space for program scope const variables.
Therefore, the kernel machine code does not have to maintain the scalar cache to
ensure it is coherent with the vector caches. The scalar and vector caches are
invalidated between kernel dispatches by CP since constant address space data
may change between kernel dispatch executions. See
Memory Spaces.
The one exception is if scalar writes are used to spill SGPR registers. In this
case the AMDGPU backend ensures the memory location used to spill is never
accessed by vector memory operations at the same time. If scalar writes are used
then a s_dcache_wb is inserted before the s_endpgm and before a function
return since the locations may be used for vector memory instructions by a
future wavefront that uses the same scratch area, or a function call that
creates a frame at the same address, respectively. There is no need for a
s_dcache_inv as all scalar writes are write-before-read in the same thread.
For kernarg backing memory:
CP invalidates the L1 cache at the start of each kernel dispatch.
On dGPU the kernarg backing memory is allocated in host memory accessed as
MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
causes it to be treated as non-volatile and so is not invalidated by
*_vol.
On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
and so the L2 cache will be coherent with the CPU and other agents.
Scratch backing memory (which is used for the private address space) is accessed
with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
only accessed by a single thread, and is always write-before-read, there is
never a need to invalidate these entries from the L1 cache. Hence all cache
invalidates are done as *_vol to only invalidate the volatile cache lines.
A wave waiting on a s_barrier is unable to handle traps or exceptions,
thus a s_waitcntvmcnt(0)expcnt(0)lgkmcnt(0) is required before entering
the barrier so that no memory exception can occur during the barrier.
Must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than the
value read by the
fence-paired-atomic.
fence
acquire
agent
system
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
buffer_wbinvl1_vol.
Ensures that the
fence-paired atomic
has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
buffer_wbinvl1_vol
Must happen before any
following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
Release Atomic
store atomic
release
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_store
store atomic
release
workgroup
global
generic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
to local have
completed before
performing the
store that is being
released.
buffer/global/flat_store
store atomic
release
workgroup
local
ds_store
store atomic
release
agent
system
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
to memory have
completed before
performing the
store that is being
released.
buffer/global/flat_store
atomicrmw
release
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_atomic
atomicrmw
release
workgroup
global
generic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to local have
completed before
performing the
atomicrmw that is
being released.
buffer/global/flat_atomic
atomicrmw
release
workgroup
local
ds_atomic
atomicrmw
release
agent
system
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global and local
have completed
before performing
the atomicrmw that
is being released.
Must happen after
any preceding
local/generic
load/load
atomic/store/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
to local have
completed before
performing the
following
fence-paired-atomic.
fence
release
agent
system
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
Acquire-Release Atomic
atomicrmw
acq_rel
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_atomic
atomicrmw
acq_rel
workgroup
global
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to local have
completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
atomicrmw
acq_rel
workgroup
local
ds_atomic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than the local load
atomic value being
acquired.
atomicrmw
acq_rel
workgroup
generic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to local have
completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than a local load
atomic value being
acquired.
atomicrmw
acq_rel
agent
system
global
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global have
completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
s_waitcnt vmcnt(0)
Must happen before
following
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
cache.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acq_rel
agent
system
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global have
completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If OpenCL, omit
lgkmcnt(0).
Must happen before
following
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
cache.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
fence
acq_rel
singlethread
wavefront
none
none
fence
acq_rel
workgroup
none
s_waitcnt lgkmcnt(0)
If OpenCL and
address space is
not generic, omit.
However,
since LLVM
currently has no
address space on
the fence need to
conservatively
always generate
(see comment for
previous fence).
Must happen after
any preceding
local/generic
load/load
atomic/store/store
atomic/atomicrmw.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that all
memory operations
to local have
completed before
performing any
following global
memory operations.
Ensures that the
preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before following
global memory
operations. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
local/generic store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
fence
acq_rel
agent
system
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
buffer_wbinvl1_vol.
Ensures that the
preceding
global/local/generic
load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before invalidating
the cache. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
global/local/generic
store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data. This
satisfies the
requirements of
acquire.
Sequential Consistent Atomic
load atomic
seq_cst
singlethread
wavefront
global
local
generic
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
global
generic
s_waitcnt lgkmcnt(0)
Must
happen after
preceding
local/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent local
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
local
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
agent
system
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
Could be split into
separate s_waitcnt
vmcnt(0)
and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt lgkmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
s_waitcnt vmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vmcnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
store atomic
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
store atomic release,
except must generate
all instructions even
for OpenCL.
atomicrmw
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
atomicrmw acq_rel,
except must generate
all instructions even
for OpenCL.
fence
seq_cst
singlethread
wavefront
workgroup
agent
system
none
Same as corresponding
fence acq_rel,
except must generate
all instructions even
for OpenCL.
Each CU has multiple SIMDs that execute wavefronts.
The wavefronts for a single work-group are executed in the same CU but may be
executed by different SIMDs. The exception is when in tgsplit execution mode
when the wavefronts may be executed by different SIMDs in different CUs.
Each CU has a single LDS memory shared by the wavefronts of the work-groups
executing on it. The exception is when in tgsplit execution mode when no LDS
is allocated as wavefronts of the same work-group can be in different CUs.
All LDS operations of a CU are performed as wavefront wide operations in a
global order and involve no caching. Completion is reported to a wavefront in
execution order.
The LDS memory has multiple request queues shared by the SIMDs of a
CU. Therefore, the LDS operations performed by different wavefronts of a
work-group can be reordered relative to each other, which can result in
reordering the visibility of vector memory operations with respect to LDS
operations of other wavefronts in the same work-group. A s_waitcntlgkmcnt(0) is required to ensure synchronization between LDS operations and
vector memory operations between wavefronts of a work-group, but not between
operations performed by the same wavefront.
The vector memory operations are performed as wavefront wide operations and
completion is reported to a wavefront in execution order. The exception is
that flat_load/store/atomic instructions can report out of vector memory
order if they access LDS memory, and out of LDS operation order if they access
global memory.
The vector memory operations access a single vector L1 cache shared by all
SIMDs a CU. Therefore:
No special action is required for coherence between the lanes of a single
wavefront.
No special action is required for coherence between wavefronts in the same
work-group since they execute on the same CU. The exception is when in
tgsplit execution mode as wavefronts of the same work-group can be in
different CUs and so a buffer_wbinvl1_vol is required as described in
the following item.
A buffer_wbinvl1_vol is required for coherence between wavefronts
executing in different work-groups as they may be executing on different
CUs.
The scalar memory operations access a scalar L1 cache shared by all wavefronts
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
scalar operations are used in a restricted way so do not impact the memory
model. See Memory Spaces.
The vector and scalar memory operations use an L2 cache shared by all CUs on
the same agent.
The L2 cache has independent channels to service disjoint ranges of virtual
addresses.
Each CU has a separate request queue per channel. Therefore, the vector and
scalar memory operations performed by wavefronts executing in different
work-groups (which may be executing on different CUs), or the same
work-group if executing in tgsplit mode, of an agent can be reordered
relative to each other. A s_waitcntvmcnt(0) is required to ensure
synchronization between vector memory operations of different CUs. It
ensures a previous vector memory operation has completed before executing a
subsequent vector memory or LDS operation and so can be used to meet the
requirements of acquire and release.
The L2 cache of one agent can be kept coherent with other agents by:
using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
Any local memory cache lines will be automatically invalidated by writes
from CUs associated with other L2 caches, or writes from the CPU, due to
the cache probe caused by coherent requests. Coherent requests are caused
by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
XGMI, and by PCIe requests that are configured to be coherent requests.
XGMI accesses from the CPU to local memory may be cached on the CPU.
Subsequent access from the GPU will automatically invalidate or writeback
the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
Since all work-groups on the same agent share the same L2, no L2
invalidation or writeback is required for coherence.
To ensure coherence of local and remote memory writes of work-groups in
different agents a buffer_wbl2 is required. It will writeback dirty L2
cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
()used for remote coarse grain memory). Note that MTYPE CC (used for local
fine grain memory) causes write through to DRAM, and MTYPE UC (used for
remote fine grain memory) bypasses the L2, so both will never result in
dirty L2 cache lines.
To ensure coherence of local and remote memory reads of work-groups in
different agents a buffer_invl2 is required. It will invalidate L2
cache lines with MTYPE NC (used for remote coarse grain memory). Note that
MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
coarse memory) cause local reads to be invalidated by remote writes with
with the PTE C-bit so these cache lines are not invalidated. Note that
MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
never result in L2 cache lines that need to be invalidated.
PCIe access from the GPU to the CPU memory is kept coherent by using the
MTYPE UC (uncached) which bypasses the L2.
Scalar memory operations are only used to access memory that is proven to not
change during the execution of the kernel dispatch. This includes constant
address space and global address space for program scope const variables.
Therefore, the kernel machine code does not have to maintain the scalar cache to
ensure it is coherent with the vector caches. The scalar and vector caches are
invalidated between kernel dispatches by CP since constant address space data
may change between kernel dispatch executions. See
Memory Spaces.
The one exception is if scalar writes are used to spill SGPR registers. In this
case the AMDGPU backend ensures the memory location used to spill is never
accessed by vector memory operations at the same time. If scalar writes are used
then a s_dcache_wb is inserted before the s_endpgm and before a function
return since the locations may be used for vector memory instructions by a
future wavefront that uses the same scratch area, or a function call that
creates a frame at the same address, respectively. There is no need for a
s_dcache_inv as all scalar writes are write-before-read in the same thread.
For kernarg backing memory:
CP invalidates the L1 cache at the start of each kernel dispatch.
On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
cache. This also causes it to be treated as non-volatile and so is not
invalidated by *_vol.
On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
so the L2 cache will be coherent with the CPU and other agents.
Scratch backing memory (which is used for the private address space) is accessed
with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
only accessed by a single thread, and is always write-before-read, there is
never a need to invalidate these entries from the L1 cache. Hence all cache
invalidates are done as *_vol to only invalidate the volatile cache lines.
Table 92 AMDHSA Memory Model Code Sequences GFX90A¶
LLVM Instr
LLVM Memory
Ordering
LLVM Memory
Sync Scope
AMDGPU
Address
Space
AMDGPU Machine Code
GFX90A
Non-Atomic
load
none
none
global
generic
private
constant
!volatile & !nontemporal
buffer/global/flat_load
!volatile & nontemporal
buffer/global/flat_load
glc=1 slc=1
volatile
buffer/global/flat_load
glc=1
s_waitcnt vmcnt(0)
Must happen before
any following volatile
global/generic
load/store.
Ensures that
volatile
operations to
different
addresses will not
be reordered by
hardware.
load
none
none
local
ds_load
store
none
none
global
generic
private
constant
!volatile & !nontemporal
buffer/global/flat_store
!volatile & nontemporal
buffer/global/flat_store
glc=1 slc=1
volatile
buffer/global/flat_store
s_waitcnt vmcnt(0)
Must happen before
any following volatile
global/generic
load/store.
Ensures that
volatile
operations to
different
addresses will not
be reordered by
hardware.
store
none
none
local
ds_store
Unordered Atomic
load atomic
unordered
any
any
Same as non-atomic.
store atomic
unordered
any
any
Same as non-atomic.
atomicrmw
unordered
any
any
Same as monotonic atomic.
Monotonic Atomic
load atomic
monotonic
singlethread
wavefront
global
generic
buffer/global/flat_load
load atomic
monotonic
workgroup
global
generic
buffer/global/flat_load
glc=1
If not TgSplit execution
mode, omit glc=1.
load atomic
monotonic
singlethread
wavefront
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_load
load atomic
monotonic
agent
global
generic
buffer/global/flat_load
glc=1
load atomic
monotonic
system
global
generic
buffer/global/flat_load
glc=1
store atomic
monotonic
singlethread
wavefront
workgroup
agent
global
generic
buffer/global/flat_store
store atomic
monotonic
system
global
generic
buffer/global/flat_store
store atomic
monotonic
singlethread
wavefront
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_store
atomicrmw
monotonic
singlethread
wavefront
workgroup
agent
global
generic
buffer/global/flat_atomic
atomicrmw
monotonic
system
global
generic
buffer/global/flat_atomic
atomicrmw
monotonic
singlethread
wavefront
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
Acquire Atomic
load atomic
acquire
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_load
load atomic
acquire
workgroup
global
buffer/global_load glc=1
If not TgSplit execution
mode, omit glc=1.
s_waitcnt vmcnt(0)
If not TgSplit execution
mode, omit.
Must happen before the
following buffer_wbinvl1_vol.
buffer_wbinvl1_vol
If not TgSplit execution
mode, omit.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following
loads will not see
stale data.
load atomic
acquire
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_load
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than the local load
atomic value being
acquired.
load atomic
acquire
workgroup
generic
flat_load glc=1
If not TgSplit execution
mode, omit glc=1.
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit lgkmcnt(0).
Must happen before
the following
buffer_wbinvl1_vol and any
following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than a local load
atomic value being
acquired.
buffer_wbinvl1_vol
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
load atomic
acquire
agent
global
buffer/global_load
glc=1
s_waitcnt vmcnt(0)
Must happen before
following
buffer_wbinvl1_vol.
Ensures the load
has completed
before invalidating
the cache.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following
loads will not see
stale global data.
load atomic
acquire
system
global
buffer/global/flat_load
glc=1
s_waitcnt vmcnt(0)
Must happen before
following buffer_invl2 and
buffer_wbinvl1_vol.
Ensures the load
has completed
before invalidating
the cache.
buffer_invl2;
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
load atomic
acquire
agent
generic
flat_load glc=1
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL omit
lgkmcnt(0).
Must happen before
following
buffer_wbinvl1_vol.
Ensures the flat_load
has completed
before invalidating
the cache.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
load atomic
acquire
system
generic
flat_load glc=1
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL omit
lgkmcnt(0).
Must happen before
following
buffer_invl2 and
buffer_wbinvl1_vol.
Ensures the flat_load
has completed
before invalidating
the caches.
buffer_invl2;
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
atomicrmw
acquire
singlethread
wavefront
global
generic
buffer/global/flat_atomic
atomicrmw
acquire
singlethread
wavefront
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
atomicrmw
acquire
workgroup
global
buffer/global_atomic
s_waitcnt vmcnt(0)
If not TgSplit execution
mode, omit.
Must happen before the
following buffer_wbinvl1_vol.
Ensures the atomicrmw
has completed
before invalidating
the cache.
buffer_wbinvl1_vol
If not TgSplit execution
mode, omit.
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acquire
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than the local
atomicrmw value
being acquired.
atomicrmw
acquire
workgroup
generic
flat_atomic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit lgkmcnt(0).
Must happen before
the following
buffer_wbinvl1_vol and
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than a local
atomicrmw value
being acquired.
buffer_wbinvl1_vol
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acquire
agent
global
buffer/global_atomic
s_waitcnt vmcnt(0)
Must happen before
following
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
cache.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acquire
system
global
buffer/global_atomic
s_waitcnt vmcnt(0)
Must happen before
following buffer_invl2 and
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
caches.
buffer_invl2;
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
atomicrmw
acquire
agent
generic
flat_atomic
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Must happen before
following
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
cache.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acquire
system
generic
flat_atomic
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Must happen before
following
buffer_invl2 and
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
caches.
buffer_invl2;
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
fence
acquire
singlethread
wavefront
none
none
fence
acquire
workgroup
none
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/
atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
buffer_wbinvl1_vol and
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than the
value read by the
fence-paired-atomic.
buffer_wbinvl1_vol
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
fence
acquire
agent
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
buffer_wbinvl1_vol.
Ensures that the
fence-paired atomic
has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
buffer_wbinvl1_vol
Must happen before any
following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
fence
acquire
system
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following buffer_invl2 and
buffer_wbinvl1_vol.
Ensures that the
fence-paired atomic
has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
buffer_invl2;
buffer_wbinvl1_vol
Must happen before any
following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
Release Atomic
store atomic
release
singlethread
wavefront
global
generic
buffer/global/flat_store
store atomic
release
singlethread
wavefront
local
If TgSplit execution mode,
local address space cannot
be used.
ds_store
store atomic
release
workgroup
global
generic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit lgkmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
have
completed before
performing the
store that is being
released.
buffer/global/flat_store
store atomic
release
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_store
store atomic
release
agent
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
to memory have
completed before
performing the
store that is being
released.
buffer/global/flat_store
store atomic
release
system
global
generic
buffer_wbl2
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after any
preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after any
preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
to memory and the L2
writeback have
completed before
performing the
store that is being
released.
buffer/global/flat_store
atomicrmw
release
singlethread
wavefront
global
generic
buffer/global/flat_atomic
atomicrmw
release
singlethread
wavefront
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
atomicrmw
release
workgroup
global
generic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit
lgkmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
buffer/global/flat_atomic
atomicrmw
release
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
atomicrmw
release
agent
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global and local
have completed
before performing
the atomicrmw that
is being released.
buffer/global/flat_atomic
atomicrmw
release
system
global
generic
buffer_wbl2
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to memory and the L2
writeback have
completed before
performing the
store that is being
released.
buffer/global/flat_atomic
fence
release
singlethread
wavefront
none
none
fence
release
workgroup
none
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/load
atomic/store/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
fence
release
agent
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
fence
release
system
none
buffer_wbl2
If OpenCL and
address space is
local, omit.
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
Acquire-Release Atomic
atomicrmw
acq_rel
singlethread
wavefront
global
generic
buffer/global/flat_atomic
atomicrmw
acq_rel
singlethread
wavefront
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
atomicrmw
acq_rel
workgroup
global
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit
lgkmcnt(0).
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
s_waitcnt vmcnt(0)
If not TgSplit execution
mode, omit.
Must happen before
the following
buffer_wbinvl1_vol.
Ensures any
following global
data read is no
older than the
atomicrmw value
being acquired.
buffer_wbinvl1_vol
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than the local load
atomic value being
acquired.
atomicrmw
acq_rel
workgroup
generic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit
lgkmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If not TgSplit execution
mode, omit vmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Must happen before
the following
buffer_wbinvl1_vol and
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than a local load
atomic value being
acquired.
buffer_wbinvl1_vol
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
agent
global
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global have
completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
s_waitcnt vmcnt(0)
Must happen before
following
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
cache.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acq_rel
system
global
buffer_wbl2
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global and L2 writeback
have completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
s_waitcnt vmcnt(0)
Must happen before
following buffer_invl2 and
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
caches.
buffer_invl2;
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
atomicrmw
acq_rel
agent
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global have
completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Must happen before
following
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
cache.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acq_rel
system
generic
buffer_wbl2
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global and L2 writeback
have completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Must happen before
following buffer_invl2 and
buffer_wbinvl1_vol.
Ensures the
atomicrmw has
completed before
invalidating the
caches.
buffer_invl2;
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
fence
acq_rel
singlethread
wavefront
none
none
fence
acq_rel
workgroup
none
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
However,
since LLVM
currently has no
address space on
the fence need to
conservatively
always generate
(see comment for
previous fence).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/load
atomic/store/store
atomic/atomicrmw.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that all
memory operations
have
completed before
performing any
following global
memory operations.
Ensures that the
preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before following
global memory
operations. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
local/generic store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
Must happen before
the following
buffer_wbinvl1_vol.
Ensures that the
acquire-fence-paired
atomic has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
acquire-fence-paired-atomic.
buffer_wbinvl1_vol
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
fence
acq_rel
agent
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
buffer_wbinvl1_vol.
Ensures that the
preceding
global/local/generic
load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before invalidating
the cache. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
global/local/generic
store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data. This
satisfies the
requirements of
acquire.
fence
acq_rel
system
none
buffer_wbl2
If OpenCL and
address space is
local, omit.
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following buffer_invl2 and
buffer_wbinvl1_vol.
Ensures that the
preceding
global/local/generic
load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before invalidating
the cache. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
global/local/generic
store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
buffer_invl2;
buffer_wbinvl1_vol
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following
loads will not see
stale L1 global data,
nor see stale L2 MTYPE
NC global data.
MTYPE RW and CC memory will
never be stale in L2 due to
the memory probes.
Sequential Consistent Atomic
load atomic
seq_cst
singlethread
wavefront
global
local
generic
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
global
generic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
s_waitcnt lgkmcnt(0) must
happen after
preceding
local/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
s_waitcnt vmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vmcnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global/local
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
agent
system
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0)
and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt lgkmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
s_waitcnt vmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vmcnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
store atomic
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
store atomic release,
except must generate
all instructions even
for OpenCL.
atomicrmw
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
atomicrmw acq_rel,
except must generate
all instructions even
for OpenCL.
fence
seq_cst
singlethread
wavefront
workgroup
agent
system
none
Same as corresponding
fence acq_rel,
except must generate
all instructions even
for OpenCL.
Each CU has multiple SIMDs that execute wavefronts.
The wavefronts for a single work-group are executed in the same CU but may be
executed by different SIMDs. The exception is when in tgsplit execution mode
when the wavefronts may be executed by different SIMDs in different CUs.
Each CU has a single LDS memory shared by the wavefronts of the work-groups
executing on it. The exception is when in tgsplit execution mode when no LDS
is allocated as wavefronts of the same work-group can be in different CUs.
All LDS operations of a CU are performed as wavefront wide operations in a
global order and involve no caching. Completion is reported to a wavefront in
execution order.
The LDS memory has multiple request queues shared by the SIMDs of a
CU. Therefore, the LDS operations performed by different wavefronts of a
work-group can be reordered relative to each other, which can result in
reordering the visibility of vector memory operations with respect to LDS
operations of other wavefronts in the same work-group. A s_waitcntlgkmcnt(0) is required to ensure synchronization between LDS operations and
vector memory operations between wavefronts of a work-group, but not between
operations performed by the same wavefront.
The vector memory operations are performed as wavefront wide operations and
completion is reported to a wavefront in execution order. The exception is
that flat_load/store/atomic instructions can report out of vector memory
order if they access LDS memory, and out of LDS operation order if they access
global memory.
The vector memory operations access a single vector L1 cache shared by all
SIMDs a CU. Therefore:
No special action is required for coherence between the lanes of a single
wavefront.
No special action is required for coherence between wavefronts in the same
work-group since they execute on the same CU. The exception is when in
tgsplit execution mode as wavefronts of the same work-group can be in
different CUs and so a buffer_invsc0 is required which will invalidate
the L1 cache.
A buffer_invsc0 is required to invalidate the L1 cache for coherence
between wavefronts executing in different work-groups as they may be
executing on different CUs.
Atomic read-modify-write instructions implicitly bypass the L1 cache.
Therefore, they do not use the sc0 bit for coherence and instead use it to
indicate if the instruction returns the original value being updated. They
do use sc1 to indicate system or agent scope coherence.
The scalar memory operations access a scalar L1 cache shared by all wavefronts
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
scalar operations are used in a restricted way so do not impact the memory
model. See Memory Spaces.
The vector and scalar memory operations use an L2 cache.
The gfx942 can be configured as a number of smaller agents with each having
a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
larger agents with groups of CUs on each agent each sharing separate L2
caches.
The L2 cache has independent channels to service disjoint ranges of virtual
addresses.
Each CU has a separate request queue per channel for its associated L2.
Therefore, the vector and scalar memory operations performed by wavefronts
executing with different L1 caches and the same L2 cache can be reordered
relative to each other.
A s_waitcntvmcnt(0) is required to ensure synchronization between
vector memory operations of different CUs. It ensures a previous vector
memory operation has completed before executing a subsequent vector memory
or LDS operation and so can be used to meet the requirements of acquire and
release.
An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
(read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
the PTE C-bit set for memory not local to the L2.
Any local memory cache lines will be automatically invalidated by writes
from CUs associated with other L2 caches, or writes from the CPU, due to
the cache probe caused by the PTE C-bit.
XGMI accesses from the CPU to local memory may be cached on the CPU.
Subsequent access from the GPU will automatically invalidate or writeback
the CPU cache due to the L2 probe filter.
To ensure coherence of local memory writes of CUs with different L1 caches
in the same agent a buffer_wbl2 is required. It does nothing if the
agent is configured to have a single L2, or will writeback dirty L2 cache
lines if configured to have multiple L2 caches.
To ensure coherence of local memory writes of CUs in different agents a
buffer_wbl2sc1 is required. It will writeback dirty L2 cache lines.
To ensure coherence of local memory reads of CUs with different L1 caches
in the same agent a buffer_invsc1 is required. It does nothing if the
agent is configured to have a single L2, or will invalidate non-local L2
cache lines if configured to have multiple L2 caches.
To ensure coherence of local memory reads of CUs in different agents a
buffer_invsc0sc1 is required. It will invalidate non-local L2 cache
lines if configured to have multiple L2 caches.
PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
UC (uncached) which bypasses the L2.
Scalar memory operations are only used to access memory that is proven to not
change during the execution of the kernel dispatch. This includes constant
address space and global address space for program scope const variables.
Therefore, the kernel machine code does not have to maintain the scalar cache to
ensure it is coherent with the vector caches. The scalar and vector caches are
invalidated between kernel dispatches by CP since constant address space data
may change between kernel dispatch executions. See
Memory Spaces.
The one exception is if scalar writes are used to spill SGPR registers. In this
case the AMDGPU backend ensures the memory location used to spill is never
accessed by vector memory operations at the same time. If scalar writes are used
then a s_dcache_wb is inserted before the s_endpgm and before a function
return since the locations may be used for vector memory instructions by a
future wavefront that uses the same scratch area, or a function call that
creates a frame at the same address, respectively. There is no need for a
s_dcache_inv as all scalar writes are write-before-read in the same thread.
For kernarg backing memory:
CP invalidates the L1 cache at the start of each kernel dispatch.
On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
cache. This also causes it to be treated as non-volatile and so is not
invalidated by *_vol.
On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
so the L2 cache will be coherent with the CPU and other agents.
Scratch backing memory (which is used for the private address space) is accessed
with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
only accessed by a single thread, and is always write-before-read, there is
never a need to invalidate these entries from the L1 cache. Hence all cache
invalidates are done as *_vol to only invalidate the volatile cache lines.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/
atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
buffer_inv and
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than the
value read by the
fence-paired-atomic.
buffer_inv sc0=1
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
fence
acquire
agent
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
buffer_inv.
Ensures that the
fence-paired atomic
has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
buffer_inv sc1=1
Must happen before any
following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
fence
acquire
system
none
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
buffer_inv.
Ensures that the
fence-paired atomic
has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
buffer_inv sc0=1 sc1=1
Must happen before any
following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
Release Atomic
store atomic
release
singlethread
wavefront
global
generic
GFX942
buffer/global/flat_store
store atomic
release
singlethread
wavefront
local
If TgSplit execution mode,
local address space cannot
be used.
ds_store
store atomic
release
workgroup
global
generic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit lgkmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
have
completed before
performing the
store that is being
released.
GFX942
buffer/global/flat_store
sc0=1
store atomic
release
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_store
store atomic
release
agent
global
generic
buffer_wbl2 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at agent scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
to memory have
completed before
performing the
store that is being
released.
GFX942
buffer/global/flat_store
sc1=1
store atomic
release
system
global
generic
buffer_wbl2 sc0=1 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after any
preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after any
preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
to memory and the L2
writeback have
completed before
performing the
store that is being
released.
buffer/global/flat_store
sc0=1 sc1=1
atomicrmw
release
singlethread
wavefront
global
generic
buffer/global/flat_atomic
atomicrmw
release
singlethread
wavefront
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
atomicrmw
release
workgroup
global
generic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit
lgkmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
buffer/global/flat_atomic sc0=1
atomicrmw
release
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
atomicrmw
release
agent
global
generic
buffer_wbl2 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at agent scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global and local
have completed
before performing
the atomicrmw that
is being released.
buffer/global/flat_atomic sc1=1
atomicrmw
release
system
global
generic
buffer_wbl2 sc0=1 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to memory and the L2
writeback have
completed before
performing the
store that is being
released.
buffer/global/flat_atomic
sc0=1 sc1=1
fence
release
singlethread
wavefront
none
none
fence
release
workgroup
none
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/load
atomic/store/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
fence
release
agent
none
buffer_wbl2 sc1=1
If OpenCL and
address space is
local, omit.
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at agent scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
fence
release
system
none
buffer_wbl2 sc0=1 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
Acquire-Release Atomic
atomicrmw
acq_rel
singlethread
wavefront
global
generic
buffer/global/flat_atomic
atomicrmw
acq_rel
singlethread
wavefront
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
atomicrmw
acq_rel
workgroup
global
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit
lgkmcnt(0).
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
s_waitcnt vmcnt(0)
If not TgSplit execution
mode, omit.
Must happen before
the following
buffer_inv.
Ensures any
following global
data read is no
older than the
atomicrmw value
being acquired.
buffer_inv sc0=1
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
ds_atomic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than the local load
atomic value being
acquired.
atomicrmw
acq_rel
workgroup
generic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL, omit
lgkmcnt(0).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If not TgSplit execution
mode, omit vmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Must happen before
the following
buffer_inv and
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures any
following global
data read is no
older than a local load
atomic value being
acquired.
buffer_inv sc0=1
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
agent
global
buffer_wbl2 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at agent scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global have
completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
s_waitcnt vmcnt(0)
Must happen before
following
buffer_inv.
Ensures the
atomicrmw has
completed before
invalidating the
cache.
buffer_inv sc1=1
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acq_rel
system
global
buffer_wbl2 sc0=1 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global and L2 writeback
have completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
sc1=1
s_waitcnt vmcnt(0)
Must happen before
following
buffer_inv.
Ensures the
atomicrmw has
completed before
invalidating the
caches.
buffer_inv sc0=1 sc1=1
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
MTYPE NC global data.
MTYPE RW and CC memory will
never be stale due to the
memory probes.
atomicrmw
acq_rel
agent
generic
buffer_wbl2 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at agent scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global have
completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Must happen before
following
buffer_inv.
Ensures the
atomicrmw has
completed before
invalidating the
cache.
buffer_inv sc1=1
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acq_rel
system
generic
buffer_wbl2 sc0=1 sc1=1
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global and L2 writeback
have completed before
performing the
atomicrmw that is
being released.
flat_atomic sc1=1
s_waitcnt vmcnt(0) &
lgkmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL, omit
lgkmcnt(0).
Must happen before
following
buffer_inv.
Ensures the
atomicrmw has
completed before
invalidating the
caches.
buffer_inv sc0=1 sc1=1
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
MTYPE NC global data.
MTYPE RW and CC memory will
never be stale due to the
memory probes.
fence
acq_rel
singlethread
wavefront
none
none
fence
acq_rel
workgroup
none
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0).
However,
since LLVM
currently has no
address space on
the fence need to
conservatively
always generate
(see comment for
previous fence).
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/
load atomic/store atomic/
atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/load
atomic/store/store
atomic/atomicrmw.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that all
memory operations
have
completed before
performing any
following global
memory operations.
Ensures that the
preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before following
global memory
operations. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
local/generic store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
Must happen before
the following
buffer_inv.
Ensures that the
acquire-fence-paired
atomic has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
acquire-fence-paired-atomic.
buffer_inv sc0=1
If not TgSplit execution
mode, omit.
Ensures that
following
loads will not see
stale data.
fence
acq_rel
agent
none
buffer_wbl2 sc1=1
If OpenCL and
address space is
local, omit.
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at agent scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
buffer_inv.
Ensures that the
preceding
global/local/generic
load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before invalidating
the cache. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
global/local/generic
store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
buffer_inv sc1=1
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data. This
satisfies the
requirements of
acquire.
fence
acq_rel
system
none
buffer_wbl2 sc0=1 sc1=1
If OpenCL and
address space is
local, omit.
Must happen before
following s_waitcnt.
Performs L2 writeback to
ensure previous
global/generic
store/atomicrmw are
visible at system scope.
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0) and
s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/store/load
atomic/store
atomic/atomicrmw.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
buffer_inv.
Ensures that the
preceding
global/local/generic
load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before invalidating
the cache. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
global/local/generic
store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
buffer_inv sc0=1 sc1=1
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
MTYPE NC global data.
MTYPE RW and CC memory will
never be stale due to the
memory probes.
Sequential Consistent Atomic
load atomic
seq_cst
singlethread
wavefront
global
local
generic
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
global
generic
s_waitcnt lgkm/vmcnt(0)
Use lgkmcnt(0) if not
TgSplit execution mode
and vmcnt(0) if TgSplit
execution mode.
s_waitcnt lgkmcnt(0) must
happen after
preceding
local/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
s_waitcnt vmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vmcnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global/local
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
local
If TgSplit execution mode,
local address space cannot
be used.
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
agent
system
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0)
If TgSplit execution mode,
omit lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0)
and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt lgkmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
s_waitcnt vmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vmcnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
store atomic
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
store atomic release,
except must generate
all instructions even
for OpenCL.
atomicrmw
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
atomicrmw acq_rel,
except must generate
all instructions even
for OpenCL.
fence
seq_cst
singlethread
wavefront
workgroup
agent
system
none
Same as corresponding
fence acq_rel,
except must generate
all instructions even
for OpenCL.
Each CU has multiple SIMDs that execute wavefronts.
The wavefronts for a single work-group are executed in the same
WGP. In CU wavefront execution mode the wavefronts may be executed by
different SIMDs in the same CU. In WGP wavefront execution mode the
wavefronts may be executed by different SIMDs in different CUs in the same
WGP.
Each WGP has a single LDS memory shared by the wavefronts of the work-groups
executing on it.
All LDS operations of a WGP are performed as wavefront wide operations in a
global order and involve no caching. Completion is reported to a wavefront in
execution order.
The LDS memory has multiple request queues shared by the SIMDs of a
WGP. Therefore, the LDS operations performed by different wavefronts of a
work-group can be reordered relative to each other, which can result in
reordering the visibility of vector memory operations with respect to LDS
operations of other wavefronts in the same work-group. A s_waitcntlgkmcnt(0) is required to ensure synchronization between LDS operations and
vector memory operations between wavefronts of a work-group, but not between
operations performed by the same wavefront.
The vector memory operations are performed as wavefront wide operations.
Completion of load/store/sample operations are reported to a wavefront in
execution order of other load/store/sample operations performed by that
wavefront.
The vector memory operations access a vector L0 cache. There is a single L0
cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
special action is required for coherence between the lanes of a single
wavefront. However, a buffer_gl0_inv is required for coherence between
wavefronts executing in the same work-group as they may be executing on SIMDs
of different CUs that access different L0s. A buffer_gl0_inv is also
required for coherence between wavefronts executing in different work-groups
as they may be executing on different WGPs.
The scalar memory operations access a scalar L0 cache shared by all wavefronts
on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
operations are used in a restricted way so do not impact the memory model. See
Memory Spaces.
The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
the same SA. Therefore, no special action is required for coherence between
the wavefronts of a single work-group. However, a buffer_gl1_inv is
required for coherence between wavefronts executing in different work-groups
as they may be executing on different SAs that access different L1s.
The L1 caches have independent quadrants to service disjoint ranges of virtual
addresses.
Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
vector and scalar memory operations performed by different wavefronts, whether
executing in the same or different work-groups (which may be executing on
different CUs accessing different L0s), can be reordered relative to each
other. A s_waitcntvmcnt(0)&vscnt(0) is required to ensure
synchronization between vector memory operations of different wavefronts. It
ensures a previous vector memory operation has completed before executing a
subsequent vector memory or LDS operation and so can be used to meet the
requirements of acquire, release and sequential consistency.
The L1 caches use an L2 cache shared by all SAs on the same agent.
The L2 cache has independent channels to service disjoint ranges of virtual
addresses.
Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
quadrant has a separate request queue per L2 channel. Therefore, the vector
and scalar memory operations performed by wavefronts executing in different
work-groups (which may be executing on different SAs) of an agent can be
reordered relative to each other. A s_waitcntvmcnt(0)&vscnt(0) is
required to ensure synchronization between vector memory operations of
different SAs. It ensures a previous vector memory operation has completed
before executing a subsequent vector memory and so can be used to meet the
requirements of acquire, release and sequential consistency.
The L2 cache can be kept coherent with other agents on some targets, or ranges
of virtual addresses can be set up to bypass it to ensure system coherence.
On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
The MALL cache is fully coherent with GPU memory and has no impact on system
coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
Scalar memory operations are only used to access memory that is proven to not
change during the execution of the kernel dispatch. This includes constant
address space and global address space for program scope const variables.
Therefore, the kernel machine code does not have to maintain the scalar cache to
ensure it is coherent with the vector caches. The scalar and vector caches are
invalidated between kernel dispatches by CP since constant address space data
may change between kernel dispatch executions. See
Memory Spaces.
The one exception is if scalar writes are used to spill SGPR registers. In this
case the AMDGPU backend ensures the memory location used to spill is never
accessed by vector memory operations at the same time. If scalar writes are used
then a s_dcache_wb is inserted before the s_endpgm and before a function
return since the locations may be used for vector memory instructions by a
future wavefront that uses the same scratch area, or a function call that
creates a frame at the same address, respectively. There is no need for a
s_dcache_inv as all scalar writes are write-before-read in the same thread.
For kernarg backing memory:
CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
needing to invalidate the L2 cache.
On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
so the L2 cache will be coherent with the CPU and other agents.
Scratch backing memory (which is used for the private address space) is accessed
with MTYPE NC (non-coherent). Since the private address space is only accessed
by a single thread, and is always write-before-read, there is never a need to
invalidate these entries from the L0 or L1 caches.
Wavefronts are executed in native mode with in-order reporting of loads and
sample instructions. In this mode vmcnt reports completion of load, atomic with
return and sample instructions in order, and the vscnt reports the completion of
store and atomic without return in order. See MEM_ORDERED field in
compute_pgm_rsrc1 for GFX6-GFX12.
Wavefronts can be executed in WGP or CU wavefront execution mode:
In WGP wavefront execution mode the wavefronts of a work-group are executed
on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
CU L0 caches is required for work-group synchronization. Also accesses to L1
at work-group scope need to be explicitly ordered as the accesses from
different CUs are not ordered.
In CU wavefront execution mode the wavefronts of a work-group are executed on
the SIMDs of a single CU of the WGP. Therefore, all global memory access by
the work-group access the same L0 which in turn ensures L1 accesses are
ordered and so do not require explicit management of the caches for
work-group synchronization.
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/
atomicrmw-with-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
atomicrmw-no-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
buffer_gl0_inv.
Ensures that the
fence-paired atomic
has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
buffer_gl0_inv
If CU wavefront execution
mode, omit.
Ensures that
following
loads will not see
stale data.
fence
acquire
agent
system
none
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0) and vscnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load
atomic/
atomicrmw-with-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
atomicrmw-no-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
buffer_gl*_inv.
Ensures that the
fence-paired atomic
has completed
before invalidating
the
caches. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
buffer_gl1_inv;
buffer_gl0_inv
Must happen before any
following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
Release Atomic
store atomic
release
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_store
store atomic
release
workgroup
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
have
completed before
performing the
store that is being
released.
buffer/global/flat_store
store atomic
release
workgroup
local
s_waitcnt vmcnt(0) & vscnt(0)
If OpenCL, omit.
Could be split into
separate s_waitcnt
vmcnt(0) and s_waitcnt
vscnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
Must happen before
the following
store.
Ensures that all
global memory
operations have
completed before
performing the
store that is being
released.
ds_store
store atomic
release
agent
system
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt vscnt(0)
and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
store.
Ensures that all
memory operations
have
completed before
performing the
store that is being
released.
buffer/global/flat_store
atomicrmw
release
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_atomic
atomicrmw
release
workgroup
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL, omit lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
buffer/global/flat_atomic
atomicrmw
release
workgroup
local
s_waitcnt vmcnt(0) & vscnt(0)
If OpenCL, omit.
Could be split into
separate s_waitcnt
vmcnt(0) and s_waitcnt
vscnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
Must happen before
the following
store.
Ensures that all
global memory
operations have
completed before
performing the
store that is being
released.
ds_atomic
atomicrmw
release
agent
system
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/load atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global and local
have completed
before performing
the atomicrmw that
is being released.
buffer/global/flat_atomic
fence
release
singlethread
wavefront
none
none
fence
release
workgroup
none
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0) and vscnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store atomic/
atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
fence
release
agent
system
none
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0) and vscnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/load atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
Acquire-Release Atomic
atomicrmw
acq_rel
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_atomic
atomicrmw
acq_rel
workgroup
global
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL, omit
lgkmcnt(0).
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0), and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
s_waitcnt vm/vscnt(0)
Use vmcnt(0) if atomic with
return and vscnt(0) if
atomic with no-return.
Must happen before
the following
buffer_gl0_inv.
Ensures any
following global
data read is no
older than the
atomicrmw value
being acquired.
buffer_gl0_inv
If CU wavefront execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
workgroup
local
s_waitcnt vmcnt(0) & vscnt(0)
If OpenCL, omit.
Could be split into
separate s_waitcnt
vmcnt(0) and s_waitcnt
vscnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
Must happen before
the following
store.
Ensures that all
global memory
operations have
completed before
performing the
store that is being
released.
ds_atomic
s_waitcnt lgkmcnt(0)
If OpenCL, omit.
Must happen before
the following
buffer_gl0_inv.
Ensures any
following global
data read is no
older than the local load
atomic value being
acquired.
buffer_gl0_inv
If CU wavefront execution
mode, omit.
If OpenCL omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
workgroup
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL, omit lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If atomic with return, omit
vscnt(0), if atomic with
no-return, omit vmcnt(0).
If OpenCL, omit lgkmcnt(0).
Must happen before
the following
buffer_gl0_inv.
Ensures any
following global
data read is no
older than the load
atomic value being
acquired.
buffer_gl0_inv
If CU wavefront execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
agent
system
global
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/load atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global have
completed before
performing the
atomicrmw that is
being released.
buffer/global_atomic
s_waitcnt vm/vscnt(0)
Use vmcnt(0) if atomic with
return and vscnt(0) if
atomic with no-return.
Must happen before
following
buffer_gl*_inv.
Ensures the
atomicrmw has
completed before
invalidating the
caches.
buffer_gl1_inv;
buffer_gl0_inv
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acq_rel
agent
system
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL, omit
lgkmcnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0), and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/load atomic
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
flat_atomic
s_waitcnt vm/vscnt(0) &
lgkmcnt(0)
If OpenCL, omit
lgkmcnt(0).
Use vmcnt(0) if atomic with
return and vscnt(0) if
atomic with no-return.
Must happen before
following
buffer_gl*_inv.
Ensures the
atomicrmw has
completed before
invalidating the
caches.
buffer_gl1_inv;
buffer_gl0_inv
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
fence
acq_rel
singlethread
wavefront
none
none
fence
acq_rel
workgroup
none
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0) and vscnt(0).
However,
since LLVM
currently has no
address space on
the fence need to
conservatively
always generate
(see comment for
previous fence).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store atomic/
atomicrmw.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that all
memory operations
have
completed before
performing any
following global
memory operations.
Ensures that the
preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before following
global memory
operations. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
local/generic store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
Must happen before
the following
buffer_gl0_inv.
Ensures that the
acquire-fence-paired
atomic has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
acquire-fence-paired-atomic.
buffer_gl0_inv
If CU wavefront execution
mode, omit.
Ensures that
following
loads will not see
stale data.
fence
acq_rel
agent
system
none
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
If OpenCL and
address space is
local, omit
vmcnt(0) and vscnt(0).
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_waitcnt vscnt(0)
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_waitcnt lgkmcnt(0)
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
buffer_gl*_inv.
Ensures that the
preceding
global/local/generic
load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before invalidating
the caches. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
global/local/generic
store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
buffer_gl1_inv;
buffer_gl0_inv
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data. This
satisfies the
requirements of
acquire.
Sequential Consistent Atomic
load atomic
seq_cst
singlethread
wavefront
global
local
generic
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0), and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt lgkmcnt(0) must
happen after
preceding
local/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
s_waitcnt vmcnt(0)
must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vmcnt(0) and so do
not need to be
considered.)
s_waitcnt vscnt(0)
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vscnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global/local
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
local
s_waitcnt vmcnt(0) & vscnt(0)
Could be split into
separate s_waitcnt
vmcnt(0) and s_waitcnt
vscnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt vmcnt(0)
Must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vmcnt(0) and so do
not need to be
considered.)
s_waitcnt vscnt(0)
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vscnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
agent
system
global
generic
s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
vscnt(0) and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
s_waitcnt lgkmcnt(0)
must happen after
preceding
local load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
s_waitcnt vmcnt(0)
must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vmcnt(0) and so do
not need to be
considered.)
s_waitcnt vscnt(0)
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
vscnt(0) and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waitcnt be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
store atomic
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
store atomic release,
except must generate
all instructions even
for OpenCL.
atomicrmw
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
atomicrmw acq_rel,
except must generate
all instructions even
for OpenCL.
fence
seq_cst
singlethread
wavefront
workgroup
agent
system
none
Same as corresponding
fence acq_rel,
except must generate
all instructions even
for OpenCL.
Each CU has multiple SIMDs that execute wavefronts.
The wavefronts for a single work-group are executed in the same
WGP.
In CU wavefront execution mode the wavefronts may be executed by different SIMDs
in the same CU.
In WGP wavefront execution mode the wavefronts may be executed by different SIMDs
in different CUs in the same WGP.
Each WGP has a single LDS memory shared by the wavefronts of the work-groups
executing on it.
All LDS operations of a WGP are performed as wavefront wide operations in a
global order and involve no caching. Completion is reported to a wavefront in
execution order.
The LDS memory has multiple request queues shared by the SIMDs of a
WGP. Therefore, the LDS operations performed by different wavefronts of a
work-group can be reordered relative to each other, which can result in
reordering the visibility of vector memory operations with respect to LDS
operations of other wavefronts in the same work-group. A s_wait_dscnt0x0
is required to ensure synchronization between LDS operations and
vector memory operations between wavefronts of a work-group, but not between
operations performed by the same wavefront.
The vector memory operations are performed as wavefront wide operations.
Vector memory operations are divided in different types. Completion of a
vector memory operation is reported to a wavefront in-order within a type,
but may be out of order between types. The types of vector memory operations
(and their associated s_wait instructions) are:
LDS: s_wait_dscnt
Load (global, scratch, flat, buffer and image): s_wait_loadcnt
Store (global, scratch, flat, buffer and image): s_wait_storecnt
Sample and Gather4: s_wait_samplecnt
BVH: s_wait_bvhcnt
Vector and scalar memory instructions contain a SCOPE field with values
corresponding to each cache level. The SCOPE determines whether a cache
can complete an operation locally or whether it needs to forward the operation
to the next cache level. The SCOPE values are:
SCOPE_CU: Compute Unit (NOTE: not affected by CU/WGP mode)
SCOPE_SE: Shader Engine
SCOPE_DEV: Device/Agent
SCOPE_SYS: System
When a memory operation with a given SCOPE reaches a cache with a smaller
SCOPE value, it is forwarded to the next level of cache.
When a memory operation with a given SCOPE reaches a cache with a SCOPE
value greater than or equal to its own, the operation can proceed:
Reads can hit into the cache
Writes can happen in this cache and the transaction is acknowledged
from this level of cache.
RMW operations can be done locally.
global_inv, global_wb and global_wbinv instructions are used to
invalidate, write-back and write-back+invalidate caches. The affected
cache(s) are controlled by the SCOPE: of the instruction.
global_inv invalidates caches whose scope is strictly smaller than the
instruction’s. The invalidation requests cannot be reordered with pending or
upcoming memory operations.
global_wb is a writeback operation that additionally ensures previous
memory operation done at a lower scope level have reached the SCOPE:
of the global_wb.
global_wb can be omitted for scopes other than SCOPE_SYS in
gfx120x.
The vector memory operations access a vector L0 cache. There is a single L0
cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
special action is required for coherence between the lanes of a single
wavefront. To achieve coherence between wavefronts executing in the same
work-group:
In CU wavefront execution mode, no special action is required.
In WGP wavefront execution mode, a global_invscope:SCOPE_SE is required
as wavefronts may be executing on SIMDs of different CUs that access different L0s.
The scalar memory operations access a scalar L0 cache shared by all wavefronts
on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
operations are used in a restricted way so do not impact the memory model. See
Memory Spaces.
The vector and scalar memory L0 caches use an L1 buffer shared by all WGPs on
the same SA. The L1 buffer acts as a bridge to L2 for clients within a SA.
The L1 buffers have independent quadrants to service disjoint ranges of virtual
addresses.
Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
vector and scalar memory operations performed by different wavefronts, whether
executing in the same or different work-groups (which may be executing on
different CUs accessing different L0s), can be reordered relative to each
other. Some or all of the wait instructions below are required to ensure
synchronization between vector memory operations of different wavefronts. It
ensures a previous vector memory operation has completed before executing a
subsequent vector memory or LDS operation and so can be used to meet the
requirements of acquire, release and sequential consistency.
s_wait_loadcnt0x0
s_wait_samplecnt0x0
s_wait_bvhcnt0x0
s_wait_storecnt0x0
The L1 buffers use an L2 cache shared by all SAs on the same agent.
The L2 cache has independent channels to service disjoint ranges of virtual
addresses.
Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
quadrant has a separate request queue per L2 channel. Therefore, the vector
and scalar memory operations performed by wavefronts executing in different
work-groups (which may be executing on different SAs) of an agent can be
reordered relative to each other. Some or all of the wait instructions below are
required to ensure synchronization between vector memory operations of
different SAs. It ensures a previous vector memory operation has completed
before executing a subsequent vector memory and so can be used to meet the
requirements of acquire, release and sequential consistency.
s_wait_loadcnt0x0
s_wait_samplecnt0x0
s_wait_bvhcnt0x0
s_wait_storecnt0x0
The L2 cache can be kept coherent with other agents, or ranges
of virtual addresses can be set up to bypass it to ensure system coherence.
A memory attached last level (MALL) cache exists for GPU memory.
The MALL cache is fully coherent with GPU memory and has no impact on system
coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
Scalar memory operations are only used to access memory that is proven to not
change during the execution of the kernel dispatch. This includes constant
address space and global address space for program scope const variables.
Therefore, the kernel machine code does not have to maintain the scalar cache to
ensure it is coherent with the vector caches. The scalar and vector caches are
invalidated between kernel dispatches by CP since constant address space data
may change between kernel dispatch executions. See
Memory Spaces.
For kernarg backing memory:
CP invalidates caches at the start of each kernel dispatch.
On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
needing to invalidate the L2 cache.
On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
so the L2 cache will be coherent with the CPU and other agents.
Scratch backing memory (which is used for the private address space) is accessed
with MTYPE NC (non-coherent). Since the private address space is only accessed
by a single thread, and is always write-before-read, there is never a need to
invalidate these entries from L0.
Wavefronts can be executed in WGP or CU wavefront execution mode:
In WGP wavefront execution mode the wavefronts of a work-group are executed
on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
CU L0 caches is required for work-group synchronization. Also accesses to L1
at work-group scope need to be explicitly ordered as the accesses from
different CUs are not ordered.
In CU wavefront execution mode the wavefronts of a work-group are executed on
the SIMDs of a single CU of the WGP. Therefore, all global memory access by
the work-group access the same L0 which in turn ensures L1 accesses are
ordered and so do not require explicit management of the caches for
work-group synchronization.
The table only applies if and only if it is directly referenced by an entry in
AMDHSA Memory Model Code Sequences GFX12, and it only applies to
the instruction in the code sequence that references the table.
Note: we don’t have to use
s_wait_samplecnt0x0 or
s_wait_bvhcnt0x0 because
there are no atomic sample or
BVH instructions that the fence
could pair with.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic load
atomic/
atomicrmw-with-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_storecnt0x0
must happen after
any preceding
global/generic
atomicrmw-no-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_dscnt0x0
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
global_inv.
Ensures that the
fence-paired atomic
has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
global_invscope:SCOPE_SE
If CU wavefront execution
mode, omit.
Ensures that
following
loads will not see
stale data.
Note: we don’t have to use
s_wait_samplecnt0x0 or
s_wait_bvhcnt0x0 because
there are no atomic sample or
BVH instructions that the fence
could pair with.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic load
atomic/
atomicrmw-with-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_storecnt0x0
must happen after
any preceding
global/generic
atomicrmw-no-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_dscnt0x0
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
global_inv
Ensures that the
fence-paired atomic
has completed
before invalidating the
caches. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
Ensures that
following
loads will not see
stale data.
Release Atomic
store atomic
release
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_store
store atomic
release
workgroup
global
generic
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit s_wait_dscnt0x0.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
store that is being
released.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
Must happen before the
following store.
Ensures that all
global memory
operations have
completed before
performing the
store that is being
released.
ds_store
store atomic
release
agent
system
global
generic
global_wbscope:SCOPE_SYS
If agent scope, omit.
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit s_wait_dscnt0x0.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
global_wb if present, or
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before the
following store.
Ensures that all
memory operations
have
completed before
performing the
store that is being
released.
If OpenCL and CU wavefront
execution mode, omit all.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before the
following atomic.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
Must happen before the
following atomic.
Ensures that all
global memory
operations have
completed before
performing the
store that is being
released.
ds_atomic
atomicrmw
release
agent
system
global
generic
global_wbscope:SCOPE_SYS
If agent scope, omit.
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit s_wait_dscnt0x0.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic
load/load atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
global_wb if present, or
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before the
following atomic.
Ensures that all
memory operations
to global and local
have completed
before performing
the atomicrmw that
is being released.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store atomic/
atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic
load/load atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
global_wb if present, or
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
Acquire-Release Atomic
atomicrmw
acq_rel
singlethread
wavefront
global
local
generic
buffer/global/ds/flat_atomic
atomicrmw
acq_rel
workgroup
global
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit s_wait_dscnt0x0.
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
Ensures any
following global
data read is no
older than the
atomicrmw value
being acquired.
global_invscope:SCOPE_SE
If CU wavefront execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
workgroup
local
1 | s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
Must happen before
the following
store.
Ensures that all
global memory
operations have
completed before
performing the
store that is being
released.
ds_atomic
s_wait_dscnt0x0
If OpenCL, omit.
Must happen before
the following
global_inv.
Ensures any
following global
data read is no
older than the local load
atomic value being
acquired.
global_invscope:SCOPE_SE
If CU wavefront execution
mode, omit.
If OpenCL omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
workgroup
generic
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit s_wait_loadcnt0x0.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
Ensures any
following global
data read is no
older than the load
atomic value being
acquired.
global_invscope:SCOPE_SE
If CU wavefront execution
mode, omit.
Ensures that
following
loads will not see
stale data.
atomicrmw
acq_rel
agent
system
global
global_wbscope:SCOPE_SYS
If agent scope, omit.
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit
s_wait_dscnt0x0
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic
load/load atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
global_wb if present, or
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
to global have
completed before
performing the
atomicrmw that is
being released.
Must happen before
any following
global/generic
load/load
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data.
atomicrmw
acq_rel
agent
system
generic
global_wbscope:SCOPE_SYS
If agent scope, omit.
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit
s_wait_dscnt0x0
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic
load/load atomic
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
global_wb if present, or
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store atomic/
atomicrmw.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that all
memory operations
have
completed before
performing any
following global
memory operations.
Ensures that the
preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before following
global memory
operations. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
local/generic store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
Must happen before
the following
global_inv.
Ensures that the
acquire-fence-paired
atomic has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
acquire-fence-paired-atomic.
global_invscope:SCOPE_SE
If CU wavefront execution
mode, omit.
Ensures that
following
loads will not see
stale data.
fence
acq_rel
agent
system
none
global_wbscope:SCOPE_SYS
If agent scope, omit.
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL and
address space is
not generic, omit
s_wait_dscnt0x0
If OpenCL and
address space is
local, omit
all but s_wait_dscnt0x0.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
global_wb if present, or
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
global_inv
Ensures that the
preceding
global/local/generic
load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before invalidating
the caches. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
global/local/generic
store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data. This
satisfies the
requirements of
acquire.
Sequential Consistent Atomic
load atomic
seq_cst
singlethread
wavefront
global
local
generic
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
global
generic
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit
s_wait_dscnt0x0
The waits can be
independently moved
according to the
following rules:
s_wait_dscnt0x0 must
happen after
preceding
local/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_wait_dscnt0x0
and so do not need to be
considered.)
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own waits and so do
not need to be
considered.)
s_wait_storecnt0x0
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_wait_storecnt0x0
and so do not need to be
considered.)
Ensures any
preceding
sequential
consistent global/local
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waits of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waits
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waits be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
local
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit all.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
Must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waits and so do
not need to be
considered.)
s_wait_storecnt0x0
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_wait_storecnt0x0
and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waits of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waits be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
agent
system
global
generic
s_wait_bvhcnt0x0
s_wait_samplecnt0x0
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit
s_wait_dscnt0x0
The waits can be
independently moved
according to the
following rules:
s_wait_dscnt0x0
must happen after
preceding
local load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_wait_dscnt0x0
and so do
not need to be
considered.)
s_wait_loadcnt0x0,
s_wait_samplecnt0x0 and
s_wait_bvhcnt0x0
must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waits and so do
not need to be
considered.)
s_wait_storecnt0x0
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own
s_wait_storecnt0x0 and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waits of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waits
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waits be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
store atomic
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
store atomic release,
except must generate
all instructions even
for OpenCL.
atomicrmw
seq_cst
singlethread
wavefront
workgroup
agent
system
global
local
generic
Same as corresponding
atomicrmw acq_rel,
except must generate
all instructions even
for OpenCL.
fence
seq_cst
singlethread
wavefront
workgroup
agent
system
none
Same as corresponding
fence acq_rel,
except must generate
all instructions even
for OpenCL.
Each WGP has 4 SIMD32 (2 SIMD32-pairs) that execute wavefronts.
The wavefronts for a single work-group are executed in the same
WGP.
Device Memory:
Each WGP has a single write-through WGP cache (WGP$) shared by the wavefronts of the
work-groups executing on it. The WGP$ is divided between LDS and vector L0 memory.
Vector L0 memory holds clean data only.
Each WGP$ has two request queues; one per SIMD32-pair.
Each queue can handle both LDS and vector L0 requests. Requests in one queue
are executed serially and in-order, but are not kept in order with the other queue.
The scalar memory operations access a scalar L0 cache shared by all wavefronts
on a WGP. The scalar and vector L0 caches are not kept coherent by hardware. However, scalar
operations are used in a restricted way so do not impact the memory model. See
Memory Spaces.
The vector and scalar memory L0 caches are both clients of an L1 buffer shared by
all WGPs on the same SE.
L1 buffers have separate request queues for each WGP$ it serves. Requests in one queue
are executed serially and in-order, but are not kept in order with other queues.
L1 buffers are clients of the L2 cache.
There may be multiple L2 caches per agent. Ranges of virtual addresses can be set up as follows:
Be non-hardware-coherent; copies of the data are not coherent between multiple L2s.
Be read-write hardware-coherent with other L2 caches on the same or other agents.
Bypass L2 entirely to ensure system coherence.
L2 caches have multiple memory channels to service disjoint ranges of virtual
addresses.
Memory Model:
Note
This section is currently incomplete as work on the compiler is still ongoing.
The following is a non-exhaustive list of unimplemented/undocumented features:
non-volatile bit code sequences, globally accessing scratch atomics,
multicast loads, barriers (including split barriers) and cooperative atomics.
Scalar operations memory model needs more elaboration as well.
Vector memory operations are performed as wavefront wide operations, with the
EXEC mask predicating which lanes execute.
Consecutive vector memory operations from the same wavefront are issued in program order.
Vector memory operations are issued (and executed) in no particular order between wavefronts.
Wave execution of a vector memory operation instruction issues (initiates) the operation,
but completion occurs an unspecified amount of time later.
The s_wait_*cnt instructions must be used to determine if the operation has completed.
The types of vector memory operations (and their associated s_wait_*cnt instructions) are:
Store (global, scratch, flat, buffer): s_wait_storecnt
non-ASYNC LDS: s_wait_dscnt
ASYNC LDS: s_wait_asynccnt
Tensor: s_wait_tensorcnt
s_wait_xcnt is a counter that is incremented when a memory operation is issued, and
decremented when memory address translation for that instruction is completed.
Waiting on a memory counter s_wait_*cntN also waits on s_wait_xcntN.
s_wait_xcnt0x0 is required before flat and global atomic stores/read-modify-write
operations to guarantee atomicity during a xnack replay.
Within a wavefront, vector memory operation completion (s_wait_*cnt decrement) is
reported in order of issue within a type, but in no particular order between types.
Within a wavefront, the order in which data is returned to registers by a vector memory
operation can be different from the order in which the vector memory operations were issued.
Thus, a s_wait_*cnt instruction must be used to prevent multiple vector memory operations
that return results to the same register from executing concurrently as they may not return
their results in instruction issue order, even though they will be reported as completed in
instruction issue order by the decrementing of the counter.
Within a wavefront, consecutive loads and store to the same address will be processed in program order
by the memory subsystem. Loads and stores to different addresses may be processed
out of order with respect to a different address.
All non-ASYNC LDS vector memory operations of a WGP are performed as wavefront wide
operations in a global order and involve no caching. Completion is reported to a wavefront in
execution order.
ASYNC LDS and tensor vector memory operations are not covered by the memory model implemented
by the AMDGPU backend. Neither s_wait_asynccnt nor s_wait_tensorcnt are inserted
automatically. They must be emitted using compiler built-in calls.
Some vector memory operations contain a SCOPE field with values
corresponding to each cache level. The SCOPE determines whether a cache
can complete an operation locally or whether it needs to forward the operation
to the next cache level. The SCOPE values are:
SCOPE_CU: WGP
SCOPE_SE: Shader Engine
SCOPE_DEV: Device/Agent
SCOPE_SYS: System
Each cache is assigned a SCOPE by the hardware depending on the agent’s
configuration.
This ensures that SCOPE_DEV can always be used to implement agent coherence,
even in the presence of multiple non-coherent L2 caches on the same agent.
When a vector memory operation with a given SCOPE reaches a cache with a smaller
SCOPE value, it is forwarded to the next level of cache.
When a vector memory operation with a given SCOPE reaches a cache with a SCOPE
value greater than or equal to its own, the operation can proceed:
Reads can hit into the cache.
Writes can happen in this cache and completion (s_wait decrement) can be
reported.
RMW operations can be done locally.
Some memory operations contain a nv bit, for “non-volatile”, which indicates
memory that is not expected to change during a kernel’s execution.
This information is propagated to the cache lines for that address
(referred to as $nv).
When nv=0 reads hit dirty $nv=1 data in cache, the hardware will
writeback the data to the next level in the hierarchy and then subsequently read
it again, updating the cache line with a clean $nv=0 copy of the data.
global_inv, global_wb and global_wbinv are cache control instructions.
The affected cache(s) are controlled by the SCOPE of the instruction.
Only caches whose scope is strictly smaller than the instruction’s are affected.
global_inv invalidates the data in affected caches so that subsequent reads
will re-read from the next level in the cache hierarchy.
The invalidation requests cannot be reordered with pending or upcoming
memory operations. Instruction completion is reported using s_wait_loadcnt.
global_wb flushes the dirty data in affected caches to the next level in
the cache hierarchy. This instruction additionally ensures previous
memory operation done at a lower scope level have reached the desired
SCOPE:. Instruction completion is reported using s_wait_storecnt once
all data has been acknowledged by the next level in the cache hierarchy.
global_wbinv performs a global_inv then a global_wb.
Instruction completion is reported using s_wait_storecnt.
global_inv, global_wb and global_wbinv with nv=0 can only
affect $nv=0 cache lines, whereas nv=1 can affect all cache lines.
global_inv, global_wb and global_wbinv behave like memory operations
issued to every address at the same time. They are kept in order with other
memory operations from the same wave.
global_load_monitor_* and flat_load_monitor_* instructions load
data and request that the wave is notified (see s_monitor_sleep) if
the L2 cache line that holds the data is evicted, or written to.
In order to monitor a cache line in the L2 cache, these instructions must
ensure that the L2 cache is always hit by setting the SCOPE of the instruction
appropriately.
For non-atomic and atomic code sequences, it is valid to replace
global_load_b32/64/128 with a global_load_monitor_b32/64/128 and a
flat_load_b32/64/128 with a flat_load_monitor_b32/64/128.
Scalar memory operations are only used to access memory that is proven to not
change during the execution of the kernel dispatch. This includes constant
address space and global address space for program scope const variables.
Therefore, the kernel machine code does not have to maintain the scalar cache to
ensure it is coherent with the vector caches. The scalar and vector caches are
invalidated between kernel dispatches by CP since constant address space data
may change between kernel dispatch executions. See
Memory Spaces.
Atomics in the scratch address space are handled as follows:
Data types <= 32 bits: The instruction is converted into an atomic in the
generic (flat) address space. All properties of the atomic
(atomic ordering, volatility, alignment, etc.) are preserved.
Refer to the generic address space code sequences for further information.
Data types >32 bits: unsupported and an error is emitted.
The table only applies if and only if it is directly referenced by an entry in
AMDHSA Memory Model Code Sequences GFX125x, and it only applies to
the instruction in the code sequence that references the table.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic load
atomic/
atomicrmw-with-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_storecnt0x0
must happen after
any preceding
global/generic
atomicrmw-no-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_dscnt0x0
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that the
fence-paired atomic
has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic load
atomic/
atomicrmw-with-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_storecnt0x0
must happen after
any preceding
global/generic
atomicrmw-no-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_dscnt0x0
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
global_inv
Ensures that the
fence-paired atomic
has completed
before invalidating the
caches. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic load
atomic/
atomicrmw-with-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_storecnt0x0
must happen after
any preceding
global/generic
atomicrmw-no-return-value
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
s_wait_dscnt0x0
must happen after
any preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Must happen before
the following
global_inv
Ensures that the
fence-paired atomic
has completed
before invalidating the
caches. Therefore
any following
locations read must
be no older than
the value read by
the
fence-paired-atomic.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store atomic/
atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
fence
release
agent
system
none
s_wait_loadcnt0x0
s_wait_storecnt0x0
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
Must happen before the
following global_wb.
Ensures that all
global memory store/rmw
operations have
completed before
their data is written
back.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic
load/load atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
global_wb or
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
any following store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
fence-paired-atomic).
Ensures that all
memory operations
have
completed before
performing the
following
fence-paired-atomic.
Acquire-Release Atomic
atomicrmw
acq_rel
singlethread
wavefront
global
local
generic
s_wait_xcnt0x0
Ensure operation remains atomic even during a xnack replay.
Only needed for flat and global operations.
buffer/global/ds/flat_atomic
atomicrmw
acq_rel
workgroup
cluster
global
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit s_wait_dscnt0x0.
Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
atomicrmw.
Ensures that all
memory operations
have
completed before
performing the
atomicrmw that is
being released.
s_wait_xcnt0x0
Ensure operation remains atomic even during a xnack replay.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store atomic/
atomicrmw-no-return-value.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store atomic/
atomicrmw.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that all
memory operations
have
completed before
performing any
following global
memory operations.
Ensures that the
preceding
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before following
global memory
operations. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
local/generic store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
Ensures that the
acquire-fence-paired
atomic has completed
before invalidating
the
cache. Therefore
any following
locations read must
be no older than
the value read by
the
acquire-fence-paired-atomic.
fence
acq_rel
agent
system
none
s_wait_loadcnt0x0
s_wait_storecnt0x0
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
any preceding
global/generic
store/store
atomic/
atomicrmw-no-return-value.
Must happen before the
following global_wb.
Ensures that all
global memory store/rmw
operations have
completed before
their data is written
back.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
any preceding
global/generic
load/load
atomic/
atomicrmw-with-return-value.
s_wait_storecnt0x0
must happen after
global_wb.
s_wait_dscnt0x0
must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
Must happen before
the following
global_inv
Ensures that the
preceding
global/local/generic
load
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
acquire-fence-paired-atomic)
has completed
before invalidating
the caches. This
satisfies the
requirements of
acquire.
Ensures that all
previous memory
operations have
completed before a
following
global/local/generic
store
atomic/atomicrmw
with an equal or
wider sync scope
and memory ordering
stronger than
unordered (this is
termed the
release-fence-paired-atomic).
This satisfies the
requirements of
release.
Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
Ensures that
following loads
will not see stale
global data. This
satisfies the
requirements of
acquire.
s_wait_loadcnt0x0
must happen after
global_inv and before
subsequent memory operations.
Sequential Consistent Atomic
load atomic
seq_cst
singlethread
wavefront
global
local
generic
Same as corresponding
load atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
global
generic
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit
s_wait_dscnt0x0
The waits can be
independently moved
according to the
following rules:
s_wait_dscnt0x0 must
happen after
preceding
local/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_wait_dscnt0x0
and so do not need to be
considered.)
s_wait_loadcnt0x0
must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own waits and so do
not need to be
considered.)
s_wait_storecnt0x0
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_wait_storecnt0x0
and so do not need to be
considered.)
Ensures any
preceding
sequential
consistent global/local
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waits of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waits
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waits be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
workgroup
local
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit all.
The waits can be
independently moved
according to the
following rules:
s_wait_loadcnt0x0
must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waits and so do
not need to be
considered.)
s_wait_storecnt0x0
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_wait_storecnt0x0
and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waits of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waitcnt
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waits be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
load atomic
seq_cst
cluster
agent
system
global
generic
s_wait_storecnt0x0
s_wait_loadcnt0x0
s_wait_dscnt0x0
If OpenCL, omit
s_wait_dscnt0x0
The waits can be
independently moved
according to the
following rules:
s_wait_dscnt0x0
must happen after
preceding
local load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_wait_dscnt0x0
and so do
not need to be
considered.)
s_wait_loadcnt0x0
must happen after
preceding
global/generic load
atomic/
atomicrmw-with-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waits and so do
not need to be
considered.)
s_wait_storecnt0x0
Must happen after
preceding
global/generic store
atomic/
atomicrmw-no-return-value
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own
s_wait_storecnt0x0 and so do
not need to be
considered.)
Ensures any
preceding
sequential
consistent global
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
s_waits of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
completing out of
order. The s_waits
could be placed after
seq_store or before
the seq_load. We
choose the load to
make the s_waits be
as late as possible
so that the store
may have already
completed.)
Following
instructions same as
corresponding load
atomic acquire,
except must generate
all instructions even
for OpenCL.
store atomic
seq_cst
singlethread
wavefront
workgroup
cluster
agent
system
global
local
generic
Same as corresponding
store atomic release,
except must generate
all instructions even
for OpenCL.
atomicrmw
seq_cst
singlethread
wavefront
workgroup
cluster
agent
system
global
local
generic
Same as corresponding
atomicrmw acq_rel,
except must generate
all instructions even
for OpenCL.
fence
seq_cst
singlethread
wavefront
workgroup
cluster
agent
system
none
Same as corresponding
fence acq_rel,
except must generate
all instructions even
for OpenCL.
The collection of convergent threads participating in a cooperative atomic must belong
to the same wave32.
Only naturally-aligned, contiguous groups of lanes may be used;
see the table below for the set of
possible lane groups.
Cooperative atomics may be executed by more than one group per wave.
Using an unsupported lane group, or using more lane groups per wave than the maximum will
cause undefined behavior.
Using the intrinsic also causes undefined behavior if it loads or stores to addresses that:
Are not in the global address space (e.g.: private and local addresses spaces).
Are only reachable through a bus that does not support 128B/256B requests
(e.g.: host memory over PCIe)
Any other unsupported addresses (TBD, needs refinement)
For code objects generated by the AMDGPU backend for HSA [HSA] compatible
runtimes (see AMDGPU Operating Systems), the runtime installs a trap handler that
supports the s_trap instruction. For usage see:
Table 100 AMDGPU Trap Handler for AMDHSA OS Code Object V2¶
Usage
Code Sequence
Trap Handler
Inputs
Description
reserved
s_trap0x00
Reserved by hardware.
debugtrap(arg)
s_trap0x01
SGPR0-1:
queue_ptr
VGPR0:
arg
Reserved for Finalizer HSA debugtrap
intrinsic (not implemented).
llvm.trap
s_trap0x02
SGPR0-1:
queue_ptr
Causes wave to be halted with the PC at
the trap instruction. The associated
queue is signalled to put it into the
error state. When the queue is put in
the error state, the waves executing
dispatches on the queue will be
terminated.
llvm.debugtrap
s_trap0x03
none
If debugger not enabled then behaves
as a no-operation. The trap handler
is entered and immediately returns to
continue execution of the wavefront.
If the debugger is enabled, causes
the debug trap to be reported by the
debugger and the wavefront is put in
the halt state with the PC at the
instruction. The debugger must
increment the PC and resume the wave.
reserved
s_trap0x04
Reserved.
reserved
s_trap0x05
Reserved.
reserved
s_trap0x06
Reserved.
reserved
s_trap0x07
Reserved.
reserved
s_trap0x08
Reserved.
reserved
s_trap0xfe
Reserved.
reserved
s_trap0xff
Reserved.
Table 101 AMDGPU Trap Handler for AMDHSA OS Code Object V3¶
Usage
Code Sequence
Trap Handler
Inputs
Description
reserved
s_trap0x00
Reserved by hardware.
debugger breakpoint
s_trap0x01
none
Reserved for debugger to use for
breakpoints. Causes wave to be halted
with the PC at the trap instruction.
The debugger is responsible to resume
the wave, including the instruction
that the breakpoint overwrote.
llvm.trap
s_trap0x02
SGPR0-1:
queue_ptr
Causes wave to be halted with the PC at
the trap instruction. The associated
queue is signalled to put it into the
error state. When the queue is put in
the error state, the waves executing
dispatches on the queue will be
terminated.
llvm.debugtrap
s_trap0x03
none
If debugger not enabled then behaves
as a no-operation. The trap handler
is entered and immediately returns to
continue execution of the wavefront.
If the debugger is enabled, causes
the debug trap to be reported by the
debugger and the wavefront is put in
the halt state with the PC at the
instruction. The debugger must
increment the PC and resume the wave.
reserved
s_trap0x04
Reserved.
reserved
s_trap0x05
Reserved.
reserved
s_trap0x06
Reserved.
reserved
s_trap0x07
Reserved.
reserved
s_trap0x08
Reserved.
reserved
s_trap0xfe
Reserved.
reserved
s_trap0xff
Reserved.
Table 102 AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above¶
Usage
Code Sequence
GFX6-GFX8 Inputs
GFX9-GFX11 Inputs
Description
reserved
s_trap0x00
Reserved by hardware.
debugger breakpoint
s_trap0x01
none
none
Reserved for debugger to use for
breakpoints. Causes wave to be halted
with the PC at the trap instruction.
The debugger is responsible to resume
the wave, including the instruction
that the breakpoint overwrote.
llvm.trap
s_trap0x02
SGPR0-1:
queue_ptr
none
Causes wave to be halted with the PC at
the trap instruction. The associated
queue is signalled to put it into the
error state. When the queue is put in
the error state, the waves executing
dispatches on the queue will be
terminated.
llvm.debugtrap
s_trap0x03
none
none
If debugger not enabled then behaves
as a no-operation. The trap handler
is entered and immediately returns to
continue execution of the wavefront.
If the debugger is enabled, causes
the debug trap to be reported by the
debugger and the wavefront is put in
the halt state with the PC at the
instruction. The debugger must
increment the PC and resume the wave.
This section describes the call convention ABI for functions other than the
outer kernel function.
If a kernel has function calls then scratch is always allocated and used for
the call stack which grows from low address to high address using the swizzled
scratch address space.
Base address pointing to the beginning of the wavefront scratch backing
memory.
Swizzled with dword element size and stride of wavefront size elements.
The FLAT_SCRATCH register pair is setup. See
Flat Scratch.
GFX6-GFX8: M0 register set to the size of LDS in bytes. See
M0.
The EXEC register is set to the lanes active on entry to the function.
MODE register: TBD
VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
below.
SGPR30-31 return address (RA). The code address that the function must
return to when it completes. The value is undefined if the function is no
return.
SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
offset relative to the beginning of the wavefront scratch backing memory.
The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
manner.
The unswizzled SP value can be converted into the swizzled SP value by:
swizzled SP = unswizzled SP / wavefront size
This may be used to obtain the private address space address of stack
objects and to convert this address to a flat address by adding the flat
scratch aperture base address.
The swizzled SP value is always 4-byte aligned for the r600
architecture and 16-byte aligned for the amdgcn architecture.
Note
The amdgcn value is selected to avoid dynamic stack alignment for the
OpenCL language which has the largest base type defined as 16 bytes.
On entry, the swizzled SP value is the address of the first function
argument passed on the stack. Other stack passed arguments are positive
offsets from the entry swizzled SP value.
The function may use positive offsets beyond the last stack passed argument
for stack allocated local variables and register spill slots. If necessary,
the function may align these to greater alignment than 16 bytes. After these
the function may dynamically allocate space for such things as runtime sized
alloca local allocations.
If the function calls another function, it will place any stack allocated
arguments after the last local allocation and adjust SGPR32 to the address
after the last local allocation.
All other registers are unspecified.
Any necessary s_waitcnt has been performed to ensure memory is available
to the function.
Use pass-by-reference (byref) instead of pass-by-value (byval) for struct
arguments in C ABI. Callee is responsible for allocating stack memory and
copying the value of the struct if modified. Note that the backend still
supports byval for struct arguments.
On exit from a function:
VGPR0-31 and SGPR4-29 are used to pass function result arguments as
described below. Any registers used are considered clobbered registers.
The following registers are preserved and have the same value as on entry:
FLAT_SCRATCH
EXEC
GFX6-GFX8: M0
All SGPR registers except the clobbered registers of SGPR4-31.
VGPR40-47
VGPR56-63
VGPR72-79
VGPR88-95
VGPR104-111
VGPR120-127
VGPR136-143
VGPR152-159
VGPR168-175
VGPR184-191
VGPR200-207
VGPR216-223
VGPR232-239
VGPR248-255
Note
Except the argument registers, the VGPRs clobbered and the preserved
registers are intermixed at regular intervals in order to keep a
similar ratio independent of the number of allocated VGPRs.
GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
Lanes of all VGPRs that are inactive at the call site.
For the AMDGPU backend, an inter-procedural register allocation (IPRA)
optimization may mark some of clobbered SGPR and VGPR registers as
preserved if it can be determined that the called function does not change
their value.
The PC is set to the RA provided on entry.
MODE register: TBD.
All other registers are clobbered.
Any necessary s_waitcnt has been performed to ensure memory accessed by
function is available to the caller.
The function input arguments are made up of the formal arguments explicitly
declared by the source language function plus the implicit input arguments used
by the implementation.
The source language input arguments are:
Any source language implicit this or self argument comes first as a
pointer type.
Followed by the function formal arguments in left to right source order.
The source language result arguments are:
The function result argument.
The source language input or result struct type arguments that are less than or
equal to 16 bytes, are decomposed recursively into their base type fields, and
each field is passed as if a separate argument. For input arguments, if the
called function requires the struct to be in memory, for example because its
address is taken, then the function body is responsible for allocating a stack
location and copying the field arguments into it. Clang terms this direct
struct.
The source language input struct type arguments that are greater than 16 bytes,
are passed by reference. The caller is responsible for allocating a stack
location to make a copy of the struct value and pass the address as the input
argument. The called function is responsible to perform the dereference when
accessing the input argument. Clang terms this by-value struct.
A source language result struct type argument that is greater than 16 bytes, is
returned by reference. The caller is responsible for allocating a stack location
to hold the result value and passes the address as the last input argument
(before the implicit input arguments). In this case there are no result
arguments. The called function is responsible to perform the dereference when
storing the result value. Clang terms this structured return (sret).
TODO: correct the ``sret`` definition.
Lambda argument types are treated as struct types with an implementation defined
set of fields.
For AMDGPU backend all source language arguments (including the decomposed
struct type arguments) are passed in VGPRs unless marked inreg in which case
they are passed in SGPRs.
The AMDGPU backend walks the function call graph from the leaves to determine
which implicit input arguments are used, propagating to each caller of the
function. The used implicit arguments are appended to the function arguments
after the source language arguments in the following order:
Work-Item ID (1 VGPR)
The X, Y and Z work-item ID are packed into a single VGRP with the following
layout. Only fields actually used by the function are set. The other bits
are undefined.
The value is computed by adding an offset to Kernarg Segment Ptr to get the
global address space pointer to the first kernarg implicit argument.
The input and result arguments are assigned in order in the following manner:
Note
There are likely some errors and omissions in the following description that
need correction.
VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
VGPR31.
If there are more arguments than will fit in these registers, the remaining
arguments are allocated on the stack in order on naturally aligned
addresses.
SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
SGPR29.
If there are more arguments than will fit in these registers, the remaining
arguments are allocated on the stack in order on naturally aligned
addresses.
Note that decomposed struct type arguments may have some fields passed in
registers and some in memory.
The following is not part of the AMDGPU function calling convention but
describes how the AMDGPU implements function calls:
SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
unswizzled scratch address. It is only needed if runtime sized alloca
are used, or for the reasons defined in SIFrameLowering.
Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
to access the incoming stack arguments in the function. The BP is needed
only when the function requires the runtime stack alignment.
Allocating SGPR arguments on the stack are not supported.
CFI will be generated that defines the CFA as the unswizzled address
relative to the wave scratch base in the unswizzled private address space
of the lowest address stack allocated local variable.
DW_AT_frame_base will be defined as the swizzled address in the
swizzled private address space by dividing the CFA by the wavefront size
(since CFA is always at least dword aligned which matches the scratch
swizzle element size).
If no dynamic stack alignment was performed, the stack allocated arguments
are accessed as negative offsets relative to DW_AT_frame_base, and the
local variables and register spill slots are accessed as positive offsets
relative to DW_AT_frame_base.
Function argument passing is implemented by copying the input physical
registers to virtual registers on entry. The register allocator can spill if
necessary. These are copied back to physical registers at call sites. The
net effect is that each function call can have these values in entirely
distinct locations. The IPRA can help avoid shuffling argument registers.
Call sites are implemented by setting up the arguments at positive offsets
from SP. Then SP is incremented to account for the known frame size before
the call and decremented after the call.
Note
The CFI will reflect the changed calculation needed to compute the CFA
from SP.
4-byte spill slots are used in the stack frame. One slot is allocated for an
emergency spill slot. Buffer instructions are used for stack accesses and
not the flat_scratch instruction.
The metadata is currently in development and is subject to major
changes. Only the current version is supported. When this document
was generated the version was 2.6.
The metadata is represented as Message Pack formatted binary data (see
[MsgPack]). The top level is a Message Pack map that includes the keys
defined in table AMDPAL Code Object Metadata Map
and referenced tables.
Additional information can be added to the maps. To avoid conflicts, any
key names should be prefixed by “vendor-name.” where vendor-name
can be the name of the vendor and specific vendor tool that generates the
information. The prefix is abbreviated to simply “.” when it appears
within a map that has been added by the same vendor-name.
Internal compiler hash for this pipeline. Lower
64 bits is the “stable” portion of the hash, used
for e.g. shader replacement lookup. Upper 64 bits
is the “unique” portion of the hash, used for
e.g. pipeline cache lookup. The value is
implementation defined, and can not be relied on
between different builds of the compiler.
Number of user data entries accessed by this
pipeline.
“.spill_threshold”
integer
The user data spill threshold. 0xFFFF for
NoUserDataSpilling.
“.uses_viewport_array_index”
boolean
Indicates whether or not the pipeline uses the
viewport array index feature. Pipelines which use
this feature can render into all 16 viewports,
whereas pipelines which do not use it are
restricted to viewport #0.
“.es_gs_lds_size”
integer
Size in bytes of LDS space used internally for
handling data-passing between the ES and GS
shader stages. This can be zero if the data is
passed using off-chip buffers. This value should
be used to program all user-SGPRs which have been
marked with “UserDataMapping::EsGsLdsSize”
(typically only the GS and VS HW stages will ever
have a user-SGPR so marked).
“.nggSubgroupSize”
integer
Explicit maximum subgroup size for NGG shaders
(maximum number of threads in a subgroup).
“.num_interpolants”
integer
Graphics only. Number of PS interpolants.
“.mesh_scratch_memory_size”
integer
Max mesh shader scratch memory used.
“.api”
string
Name of the client graphics API.
“.api_create_info”
binary
Graphics API shader create info binary blob. Can
be defined by the driver using the compiler if
they want to be able to correlate API-specific
information used during creation at a later time.
Table 107 AMDPAL Code Object API Shader Metadata Map¶
String Key
Value Type
Required?
Description
“.api_shader_hash”
sequence of
2 integers
Required
Input shader hash, typically passed in from the client. The value
is implementation defined, and can not be relied on between
different builds of the compiler.
“.hardware_mapping”
sequence of
string
Required
Flags indicating the HW stages this API shader maps to. Values
include:
The ELF symbol pointing to this pipeline’s stage entry point.
“.scratch_memory_size”
integer
Scratch memory size in bytes.
“.lds_size”
integer
Local Data Share size in bytes.
“.perf_data_buffer_size”
integer
Performance data buffer size in bytes.
“.vgpr_count”
integer
Number of VGPRs used.
“.agpr_count”
integer
Number of AGPRs used.
“.sgpr_count”
integer
Number of SGPRs used.
“.dynamic_vgpr_saved_count”
integer
No
Number of dynamic VGPRs that can be stored in scratch by the
CWSR trap handler. Only used on GFX12+.
“.vgpr_limit”
integer
If non-zero, indicates the shader was compiled with a
directive to instruct the compiler to limit the VGPR usage to
be less than or equal to the specified value (only set if
different from HW default).
“.sgpr_limit”
integer
SGPR count upper limit (only set if different from HW
default).
“.threadgroup_dimensions”
sequence of
3 integers
Thread-group X/Y/Z dimensions (Compute only).
“.wavefront_size”
integer
Wavefront size (only set if different from HW default).
“.uses_uavs”
boolean
The shader reads or writes UAVs.
“.uses_rovs”
boolean
The shader reads or writes ROVs.
“.writes_uavs”
boolean
The shader writes to one or more UAVs.
“.writes_depth”
boolean
The shader writes out a depth value.
“.uses_append_consume”
boolean
The shader uses append and/or consume operations, either
memory or GDS.
Table 111 AMDPAL Code Object Shader Function Metadata Map¶
String Key
Value Type
Description
“.api_shader_hash”
sequence of
2 integers
Input shader hash, typically passed in from the client. The value
is implementation defined, and can not be relied on between
different builds of the compiler.
“.scratch_memory_size”
integer
Size in bytes of scratch memory used by the shader.
regoffset is the dword offset into the GFXIP register space of
a GRBM register (i.e., driver accessible GPU register number, not
shader GPR register number). The driver is required to program each
specified register to the corresponding specified value when
executing this pipeline. Typically, the regoffsets are the
uint16_t offsets to each register as defined by the hardware
chip headers. The register is set to the provided value. However, a
regoffset that specifies a user data register (e.g.,
COMPUTE_USER_DATA_0) needs special treatment. See
User Data section for more
information.
Each hardware stage has a set of 32-bit physical SPI user data registers
(either 16 or 32 based on graphics IP and the stage) which can be
written from a command buffer and then loaded into SGPRs when waves are
launched via a subsequent dispatch or draw operation. This is the way
most arguments are passed from the application/runtime to a hardware
shader.
PAL abstracts this functionality by exposing a set of 128 user data
entries per pipeline a client can use to pass arguments from a command
buffer to one or more shaders in that pipeline. The ELF code object must
specify a mapping from virtualized user data entries to physical user
data registers, and PAL is responsible for implementing that mapping,
including spilling overflow user data entries to memory if needed.
Since the user data registers are GRBM-accessible SPI registers, this
mapping is actually embedded in the .registers metadata entry. For
most registers, the value in that map is a literal 32-bit value that
should be written to the register by the driver. However, when the
register is a user data register (any USER_DATA register e.g.,
SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
the driver to write either a user data entry value or one of several
driver-internal values to the register. This encoding is described in
the following table:
Note
Currently, user data registers 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
and SPI_SHADER_USER_DATA_PS_1) are reserved. User data register 0 must
always be programmed to the address of the GlobalTable, and user data
register 1 must always be programmed to the address of the PerShaderTable.
32-bit value of user_data_entry[N] as specified via CmdSetUserData()
0x10000000
GlobalTable
32-bit pointer to GPU memory containing the global internal table (should
always point to user data register 0).
0x10000001
PerShaderTable
32-bit pointer to GPU memory containing the per-shader internal table. See
Per-Shader Table
for more detail (should always point to user data register 1).
0x10000002
SpillTable
32-bit pointer to GPU memory containing the user data spill table. See
Spill Table for
more detail.
0x10000003
BaseVertex
Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn’t
reference the draw index in the vertex shader. Only supported by the first
stage in a graphics pipeline.
0x10000004
BaseInstance
Instance offset (32-bit unsigned integer). Only supported by the first stage in
a graphics pipeline.
0x10000005
DrawIndex
Draw index (32-bit unsigned integer). Only supported by the first stage in a
graphics pipeline.
0x10000006
Workgroup
Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
a buffer containing the grid dimensions for a Compute dispatch operation. The
high half of the address is stored in the next sequential user-SGPR. Only
supported by compute pipelines.
0x1000000A
EsGsLdsSize
Indicates that PAL will program this user-SGPR to contain the amount of LDS
space used for the ES/GS pseudo-ring-buffer for passing data between shader
stages.
0x1000000B
ViewId
View id (32-bit unsigned integer) identifies a view of graphic
pipeline instancing.
0x1000000C
StreamOutTable
32-bit pointer to GPU memory containing the stream out target SRD table. This
can only appear for one shader stage per pipeline.
0x1000000D
PerShaderPerfData
32-bit pointer to GPU memory containing the per-shader performance data buffer.
0x1000000F
VertexBufferTable
32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
only appear for one shader stage per pipeline.
0x10000010
UavExportTable
32-bit pointer to GPU memory containing the UAV export SRD table. This can
only appear for one shader stage per pipeline (PS). These replace color targets
and are completely separate from any UAVs used by the shader. This is optional,
and only used by the PS when UAV exports are used to replace color-target
exports to optimize specific shaders.
0x10000011
NggCullingData
64-bit pointer to GPU memory containing the hardware register data needed by
some NGG pipelines to perform culling. This value contains the address of the
first of two consecutive registers which provide the full GPU address.
0x10000015
FetchShaderPtr
64-bit pointer to GPU memory containing the fetch shader subroutine.
Low 32 bits of the GPU address for an optional buffer in the .data
section of the ELF. The high 32 bits of the address match the high 32 bits
of the shader’s program counter.
The buffer can be anything the shader compiler needs it for, and
allows each shader to have its own region of the .data section.
Typically, this could be a table of buffer SRD’s and the data pointed to
by the buffer SRD’s, but it could be a flat-address region of memory as
well. Its layout and usage are defined by the shader compiler.
Each shader’s table in the .data section is referenced by the symbol
_amdgpu_xs_shdr_intrl_data where xs corresponds with the
hardware shader stage the data is for. E.g.,
_amdgpu_cs_shdr_intrl_data for the compute shader hardware stage.
It is possible for a hardware shader to need access to more user data
entries than there are slots available in user data registers for one
or more hardware shader stages. In that case, the PAL runtime expects
the necessary user data entries to be spilled to GPU memory and use
one user data register to point to the spilled user data memory. The
value of the user data entry must then represent the location where
a shader expects to read the low 32-bits of the table’s GPU virtual
address. The spill table itself represents a set of 32-bit values
managed by the PAL runtime in GPU-accessible memory that can be made
indirectly accessible to a hardware shader.
For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
not install a trap handler. The llvm.trap and llvm.debugtrap
instructions are handled as follows:
This section describes the format of core files supporting AMDGPU. Core dumps
for an AMDGPU program can come in 2 flavors: split or unified core files.
The split layout consists of one host core file containing the information to
rebuild the image of the host process and one AMDGPU core file that contains
the information for the AMDGPU agents used in the process. The AMDGPU core
file consists of:
A note describing the state of the AMDGPU agents, AMDGPU queues, and AMDGPU
runtime for the process (see Core file notes).
A list of load segments containing an image of the AMDGPU agents’ memory (see
Memory segments).
The unified core file is the union of all the information contained in
the two files of the split layout (all notes and load segments). It contains
all the information required to reconstruct the image of the process across all
the agents.
The definition of all the kfd_* types comes from the
include/uapi/linux/kfd_ioctl.h header file from the KFD repository. It is
usually installed in /usr/include/linux/kfd_ioctl.h. The version of the
kfd_ioctl.h file used must define values for
KFD_IOCTL_MAJOR_VERSION and KFD_IOCTL_MINOR_VERSION matching
the values of kfd_version_major and kfd_version_major from the
note.
An AMDGPU core file must contain an image of the AMDGPU agents’ memory in load
segments (of type PT_LOAD). Those segments must correspond to the memory
regions where the content of the agent memory is mapped into the host process
by the ROCr runtime (note that those memory mappings are usually not readable
by the process itself).
When using the split core file layout, those segments must be included in the
AMDGPU core file.
The order of operands and modifiers is fixed.
Most modifiers are optional and may be omitted.
Links to detailed instruction syntax description may be found in the following
table. Note that features under development are not included
in this description.
s_barriers_nop2s_endpgms_waitcnt0; Wait for all counters to be 0s_waitcntvmcnt(0)&expcnt(0)&lgkmcnt(0); Equivalent to aboves_waitcntvmcnt(1); Wait for vmcnt counter to be 1.s_sethalt9s_sleep10s_sendmsg0x1s_sendmsgsendmsg(MSG_INTERRUPT)s_trap1
For full list of supported instructions, refer to “SOPP Instructions” in ISA
Manual.
Unless otherwise mentioned, little verification is performed on the operands
of SOPP Instructions, so it is up to the programmer to be familiar with the
range or acceptable values.
For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
the assembler will automatically use optimal encoding based on its operands. To
force specific encoding, one can add a suffix to the opcode of the instruction:
Set to the GFX major generation number of the target being assembled for. For
example, when assembling for a “GFX9” target this will be set to the integer
value “9”. The possible GFX major generation numbers are presented in
Processors.
Set to the GFX minor generation number of the target being assembled for. For
example, when assembling for a “GFX810” target this will be set to the integer
value “1”. The possible GFX minor generation numbers are presented in
Processors.
Set to the GFX stepping generation number of the target being assembled for.
For example, when assembling for a “GFX704” target this will be set to the
integer value “4”. The possible GFX stepping generation numbers are presented
in Processors.
Set to zero each time a
.amdgpu_hsa_kernel (name) directive is
encountered. At each instruction, if the current value of this symbol is less
than or equal to the maximum VGPR number explicitly referenced within that
instruction then the symbol value is updated to equal that VGPR number plus
one.
Set to zero each time a
.amdgpu_hsa_kernel (name) directive is
encountered. At each instruction, if the current value of this symbol is less
than or equal to the maximum VGPR number explicitly referenced within that
instruction then the symbol value is updated to equal that SGPR number plus
one.
This directive specifies that the symbol with given name is a kernel entry
point (label) and the object should contain corresponding symbol of type
STT_AMDGPU_HSA_KERNEL.
This directive marks the beginning of a list of key / value pairs that are used
to specify the amd_kernel_code_t object that will be emitted by the assembler.
The list must be terminated by the .end_amd_kernel_code_t directive. For any
amd_kernel_code_t values that are unspecified a default value will be used. The
default value for all keys is 0, with the following exceptions:
amd_code_version_major defaults to 1.
amd_kernel_code_version_minor defaults to 2.
amd_machine_kind defaults to 1.
amd_machine_version_major, machine_version_minor, and
amd_machine_version_stepping are derived from the value of the -mcpu option
that is passed to the assembler.
kernel_code_entry_byte_offset defaults to 256.
wavefront_size defaults 6 for all targets before GFX10. For GFX10 onwards
defaults to 6 if target feature wavefrontsize64 is enabled, otherwise 5.
Note that wavefront size is specified as a power of two, so a value of n
means a size of 2^ n.
call_convention defaults to -1.
kernarg_segment_alignment, group_segment_alignment, and
private_segment_alignment default to 4. Note that alignments are specified
as a power of 2, so a value of n means an alignment of 2^ n.
enable_tg_split defaults to 1 if target feature tgsplit is enabled for
GFX90A onwards.
enable_wgp_mode defaults to 1 if target feature cumode is disabled for
GFX10 onwards.
enable_mem_ordered defaults to 1 for GFX10 onwards.
The .amd_kernel_code_t directive must be placed immediately after the
function label and before any instructions.
For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Set to the GFX major generation number of the target being assembled for. For
example, when assembling for a “GFX9” target this will be set to the integer
value “9”. The possible GFX major generation numbers are presented in
Processors.
Set to the GFX minor generation number of the target being assembled for. For
example, when assembling for a “GFX810” target this will be set to the integer
value “1”. The possible GFX minor generation numbers are presented in
Processors.
Set to the GFX stepping generation number of the target being assembled for.
For example, when assembling for a “GFX704” target this will be set to the
integer value “4”. The possible GFX stepping generation numbers are presented
in Processors.
Set to zero before assembly begins. At each instruction, if the current value
of this symbol is less than or equal to the maximum VGPR number explicitly
referenced within that instruction then the symbol value is updated to equal
that VGPR number plus one.
Set to zero before assembly begins. At each instruction, if the current value
of this symbol is less than or equal the maximum SGPR number explicitly
referenced within that instruction then the symbol value is updated to equal
that SGPR number plus one.
Directives which begin with .amdgcn are valid for all amdgcn
architecture processors, and are not OS-specific. Directives which begin with
.amdhsa are specific to amdgcn architecture processors when the
amdhsa OS is specified. See Target Triples and
Processors.
Optional directive which declares the <target-triple>-<target-id> supported
by the containing assembler source file. Used by the assembler to validate
command-line options such as -triple, -mcpu, and
--offload-arch=<target-id>. A non-canonical target ID is allowed. See
Target Triples and Target ID.
Note
The target ID syntax used for code object V2 to V3 for this directive differs
from that used elsewhere. See Code Object V2 to V3 Target ID.
Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
<name>.kd, in the current location of the current section. Only valid when
the OS is amdhsa. <name> must be a symbol that labels the first
instruction to execute, and does not need to be previously defined.
Marks the beginning of a list of directives used to generate the bytes of a
kernel descriptor, as described in Kernel Descriptor.
Directives which may appear in this list are described in
AMDHSA Kernel Assembler Directives. Directives may appear in any order, must
be valid for the target being assembled for, and cannot be repeated. Directives
support the range of values specified by the field they reference in
Kernel Descriptor. If a directive is not specified, it is
assumed to have its default value, unless it is marked as “Required”, in which
case it is an error to omit the directive. This list of directives is
terminated by an .end_amdhsa_kernel directive.
Whether the kernel may use the special VCC SGPR.
Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
compute_pgm_rsrc1 for GFX6-GFX12.
.amdhsa_reserve_flat_scratch
1
GFX7-GFX10
(except
GFX942)
Whether the kernel may use flat instructions to access
scratch memory. Used to calculate
GRANULATED_WAVEFRONT_SGPR_COUNT in
compute_pgm_rsrc1 for GFX6-GFX12.
.amdhsa_reserve_xnack_mask
Target
Feature
Specific
(xnack)
GFX8-GFX10
Whether the kernel may trigger XNACK replay.
Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
compute_pgm_rsrc1 for GFX6-GFX12.
If an assembly source file contains multiple kernels and/or functions, the
.amdgcn.next_free_vgpr and
.amdgcn.next_free_sgpr symbols may be reset using
the .set<symbol>,<expression> directive. For example, in the case of two
kernels, where function1 is only called from kernel1 it is sufficient
to group the function with the kernel that calls it and reset the symbols
between the two connected components:
These symbols cannot identify connected components in order to automatically
track the usage for each kernel. However, in some cases careful organization of
the kernels and functions in the source file means there is minimal additional
effort required to accurately calculate GPR usage.