API Documentation#

Note

This document describes the new C API that closely follows the NVIDIA Collective Communications Library (NCCL)* API standard. Documentation for the legacy C++ API can be found here.

Communicator Creation API#

This section includes functions related to initializing and managing communicators in oneCCL.

onecclResult_t onecclGetVersion(int *version)#

Function to obtain the oneCCL version encoded as an integer.

This function returns the version number encoded as an integer, which includes the oneCCL major version, oneCCL minor version, and oneCCL patch. The user can use the onecclExtractVersionComponents function to extract the version components.

Parameters:
  • version[out] Pointer to store the version encoded integer

Returns:

Result of the operation

onecclResult_t onecclGetUniqueId(onecclUniqueId *uniqueId)#

Function to get a unique ID.

This function generates a unique ID to be used in onecclCommInitRank or onecclCommInitRankConfig. ncclGetUniqueId is called once before creating a communicator. The ID should be sent to all the ranks that are going to participate in the communicator before they call onecclCommInitRank or onecclCommInitRankConfig.

Parameters:
  • uniqueId[out] Pointer to store the unique ID

Returns:

Result of the operation

onecclResult_t onecclExtractVersionComponents(int versionCode, int *major, int *minor, int *patch)#

Function to extract version components from the version encoded integer.

This function takes the version encoded integer obtained with oneCCLGetVErsion and returns the oneCCL major version, minor version, and patch.

Parameters:
  • versionCode[in] Encoded version integer

  • major[out] Pointer to store major version

  • minor[out] Pointer to store minor version

  • patch[out] Pointer to store patch version

Returns:

Result of the operation

onecclResult_t onecclCommInitRank(onecclComm_t *comm, size_t nranks, onecclUniqueId commId, int rank)#

Function to create a new communicator.

This function creates a new communicator with nranks, where rank must be an integer between 0 and nranks-1 and unique in the communicator. commId is the unique ID obtained with onecclGetUniqueId. This is a collective call and needs to be called by all the processes participating in the communicator.

Before this call, each rank needs to specify the device it is associated with (this can be done with onecclSetDevice(devide-idx), where device-idx is the device index.

Parameters:
  • comm[out] Pointer to store the initialized communicator

  • nranks[in] Number of ranks

  • commId[in] Unique ID for the communicator

  • rank[in] Rank within the communicator

Returns:

Result of the operation

onecclResult_t onecclCommInitRankConfig(onecclComm_t *comm, size_t nranks, onecclUniqueId commId, int rank, const onecclConfig_t *config)#

Function to create a new communicator using a config argument.

This function is similar to onecclCommInitRank but it also takes a config argument with additional attributes for the communicator.

Parameters:
  • comm[out] Pointer to store the new communicator

  • nranks[in] Number of ranks

  • commId[in] Unique ID for the communicator

  • rank[in] Rank within the communicator

  • config[in] Configuration attributes for the communicator

Returns:

Result of the operation

onecclResult_t onecclCommInitAll(onecclComm_t *comm, int ndev, const int *devlist)#

Function to create a single-process communicator.

This API is not implemented yet.

Parameters:
  • comm[out] Pointer to store the initialized communicator

  • ndev[in] Number of devices

  • devlist[in] List of devices

Returns:

Result of the operation

onecclResult_t onecclCommFinalize(onecclComm_t comm)#

Function to flush all communication inside the communicator.

This API is not implemented yet.

Parameters:
  • comm[in] Communicator to finalize

Returns:

Result of the operation

onecclResult_t onecclCommDestroy(onecclComm_t comm)#

Function to destroy a communicator.

This function frees the local resources allocated to the communicator object comm

Parameters:
  • comm[in] Communicator to destroy

Returns:

Result of the operation

onecclResult_t onecclCommAbort(onecclComm_t comm)#

Funtion to abort uncompleted operations and destroy the communicator.

This API is not implemented yet

Parameters:
  • comm[in] Communicator to abort

Returns:

Result of the operation

onecclResult_t onecclCommSplit(onecclComm_t comm, int color, int key, onecclComm_t *newcomm, onecclConfig_t *config)#

Function to create a set of new communicators from an existing one.

This functions creates a set of new communicators. Ranks with the same color will be in the same communicator. Ranks with ONECCL_SPLIT_NOCOLOR will not be part of any new communicator and will return NULL in the new communicator. The key determines the order in the new communicator, with a smaller key value indicating a smaller rank in the new communicator. Ranks with the same key will be ordered based on order in the orginal communicator. If the communicator needs to have a new configuration, this has to be passed in the config argument. Otherwise, setting config to NULL will make the new communicators inherit the configuration of the original communicator. When calling this function, there should not be any pending operations in the communicator. Otherwise, there could be a deadlock.

Parameters:
  • comm[in] Original communicator

  • color[in] Color identifier for splitting

  • key[in] Key for ranking within color

  • newcomm[out] Pointer to store newly created communicator

  • config[out] Configuration for the new communicator

Returns:

Result of the operation

const char *onecclGetErrorString(onecclResult_t result)#

Returns a string for the error code in result.

Returns a human-readable string corresponding to the error code in result.

Parameters:
  • result[in] Return code to describe

Returns:

Description of the return code

const char *onecclGetLastError(onecclComm_t comm)#

Funtion that returns an error message for the last error that occurred in the communicator.

This function returns a human-readable string corresponding to the last error that occurred in the communicator. Notice that the error message may not be related to the current call, but rather to a previous non-blocking call.

Parameters:
  • comm[in] Communicator to query for last error

Returns:

Description of the last error

onecclResult_t onecclCommCount(const onecclComm_t comm, int *size)#

Function to obtain communicator size.

This function returns the number of ranks in the communicator.

Parameters:
  • comm[in] Communicator to query

  • size[out] Pointer to store the size

Returns:

Result of the operation

onecclResult_t onecclCommDevice(const onecclComm_t comm, int *device)#

Function to get the device used by the communicator.

This function returns the device associated with the communicator

Parameters:
  • comm[in] Communicator to query

  • device[out] Pointer to store the device index

Returns:

Result of the operation

onecclResult_t onecclCommUserRank(const onecclComm_t comm, int *rank)#

Function to get the rank within the communicator.

This function returns the rank of the caller in the communicator.

Parameters:
  • comm[in] Communicator to query

  • rank[out] Pointer to store the rank

Returns:

Result of the operation

onecclResult_t onecclSetDevice(uint32_t index)#

Function to set the device index for the calling rank.

This function records the device index associated with the calling rank/thread.

Parameters:
  • index[in] Index of device to select

Returns:

Result of the operation

onecclResult_t onecclCommGetAsyncError(onecclComm_t comm)#

Function to check for errors of asynchronous oneCCL operations in the communicator.

This API is not implemented yet

Parameters:
  • comm[in] Communicator to check

Returns:

Result of the operation

Collective Functions API#

This section includes functions related to collective communication operations.

onecclResult_t onecclReduce(const void *sendbuff, void *recvbuff, size_t count, onecclDataType_t datatype, onecclRedOp_t redop, int root, onecclComm_t comm, void *stream)#

Functon to perform a Reduce operation.

Reduce is a collective communication operation that performs a reduction operation redop on count elements in sendbuf and places the result into the recvbuff of the root rank. The recvbuf is only used on the root rank.

This operation is in-place if sendbuff == recvbuff.

The stream is usually a pointer to a SYCL queue, but it can be NULL for host buffers. For details, see plugin.

Parameters:
  • sendbuff[in] Buffer with the data to send ABCDCACSAC

  • recvbuff[out] Buffer to receive reduced data

  • count[in] Number of elements

  • datatype[in] Data type of elements

  • redop[in] Reduction operation

  • root[in] Root rank of the operation

  • comm[in] Communicator for the operation

  • stream[in] Stream for the reduction

Returns:

Result of the operation

onecclResult_t onecclAllReduce(void *sendbuff, void *recvbuff, size_t count, onecclDataType_t datatype, onecclRedOp_t reduction_op, onecclComm_t comm, void *stream)#

Function to perform an AllReduce operation.

Allreduce is a collective communication operation that performs a reduction operation redop on count elements in sendbuf and places the result into the recvbuff of each rank. recvbuff is equal in all the ranks.

This operation is in-place if sendbuff == recvbuff.

The stream is usually a pointer to a SYCL queue, but it can be NULL for host buffers. For details, see plugin.

Parameters:
  • sendbuff[in] Buffer with the data to send

  • recvbuff[out] Buffer to receive reduced data

  • count[in] Number of elements

  • datatype[in] Data type of elements

  • reduction_op[in] Reduction operation

  • comm[in] Communicator for the operation

  • stream[in] Stream for the reduction

Returns:

Result of the operation

onecclResult_t onecclBroadcast(const void *sendbuff, void *recvbuff, size_t count, onecclDataType_t datatype, int root, onecclComm_t comm, void *stream)#

Function to performs a Broadcast operation.

Broadcast is a collective communication operation that copies count elements from the sendbuf in the root rank to the recvbuff of all the ranks. sendbuf is only used in the root rank.

The operation is in-place if sendbuff == recvbuff.

The stream is usually a pointer to a SYCL queue, but it can be NULL for host buffers. For details, see plugin.

Parameters:
  • sendbuff[in] Buffer with the data to send from root

  • recvbuff[out] Buffer to receive broadcasted data

  • count[in] Number of elements

  • datatype[in] Data type of elements

  • root[in] Root rank of the operation

  • comm[in] Communicator for the operation

  • stream[in] Stream for the broadcast

Returns:

Result of the operation

onecclResult_t onecclReduceScatter(const void *sendbuff, void *recvbuff, size_t recvcount, onecclDataType_t datatype, onecclRedOp_t redop, onecclComm_t comm, void *stream)#

Function to perform a ReduceScatter operation.

ReduceScatter is a collective communication operation that performs a reduction operation redop on count elements in sendbuf and places the result scattered over the recvbuff of the participating ranks, so that the recvbuff in rank i contains the i-th chunk of the result.

This operation assumes that send count is equal to nranks*recvcount, that is, the sendbuf has a count of at least nranks*recvcount elements.

This operation is in-place if recvbuff == sendbuff + rank * recvcount.

The stream is usually a pointer to a SYCL queue, but it can be NULL for host buffers. For details, see plugin.

Parameters:
  • sendbuff[in] Buffer with the data to send

  • recvbuff[out] Buffer to receive scattered data

  • recvcount[in] Number of elements to receive

  • datatype[in] Data type of elements

  • redop[in] Reduction operation

  • comm[in] Communicator for the operation

  • stream[in] Stream for the scatter

Returns:

Result of the operation

onecclResult_t onecclAllGather(const void *sendbuff, void *recvbuff, size_t sendcount, onecclDataType_t datatype, onecclComm_t comm, void *stream)#

Function to perform an AllGgather operation.

Allgather is a collective communication operation that gathers sendcount elements from the sendbuf in each rank and places them in the recvbuff of all the participating ranks. The data in the sendbuf in rank i can be found in recvbuf at offset i*sendcount.

This operation assumes that the receive count is equal to nranks*sendcount, that is, the recvbuff has a count of at least nranks*sendcount elements.

This operation is in-place if sendbuff == recvbuff + rank * sendcount.

The stream is usually a pointer to a SYCL queue, but it can be NULL for host buffers. For details, see plugin.

Parameters:
  • sendbuff[in] Buffer with the data to send

  • recvbuff[out] Buffer to receive gathered data

  • sendcount[in] Number of elements to send

  • datatype[in] Data type of elements

  • comm[in] Communicator for the operation

  • stream[in] Stream for the gather

Returns:

Result of the operation

onecclResult_t onecclAllToAll(const void *sendbuff, void *recvbuff, size_t count, onecclDataType_t datatype, onecclComm_t comm, void *stream)#

Function to perform an AlltoAll operation.

Alltoall is a collective communication operation where each rank sends count elements to all other ranks and receives count elements from all other ranks. The data to send to destination rank j is located at sendbuff+j*count and data received from source rank i is placed at recvbuff+i*count.

This collective assumes that the count in sendbuff and recvbuff is the same and it is equal to nranks*count.

The stream is usually a pointer to a SYCL queue, but it can be NULL for host buffers. For details, see plugin

Parameters:
  • sendbuff[in] Buffer with the data to send to each process

  • recvbuff[out] Buffer to receive data

  • count[in] Number of elements to send and receive from each process

  • datatype[in] Data type of elements

  • comm[in] Communicator for the operation

  • stream[in] Stream for the AllToAll operation

Returns:

Result of the operation

onecclResult_t onecclSend(const void *sendbuff, size_t count, onecclDataType_t datatype, int peer, onecclComm_t comm, void *stream)#

Fun ction to perform a send operation.

This operation sends count data from sendbuff to peer rank. The peer rank needs to call onecclRecv with the same count and dataype as the calling rank.

This operation may block the GPU. If multiple onecclSend() and onecclRecv() operations need to progress concurrently in a non-blocking fashion, they need to be placed within the onecclGroupStart() and oneccllGroupEnd() calls.

The stream is usually a pointer to a SYCL queue, but it can be NULL for host buffers. For details, see plugin

Parameters:
  • sendbuff[in] Buffer with the data to send

  • count[in] Number of elements

  • datatype[in] Data type of elements

  • peer[in] Rank of the peer

  • comm[in] Communicator for the operation

  • stream[in] Stream for the send

Returns:

Result of the operation

onecclResult_t onecclRecv(void *recvbuff, size_t count, onecclDataType_t datatype, int peer, onecclComm_t comm, void *stream)#

Function to perform a receive operation.

This operation receives count data from recvbuff from peer rank. The peer rank needs to call onecclSend with the same count and dataype as the calling rank.

This operation may block the GPU. If multiple onecclSend() and onecclRecv() operations need to progress concurrently in a non-blocking fashion, they need to be placed within the onecclGroupStart() and oneccllGroupEnd() calls.

The stream is usually a pointer to a SYCL queue, but it can be NULL for host buffers. For details, see plugin

Parameters:
  • recvbuff[out] Buffer to receive data

  • count[in] Number of elements

  • datatype[in] Data type of elements

  • peer[in] Rank of the peer

  • comm[in] Communicator for the operation

  • stream[in] Stream for the receive

Returns:

Result of the operation

onecclResult_t onecclGroupStart()#

Function to start a group call.

This function indicates that all the subsequent oneCCL calls until onecclGrouEnd will not block due to CPU synchronization.

Returns:

Result of the operation

onecclResult_t onecclGroupEnd()#

Function to end a group call.

This operation stars all the oneCCL operations submitted after the most recent onecclGroupStart.

At the moment, the only operations supported between onecclGroupStart and onecclGroupEnd are collectives and point to point send and receive.

Returns:

Result of the operation

onecclResult_t onecclRedOpCreatePreMulSum(onecclRedOp_t *redop, void *scalar, onecclDataType_t datatype, onecclScalarResidence_t residence, onecclComm_t comm)#

Function to create a custom reduction operation that performs a pre-multiplied sum.

This function creates a new reduction operator that pre-mulitplies the input values by the scalar before reducing them with peer values wih a sum.

The input data and the scalar are of type datatype.

The residence argument indicates whether the memory pointed by the scalar is in the host or device memory. See onecclScalarResidence_t.

The handle to the new created reduction operation is stored in redop.

Parameters:
  • redop[out] Pointer to store the created reduction operation

  • scalar[in] Pointer to the scalar value to pre-multiply

  • datatype[in] Data type of the scalar value

  • residence[in] Memory residence of the scalar value

  • comm[in] Communicator for the operation

Returns:

Result of the operation

onecclResult_t onecclRedOpDestroy(onecclRedOp_t redop, onecclComm_t comm)#

Function to destroy a previously created custom reduction operation.

Destroys the reduction operation redop.

This API assumes the reduction operation has been created with onecclRedOpCreatePreMul with the communicator comm. An operation can be destroyed when the last oneccl function using that reduction operation returns.

Parameters:
  • redop[in] Previously created reduction operation

  • comm[in] Communicator for the operation

Returns:

Result of the operation

Types API#

This section includes types and definitions used throughout the oneCCL API.

enum onecclResult_t#

Enum for possible result codes for oneCCL functions.

Values:

enumerator onecclSuccess#
enumerator onecclError#
enumerator onecclSystemError#
enumerator onecclInternalError#
enumerator onecclInvalidArgument#
enumerator onecclInvalidUsage#
enumerator onecclInProgress#
enumerator onecclFailureGPU#
enumerator onecclFailureCPU#
enumerator onecclAllocFailureCPU#
enumerator onecclAllocFailureGPU#
enumerator onecclPluginException#
enumerator onecclNotImplemented#
enum onecclDataType_t#

Enum for different oneCCL data types.

Values:

enumerator onecclInt8#
enumerator onecclChar#
enumerator onecclUint8#
enumerator onecclInt32#
enumerator onecclInt#
enumerator onecclUint32#
enumerator onecclInt64#
enumerator onecclUint64#
enumerator onecclFloat16#
enumerator onecclHalf#
enumerator onecclFloat32#
enumerator onecclFloat#
enumerator onecclFloat64#
enumerator onecclDouble#
enumerator onecclBfloat16#
enum onecclRedOp_t#

Enum for reduction operations in oneCCL.

Values:

enumerator onecclSum#
enumerator onecclProd#
enumerator onecclMax#
enumerator onecclMin#
enumerator onecclAvg#
enumerator onecclNumOps#
enumerator onecclMaxRedOp#
enum onecclScalarResidence_t#

Scalar residence for user-defined operations.

Values:

enumerator onecclScalarDevice#
enumerator onecclScalarHostImmediate#
enum onecclPluginType_t#

Enum to specify plugin types.

Values:

enumerator onecclPluginAny#
enumerator onecclNull#
enumerator onecclLegacy#
enumerator onecclLegacyCPU#
enumerator onecclUserBackend#
struct onecclUniqueId#

Structure to store unique communicator identifier details.

Public Members

char legacy[512]#

Legacy identifier.

char nccl[512]#

NCCL identifier.

char any[2048]#

Additional space for any identifier.

char data[4096]#
union onecclUniqueId
struct onecclConfig#

Configuration structure for oneCCL communicators.