Dropout

Contents

Dropout#

General#

Dropout operation is a regularization technique that randomly zeroes elements of the input tensor during training, scaling the outputs so that the expected sum of output elements matches the sum of input elements. It depends on a deterministic PRNG (current implementation uses a variation of Philox algorithm) and transforms the values as follows:

\[\begin{split}\mathrm{mask}[:] = (\mathrm{PRNG}(seed, offset, :) > P) \\ \mathrm{dst}[:] = \mathrm{mask}[:] \cdot {{\mathrm{src}[:]} \over {1 - P}}\end{split}\]

where:

  • \(\mathrm{mask}\) values may be either 0 if the corresponding value in \(\mathrm{dst}\) got zeroed (a.k.a. dropped out) or 1, otherwise.

  • \(seed, offset\) are the seed and the offset for the PRNG algorithm. The seed initializes the PRNG state, while the offset allows generating different random sequences from the same seed, ensuring reproducibility across executions.

  • \(P\) is the probability for any given value to get dropped out, \(0 \leq P \leq 1\).

Forward (Training) applies the dropout mask and scaling as described above. Forward (Inference) passes the input directly to the output without modification. Backward applies the same mask and scaling to the gradient f${diff_dst} \(to compute\) {diff_src}