on where the channels are updated using a weighted sum. The weights are obtained from the (softmax normalized) Cross-covariance matrix (Q^T \cdot K \in d_h \times d_h) rO