e https://arxiv.org/abs/2209.10655 and original fairseq implementation at https://github.com/facebookresearch/mega (copyright Meta Research, licensed under MIT License) Differences from original implementation include hidden state refactor and fixed inconsistency with additive / multiplicative attention masks r