The article discusses self-supervised learning for video understanding with a focus on VideoMAE and its follow-ups. VideoMAE adapts image masked autoencoders to video data, tackling challenges like temporal redundancy and correlation. It demonstrates efficient training and strong performance, which is improved by VideoMAEv2 and MGMAE methods. The most recent work, ARVideo, introduces an autoregressive approach for even deeper video understanding.