Self-Supervised Interleaved Motion Representation Learning for Long-Range Sports Video Analytics

Ishaan Smith

Authors

Ishaan Smith Department of Computer Science, University of North Texas, Denton, TX, USA. Author

Keywords:

self-supervised learning, interleaved motion representation, long-range video analytics, sports video understanding, hierarchical encoding, system architecture, fairness in AI, video prediction

Abstract

Long-range sports video analytics presents unique challenges due to the need to capture fine-grained motion patterns over extended temporal horizons while maintaining computational efficiency and robustness to domain shifts. Traditional supervised approaches require extensive human annotation and often fail to generalize across different sports, camera setups, and environmental conditions. This paper proposes a self-supervised interleaved motion representation learning framework that leverages hierarchical multi-stream architectures to encode motion at multiple temporal scales without reliance on labeled data. The framework integrates contrastive and predictive self-supervised objectives within an interleaved encoder design, enabling the model to learn structured representations that disentangle short-term dynamics from long-term dependencies. System-level considerations including architectural trade-offs between model capacity and inference speed, the role of data augmentation and negative sampling strategies, and the implications for deployment on edge devices are examined. Furthermore, the paper addresses issues of fairness, such as demographic biases in broadcast sports data, and discusses governance frameworks for responsible deployment in automated coaching and officiating assistance. Experimental evaluations on benchmark sports video datasets demonstrate that the proposed approach achieves competitive performance on downstream tasks including action recognition, event localization, and player trajectory prediction. The work contributes a scalable, annotation-free paradigm for long-range video understanding and provides a critical analysis of the socio-technical infrastructure required for real-world adoption.

References

1. Carreira, J., & Zisserman, A. (2017). Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299–6308).

2. Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision (pp. 6202–6211).

3. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (pp. 813–823).

4. He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9729–9738).

5. Jin, H., Yi, H., Zhao, W., Luo, J., Ye, S., Guan, Z., ... & Yu, T. (2026). HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding. arXiv preprint arXiv:2605.08158.

6. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (pp. 1597–1607).

7. Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems (Vol. 33, pp. 21271–21284).

8. van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

9. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (pp. 10347–10357).

10. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

11. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). ViViT: A video vision transformer. In Proceedings of the IEEE International Conference on Computer Vision (pp. 6836–6846).

12. Zhu, P., Zhao, S., Han, F., & Deng, H. (2024, May). BEAVP: A Bidirectional Enhanced Adversarial Model for Video Prediction. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) (pp. 1-8). IEEE.

13. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision (pp. 20–36).

14. Lin, J., Gan, C., & Han, S. (2019). TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision (pp. 7083–7093).

15. Yang, C., Xu, Y., Shi, J., Dai, B., & Zhou, B. (2020). TANet: Towards fully automatic tracking and analysis of team sports. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1431–1441).

16. Wu, Y., Lim, J., & Yang, M. H. (2019). Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1371–1380).

17. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6450–6459).

18. Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6546–6555).

19. Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2019). ActionVLAD: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1312–1321).

20. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... & Zisserman, A. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

Self-Supervised Interleaved Motion Representation Learning for Long-Range Sports Video Analytics

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Journal Information

Latest publications

Make a Submission

Information