Explainable Long Video Understanding through Dynamic Motion Tokens and Temporal Causal Discovery

Authors

  • Henri M. Rose Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, USA. Author

Keywords:

long video understanding, explainability, motion tokens, temporal causal discovery, dynamic representation, video analytics, causal inference, system architecture

Abstract

Long video understanding remains a central challenge in artificial intelligence due to the complexity of temporal dependencies, the volume of redundant visual data, and the opacity of deep learning models. This paper proposes a framework that integrates dynamic motion tokens and temporal causal discovery to produce interpretable analyses of extended video sequences. Dynamic motion tokens are learned representations that condense local motion patterns into discrete, semantically meaningful units while preserving temporal ordering. Temporal causal discovery then identifies directed causal relationships among these tokens across time, yielding a graph-based explanation of event progression. The system is designed to support explainability by design rather than post-hoc interpretation. The paper examines the structural trade-offs involved in token granularity, causal graph sparsity, and computational efficiency. It also discusses architectural choices for large-scale deployment, including distributed processing pipelines and memory-bounded inference. Robustness considerations are addressed, particularly concerning distribution shift and adversarial perturbations. Fairness and policy implications are explored in the context of video surveillance and content moderation applications. The framework is contrasted with existing methods such as SlowFast, VideoMAE, and transformer-based architectures, highlighting the benefits of causal explainability in high-stakes domains. The work concludes with a forward-looking discussion of governance and sustainability, arguing that transparent causal models are essential for accountable video analytics in societal infrastructure. This research contributes to the growing intersection of explainable artificial intelligence and temporal reasoning, offering a principled pathway toward trustworthy long video understanding.

References

1. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299–6308).

2. Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision (pp. 6202–6211).

3. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). ViViT: A video vision transformer. In Proceedings of the IEEE International Conference on Computer Vision (pp. 6836–6846).

4. Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems (Vol. 35, pp. 10078–10093).

5. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (pp. 618–626).

6. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., & Sayres, R. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In International Conference on Machine Learning (pp. 2668–2677).

7. Jin, H., Yi, H., Zhao, W., Luo, J., Ye, S., Guan, Z., ... & Yu, T. (2026). HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding. arXiv preprint arXiv:2605.08158.

8. Runge, J., Nowack, P., Kretschmer, M., Flaxman, S., & Sejdinovic, D. (2019). Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5(11), eaau4996.

9. Tank, A., Covert, I., Foti, N., Shojaie, A., & Fox, E. (2022). Neural Granger causality for nonlinear time series. Journal of Machine Learning Research, 23(1), 4560–4620.

10. Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: Foundations and learning algorithms. MIT Press.

11. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction, and search (2nd ed.). MIT Press.

12. Shojaie, A., & Michailidis, G. (2010). Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3), 519–538.

13. Zhu, P., Zhao, S., Deng, H., & Han, F. (2025). Attentive radiate graph for pedestrian trajectory prediction in disconnected manifolds. IEEE Transactions on Intelligent Transportation Systems.

14. Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant risk minimization. arXiv preprint arXiv:1907.02893.

15. Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., & Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634.

16. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.

17. Lakkaraju, H., Arsov, N., & Leskovec, J. (2020). Interpretable machine learning: A path to more trustworthy AI. Nature Machine Intelligence, 2(7), 361–362.

18. Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). Cambridge University Press.

19. Goyal, R., Kahou, S. E., Michalski, V., Pal, C., & Bengio, Y. (2019). The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5842–5850).

20. Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7794–7803).

Downloads

Published

2026-06-05

How to Cite

Explainable Long Video Understanding through Dynamic Motion Tokens and Temporal Causal Discovery. (2026). Journal of Advanced Artificial Intelligence Research, 5(1). https://www.jaair.org/index.php/home/article/view/43