A Multimodal Spatio-Temporal Transformer for Trajectory-Aware Long Video Event Understanding in Intelligent Transportation Systems

Olivier Jarvinen; Bedra Millis; Jeffrey Daker

Authors

Olivier Jarvinen Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA. Author
Bedra Millis School of Computing, Clemson University, Clemson, SC, USA. Author
Jeffrey Daker Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA. Author

Keywords:

multimodal transformer, long video understanding, trajectory prediction, intelligent transportation systems, spatio-temporal attention, edge-cloud deployment, fairness, sustainability

Abstract

The increasing deployment of intelligent transportation systems (ITS) demands robust, real-time video understanding that extends beyond isolated frame analysis to capture long-range spatio-temporal dependencies in traffic scenes. This paper proposes a multimodal spatio-temporal transformer architecture specifically designed for trajectory-aware long video event understanding. The framework integrates heterogeneous sensor streams, including visual, LiDAR, and radar data, into a unified token representation that preserves spatial topology and temporal continuity. A hierarchical attention mechanism is introduced to process video segments of arbitrary length while maintaining computational tractability through windowed self-attention and cross-modal fusion layers. The architecture explicitly encodes trajectory priors from a dedicated motion encoder, enabling the model to reason about agent interactions over extended time horizons. This paper examines the structural trade-offs inherent in such system design, including model depth versus inference latency, modality alignment cost, and the balance between local and global receptive fields. Deployment considerations for edge-cloud hybrid infrastructures are analyzed, with emphasis on sustainability, energy efficiency, and real-time constraints. Robustness to sensor noise and adversarial perturbations is addressed through a discussion of redundancy and failover mechanisms. Fairness and governance issues arising from biased training data and uneven coverage across demographic groups are critically assessed. Policy implications for regulatory compliance, privacy preservation, and public accountability are outlined. The proposed architecture demonstrates how trajectory-aware multi-modal transformers can achieve state-of-the-art performance in tasks such as traffic accident anticipation, pedestrian intention recognition, and congestion pattern evolution, while also highlighting the need for responsible deployment strategies that prioritize both performance and equity.

References

1. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299-6308.

2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.

3. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations.

4. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). ViViT: A video vision transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836-6846.

5. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2022). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022.

6. Feng, D., Haase-Schütz, C., Rosenbaum, L., Hertlein, H., Glaeser, C., Kugele, S., & Dietmayer, K. (2020). Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3), 1341-1360.

7. Prakash, A., Chitta, K., & Geiger, A. (2021). Multi-modal fusion transformer for end-to-end autonomous driving. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7077-7087.

8. Zhu, P., Zhao, S., Deng, H., & Han, F. (2025). Attentive radiate graph for pedestrian trajectory prediction in disconnected manifolds. IEEE Transactions on Intelligent Transportation Systems.

9. Jin, H., Yi, H., Zhao, W., Luo, J., Ye, S., Guan, Z., ... & Yu, T. (2026). HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding. arXiv preprint arXiv:2605.08158.

10. Huang, L., & Ling, H. (2023). Video action recognition with transformers: A review. ACM Computing Surveys, 56(3), 1-38.

11. Wang, J., Chen, Y., Chakraborty, R., & Yu, S. X. (2022). Orthogonal convolutional neural networks for video understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6760-6774.

12. Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12104-12113.

13. Yu, F., Wang, D., Shelhamer, E., & Darrell, T. (2018). Deep layer aggregation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2403-2412.

14. Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R., Titov, I., & Welling, M. (2018). Modeling relational data with graph convolutional networks. European Semantic Web Conference, 593-607.

15. Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks? International Conference on Learning Representations.

16. Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., & Kim, T. K. (2021). Multiple object tracking: A literature review. Artificial Intelligence, 293, 103448.

17. Shi, W., Yan, Y., & Wang, L. (2022). Sparse attention with learned temporal masks for video understanding. European Conference on Computer Vision, 353-369.

18. Mao, J., Huang, J., & Xu, Z. (2023). Trajectory-based event recognition in traffic videos using spatio-temporal transformers. IEEE Transactions on Image Processing, 32, 2587-2599.

19. Li, Q., Li, Z., & Li, C. (2024). A survey on fairness in autonomous driving systems. ACM Computing Surveys, 57(1), 1-35.

20. Chen, L., Chen, Y., & Wu, T. (2025). Energy-efficient transformer architectures for edge deployment: A survey. Journal of Systems Architecture, 146, 103089.

A Multimodal Spatio-Temporal Transformer for Trajectory-Aware Long Video Event Understanding in Intelligent Transportation Systems

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Journal Information

Latest publications

Make a Submission

Information