menu_book Explore the article's raw data

Unified spatio-temporal attention mixformer for visual object tracking

Abstract

In this paper, we present a unified spatio-temporal attention MixFormer framework for visual object tracking. Within the vision transformer framework, we design a cohesive network consisting of target template and search region feature extraction, cross -attention utilizing spatial and temporal information, and task -specific heads, all operating in an end -to -end manner. Incorporating spatial and temporal attention modules within the network enables simultaneous feature extraction and emphasis, allowing the model to concentrate on targetspecific discriminative features despite changes in illumination, occlusion, scale, camera pose, and background clutter. Stacking multiple non-hierarchical blocks allows meaningful features to be extracted while irrelevant features are discarded from the provided target template and search region. The simultaneous spatio-temporal attention module is employed to accentuate target appearance features and alleviate variation in the object state across frame sequences. Qualitative and quantitative analysis, including ablation tests based on various tracking benchmarks, validates the robustness of the proposed tracking methodology.

article Article
date_range 2024
language English
link Link of the paper
format_quote
Sorry! There is no raw data available for this article.
Loading references...
Loading citations...
Featured Keywords

Visual object tracking
Unified vision transformer
Spatio-temporal model
Citations by Year

Share Your Research Data, Enhance Academic Impact