Abstract:To address the issues of slow training, high memory usage, and small batch sizes caused by the quadratic growth of self-attention computation with resolution in Transformer-based end-to-end multi-object tracking (MOTR), an efficient MOTR (EMOTR) method is proposed. Without altering the MOTR paradigm, fast multi-scale attention (FMA) utilizes an I/O-friendly attention kernel combined with multi-scale feature fusion to reduce computational and storage overhead while enhancing small object resolution. Spatio-temporal batch processing and decoder weight sharing enable variable-length sequence training within the same batch and reduce approximately 15% of parameters. Automatic mixed precision (AMP) combined with dynamic Loss Scaling fully utilizes Tensor Core throughput. Experimental results based on VisDrone2019 show that compared to the original MOTR, training time is reduced by 80.4%, parameters are decreased by 15.5%, MOTA is improved by 2.1 to 24.9, and IDF1 remains stable, verifying the possibility of significantly enhancing the practicality of end-to-end MOT without sacrificing accuracy.