OOIR: Observatory of International Research

Papers

(The H4-Index of IEEE Transactions on Multimedia is 78. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-06-01 to 2026-06-01.)

Article	Citations
Improving Vision Anomaly Detection With the Guidance of Language Modality	931
Focusing on Subtle Differences: A Feature Disentanglement Model for Series Photo Selection	509
Vulnerability of Feature Extractors in 2D Image-Based 3D Object Retrieval	365
SkyML: A MLaaS Federation Design for Multicloud-Based Multimedia Analytics	336
Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention	248
Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective	200
FoodSAM: Any Food Segmentation	191
Online Low-Light Sand-Dust Video Enhancement Using Adaptive Dynamic Brightness Correction and a Rolling Guidance Filter	184
Semi-Supervised Domain Adaptation via Joint Transductive and Inductive Subspace Learning	183
Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization	180
SGG-Nets: Generic Rotation-Invariant Plugin Networks for Point Cloud Analysis	176
ViDR-GNN: Vision Implicit Discriminative Reorganization Graph Neural Networks	175
Dual-Task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding	157
Late Fusion Multiple Kernel Clustering With Local Kernel Alignment Maximization	153
Ensemble Prototype Networks for Unsupervised Cross-Modal Hashing With Cross-Task Consistency	153
Few-Shot Generative Model Adaptation via Style-Guided Prompt	150
Outliers Adaptation Exploration and Centroids Matching Label Refinement for Unsupervised Person Re-identification	149
Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations	144
Feature First: Advancing Image-Text Retrieval Through Improved Visual Features	144
Quality Assessment for DIBR-Synthesized Views Based on Wavelet Transform and Gradient Magnitude Similarity	143
Towards Substation Semantic Segmentation: A benchmark dataset and a cross-attention embedded hierarchical network	141
HRVFusion: Video-based Long-Term Heart Rate Variability Measurement with Conditional Diffusion Models	139
ICE: Interactive 3D Game Character Facial Editing via Dialogue	135
Revisiting the Adversarial Transferability: Towards a Perspective of Semantic Preservation	133
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition	132

Exploring Kernel Transformations for Implicit Neural Representations	130
Posture-Movement-Frequency-Enhanced Graph Convolutional Network for Gait Emotion Recognition	127
Mask-Aware Kernel Learning for Action Recognition	126
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework	125
Siamese Alignment Network for Weakly Supervised Video Moment Retrieval	125
Anomaly-Led Prompting Learning Caption Generating Model and Benchmark	124
BMB: Balanced Memory Bank for Long-Tailed Semi-Supervised Learning	123
LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation	120
Hierarchical Equalization Loss for Long-Tailed Instance Segmentation	119
Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation	116
BASNet: Boundary Assisted Network for Image Splicing Forgery Detection	116
Pixel Bleach Network for Detecting Face Forgery Under Compression	113
Mix-Based Training Strategies for Learning Implicit Neural Representations	113
Skeleton-Based Action Recognition With Select-Assemble-Normalize Graph Convolutional Networks	111
Bidirectional Translation Between UHD-HDR and HD-SDR Videos	110
Neighborhood Contrastive Transformer for Change Captioning	109
Scale Up Composed Image Retrieval Learning via Modification Text Generation	107
Optimal Transport-Based Patch Matching for Image Style Transfer	105
Robust Multi-Stage Tracking via Multi-Scale and Multi-Level Representation Learning	104
Long Video Understanding with Learnable Retrieval in Video-Language Models	103
Transferable Backdoor Attack on Any CLIP Model With Any Target Class by Pre-Trained Hack Network	101
Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames	100
Watch Where You Move: Region-Aware Dynamic Aggregation and Excitation for Gait Recognition	100
PropMambaSR: Lightweight Image Super-Resolution with Propagation State Space Model	97
3D-SceneQ: Empowering 3D LLM with Query-Guided Adaptive Pruning and Multi-modal Feature Enhancement	97
DWSF-Net: A Dynamic Wavelet-based Spatial-frequency Fusion Network for Multispectral Object Detection	97
Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments	97
MHRN: A Multimodal Hierarchical Reasoning Network for Topic Detection	96
MVPC-CLIP: Multi-Granularity Visual Prompt Co-Operative for Aerial Video Recognition	96
Adaptive Weight Generator for Multi-Task Image Recognition by Task Grouping Prompt	96
Deep Semantic-Consistent Penalizing Hashing for Cross-Modal Retrieval	95
Semi-Supervised Contrastive Learning With Similarity Co-Calibration	95
Semantic Dual-Adversarial Network for Blended-Target Domain Adaptation	95
SCSP: An Unsupervised Image-to-Image Translation Network Based on Semantic Cooperative Shape Perception	94
Self-Guided Discriminative Locality Preserving Projections	94
Semantic-Aware Triplet Loss for Image Classification	94
Improving Pre-Trained Model-Based Speech Emotion Recognition From a Low-Level Speech Feature Perspective	93
Adaptive HEVC Video Steganography With High Performance Based on Attention-Net and PU Partition Modes	91
Distributed Deep Point Cloud Feature Compression for Vehicle-to-Vehicle Cooperative Perception	90
AMS-Net: Adaptive Multi-Scale Network for Image Compressive Sensing	89
Self-Mining the Confident Prototypes for Source-Free Unsupervised Domain Adaptation in Image Segmentation	89
Unsupervised Learning-Based Framework for Deepfake Video Detection	89
Progressive Local Filter Pruning for Image Retrieval Acceleration	88
Interpretable Graph Convolutional Network for Multi-View Semi-Supervised Learning	87
Weakly-Supervised 3D Visual Grounding Based on Visual Language Alignment	84
Guided Image-to-Image Translation by Discriminator-Generator Communication	84
Dynamic Contrastive Distillation for Image-Text Retrieval	82
Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment	82
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation With Interpretability	82
PhotoHelper: Portrait Photographing Guidance Via Deep Feature Retrieval and Fusion	81

SLCGC: A lightweight Self-supervised Low-Pass Contrastive Graph Clustering Network for Hyperspectral Images	80
Asymptotics-Aware Multi-View Subspace Clustering	79
One-Shot Human Motion Transfer via Occlusion-Robust Flow Prediction and Neural Texturing	78
Semi-Supervised Domain Adaptation for Major Depressive Disorder Detection	78