OOIR: Observatory of International Research

Papers

(The median citation count of IEEE Computer Architecture Letters is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-06-01 to 2026-06-01.)

Article	Citations
Old is Gold: Optimizing Single-Threaded Applications With ExGen-Malloc	94
Speculative Multi-Level Access in LSM Tree-Based KV Store	26
Toward Practical 128-Bit General Purpose Microarchitectures	24
A Characterization of Generative Recommendation Models: Study of Hierarchical Sequential Transduction Unit	22
Characterization and Analysis of Text-to-Image Diffusion Models	21
Exploration of Algorithm-Hardware Co-Design for Floating-Point Digital Compute-in-Memory	21
Accelerating Programmable Bootstrapping Targeting Contemporary GPU Microarchitecture	20
The Architectural Sustainability Indicator	19
Time Series Machine Learning Models for Precise SSD Access Latency Prediction	18
SCALES: SCALable and Area-Efficient Systolic Accelerator for Ternary Polynomial Multiplication	18
Context-Aware Set Dueling for Dynamic Policy Arbitration	16
Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure	13
De-Quantization Penalties for Interactive LLM Inference on Prosumer GPUs	13
In-Depth Characterization of Machine Learning on an Optimized Multi-Party Computing Library	12
MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference	12
A Quantitative Analysis of Mamba-2-Based Large Language Model: Study of State Space Duality	12
SoCurity: A Design Approach for Enhancing SoC Security	11
Straw: A Stress-Aware WL-Based Read Reclaim Technique for High-Density NAND Flash-Based SSDs	11
Improving Energy-Efficiency of Capsule Networks on Modern GPUs	11
Exploring KV Cache Quantization in Multimodal Large Language Model Inference	11
OASIS: Outlier-Aware KV Cache Clustering for Scaling LLM Inference in CXL Memory Systems	10
Wafer-scale GPU Memory Pool with In-Package Optics for Enhanced Capacity and Bandwidth	10
AiDE: Attention-FFN Disaggregated Execution for Cost-Effective LLM Decoding on CXL-PNM	10
A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System	9
REDIT: Redirection-Enabled Memory-Side Directory Architecture for CXL Memory Fabric	9

In-Memory Versioning (IMV)	8
RouteReplies: Alleviating Long Latency in Many-Chip-Module GPUs	8
A Case for In-Memory Random Scatter-Gather for Fast Graph Processing	8
StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models	8
Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving	7
Enabling Computation and Communication Overlap in PIMs for On-Device LLM Inference	7
Exploring the DIMM PIM Architecture for Accelerating Time Series Analysis	7
High-Bandwidth Flash for KV Caches: Endurance and Performance Implications	6
Improving Performance on Tiered Memory With Semantic Data Placement	6
Mitigating Timing-Based NoC Side-Channel Attacks With LLC Remapping	6
Accelerating Deep Reinforcement Learning via Phase-Level Parallelism for Robotics Applications	6
Thread-Adaptive: High-Throughput Parallel Architectures of SLH-DSA on GPUs	6
PUDTune: Multi-Level Charging for High-Precision Calibration in Processing-Using-DRAM	6
DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity	6
NoHammer: Preventing Row Hammer With Last-Level Cache Management	6
QuArch: A Question-Answering Dataset for AI Agents in Computer Architecture	6
Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference	6
Security Helper Chiplets: A New Paradigm for Secure Hardware Monitoring	6
pNet-gem5: Full-System Simulation With High-Performance Networking Enabled by Parallel Network Packet Processing	6
Efficient Deadlock Avoidance by Considering Stalling, Message Dependencies, and Topology	5
Hisui: Unlocking Tiered Memory Efficiency for FaaS Workloads	5
SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency	5
LADIO: Leakage-Aware Direct I/O for I/O-Intensive Workloads	5
SparseLeakyNets: Classification Prediction Attack Over Sparsity-Aware Embedded Neural Networks Using Timing Side-Channel Information	5
High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network	5
RAESC: A Reconfigurable AES Countermeasure Architecture for RISC-V With Enhanced Power Side-Channel Resilience	5
Memory-Centric MCM-GPU Architecture	5
PreGNN: Hardware Acceleration to Take Preprocessing Off the Critical Path in Graph Neural Networks	4
ReplayOpt: Optimizer-State Replay to Resolve Critical-Path Bottlenecks in Offloaded Training	4
A Flexible Hybrid Interconnection Design for High-Performance and Energy-Efficient Chiplet-Based Systems	4
Managing Prefetchers With Deep Reinforcement Learning	4
H ³ : H ybrid Architecture Using H igh Bandwidth Memory	4
Xami : E x pert-Aware A daptive Compression for Mi	4
Enhancing the Reach and Reliability of Quantum Annealers by Pruning Longer Chains	4
Nighthawk: Zero-Copy Cache Quarantine for Invisible Speculation	4
KiF: Accelerating Low-Batch LLM Inference Using In-Flash KV Cache	4
Primate: A Framework to Automatically Generate Soft Processors for Network Applications	4
FPGA-Accelerated Data Preprocessing for Personalized Recommendation Systems	3
Fast Inter-Enclave Communication Encryption	3
Driving the Core Frontend With LiteBTB	3
T-CAT: Dynamic Cache Allocation for Tiered Memory Systems With Memory Interleaving	3
LeakDiT: Diffusion Transformers for Trace-Augmented Side-Channel Analysis	3
Understanding the Performance Behaviors of End-to-End Protein Design Pipelines on GPUs	3
Accelerators & Security: The Socket Approach	3
ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs	3
Enabling Cost-Efficient LLM Inference on Mid-Tier GPUs With NMP DIMMs	3
Fast Performance Prediction for Efficient Distributed DNN Training	3
Direct-Coding DNA With Multilevel Parallelism	3
Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications	3
Adaptive Web Browsing on Mobile Heterogeneous Multi-cores	3

SSE: Security Service Engines to Accelerate Enclave Performance in Secure Multicore Processors	3
Guard Cache: Creating Noisy Side-Channels	3
A Quantum Computer Trusted Execution Environment	3
SEMS: Scalable Embedding Memory System for Accelerating Embedding-Based DNNs	3
Camulator: A Lightweight and Extensible Trace-Driven Cache Simulator for Embedded Multicore SoCs	3
Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads	2
Hungarian Qubit Assignment for Optimized Mapping of Quantum Circuits on Multi-Core Architectures	2
Cost-Effective Extension of DRAM-PIM for Group-Wise LLM Quantization	2
Characterization and Analysis of the 3D Gaussian Splatting Rendering Pipeline	2
Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models	2
Minimal Counters, Maximum Insight: Simplifying System Performance With HPC Clusters for Optimized Monitoring	2
A First-Order Model to Assess Computer Architecture Sustainability	2
R.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead	2
DRAM-CAM: General-Purpose Bit-Serial Exact Pattern Matching	2
gem5-accel: A Pre-RTL Simulation Toolchain for Accelerator Architecture Validation	2
FullPack: Full Vector Utilization for Sub-Byte Quantized Matrix-Vector Multiplication on General Purpose CPUs	2
EgDiff: An Enhanced Global Load Value Predictor	2
Enhancing DNN Training Efficiency Via Dynamic Asymmetric Architecture	2
Redundant Array of Independent Memory Devices	2
eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models	2
A Case Study of a DRAM-NVM Hybrid Memory Allocator for Key-Value Stores	2
MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage	2
FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems	2
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System	2
Analyzing and Exploiting Memory Hierarchy Parallelism With MLP Stacks	2
LWAL: Lightweight Adaptive Learning-Driven Cache Bypassing for GPUs	2
Approximate Multiplier Design With LFSR-Based Stochastic Sequence Generators for Edge AI	2
Architectural Implications of GNN Aggregation Programming Abstractions	2
Characterization and Implementation of Radar System Applications on a Reconfigurable Dataflow Architecture	2
Overcoming Memory Capacity Wall of GPUs With Heterogeneous Memory Stack	2
CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX	2
Computational CXL-Memory Solution for Accelerating Memory-Intensive Applications	2
CGR-NPU: A Hybrid CGRA and NPU Architecture for Adaptive Neural Computing Workloads	2
Capacity-Latency Tradeoffs in CXL Memory Expander at Hyperscale	2
Amethyst: Reducing Data Center Emissions With Dynamic Autotuning and VM Management	2
IntervalSim++: Enhanced Interval Simulation for Unbalanced Processor Designs	2
On Internally Tagged Instruction Set Architectures	2
Near-HBM Tensor Core Acceleration for Fine-Grained Sparse Matrix-Matrix Multiplication	2
Reducing the Silicon Area Overhead of Counter-Based Rowhammer Mitigations	2
Accelerating Page Migrations in Operating Systems With Intel DSA	2
PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors	2
Enhancing DCIM Efficiency with Multi-Storage-Row Architecture for Edge AI Workloads	2
Energy-Efficient Bayesian Inference Using Bitstream Computing	2
A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models	1
GraNDe: Near-Data Processing Architecture With Adaptive Matrix Mapping for Graph Convolutional Networks	1
Approximate SFQ-Based Computing Architecture Modeling With Device-Level Guidelines	1
Hashing ATD Tags for Low-Overhead Safe Contention Monitoring	1
Pyramid: Accelerating LLM Inference With Cross-Level Processing-in-Memory	1
Contention-Aware GPU Thread Block Scheduler for Efficient GPU-SSD	1
Exploiting Intel AMX Power Gating	1
Cache and Near-Data Co-Design for Chiplets	1
Characterizing and Understanding End-to-End Multi-Modal Neural Networks on GPUs	1
A Case for Hardware Memoization in Server CPUs	1
Electra: Eliminating the Ineffectual Computations on Bitmap Compressed Matrices	1
An Intermediate Language for General Sparse Format Customization	1
Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays	1
Intelligent SSD Firmware for Zero-Overhead Journaling	1
Towards an Accelerator for Differential and Algebraic Equations Useful to Scientists	1
JBOC: Just a Bunch of CXL-Enabled SSDs for Resource-Efficient LLM Checkpointing	1
Tulip: Turn-Free Low-Power Network-on-Chip	1
On Variable Strength Quantum ECC	1
Architectural Security Regulation	1
Fusing Adds and Shifts for Efficient Dot Products	1
Dicemite: Scaling Shared-Memory Isolation for Highly Consolidated FaaS Workers	1
Structured Combinators for Efficient Graph Reduction	1
Exploiting Direct Memory Operands in GPU Instructions	1
GPU-Centric Memory Tiering for LLM Serving With NVIDIA Grace Hopper Superchip	1
A Partial Tag–Data Decoupled Architecture for Last-Level Cache Optimization	1
SPAM: Streamlined Prefetcher-Aware Multi-Threaded Cache Covert-Channel Attack	1
Characterizing and Understanding HGNNs on GPUs	1
Efficient MoE Model Fine-Tuning on Commodity GPU Server With Offloading	1
HINT: A Hardware Platform for Intra-Host NIC Traffic and SmartNIC Emulation	1
Supporting a Virtual Vector Instruction Set on a Commercial Compute-in-SRAM Accelerator	1
SPGPU: Spatially Programmed GPU	1
EONSim: An NPU Simulator for On-Chip Memory and Embedding Vector Operations	1
X-PPR: Post Package Repair for CXL Memory	1
Stardust: Scalable and Transferable Workload Mapping for Large AI on Multi-Chiplet Systems	1
Low-Latency PIM Accelerator for Edge LLM Inference	1
UDIR: Towards a Unified Compiler Framework for Reconfigurable Dataflow Architectures	1
A Data Prefetcher-Based 1000-Core RISC-V Processor for Efficient Processing of Graph Neural Networks	1

InfAMAX: Bridging the Compute-Memory Gap in Intel AMX for Efficient LLM Inference	1
MajorK: Majority Based kmer Matching in Commodity DRAM	1
Balancing Performance Against Cost and Sustainability in Multi-Chip-Module GPUs	1
A Multiple-Aspect Optimal CNN Accelerator in Top1 Accuracy, Performance, and Power Efficiency	1
Address Scaling: Architectural Support for Fine-Grained Thread-Safe Metadata Management	1
GEMM the New Gem: The Inevitable Kernel and its Sensitivity to Compiler Optimizations and Libraries	1
Halis: A Hardware-Software Co-Designed Near-Cache Accelerator for Graph Pattern Mining	1
MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization	1
TeleVM: A Lightweight Virtual Machine for RISC-V Architecture	1
I/O-ETEM: An I/O-Aware Approach for Estimating Execution Time of Machine Learning Workloads	1