OOIR: Observatory of International Research

Papers

(The median citation count of ACM Transactions on Architecture and Code Optimization is 1. The table below lists those papers that are above that threshold based on CrossRef citation counts [max. 250 papers]. The publications cover those that have been published in the past four years, i.e., from 2022-06-01 to 2026-06-01.)

Article	Citations
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency	52
ASM: An Adaptive Secure Multicore for Co-located Mutually Distrusting Processes	48
An Intelligent Scheduling Approach on Mobile OS for Optimizing UI Smoothness and Power	30
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework	29
ESMPC: An Efficient Neural Network Training Framework for Secure Two- and Three-Party Computation	28
Accelerating Verifiable Queries over Blockchain Database System Using Processing-in-memory	28
Intra-request Lag-aware Cache Management to Enhance I/O Responsiveness of SSDs	26
ModNEF : An Open Source Modular Neuromorphic Emulator for FPGA for Low-Power In-Edge Artificial Intelligence	26
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators	24
Supporting QoS Guarantee in Heterogeneous Object Storage System: A Spatio-Temporal Graph Data Processing Method	24
Performance, Energy and NVM Lifetime-Aware Data Structure Refinement and Placement for Heterogeneous Memory Systems	22
HierMine: Accelerating Graph Pattern Mining via Hierarchical Sampling	22
Tiaozhuan: A General and Efficient Indirect Branch Optimization for Binary Translation	18
Characterizing Digital DRAM PIM through Modeling and Benchmarking	18
A Concise Concurrent B ⁺ -Tree for Persistent Memory	18
DCMA: Accelerating Parallel DMA Transfers with a Multi-Port Direct Cached Memory Access in a Massive-Parallel Vector Processor	17
Source Matching and Rewriting for MLIR Using String-Based Automata	16
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs	16
COER: A Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign	16
MLKAPS: Machine Learning and Adaptive Sampling for HPC Kernel Auto-tuning	15
FlashGEMM: Optimizing Sequences of Matrix Multiplication by Exploiting Data Reuse on CPUs	13
DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster	12
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product	12
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments	12
Osiris: A Systolic Approach to Accelerating Fully Homomorphic Encryption	12

Mitigating the Bandwidth Wall via Data-Streaming System–Accelerator Co-Design	11
A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler	11
COX : Exposing CUDA Warp-level Functions to CPUs	10
ODGS: Dependency-Aware Scheduling for High-Level Synthesis with Graph Neural Network and Reinforcement Learning	10
AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUs	10
Flexible and Effective Object Tiering for Heterogeneous Memory Systems	10
SnsBooster: Enhancing Sampling-based μ Arch Evaluation Efficiency through Online Performance Sensitivity Analysis	10
Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping	9
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions	9
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems	9
Accelerating Parallel Structures in DNNs via Parallel Fusion and Operator Co-Optimization	9
BridgeGC: An Efficient Cross-Level Garbage Collector for Big Data Frameworks	9
Towards high scalability and fine-grained parallelism on distributed HPC platforms	9
Quantifying Resource Contention of Co-located Workloads with the System-level Entropy	9
Accelerating Nearest Neighbor Search in 3D Point Cloud Registration on GPUs	9
EXPERTISE: An Effective Software-level Redundant Multithreading Scheme against Hardware Faults	9
FDSR: Efficient Model Training via Adaptive Tensor Quantization Based on Frequency Domain Division and Similarity Data Reuse	9
Power Scheduling for Maximizing Throughput and Fairness in Co-running Applications	9
A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks	8
A Step toward Stateful HW-SW Migration: An Architecture-agnostic Checkpointing-rollback Toolchain	8
PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service	8
HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy	8
CLAP: Cross-Layer Adaptive Pipelining Inference Scheduling for Resource-Efficient Edge-Cloud Vision Systems	8
NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks	8
Environmental Condition Aware Super-Resolution Acceleration Framework in Server-Client Hierarchies	8
Towards High Performance QNNs via Distribution-Based CNOT Gate Reduction	8
Optimizing Attention for Large Language Model Inference on the MT-3000 Many-Core Processor	8
Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture	8
TPRepair: Tree-based Pipelined Repair in Clustered Storage Systems	8
WSGraph: A Framework for Tackling Redundant and Irregular Data Access in Streaming Graph Processing	7
Enabling Low-Latency, GPU-Efficient Serverless Inference with Model Swapping	7
CML-PowF: Data Clustering Matching Based Low-overhead Multiple CPU Real-time Power Forecasting	7
Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models	7
A Decoupled Analytical Model for Tile Size Selection in Affine Programs	7
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion	7
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals	7
HEngine: A High Performance Optimization Framework on a GPU for Homomorphic Encryption	7
PctoDL: Adaptive GPU Throughput Optimization for Deep Learning Inference with Power Constraints	7
RaNAS: Resource-Aware Neural Architecture Search for Edge Computing	7
DTAP: Accelerating Strongly-Typed Programs with Data Type-Aware Hardware Prefetching	7
HiSo: Co-optimizing the Intra-layer and Inter-layer Scheduling Schemes with the Hybrid Data Flow for PIM Architectures	6
Towards Optimizing Learned Index for High Performance, Memory Efficiency and NUMA Awareness	6
SimTrace: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis	6
gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography	6
Enabling Efficient Vector Processing: A Heterogeneous Vector Architecture with in-SRAM Computing	6
A Memory-Aware Sparse Matrix-Matrix Multiplication on Multicore Architectures	6
A Stable Idle Time Detection Platform for Real I/O Workloads	6
RACER: Avoiding End-to-End Slowdowns in Accelerated Chip Multi-Processors	6
Pac-PIM: A Parallel Communication Framework for Commodity Processing-in-memory Systems	6
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing	6

Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices	6
EDAS: Enabling Fast Data Loading for GPU Serverless Computing	6
Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks	5
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs	5
WIPE: A Write-Optimized Learned Index for Persistent Memory	5
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs	5
PaTGen: Temporal Similarity-Driven Proxy Benchmark Generation Method for Cloud Workloads	5
Toward Comprehensive Design Space Exploration on Heterogeneous Multi-core Processors	5
x Meta : SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage	5
FlexHM: A Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations	5
Efficient and Scalable Hybrid Parallelization of Unstructured Computational Fluid Dynamics with Geometric Multigrid	5
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation	5
GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing	5
CGCGraph: Efficient CPU-GPU Co-execution for Concurrent Dynamic Graph Processing	5
Lightweight Code Outlining for Android Applications	5
Performance Prediction of Concurrent DNN Training Tasks in GPU Spatial Sharing Environments	5
BLR-Krylov: A Single-GPU Iterative SpMM Framework with Communication Avoidance and Block Low-Rank Optimization	5
Shift-CIM: In-SRAM Alignment To Support General-Purpose Bit-level Sparsity Exploration in SRAM Multiplication	5
Improving Utilization of Dataflow Unit for Multi-Batch Processing	5
Accelerating the Simulation of Parallel Workloads using Loop-Bounded Checkpoints	5
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs	5
TSN Cache: Exploiting Data Localities in Graph Computing Applications	4
MetaEC: An Efficient and Resilient Erasure-Coded KV Store on Disaggregated Memory	4
Capability-Based Efficient Data Transmission Mechanism for Serverless Computing	4
SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs	4
Architecting Optically Controlled Phase Change Memory	4
Iterating Pointers: Enabling Static Analysis for Loop-based Pointers	4
Address/Data Instruction Steering in Clustered General Purpose Processors	4
RaKV: A Write-Optimized LSM Store for Cloud Block Storage with Robust SLA	4
Koala: Efficient Pipeline Training through Automated Schedule Searching on Domain-Specific Language	4
BullsEye : Scalable and Accurate Approximation Framework for Cache Miss Calculation	4
Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources	4
Rethinking Variable-Length Encoding: Exploiting Bit Sparsity for Parallel Decoding in LLM Accelerators	4
BLG-Tuning: Benchmark-Based Low-Cost General-Purpose I/O Modeling and Tuning	4
Efficient Flexible Edge Inference for Mixed-Precision Quantized DNN using Customized RISC-V Core	4
Scale-out Systolic Arrays	4
CoolDC: A Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling	4
Analytical Modeling of Set-Associative Caches for Optimizing Tensor Operations	4
Consequence-based Clustered Architecture	4
Towards Efficient Extendible Perfect Hashing for Hybrid PM-DRAM Memory	3
PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems	3
PANDA: Adaptive Prefetching and Decentralized Scheduling for Dataflow Architectures	3
A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs	3
High Performance Singular Value Decomposition on GPU Architectures	3
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler	3
Heterogeneous Confidential Computing System for Large Language Models: A Survey	3
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU	3
Cheetah: Accelerating Dynamic Graph Mining with Grouping Updates	3
Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy	3
GenCNN: A Partition-Aware Multi-Objective Mapping Framework for CNN Accelerators Based on Genetic Algorithm	3
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead	3
Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access	3
MatXtract: Sparsity-Aware Matrix Transformation via Cascaded Compute Density EXtraction for SpMV	3
Compressing and Accelerating Sparse CNNs Using Sign-Reserved Toeplitz Filters and Input Activation Density-aware Dataflow	3
Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance	3
Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters	3
GraphService: Topology-aware Constructor for Large-scale Graph Applications	3
Optimizing OpenCL Barrier Synchronization and Memory Efficiency on Multi-Core DSPs	3
Matrix: Multi-Cipher Structures Dataflow for Parallel and Pipelined TFHE Accelerator	3
GNΩSIS: Lessons Learned in Generating a High-Level Synthesis Dataset	3
Making Root Cause Localization on FPGA Simulation Tools Robust	3
Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks	3
CheriMore: On-Demand Vertical Memory Expansion for Capability Serverless Runtime	3
Data Deduplication Based on Content Locality of Transactions to Enhance Blockchain Scalability	3
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation	3
SpiderSense: Lightweight Last-Level Cache Management via Time Period Tagging for LLC-Critical Workloads	3
Corrigendum: Unified and Efficient Factor Graph Accelerator Design for Robotic Optimization	3
SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural Networks	3
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel	3
MicroProf : Code-level Attribution of Unnecessary Data Transfer in Microservice Applications	3
3D GNLM: Efficient 3D Non-Local Means Kernel with Nested Reuse Strategies for Embedded GPUs	3
Diamond Tiling for Periodic Stencil Loop Nests by Means of Transitive Closure-based Extraction of Overlapping Iteration Spaces	3
Abakus: Accelerating k -mer Counting with Storage Technology	3
A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning	3
FastCC: A System-Algorithm Co-Design for Connected Components Computation on Large Power-Law Graphs	3
High-performance Deterministic Concurrency Using Lingua Franca	3
An Optimized GPU Implementation for GIST Descriptor	3
STen: Productive and Efficient Sparsity in PyTorch	3
CoNST: Code Generator for Sparse Tensor Networks	3
SPIRIT: Scalable and Persistent In-Memory Indices for Real-Time Search	2

Conflict Management in Vector Register Files	2
Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs	2
Partitioned Scheduling and Analysis for a Typed DAG Task on Heterogeneous Multi-Cores	2
CPU-GPU Workload Distribution during Throughput-Oriented LLM Inference on Single-GPU Systems	2
ALOHA: Accelerating Leveled Fully Homomorphic Encryption with Cryptography-Specific Architectures	2
UniTe: A Universal Tensor Abstraction for Capturing Spatial Relationships	2
Understanding Silent Data Corruption in Processors for Mitigating its Effects	2
HAVIT: An Efficient H ardware- A ccelerator for V ision Tra	2
Bubble-Swap Flow Control	2
Supporting Dynamic Program Sizes in Deep Learning-Based Cost Models for Code Optimization	2
FORTIFY: Feature-Oriented Representation and Graph Topology Integration for Path-Level Vulnerability Detection	2
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs	2
ShieldCXL: A Practical Obliviousness Support with Sealed CXL Memory	2
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM	2
Membrane: Accelerating Database Analytics with DRAM-Based PIM Filtering and Schema Denormalization	2
GOLDYLOC: Global Optimizations & Lightweight Dynamic Logic for Concurrency	2
QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers	2
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads	2
AccelHSA: Modeling Single-ISA Heterogeneous GPU Architectures	2
SuccinctKV: a CPU-efficient LSM-tree Based KV Store with Scan-based Compaction	2
In-SRAM Parallel Data Shuffle	2
PARADISE: Criticality-Aware Instruction Reordering for Power Attack Resistance	2
SSD-SGD: Communication Sparsification for Distributed Deep Learning Training	2
Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints	2
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication	2
Assessing the Impact of Compiler Optimizations on GPUs Reliability	2
HAIR: Halving the Area of the Integer Register File with Odd/Even Banking	2
Compiler Support for Sparse Tensor Computations in MLIR	2
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors	2
ZNSFQ: An Efficient and High-Performance Fair Queue Scheduling Scheme for ZNS SSDs	2
ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection	2
Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs	2
SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V	2
DCSolver: Accelerating Sparse Iterative Solvers via Divide-and-Conquer on GPUs	1
Energy-efficient In-Memory Address Calculation	1
Solving Sparse Assignment Problems on FPGAs	1
DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip	1
The Design of an Efficient Lossy Compressor for Time Series Databases	1
WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs	1
Scheduling Language Chronology: Past, Present, and Future	1
Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing	1
A Survey of General-purpose Polyhedral Compilers	1
A Lock-free RDMA-friendly Index in CPU-parsimonious Environments	1
PRAGA: A Priority-Aware Hardware/Software Co-design for High-Throughput Graph Processing Acceleration	1
ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads	1
SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching	1
Hardware-hardened Sandbox Enclaves for Trusted Serverless Computing	1
ScaleGS: Closing the Gap between Real-time 3D Gaussian Splatting and Real-time XR Rendering	1
PDGNN: Efficient Micro-batch GNN Training via Degree-Pruned Partitioning and Redundancy Elimination	1
Attack and Defense: Enhancing Robustness of Binary Hyper-Dimensional Computing	1
Thoth: Uncovering Data-Dependent Memory Access Patterns via Annotation-Directed Load Sampling	1
Phronesis: Efficient Performance Modeling for High-dimensional Configuration Tuning	1
IBing: An Efficient Interleaved Bidirectional Ring All-Reduce Algorithm for Gradient Synchronization	1
A ² : Towards Accelerator Level Parallelism for Autonomous Micromobility Systems	1
Optimizing Garbage Collection for ZNS SSDs via In-storage Data Migration and Address Remapping	1
Fixed-point Encoding and Architecture Exploration for Residue Number Systems	1
Constructing a Supplementary Benchmark Suite to Represent Android Applications with User Interactions by using Performance Counters	1
Characterizing and Optimizing LDPC Performance on 3D NAND Flash Memories	1
Second-level Caches: Not for Instructions	1
Gator: Accelerating Graph Attention Networks by Jointly Optimizing Attention and Graph Processing	1
LitTLS: Lightweight Thread-Level Speculation on Little Cores	1
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems	1
gHyPart: GPU-friendly End-to-End Hypergraph Partitioner	1
Fast One-Sided RDMA-Based State Machine Replication for Disaggregated Memory	1
JUNO++: Optimizing ANNS and Enabling Efficient Sparse Attention in LLM via Ray Tracing Core	1
NICE: Deep Neural Network Acceleration via Hardware-Friendly Index Assisted Compression	1
VersaTile: Flexible Tiled Architectures via Associative Processors	1
TianheGraph: Topology-aware Graph Processing	1
Cppless: Single-Source and High-Performance Serverless Programming in C++	1
HotLD: A Workload-Aware Method for Global Code-Layout Optimization of Shared Libraries	1
An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing	1
Overlapping Aware Data Placement Optimizations for LSM Tree-Based Store on ZNS SSDs	1
Quantitative Analysis and Performance Optimization of Graph Neural Networks on Multi-core CPUs	1
Critical Data Backup with Hybrid Flash-Based Consumer Devices	1
RegCPython: A Register-based Python Interpreter for Better Performance	1
Unleashing Parallelism with Elastic-Barriers	1
Dynamic Power Management Through Multi-agent Deep Reinforcement Learning for Heterogeneous Systems	1
DFGAS: Exploring the Balance of HW-SW Scheduling through the DFG-Aware Scheme	1
Leveraging the Hardware Resources to Accelerate cryo-EM Reconstruction of RELION on the New Sunway Supercomputer	1
Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization	1
The Droplet Search Algorithm for Kernel Scheduling	1
Unveiling and Evaluating Vulnerabilities in Branch Predictors via a Three-Step Modeling Methodology	1
FlexPointer: Fast Address Translation Based on Range TLB and Tagged Pointers	1
TEA+ : A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture	1
A Sparsity-Aware Autonomous Path Planning Accelerator with HW/SW Co-Design and Multi-Level Dataflow Optimization	1
DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping	1
HAKV: A Hotness-Aware Zone Management Approach to Optimizing Performance of LSM-tree-based Key-Value Stores	1
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis	1
Symbolic Analysis for Data Plane Programs Specialization	1
Maximizing Data and Hardware Reuse for HLS with Early-Stage Symbolic Partitioning	1
Turn-based Spatiotemporal Coherence for GPUs	1
An Optimized Framework for Matrix Factorization on the New Sunway Many-core Platform	1
HuntKTm: Hybrid Scheduling and Automatic Management for Efficient Kernel Execution on Modern GPUs	1
AOBO: A Fast-Switching Online Binary Optimizer on AArch64	1