| Thursday - 30th October 2025 | ||
|---|---|---|
| Morning Session: 09:00 - 13:00, Venue: CSA 104 | ||
| Session Chair: Yogesh Simmhan (IISc) | ||
| Time Slot | Title of the Talk | Speaker (Affiliation) |
| 08:00 - 09:00 | Breakfast | |
| 09:00 - 09:05 | Welcome Remark | Vinod Ganapathy |
| 09:10 - 10:10 |
Keynote
In Computer Architecture, We Don't Change the Questions, We Change the Answers
When I was a new professor in the late 1980s, my senior colleague Jim Goodman told me, "On the computer architecture PhD qualifying exam, we don't change the questions, we only change the answers". More generally, I now augment this to say, "In computer architecture, we don't change the questions, application and technology innovations change the answers, and it's our job to recognize those changes." Eternal questions this talk will sample are how best to do the following interacting factors: compute, memory, storage, interconnect/networking, security, power, cooling and one more. The talk will not provide the answers but leave that as an audience exercise.
|
Mark D. Hill (University of Wisconsin Madison) |
| 10:10 - 10:45 |
TREEBEARD: A Retargetable Compiler for Decision Tree Inference
Decision tree based models are the most popular models on tabular data. Decision tree ensemble inference is usually performed with libraries. While these libraries apply a fixed set of optimizations, the solutions lack portability and fail to fully exploit hardware or model specific information.
In this talk, we present the design of a schedule-guided, retargetable compiler for decision tree based models, called Treebeard, which has two core components. The first is a scheduling language that encapsulates the optimization space, and techniques to efficiently explore this space. The second is an optimizing retargetable compiler that can generate code for any specified schedule by lowering the inference computation to optimized CPU or GPU code through multiple intermediate abstractions. By applying model-specific optimizations at the higher levels, tree walk optimizations at the middle level, and machine-specific optimizations lower down, Treebeard can specialize inference code for each model on each supported target. Treebeard combines several novel optimizations at various abstraction levels, uses different data layouts, loop structures and caching strategies to mitigate architectural bottlenecks and achieve portable performance across a range of targets. Treebeard is implemented using the MLIR compiler infrastructure and can generate code for single and multi-core CPUs as well as GPUs (both Nvidia and AMD MI GPUs). Treebeard demonstrates significant performance gain over the state-of-the-art methods, both on CPUs and on GPUs.
|
R. Govindarajan (IISc) |
| 10:45 - 11:00 | Coffee Break | Session Chair: Kanishka Lahiri (AMD) |
| 11:00 - 11:35 |
Let's Make ML Affordable
Machine learning (ML) training and inference services are some of the most compute hungry workloads that we have ever encountered. As ML parameter size explodes into Trillions they have an insatiable desire for GPUs and high bandwidth memory (HBM), both of which are expensive resources. In this talk I present our group's work on reducing our reliance on GPUs and HBMs in two ML domains: recommendation models and LLMs. Recommendation models have Trillions of embedding table entries that are usually distributed across GPU clusters. In our cDLRM research we made the observation that training of ML models is highly predictable as we can look ahead into the future to extract the training batches. This predictability can be exploited to move the embedding tables to CPU DRAMs (think very cheap memory!) and transfer only a tiny, but relevant, portion of the embedding tables to the GPU HBM just in time. In our follow up work, titled LEAF, we explored an orthogonal design space of embedding table compression. LEAF is a multi-level hashing framework that compresses the large embedding tables based on real-time access frequency distribution. In particular, LEAF leverages a streaming algorithm to estimate access distributions on the fly without relying on model gradients or requiring a priori knowledge of access distribution and achieves 2 orders of magnitude compression with limited model accuracy loss.
In the second part of the talk I will present resource efficient solutions for LLMs. In our KVPR research we argued for offloading KV caches to CPU DRAMs. But CPU-GPU PCIe bandwidth could be a serious impediment for performance. We present a novel cache+recompute approach where part of the KV cache data is transferred from CPU to GPU, while part of the KV values are recomputed on the GPU to overlap the communication delay with computations. Finally, in our DEL research work we tackle the problem of making speculative decoding affordable. Speculative decoding is a key technique to enhance token generation speed. But identifying the right speculative decoding architecture is a challenging issue as some tokens demand large speculation resources while other tokens are much easier to predict. DEL provides a dynamic approach to select an optimal speculation resource for each token.
|
Murali Annavaram (University of Southern California) |
| 11:35 - 12:10 |
Sparse attention techniques for long context inference
Long context workloads have become increasingly common in LLM inference, driven by applications such as RAG, Multimodal inference and the recent progress in Chain of Thought. The self-attention operation, which scales quadratically with context length, becomes a dominant cost for long context inference. This talk will discuss some case studies in reducing the cost of self-attention operation using training free sparse attention mechanisms.
|
Saurabh Goyal (Microsoft Research India) |
| 12:10 - 12:45 |
Extracting Useful Parallelism from User Perceived Ideal Parallelism
Multicore systems have taken the computing world by a storm, with the ever-increasing amount of parallelism in the hardware, and the continuously changing landscape of parallel programming. The programmers are expected to think in parallel and express the program logic (ideal parallelism), using parallel languages of their choice. However, a parallel program is not guaranteed to be efficient just because it is parallel. This problem becomes challenging as many of the traditional assumptions about serial programs do not hold in the context of parallel programs. In this talk, we will discuss some of our experiences in bridging the gap between ideal and useful parallelism.
The talk will first explain the performance challenges in parallel programs. We will follow it up with our experience in identifying different patterns in parallel programs that can be exploited to realize highly performant code. These can be seen as both manual and compiler optimizations. We will focus on the safety, profitability, and opportunities of such optimizations in the context of task-parallel programs. We will also explain the insufficiency of traditional analysis for safely transforming parallel programs and discuss how may-happen-in-parallel analysis plays a vital role in sound and precise analysis of parallel programs. In addition to covering the traditional HPC kernels, we also will have a particular focus on irregular task-parallel programs, which are becoming critical workloads. We will explain the challenges with irregular task-parallel programs and then discuss how we can achieve high performance in irregular task-parallel programs.
|
V. Krishna Nandivada (IIT Madras) |
| 12:45 - 14:00 | Lunch Break | |
| Afternoon Session: 14:00 - 17:00, Venue: CSA 104 | ||
| Session Chair: Murali Annavaram (University of Southern California) | ||
| 14:00 - 14:35 | CXL - From Research to Reality
Compute Express Link (CXL) is an open industry standard interconnect offering high-bandwidth, low latency connectivity between host processors and devices such as accelerators, memory buffers, and smart I/O devices. It is designed to address the growing high-performance computational workloads by supporting heterogeneous processing and memory systems by enabling cache coherency and memory semantics. This talk will cover the following areas :
- Motivation of CXL and Use cases
- CXL Research areas in Academia and Industry
- CXL Hardware (Vendors/Products) and Industry Landscape
- Software/Solution impacts and opportunities with CXL
|
Mohan Parthasarathy (Hewlett Packard Enterprise) |
| 14:35 - 15:10 |
The Micro Things That Matter: Microarchitecture for Macro Servers
Many-core servers are the compute engines that drive large-scale datacenters 24/7, 365 days a year. These servers consist of 10s to 100s of processor cores running application with huge code and data footprints and performance of memory hierarchy plays an important role in the overall throughput of these servers. The talk will be about our journey in designing micro things in improving cache hierarchy for macro-servers keeping huge code/data footprints and limited DRAM bandwidth in mind. My awesome mentees (Sweta, Vedant, Prerna, and Hrishikesh) and I embarked on this journey together.
|
Biswabandan Panda (IIT Bombay) |
| 15:10 - 15:45 |
Evolving LLM Systems: Inference Opportunities and AMD MI Roadmap
Large language models (LLMs) are rapidly increasing in scale and capability, placing growing demand on hardware, software, and deployment stacks.
In this talk we present key research challenges in engineering LLM inference, and motivate systems research for efficient, scalable deployment. Building on this foundation,
we describe IISc-AMD collaborative work on optimizations that accelerate LLM inference on CPUs and GPUs. Finally, we present AMD's roadmap for MI-class accelerators for inference and training and show how upcoming hardware and software capabilities will address critical systems challenges while enabling the research community through an open ecosystem.
|
Arun Ramachandran (AMD) |
| 16:00 - 18:00 | Poster Presentation | |
| 18:00 - Onwards | High Tea | |
| Friday - 31st October 2025 | ||
| Morning Session: 09:00 - 13:00, Venue: CSA 104 | ||
| Session Chair: Ranjita Bhagwan (Google) | ||
| 08:00 - 09:00 | Breakfast | |
| 09:10 - 10:10 |
Keynote
Cloud Native and AI in a Multicloud World: Simplifying Innovation and Resilience
Explore practical and strategic considerations for enabling developers, platform engineers, and IT teams to move faster and more efficiently for implementing AI, bringing new apps online, and adopting hybrid and multicloud environments. Learn how cloud native architectures, data locality, and infrastructure abstraction can simplify deployment, enhance resilience, and optimize cost-helping leaders align technology with business outcomes.
|
Manosiz Bhattacharyya (Nutanix) |
| 10:10 - 10:45 |
Building Effective Compilers for AI Programming Frameworks
This talk will focus on the role of compilers in the era of AI programming frameworks (e.g. PyTorch, JAX) and AI hardware accelerators. AI models are evolving and continue to heavily rely on high-performance computing. Specialized hardware for AI is often hard to program to exploit peak performance. AI models are also evolving in a way that is coupled with hardware strengths. This talk will describe how to build effective compiler systems using the MLIR infrastructure in a layered way to improve hardware usability and deliver high performance as automatically as possible.
|
Uday Reddy Bondhugula (IISc) |
| 10:45 - 11:00 | Coffee Break | |
| Session Chair: Ashish Panwar (Microsoft) | ||
| 11:00 - 11:35 |
Robust Query Processing: Where Geometry Beats ML!
Over the past half-century, the design and implementation of declarative query
processing techniques in relational database systems has been a foundational topic.
Despite this sustained study, the solutions have largely remained a "black art" due
to complexities of database query processing. Recent work explores two directions:
learning-based query performance prediction and geometric search strategies.
|
Jayant Haritsa (IISc) |
| 11:35 - 12:10 |
Challenges in Observability of the Google Network
Google owns and operates one of the world's largest networks, supporting billions of users. Today, this network not only supports the users of Google's various applications such as Gemini, Search, Youtube, Gmail and Maps, it also forms a critical part of the infrastructure supporting enterprise customers of the Google Cloud Platform. Given the scale and complexity of such varied applications, observing the performance and reliablity of the network and its various components is of prime importance. In this talk, I will present some of the challenges that we envision for network observability in coming years, and how we plan to address them.
|
Ranjita Bhagwan (Google) |
| 12:10 - 12:45 |
Driving Innovation in Academia and Industry
You might think that the key to driving innovation in academia and industry is solving problems and refining solutions. While these are important, the most critical and valued talents are picking a problem and getting started toward understanding it. I will share some tips on these talents from forty years of experience. Even with generative AI, I predict that humans will drive these steps. Reference: Increasing Your Research Impact, SIGARCH Blog, 08/2019 [https://www.sigarch.org/increasing-your-research-impact/].
|
Mark D. Hill (University of Wisconsin Madison) |
| 12:45 - 14:00 | Lunch Break | |
| Afternoon Session: 14:00 - 17:00, Venue: CSA 104 | ||
| Session Chair: Jayashree Mohan (Microsoft) | ||
| 14:00 - 14:35 |
Enabling Angstrom-scale Manufacturing with AI & HPC
Semiconductor manufacturing is approaching the Angstrom-era with innovations such as gate all around transistors, and chip to chip integration technologies enabling the continuation of Moore's law. This talk will highlight some of the challenges that these advanced technologies pose to manufacturing semiconductors, and will cover how modern AI & HPC technologies are being leveraged to address these challenges to enable high-volume manufacturing. We will also give a peek into some of the solutions that KLA is pioneering in this space.
|
Pradeep Ramachandran (KLA) |
| 14:35 - 15:10 |
Imagining a next-generation superoptimizer
A program superoptimizer uses a search procedure, e.g., a probabilistic backtracking algorithm, to find an optimized (and sometimes, optimal) implementation of a program specification on a given machine architecture. Each candidate implementation proposed by the search procedure is checked for equivalence with the input program specification, to eventually identify a sound optimization that can subsequently be stored as a peephole optimization rule. This is in contrast to a traditional compiler optimizer that is typically organized as a sequence of algorithmic passes that transform the program successively towards an optimized implementation.
I will share my thoughts on why the traditional model of compiler development may be unsustainable, and why it is likely for a superoptimizer to become a mainstream method of optimizing programs in the foreseeable future, considering recent advances in AI. I will then present our formal program equivalence checker which is intended to enable such a superoptimizer.
|
Sorav Bansal (IIT Delhi) |
| 15:10 - 15:25 | Coffee Break | |
| Session Chair: R. Govindarajan (IISc) | ||
| 15:25 - 16:00 |
Advancing General Purpose CPU Computing in the AI Era
In the age of artificial intelligence, the ongoing evolution and optimization of general-purpose, high-performance, out-of-order cores remains crucial for modern computing. As advancements in microprocessor technology become increasingly challenging, substantial research efforts are now directed towards enhancing the performance of these cores through innovations in micro-architecture. This talk will delve into the primary bottlenecks encountered in contemporary server cores, the micro-architectural innovations necessary to address these challenges, and the role that AI can play in improving overall core performance and efficiency. Furthermore, it will explore how CPUs with advanced AI extensions can significantly benefit critical AI inference applications.
|
Jayesh Gaur (IBM) |
| 16:00 - 16:35 |
Computing Less by Understanding More
Large foundation models have significantly reshaped how machine learning and computer vision problems are approached. However, these models are often treated as black boxes, with interaction limited to prompting or fine-tuning via loss functions. This talk advocates for a deeper examination of their internal behavior, e.g., understanding of the latent spaces and attention mechanisms to uncover signals that can be systematically leveraged to improve efficiency, accuracy, and interpretability for different tasks. This talk will present a set of methods that utilizes internal computation patterns to achieve systems-level optimizations. For instance, it will show how attention logits in transformers reveal stable contextual relationships that can inform both cache management and fine-grained attribution. In retrieval-augmented generation, attention states corresponding to frequently retrieved chunks can be reused across queries, provided their validity is carefully checked and maintained. For generative models such as diffusion pipelines, the reuse of intermediate noise states or prompt-derived visual concepts offers a path to accelerate inference without compromising output quality. Finally, in continuous-time generative models, the model's own recent outputs can be used to speculatively predict future steps, allowing computation to be skipped when these predictions are accurate. These techniques do not require retraining and are compatible with existing model architectures. By interpreting the computations that foundation models already perform, these works highlight how internal structure can be exposed and repurposed to build systems that are faster, more efficient, and more explainable.
|
Subrata Mitra (Adobe Research) |
| 16:35 - 16:40 | Vote of Thanks & Closing Remarks | |
| 16:40 - Onwards | High Tea | |