- 2022 Feb TFRT: A Progress Update
- 2021 Optimization based on LLVM global instruction selection
- FPL: Fast Presburger Arithmetic through Transprecision
- Automatic Horizontal Fusion for GPU Kernels
- MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
- https://arxiv.org/abs/2110.15352
- IREE: https://discord.com/channels/689900678990135345/760577505840463893/933243735356084245
- From Ben Vanik
yeah, that would be awesome - we've been calling that "vertical tiling" or "vertical slicing" to an extent it's kind of what our linalg fusion does by not looking at layers and instead looking at the loop structure - only (today) it has some specific requirements about the loops it can put together there's a few scales of this approach, though, and the higher level ones (partitioning entire slices of the model across devices) are still TBD the lower level ones (what our fusion does and some linalg transformations for slicing things up) are easier for us to do automatically, while the higher level ones may need frontend involvement in the jax world it's https://jax.readthedocs.io/en/latest/jax-101/06-parallelism.html pmap & co, which I don't think we support yet if we did support it we could more easily use multiple device queues/multiple devices by just using the pmap as the partitioning mechanism - cheat our way to distribution :P since it looks like what they did involved training to handle it, they're likely more on the jax pmap side of things (meaning that if we supported pmap - even running serially by just translating it to an scf.for loop - we could do what they did in the paper) (we could do it without pmap and stuff today, mostly just anchoring on that as a nice user-level mechanism)
- From Ben Vanik
- (ongoing) https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2022/
- https://en.wikipedia.org/wiki/Cache-oblivious_algorithm
- Parallel matrix transpose algorithms on distributed memory concurrent computers
- Awesome Tensor Compilers (Papper collection)
- A High-Performance Sparse Tensor Algebra Compiler in Multi-Level IR
-
CompileExplore with llvm optimization pipeline viewer
-
GEF - GDB Enhanced Features
-
字节跳动 Service Mesh 数据面编译优化实践
-
What is Envoy
-
MPI
- https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf#687
- The goal of the Message-Passing Interface simply stated is to develop a widely used
standard for writing message-passing programs. As such the interface should establish a
practical, portable, efficient, and flexible standard for message passing.
A complete list of goals follows
- Design an application programming interface (not necessarily for compilers or a system implementation library).
- Allow efficient communication: Avoid memory-to-memory copying, allow overlap of computation and communication, and offload to communication co-processors, where available.
- Allow for implementations that can be used in a heterogeneous environment.
- Allow convenient C and Fortran bindings for the interface
- Assume a reliable communication interface: the user need not cope with communication failures. Such failures are dealt with by the underlying communication subsystem.
- Define an interface that can be implemented on many vendor’s platforms, with no significant changes in the underlying communication and system software.
- Semantics of the interface should be language independent.
- The interface should be designed to allow for thread safety.
-
An Introduction to CUDA-Aware MPI
-
2020 LLVM in HPC Workshop: Keynote: MLIR: an Agile Infrastructure for Building a Compiler Ecosystem
-
TVM Conf 2020 - Day 2 - MLIR and MLIR in the TensorFlow Ecosystem
-
Systolic Arrays
- https://www.sciencedirect.com/topics/computer-science/systolic-arrays
- C = AB, mkn = 4x4x4 matrix multiplication implementation:
-
Linux performance observability tools