LATER: Linear Algebra on TEnsoRcore

Introduction

This project aims to deliver a mixed single/half precision linear algebra package that effectively use the TensorCore engines in Volta and Turing architecture NVIDIA GPUs. It's blazingly fast at moderate loss of accuracy compared to cuSOLVER or MAGMA due to the use of half precision arithmetic in TensorCore. The goals of the project is performance, ease of use, functioanlity, and robustness.

https://github.com/Orgline/LATER

Supported NVIDIA GPUs include:

Features:

  1. Implements common BLAS3/LAPACK matrix computations: TRSM, LU/Cholesky/QR and Eigen/Singular value decompositions. various algorithms for different tradeoffs between accuracy and performance.
  2. Drop-in replacement of cuSOLVER subroutines for easy adoption
  3. Single header file distribution: just #include and be done!
  4. Standard static/dynamic library distribution also supported---enables high level interfaces in C++/C, Fortran, Python, Matlab, R, and Julia.

Performance & Accuracy

Development Status and Roadmap

TO BE ADDED

References/Citing

[1] S. Zhang, E. Baharlouei, and P. Wu. High Accuracy Matrix Computations on Neural Engines: a Study of QR Factorization and its Applications. The 20th ACM International Symposium on High-Performance Parallel and Distributed Computing
Stockholm, Sweden, June 23-26, 2020.
[2] Zhang, Shaoshuai, Vivek Karihaloo, and Panruo Wu. "Basic Linear Algebra Operations on TensorCore GPU." In 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), pp. 44-52. IEEE, 2020.
[3] Zhang, Shaoshuai, and Panruo Wu. "Recursion Brings Speedup to Out-of-Core TensorCore-based Linear Algebra Algorithms: A Case Study of Classic Gram-Schmidt QR Factorization." In Proceedings of the 50th International Conference on Parallel Processing, pp. 1-11. 2021.

Developer and Contributors