Performance

SINGLE-NODE CPU performance comparison

Comparison between LAMA, PETSc and a plain MKL BLAS implementation of an CG solver running 1000 iterations

 

System

 

  • Both libraries make use of Intel®’s high performance MKL BLAS implementation

 

Results

 

  • Runtime is proportional to the number of non-zeros
  • only the irregular structure of inline_1 and audikw_1 show remarkably higher runtime
  • demonstrating, that LAMA's as well as PETSc’s design overhead is negligible

 

 

In Summary

 

  • LAMA and PETSc perform similar on CPUs

SINGLE-NODE GPU performance comparison

Comparison between LAMA and PETSc implementations of an CG solver running 1000 iterations

 

System

 

  • Nvidia® K40 (12GB GDDR 5)
  • CSR and ELL format

 

CSR format results

 

  • the run time proportional to the number of non-zeros
  • irregular structure of inline_1 and audikw_1 leads to  higher runtime

 

ELL format results

 

  • show shorter run times in general
  • except inline_1 and audikw_1 exhibiting nearly twice the number of entries per row compared to the other matrices

 

 

In Summary

 

  • for the CSR format
    • LAMA and PETSc perform similar with a tiny overall benefit in favor of LAMA
    • both libraries rely on cuSPARSE SpMV implementation (dominating with 80% of the overall runtime)
    • LAMA calls cuBLAS routines for the axpy and dot operations while PETSc exploits implementations using the Thrust library

 

  • for the ELL format
    • the runtime results are  more sensitive to the actual sparse matrix structure in comparison with CSR
    • LAMA uses a custom kernel
    • exploiting texture cache
    • increases the performance slightly in most cases

MULTI-NODE MULTI-GPU performance comparison

System

 

  • Intel® Xeon® E5-1650v2 and Nvidia® K40 (12GB GDDR 5)
  • with Infiniband node interconnect
  • ELL matrix format
  • of 3D 27-point poisson matrices with 10 million unknown per node
  • LAMA in asynchronous communication mode

 

Result for 2 nodes

 

  • a small overhead (under 5%) for LAMA and PETSc on the CPU and GPU

 

Result for 4 nodes

 

  • the overhead increases for both libraries on CPU
  • on GPU LAMAs overhead increases by far less in comparison to PETSc

 

 

In Summary

 

  • better scalability for LAMA (especially on GPU) due to asynchronous execution

LAMA White Paper

 

LAMA in the Press - Publications

 

Süß, Tim; Döring, Nils; Gad, Ramy; Nagel, Lars; Brinkmann, André; Feld, Dustin; Schricker, Eric; Soddemann, Thomas; Impact of the Scheduling Strategy in Heterogeneous Systems That Provide Co-Scheduling, in Proceedings of the 1st COSH Workshop on Co-Scheduling of HPC Applications, 2016, DOI: 10.14459/2016md1286954

 

Förster, M., Kraus, J.: Scalable parallel AMG on cc-NUMA machines with OpenMP. In: Computer Science - Research and Development, 2011, Volume 26, Issue 3-4, pp 221-228, DOI: 10.1007/s00450-011-0159-z

 

Kraus, J., Förster, M.: Efficient AMG on Heterogeneous Systems. In: Facing the Multicore Challenge II, Lecture Notes in Computer Science, 2012, Volume 7174, pp 133-146, DOI: 10.1007/978-3-642-30397-5_12

 

Kraus, J., Förster, M., Brandes, T., Soddemann, T.: Using LAMA for efficient AMG on hybrid clusters, Computer Science - Research and Development, 2013, Volumn 28, Issue 2-3, pp 211-220, DOI: 10.1007/s00450-012-0223-3