LAMMPS Benchmarks

This page lists LAMMPS performance on several benchmark problems, run on various machines, both in serial and parallel and on GPUs. Note that input and sample output files for many of these benchmark tests are provided in the bench directory of the LAMMPS distribution. See the bench/README file for details.

Note that most of these numbers are quite old and thus do not reflect numbers that can be achieved with current hardware or what kind of improvements have been added to the LAMMPS code since then. They do, however, still represent the relative strong and weak scaling behavior in most cases.

CPU benchmarks

One processor = relative CPU cost of the following 5 benchmarks

Lennard Jones = atomic fluid with Lennard-Jones potential
Polymer = bead-spring polymer melt of 100-mer chains
Metal = metallic solid
Granular = granular chute flow
Protein = rhodopsin protein in solvated lipid bilayer

GPU (Kepler) and Intel Xeon Phi benchmarks using all accelerator packages

Accelerator packages: GPU, KOKKOS, OPT, USER-CUDA, USER-INTEL, USER-OMP
Oct 2016, CPU vs GPU vs KNL performance
Sept 2014, GPU cluster = Dual 8-core Sandy Bridge Xeons with 2 Kepler GPUs

GPU (Fermi) benchmarks using the GPU and USER-CUDA packages

Aug 2012, Desktop system = Dual hex-core Xeons with 2 Tesla GPUs
Aug 2012, Supercomputer = Titan development machine at ORNL

Interatomic potential benchmarks

Jun 2012, Potentials = relative CPU cost of different interatomic potentials

Billion atom benchmarks

Billion-atom = Lennard-Jones systems

Thanks to the following individuals for running the various benchmarks:

Sikandar Mashayak (UIUC), Kokkos and other accelerator results on GPU and Phi clusters
Mike Brown (ORNL), Titan results
Carl Ponder (NVIDIA), Keeneland results
Christian Trott (U of Technology Ilmenau), Lincoln results
Paul Crozier (Sandia), IBM BG/L results
Fiona Reid (Edinburgh Parallel Computing Centre), IBM p690+ results
Courtenay Vaughan (Sandia), Cray XT3 results

Machine specifications

These are the parallel machines for which benchmark data is given for the CPU benchmarks below. See the Kokkos, Intel, and GPU sections for machine specifications for those GPU and Phi platforms.

The "Processors" column is the most number of processors on that machine that LAMMPS was run on. Message passing bandwidth and latency is in units of Mb/sec and microsecs at the MPI level, i.e. what a program like LAMMPS sees. More information on machine characteristics, including their "birth" year, is given at the bottom of the page.

Vendor/Machine	Processors	Site	CPU	Interconnect	Bandwidth	Latency
Dell T7500 dual hex-core desktop	12	SNL	3.47 GHz Xeon	on-chip	??	??
Xeon/Myrinet cluster	512	SNL	3.4 GHz dual Xeons (64-bit)	Myrinet	230	9
IBM p690+	512	Daresbury	1.7 GHz Power4+	custom	1450	6
IBM BG/L	65536	LLNL	700 MHz PowerPC 440	custom	150	3
Cray XT3	10000	SNL	2.0 GHz Opteron	Cstar	1100	7
Cray XT5	1920	SNL	2.4 GHz Opteron	Cstar	1100	7

One-processor timings are also listed for some older machines whose characteristics are also given below.

Name	Machine	Processors	Site	CPU	Interconnect	Bandwidth	Latency
Laptop	Mac PowerBook	1	SNL	1 GHz G4 PowerPC	N/A	N/A	N/A
ASCI Red	Intel	1500	SNL	333 MHz Pentium III	custom	310	18
Ross	custom Linux cluster	64	SNL	500 MHz DEC Alpha	Myrinet	100	65
Liberty	HP Linux cluster	64	SNL	3.0 GHz dual Xeons (32-bit)	Myrinet	230	9
Cheetah	IBM p690	64	ORNL	1.3 GHz Power4	custom	1490	7

Billion-atom LJ timings are also given for GPU clusters, with more characteristics given below.

Name	Machine	GPUs	Site	GPU	Interconnect	Bandwidth	Latency
Keeneland	Intel/NVIDIA cluster	360	ORNL	Tesla M2090	Qlogic QDR	???	???
Lincoln	Intel/NVIDIA cluster	384	NCSA	Tesla C1060	Infiniband	1500	12

For each of the 5 benchmarks, fixed- and scaled-size timings are shown in tables and in comparative plots. Fixed-size means that the same problem with 32,000 atoms was run on varying numbers of processors. Scaled-size means that when run on P processors, the number of atoms in the simulation was P times larger than the one-processor run. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms.

All listed CPU times are in seconds for 100 timesteps. Parallel efficiencies refer to the ratio of ideal to actual run time. For example, if perfect speed-up would have given a run-time of 10 seconds, and the actual run time was 12 seconds, then the efficiency is 10/12 or 83.3%. In most cases parallel runs were made on production machines while other jobs were running, which can sometimes degrade performance.

The files needed to run these benchmarks are part of the LAMMPS distribution. If your platform is sufficiently different from the machines listed, you can send your timing results and machine info and we'll add them to this page. Note that the CPU time (in seconds) for a run is what appears in the "Loop time" line of the output log file, e.g.

Loop time of 3.89418 on 8 procs for 100 steps with 32000 atoms

These benchmarks are meant to span a range of simulation styles and computational expense for interaction forces. Since LAMMPS run time scales roughly linearly in the number of atoms simulated, you can use the timing and parallel efficiency data to estimate the CPU cost for problems you want to run on a given number of processors. As the data below illustrates, fixed-size problems generally have parallel efficiencies of 50% or better so long as the atoms/processor is a few hundred or more. Scaled-size problems generally have parallel efficiencies of 80% or more across a wide range of processor counts.

One processor comparisons

This is a summary of single-processor LAMMPS performance in CPU secs per atom per timestep for the 5 benchmark problems which follow. This is on a Dell Precision T7500 desktop Red Hat linux box with dual hex-core 3.47 GHz Intel Xeon processors, using the Intel 11.1 icc compiler. The ratios indicate that if the atomic LJ system has a normalized cost of 1.0, the bead-spring chains and granular systems run 2x and 4x faster, while the EAM metal and solvated protein models run 2.6x and 16x slower respectively. These differences are primarily due to the expense of computing a particular pairwise force field for a given number of neighbors per atom.

Problem:	LJ	Chain	EAM	Chute	Rhodopsin
CPU/atom/step:	7.02E-7	3.15E-7	1.85E-6	1.71E-7	1.14E-5
Ratio to LJ:	1.0	0.45	2.64	0.24	16.2

Lennard-Jones liquid benchmark

Input script for this problem.

Atomic fluid:

32,000 atoms for 100 timesteps
reduced density = 0.8442 (liquid)
force cutoff = 2.5 sigma
neighbor skin = 0.3 sigma
neighbors/atom = 55 (within force cutoff)
NVE time integration

Performance data:

One-processor performance

These plots show fixed-size parallel efficiency for the same 32K atom problem run on different numbers of processors and scaled-size efficiency for runs with 32K atoms/proc. Thus a scaled-size 64-processor run is for 2,048,000 atoms; a 32K proc run is for ~1 billion atoms. Each curve is normalized to be 100% on one processor for the respective machine; one-processor timings are shown in parenthesis. Click on the plot for a larger version.

Polymer chain melt benchmark

Input script for this problem.

Bead-spring polymer melt with 100-mer chains and FENE bonds:

32,000 atoms for 100 timesteps
reduced density 0.8442 (liquid)
force cutoff of 2^(1/6) sigma
neighbor skin = 0.4 sigma
neighbors/atom = 5 (within force cutoff)
NVE time integration

Performance data:

One-processor performance

EAM metallic solid benchmark

Input script for this problem.

Cu metallic solid with embedded atom method (EAM) potential:

32,000 atoms for 100 timesteps
force cutoff of 4.95 Angstroms
neighbor skin = 1.0 Angstrom
neighbors/atom = 45 (within force cutoff)
NVE time integration

Performance data:

One-processor performance

Granular chute flow benchmark

Input script for this problem.

Chute flow of packed granular particles with frictional history potential:

32,000 atoms for 100 timesteps
force cutoff of 1.0 sigma
neighbor skin of 0.1 sigma
neighbors/atom = 7
NVE time integration

Performance data:

One-processor performance

Rhodopsin protein benchmark

Input script for this problem.

All-atom rhodopsin protein in solvated lipid bilayer with CHARMM force field, long-range Coulombics via PPPM (particle-particle particle mesh), SHAKE constraints. This model contains counter-ions and a reduced amount of water to make a 32K atom system:

32,000 atoms for 100 timesteps
LJ force cutoff of 10.0 Angstroms
neighbor skin of 1.0 sigma
neighbors/atom = 440 (within force cutoff)
NPT time integration

Performance data:

One-processor performance

Accelerator benchmarks for CPU, GPU, KNL - Oct 2016

Benchmark of the Lennard-Jones potential: /src/USER-INTEL/TEST/in.intel.lj, run using "-v m 0.1". On Sandy Bridge CPU and KNL, "atom_modify sort 100 2.8" was added to the input script.

Legend:

16 cores, Sandy Bridge CPU: no accelerator package, 1MPI task per core = 16 MPI tasks total

2 K80 GPUs, Kokkos: Kokkos package used 1 MPI task per internal K80 GPU = 4 MPI tasks total, running on 4 Sandy Bridge CPU cores, full neighbor list, newton off, neighbor list binsize = 2.8

2 K80 GPUs, GPU: GPU package used 4 MPI tasks per internal K80 GPU = 16 MPI tasks total running on 16 Sandy Bridge CPU cores

1 KNL, Kokkos: Kokkos package used 64 MPI tasks x 4 hyperthreads x 4 OpenMP threads, full neighbor list, newton off

1 KNL, USER-OMP: USER-OMP package used 64 MPI tasks x 4 hyperthreads x 4 OpenMP threads

1 KNL, USER-INTEL: USER-INTEL package used 64 MPI tasks x 2 hyperthreads x 2 OpenMP threads

Benchmark of the Stillinger-Weber potential: /src/USER-INTEL/TEST/in.intel.sw, run using "-v m 0.1". For KNL, "atom_modify sort 100 4.77" was added to the input script.

Legend:

16 cores, Sandy Bridge CPU: no accelerator package, 1MPI task per core = 16 MPI tasks total

2 K80 GPUs, Kokkos: Kokkos package used 1 MPI task per internal K80 GPU = 4 MPI tasks total running on 4 Sandy Bridge CPU cores, half neighbor list, newton on

2 K80 GPUs, GPU: GPU package used 4 MPI tasks per internal K80 GPU = 16 MPI tasks total, running on 16 Sandy Bridge CPU cores

1 KNL, Kokkos: Kokkos package used 64 MPI tasks x 4 hyperthreads x 4 OpenMP threads, half neighbor list, newton on

1 KNL, USER-OMP: USER-OMP package used 64 MPI tasks x 4 hyperthreads x 4 OpenMP threads

1 KNL, USER-INTEL: USER-INTEL package used 64 MPI tasks x 4 hyperthreads x 4 OpenMP threads

Benchmark of the Tersoff potential: /src/USER-INTEL/TEST/in.intel.tersoff, was run using "-v m 0.2". For KNL, "atom_modify sort 100 4.2" was added to the input script.

Legend:

16 cores, Sandy Bridge CPU: no accelerator package, 1MPI task per core = 16 MPI tasks total

2 K80 GPUs, Kokkos: Kokkos package used 1 MPI task per internal K80 GPU = 4 MPI tasks total, running on 4 Sandy Bridge CPU cores, half neighbor list, newton on

2 K80 GPUs, GPU: GPU package used 4 MPI tasks per internal K80 GPU = 16 MPI tasks total, running on 16 Sandy Bridge CPU cores

1 KNL, Kokkos: Kokkos package used 64 MPI tasks x 4 hyperthreads x 4 OpenMP threads, half neighbor list, newton on

1 KNL, USER-OMP: USER-OMP package used 64 MPI tasks x 4 hyperthreads x 4 OpenMP threads

1 KNL, USER-INTEL: USER-INTEL package used 64 MPI tasks x 4 hyperthreads x 4 OpenMP threads

Accelerator benchmarks on a GPU cluster with Kepler GPUs - Sept 2014

This section benchmarks all the accelerator packages available in LAMMPS on the same machine: GPU, KOKKOS, OPT, USER-CUDA, USER-INTEL, USER-OMP. Unaccelerated runs (using the standard LAMMPS styles) are also included for reference.

This section shows performance results for a GPU cluster at Sandia National Labs called "shannon". It has 32 nodes, each with two 8-core Sandy Bridge Xeon CPUs (E5-2670, 2.6GHz, HT deactivated), for a total of 512 cores. Twenty-four of the nodes have two NVIDIA Kepler GPUs (K20x, 2688 732 MHz cores). LAMMPS was compiled with the Intel icc compiler, using module openmpi/1.8.1/intel/13.1.SP1.106/cuda/6.0.37.

The benchmark problems themselves are described in more detail above in the CPU section. The input scripts and instructions for running these test cases are included in the bench/KEPLER directory of the LAMMPS distribution.

Lennard-Jones liquid

The first set of 5 plots are for running problmes of various sizes (2K to 8192K atoms) on a single node (16 cores, 1 or 2 GPUs). Note that the y-axis scale is not the same in the various plots. Click on the plots for a larger version.

The corresponding raw timing data for the 5 plots is in these 5 files. The 1st column is the x-axis, the 2nd is the time for 100 steps; the 3rd is the y-axis.

The first plot shows the performance of all 6 accelerator packages (two variants for KOKKOS). The GPU, USER-CUDA, and KOKKOS/CUDA curves run primarily on the 2 GPUs. They perform better the more atoms that are modeled. All the other curves run on the 16 CPU cores. All of these performed best using essentially MPI-only parallelization (single thread per MPI task if multi-threading). The CPU curve is for running without enabling any of the accelerator packages. All the runs were in double precision.

The second plot shows the same data as the first (on a different y-axis scale), but only for the runs on the CPU cores. The Intel package has an option to perform pairwise calculations in double, mixed, or single precision, so 3 curves are shown for it.

The third plot also shows the same data as the first, but only for the runs on the GPUs. For each of the 3 packages, runs were made on one or two GPUs. All the runs were in double precision.

The fourth plot shows the performance effect of running the 3 GPU-based packages with varying precision (double, mixed, single). The KOKKOS package only currently allows for double precision calculations.

The fifth plot shows the effect on KOKKOS GPU performance of transferring data back and forth between the CPUs and GPUs every timestep. The benchmark problem in the previous 4 plots could run continuously on the GPU (e.g. between occasional thermodynamic output in a production run). But a LAMMPS input script may require more frequent communication back to the CPU. E.g. if a diagnostic is invoked that runs on the CPU or some form of output is triggered every few timesteps. Or if a fix or compute style is used that is not yet KOKKOS-enabled. The benchmark run for this plot required data to move back-and-forth every timestep (worst case). The curve for the GPU package shows this performance hit may be possible to partially overcome, since the GPU package also moves data back and forth every step to perform time integration on the CPU.

The next 2 plots are for parallel runs on 1 to 20 nodes of the cluster. The strong scaling plot is for runs of a 2048K atom problem. The weak scaling plot is for runs with 512K atoms per node. Note that performance (y-axis) is normalized by the number of nodes. Thus in both plots, horizontal lines would be 100% parallel efficiency. Note that for the strong scaling curves on GPUs, there is an pseudo-efficieny loss when running on more nodes due to shrinking the number of atoms per node and thus moving to the left on the single-node GPU performance curves in the plots above.

The corresponding raw timing data is in these 2 files. The 1st column is the x-axis, the 2nd is the time for 100 steps; the 3rd is the y-axis.

GPU and USER-CUDA package benchmarks on Desktop system with Fermi GPUs

This section shows performance results for a desktop system with dual hex-core Xeon processors and 2 NVIDIA Tesla/Fermi GPUs. More system details are given below, for the "Desktop" entry.

The benchmark problems themselves are described in more detail above in the CPU section. The input scripts and instructions for running these GPU test cases are included in the bench/FERMI directory of the LAMMPS distribution.

Lennard-Jones liquid
EAM metallic solid
Rhodopsin protein

The performance is plotted as a function of system size (number of atoms), where the size of the benchmark problems was varied. The Y-axis is atom-timesteps per second. Thus a y-axis value of 10 million for a 1M atom system means it ran at a rate of 10 timesteps/second.

Results are shown for running in CPU-only mode, and on 1 or 2 GPUs, using the GPU package.

The CPU-only results are for running on a single core and on all 12 cores, always in double precision.

For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For small problems this is typically less than all 12; for large problems it is typically all 12. The precision refers to the portion of the calculation performed on the GPU (pairwise interactions). Results are shown for single precision, double precision, and mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision.

For the USER-CUDA package, the number of CPU cores used is always equal to the number of GPUs used, i.e. 1 or 2 for this system. The three precisions have the same meaning as for the GPU package, except that other portions of the calculation are also performed on the GPU, e.g. time integration.

Click on the plots for a larger version.

GPU package benchmarks on Titan development supercomputer with Fermi GPUs

This section shows performance results for the Titan development system, each node of which has a 16-core AMD CPU and a single NVIDIA Tesla/Fermi GPU. More system details are given below, for the "Titan Development" entry. Note that the eventual Titan machine will have Tesla/Kepler GPUs, and more of them.

Lennard-Jones liquid
EAM metallic solid
Rhodopsin protein

For the rhodopsin benchmark, which computes long-range Coulombics via the PPPM option of the kspace_style command, these benchmarks were run with the run_style verlet/split command, to split the real-space versus K-space computations across the CPUs. This makes little difference on small node counts, but on large node counts, it enables better scaling, since the FFTs computed by PPPM are performed on fewer processors. This was done for both the strong- and weak-scaling results below. The ratio of real-to-kspace processors was chosen to give the best performance, and was 7:1 on this 16 core/node machine.

For the strong-scaling plots, a fixed-size problem of 256,000 atoms was run for all node counts. The node count varied from 1 to 128, or 16 to 2048 cores. The Y-axis is atom-timesteps per second. Thus a value of 10 for the 256,000 atom system means it ran at a rate of roughly 40 timesteps/second.

Strong-scaling results are shown for running in CPU-only mode, and on the GPU, using the GPU package. The CPU-only results are double-precision, the GPU results are for mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision. The dotted line indicates the slope for perfect scalability.

For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For the strong-scaling results with the large per-node atom count (256000), this was typically nearly all 16 cores.

Click on the plots for a larger version.

For the weak-scaling plots, a scaled-size problem of 32,000 atoms/node was run for all node counts. The node count varied from 1 to 8192, or 16 to 131072 cores; only 960 nodes on the current development machine have GPUs. Thus the largest system on 8192 nodes has ~262 million atoms.

The Y-axis is atom-timesteps per second. Thus a value of 100 for a 1M atom system (32 nodes) means it ran at a rate of 100 timesteps/second.

Weak-scaling results are shown for running in CPU-only mode, in CPU-only mode with the numa option invoked for the processors command, and on the GPU, using the GPU package. The CPU-only and NUMA results are double-precision, the GPU results are for mixed precision which means pairwise interactions calculated in single precision, with the aggregate per-atom force accumulated in double precision. The dotted line indicates the slope for perfect scalability.

The NUMA results alter the layout of cores to the logical 3d grid of processors that overlays the simulation domain. The processors numa command does this so that cores within a node and within a NUMA region (inside the node) are close together in the topology of the 3d grid, to reduce off-node communication costs. This can give a speed-up of 10-15% on large node counts, as shown in the plots.

For the GPU package, the number of CPU cores/node used was whatever gave the fastest performance for a particular problem size. For the weak-scaling results with the smaller per-node atom count (32000), this was typically 4-8 cores out of 16.

Click on the plots for a larger version.

Interatomic potential comparisons

The following table summarizes the CPU cost of various potentials, as implemented in LAMMPS, each for a system commonly modeled by that potential. The desktop machine these were run on is described below. The last 3 entries are for VASP timings, to give a comparison with DFT calculations. The details for the VASP runs are described below.

The listed timing is CPU seconds per timestep per atom for a one processor (core) run. Note that this is per timestep, as is the ratio to LJ; the timestep size is listed in the table. In each case a short 100-step run of a roughly 32000 atom system was performed. The speed-up is for a 4-processor run of the same 32000-atom system. Speed-ups greater than 4x are due to cache effects.

To first order, the CPU and memory cost for simulations with all these potentials scales linearly with the number of atoms N, and inversely with the number of processors P when running in parallel. This assumes the density doesn't change so that the neighbors per atom stays constant as you change N. This holds for N/P ratios larger than some threshhold, say 1000 atoms per processor. Thus you can use this data to estimate the run-time of different size problems on varying numbers of processors.

Potential	System	# Atoms	Timestep	Neighs/atom	Memory	CPU	LJ Ratio	P=4 Speed-up	Input script	Tarball
Granular	chute flow	32000	0.0001 tau	7.2	33 Mb	2.08e-7	0.26x	4.28x	in.granular	bench_granular.tar.gz
FENE bead/spring	polymer melt	32000	0.012 tau	9.7	8.4 Mb	2.86e-7	0.36x	3.78x	in.fene	bench_fene.tar.gz
Lennard-Jones	LJ liquid	32000	0.005 tau	76.9	12 Mb	8.01e-7	1.0x	3.56x	in.lj	bench_lj.tar.gz
DPD	pure solvent	32000	0.04 tau	41.3	9.4 Mb	1.22e-6	1.53x	3.54x	in.dpd	bench_dpd.tar.gz
EAM	bulk Cu	32000	5 fmsec	75.5	13 Mb	1.87e-6	2.34x	3.83x	in.eam	bench_eam.tar.gz
REBO	polyethylene	32640	0.5 fmsec	149	33 Mb	3.18e-6	3.97x	3.61x	in.rebo	bench_rebo.tar.gz
Stillinger-Weber	bulk Si	32000	1 fmsec	30.0	11 Mb	3.28e-6	4.10x	3.83x	in.sw	bench_sw.tar.gz
Tersoff	bulk Si	32000	1 fmsec	16.6	9.2 Mb	3.74e-6	4.67x	3.92x	in.tersoff	bench_tersoff.tar.gz
ADP	bulk Ni	32000	5 fmsec	83.6	25 Mb	5.58e-6	6.97x	3.61x	in.adp	bench_adp.tar.gz
EIM	crystalline NaCl	32000	0.5 fmsec	98.9	14 Mb	5.60e-6	6.99x	3.86x	in.eim	bench_eim.tar.gz
Peridynamics	glass fracture	32000	22.2 nsec	422	144 Mb	7.46e-6	9.31x	3.78x	in.peri	bench_peri.tar.gz
SPC/E	liquid water	36000	2 fmsec	700	86 Mb	8.77e-6	11.0x	3.46x	in.spce	bench_spce.tar.gz
CHARMM + PPPM	solvated protein	32000	2 fmsec	376	124 Mb	1.13e-5	14.1x	3.66x	in.protein	bench_protein.tar.gz
MEAM	bulk Ni	32000	5 fmsec	48.8	54 Mb	1.32e-5	16.5x	3.73x	in.meam	bench_meam.tar.gz
Gay-Berne	ellipsoid mixture	32768	0.002 tau	140	21 Mb	2.20e-5	27.5x	3.63x	in.gb	bench_gb.tar.gz
BOP	bulk CdTe	32000	1 fmsec	4.4	74 Mb	2.51e-5	31.3x	3.88x	in.bop	bench_bop.tar.gz
AIREBO	polyethylene	32640	0.5 fmsec	681	101 Mb	3.25e-5	40.6x	3.66x	in.airebo	bench_airebo.tar.gz
ReaxFF/C	PETN crystal	32480	0.1 fmsec	667	976 Mb	1.09e-4	136x	3.17x	in.reaxc	bench_reaxc.tar.gz
COMB	crystalline SiO2	32400	0.2 fmsec	572	85 Mb	2.00e-4	250x	3.89x	in.comb	bench_comb.tar.gz
eFF	H plasma	32000	0.001 fmsec	5066	365 Mb	2.16e-4	270x	3.71x	in.eff	bench_eff.tar.gz
ReaxFF	PETN crystal	16240	0.1 fmsec	667	425 Mb	2.84e-4	354x	3.78x	in.reax	bench_reax.tar.gz
VASP/small	water	192/512	0.3 fmsec	N/A	320 procs	26.2	17.7e6	100%	N/A	N/A
VASP/medium	CO2	192/1024	0.8 fmsec	N/A	384 procs	252	170e6	100%	N/A	N/A
VASP/large	Xe	432/3456	2.0 fmsec	N/A	384 procs	1344	908e6	100%	N/A	N/A

Notes:

The 1-processor runs were made on a single core of a Dell Precision T7500 desktop system, running Linux (Red Hat release 6.1, 64-bit). The box has dual hex-core 3.47 GHz Intel Xeon X5690 nodes. LAMMPS was compiled using the Intel icc compiler, version 11.1. The 4-processor runs were made on 4 cores of the same system.
The number of neighbors per atom is based on what LAMMPS tallied at the end of the short run within the neighbor cutoff = force cutoff + neighbor skin distance.
The tarballs contain the input and output files for each benchmark.
The VASP quantum DFT calculations and timings are from Thomas Mattsson at Sandia for production runs he performs. The 3 VASP runs (small, medium, large) were performed on large parallel machines: a Cray XT3 for the small, and a dual quad-core Nehalem-based Linux cluster for the medium and large. The single core speed of these machines is slightly slower (maybe 25% slower) of the Dell desktop single core speed for LAMMPS calculations. The "# of Atoms" entry is listed as N/M where N is the number of atoms and M is the number of electrons, since the latter is more relevant for the cost of the DFT calculation. The listed timestep is for ab initio MD calculations. The total memory footprint is not given, but can be inferred from the processor count (320 or 384 procs) since each processor has one to a few Gb of memory. The CPU time per timestep per atom for one processor was calculated from the parallel run-time in the following manner. For the small problem, VASP ran at 15.7 seconds per timestep on 320 procs. The listed CPU time = 320*15.7/192 = 26.2 seconds per atom per timestep per processor. Similarly for the medium and large problems which ran at 126 and 1512 seconds per timestep in parallel. Note that this assumes the parallel VASP run-time is 100% efficient, which thus over-estimates the 1-processor time (assuming these problems could be run on a single processor, which they can't). Also note that the per-atom VASP timings increase with problem size, since DFT scales roughly as O(N^2.5) in the number of electrons. This means the per-atom per-timestep cost of a quantum calculation would increase dramatically if extrapolated to the 32,000 atom problem sizes listed for the MD calculations.

Details for different systems:

Granular: chute flow, dense packing of sigma = 1 diameter particles, contact forces, neighbor skin = 0.1 sigma, NVE with rotational DOF.
FENE bead/spring: melt of 320 100-mer chains, rho* = 0.844, FENE bond potential, LJ cutoff = 2^(1/6) sigma, neighbor skin = 0.4 sigma, NVE + Langevin thermoastat at T* = 1.0.
Lennard-Jones: liquid, rho* = 0.844, T* = 0.75, cutoff = 2.5 sigma, neighbor skin = 0.3 sigma, NVE.
DPD: pure solvent, rho* = 3.0, T* = 1.0, cutoff = 1.0 sigma, neighbor skin = 0.5 sigma, NVE.
EAM: bulk fcc Cu at 800K, cutoff = 4.95 Angs, neighbor skin = 1.0 Angs, NVE.
Tersoff: bulk diamond Si at 500K, cutoff = 3.2 Angs, neighbor skin = 1.0 Angs, NVE.
Stillinger-Weber: bulk diamond Si at 500K, cutoff = 3.77 Angs, neighbor skin = 1.0 Angs, NVE.
EIM: NaCl crystal in CsCl structure (would anneal to NaCl polycrystals), cutoff = 7.54 Angs, neighbor skin = 0.3 Angs, NPT.
SPC/E: liquid water at 300K, SPC/E water model with LJ/Coulombic pair potential, cutoff = 9.8 Angs, neighbor skin = 2.0 Angs, long-range Coulombics via PPPM to 1.0e-4 tolerance every step, SHAKE on bonds and angle, NVT.
CHARMM + PPPM: rhodopsin protein in solvated lipid bilayer, CHARMM force field with LJ/Coulombic pair potential, cutoff = 10.0 Angs, neighbor skin = 2.0 Angs, long-range Coulombics via PPPM to 1.0e-4 tolerance every step, SHAKE on hydrogens and water, NPT.
MEAM: bulk fcc Ni at 800K, cutoff = 4.0 Angs, neighbor skin = 1.0 Angs, NVE.
Peridynamics: brittle fracture of notched sodalime glass, bond-based prototype microelastic brittle (PMB) potential, cutoff = 1.5 mm, neighbor skin = 1.0 mm, NVE.
Gay-Berne: biaxial ellipsoids at volume fraction = 0.48 and T* = 2.4, ellipsoid shape (in sigma) = 2 by 1.5 by 1, cutoff = 4.0 sigma, neighbor skin = 0.8 sigma, NPT.
BOP: bulk CdTe at 550K, cutoff at 4.9 Angs, neighbor skin = 0.1 Angs, NVE. Note that the BOP manybody calculations involve neighbors of neighbors of neighbors, so the effective cutoff is 14.7 Angstroms.
AIREBO: polyethylene chains at 300K, cutoff = 10.2 Angs for C-C interactions, neighbor skin = 0.5 Angs, NVE.
COMB: bulk alpha-cristobalite SiO2 at 300K, short-range cutoff = 3.1 Angs, long-range Coulombics with cutoff = 12.0 Angs, neighbor skin = 0.5 Angs, QEq via EE every 10 steps to 1.0e-3 tolerance (QEq took 55% of time), NVT.
eFF: H plasma at 40,000K, short-range cutoff = 12 Bohr = 6.35 Angs, neighbor skin = 2.0 Bohr = 1.06 Angs, NVE.
ReaxFF: PETN crystal, cutoff = 10.0 Angs, neighbor skin = 2.0 Angs, QEq every step to 1.0e-6 tolerance, NVE. To fit 16240 atoms in one processor's memory, use provided reax_defs.h in lib/reax when compiling Fortran ReaxFF library.
ReaxFF/C: same system as ReaxFF, run with the newer C-library version of ReaxFF in USER-REAX.
VASP/small: 64 water molecules with 512 electrons at 300K, using a deuterium mass for H.
VASP/medium: 64 CO2 molecules with 1024 electrons at 1000K and liquid density.
VASP/large: 432 Xe atoms with 3456 electrons at 6000K and 10 g/cm3, a high-pressure melting calculation.

Billion-atom LJ benchmarks

The Lennard-Jones benchmark problem described above (100 timesteps, reduced density of 0.8442, 2.5 sigma cutoff, etc) has been run on different machines for billion-atom tests. For the LJ benchmark LAMMPS requires a little less than 1/2 Terabyte of memory per billion atoms, which is used mostly for neighbor lists.

Machine	# of Atoms	Processors	CPU Time (secs)	Parallel Efficiency	Flop Rate	Date
Keeneland	1 million	1 GPU	2.35	100%	27.0 Gflop	2012
Keeneland	1 billion	288 GPUs	17.7	46.3%	3.60 Tflop	2012
Lincoln	1 million	1 GPU	4.24	100%	15.0 Gflop	2011
Lincoln	1 billion	288 GPUs	28.7	51.3%	2.21 Tflop	2011
Cray XT5	1 million	1	148.7	100%	427 Mflop	2011
Cray XT5	1 billion	1920	103.0	75.1%	616 Gflop	2011
Cray XT3	1 million	1	235.3	100%	270 MFlop	2006
Cray XT3	1 billion	10000	25.1	93.6%	2.53 Tflop	2006
Cray XT3	10 billion	10000	246.8	95.2%	2.57 Tflop	2006
Cray XT3	40 billion	10000	979.0	96.0%	2.59 Tflop	2006
IBM BG/L	1 million	1	898.3	100%	70.7 Mflop	2005
IBM BG/L	1 billion	4096	227.6	96.3%	279 Gflop	2005
IBM BG/L	1 billion	32K	30.2	90.7%	2.10 Tflop	2005
IBM BG/L	1 billion	64K	16.0	85.6%	3.97 Tflop	2005
IBM BG/L	10 billion	64K	148.9	92.0%	4.26 Tflop	2005
IBM BG/L	40 billion	64K	585.4	93.6%	4.34 Tflop	2005
ASCI Red	32000	1	62.88	100%	32.3 Mflop	2004
ASCI Red	750 million	1500	1156	85.0%	41.2 Gflop	2004

The parallel efficiencies are estimated from the per-atom CPU or GPU time for a large single processor (or GPU) run on each machine:

2.35e-8 sec/atom/timestep on Keeneland
4.24e-8 sec/atom/timestep on Lincoln
1.49e-6 sec/atom/timestep on the Cray XT5
2.35e-6 sec/atom/timestep on the Cray XT3
8.98e-6 sec/atom/timestep on the IBM BG/L
1.96e-5 sec/atom/timestep on ASCI Red

The aggregate flop rate is estimated using the following values for the pairwise interactions, which dominate the run time:

27.6 pairwise interactions per atom (with Newton's 3rd law for this density and cutoff)
23 flops per LJ interaction

This is a conservative estimate in the sense that flops computed for atom pairs outside the force cutoff, building neighbor lists, and time integration are not counted. For the USER-CUDA package running on GPUs, Newton's 3rd law is not used (because it's faster not to), which doubles the pairwise interaction count, but that is not included in the flop rate either.

Machines

This section lists characteristics of machines used in the benchmarking along with options used in compiling LAMMPS. The communication parameters are for bandwidth and latency at the MPI level, i.e. what a program like LAMMPS sees.

Desktop = Dell Precision T7500 desktop workstation running Red Hat linux

processor: dual hex-core 3.47 GHz Intel Xeons X5690
sited at Sandia National Labs (~2011)
communication: on chip and between procs, ?? MB/sec bandwidth, ?? usec latency
Intel compiler: icc -O (version 11.1)
GPUs: 2 Tesla C2075 GPU boards

Mac laptop = PowerBook G4 running OS X 10.3

processor: 1 GHz G4 PowerPC
sited at Sandia National Labs (~2004)
Mac compiler (g++): c++ -O

ASCI Red = ASCI Intel Tflops MPP

~9000 processors
sited at Sandia National Labs (~2002 for upgrade, original machine ~1998)
processor: 333 MHz Pentium III
communication: Intel custom interconnect, 310 MB/sec bandwidth, 18 usec latency
PGI compiler: CiCC -O4 -Knoieee

Ross = CPlant DEC Alpha/Myrinet cluster

1000+ processors
sited at Sandia National labs (~2004)
processor: 500 MHz EV6 DEC Alpha
communication: Myrinet, 100 MB/sec bandwidth, 65 usec latency
DEC compiler: c++ -O

Liberty = Intel/Myrinet cluster packaged by HP

472 procs = 236 dual-processors nodes
sited at Sandia National Labs (~2004)
processor: 3.06 GHz Xeon
communication: Myrinet, 230 MB/sec bandwidth, 9 usec latency
Intel compiler: mpiCC -O

Cheetah = IBM p690 cluster

864 procs = 27 32-processor nodes
sited at Oak Ridge National Labs (~2004)
processor: 1.3 GHz Power4
communication: IBM Federation interconnect, 1.49 GB/sec bandwidth, 7 usec latency
IBM compiler: mpCC_r -O4 -qnoipa

Xeon/Myrinet cluster = Spirit

1024 procs = 512 dual-processors nodes
sited at Sandia National Labs (~2005)
processor: 3.4 GHz EM64T Xeon
communication: Myrinet, 230 MB/sec bandwidth, 9 usec latency
Intel compiler: mpiCC -O

IBM p690+ cluster = HPCx

1600 procs = 50 32-processor nodes
sited at CCLRC Daresbury Laboratory (~2005)
processor: 1.7 GHz Power4+
communication: IBM Federation interconnect, 1.45 GB/sec bandwidth for one link, 4.5 GB/sec bandwidth for all 4 links, 6 usec latency
IBM compiler: mpCC_r -q64 -O3 -qarch=pwr4 -qtune=pwr4

IBM BG/L = Blue Gene Light

65536 (64K) procs
sited at Lawrence Livermore National Labs (~2005)
processor: 700 MHz PowerPC 440
communication: IBM custom interconnect, 150 MB/sec bandwidth, 3 usec latency
IBM compiler: blrts_xlC -O3

Cray XT3 = Red Storm

10368 processors
sited at Sandia National Labs (~2005)
processor: 2.0 GHz AMD Opteron
communication: Cray Cstar custom interconnect, 1100 MB/sec bandwidth, 7 usec latency
PGI 6.0 compiler: CC -fastsse

Cray XT5 = xtp

160 nodes, 1920 cores (12 cores/node)
sited at Sandia National Labs (~2010)
processor: dual hex-core 2.4 GHz AMD Opteron
communication: Cray Cstar custom interconnect, 1100 MB/sec bandwidth, 7 usec latency
PGI 9.0 compiler: CC -fastsse

Lincoln = GPU cluster

192 nodes, each with 2 CPUs and 2 GPUs
sited at NCSA (~2010)
CPU = 2.33 GHz Intel X5300
GPU = NVIDIA Tesla C1060
communication: SDR Infiniband
NVIDIA Cuda compiler

Keeneland = GPU cluster

120 nodes, each with 2 CPUs and 3 GPUs
sited at ORNL (~2011)
CPU = 2.80 GHz Intel X5660
GPU = NVIDIA Tesla M2090
communication: Qlogic QDR
NVIDIA Cuda 4.1 compiler

Titan Development = GPU-enabled supercomputer (used to be Jaguar)

18688 XK6 nodes, each with a 16-core CPU and 1 GPU (only 960 nodes have GPUs)
sited at ORNL (~2012)
CPU = 16-core 64-bit AMD Opterion 6200
GPU = NVIDIA Tesla X2090 (Fermi)
communication: Gemini interconnect
NVIDIA Cuda 4.1 compiler