Skip to main content
The Journal of Chemical Physics logoLink to The Journal of Chemical Physics
. 2017 May 22;146(20):204101. doi: 10.1063/1.4983393

Efficient evaluation of three-center Coulomb integrals

Gyula Samu 1,a), Mihály Kállay 1,b)
PMCID: PMC5440237  PMID: 28571354

Abstract

In this study we pursue the most efficient paths for the evaluation of three-center electron repulsion integrals (ERIs) over solid harmonic Gaussian functions of various angular momenta. First, the adaptation of the well-established techniques developed for four-center ERIs, such as the Obara–Saika, McMurchie–Davidson, Gill–Head-Gordon–Pople, and Rys quadrature schemes, and the combinations thereof for three-center ERIs is discussed. Several algorithmic aspects, such as the order of the various operations and primitive loops as well as prescreening strategies, are analyzed. Second, the number of floating point operations (FLOPs) is estimated for the various algorithms derived, and based on these results the most promising ones are selected. We report the efficient implementation of the latter algorithms invoking automated programming techniques and also evaluate their practical performance. We conclude that the simplified Obara–Saika scheme of Ahlrichs is the most cost-effective one in the majority of cases, but the modified Gill–Head-Gordon–Pople and Rys algorithms proposed herein are preferred for particular shell triplets. Our numerical experiments also show that even though the solid harmonic transformation and the horizontal recurrence require significantly fewer FLOPs if performed at the contracted level, this approach does not improve the efficiency in practical cases. Instead, it is more advantageous to carry out these operations at the primitive level, which allows for more efficient integral prescreening and memory layout.

I. INTRODUCTION

Electron repulsion integrals (ERIs), which describe the Coulomb interaction of two charge distributions, are one of the basic quantities in quantum chemistry. In conventional formulations, these are four-center integrals defined as

(ϕAϕB|ϕCϕD)=ϕA(𝐫𝟏)ϕB(𝐫𝟏)ϕC(𝐫𝟐)ϕD(𝐫𝟐)|𝐫𝟏𝐫𝟐|d𝐫𝟏d𝐫𝟐 (1)

for basis functions ϕA, ϕB, ϕC, and ϕD with r1 and r2 being the coordinates of the electrons. The evaluation of such integrals is often the limiting step for Hartree–Fock (HF) and density functional theory (DFT) calculations, while their transformation from the atomic orbital (AO) to the molecular orbital (MO) basis can be a bottleneck for correlated methods. The computational requirements for both of these tasks can be efficiently reduced by invoking the density fitting (DF) approximation, which is equivalent to the resolution of identity technique if the so-called Coulomb metric is used.1–5 In this approach, the generalized electron densities given by the product of two basis functions are expanded in an auxiliary (fitting) basis in a manner that minimizes the error of the electric field generated by the charge distributions3,4 as

(ϕAϕB|ϕCϕD)Q,R(ϕAϕB|ρQ)VQR1(ρR|ϕCϕD), (2)

where ρQ and ρR denote functions from the fitting basis, VQR1 is the element of the inverse of the matrix containing the two-center ERIs (ρQ|ρR), and (ϕAϕB|ρQ) is a three-center Coulomb integral. The main advantage of applying this approximation is that the O(N4) scaling of evaluating and processing the ERIs breaks down to O(N2M) with N (M) being the size of the AO (auxiliary) basis, and the calculation of these integrals over a reduced number of Gaussian basis functions is also considerably simpler than that for the four-center ones. When dealing with large systems, even the necessary three-center ERIs become too numerous to store on a disk or it is more advantageous to recalculate them since the sparsity of the integrals can be efficiently utilized with prescreening techniques. These observations had led to the development of integral-direct algorithms, where the ERIs are recalculated whenever they are needed, e.g., in each cycle of a direct self-consistent field (SCF) procedure6–8 or for the overlapping domains in a local correlation calculation.9 The efficiency of such algorithms obviously depends on the speed of the integral evaluation.

For the evaluation of four-center ERIs, several efficient schemes have been constructed. The oldest of the still popular methods is the one developed by King, Dupuis, and Rys,10–16 commonly referred to as the Rys quadrature scheme, which is a Gaussian quadrature based technique for the evaluation of integrals containing functions with arbitrary angular momenta. Other methods are mainly based on recurrence relations using scaled Boys functions17 as their starting values. The scheme of McMurchie and Davidson18 (MD) utilizes the fact that Cartesian Gaussian overlap distributions can be written in terms of Hermite Gaussian functions and also that the two-center Hermite integrals necessary for this expansion can be reduced to one-center ones. Later, Obara and Saika19,20 (OS) presented their method based on recurrence relations connecting auxiliary integrals of various angular momenta. Their scheme arguably remains the most widely used one, due to the subsequent introduction of the horizontal recurrence relation21–23 (HRR) by Head-Gordon and Pople and the electron transfer relation24 (ETR) by Hamilton and Schaefer. The latter recurrence was also presented by Lindh, Riu, and Liu utilizing the close relationship between the OS and the Rys quadrature schemes, and these authors also developed the reduced multiplication scheme by combining the Rys quadrature approach with the ETR and the HRR.25,26 Gill and co-workers,27–34 amongst other contributions, achieved a synthesis of the OS and the MD methods by moving the transformation of Hermite integrals into integrals over Cartesian overlap distributions to the contracted level by properly scaling the intermediate one-center integrals, resulting in a scheme that is very efficient for integrals over highly contracted functions.

Concerning the evaluation of three-center integrals, fewer studies can be found in the literature. Köster, exploiting the uncontracted nature of the auxiliary basis sets, combined the OS, MD, and Gill–Head-Gordon–Pople (GHP) algorithms for three-center ERIs over Cartesian Gaussians.35 Later he also proposed the use of Hermite Gaussian auxiliary functions,36,37 which saves the transformation from Hermite to Cartesian functions in an MD scheme. Reine, Tellgren, and Helgaker38,39 showed that Hermite Gaussians transform into solid harmonic ones exactly the same way as Cartesian Gaussians do, and utilizing this finding these authors also put forward a scheme for the evaluation of three-center integrals over solid harmonic Gaussians which avoids the Hermite to Cartesian transformation. A remarkable improvement on the OS scheme for solid harmonic three-center ERIs was achieved by Ahlrichs,40 who realized that the recurrence relation for the build-up of angular momentum on the fitting function greatly simplifies for three-center integrals. Efficient three-center ERI implementations can be found in the Libint library of Valeev41,42 and the adaptive integral core code of Knizia,43 who both applied the results of Ahlrichs.

It is also important to mention here that there exist several approaches that employ the DF approximation but at least partly avoid the explicit construction of the three-center ERI lists. The so-called J-engine and the related schemes exploit the structure of the Coulomb term in a direct SCF calculation, and instead of performing the relatively expensive recursions and transformations for the ERIs the reverse operations are carried out for the quantities by which the ERIs are multiplied.36,44,45 These algorithms are particularly useful for Kohn–Sham SCF calculations where the significantly more costly exact exchange term is not computed, but efficient DF HF and hybrid DFT algorithms can also be designed if the J-engine approaches are combined with low-cost schemes for the evaluation of the Fock exchange.9,46–51 A further possibility for the reduction of the costs of DF SCF calculations is to approximate far-field ERIs invoking asymptotic or multipole expansions and to evaluate only the near-field integrals analytically.52–54 Nonetheless, there are numerous applications where the explicit evaluation of the three-center ERIs cannot be avoided. For the evaluation of the Fock exchange in a DF SCF calculation or for any correlated calculation employing the DF approximation, at least one AO index of the three-center integrals must be transformed to the MO basis, and, to the best of our knowledge, there exist no algorithms that use similar tricks as the J-engine scheme. In the above cases, at least the near-field three-center integrals must be computed, which requires a considerable computation time, especially with basis sets including functions of high angular momentum. Thus, the cost-effective evaluation of three-center Coulomb integrals is of utmost importance for DF methods.

The aim of this paper is to find the most efficient route for the evaluation of three-center ERIs over solid harmonic Gaussian functions of various angular momenta. We compare the OS, MD, GHP, and Rys quadrature schemes and their combinations and discuss several algorithmic aspects for the evaluation of three-center ERIs. In Sec. II the adaptation of the aforementioned methods for the evaluation of three-center ERIs is presented. The given equations form the basis for the estimation of the floating point operations (FLOPs) required by the various approaches, detailed in Sec. III. The implementation of the schemes with the lowest theoretical FLOP counts along with various prescreening strategies and orders of the operations is discussed in Sec. IV, and the comparison of practical performances is done in Sec. V. Finally, in Sec. VI the efficiency of our implementation is demonstrated by calculating the ERIs for medium to large systems.

II. THEORY

A. Three-center Coulomb integrals

In this work we are concerned with the evaluation of three-center ERIs over contracted solid harmonic Gaussian basis functions, which are gained by linear transformations of integrals over unnormalized primitive Cartesian Gaussian basis functions. These functions are defined as

G𝐼𝐽𝐾(𝐫,a,𝐀)=xAIyAJzAKexp(arA2), (3)

where r denotes the position vector of the electron, A is the position of the nucleus on which the function is centered, a is a constant Gaussian exponent, and rA is the magnitude of the vector rA = rA with xA being the x component of rA. L = I + J + K will be called the angular momentum of GIJK, and the vector L = (I, J, K) will be referred to as the angular momentum vector of GIJK. Functions with the same center, exponent, and angular momentum constitute a shell with (L + 1) (L + 2)/2 components. The primitive Gaussians are separable in the three Cartesian directions, that is, GIJK = GIGJGK, where, for instance, G𝐼=xAIexp(axA2). They also obey the following recurrence relation for differentiation with respect to a nuclear coordinate (given here for the x direction only):

G𝐼Ax=2aG𝐼+1𝐼G𝐼1, (4)

where Ax is the x component of A.

For solid harmonic Gaussian functions, one needs to combine functions with the same exponent, angular momentum, and center, but different angular momentum vectors as

G𝐿𝑚(𝐫,a,𝐀)=I+J+K = L𝐶 IJK LmG𝐼𝐽𝐾(𝐫,a,𝐀). (5)

A shell of solid harmonic Gaussians consists of functions with 0|m|L, having 2L + 1 components. The 𝐶 IJK Lm coefficients in Eq. (5) only depend on the angular momentum vector and the value of L and m.17

We obtain contracted Gaussians by linearly combining functions with different exponents a but the same angular momentum vector and center,

χALm(𝐫,𝐀)=aGLm(𝐫,a,𝐀)daχA, (6)

where the contraction coefficients daχA also include the norm of the solid harmonic Gaussian function and are the same for a given shell. Of course, the transformation given by Eq. (5) can also be applied to integrals in the contracted basis and the one defined by Eq. (6) to the integrals in the primitive Cartesian basis as well.

Three-center ERIs over primitive Gaussian functions are defined as

(𝑳a𝑳b|𝐿c)=GIaJaKa(𝐫𝟏,a,𝐀)GIbJbKb(𝐫𝟏,b,𝐁)GIcJcKc(𝐫𝟐,c,𝐂)|𝐫𝟏𝐫𝟐|d𝐫𝟏d𝐫𝟐, (7)

where La = (Ia, Ja, Ka) stands for the angular momentum vector and La = Ia + Ja + Ka is the angular momentum of the function with exponent a. From these, integrals over solid harmonic contracted Gaussians are computed by applying Eqs. (5) and (6) in an arbitrary order on the three centers. We will refer to primitive integrals sharing angular momenta La, Lb, and Lc, centers A, B, and C, and exponents a, b, and c as a primitive class, e.g., the class (11|1) consists of 27 primitive integrals. Similarly, the members of contracted classes are integrals over contracted Gaussians of the same angular momenta and centers. A shell triplet will refer to all the integrals over solid harmonic Gaussians belonging to centers A, B, and C and angular momenta La, Lb, and Lc.

An important special case is the primitive integral where La = Lb = Lc = 0, the value of which can be expressed directly40 as

(𝟎𝟎|𝟎)(0)=(𝟎𝟎|𝟎)=θpcκabF0(αRPC2), (8)

with

κab=exp(μRAB2),μ=aba+b,𝐑AB=𝐀𝐁,p=a+b,    θpc=2π5/2pcp+c,     𝐏=a𝐀+b𝐁p,    𝐑PC=𝐏𝐂,     α=pcp+c,

and Fn being the Boys function of order n, defined as

Fn(x)=01t2nexp(xt2)dt. (9)

The integral in Eq. (8) and also other auxiliary integrals where the order of the Boys function is greater than 0 are the starting points for the OS,19 MD,18 and GHP27 schemes for the evaluation of the integrals in Eq. (7) for arbitrary angular momenta.

B. Obara–Saika recursion

The OS scheme utilizes recurrence relations of auxiliary intermediate integrals to construct the true ERIs with the desired angular momenta. An efficient application of this method to three-center ERIs was presented by Ahlrichs.40 This approach will be referred to as OS1. The first step here is to evaluate the required auxiliary integrals

(𝟎𝟎|𝟎)(n)=θpcκabFn(αRPC2) (10)

for LcnLa+Lb+Lc. Then the vertical recurrence relation19 (VRR) is used to increment the angular momentum of the first function on the bra side (given here for the x direction) as

([𝒍a+𝟏x]𝟎|𝟎)(n)=XPA(𝑙a𝟎|𝟎)(n)αpXPC(𝒍a𝟎|𝟎)(n+1)+ia2p×(([𝒍a𝟏x]𝟎|𝟎)(n)αp([𝒍a𝟏x]𝟎|𝟎)(n+1)), (11)

where XPA is the x component of vector RPA, and generally 𝟏σ=(δσ,x,δσ,y,δσ,z) for σ=x,y,z. Here and later, la, ia, and la refer to the angular momentum, its x component, and the angular momentum vector of the first Gaussian in the intermediate integrals, respectively, and a similar notation will be used for the angular momenta of the second and third functions and their components. With Eq. (11), the classes (𝒍a𝟎|𝟎)(Lc) are calculated for max(1,LaLc)laLa+Lb. Next, in the case where solid harmonic basis functions are supposed to be on the ket side, lc can be built up by a two-term VRR,40

(𝒍a𝟎|[𝒍c+𝟏x])(n)=αcXPC(𝒍a𝟎|𝒍c)(n+1)+ia2(p+c)([𝒍a𝟏x]𝟎|𝒍c)(n+1). (12)

Eq. (12) is used to produce (la0|Lc)(0) classes for LalaLa+Lb. From here on, superscript (n) will be dropped when it is equal to 0. The last step is to increment lb, which is efficiently done by the HRR of Head-Gordon and Pople,21

(𝒍a[𝒍b+𝟏x]|𝒍c)=([𝑙a+𝟏x]𝒍b|𝑙c)+XAB(𝑙a𝑙b|𝑙c). (13)

Besides the above algorithm, there are at least three other possibilities to get the target integrals with OS-type recursions. The first one, labeled as OS2, evaluates the same auxiliary integrals with Eq. (10) as in OS1 and then applies the VRR to the ket side first as

(𝟎𝟎|[𝒍c+𝟏x])(n)=αcXPC(𝟎𝟎|𝒍c)(n+1) (14)

to construct the classes (00|lc)(n) for max(1,LcLaLb)lcLc and LclcnLa+Lb. This is followed by building up the angular momentum of the first function on the bra side as

([𝒍a+𝟏x]𝟎|𝑙c)(n)=XPA(𝑙a𝟎|𝑙c)(n)αpXPC(𝑙a𝟎|𝑙c)(n+1)+ia2p(([𝑙a𝟏x]𝟎|𝑙c)(n)αp([𝑙a𝟏x]𝟎|𝑙c)(n+1))+ic2(p+c)(𝒍a𝟎|[𝑙c𝟏x])(n+1), (15)

to compute (la0|Lc) for LalaLa+Lb, and finally the algorithm is finished with Eq. (13).

Apart from the VRR, another way to build up la or lc is to use the ETR24 arising from the translational invariance of integrals, and also Eqs. (4) and (13). For three-center ERIs, the ETR has the form

([𝒍a+𝟏x]𝟎|𝑙c)=bpXAB(𝑙a𝟎|𝑙c)+ia2p([𝑙a𝟏x]𝟎|𝑙c)cp(𝑙a𝟎|[𝑙c+𝟏x])+ic2p(𝑙a𝟎|[𝑙c𝟏x]) (16)

for the 𝒍c𝑙a conversion, and

(𝒍a𝟎|[𝑙c+𝟏x])=bcXAB(𝑙a𝟎|𝑙c)+ia2c([𝑙a𝟏x]𝟎|𝑙c)pc([𝑙a+𝟏x]𝟎|𝑙c) (17)

for the 𝑙a𝑙c transfer. We note that, in principle, Eq. (17) also contains a fourth term on the right-hand side, ic/2c(la0|[lc1x]), but this term is canceled for the same reasons as discussed by Ahlrichs for the VRR40 when transforming to the solid harmonic basis. This cancellation also takes place for the third and fourth terms in both Eqs. (11) and (15), and the second term in Eq. (16), but only in the case when Lb = 0. It should be noted that the numerical instability in the ETR associated with the addition pXPA/(c + d) + XQC55 (where, in the four-center case, d is the exponent of the fourth Gaussian and Q = (cC + dD)/(c + d), D being the center of the fourth function) does not appear here. This is because in the absence of the fourth center, Eq. (13) only has to be applied to the bra side, reducing the aforementioned sum to p/cXPA = −b/cXAB. If we wish to build up the integrals necessary for Eq. (13) with Eq. (16), we cannot use Eq. (14) for the construction of the (00|lc) type classes, instead we have to employ the full vertical recurrence,40

(𝟎𝟎|[𝑙c+𝟏x])(n)=αcXPC(𝟎𝟎|𝑙c)(n+1)+ic2c((𝟎𝟎|[𝑙c𝟏x])(n)αc(𝟎𝟎|[𝑙c𝟏x])(n+1)), (18)

for the ket side. The terms corresponding to the ones in the big parentheses in Eq. (18) vanish in Eqs. (12) and (14) during the solid harmonic transformation40 of the ket side; however, with Eq. (16) terms belonging to angular momenta other than lc get built into the integrals to be transformed, and these will not cancel. The scheme where we first employ Eq. (18) to build up lc and then Eq. (16) for la will be referred to as OS3. In this route, we first use Eq. (10) to calculate the (00|0)(n) integrals for [Lcmod 2]nLa+Lb+Lc, then Eq. (18) for the classes (00|Lc) with max(LcLaLb,1)lcLa+Lb+Lc, thereafter we apply Eq. (16) to get the (la0|Lc) classes for LalaLa+Lb. Finally, in the algorithm denoted as OS4, la is built up by Eq. (11), and lc is incremented by the ETR, Eq. (17). Here the necessary (00|0)(n) integrals are in the range 0nLa+Lb+Lc and are used to calculate the (la0|0) classes for max (LaLc,1)laLa+Lb+Lc.

C. McMurchie–Davidson scheme

The strategy of the MD method is to expand ERIs over Gaussian overlap distributions arising from multiplying GIaJaKa(𝐫𝟏,a,𝐀) and GIbJbKb(𝐫𝟏,b,𝐁) into integrals over Hermite Gaussian functions centered on P, defined as

HI¯pJ¯pK¯p(𝐫𝟏,p,𝐏)=L¯pexp(prP2)PxI¯pPyJ¯pPzK¯p, (19)

where the bars over the total angular momentum and its components are used to distinguish from the corresponding Cartesian Gaussians. In this scheme, one has to evaluate two-center Coulomb integrals over Hermite Gaussians centered on P and C, which, exploiting translational invariance (that is, /Px=/Cx), can be written as18

(𝑙¯p|𝑙¯c)=θpc(2c)l¯cl¯p+l¯cF0(αRPC2)Pxi¯pPyj¯pPzk¯pCxi¯cCyj¯cCzk¯c=θpc(2c)l¯cl¯p+l¯cF0(αRPC2)Pxi¯p+i¯cPyj¯p+j¯cPzk¯p+k¯c=(𝑙¯u) (20)

with 𝑙¯u=𝑙¯p+𝑙¯c. The scaling with (2c)l¯c is applied since for the Hermite Gaussian in the ket we follow the definition of Reine and co-workers,38 which will allow us to transform the ket side into the solid harmonic Gaussian basis without the transformation into Cartesian Gaussians first [note that this is not necessary for (𝑙¯p|]. The one-center integrals on the rightmost of Eq. (20) can be computed by the two-term recursion18

(𝑙¯u+𝟏x)(n)=XPC(𝑙¯u)(n+1)+i¯u(𝑙¯u𝟏x)(n+1) (21)

with

(𝟎¯)(n)=(2α)nκabθpc(2c)L¯cFn(αRPC2). (22)

From the one-center integrals, three-center ERIs with two Cartesian Gaussians in the bra and a Hermite Gaussian in the ket are evaluated as18

(𝐿a𝐿b|𝐿¯c)=i¯p=0Ia+IbEi¯pIa,Ibj¯p=0Ja+JbEj¯pJa,Jbk¯p=0Ka+KbEk¯pKa,Kb(𝑙¯p+𝐿¯c). (23)

The E expansion coefficients appearing in Eq. (23) can be constructed by a set of recurrence relations,17

E0¯ia+1,0=XPAE0¯ia,0+E1¯ia,0, (24)
E0¯ia,ib+1=XPBE0¯ia,ib+E1¯ia,ib, (25)
Ei¯p+1ia,ib=12p(i¯p+1)(iaEi¯pia1,ib+ibEi¯pia,ib1),i¯p0, (26)

with E0¯0,0=1.

The expansion defined by Eq. (23) can be applied to produce various types of three-center ERIs. In the MD1 algorithm, for example, we get the (𝐿a𝐿b|𝐿¯c) classes directly. First the expansion coefficients are computed; e.g., in the x direction Ei¯pia,ib values are needed for 0iaLa, 0ibLb, and 0i¯pia+ib. This is followed by the calculation of the (𝟎¯)(n) integrals for L¯c/2+[L¯cmod 2]nLa+Lb+L¯c with x denoting the integer part of x. The one-center integrals (𝒍¯u) for L¯cl¯uLa+Lb+L¯c are built up by Eq. (21), from which the target integrals are readily assembled by Eq. (23). The work done in this assembly step can be reduced by performing it at an earlier stage to construct intermediate classes and using OS-type recursions for the evaluation of the target integrals. In the MD2 scheme, the (𝑙a𝟎|𝐿¯c) classes for LalaLa+Lb are evaluated with Eq. (23). Here the necessary expansion coefficients are in the range of 0iaLa+Lb, ib = 0, and 0i¯pia, and the required one-center integrals are the same as in MD1. After the assembly, the final integrals are computed by Eq. (13). A third option (MD3) is to obtain the (𝑙a𝟎|𝟎)(L¯c) type intermediates for max(1,LaL¯c)laLa+Lb with Eq. (23), then to build up 𝑙¯c with Eq. (12), and to finish with Eq. (13). Here the (1)L¯c scaling factor is absent from Eq. (22), and the required (𝟎¯)(n) values are in the range of L¯cnLa+Lb+L¯c and used for calculating the (𝑙¯u)(L¯c) integrals for 0luLa+Lb. The index range for the expansion coefficients is the same as in the MD2 scheme.

An alternative method for transforming the Hermite integrals into ones over Cartesian overlaps is the use of the

Ω𝑙a,𝑙b𝑙¯p=xAiayAjazAkaxBibyBjbzBkbl¯pexp(prP2)Pxi¯pPyj¯pPzk¯p (27)

hybrid functions17 on the bra side. As it is clear from Eq. (27), these functions reduce to Hermite Gaussians if la = lb = 0 and to Cartesian overlap distributions centered on P without the κab factor if 𝑙¯p=𝟎. Introducing the notation for the auxiliary integrals over hybrid bras and Hermite kets as (Ω𝑙a,𝑙b𝑙¯p|𝑙¯c) and applying the recurrence relations17 for the functions in Eq. (27) we can write

(Ω𝒍a+𝟏x,𝒍b𝒍¯p|𝒍¯c)=i¯p(Ω𝒍a,𝒍b𝒍¯p𝟏x|𝒍¯c)+XPA(Ω𝒍a,𝒍b𝒍¯p|𝒍¯c)+12p(Ω𝒍a,𝒍b𝒍¯p+𝟏x|𝒍¯c) (28)

and

(Ω𝒍a,𝒍b+𝟏x𝒍¯p|𝒍¯c)=i¯p(Ω𝒍a,𝒍b𝒍¯p𝟏x|𝒍¯c)+XPB(Ω𝒍a,𝒍b𝒍¯p|𝒍¯c)+12p(Ω𝒍a,𝒍b𝒍¯p+𝟏x|𝒍¯c). (29)

Relying on these relations, one can start from the two-center Hermite integrals (Ω0,0𝑙¯p|𝐿¯c), which are, by Eq. (20), practically scaled one-center (𝑙¯p+𝐿¯c) integrals, and, through hybrid intermediates, convert these into the target (Ω𝐿a,𝐿b𝟎¯|𝐿¯c)=(𝐿a𝐿b|𝐿¯c) classes with a purely Cartesian bra side. In the MD4, MD5, and MD6 schemes, we proceed the same way as in the MD1, MD2, and MD3 cases, respectively, with the difference that the calculation of the expansion coefficients is omitted, and instead of Eq. (23) we apply Eqs. (28) and (29) for the transformation of the bra side.

D. Gill–Head-Gordon–Pople algorithm

Here we consider the original algorithm of Gill, Head-Gordon, and Pople27 with the modifications needed for three-center ERIs. In this method, the procedure is very similar to the MD5 scheme. The difference lies in the introduction of the β,ζ-scaled auxiliary integrals defined as

(Ω𝒍a,𝒍b𝒍¯p|𝒍¯c)β,ζ=(2b)β(2p)ζ(Ω𝒍a,𝒍b𝒍¯p|𝒍¯c), (30)

where β and ζ are positive integers. With these quantities, substituting XPA = −(2b)/(2p)XAB, Eq. (28) can be rewritten as27

(Ω𝒍a+𝟏x,𝒍b𝒍¯p|𝒍¯c)β,ζ=i¯p(Ω𝒍a,𝒍b𝒍¯p𝟏x|𝒍¯c)β,ζXAB(Ω𝒍a,𝒍b𝒍¯p|𝒍¯c)β+1,ζ+1+(Ω𝒍a,𝒍b𝒍¯p+𝟏x|𝒍¯c)β,ζ+1, (31)

which is a relation that does not depend explicitly on the Gaussian exponents and therefore can be applied to the β,ζ-scaled auxiliary integrals transformed to the contracted basis.

The strategy of the GHP scheme for three-center ERIs is thus the following. First, the necessary Hermite integrals (Ω0,0𝑙¯p|𝐿¯c)=(𝑙¯p+𝐿¯c) are computed for 0l¯pLa+Lb. Then, all the scaled classes of these integrals required to compute the (Ω𝑙a,𝟎𝟎¯|𝐿¯c)0,0 classes with Eq. (31) for LalaLa+Lb are produced. For each of these classes, we need to start from the (Ω0,0𝑙¯p|𝐿¯c)β,ζ scaled Hermite intermediates for 0l¯pla. To determine the β,ζ-scaled classes needed for each (Ω0,0𝑙¯p|𝐿¯c) that will be used for the calculation of a given (Ω𝑙a,𝟎𝟎¯|𝐿¯c)0,0, we have to trace back the recursion defined by Eq. (31). As each recursion step increments la by 𝟏σ, there are la steps. By analyzing the positions where (Ω0,0𝒍¯p|𝐿¯c)β,ζ and the intermediates connected to it can appear in Eq. (31) during the recursion, we see that such intermediates have to be the third term at least l¯p times to reduce 𝒍¯p to 𝟎¯. In the additional lal¯p steps, these intermediates have to appear at the first and the third positions equal times if 𝒍¯p is to stay equal to 𝟎¯, and in the remaining steps they have to be the second term. From this it follows that for each la,l¯p pair there are (lal¯p)/2+1 different scalings to consider. The β and ζ for these can be obtained by looking at how the changes in these values depend on the position the intermediates take in Eq. (31). The scaling indices are determined by how many times the connected intermediates take the second or third position. For example, in the case they take the third place l¯p times and the second position in the remaining lal¯p steps, the values of β and ζ are lal¯p and la, respectively. Another example is when the intermediate takes the second position in two fewer steps in the recursion, and both the first and the third places are taken one more time than in the former example, making the scaling indices β=lal¯p2 and ζ=la1. Let us denote the scaled class in the first example as class 1 and that in the second example as class 2. In general, class n can be defined for the scaling indices β=lal¯p2(n1) and ζ=la(n1). After these classes for 1n(lal¯p)/2+1 have been calculated for all the primitive classes, the scaled one-center integrals are transformed to the contracted basis by Eq. (6). When using segmented basis sets, the multiplication work in this contraction step can be reduced to simply multiplying Eq. (22) with the appropriate daχA coefficients. Following the contraction Eq. (31) is applied, and lastly Eq. (13) is used to build up lb.

E. Rys quadrature method

The algorithms discussed before are all based on calculating scaled Boys functions of various orders and using them as starting values for a recursive procedure. Inspecting these methods and utilizing Eq. (9) it is evident that the target integral can be expressed as

(𝑳a𝑳b|𝑳c)=n=0La+Lb+LcZnFn(αRPC2)=01n=0La+Lb+LcZnt2nexp(αRPC2t2)dt, (32)

where the values of the coefficients Zn can be obtained by, for example, backtracking the OS recursions until the integral is only expanded in Boys functions. Eq. (32) is an integral over a polynomial f(t2)=n=0La+Lb+LcZnt2n multiplied by a weight function W(T,t2)=exp(Tt2) with T=αRPC2. According to the theory of Gauss–Rys quadrature,10,17 these integrals can be evaluated exactly as

01f(t2)W(T,t2)dt=n=1Nrtsf(tn2)wn (33)

with Nrts being an integer satisfying Nrts>(La+Lb+Lc)/2 and tn2 is the square of the nth positive root of the (2Nrts)th order Rys polynomial in t. These polynomials are defined to be orthonormal on the interval [0,1] with the weight function W(T, t2). w𝑛 is the T-dependent weight factor of the quadrature associated with tn2. For the calculation of the roots of the Rys polynomials and the weight factors, we followed the approach of King and Dupuis10 for the Nrts5 cases and the work of Flocke and Lotrich56 for Nrts>5.

Substituting the identity

1|𝐫𝟏𝐫𝟐|=2π1/20exp(|𝐫𝟏𝐫𝟐|2u2)du (34)

into Eq. (7) and changing the order of integration, we get

(𝑳a𝑳b|𝑳c)=2π1/20[GIaJaKa(𝐫𝟏,a,𝐀)GIbJbKb(𝐫𝟏,b,𝐁)×exp(|𝐫𝟏𝐫𝟐|2u2)GIcJcKc(𝐫𝟐,c,𝐂)d𝐫𝟏d𝐫𝟐]du. (35)

It is possible to factorize the bracketed integrand in Eq. (35) into three two-dimensional (2D) integrals associated with the three Cartesian directions12 to get

(𝑳a𝑳b|𝑳c)=2π1/20Θ¯xIa,Ib,Ic(u2)Θ¯yJa,Jb,Jc(u2)Θ¯zKa,Kb,Kc(u2)du, (36)

where

Θ¯xIa,Ib,Ic(u2)=xAIaxBIbxCIcexp[μXABpxP2cxC2u2|x1x2|2]dx1dx2. (37)

By making a change of variable from u to t as

u2=αt21t2, (38)
du=dtα1/2(11t2)3/2, (39)

defining the modified 2D integrals as

ΘxIa,Ib,Ic(t2)=Θ¯xIa,Ib,Ic(u2)exp(αXPC2t2)(1t2)1/2, (40)

and also noting that as u varies from 0 to infinity, t varies from 0 to 1, we can rewrite Eq. (35) as

(𝑳a𝐋b|𝐋c)=2(απ)1/201ΘxIa,Ib,Ic(t2)ΘyJa,Jb,Jc(t2)ΘzKa,Kb,Kc(t2)×W(T,t2)dt. (41)

From Eq. (41) it is clear that f(t2) can be written as

f(t2)=2(απ)1/2ΘxIa,Ib,Ic(t2)ΘyJa,Jb,Jc(t2)ΘzKa,Kb,Kc(t2), (42)

and, since the 2D integrals are polynomials in t2,12 Eq. (33) takes the form

01f(t2)W(T,t2)dt=2(απ)1/2n=1NrtsΘxIa,Ib,Ic(tn2)ΘyJa,Jb,Jc(tn2)×ΘzKa,Kb,Kc(tn2)wn. (43)

The value of ΘxIa,Ib,Ic(tn2) can be calculated recursively17 (and similarly for the y and z directions) as

Θxia+1,0,0(tn2)=(XPAαpXPCtn2)Θxia,0,0(tn2)+ia2p(1αptn2)Θxia1,0,0(tn2) (44)

for ia and as

Θxia,0,ic+1(tn2)=αcXPCtn2Θxia,0,ic(tn2)+iatn22(p+c)Θxia1,0,ic(tn2) (45)

for ic. Finally, ib is built up by

Θxia,ib+1,ic(tn2)=Θxia+1,ib,ic(tn2)+XABΘxia,ib,ic(tn2). (46)

We note that, in the general case, Eq. (45) contains a third term which can be neglected if the ket side is to be transformed to the solid harmonic Gaussian basis. The derivation of Eq. (45) is given in Appendix A. Instead of performing the assembly step as it is defined by Eq. (43) and starting the recursion of Eq. (44) with Θx0,0,0(tn2)=πexp(μXAB2)(1/pc) (and analogously for the other directions),11 it is more beneficial to start with Θz0,0,0(tn2)=θabκabwn and Θx0,0,0(tn2)=Θy0,0,0(tn2)=1, making the equation for the assembly

01f(t2)W(T,t2)dt=n=1NrtsΘxIa,Ib,Ic(tn2)ΘyJa,Jb,Jc(tn2)ΘzKa,Kb,Kc(tn2). (47)

For the four-center ERIs, it has also been shown25 that the direct evaluation of the target integrals from the 2D integrals is not the only possibility, but it can be advantageous to construct intermediate integrals from the 2D ones and use OS-type recursions to get the target integral.

Here we will investigate three possibilities for the three-center ERIs. In the RYS1 algorithm, we evaluate the (LaLb|Lc) integrals directly by Eq. (47). For this purpose, we have to compute Θxia,ib,ic(tn2) for 1iaLa, 1ibLb, and 1icLc for the Nrts roots and for the three directions. In the RYS2 scheme, (la0|Lc) classes are calculated on the quadrature for LalaLa+Lb, then the OS-type HRR, Eq. (13), is applied. The indices of the necessary 2D integrals here are in the range of 1iaLa+Lb, ib = 0, and 1icLc. We also explored here a completely different strategy which has not yet been considered in the literature even for four-center integrals. We utilize that it is also possible to construct the (𝒍a𝟎|𝟎)(Lc) auxiliary integrals as

(𝒍a𝟎|𝟎)(Lc)=n=0laZnFn+Lc(αRPC2)=01n=0laZnt2(n+Lc)exp(αRPC2t2)dt. (48)

In this case, the value of the polynomial f of tn2 can be written as

f(tn2)=tn2Lc2(α/π)1/2Θxia,0,0(tn2)Θyja,0,0(tn2)Θzka,0,0(tn2). (49)

The extra multiplication with tn2Lc can also be built into Θz0,0,0(tn2). In this algorithm (RYS3), the needed 2D integrals are Θxia,0,0(tn2) for 1iaLa+Lb for all the roots and directions, the (𝒍a𝟎|𝟎)(Lc) classes are constructed for max(0,LaLc)laLa+Lb by Eq. (47), and the target integrals are built up via Eqs. (12) and (13).

F. Algorithmic considerations

Since its introduction the HRR equation, Eq. (13), has been a standard tool for evaluating molecular integrals over Gaussian functions. In addition to being a simple two-term recurrence relation, it is also independent of the basis set exponents, making it possible to apply it to contracted integrals instead of primitive ones, which (usually) means that a smaller number of integrals are to be treated. The same is true for the transformation to the solid harmonic Gaussian basis, and it has also been proposed that these two operations for one side (bra or ket) can be efficiently combined into a single matrix multiplication.56 On the other hand, if we choose to use Eqs. (13) and (5) at the contracted level, we have to first contract the components of the classes (la0|Lc) for LalaLa+Lb, which consist of [(La + Lb + 1) (La + Lb + 2) (La + Lb + 3)/6 − 1 − La(La + 1)/2](Lc + 1) (Lc + 2)/2 integrals for every final class of (LaLb|Lc). If we perform the HRR and the solid harmonic transformation at the primitive level instead, this number becomes (2La + 1) (2Lb + 1) (2Lc + 1), which is smaller in all the cases. This does not only affect the operation count of the contraction step but the memory use of the code as well. For example, if we apply the nested loop structure shown in Algorithm 1, the arrays storing the partially and fully contracted integrals will be the largest ones used in the process of evaluating all (LaLb|Lc) ERIs for three given centers. This means that we can expect the most data cache-miss events (meaning that the copy of the data stored at a referenced memory address cannot be found in the cache memory of the central processing unit (CPU)) to happen at this stage of the algorithm. Since the fetching of data from main memory is about a magnitude slower than from the cache (two magnitudes if the data reside in the first level of the cache), such misses can have a considerable effect on the performance of the code, and fewer misses are expected for a smaller array. Thus we see that it is not a trivial decision where Eqs. (13) and (5) should be applied. The schemes where the HRR and the solid harmonic transformation are done at the primitive level will be denoted as IN, while the ones where these two steps are performed at the contracted level will be labeled as OUT.

Algorithm 1.

abc primitive loop order.

Loop over a
 Loop over b
  Algorithm pPRE2: estimate (00|0) for the smallest c
   Loop over c
    Algorithm pPRE1: estimate (00|0)
    Algorithm OUT: Build up (la0|Lc) for LalaLa+Lb in the Cartesian
     Gaussian basis
    Algorithm IN: Build up (LaLb|Lc) in the solid harmonic Gaussian basis
   End loop
  Contract the third function for all classes with exponents a and b
 End loop
 Contract the second function for all classes with exponent a
End loop
Contract the first function for all classes
Loop over χA (executed only in the case of algorithm OUT)
 Loop over χB
  Algorithm cPRE2: Look up the integral of highest absolute value in
   the contracted (la0|Lc) classes needed for the contracted (LaLb|Lc) class with
   the smallest c
  Algorithm cPRE3: Estimate the integral of highest absolute value in
   the contracted (la0|Lc) classes needed for the contracted (LaLb|Lc) class with
   the smallest c
 Loop over χC
  Algorithm cPRE1: Look up the integral of highest absolute value in
   the contracted (la0|Lc) classes needed for the contracted (LaLb|Lc) class
  Algorithm OUT: perform HRR to get (LaLb|Lc), perform solid harmonic
   transformation
  End loop
 End loop
End loop

Our contraction procedure distinguishes between contracted and uncontracted functions for all three centers, especially because there can be a significant number of uncontracted functions in generally contracted basis sets, e.g., in the cc-pVXZ bases.57,58 For example, in the cc-pVTZ basis for elements Li to Ne all the d and f functions are uncontracted, and out of the four s and three p functions only two and one are contracted, respectively, and all the functions in the corresponding fitting basis,59 cc-pVTZ-RI, are uncontracted. For the integrals that are evaluated over primitives which contribute to an uncontracted function, the quantity θpcκab is multiplied by the norm factor of the function which is otherwise absorbed into the contraction coefficients, and the integrals are written directly into the array that stores the contracted integrals; therefore, both the floating-point and memory operations for the contraction are saved. In the case these primitives also contribute to other, contracted functions, the coefficients of the affected primitives for these contracted functions in Eq. (6) are divided by the above mentioned norm. Further notes on the efficient treatment of integral contraction will be discussed in Sec. IV.

The sizes of the arrays for integral contraction can be further reduced when the auxiliary basis set used for the density fitting approximation is uncontracted even if the functions on centers A and B are contracted. If we change the order of loops from a, b, c to c, a, b as it is shown in Algorithm 2, the sizes of the arrays for the contraction of the first and second functions reduce by a factor of the number of the contracted functions on the third center. Here the loop over the exponents of the ket side is also the loop over the contracted functions on C, and all calculations are performed inside this loop. This scheme, however, has the disadvantage that we have to precalculate the a- and b-dependent quantities in a separate loop to avoid unnecessary recalculations. Schemes with the a,b,c primitive loop structure will be referred to as abc, while the ones with c,a,b order will be denoted by cab.

Algorithm 2.

cab primitive loop order.

Loop over a
 Loop over b
  Calculate the quantities depending on functions in the bra
 End loop
End loop
Loop over c
 Loop over a
  Algorithm pPRE2: estimate (00|0) for the smallest b
  Loop over b
   Algorithm pPRE1: estimate (00|0)
   Algorithm OUT: Build up (la0|Lc) for LalaLa+Lb in the Cartesian
    Gaussian basis
   Algorithm IN: Build up (LaLb|Lc) in the solid harmonic Gaussian basis
  End loop
 Contract the second function for all classes with exponents c and a
 End loop
Contract the first function for all classes with exponent c
 Loop over χA (executed only in the case of algorithm OUT)
 Loop over χB
  Algorithm cPRE1: Look up the integral of highest absolute value in
   the contracted (la0|Lc) classes needed for the contracted (LaLb|Lc) class
  Algorithm cPRE4: Estimate the integral of highest absolute value in
   the contracted (la0|Lc) classes needed for the contracted (LaLb|Lc) class
  Algorithm OUT: perform HRR to get (LaLb|Lc), perform solid harmonic
   transformation
  End loop
 End loop
End loop

Another aspect that can have a strong effect on the performance is the prescreening of integrals which are lower in absolute value than a user-defined threshold, hereafter denoted by ε. In our code, as usual, the entire shell triplets are prescreened invoking the Schwartz inequality,7 and we also employ the distance-dependent estimator of Valeev and co-workers.60 In addition, the screening of the primitive integrals is also implemented. For the latter, the threshold is also tied to ε by dividing it by the maximal level of contraction, that is, the product of the number of primitive functions on each center. Exceptions from this rule are integrals that contain a primitive (centered on, for example, A) which contributes to only one contracted function χA. Then, ε is not divided by the number of primitives on A but rather the level of contraction for χA, making the threshold for primitive prescreening higher. For the estimation of the magnitude of the primitive integrals, we will use the value of the (00|0) ERI evaluated with the exponents of the functions of higher angular momentum. Instead of directly calculating (00|0) according to Eq. (8), we can use the upper bound for the zeroth-order Boys function,17 from which we get

(𝟎𝟎|𝟎)=θpcκabF0(αRPC2)θpcκabmin(1,π4αRPC2). (50)

The minimum criterion appears since the approximation used in Eq. (50) is only accurate for high values of αRPC2 (greater than about 74), and for smaller arguments it can give results greater than 1, which is the highest value the zeroth-order Boys function can take (when αRPC2=0). In actual calculations, it is more beneficial to use the square of the rightmost side of Eq. (50) for screening, so the expensive square root calculation only has to be done for classes with small αRPC2 that survive the prescreening. In this method (algorithm pPRE1), the estimate for |(00|0)|2 is compared to the square of the threshold, and if the former value is greater, the class is evaluated. This is not an exact screening since Eq. (50) is not a rigorous upper bound for the target ERIs. Instead, this approach is related to the one proposed by Almlöf and co-workers,6 who used the common factor (in our case κabθpc) by which all the integrals in a class are multiplied to gain an estimate for the magnitude of the primitive integrals in a given class. In our scheme, this value is multiplied by a number smaller than 1, resulting in a less precise but more efficient screening method. In practice, we found that it can be more efficient to screen a batch of primitive exponent triplets than each individual one. Here we make use of the fact that the value of the right-hand side of Eq. (50) increases with the decrement of the Gaussian exponent c for the ket side. This can be seen by noting that α/c=p2/(p+c)2 is always a positive number. Hence, we only need to estimate the (00|0) integral with the smallest c in an abc scheme before the innermost loop (algorithm pPRE2). One could proceed the same way in a cab scheme estimating the integral with the smallest b before the loop over b, but we found this choice to be inefficient, as it will be discussed in Sec. V. The accuracy of the pPRE2 screening method and the effect of its inexact nature on HF energies are discussed in the supplementary material. The derivation of an exact, but less efficient prescreening method based on the Schwartz inequality,7 is presented in Appendix B.

The primitive prescreening described above does not reduce the work of the HRR and the solid harmonic transformation steps if these are performed at the contracted level (algorithm OUT). The simplest option in this case is, for each combination of the contracted functions, to check if the largest value out of the contracted (la0|Lc) classes needed for a class of (LaLb|Lc) is greater than the threshold before applying Eqs. (13) and (5) to get the given class (algorithm cPRE1). We can also chose to screen a bigger batch of contracted classes instead by performing the search for the integral of highest absolute value before the loop over χC in an abc scheme or χB in a cab scheme. This is advantageous when the fitting basis is uncontracted and an abc scheme is applied (see Algorithm 1). In these cases, we will work with the assumption that the integrals involving the most diffuse functions (that is, the smallest c) on the ket side will have higher absolute values than those containing higher c exponents, and therefore screening for the classes with the smallest c is enough to see if any of the integrals in the batch will reach the threshold (algorithm cPRE2). Like the pPRE1 and pPRE2 methods, this is not a rigorous screening, but its accuracy is demonstrated in the supplementary material. An alternative method is to estimate the integral with the highest absolute value out of the screened batch. For this purpose, we save the estimates of the (00|0) integrals made by Eq. (50). Then, an estimated upper bound for the integral of highest value of a contracted class is gained by taking the (00|0) estimate calculated from the smallest a, b, and c exponents which contribute to the contracted functions in question and multiplying it by both the degree of contraction (product of the number of primitives for the three functions) and the maximal contraction coefficient used for each contracted function. This estimation can also be done before the loop over χC for the class with the smallest c (algorithm cPRE3) in an abc scheme (see Algorithm 1) when the fitting basis is uncontracted. With a cab loop order (Algorithm 2) we cannot assume which contracted class contains the integrals of highest absolute value; therefore, the estimation is performed for each class inside the loop over χB (algorithm cPRE4).

Finally, from the recursive formulas for the calculation of six-dimensional integrals given in Secs. II B–II D it is evident that an integral can be constructed in numerous ways by such recursions, depending on which of the x,y,z components of the angular momentum is raised in the various recursion steps. A well-known consequence of this is that not all components of the intermediate classes have to be calculated and that different paths in the recursion have different operation counts.22,28 In our algorithms, the related tree-search problems were treated utilizing the ideas of Ryu and co-workers.22

III. FLOATING POINT OPERATION COUNTS

The FLOP requirements of the discussed schemes were estimated by a simple program developed for this purpose. The considered operations include the calculation of the primitive integrals and the transformation into the solid harmonic Gaussian and contracted bases. Estimations for the evaluation of Boys functions and the roots and weights for the Rys quadratures are omitted because the computational requirements of both steps depend heavily on the actual values of αRPC2. Nevertheless, we found that the computation time spent on the two operations is rather similar, thus the neglect of their FLOP counts is not expected to influence our conclusions. Prescreening of the integrals is also not taken into account since this is also strongly system-dependent. The program counts the FLOP requirements of the schemes according to the equations given in Sec. II supposing that reusable compound quantities, such as (α/p)XPC in Eq. (11), are precalculated and treated as single variables. The sparsity of the transformation matrices for the solid harmonic Gaussian transformation and the primitive contraction is taken into consideration. The abc primitive loop structure was used and the solid harmonic transformation and the HRR were performed at the contracted level since this is the most conventional approach, but this does not change the theoretical order of efficiency for the investigated schemes. In the calculations presented in the following, a model system of three carbon atoms were chosen, and the number of FLOPs needed to evaluate all the ERIs over three separate centers was estimated for Dunning’s57 correlation consistent cc-pVXZ (X = D,T,Q,5) basis sets (XZ for short) for the bra side and the corresponding auxiliary basis sets of Weigend59 (cc-pVXZ-RI) for the ket side.

The overall FLOP counts for all the shell triplets for the various algorithms are presented in Table I. Figures that show the theoretical performance of the other algorithms relative to the OS1 scheme can be found in the supplementary material. It can be seen that out of the OS-based schemes the OS1 algorithm shows the best theoretical performance. In the OS2 and OS3 schemes, the more expensive recursion for la takes place after the build up of lc, which makes these algorithms perform progressively worse with basis sets of higher cardinal number compared to OS1. In the OS4 route, the extra work introduced on the bra side with the use of the ETR becomes less and less significant with higher angular momenta in the bra, making the relative performance of OS4 better with bigger bases. Nevertheless, the OS1 scheme provides the lowest FLOP counts for each shell triplet. For the MD-based algorithms, the introduction of both the HRR for the bra (MD2 and MD5) and the VRR for the ket side (MD3 and MD6) improves the performance with respect to the MD1 and MD4 schemes, and increasingly so with the growth of Lb and Lc, respectively. None of the MD routes perform better than the OS1 for any shell triplets except for (ss|p), where the MD1, MD2, MD4, and MD5 schemes are slightly cheaper since the additional calculation of αc from Eq. (12) is not necessary. Looking at the best performing MD3 and MD6 schemes, we see that the use of Eqs. (28) and (29) is preferred to the assembly of Eq. (23), except when Lb = 0. The GHP scheme performs better than the OS1 when the bra side is (ps| since the extra contraction work for the scaled Hermite classes needed for Eq. (31) is negligible in these cases [except for very high angular momenta in the ket, see, for example, the (ps|i) shell triplet] and the s and p shells are contracted in all the investigated basis sets. The (ss|p) shell triplet also performs better, for the same reason as with the MD schemes. For higher angular momenta Eq. (31) becomes inefficient, hence the GHP scheme is only competitive for the DZ basis. As in the MD cases, the HRR for RYS2 and the ket-side VRR for RYS3 improve the FLOP counts. The RYS2 and RYS3 algorithms outmatch the OS1 in most of the cases when Lc = 0. For example, the OS1 scheme is better for (ds|s), but not for (dp|s). This is because the two-point quadrature is more costly than Eq. (11) for the former case, but it is cheaper for the latter. The RYS1 is the worst performing one of the Rys-based algorithms, but it is still superior to OS1 for particular shell triplets, for example, for (fd|s). The RYS3 scheme can be better than the OS1 for p kets if the change from s to p does not increase the number of quadrature points. However, since from Eq. (12) the (𝒍a𝟎|𝟎)(Lc) integral classes that have to be calculated with quadrature for RYS3 are in the range of max(0,LaLc)laLa+Lb, the growth of Lc also increases the work in the quadrature step, so this is only the case for higher angular momentum bras. All in all, there is only a small difference between the overall estimates for the best performing OS1 and RYS3 algorithms. Because of this, and also because the FLOP counts of the Boys functions and the roots and weights of the Rys quadratures are not estimated, these two schemes were implemented efficiently using automated code generation and wall time measurements were carried out, as will be discussed in Secs. IV and V, to decide which of the two is the most efficient scheme. The GHP algorithm for the (ps|s)–(ps|g) integrals has also been implemented “by hand” because the FLOP counts with this scheme are the lowest for these triplets.

TABLE I.

FLOP counts for the various algorithms with the cc-pVXZ basis sets.

X
Algorithm D T Q 5
OS1 445 777 2 231 707 14 074 904 71 407 908
OS2 545 297 2 967 883 19 981 747 106 671 377
OS3 632 210 3 465 805 22 599 746 116 871 757
OS4 754 037 3 587 118 21 812 481 106 908 381
MD1 599 215 3 801 560 30 617 263 198 278 829
MD2 555 359 3 165 292 22 249 286 125 117 638
MD3 474 978 2 473 785 15 766 358 80 931 165
MD4 616 235 3 824 184 29 178 035 173 497 467
MD5 570 267 3 272 596 22 532 400 121 220 098
MD6 470 050 2 420 151 15 243 362 77 170 230
GHP 499 430 3 188 703 25 032 932 152 491 888
RYS1 622 518 3 181 603 20 929 684 112 060 719
RYS2 585 778 2 902 749 18 203 750 92 155 512
RYS3 467 187 2 308 256 14 413 073 72 659 045

The FLOP counts for the four different possible combinations of the IN-OUT and abc-cab schemes for the OS1 algorithm are shown in Table II. The conclusions are also true for the RYS3 algorithm since the OS1 and RYS3 schemes do not differ in any part that is affected by varying these four algorithmic approaches. The estimates for the abc and cab cases are essentially the same; the small difference comes from the fact that for the abc schemes the additional costs of the pPRE2 type primitive prescreening are also counted because additional calculations are needed here before the loop over c. The differences between the IN and OUT algorithms are more significant, and as expected, performing the HRR and the solid harmonic transformation at the contracted level is theoretically more efficient in every case when at least one of the functions is contracted. The difference becomes less pronounced with higher basis sets because d and higher shells are uncontracted in the investigated bases. These results, however, do not provide information about the difference in performance that could arise from the different memory layouts and prescreening strategies of the schemes. Hence, to assess the wall time performances as well as cache-miss rates these four variations have also been efficiently implemented for both the OS1 and RYS3 algorithms, and the abc and cab versions of the GHP schemes were also programmed.

TABLE II.

FLOP counts for the four different OS1 algorithms with the cc-pVXZ basis sets.

X
Algorithm D T Q 5
IN-abc 566 748 2 664 883 15 919 037 79 233 985
IN-cab 565 054 2 662 374 15 916 043 79 232 960
OUT-abc 445 777 2 231 707 14 074 904 71 407 908
OUT-cab 443 201 2 227 609 14 069 600 71 404 912

IV. IMPLEMENTATION

The four combinations of the IN-OUT and abc-cab schemes for the OS1 and RYS3 algorithms together with the prescreening approaches discussed in Sec. II F have been implemented in the Mrcc program suite61 by means of automated code generation. The abc and cab variants of the GHP algorithm for the (ps|s)–(ps|g) triplets have been implemented in the conventional way. An individual Fortran 95 subroutine was created for every shell triplet up to (hh|i). The subroutines contain the loops over the primitive and the contracted Gaussians, the calculation of the necessary exponent- and center-dependent quantities, the evaluation of Boys functions (or the roots and weights for the Rys quadrature), the recursive build-up of angular momenta (or the quadrature for la), and the transformations to the solid harmonic and contracted bases. The code generation based implementation is particularly useful for the exploitation of the fact that not all the intermediate integrals are needed for a given class when using the 6D recurrences of Eqs. (11)–(13) and the 3D recurrence of Eq. (21), and the statements for calculating the unnecessary integrals are simply omitted from the code. For the 2D recursions of the RYS3 scheme, this does not apply since the recursions for the x, y, and z directions are performed separately and all the components are needed in the recursion defined by Eq. (44). The calculation of the 2D integrals is vectorized for the roots of the Rys polynomials, and the quadrature for the (𝒍a𝟎|𝟎)(Lc) classes has been implemented utilizing the reduced multiplication scheme of Lindh and co-workers.25 All the intermediate and target integrals are stored in one-index arrays. The build-up of angular momenta and the solid harmonic transformation is performed for one class at a time, which means that the arrays for storing the intermediates of these tasks are of fixed length and the indices can be explicitly generated, eliminating the integer and memory operations for the calculation of indices.

A significant amount of vectorization can be achieved for the HRR and the solid harmonic transformation provided that the data are stored in the appropriate order. The HRR can be trivially vectorized for the components of Lc since Eq. (13) does not depend on the function in the ket. Systematic vectorization for the components of la is also possible if the component of lb is the slowest changing property in the array. If the ordering of Cartesian components is as it is shown in Fig. 1, then the components of la can only be partially vectorized if z or y is raised in the angular momentum of lb and fully if x is incremented; therefore, whenever it is possible, x should be raised by the HRR. For the GHP algorithm with a (ps|bra, where the target integrals are calculated directly from the one-center ones, Eq. (31) was vectorized in the same manner for the components of Lc. For the solid harmonic transformation of one of the functions, the loops over all the (Cartesian or solid harmonic) components of the other two functions can only be vectorized if the components of the transformed function change most slowly. We found it to be efficient to rearrange the ordering of integrals before these highly vectorizable tasks. The sparsity of the solid harmonic transformation is fully exploited in our implementation, and the values of the coefficients in Eq. (5) are explicitly generated into the code. We have also considered the approach where the HRR and the solid harmonic transformation for the bra are treated as one matrix multiplication by precalculating the combined transformation matrix,56 storing it in compressed sparse column format for a given bra, and reusing this matrix with a sparse matrix multiplication routine for the transformation of integrals. It was our experience that performing the HRR separately step by step for each lb with the vectorization scheme described above and exploiting that some components are unnecessary for the recursion is a more beneficial strategy. It should also be mentioned that the solid harmonic transformation of the ket side is always performed before the HRR since this makes the latter step less expensive.

FIG. 1.

FIG. 1.

Two possible ways of calculating integrals with a (𝒅𝒅| bra side and lb = (1, 0, 1) are by (a) incrementing z and (b) incrementing x in lb. The indices for the Cartesian components increase as we proceed from top to bottom in the columns for the f and d shells above. The operations which can be vectorized are highlighted by boxes of various colors. In our implementation, incrementing x is always better suited for vectorization. The ket side of the integrals is not shown since the HRR equation is invariant to the function in the ket.

The contraction of primitives can be treated in a vectorized manner without the rearrangement of data. For generally contracted functions, the multiplication with the coefficients in Eq. (6) is vectorized for all the necessary classes, e.g., for the construction of the integrals over all components of one of the χB functions in an abc scheme NχCNS number of integrals are treated simultaneously, where NχC is the number of contracted functions centered on C and NS is the number of integrals in the class (for algorithm IN) or in the necessary (la0|Lc) classes (for algorithm OUT). For example, for a (dd|d) class NS=5×5×5=125 for algorithm IN and NS=6×1×6+10×1×6+15×1×6=186 for algorithm OUT because here we need the (ds|d), (fs|d), and (gs|d) classes for the HRR. It is also noteworthy that, at the contraction of the functions centered on B, instead of performing the summation of Eq. (6) in the NaNχBNχCNS long array used to store these partially contracted integrals (where Na is the number of primitives centered on A), it is more cache-friendly to do the summation in a buffer array of size NχCNS, than to copy the data into the array that will be used for the contraction of primitives centered on A.

The implementation of ERIs also utilizes a coarse-grained OpenMP parallelization for the innermost atomic loop. A figure showing the performance of the parallelization can be found in the supplementary material. We also note that, to demonstrate the efficiency of the generated implementation, we have also coded a subroutine that uses the OS1 scheme for arbitrary angular momenta. Here, the recursions of Eqs. (11) and (12) are performed by general loops, and the intermediates are stored in a two-index array. The HRR and the solid harmonic transformation steps are done at the contracted level with a sparse matrix multiplication routine, which is applied to the solid harmonic transformation of the ket and the combined HRR and solid harmonic transformation of the bra56 as described above.

V. PERFORMANCE TESTS

In this section, we present the wall time performances of the implemented algorithms measured using a single core of a 2-core 3.00 GHz Intel Xeon E3110 CPU. The generated subroutines were compiled with the Intel Fortran compiler using the highest level of optimization. Measurements were carried out for penicillin62 (PEN) and two DNA systems with one (DNA1) and two (DNA2) adenine-thymine base pairs.63 The threshold ε for contracted integrals was set to 1010Eh in all of the calculations. Only the results for DNA2 with the cc-pVTZ basis set are presented here. The results for the other measurements, which show that the conclusions gained hold for all the investigated systems, can be found in the supplementary material. Cache simulations were performed for hydrogen peroxide (ROO = 2.7514 bohrs, RHO = 1.8274 bohrs, HOO=102.32°, dihedral angle =115.89°) with the Valgrind program package64 supposing a three-level CPU cache structure which is common these days: 64 kB of level 1 (L1, 32 kB for both data and instructions), 256 kB of level 2, and 4 MB of level 3 (last level, LL) cache. In the simulations, an L1 miss means that the data or instructions have not been found in the first level, while an LL miss indicates that no copy of the requested information can be found in the cache at all. Note that the number of L1 misses also contains the LL misses.

Fig. 2 shows the difference between the pPRE1 and pPRE2 primitive prescreening schemes in the case of the IN-abc algorithm. The pPRE2 method saves entering the loop over c and the prescreening for each c at the price that classes containing integrals of insignificant absolute values that would be screened out with the pPRE1 scheme are also computed. With the abc loop order, the pPRE2 approach is clearly more efficient. The difference between the performance of the two prescreening schemes, as well as the significance of primitive prescreening, shrinks with the decrease in the number of primitive functions. On the other hand, from Fig. 3 we see that the pPRE1 prescreening is more economical in the case of a cab scheme since the Schwartz screening already throws out most of the shell pairs where no b gives a significant contribution. The figures presenting the timings for the various cPRE algorithms can be found in the supplementary material. The cPRE type of screening has less effect, and for triplets that do not require either the HRR or the solid harmonic transformation, it merely saves the writing of integrals into their final storing array. As the former two tasks become more significant, the cPRE screening gets more beneficial, especially with higher basis sets, where there are more contracted functions for higher angular momenta. For the OUT-abc scheme, the lookup of the integrals of highest absolute value (cPRE1 and cPRE2) is preferred over the estimation of this quantity (cPRE3). The cPRE1 and cPRE2 schemes have very similar performance, with cPRE2 being slightly more efficient. The same tendencies can be observed with the OUT-cab algorithm, where cPRE1 is the more efficient method. We conclude that for the abc primitive loop order, the pPRE2 and cPRE2 are the prescreening schemes of choice, while for the cab algorithms the pPRE1 and cPRE1 screenings are preferred.

FIG. 2.

FIG. 2.

Wall times measured in seconds obtained by calculating all three-center ERIs of the DNA2 molecule with the cc-pVTZ basis set by applying the OS1-IN-abc algorithm with various prescreening strategies.

FIG. 3.

FIG. 3.

Wall times measured in seconds obtained by calculating all three-center ERIs of the DNA2 molecule with the cc-pVTZ basis set by applying the OS1-IN-cab algorithm with various prescreening strategies.

The wall times measured for the shell triplets with the four variants of the OS1 algorithm, using the most efficient prescreening methods, are shown in Fig. 4. For triplets containing small angular momenta, the cab schemes are inefficient, even without primitive prescreening (see also Figs. 2 and 3). The reason for this is that the arrays that become smaller with a cab algorithm are already too short in these cases. For example, the length of the buffer array used for the contraction of functions centered on B for (ss|s) is NχC and 1 using an abc and a cab scheme, respectively. Here, applying the cab loop order ruins the vectorization for the primitive contraction. This effect loses its importance with the growth of Lc since NS becomes bigger and NχC becomes smaller. The difference between the abc and cab schemes grows when using basis sets of higher cardinal number because of the higher number of contracted functions. The IN algorithms generally perform better than the OUT ones. One of the reasons is the apparent superiority of the pPRE-type screening, which lessens the amount of work for the HRR and solid harmonic transformation steps using the IN schemes. We must note, however, that only the s and p shells are contracted in the considered basis sets, making the OUT route theoretically more efficient only in shell triplets containing at least one such shell.

FIG. 4.

FIG. 4.

Wall times measured in seconds obtained by calculating all three-center ERIs of the DNA2 molecule with the cc-pVTZ basis set by applying the four OS1 algorithms with the most efficient prescreening strategies.

The timings can be better interpreted inspecting the results of the cache performance simulations. The cumulated results for all the shell triplets in the TZ basis are presented in Table III, while the results with the other basis sets can be found in the supplementary material. We see that the number of level 1 instruction fetch misses (L1Is) is lower for the OUT-abc scheme than for the IN-abc, but a higher percentage of these is also last level misses. This is because with an IN algorithm the calculation of primitive integrals and the conversion into the solid harmonic Gaussian basis are done continuously step by step inside the primitive loops, while in the OUT case this procedure is divided into two parts with two separate loop structures, making it more friendly for the instruction cache for higher angular momenta, where the generated codes are lengthy. This effect is more pronounced with basis sets of higher cardinal number, where the angular momenta are higher and the loops over primitive and contracted functions perform more cycles. With the QZ and 5Z bases, we can observe the same for OUT-cab: the number of L1Is is smaller than for the IN schemes, but higher than for the OUT-abc since all the calculations take place in the loop over c, making the reuse of instructions less temporally local (that is, the same tasks are not performed as frequently as they would be with the loop over c being the innermost one). For this reason, the abc schemes are always more friendly to the instruction cache. This aspect of the performance is the reason why the OUT schemes are sometimes more efficient for shell triplets we would not expect theoretically, for example, for the (fd|f) and (ff|d) cases with the TZ basis, and also explains why the performance of this approach improves with higher basis sets. As anticipated from the sizes of the arrays used for the primitive contraction, the IN algorithms produce fewer data misses of both the read and write kind, and the cab loop order is beneficial in this aspect. This difference also grows with the cardinal number of the basis sets and is more significant for write misses since the read operations are usually carried out from arrays that have been written in a previous calculation step.

TABLE III.

Cache performance simulation results for H2O2 with the cc-pVTZ basis set.

Algorithm
Event IN-abc IN-cab OUT-abc OUT-cab
L1 instruction fetch miss 687 995 720 295 665 954 820 972
LL instruction fetch miss 576 727 582 575 641 900 666 989
L1 data read miss 219 741 192 668 252 112 202 708
LL data read miss 199 655 191 036 200 703 199 053
L1 data write miss 484 132 385 047 552 062 407 074
LL data write miss 482 194 383 336 544 077 404 681

Fig. 5 compares the efficiency of the OS1 and RYS3 schemes. For each shell triplet, the selected algorithmic approach was the one that best performed according to Fig. 4, keeping in mind that the most efficient combination of the IN-OUT and abc-cab approaches for the OS1 scheme is also the most efficient one for the RYS3 since the OS1 and RYS3 schemes do not differ in any part that depends on using the IN-OUT or abc-cab approaches. While the performances fall close, the OS1 scheme is superior in almost every case. The differences are more pronounced for the shell triplets with small angular momenta in the bra. The advantage of using OS1 becomes larger for the shell triplets where the number of Rys quadrature points is over 5. In these cases, the roots and weights are calculated by applying Wheeler’s algorithm65 and Golub’s matrix method,66 while otherwise the less expensive schemes proposed by King and Dupuis10 are employed. The disagreement between the timings and the FLOP estimates must come from the task that is not estimated by the operation counts, that is, the evaluation of Boys functions and the roots and weights for the Rys quadrature. In some cases, the RYS3 scheme is still slightly more efficient, e.g., for the (fd|p) and (gd|p) shell triplets. The GHP scheme is competitive for the implemented cases (see Sec. IV) with the 5Z basis, where the degree of contraction is the highest. For smaller basis sets, for the (ps|p) triplet, GHP performs slightly better than OS1 since here the number of integrals to be contracted, that is, the number of integrals included in the scaled classes (Ω0,0𝟎¯|𝟏¯)1,1 and (Ω0,0𝟏¯|𝟏¯)0,1 needed for Eq. (31), is the same as the number of integrals to be contracted in the OS1 scheme, and all of the functions are contracted. The application of the cab loop order on the (ps|g) and (ps|f) triplets makes the GHP algorithm perform better for these cases than the other ones with the TZ and the QZ bases, respectively.

FIG. 5.

FIG. 5.

Wall times measured in seconds obtained by calculating all three-center ERIs of the DNA2 molecule with the cc-pVTZ basis set by applying the most efficient OS1 and RYS3 algorithms.

As it was pointed out, the relative performances of the discussed approaches depend on the number of functions and the degree of contraction therefore on the applied basis set itself. For the three test molecules we investigated, it was our experience that the best algorithm for a given shell triplet with a given basis is mostly independent of the calculated system. Based on our measurements with the cc-pVXZ bases for first row elements, in Table IV we present our recommendations for the algorithms for the shell triplets up to (hh|i). The list compiled in Table IV was composed by selecting the schemes that are the most beneficial ones for the TZ and the QZ basis sets because such bases are used most frequently in DF calculations. The best algorithm for the triplets is the same with both basis sets for most of the cases. As we can see, even though the considered basis sets have the similarity that only the s and p shells are contracted, the increase of the number of functions and the level of contraction makes the cab and OUT schemes more beneficial with the bigger bases.

TABLE IV.

Recommended algorithms for the various shell triplets.

Shell triplet Algorithm Shell triplet Algorithm Shell triplet Algorithm
(ss|s) OS1-IN-abc (fp|s) OS1-IN-abc (gg|s) OS1-OUT-abc
(ss|p) OS1-IN-abc (fp|p) OS1-OUT-abc (gg|p) OS1-IN-cab
(ss|d) OS1-IN-abc (fp|d) OS1-IN-cab (gg|d) OS1-IN-cab
(ss|f) OS1-IN-abc (fp|f) OS1-IN-cab (gg|f) OS1-IN-cab
(ss|g) OS1-IN-cab (fp|g) OS1-IN-cab (gg|g) OS1-IN-cab
(ss|h) OS1-IN-cab (fp|h) OS1-OUT-cab (gg|h) OS1-OUT-cab
(ss|i) OS1-IN-cab (fp|i) OS1-OUT-cab (gg|i) OS1-OUT-abc
(ps|s) OS1-IN-abc (fd|s) OS1-IN-cab (hs|s) RYS3-IN-abc
(ps|p) GHP-abc (fd|p) RYS3-IN-cab (hs|p) OS1-IN-abc
(ps|d) OS1-IN-abc (fd|d) OS1-IN-cab (hs|d) OS1-IN-abc
(ps|f) OS1-IN-abc (fd|f) OS1-OUT-abc (hs|f) OS1-IN-abc
(ps|g) GHP-cab (fd|g) OS1-OUT-abc (hs|g) OS1-IN-abc
(ps|h) OS1-IN-cab (fd|h) OS1-IN-abc (hs|h) OS1-IN-cab
(ps|i) OS1-IN-cab (fd|i) OS1-OUT-abc (hs|i) OS1-IN-cab
(pp|s) OS1-IN-abc (ff|s) OS1-IN-cab (hp|s) RYS3-IN-cab
(pp|p) OS1-OUT-abc (ff|p) OS1-OUT-abc (hp|p) OS1-IN-cab
(pp|d) OS1-IN-cab (ff|d) OS1-OUT-abc (hp|d) OS1-IN-abc
(pp|f) OS1-IN-cab (ff|f) OS1-IN-cab (hp|f) OS1-OUT-abc
(pp|g) OS1-IN-cab (ff|g) OS1-IN-cab (hp|g) OS1-OUT-abc
(pp|h) OS1-IN-cab (ff|h) OS1-IN-cab (hp|h) OS1-IN-cab
(pp|i) OS1-IN-cab (ff|i) OS1-OUT-cab (hp|i) OS1-IN-cab
(ds|s) OS1-IN-abc (gs|s) OS1-IN-abc (hd|s) RYS3-IN-cab
(ds|p) OS1-IN-abc (gs|p) OS1-IN-abc (hd|p) OS1-OUT-abc
(ds|d) OS1-IN-abc (gs|d) OS1-IN-abc (hd|d) OS1-OUT-cab
(ds|f) OS1-IN-abc (gs|f) OS1-IN-abc (hd|f) OS1-OUT-cab
(ds|g) OS1-IN-abc (gs|g) OS1-IN-abc (hd|g) OS1-OUT-abc
(ds|h) OS1-IN-cab (gs|h) OS1-IN-cab (hd|h) OS1-OUT-cab
(ds|i) OS1-IN-cab (gs|i) OS1-IN-cab (hd|i) OS1-OUT-cab
(dp|s) OS1-IN-abc (gp|s) OS1-IN-cab (hf|s) OS1-OUT-abc
(dp|p) OS1-OUT-abc (gp|p) OS1-IN-cab (hf|p) OS1-OUT-abc
(dp|d) OS1-IN-cab (gp|d) OS1-IN-cab (hf|d) OS1-IN-cab
(dp|f) OS1-IN-cab (gp|f) OS1-OUT-abc (hf|f) OS1-IN-cab
(dp|g) OS1-IN-cab (gp|g) OS1-IN-cab (hf|g) OS1-IN-cab
(dp|h) OS1-IN-cab (gp|h) OS1-IN-cab (hf|h) OS1-OUT-cab
(dp|i) OS1-OUT-cab (gp|i) OS1-OUT-abc (hf|i) OS1-OUT-abc
(dd|s) OS1-IN-abc (gd|s) OS1-IN-cab (hg|s) OS1-OUT-abc
(dd|p) OS1-IN-cab (gd|p) OS1-IN-cab (hg|p) OS1-IN-cab
(dd|d) OS1-IN-cab (gd|d) OS1-OUT-abc (hg|d) OS1-IN-cab
(dd|f) OS1-IN-cab (gd|f) OS1-OUT-abc (hg|f) OS1-OUT-cab
(dd|g) OS1-OUT-abc (gd|g) OS1-OUT-abc (hg|g) OS1-OUT-cab
(dd|h) OS1-OUT-cab (gd|h) OS1-IN-cab (hg|h) OS1-OUT-cab
(dd|i) OS1-OUT-cab (gd|i) OS1-OUT-cab (hg|i) OS1-OUT-abc
(fs|s) OS1-IN-abc (gf|s) RYS3-IN-cab (hh|s) OS1-IN-cab
(fs|p) OS1-OUT-abc (gf|p) OS1-OUT-abc (hh|p) OS1-IN-cab
(fs|d) OS1-IN-abc (gf|d) OS1-OUT-abc (hh|d) OS1-IN-cab
(fs|f) OS1-IN-abc (gf|f) OS1-OUT-abc (hh|f) OS1-OUT-cab
(fs|g) OS1-IN-cab (gf|g) OS1-IN-cab (hh|g) OS1-OUT-cab
(fs|h) OS1-IN-cab (gf|h) OS1-IN-abc (hh|h) OS1-OUT-abc
(fs|i) OS1-IN-cab (gf|i) OS1-OUT-abc (hh|i) OS1-OUT-cab

VI. BENCHMARK CALCULATIONS

To demonstrate the efficiency of our implementation based on the above recommendation, in Table V we present the wall times measured for the evaluation of three-center ERIs for test systems of various size, namely, penicillin,62 DNA fragments containing 1 and 4 adenine-thymine base pairs67 (DNA1 and DNA4, respectively), indinavir,68 angiotensin II,69 and a halloysite clay structure.70 The measurements were carried out using 8 cores of a 3.00 GHz Intel Xeon E5-1660 CPU. The results are close to quadratic scaling with the total number of basis functions due to the various integral screenings, and the prefactor is kept small by the efficient implementation. We have also experienced a constant speedup of about 3 compared to our general purpose routine using the OS1 scheme, which shows that we can gain an efficient implementation optimized for each shell triplet separately. We note that three-center ERIs can also be easily computed with the algorithms developed for four-center ones constraining two of the four centers to be coincident. Since many quantum chemistry software packages evaluate three-center Coulomb integrals in this way, it is instructive to compare the speed of an explicitly three-center code to that of a four-center one for three-center ERIs. Therefore, we compared our three-center code to our previous OS-based four-center integral program71 and have found that the former program is roughly 3.5 times faster than the latter one. We also note that the efficiency of our integral code has been recently demonstrated also in the case of the integral-direct local correlation approach of Ref. 9, where roughly one-third of the entire computation time is spent on the calculation of three-center ERIs.

TABLE V.

Wall times of three-center ERI calculations in minutes measured for various test systems with the cc-pVXZ basis sets. N + M denotes the total number of ordinary basis functions and fitting functions.

X
D T Q 5
Test system Time N + M Time N + M Time N + M Time N + M
Penicillin 0.008 430 + 2 136 0.022 946 + 2 478 0.088 1 864 + 3 504 0.372 3 178 + 5 033
DNA1 0.016 625 + 3 071 0.049 1428 + 3 575 0.201 2 735 + 5 087 0.883 4 670 + 7 351
Indinavir 0.033 865 + 4 231 0.118 2008 + 4 965 0.492 3 885 + 7 167 2.251 6 680 + 10 471
Angiotensin II 0.104 1405 + 6 883 0.380 3244 + 8 055 1.609 6 255 + 11 571 7.245 10 730 + 16 843
DNA4 0.474 2746 + 19 820 1.777 6192 + 15 794 8.307 11 774 + 22 202 33.174 20 012 + 31 744
Halloysite 1.306 3700 + 19 820 4.607 7970 + 22 435 19.854 14 855 + 30 280 68.447 24 985 + 41 510

VII. CONCLUSIONS

We have compared the Obara–Saika, McMurchie–Davidson, Gill–Head-Gordon–Pople, and Rys quadrature schemes as well as their combinations for the evaluation of three-center Coulomb integrals. Various algorithmic considerations, such as the order of loops for primitive functions, the application of the horizontal recurrence relation, and the solid harmonic transformation at the primitive or contracted level, and several prescreening strategies have also been investigated. Based on estimations for the number of necessary floating point operations for a simple model system, we concluded that the Obara–Saika scheme, utilizing the vertical recurrence relation of Ahlrichs,40 is the most efficient choice, with the Gill–Head-Gordon–Pople algorithm and the combination of the Rys quadrature and the Obara–Saika schemes being competitive for a few special cases. The most promising algorithms were implemented via automated code generation for all shell triplets up to (hh|i) along with the discussed algorithmic approaches. Wall time measurements for medium sized molecules also showed the Obara–Saika scheme to be superior, and the most effective prescreening technique was determined for each algorithmic approach. Even though the floating point operation counts suggested that the horizontal recurrence relation and the solid harmonic transformation are significantly more efficient when applied to contracted integrals, this does not seem to be the case for the majority of shell triplets encountered in practical calculations. The reason for this is that performing these two tasks on primitive integrals allows for the use of more effective prescreening and memory layout. Based on our investigations, we have presented a recommendation for the algorithms to be used for the various shell triplets, favoring the ones that perform the best with triple- and quadruple-zeta basis sets.

SUPPLEMENTARY MATERIAL

See supplementary material for the analysis of the prescreening schemes presented in Sec. II F, for the relative theoretical performances of the investigated algorithms referred to in Sec. III, for the wall time measurement and cache simulation results discussed in Sec. V, for the performance of the ERI calculation on multiple CPU cores, and for the geometries of the molecules used in the performance tests and benchmark calculations.

ACKNOWLEDGMENTS

The authors are indebted to Professor Reinhart Ahlrichs and Dr. Gerald Knizia for useful discussions. The computing time granted on the Hungarian HPC Infrastructure at NIIF Institute, Hungary, is gratefully acknowledged.

APPENDIX A: IMPROVED RECURRENCE RELATION FOR THE 2D INTEGRALS OF THE RYS SCHEME

In the general case, Eq. (45) contains a third term and has the form17

Θxia,0,ic+1(tn2)=αcXPCtn2Θxia,0,ic(tn2)+iatn22(p+c)Θxia1,0,ic(tn2)+ic2c(1αctn2)Θxia,0,ic1(tn2). (A1)

With the help of Eq. (12) we can show that, if the ket side will be transformed into the solid harmonic Gaussian basis, the third term on the left-hand side of Eq. (A1) can be omitted. To see this, we first notice from backtracking the recursion defined by Eq. (12) that an integral (𝒍a#𝟎|𝒍c#)(m+n) contributes to (la0|Lc)(m) only if

lc#=lcn (A2)

since each recursion step decreases n and increases 𝑙c# by one. Then, let us express (la0|Lc)(m) as

(𝒍a𝟎|𝒍c)(m)=n=1Nrtstn2mΘxia,0,ic(tn2)Θyja,0,jc(tn2)Θzka,0,kc(tn2). (A3)

Substituting Eq. (A1) into Eq. (A3) we get

(𝒍a𝟎|𝐥c)(m)=n=1Nrtstn2m[αcXPCtn2Θxia,0,ic1(tn2)+iatn22(p+c)Θxia1,0,ic1(tn2)+ic2c(1αctn2)Θxia,0,ic2(tn2)]×[αcYPCtn2Θyja,0,jc1(tn2)+jatn22(p+c)Θyja1,0,jc1(tn2)+jc2c(1αctn2)Θyja,0,jc2(tn2)]×[αcZPCtn2Θzka,0,kc1(tn2)+katn22(p+c)Θzka1,0,kc1(tn2)+kc2c(1αctn2)Θzka,0,kc2(tn2)]. (A4)

Each of the terms arising by performing the multiplications amongst the brackets can contribute to an integral determined by the indices of the 2D integrals. For example, the term arising from multiplying the first terms of the brackets contributes to a scaled version of (la#0|lc#)(m+3) with la#=(ia,ja,ka) and lc#=(ic1,jc1,kc1) through Eq. (A3), which is used in the expansion of (la0|Lc)(m) by Eq. (12) if we go three steps back in the recursion. The terms containing the third 2D integral from one or more brackets in Eq. (A4) are used to build the (la#0|lc#)(m+n) classes with Eq. (A3) where 0n3 (because the third 2D integral can be multiplied by a quantity that does or does not contain tn2), la2la#la (because the product can contain a maximum of two of the second 2D integrals which each reduce la# by one), and lc6lc#lc4 (because the first two 2D integrals reduce lc# by one, while the third does so by two). Since none of these satisfy Eq. (A2), the contributions containing the third terms in the brackets in Eq. (A4) will be canceled during the solid harmonic transformation and can be taken to be zero, which means that Eq. (A1) reduces to Eq. (45). The same reasoning applies to the second term in Eq. (44) in the case of Lb = 0, when the third and fourth terms in Eq. (11) vanish.

APPENDIX B: A RIGOROUS UPPER BOUND FOR PRIMITIVE THREE-CENTER ERIs

It is possible to construct an exact prescreening scheme for the primitive integrals based on the Schwartz inequality,

|(𝑳a𝑳b|𝑳c)|2|(𝑳a𝑳b|𝑳a𝑳b)||(𝑳c|𝑳c)|, (B1)

by giving upper bounds to the integrals on the right-hand side of Eq. (B1). In fact, the exact value of (Lc|Lc) can be simply calculated by using Eq. (12) and noting that in this special case RPC = 0, which gives

(𝑳c|𝑳c)=Lc!(4c)Lc(𝟎|𝟎)(Lc)=Lc!(4c)Lcθcc12Lc+1, (B2)

where it was also exploited that Fn(0) = 1/(2n + 1).17 To gain an upper bound for |(LaLb|LaLb)|, we have to track back the recursions necessary to build up this integral. Let us first define the maximum absolute value component of RAB as

mRAB=max(|XAB|,|YAB|,|ZAB|). (B3)

Then, by Eq. (13), an upper bound for |(LaLb|LaLb)| is

|(𝑳a𝐿b|𝑳a𝑳b)|[lb=0Lb(Lblb)mRABlb]MLaLblb=UHRRMLaLblb, (B4)

where MLaLblb is a value that is greater than the absolute value of any of the integrals (LaLb|lb0) for LalbLa+Lb. Proceeding in the same manner for the bra side, we get

|(𝑳a𝑳b|𝑳a𝑳b)|UHRR2Mlalb, (B5)

where, similarly, Mlalb is an upper bound for |(la0|lb0)| with LalaLa+Lb and LalbLa+Lb. To get an upper bound for these types of integrals, we inspect the VRR for four-center ERIs19

(𝒍a𝟎|[𝒍b+𝟏x]𝟎)(n)=XPA(𝒍a𝟎|𝒍b𝟎)(n)+ib2p(𝒍a𝟎|𝒍b𝟏x𝟎)(n)ib4p(𝒍a𝟎|𝒍b𝟏x𝟎)(n+1)+ia4p(𝒍a𝟏x𝟎|𝒍b𝟎)(n+1), (B6)

which can be used to expand (la0|lb0) type ERIs in (la0|00) type ones. The highest number of terms in this expansion, NVRR1, will belong to ([La + Lb]0|[Lb + Lb]0). We can then write

|(𝑳a𝑳b|𝑳a𝑳b)|UHRR2NVRR1UVRRMla, (B7)

where

UVRR=max[mRPALa+Lb,(La+Lb2p)La+Lb2,1] (B8)

is the biggest recursion coefficient that can occur, and Mla is an upper bound for |(la0|00)| with 0laLa+Lb. NVRR1 can be given as

NVRR1=m=0La+Lb2(La+Lbmm)2La+Lbm. (B9)

It only remains to give an appropriate value of Mla, for which we use the VRR

(𝒍a+𝟏x𝟎|𝟎𝟎)(n)=XPA(𝒍a𝟎|𝟎𝟎)(n)+ia2p(𝒍a𝟏x𝟎|𝟎𝟎)(n)ia4p(𝒍a𝟏x𝟎|𝟎𝟎)(n+1) (B10)

to expand ([La + Lb]0|00) in NVRR2 (00|00)(n) type integrals, the greatest of which will be κab2θppF0(0)=κab2θpp. We then get

|(𝑳a𝑳b|𝑳a𝑳b)|UHRR2NVRR1NVRR2UVRR2κab2θpp (B11)

with

NVRR2=m=0La+Lb2(La+Lbmm)2m. (B12)

Note that UHRR only depends on the inter-nuclear distances in the bra and Lb, NVRR1, and NVRR2 only depend on La + Lb, and mRPA=b/pmRAB. If desired, a bound for integrals over spherical harmonic Gaussians can be given by multiplying the screening value by (2La + 1) (2Lb + 1) (2Lc + 1) and the maximal coefficients in Eq. (5) for the three shells. In our experience if we neglect this, the integrals that are falsely discarded have the same magnitude as the tolerance. Applying the scheme described above, roughly an extra 5% and 10% of the integrals are calculated with respect to the approaches presented in Sec. II F for the TZ and QZ bases, respectively, and the wall times increase by about 10%.

We note that an upper bound can also be derived directly for the (LaLb|Lc) integrals in a way similar to the one outlined here for (LaLb|LaLb), but the resulting scheme is less efficient due to the increased number of FLOPs and logical operations necessary inside the primitive loops.

REFERENCES

  • 1.Boys S. F. and Shavitt I., University of Wisconsin Naval Research Laboratory Report No. WIS-AF-13, 1959.
  • 2.Baerends E. J., Ellis D. E., and Ros P., Chem. Phys. 2, 41 (1973). 10.1016/0301-0104(73)80059-x [DOI] [Google Scholar]
  • 3.Whitten J. L., J. Chem. Phys. 58, 4496 (1973). 10.1063/1.1679012 [DOI] [Google Scholar]
  • 4.Dunlap B. I., Connolly J. W. D., and Sabin J. R., J. Chem. Phys. 71, 3396 (1979). 10.1063/1.438728 [DOI] [Google Scholar]
  • 5.Dunlap B. I., Phys. Chem. Chem. Phys. 2, 2113 (2000). 10.1039/b000027m [DOI] [Google Scholar]
  • 6.Almlöf J., K. Fægri, Jr., and Korsell K., J. Comput. Chem. 3, 385 (1982). 10.1002/jcc.540030314 [DOI] [Google Scholar]
  • 7.Häser M. and Ahlrichs R., J. Comput. Chem. 10, 104 (1989). 10.1002/jcc.540100111 [DOI] [Google Scholar]
  • 8.Weigend F., Phys. Chem. Chem. Phys. 4, 4285 (2002). 10.1039/b204199p [DOI] [Google Scholar]
  • 9.Nagy P. R., Samu G., and Kállay M., J. Chem. Theory Comput. 12, 4897 (2016). 10.1021/acs.jctc.6b00732 [DOI] [PubMed] [Google Scholar]
  • 10.King H. F. and Dupuis M., J. Comput. Phys. 21, 144 (1976). 10.1016/0021-9991(76)90008-5 [DOI] [Google Scholar]
  • 11.Dupuis M., Rys J., and King H. F., J. Chem. Phys. 65, 111 (1976). 10.1063/1.432807 [DOI] [Google Scholar]
  • 12.Rys J., Dupuis M., and King H. F., J. Comput. Chem. 4, 154 (1983). 10.1002/jcc.540040206 [DOI] [Google Scholar]
  • 13.Komornicki A. and King H. F., J. Chem. Phys. 134, 244115 (2011). 10.1063/1.3600745 [DOI] [PubMed] [Google Scholar]
  • 14.King H. F., J. Phys. Chem. A 120, 9348 (2016). 10.1021/acs.jpca.6b10004 [DOI] [PubMed] [Google Scholar]
  • 15.Dupuis M. and Marquez A., J. Chem. Phys. 114, 2067 (2001). 10.1063/1.1336541 [DOI] [Google Scholar]
  • 16.Dupuis M., Comput. Phys. Commun. 134, 150 (2001). 10.1016/s0010-4655(00)00195-8 [DOI] [Google Scholar]
  • 17.Helgaker T., Jørgensen P., and Olsen J., Molecular Electronic Structure Theory (Wiley, Chichester, 2000). [Google Scholar]
  • 18.McMurchie L. E. and Davidson E. R., J. Comput. Phys. 26, 218 (1978). 10.1016/0021-9991(78)90092-x [DOI] [Google Scholar]
  • 19.Obara S. and Saika A., J. Chem. Phys. 84, 3963 (1986). 10.1063/1.450106 [DOI] [Google Scholar]
  • 20.Honda H., Yamaki T., and Obara S., J. Chem. Phys. 117, 1457 (2002). 10.1063/1.1485958 [DOI] [Google Scholar]
  • 21.Head-Gordon M. and Pople J. A., J. Chem. Phys. 89, 5777 (1988). 10.1063/1.455553 [DOI] [Google Scholar]
  • 22.Ryu U., Lee Y. S., and Lindh R., Chem. Phys. Lett. 185, 562 (1991). 10.1016/0009-2614(91)80260-5 [DOI] [Google Scholar]
  • 23.Johnson B. G., Gill P. M. W., and Pople J. A., Chem. Phys. Lett. 206, 229 (1992). 10.1016/0009-2614(93)85546-z [DOI] [Google Scholar]
  • 24.Hamilton T. P. and Schaefer H. F. III, Chem. Phys. 150, 163 (1991). 10.1016/0301-0104(91)80126-3 [DOI] [Google Scholar]
  • 25.Lindh R., Ryu U., and Liu B., J. Chem. Phys. 95, 5889 (1991). 10.1063/1.461610 [DOI] [Google Scholar]
  • 26.Lindh R., Theor. Chim. Acta 85, 423 (1993). 10.1007/bf01112982 [DOI] [PubMed] [Google Scholar]
  • 27.Gill P. M. W., Head-Gordon M., and Pople J. A., Int. J. Quantum Chem. 36, 269 (1989). 10.1002/qua.560360831 [DOI] [Google Scholar]
  • 28.Johnson B. G., Gill P. M. W., and Pople J. A., Int. J. Quantum Chem. 40, 809 (1991). 10.1002/qua.560400610 [DOI] [Google Scholar]
  • 29.Gill P. M. W., Johnson B. G., and Pople J. A., Chem. Phys. Lett. 217, 65 (1994). 10.1016/0009-2614(93)e1340-m [DOI] [Google Scholar]
  • 30.Johnson B. G., Gill P. M. W., Pople J. A., and Fox D. J., Chem. Phys. Lett. 206, 239 (1993). 10.1016/0009-2614(93)85547-2 [DOI] [Google Scholar]
  • 31.Gill P. M. W., Head-Gordon M., and Pople J. A., J. Phys. Chem. 94, 5564 (1990). 10.1021/j100377a031 [DOI] [Google Scholar]
  • 32.Gill P. M. W. and Pople J. A., Int. J. Quantum Chem. 40, 753 (1991). 10.1002/qua.560400605 [DOI] [Google Scholar]
  • 33.Gill P. M. W. and Johnson B. G., Int. J. Quantum Chem. 40, 745 (1991). 10.1002/qua.560400604 [DOI] [Google Scholar]
  • 34.Gill P. M. W., Adv. Quantum Chem. 25, 141 (1994). 10.1016/s0065-3276(08)60019-2 [DOI] [Google Scholar]
  • 35.Köster A. M., J. Chem. Phys. 104, 4114 (1996). 10.1063/1.471224 [DOI] [Google Scholar]
  • 36.Köster A. M., J. Chem. Phys. 118, 9943 (2003). 10.1063/1.1571519 [DOI] [Google Scholar]
  • 37.Calaminici P., Domínguez-Soria V. D., Geudtner G., Hernández-Marín E., and Köster A. M., Theor. Chem. Acc. 115, 221 (2006). 10.1007/s00214-005-0005-0 [DOI] [Google Scholar]
  • 38.Reine S., Tellgren E., and Helgaker T., Phys. Chem. Chem. Phys. 9, 4771 (2007). 10.1039/b705594c [DOI] [PubMed] [Google Scholar]
  • 39.Reine S., Helgaker T., and Lindh R., Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2, 290 (2012). 10.1002/wcms.78 [DOI] [Google Scholar]
  • 40.Ahlrichs R., Phys. Chem. Chem. Phys. 6, 5119 (2004). 10.1039/b413539c [DOI] [Google Scholar]
  • 41.Valeev E. F., Libint: A library for the evaluation of molecular integrals of many-body operators over Gaussian functions, http://libint.valeyev.net/.
  • 42.Valeev E. F. and Janssen C. L., J. Chem. Phys. 121, 1214 (2004). 10.1063/1.1759319 [DOI] [PubMed] [Google Scholar]
  • 43.Werner H.-J., Knizia G., and Manby F. R., Mol. Phys. 109, 407 (2011). 10.1080/00268976.2010.526641 [DOI] [Google Scholar]
  • 44.Shao Y. and Head-Gordon M., Chem. Phys. Lett. 323, 425 (2000). 10.1016/s0009-2614(00)00524-8 [DOI] [Google Scholar]
  • 45.Sodt A., Subotnik J. E., and Head-Gordon M., J. Chem. Phys. 125, 194109 (2006). 10.1063/1.2370949 [DOI] [PubMed] [Google Scholar]
  • 46.Polly R., Werner H.-J., Manby F. R., and Knowles P. J., Mol. Phys. 102, 2311 (2004). 10.1080/0026897042000274801 [DOI] [Google Scholar]
  • 47.Reine S., Tellgren E., Krapp A., Kjærgaard T., Helgaker T., Jansik B., Høst S., and Salek P., J. Chem. Phys. 129, 104101 (2008). 10.1063/1.2956507 [DOI] [PubMed] [Google Scholar]
  • 48.Manzer S. F., Epifanovsky E., and Head-Gordon M., J. Chem. Theory Comput. 11, 518 (2014). 10.1021/ct5008586 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Mejía-Rodríguez D. and Köster A. M., J. Chem. Phys. 141, 124114 (2014). 10.1063/1.4896199 [DOI] [PubMed] [Google Scholar]
  • 50.Mejía-Rodríguez D., Huang X., del Campo J. M., and Köster A. M., Adv. Quantum Chem. 71, 41 (2015). 10.1016/bs.aiq.2015.03.009 [DOI] [Google Scholar]
  • 51.Köppl C. and Werner H.-J., J. Chem. Theory Comput. 12, 3122 (2016). 10.1021/acs.jctc.6b00251 [DOI] [PubMed] [Google Scholar]
  • 52.Sierka M., Hogekamp A., and Ahlrichs R., J. Chem. Phys. 118, 9136 (2003). 10.1063/1.1567253 [DOI] [Google Scholar]
  • 53.Alvarez-Ibarra A. and Köster A. M., J. Chem. Phys. 139, 024102 (2013). 10.1063/1.4812183 [DOI] [PubMed] [Google Scholar]
  • 54.Alvarez-Ibarra A. and Köster A. M., Mol. Phys. 113, 3128 (2015). 10.1080/00268976.2015.1078009 [DOI] [Google Scholar]
  • 55.Ishida K., J. Chem. Phys. 98, 2176 (1993). 10.1063/1.464196 [DOI] [Google Scholar]
  • 56.Flocke N. and Lotrich V., J. Comput. Chem. 29, 2722 (2008). 10.1002/jcc.21018 [DOI] [PubMed] [Google Scholar]
  • 57.T. H. Dunning, Jr., J. Chem. Phys. 90, 1007 (1989). 10.1063/1.456153 [DOI] [Google Scholar]
  • 58.Woon D. E. and T. H. Dunning, Jr., J. Chem. Phys. 98, 1358 (1993). 10.1063/1.464303 [DOI] [Google Scholar]
  • 59.Weigend F., Köhn A., and Hättig C., J. Chem. Phys. 116, 3175 (2002). 10.1063/1.1445115 [DOI] [Google Scholar]
  • 60.Hollman D. S., Schaefer H. F. III, and Valeev E. F., J. Chem. Phys. 142, 154106 (2015). 10.1063/1.4917519 [DOI] [PubMed] [Google Scholar]
  • 61.MRCC, a quantum chemical program suite written by Kállay M., Rolik Z., Csontos J., Ladjánszki I., Szegedy L., Ladóczki B., Samu G., Petrov K., Farkas M., Nagy P., Mester D., and Hégely B., see also Ref. 71 as well as http://www.mrcc.hu/.
  • 62.Neese F., Hansen A., and Liakos D. G., J. Chem. Phys. 131, 064103 (2009). 10.1063/1.3173827 [DOI] [PubMed] [Google Scholar]
  • 63.Helgaker T., Gauss J., Jørgensen P., and Olsen J., J. Chem. Phys. 106, 6430 (1997). 10.1063/1.473634 [DOI] [Google Scholar]
  • 64.Weidendorfer J., Kowarschik M., and Trinitis C., in Proceedings of the 4th International Conference on Computational Science (ICCS 2004), Krakow, Poland, 2004. [Google Scholar]
  • 65.Wheeler J. C., Rocky Mt. J. Math. 4, 287 (1974). 10.1216/rmj-1974-4-2-287 [DOI] [Google Scholar]
  • 66.Golub H. and Welsch J. H., Math. Comput. 23, 221 (1969). 10.1090/s0025-5718-69-99647-1 [DOI] [Google Scholar]
  • 67.Doser B., Lambrecht D. S., Kussmann J., and Ochsenfeld C., J. Chem. Phys. 130, 064107 (2009). 10.1063/1.3072903 [DOI] [PubMed] [Google Scholar]
  • 68.Schütz M., Hetzer G., and Werner H.-J., J. Chem. Phys. 111, 5691 (1999). 10.1063/1.479957 [DOI] [Google Scholar]
  • 69.Eshuis H., Yarkony J., and Furche F., J. Chem. Phys. 132, 234114 (2010). 10.1063/1.3442749 [DOI] [PubMed] [Google Scholar]
  • 70.Hári J., Polyák P., Mester D., Mitušík M., Omastová M., Kállay M., and Pukánszky B., Appl. Clay Sci. 132, 167 (2016). 10.1016/j.clay.2016.06.001 [DOI] [Google Scholar]
  • 71.Rolik Z., Szegedy L., Ladjánszki I., Ladóczki B., and Kállay M., J. Chem. Phys. 139, 094105 (2013). 10.1063/1.4819401 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

See supplementary material for the analysis of the prescreening schemes presented in Sec. II F, for the relative theoretical performances of the investigated algorithms referred to in Sec. III, for the wall time measurement and cache simulation results discussed in Sec. V, for the performance of the ERI calculation on multiple CPU cores, and for the geometries of the molecules used in the performance tests and benchmark calculations.


Articles from The Journal of Chemical Physics are provided here courtesy of American Institute of Physics

RESOURCES