**CSE8803 / ME 8883 Materials Informatics**

Fall 2016 Georgia Tech

*Anh Tran and Zhiyu Wang*

- High-dimensional PES metamodel via DFT - Blog 5: Battling against the curse of dimensionality using permutation in material science (Group:WangTran)
- High-dimensional PES metamodel via DFT - Blog 6: Potential energy surface (PES) meta-modeling and the search for minimum energy path (MEP) (Group:WangTran)
- High-dimensional PES metamodel via DFT - Blog 3: Reduce the curse of dimensionality on meta-modeling (Group:WangTran)
- High-dimensional PES metamodel via DFT - Blog 2: High-dimensional kriging with modifications (Group:WangTran) - Report 21Sep16
- High-dimensional PES metamodel via DFT - Blog 1: Data acquisition (Group:WangTran) - Report 13Sep2016
- High-dimensional PES metamodel via DFT - Blog 7: Simulating and searching for the MEP on PES (Group:WangTran)
- High-dimensional PES metamodel via DFT - Blog 4: Searching for local minima and saddle points algorithm with kriging (Group: WangTran)
- High-dimensional PES metamodel via DFT - Blog 8: Searching and refining based on Kriging metamodel (Group:WangTran)
- High-dimensional PES metamodel via DFT - Blog 9: The limitations of metamodel (Group:WangTran) - Report 19Nov2016
- High-dimensional PES metamodel via DFT - Front Page

**Motivation**

For quantum chemist, materials engineers who are concerned about Fermi energy band gap, potential energy surface is surely not a new concept. In fact, there are many different approaches including density functional theory (DFT), quantum Monte Carlo (QMC) methods, classical potentials, and Ising models (including cluster expansion, 2D square-lattice). Potential energy surface (PES) is indeed a mathematical function, that maps from all atomistic velocities and positions to yield a potential energy, as described in Equation (1). Typically the problem setting is a little simpler, called ground state search, by assuming the atomistic velocities are exactly 0 and 0K temperature. Apparently, the PES is a high-dimensional function ($$6n$$), but by assuming the velocities go away, we have reduced it to $$3N$$. For a small MD simulation containing $$\sim 5000$$ atoms, the dimensionality is 15000, yet still considered small in simulation. High-dimensionality has a curse, so-called the curse of dimensionality. Take an example, if you have 2 data points (only 2) for each dimensions, for a 10D problem it would take $$2^{10}$$ data points (typically obtained by numerical simulation or experimental data) to get a very crude approximation. As a matter of fact, in design of experiments, there are bunch of parameters that control the experimental settings but we cannot include all due to the curse. The same thing applies for numerical simulation, where there are also bunch of parameters that affects the output (and typically we try to ignore those effects by telling ourselves it’s OK to change one parameters at a time).

$$ \text{ground state potential energy} = f (\{v_i,x_i\}_{i=1}^N ) \quad \text{ where } N \text{ is the number of atoms}$$

Here this study is concerned with the hydrogen embrittlement in bcc iron. The ratio set for DFT calculation is $$\frac{1}{8}$$ between hydrogen and iron, i.e. $$ \text{Fe}_8\text{H} $$. Searching for the metastable and stable ground state configurations is in fact one of the most important objectives for designing new materials. Those states correspond to the local minima and saddle points on the PES, and there are many lines connecting them that represent the transition pathways. Among all these lines, there is one line connecting from a given initial state to a given final state, that requires minimum activation energy, called minimum energy path (MEP). Searching for this line is equally important, because materials are changing structure morphology along this line, so we desperately want to know where it is.

That being said, our goals are twofold: (1) search for saddle points and local minima (2) connect them together to make a path. Therefore, it belongs to the structure-properties linkage. Structure is the geometric configuration of the materials, and properties is the ground-state potential energy. Because the dimensionality is high (27 for our system), the searching algorithm takes a long time to finish. Even worse, the algorithm uses real DFT calculation to collect the data points. Exhausting DFT calculation is OK, but we can do better but a supporting metamodel to predict and interpolate DFT calculation. The steps of searching algorithms can be roughly summarized as: – First: local 2 local minima by conjugate gradient method – Second: breaking a curve successively to locate more local minima between those 2 points – Third: climbing process to locate saddle point

Material configuration changes along the transition pathway

The transition pathways is described by the PES and a reaction coordinate. The reaction coordinate is a chemistry terminology to describe the configuration change of the materials.

The main motivation for metamodel is the acceleration in high-dimensional space. By imposing many thresholds, DFT calculations can be replaced by a kriging metamodel if there is sufficient amount of data. Later on, the searching algorithm switches back to DFT calculations to feed more data and recalibrate itself. This fed-forward scheme is quite popular in machine learning. Kriging, also known as Gaussian process regression, is more well known in design optimization community than others. The technique was first invented for geostatistics purposes, but later on become widely used due to its rigorous mathematical foundation.

**The breakdown of classical approach in Big Data**

Like any approach, metamodel also has its Achilles heel. The bottleneck of kriging is its expensive computation of inverse covariance matrix. We had a classical kriging metamodel to support the searching algorithm running on PACE for 19 days before breaking down at 207GB of physical memory. The cause of the issue is 9579 data points, which correspond to 9579 $$\times$$ 9579 covariance matrix. Computing the inverse of this matrix requires $$10^{12}$$, 1 tetraFLOP, which is not trivial.

Forensic scene of a failing expensive numerical experiment.

At this point, we are currently working on a cluster kriging approach, where database are broken into finitely many cluster. Since kriging predictor is normal distributed, a weighted average predictor from each cluster would also yield a normally distributed global predictor. A little more about our current stage: Big Data. Big Data is a trending terminology to characterize big data, where high-dimensionality and large sample data set are two main problems. High-dimensionality causes 1) noise accumulation 2) spurious correlations, and 3) incidental homogeneity. The high-dimensionality and large dataset problem together causes 1) high computational cost and 2) algorithmic instability. Big Data problems arise in many fields, such as biostatistics, genomics, social (media) network and engineering. Machine learning, or more simply fed-forward scheme to perform classification/regression is a promising approach toward Big Data. Local Gaussian regression is one of the solutions for our case. This is an ongoing work.