Best of both worlds: Large, cheap data sets enhance machine learning of high-value properties of ordered and disordered materials

Like Comment

It is a truism that nothing in life is free. The same principle applies to materials properties. Generally, the more accurate a method to compute or measure a property, the more expensive it is. That is why despite maturing software and exponentially growing computing power, large databases of materials properties today are still primarily based on cheap but less accurate semi-local density functional theory (DFT) functionals. Data based on higher accuracy DFT methods and experimental measurements tend to be orders of magnitude smaller and less diverse in terms of coverage. This scarcity and heterogeneity of high-quality data is a critical bottleneck in the development of machine learning (ML) models for high-quality materials property predictions.

Our idea to address this fundamental trade-off is an extraordinarily simple one; by combining data from multiple fidelities to train a single model, we can leverage the large low-fidelity data to enable the model to learn better latent representations of materials, which would, in turn, lead to better accuracy on predictions on small high-fidelity datasets. Our model of choice is the MatErials Graph Network (MEGNet) framework[1], a deep learning approach that naturally represents atoms and bonds in a material as nodes and edges in a mathematical graph. Information flows between connected nodes and edges via graph convolutional layers, mimicking the atomic interactions in a real-world material. In addition, the MEGNet architecture incorporates a global state input, which provides a conduit for encoding fidelity information.

Figure 1. Multi-fidelity graph networks. (a) Representation of a material in a graph network model, with atoms as the nodes, bonds as the edges coupled with a structure-independent global state. The fidelity of each data is encoded as an integer. (b) Graph convolution layers pass information between connected nodes and edges.

The effectiveness of our approach is summarized in the Figure below. The low-fidelity data set comprises more than 50,000 band gaps calculated using the standard semi-local PBE functional. The high-fidelity data sets are ~1000-3000 band gaps computed using the more accurate GLLB-SC, HSE, and SCAN functionals and from experimental measurements. The multi-fidelity (2-fi, 4-fi, and 5-fi) models lead to significantly lower mean absolute errors (~20-40% reduction) compared to the single-fidelity (1-fi) models.

Figure 2. Performance of graph network models with different fidelity combinations.  The 4-fi models used the PBE, GLLB-SC, HSE and Exp datasets, i.e., the very small SCAN dataset is excluded. All errors were obtained on the corresponding test sets of the fidelity. The error bars show one standard deviation.

It is serendipity that led us to one of the other potentially transformative features of the MEGNet approach. In the MEGNet framework, atomic attributes are represented as a learned length-16 embedding vector. The correlations between the embedding vectors for different elements reproduce the chemical trends in the periodic table of the elements. We found that interpolating these learned embedding vectors provides a way to model disordered materials, i.e., materials with sites occupied by more than one element and/or vacancies. While the bulk of computational and machine learning works have focused on ordered materials, disordered compounds actually form the majority of known materials. Using this approach, multi-fidelity graph network models can reproduce trends in the band gaps in disordered materials to reasonable accuracy (Figure 3).

Figure 3. Performance of disordered multi-fidelity graph network models. Predicted and experimental band gaps Eg as a function of composition variable x in (a) AlxGa1-xN, (b) ZnxCd1-xSe and (c) MgxZn1-xO. (d) Comparison of the change in band gap with respect to Lu3Al5O12 (ΔEg) with x in Lu3(GaxAl1-x)5O12. The error bars indicate one standard deviation.

If you are interested in more details, please refer to our paper published in Nature Computational Science “Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data” or via the following link https://doi.org/10.1038/s43588-020-00002-x

 

 

[1] Chen, C.; Ye, W.; Zuo, Y.; Zheng, C.; Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chemistry of Materials 2019, 31(9), 3564-3572. doi:10.1021/acs.chemmater.9b01294

[2] Chen, C.; Zuo, Y.; Ye, W.; Li, X.G.; Ong, S. P. Learning Properties of Ordered and Disordered Materials from Multi-fidelity Data. Nature Computational Science, 2020, doi: 10.1038/s43588-020-00002-x.

Chi Chen

Assistant Project Scientist, University of California San Diego