## Morphology control in auto-assembly of Zinc meso-tetra (4-pyridyl) porphyrin (ZnTPyP) – Executive Summary

by Hyung Nun (Jonny) Kim and Yongshuai (Eric) Wang

Materials Informatics Course Project, Fall 2016

Data provided by Hongyou Fan at Sandia National Laboratory

Professor: Surya R. Kalidindi

## Introduction

# What are porphyrins?

A general porphyrin skeletal formulae is shown below (far left):

The ‘M’ has flexibility in the allowed element; as long as the chemical bonds allow it, this porphyrin compound can be chelated with many different types of element; this has an effect on how it interacts or manipulates incoming light. Two naturally occurring porphyrins are:

**1) Chlorophyll:** this magnesium chelated porphyrin is able to absorb light in the red/blue spectrum, and reflect light in the green spectrum (hence, green pigmentation in plants).

**2) Heme:** this iron chelated porphyrin is present in red blood cells. These porphyrins have a unique functionality in their ability to transport oxygen in blood circulation. The chemical bonds resulting from iron-chelation reflects red light (hence, red pigmentation in blood).

This study is motivated by the existence of such efficient photonic devices.

Our project’s material of interest is named Zinc meso-tetra (4-pyridyl) porphyrin (ZnTPyP). ZnTPyP is approximately 1.6 nm X 1.6 nm X 0.5 nm in size and is highly stable. ZnTPyP have great potential in optoelectronics (electronics that are capable of sourcing, detecting, or controlling light) and nanotechnology (they can be morphed into nano-rods, wires, or tubes). We are interested in how well we can control the morphologies of these ZnTPyP nano-material.

## Project Objective

In this project, we aim to construct a process-structure linkage of ZnTPyP auto-synthesis process. A linkage is a reduced-order model that will allow us to predict a structure for any processes within the defined process space. Therefore, our inputs are process variables involved, and our outputs are structure descriptors (in our case, they are principal component scores. This will be explained later in more detail).

# Process Description

The ZnTPyP described in this project is synthesized in a liquid solution as described below:

**Step 1)** ZnTPyP is not water-soluble. Therefore, they are protonated beforehand to form tetrapyridinium cation ZnTPyP-H44+ which is water-soluble. This solution is typically added first into the batch.

**Step 2)** Next, a solution of a catalyst called a surfactant is added into the mix. A surfactant (described in the figure below) is defined by its micellar structure with two discrete regions (hydrophillic exterior and hydrophobic interior).

**Step 3)** A solution of NaOH (basic aqueous solution) is lastly added to the mix. The presence of NaOH deprotonates the ZnTPyP, thereby making it water-insoluble once again. Due to the presence of surfactants in the liquid solution, the water-insoluble ZnTPyP are then forced to confine themselves within the hydrophobic surfactant interior simply due to interaction of charges. This process is analogous to interactions between water and oil-like species.

**Step 4)** Steps 2 and 3 are done while continuous stirring occurs. After the constituents are added, continuous stirring occurs for 48 additional hours.

**Step 5)** The liquid solution is dried, and what we have at the end result are ZnTPyP nano-materials of various morphologies.

These steps can be visualized in the figure below.

# Data Inspection

Through numerous experiments, different morphologies of ZnTPyP have been observed due to different concentrations of ZnTPyP, surfactant (sodium dodecyl sulfate, or SDS), or NaOH used. In the experiments, these values were logged as:

ZnTPyP = ZnTPyP concentration (mM) Surfactant = SDS-to-ZnTPyP ratio NaOh = acidity (pH)

The data from Sandia National Laboratory was collected by varying these 3 process parameters, and the resulting mophologies were observed in the SEM. These SEM images are our datasets.

A summary of all the data is given in table format below:

As it can be seen, 3 different experiments were conducted. Each experiment had different independent and control variables to observe its effects on the diameter and lengths. This kind of experimental setup is very intuitive and the quantifiable structure measurements are defined. However, throughout the course we learned alternative means of quantifying structure data in form of n-point statistics. n-point statistics not only captures diameter and length information but orientation information as well; this truly changed the dynamics of our data analysis. All of these will be discussed in significantly more detail later. To summarize, as data scientists, we ask a slightly different question than the experimentalists.

**Experimentalist:** How does pH, ZnTPyP concentration, or SDS:ZnTPyP *individually* ratio change the diameter and lengths of the nanorods?

**Materials Informatics:** How well can we predict the microstructure for *any given combination* of pH, ZnTPyP concentration, or SDS:ZnTPyP?

Therefore, the way that we visualize the data is also slightly different, shown below:

Note that the SDS concentration in samples 8 to 13 were changed to SDS:ZnTPyP ratio for consistency. We have a total of 15 unique processes. Although 4 images were taken at different viewfields per process, these images were taken at the same location. Therefore, many of the images were considered redundant; we used the largest scale image and declared that this one image was representative.

# Example Images

A collection of all images can be downloaded here: Full Dataset

## Data Preparation

For any type of data analysis, the data must be well-defined. For us to determine how to describe our data, we must look into the future for what we want to do with the data. We plan to use 2-point statistics to describe our data, and to use this technique, there are some pre-requisites that must be fulfilled. Firstly, all of our data must be scaled to equal pixel resolution for any meaningful analysis. Secondly, we must binarize our images. To understand why these pre-requisites exist, it is necessary to understand how 2-point statistics works. For more information on 2-point statistics, refer to the links below:

Ahmet’s 2-point statistics Tutorial

# Image Rescaling

For spatial correlations to be meaningful, the resolution has to be consistent for all images. Image rescaling was a simple task in MATLAB (built-in function, imresize). All that was needed for rescaling was the desired pixel resolution. We decided to rescale all of the images to 10 nm / pixel, because majority of the images were close to that resolution already.

# Image Binarization

Binarization of images were a more challenging task. The reason why we perform binarization is to clearly define material states. Consider an example image shown below:

We are interested in describing the ZnTPyP nanorods. We only have two states that we can really define, as follows:

1) ZnTPyP

2) Void (empty space)

We would like ZnTPyP to be defined as value 1 (white) and void space as value 0 (black). We can achieve this through strategic binarization. The main challenge in this step was to account for unwanted shadow gradients in many of the images. Various segmentation techniques were employed, and these techniques are listed with their sources:

# 1. ShadowRemover.m (Ahmet Cecen)

Shadow Remover does a parabolic surface fit of the background. This background is then subtracted from the original image to remove subtle shadow gradients.

# 2. BigDiskFilter.m (Ahmet Cecen)

Big Disk Filter applies a disk-shaped filter to the image. It essentially calculates a moving average of neighboring elements for every pixel which becomes the calculated background. This background is then subtracted from the original image. This technique is able to remove harsher shadow gradients than the formerly mentioned Shadow Remover.

# 3. imageSegmenter (Built-in MATLAB GUI)

This is a built-in MATLAB GUI that has many binarization capabilities. Otsu’s method was used to binarize the images. The results are shown below:

# 4. bwareaopen.m (Built-in MATLAB function)

This function is helpful in filling unwanted pores in the image. You simply define a threshold for the area of the pores to be filled.

# Masking

We now have our ensemble of binarized images. It is remarkable how well Ahmet’s segmentation tools worked. However, we have some images where majority of the image is void space. An example of such an image is shown below:

We only have a very few amount of these kinds of images. Surely, not all of the void space are meaningful for us to keep. This is where the concept of masking comes into play.

# Defining the masked regions

We do not want to remove all of the void space from all of the images; this would provide meaningless 2-point statistics. However, as discussed previously, we do not want to keep all of the void space either. Therefore, we propose that we define a 3rd state: mask.

Void spaces within proximity of ZnTPyP are meaningful in that they tell us ZnTPyP orientation information. However, beyond a certain distance, it provides less and less meaning. Therefore, we defined a threshold where beyond a certain distance, the void space is meaningless. These meaningless regions are now defined as a mask. Now we have 3 states: ZnTPyP, void space, and mask. A visualization is provided below:

ZnTPyP = white … Void Space = gray … Mask = black

2-point statistics with mask implementation is calculated slightly differently. For more information, refer to the following paper:

# 2-point statistics

Now that we have our ensemble of binarized images with masks to account for undesirable volume of void space, we are now ready to represent our images in 2-point statistics. For 2-point statistics, we are required to define the following:

**1. Non-periodic**: it was very apparent that we were dealing with non-periodic structure. Many of the images do not show redundant patterns. Although the individual nano-rods are very similar in geometry, this is still different from what we consider periodic.

**2. Auto-correlation:** because we are essentially dealing with a two phase image (discounting the mask, refer to Ahmet’s paper for clarification), a simple auto-correlation is sufficient to describe the structure.

**3. The dimensions of the 2-point statistics matrix is constrained by the smallest image that we have.** This happened to be 97 X 97 pixels. Therefore, our 2-point statistics for each and every image was 97 X 97 pixels. This dimensional consistency is a requirement for us to use principal component analysis (PCA).

Example 2-point statistics representations are shown below:

For a total of 15 non-redundant images, the ensemble of 2-point statistics data were vectorized and compiled into a single matrix of dimension 15 X 9409. This concludes our data preparation, and we are ready to move onto PCA.

## Dimensionality Reduction

We have a total of 15 images and each image is of size 97 * 97, so we get an ensemble of images matrix of size 15 * 9409. We are faced with the “curse of dimensionality” because the number of samples (15) is way smaller than the number of features (9409). We conducted dimensionality reduction using principal component analysis (PCA) in order to solve this problem. PCA can be achieved via singular value decomposition as follows:

$$X = U*S*{V}^T$$ where $$U$$ is the eigenvector matrix. Each column is an eigenvector and every column is orthogonal to each other. $$S$$ is a rectangular matrix whose main diagonal is the square of the eigenvalues. $$V$$ is the right singular orthogonal matrix.

PC score matrix is then calculated by $${T}_L = U*{S}_L$$ where $${S}_L$$ is the truncated $$S$$ matrix with only the first 15 columns. The following scree plot shows the variance explained by PCs

Even the first PC can explain 99% variance, so for simplicity we will just use the first 3 PCs for model construction.

## Model Construction

We will use the process parameters, namely SDS:ZnTPyP, pH, and ZnTPyP as our input. The output is the PC scores. The error metric we use is mean absolute error (MAE) defined as $$\frac{1}{m} \sum_{i=1}^{m} |h_{\theta}(x^{(i)}) - y^{(i)}|$$. Compared to mean squared error (MSE), MAE is more robust to data outliers which is very common in our dataset.

We experiment with 3 different models. We got started with multiple linear regression, and then polynomial regression with different degrees. Finally, we tried polynomial regression with Ridge regularization. We found the last model works best in terms of cross validation MAE.

We mess around Ridge regression with different degrees and plot the training error and cross validation error as follows:

We found that 2nd order Ridge regression works best for PC1, PC2 and PC3, because it reduces training error by a lot without considerably increasing the cross validation error.

The following plots shows goodness of fit by trying Ridge regression with different degrees: For PC1:

For PC2:

For PC3:

It is obvious that linear regression is a case of underfitting, while 5th order Ridge regression is a case of overfitting. 2nd order Ridge regression works best for our dataset.

# Visual verification of our model performance

Other images are shown in this attachment: Prediction_vs_Actual.pptx (4 MB, uploaded by Hyung Nun Kim 4 years 9 months ago)

## Conclusion

**Why did the model perform better at predicting some points more than others?**

For us to attempt to answer this question, we have to refer back to the data itself. Shown below is a figure of the collection of data points, in process space:

For the figure on the left, we have 3 different colors, dividing up the data set into three different experiments as the experimentalists intended. We can visually see which variables are control and independent. The figure on the right shows the same data points, but with the colors now signifying pearson correlation coefficient between the predicted vs actual 2-point statistics. Therefore, the higher the value (yellow) means the model did a better job at prediction. Notice how the model performs much better on points that are located far from somewhat of a cluster of points in the center. I believe that this issue is related to problems that occur from having identical processes but different microstructures. This issue has been well-discussed between our group and Sanam + Sepideh’s group. Basically, when the data we are using contains multiple outputs for a single input, it creates problems for our model that is supposed to be a one-to-one function. Similar problems show when datapoints are close to each other (even if not identical). We can see the effects of having sub-optimally distributed data points in process space.

This tells me that for the approach studied in this course, it is generally preferred to have largely distributed data points. This often goes against the intuition of an experimentalist; it is customary for an experimentalist to clearly define control variables and isolate independent variables to study the resulting effects on a dependent variable of interest. However, from a materials informatics data science point of view, it is much more beneficial to shy away from this practice and spread out the processes to be investigated. To clarify this point… if I had the time to conduct the same number of experiments (15 processes), then I would prefer collecting data at these following locations instead (without control variables):

# Future Work

If there is a possibility of a publication in this work, we would be more than willing to explore this opportunity. Speaking of future works, there are a couple avenues that I can see being worthwhile.

1) Solving the other half of the problem in structure-property linkage and completing the PSP linkage. The material properties of interest will probably be related to optoelectronic applications.

2) Other factors have been identified to also affect morphologies of ZnTPyP nanorods. Such examples are: using a different type of surfactant such as CTAB, or simply changing the order of how each solution is added. These will require additional models.

3) It is possible to arrange and align the nanorods in uniform and a monolayer. To achieve this requires extra steps in the processing, but it is possible. By having the nanorods prepared in this manner, I expect the quality of images to greatly improve, thereby improving our model itself.

# Acknowledgements

We are grateful to Hongyou Fan from Sandia National Laboratory for providing us data for our project. We are thankful for Professor Surya R. Kalidindi and Noah Paulson for their countless advice and guidance regarding the project. Ahmet Cecen’s MATLAB codes were tremendously helpful in binarization of our images. We also acknowledge Alehksandr Blekh for his work on MATIN development, and everyone in the Materials Informatics class for all the discussions that took place in MATIN.