Deep Learning for Classifying Congenital Heart Disease from Ultrasound

Final Report | AC 299R - Special Topics in Applied Computation | Harvard Spring 2019

Background

Recent advances in machine learning have led to incredibly successful applications in visual recognition tasks. In particular, there has been a huge surge in using convolutional models to extract abstract features from medical imagery for a host of purposes spanning segmentation, classification, de-noising, and much more. This integration of computer science with medicine is often motivated by the lack of human resources or skill to effectively examine grainy, busy images from hundreds of patients each day. Whether its a pathologist, scanning stained tissue slides for signs of malignant anomalies, or a neurosurgeon attempting to qualify degeneration from an MRI, doctors would benefit from artificial intelligence (AI) aid. As it currently stands, convolutional models have been promisingly effective in numerous scenarios. A few notable examples include Diabetic Retinopathy detection,¹ automatic fetal body and amniotic fluid segmentation,² and breast cancer diagnosis.³

The focus of this report is on developing a temporal machine learning model to classify congenital heart disease from neonatal ultrasound image sequences. Echocardiography is performed on newborns to identify several different types of heart conditions that are fatal unless operated on soon after birth. These ultrasound scans are preferred as a non-invasive method, but the downside is that the generated image sequences are grainy, with low-resolution. This calls for an expert cardiologist to analyze them for possible issues. The main problem is that many more babies are born than cardiologists can examine, therefore a proxy pulse oximetry test is performed to pluck out signs of a deadly heart condition. This surrogate is imperfect, as low oxygen in a baby’s blood may simply indicate the heart has not yet transitioned out of the fetal circulatory system if the test is performed too soon after birth. Often, waiting to uncouple this confounder would be fatal, as a congenital heart disease could lead to heart failure once this transition occurs. As such, there is great interest in an AI algorithm that could automatically classify ultrasound scans as indicative of a particular pathology, flagging the need for immediate surgery and thus saving little lives.

Previous work performed by Dr. Perrin, my research advisor on this project, involved training a model for static image classification of congenital heart disease from ultrasound. Intuitively, however, the presence of a heart defect has the potential to also impact the rhythm of the heartbeat. This effect would manifest temporally throughout a sequence of ultrasound frames taken from any given heart scan. My task was to extend upon prior work in investigating the application of temporal machine learning methods to a dataset of ultrasound videos, to try and pick up such a signal.

Data

The data consist of 1,666 ultrasound videos spanning four different pathologies (plus normal), and four different views of the heart. Each frame of the original dataset is a black and white image. The sizes of these images are variable, as are the frame rates, orientations, and the lengths of the videos. The distribution of video lengths is displayed in Figure 1 below:

Figure 1: The distribution of video lengths. Note that the distribution
is right-skewed, with a median length of 30 frames.

The four different pathologies are as follows:

CA - Coarctation of the Aorta: The narrowing of the aorta, the large blood vessel that delivers oxygen-rich blood to the body. This over-stresses the left ventricle.
TGA - Transposition of the Great Arteries: The two main arteries are flipped, causing oxygen-poor blood to be circulated insufficiently throughout the body by the smaller, right ventricle, while the left ventricle pumps oxygen-rich blood straight back into the lungs.
HLHS - Hypoplastic Left Heart Syndrome: The left side of the heart is underdeveloped, and therefore can’t supply the body with oxygen-rich blood effectively.
SV - Single Ventricle: A class of congenital heart defects where the heart is essentially operating with a single ventricle. HLHS is a subclass of SV, but in the data they are considered separate.

To be explicitly clear, there are five class labels: CA, TGA, HLHS, SV, and Normal, for examples where no heart defect is present. The pathology class label distribution is not quite uniform. Figure 2 shows the breakdown of how these pathologies are represented in the data:

Figure 2: The distribution of class labels. SV is underrepresented in the data with only 92 examples,
while Normal is the most represented with 483 exampels.

In addition to the pathologies, there are four different views that are represented in the data. They are apical four chamber, suprasternal notch, parasternal long axis, and parasternal short axis. At the time of the analysis, information was not available to link sequences and views. Their presence, however, is important when considering model bias, and will be brought up again in the conclusions section.

The raw echo images are screen captures that occasionally include annotations made by clinicians, and legends produced by the echocardiogram software. A few examples are shown below in Figure 3:

Figure 3: Three examples of echo images to demonstrate the types of annotations that are present in the data, as well as the varying sizes/orientations of the images.

The primary focus is the conical section of feedback coming from the reflections of the sound waves that were pulsed into the patient’s chest (hence the term “echo”). Prior work has shown that these annotations can interfere with model training. Specifically, they add a confounding source of bias if one type of annotation appears more commonly for one pathology over another, which may very well be expected if the annotations reference the pathology. The traditional method to address this is to create masks that separate the echo cone from the cluttered background. Another Master’s student in Dr. Perrin’s lab spent the past six months developing a deep neural network to generate masks for these echo images. The best network-generated masks were in 95% agreement with a test set of hand-masked echo images, rivaling state of the art computer vision techniques. I wrote a bash script using ImageMagick to applying these masks on the raw echo images, after standardizing them each to be 200 x 300. Figure 4 shows the post-masked versions of the example images presented above:

Figure 4: The masked and resized versions of the images presented in Figure 3. As can be seen, most of the annotations have been wiped out and replaced with a neutral black background, save for annotations that overlap with the echo cone itself.

The masking was effective, though it could not remove the annotations that overlapped with the conical section. A manual examination of images sampled from each of the 1666 videos revealed this to be a negligible minority of the population, with only ~25 videos displaying visible annotations that were not fully removed. This masked data were used to train and evaluate each of the models described in this report.

Models

There are a number of avenues to consider when building a model for video classification. For years, the traditional methodology in the medical community was to perform manual feature extraction from image sequences, and feed these details into a classifier such as a multilayer perceptron or decision tree algorithm to make an inference.⁴ These types of classifiers, however, are susceptible to image distortion, and Convolutional Neural Networks (CNN’s) have superceded them as they are translation invariant, learning more abstract representations that aren’t as easily perturbed. As a result, I decided to focus on building CNN-based models to capitalize on their automatic feature extraction capabilities.

Extracting features from the images is one piece of the puzzle. The second is figuring out how they relate to each other. To capture the time-varying aspect of the ultrasound videos, I employed the use of Recurrent Neural Networks (RNN’s). RNN’s are adapted to capture this interdependency, as they have an internal “hidden state” that learns to retain relevant information across multiple inputs, akin to memory. The most prominent type of RNN is the Long-Short Term Memory (LSTM) unit. In this report, I chose to use the Gated Recurrent Unit (GRU) instead, as the performance is indistinguishable from the LSTM, yet the internal mechanism is much simpler (the input and forget gates are coupled, and there is only one persistent internal state).⁵

Preprocessing

The masked, and resized images were stored in a file structure whereby each pathology had its own folder, with subfolders corresponding to each unique video clip. Each of the subfolders had a series of .png files, where the suffixes corresponded to their order in the sequence. I first parsed through this file tree to create a “master list” of all the frames, including information such as each filepath, clip id, clip length, and pathology label. I shuffled the clips, but sorted the images within a unique clip id to ensure that when a clip gets read into a model, it is read as a coherent sequence. Once this was complete, I split the master list into training, validation, and test sets, treating unique clips as immutable to ensure that a given sequence would not be divided across multiple splits.

Because the pathology distribution was imbalanced, I endeavored to reweight the classes to have the models place equal emphasis on the cost of a missed label for each pathology type. These weights were included as extra information in the train/val/test sets.

In total, I built and evaluated three models on the echo data. The following sections illustrate and discuss the model design for each.

Model 1: CNN & RNN Fusion

The first architecture I built was essentially a time-distributed CNN stacked on top of an RNN. Figure 5 shows a schematic of the layers that comprised the model:

Figure 5: The design of Model 1. Note that each of the layers comprising the CNN part of the model are
wrapped in a TimeDistributed wrapper to link it up with the RNN.

The design of the CNN was largely motivated by the highly-regarded VGG16 architecture, which has 5 convolutional blocks with increasing numbers of 3x3 filters.⁶ Within each convolution block are two convolution layers, a maxpooling, and dropout to enforce regularization and prevent the model from overfitting. The last feature maps get flattened and sent into a single “unrolled” unit of the bidirectional GRU, as demonstrated in Figure 6:

Figure 6: Schematic of how the CNN and RNN link up. The frames are passed one by one into the RNN,
and the hidden state gets updated and transferred each time. A dense layer maps the RNN cell outputs
to a five dimensional probability vector.

The dataset was much too large to load into cache, therefore I built a custom data generator to read a single batch of images into memory at a time, perform preprocessing, and then pass them along to the model. For each image, I normalized the pixel values to aid in the convergence of the model, and cast the normalized pixel values as single-precision floats to conserve memory. GPU memory was a concern, as the GPU must be able to store at least one batch at a time to perform the gradient updates. As a rough calculation, the longest sequence is around 500, 200 x 300 images. Each pixel value is encoded as a 4 byte float, therefore the maximum size of a single batch was around 120MB. This does not reflect the total memory used to train a single batch (to do that we would have to factor in the weights of the model and the intermediate storage that the tensorflow backend uses) but this calculation verifies that our batch memory is much less than the GPU’s maximum capacity of ~ 10 GB. The memory constraint of using large batches restricted us to training with a single batch at a time. This is less than ideal, as it increases the stochasticity of the gradients, but it’s still much better than using a CPU.

One important aspect to consider when working with RNN’s is what Keras calls “statefulness.” Statefulness refers to whether or not the hidden state gets reset in between batches. In our situation, each video is independent of the previous one, therefore we definitely do not want the RNN to “remember” anything about the previous video clip. Therefore, I explicitly set statefulness to be False. One design choice that I may have overlooked, was that I only shuffled the training data once before training began, and not after each epoch. Provided more time, it would be interesting to look into whether episodic shuffling has an effect.

Model 2: 3D Convolution

The second model I built factored in time as an extra dimension to perform convolution on. 3D CNN’s were demonstrated to be capable of resolving temporal relationships, and outperform 2D CNN’s in picking out spatiotemporal features.⁷ A schematic of the 3D CNN I built is shown in Figure 7.

Figure 7: The design of Model 2. Note that the first dimension of the maxpooling and filter sizes is the length of the sequence.

Once again, the 3D CNN roughly follows the VGG-16 structure, with five convolutional blocks that have maxpooling and increasing numbers of filters. There is no dropout in between convolutional layers as it did not improve performance when parameter tuning, as will be described in more detail in the following section.

An unfortunate restriction presented by 3D convolution is that the lengths of sequences must be fixed. Given that we have video lengths that follow a right skewed distribution (refer to Figure 1), this means that we must either pad, crop, or both. I decided to take the median video length, 30 frames, as the set size for the model. As a design choice, this was mostly motivated due to memory concerns, as 3D convolution takes up a lot of disc space. In the future, it might be worthwhile setting the sequence length to be the minimum for which one or two heartbeats passes for the slowest frame rate in the data.

Handling sequences that are longer than 30 frames is relatively straightforward. Essentially, one must choose 30 frames from some part of the sequence, either randomly or systematically. I chose to take the first 30 frames of the sequence. Padding shorter sequences is less obvious. Two common approaches are:

zero padding: Fill up the extra space with blank images
mirror padding: Flip the sequence and concatenate it to itself. For example, if we want to extend the sequence [1, 2, 3, 4, 5] to be 8 elements instead of 5, it would become [1, 2, 3, 4, 5, 5, 4, 3]. This allows for a smooth transition back and forth, almost like a periodic boundary condition.

For the design of the 3D CNN, I chose to implement mirror padding.

Model 3: Transfer Learning

Transfer learning is the idea that a powerful model that was trained to perform a certain task really well, can be transferred to another domain and excel at performing a different task too. The most powerful, open source CNN’s are those trained on ImageNet data for the famous ImageNet Large Scale Visual Recognition Challenge (ILSVRC). ILSVRC contains 1000 color image classes of objects like dogs and boats. While these image classes are very different from black and white medical imagery, they can be successfully tuned to excel at medical image classification tasks.⁸

Transfer learning is extremely useful in situations where data is “scarce.” In particular, the amount of data that is necessary scales with the difficulty of the problem. Classifying heart conditions from noisy ultrasound videos is such a problem, and thousands of examples may not be sufficient to reach high accuracies. This was my reasoning behind exploring transfer learning as a possible solution. It is important to point out that tuning an ImageNet model to a very different domain requires careful retraining of the weights layer by layer with very small learning rates. Due to time contraints, I simply replaced the outermost layer of a pre-trained CNN, passing the extracted feature embeddings to a bidirectional GRU, similarly to the first model. By saving the image embeddings as a preprocessing step, I was able to speed up the pace of training a little bit, though the training process is still dominated by I/O operations.

The pre-trained CNN I used was the Inception V3 model. A schematic of the model setup is shown in Figure 8.

Figure 8: Schematic of the transfer learning dataflow. The embeddings are saved to disc and then read in at runtime.

Results

For each of the models outlined above, I tuned the learning rate, number of filters, number of hidden nodes, hidden units, and the amount of regularization. Each model was trained for at least 40 epochs, which empirically was enough time for convergence. A single run took anywhere from 12 to 18 hours on Harvard’s computing cluster, Odyssey, dependent on the network speeds and the demand.

In terms of model evaluation, my main concern was not overall accuracy, but the accuracy within each pathology. To measure this, for each model I produced a confusion matrix, and then “unrolled” this information into colored sets of bar graphs that show the breakdown of the correct and incorrect predictions made by the model. All tuning was done using a hold out validation set, and the final evaluation was performed on a holdout test set, each of which comprised 12% of the overall data. We can compare our model’s accuracy to a simple baseline of predicting the most commonly occurring class label. This baseline would yield an accuracy of ~28%, corresponding to the fraction of class labels that belong to the most populous class, Normal.

To better understand how the model was learning, I saved the model history for each run to produce loss and accuracy curves for both the training and validation sets. From doing so, I noticed that each of the three models reached more or less the same level of validation accuracy before the validation loss started to diverge from the training loss. This effect is highlighted in Figure 9.

Figure 9: Example of model overfitting.

When the training and validation performances begin to diverge, this is typically indicative of overfitting. In other words, the model is picking up on aspects specific to the training data that may be correlated with pathology, but are not generalizable. For example, if we had fed in the raw dataset, likely the annotations would have been picked up and used to explain the variance in the target variable. Interestingly, even though the losses diverge, the validation accuracy continues to increase slowly (from 50 - 60% as training accuracy goes from 50 - 95%). Once again, this is likely not due to actual generalizable knowledge learned about the relationship between echo data and pathology, but instead due to the model figuring out how to discern 2/5 or 3/5 of the pathologies, and predicting extreme values for the probabilities such as 0 or 1. Predicting a very low probability for a label that then appears, causes the log term in the categorical crossentropy loss function to explode, resulting in higher net loss even though the accuracy may increase overall.

Methods to deal with overfitting include regularization (through dropout, batch normalization, modifying the objective function…), scaling down the model parameters to reduce model complexity, stopping the training early when the losses start to diverge, and of course, using more data. I used three different techniques: implementing dropout, scaling down on model parameters, incorporating early stopping with a tolerance of 5 epochs. Early stopping is a callback function in the Keras API that tracks the validation loss and terminates the training if the validation loss does not improve at all for 5 epochs after it hits its minimum. I determined that a tolerance of 5 was enough to account for the stochasticity in the validation accuracy (the validation set is 200 examples, which is small enough to where there is variation of a few percent in between evaluations).

The final results for each of the model architectures are summarized in the sections below.

Model 1: CNN & RNN Fusion

Figure 10: Final loss and accuracy curves for Model 1.

Figure 11: Final confusion matrix for Model 1.

Figure 12: Final predictions for Model 1. Note, the correct predictions are shown in green and the incorrect predictions are shown in red.

From the loss and accuracy curves, we see the validation accuracy plateaus at around 50%. Disappointingly, the test set accuracy was lower, at only 44.5%. Perhaps the validation/test sets should have been larger to get them to agree, at the cost of losing training data. Redeemingly, however, it seems that the model is very good at predicting the SV and TGA pathologies, while it is very bad at identifying HLHS (essentially the same as a coin flip). It is moderately successful at identifying which examples are Normal.

Model 2: 3D Convolution

Figure 13: Final loss and accuracy curves for Model 2.

Figure 14: Final confusion matrix for Model 2.

Figure 15: Final predictions for Model 2. Note, the correct predictions are shown in green and the incorrect predictions are shown in red.

From the loss and accuracy curves, we see the validation accuracy plateaus at around 55%. This is in agreement with the test accuracy, which was measured to be 54%, notably 10% higher than Model 1. Interestingly Model 2 is worse than Model 1 at picking out the SV pathology, mixing it up with HLHS half the time. Even with this mixup, if we are concerned with the binary indication of whether there is a congenital heart condition or not, Model 2 would provide 100% accuracy in the SV case. Model 2 is also better than Model 1 at identifying HLHS, though it still struggles. Model 2 is particularly adept at picking out the CA, TGA and Normal pathologies.

Model 3: Transfer Learning

Figure 16: Final loss and accuracy curves for Model 3.

Figure 17: Final confusion matrix for Model 3.

Figure 18: Final predictions for Model 3. Note, the correct predictions are shown in green and the incorrect predictions are shown in red.

The loss and accuracy curves for the transfer model show that the validation accuracy plateaus at around 40%, though it oscillates wildly around it. This is in agreement with the test accuracy, which was measured to be 39%, notably lower than both Models 1 and 2. Interestingly Model 3 is better at identifying HLHS than both Models 1 and 2, and also matches Model 1’s 100% accuracy rate in recognizing SV. However, Model 3 is quite poor at determining what is Normal, instead overpredicting TGA quite drastically. In fact, the model appears to overpredict TGA for all categories except SV (and TGA of course).

Conclusions

To recap, we fit three different models on thousands of sequences containing masked echo cone images. The echo images spanned five different class labels, four different views, and had varying lengths and frame rates. The model that appears to have performed the best was the 3D CNN, with a test accuracy of 54%, and a substantial majority of correct predictions for four of the five classes (all save for SV). That being said, the sample size for the SV class was only 8 examples, which is not large enough to make any conclusions as to the efficacy of the model in its predictions for that pathology. In particular, one important future endeavor would be to do some parametric and non-parametric uncertainty estimation for the model predictions. The parametric procedure would be to assume that the predictions, conditional on a particular pathology, follow some underlying true rate, thereby turning the counts for each class into a Poisson random variable. The Poisson random error is equal to the square root of the number of counts (for counts greater than ~10), therefore this could be used to estimate error bars for the predictions within a class. The non-parametric way to estimate uncertainty would involve sampling different splits from the population a number of times and training a model, then aggregating the results across all the models. This would be very computationally expensive, so a different approach could be to perform something similar to K-fold cross validation using the same model.

The most exciting aspect of the results is that TGA is actually the hardest category to detect from static images, whereas all three temporal models performed the best in this category. It is important to note that for the Models 1 and 3, they overpredicted TGA for other categories. This behavior was not as visible for Model 2, however, suggesting there may be a significant arrhythmia indicative of TGA. This was the most beneficial aspect of my research project, as my work serves as a pilot indicator to look further into this effect.

One major difficulty I faced in this project, was the overfitting of the models to the training data. There are a few preprocessing ideas I would implement going forward to rectify this, beyond changing around the model architecture or training regimen. First and foremost, I would shuffle the training data on each epoch. Keep in mind, shuffling does not mean shuffling the order that the images appear within a batch, but instead shuffling the order in which batches appear. Secondly, I would decrease the resolution of the images even more, perhaps from 200 x 300 to 100 x 150 or 50 x 75. This would allow us to train with batch sizes greater than 1, which may allow the model to converge to a better accuracy.

Above all else, I think the most rationale next steps in this analysis would be to try to enhance the data by nailing down some of the confounders. First, it would be valuable to acquire labels that indicate the view of the echo cone, so that we can train a model on a single view at a time. This is especially important because some views are not looking at the portion of the heart that is affected for a given pathology. To elucidate this point, TGA would not be detectable from a static image that is taken from a parasternal short axis view. Separating out this view, and then training a temporal model, could demonstrably show that there are temporal indications of TGA if the accuracy is significantly better than a coin flip (provided the model is not picking up on biased noise). Along the same vein, as mentioned previously, it would also be advantageous to examine and perhaps standardize the frame rates of each sequence, at the very least ensuring that enough heartbeats are contained in a sequence to make a prediction. The shortest sequence in the dataset is only 3 frames, which I doubt is sufficient for prediction.

Furthermore, while the masking process did a great job of removing the annotations, there were still 1-5% of sequences that had minor markings on them (some extremely minor), and a couple sequences had echo cones that were partially out of frame. In addition, there are two different orientations of echo cones, one where the thicker part of the cone is facing down, and one where it is facing up. It would be good to ascertain whether this is due to the location of the probe, or whether the heart itself is upside-down in these images, and come to a consensus on how to best standardize (or remove) these data. Eliminating these sources of bias may reduce valuable training data, though augmenting the dataset with rotations and contrast enhanced versions of the images may offset this cost, and decrease the chance of the model overfitting.

Beyond efforts to enhance the quality of the data, there are a number of very difficult, but potentially worthwhile extensions to the base convolutional models I discuss in this report. Two recent areas of research are the Slow-Fast Dual Mode network, which uses multiple parallel pathways learning on different frame rates to achieve separate levels of semantic meaning from sequences of images,⁹ and the Inflated 3D CNN, which essentially extends powerful 2D CNN’s to higher dimensions by initializing the new dimensions with transformed versions of the powerful weights already present in the model.¹⁰ It could also be very powerful to implement two parallel pathways, where the raw images pass through one, while transformed versions of the images (optical flow, Fourier transforms, contrast enhancement) pass through the other. This has been implemented successfully in the action recognition field.¹¹

Overall, multinomial classification of ultrasound videos through deep learning is a difficult, yet fascinating problem. Throughout this process, I came to fully understand how a model is only as good as its data, and adding more degrees of freedom to a model can only help so much. The end results were promising in that the 3D CNN predicted the majority of each class label (save for SV), but it is clear there is a long way yet to go to achieve accuracies that would allow a tool like this to be used as a surrogate for a pulse oximetry test, or even one day, a stand in for a preoccupied, expert cardiologist.

References

Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Narayanaswamy, A., … & Kim, R. (2016). Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22), 2402-2410.
Li, Y., Xu, R., Ohya, J., & Iwata, H. (2017, July). Automatic fetal body and amniotic fluid segmentation from fetal ultrasound images by encoder-decoder network with inner layers. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 1485-1488). IEEE.
Antropova, N., Huynh, B. Q., & Giger, M. L. (2017). A deep feature fusion methodology for breast cancer diagnosis demonstrated on three imaging modality datasets. Medical physics, 44(10), 5162-5171.
Giger, M. L., Chan, H. P., & Boone, J. (2008). Anniversary paper: history and status of CAD and quantitative image analysis: the role of medical physics and AAPM. Medical physics, 35(12), 5799-5820.
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10), 2222-2232.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
Tajbakhsh, N., Shin, J. Y., Gurudu, S. R., Hurst, R. T., Kendall, C. B., Gotway, M. B., & Liang, J. (2016). Convolutional neural networks for medical image analysis: Full training or fine tuning?. IEEE transactions on medical imaging, 35(5), 1299-1312.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2018). SlowFast networks for video recognition. arXiv preprint arXiv:1812.03982.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568-576).