Estimation of the Left Ventricular Ejection Fraction from Spatiotemporal Echocardiography

Document Type



Contrastive learning has proven useful in many applications where access to labelled data is limited. The lack of annotated data is particularly problematic in medical image segmentation as it is difficult to have clinical experts manually annotate large volumes of data such as cardiac structures in ultrasound images of the heart. In this paper, we propose a self-supervised contrastive learning method to segment the left ventricle from echocardiography where limited annotated images exist. Furthermore, we study the effect of contrastive pretraining on two well-known segmentation networks, UNet and DeepLabV3. Our results show that contrastive pretraining helps improve the performance on left ventricle segmentation, particularly when annotated data is scarce. We show how to achieve comparable results to state-of-the-art fully supervised algorithms when we train our models in a self-supervised fashion followed by fine-tuning on just 5% of the data. We show that our solution outperforms what is currently published on a large public dataset (EchoNet-Dynamic) achieving a Dice score of 0.9252. We also compare the performance of our solution on another smaller dataset (CAMUS) to demonstrate the generalizability of our proposed solution. Learning spatiotemporal features is an important task for efficient video understanding especially in medical images such as echocardiograms. Convolutional neural networks (CNNs) and more recent vision transformers (ViTs) are the most commonly used methods. However, according to the literature each approach has certain limitations. CNNs are good at capturing local context but fail to learn global information across video frames. On the other hand, vision transformers can incorporate global details and long sequences but are computationally expensive and typically require more data to train. In this paper, we propose a method that addresses the limitations we typically face when training on medical video data such as echocardiographic scans. The algorithm we propose (EchoCoTr) utilizes the strength of vision transformers and CNNs to tackle the problem of estimating the left ventricular ejection fraction (LVEF) on ultrasound videos. We demonstrate how the proposed method outperforms state-of-the-art work to-date on the EchoNet-Dynamic dataset with MAE of 3.95 and R2 of 0.82. These results show noticeable improvement compared to all published research. In addition, we show extensive ablations and comparisons with several algorithms, including ViT and BERT.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Mohammad Yaqub, Dr. Hang Dai

Online access provided for MBZUAI patrons