On the Identifiability of Causal Discovery from Temporally Aggregated Data

Document Type



Causal discovery is a vital aspect of understanding the underlying relationships between variables in various fields, such as economics, social sciences, and natural sciences. Time-delay data plays a significant role in many real-world applications, with Granger causality being a widely-used method for causal discovery in time-delay data. However, in practice, the time required for causal effects to occur is often very fine-grained, and it is challenging to obtain time series data that matches the frequency of these causal effects. This results in data that is perceived as simultaneous but is actually an aggregation of data over a period of time. Consequently, researchers may use instantaneous causal discovery methods to analyze these data, raising questions about the validity of such approaches. This thesis addresses the critical question of whether it is possible to recover the time-delayed underlying causal model from temporally aggregated data in VAR models. In doing so, it aims to enhance our understanding of the conditions under which causal information can be preserved and improve the reliability and applicability of causal discovery methods in practice. By examining the identifiability of causal discovery from temporally aggregated data in the general function case, this thesis provides a comprehensive analysis of the problem at hand, going beyond the linear models often used in the literature. The key contributions of this thesis include a rigorous theoretical analysis of the functional equation, investigating the properties of the lag function (f) and the estimated function and the consistency between them. It also delves into the conditional independence between variables, exploring when the conditional independence set induced by the distribution of temporally aggregated data aligns with the distribution entailed by the underlying causal model. This investigation considers chain-like, fork-like, and collider-like structures and examines the conditions necessary for maintaining consistent conditional independence. Additionally, simulation experiments are conducted to validate the theoretical results obtained, providing empirical evidence to support the findings. The results reveal that, for the first problem, if the estimated instantaneous function is to be consistent with the underlying lag function, the function must exhibit additivity, which often implies linearity. This suggests that relaxing the linearity requirement when discovering the underlying lag function from temporally aggregated data is difficult. For the second problem about preserving conditional independence, collider structures naturally preserve conditional independence. In the case of fork and chain structures, partial linearity in the system (not necessarily complete linearity) is sufficient to satisfy consistent conditional independence.

Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Kun Zhang, Dr. Bin Gu

Online access available for MBZUAI patrons