Research Overview

The research of the Ztatistics (Zhang's Statistics) Lab focuses on the four pillars of Statistics: Method, Theory, Computation, and Application. We develop novel statistical methods to provide data-driven answers to challenging questions from the real world. We develop new limit theorems and statistical theory to advance our knowledge on uncertainty quantification and guide the design of next-generation inference protocols. We develop computational algorithms to implement our methods, and take advantage of high performance computing clusters. We apply our results to data in different applications for new scientific discoveries, and we look forward to collaborating with researchers on projects that can benefit from our expertise.

A brief summary on some of our research directions can be found in the followings.

Some Research Directions

The Phenomenon of Serial Tail Dependence

The phenomeon of tail dependence refers to the dependence in the tail part that is more related to extreme risks. With the conventional notion of dependence, one may be interested in knowing, for example, if a stock price drop today will be followed by another drop tomorrow. However, in the tail dependence setting, we are mainly interested in knowing if a big price drop today will be followed by more big drops in the following days. This can be frequently observed in a crisis, such as the 2007 financial crisis or the 2020 pandemic, and as a result understanding tail dependence can be critical in modeling extreme risks. Through the newly developed tail adversarial stability framework by the PI, we have been able to advance the knowledge frontier on statistical inference of tail dependent time series, and our results are expected to impact not only statistics but any discipline that may involve the analysis of tail dependent time serie data.

[Plot from Zhang (2021, Biometrika)] 

High-Dimensional Time Series

High-dimensional time series data have been emerging from applications in a number of disciplines. For example, a brain fMRI data may contain time series of measurements taken over a large number of voxels, which can thus be viewed as a high-dimensional time series. Also, recent advancements in climate science have allowed the collection of satellite-derived high-resolution temperature series over a certain region, which makes another example of high-dimensional time series. Although high-dimensional data has been largely studied due to the recent big data era, most of the existing results were developed for temporally independent random vectors along with certain restrictions on the tail behavior of the underlying distribution. We have been developing statistical methods and their associated theory that can remain valid under nonnegligible temporal dependence without specifying any parametric time series models to avoid model misspecification issues.

[Plot from Zhang (2013, Journal of the American Statistical Association)] 

Nonparametric and Semiparametric Learning

Although parametric models are easy to use and implement in statistical practices, they can be misspecified leading to unreliable or erroneous conclusions. Nonparametric methods, on the other hand, are model free and thus robust to model misspecification issues. However, how to quantify the uncertain of results from a nonparametric method can often be a difficult problem, especially for dependent data. For example, when a nonparametric method is used to estimate a target function, then the usual pointwise confidence interval results may not be of direct interest, as one in this case often seeks a statistical method to gauge the estimation uncertainty across the whole domain of the target function. This relates to the difficult problem of simultaneous inference of nonparametric functions. We have been developing novel results regarding uncertain quantification of nonparametric and semiparametric learning schemes, and have applied our results to address important questions from climate science, finance, and telecommunication.

[Plot from Zhang (2015, Journal of Econometrics)] 

Self-Normalized Automation

Designing statistical inference protocols for different quantities often requires a case by case study. For example, inference procedures for the mean and median are usually different and guided by different limit theorems. In the time series setting, developing such inference procedures can be even more challenging due to the complicated form of the asymptotic variance that nevertheless is essential for obtaining the statistical cut-off threshold. For example, in the mean case, the asymptotic variance under dependence is no longer the simple marginal variance but an infinite sum of autocovariances of all orders, making its estimation a nontrivial problem requiring regularization techniques such as banding or thresholding. We have been developing self-normalized automation schemes for inference of time series, which is capable of avoiding direct estimation of the complicated asymptotic variance by using a sequence of recursive estimators and is versatile in handling various quantities such as the mean, the median, and beyond.

[Plot from Zhang and Lavitas (2018, Journal of the American Statistical Association)]