题目：Deep Network Approximation: Error Characterization in term of Width and Depth
摘要：Deep neural networks are a powerful tool in many applications in sciences, engineering, technology, and industries, especially for large-scale and high-dimensional learning problems. This talk focuses on the mathematical understanding of deep neural networks. In particular, a relation of the approximation properties of deep neural networks and function compositions is characterized. The approximation error of ReLU networks in terms of the width and depth is given for various function spaces, such as space of polynomials, continuous functions, or smooth functions on a hypercube. Finally, to achieve a better approximation error, we introduce a new type of network called Floor-ReLU networks, built with each neuron activated by either Floor or ReLU.
题目：Deep Network Approximation: Achieving Arbitrary Error with Fixed Size
摘要：This talk discusses a new type of simple feed-forward neural network that achieves the universal approximation property for all continuous functions with a fixed network size. This new type of neural network is simple because it is designed with a simple and computable continuous activation function σ leveraging a triangular-wave function and a softsign function. First, we prove that σ-activated networks with width 36d(2d + 1) and depth 11 can approximate any continuous function on a d-dimensional hypercube with an arbitrarily small error. Next, we show that classification functions arising from image and signal classification can be exactly represented by σ-activated networks with width 36d(2d + 1) and depth 12, when there exist pairwise disjoint closed bounded subsets of Rd such that the samples of the same class are located in the same subset.
Tecent Meeting 下载链接： https://meeting.tencent.com/download-center.html
备用会议ID(无密码)： 625 2803 3607
题目：Embedding Principle of Loss Landscape of Deep Neural Networks
Affiliations： Shanghai Jiao Tong University
摘要：Understanding the structure of loss landscape of deep neural networks (DNNs) is obviously important. In this talk, we present an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/affine subspace of the target DNN with higher degeneracy and preserving the DNN output function. Note that, given any training data, differentiable loss function and differentiable activation function, this embedding structure of critical points holds. This general structure of DNNs is starkly different from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides a new perspective to study the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, this work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near future.
题目：Global Loss Landscape of Neural Networks: What do We Know and What do We Not Know?
Affiliations： University of Illinois at Urbana-Champaign
摘要：The recent success of neural networks suggests that their loss landscape is not too bad, but what concrete results do we know and not know about the landscape? In this talk, we present a few recent results on the landscape. First, non-linear neural nets can have sub-optimal local minima under mild assumptions (for arbitrary width, for generic input data and for most activation functions). Second, wide networks do not have sub-optimal ``basin'' for any continuous activation, while narrow networks can have sub-optimal basin. We will present a simple 2D geometrical object that is a basic component of neural net landscape, which can visually explain the above two results. Third, we show that for ReQU and ReLU networks, adding a proper regularizer can eliminate sub-optimal local minima and decreasing paths to infinity. Together, these results demonstrate that wide neural nets have a ``nice landscape'', but the meaning of ``nice landscape'' is more subtle than we expected. We will also mention the limitation of existing results, e.g., in what settings we do not know the existence of sub-optimal local minima. We will briefly discuss the relevance of the landscape results in two aspects: (1) these results can help understand the training difficulty of narrow networks; (2) these results can potentially help convergence analysis and implicit regularization.
题目：Some theoretical results on model-based reinforcement learning
Affiliations： Princeton University
摘要：We discuss some recent results on model-based methods for reinforcement learning (RL) in both online and offline problems. For the online RL problem, we discuss several model-based RL methods that adaptively explore an unknown environment and learn to act with provable regret bounds. In particular, we focus on finite-horizon episodic RL where the unknown transition law belongs to a generic family of models. We propose a model based ‘value-targeted regression’ RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting values as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, for arbitrary family of transition models, using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014). Next we discuss batch data (offline) reinforcement learning, where the goal is to predict the value of a new policy using data generated by some behavior policy (which may be unknown). We show that the fitted Q-iteration method with linear function approximation is equivalent to a model-based plugin estimator. We establish that this model-based estimator is minimax optimal and its statistical limit is determined by a form of restricted chi-square divergence between the two policies.
题目：Policy Cover Guided Exploration in Model-free and Model-based Reinforcement Learning
Affiliations： Cornell University
摘要：Existing RL methods that leverage random exploration techniques often fail to learn efficiently in environments where strategic exploration is needed. In this talk, we introduce a new concept called Policy Cover, which is an ensemble of learned policies. The policy cover encodes information about which part of the state space is explored and such information can be used to further guide the exploration process. We show that this idea can be used in both model-free and model-based algorithmic frameworks. Particularly, for model-free learning, we present the first policy gradient algorithm--- Policy Cover Policy Gradient (PG-PG), that can explore and learn with polynomial sample complexity. For model-based setting, we present an algorithm — Policy-Cover Model Learning and Planning (PC-MLP), that also learns and explores with polynomial sample complexity. For both approaches, we show that they are flexible enough to be used together with deep neural networks, and their deep versions achieve state-of-art performance on common benchmarks, including exploration challenging tasks such as reward-free Maze exploration.
题目：In Search of Effective and Reproducible Clinical Imaging Biomarkers for Population Health and Oncology Applications of Screening, Diagnosis and Prognosis.
报告人：Le Lu (吕乐)
Affiliations： PhD, FIEEE, MICCAI Board Member, AE for TPAMI
摘要：This talk will first give an overall on the work of employing deep learning to permit novel clinical workflows in two population health tasks, namely using conventional ultrasound for liver steatosis screening and quantitative reporting; osteoporosis screening via conventional X-ray imaging and "AI readers". These two tasks were generally considered as infeasible tasks for human readers, but as proved by our scientific and clinical studies and peer-reviewed publications, they are suitable for AI readers. AI can be a supplementary and useful tool to assist physicians for cheaper and more convenient/precision patient management. Next, the main part of this talk describes a roadmap on three key problems in pancreatic cancer imaging solution: early screening, precision differential diagnosis, and deep prognosis on patient survival prediction. (1) Based on a new self-learning framework, we train the pancreatic ductal adenocarcinoma (PDAC) segmentation model using a larger quantity of patients, with a mix of annotated/unannotated venous or multi-phase CT images. Pseudo annotations are generated by combining two teacher models with different PDAC segmentation specialties on unannotated images, and can be further refined by a teaching assistant model that identifies associated vessels around the pancreas. Our approach makes it technically feasible for robust large-scale PDAC screening from multi-institutional multi-phase partially-annotated CT scans. (2) We propose a holistic segmentation-mesh classification network (SMCN) to provide patient-level diagnosis, by fully utilizing the geometry and location information. SMCN learns the pancreas and mass segmentation task and builds an anatomical correspondence-aware organ mesh model by progressively deforming a pancreas prototype on the raw segmentation mask. Our results are comparable to a multimodality clinical test that combines clinical, imaging, and molecular testing for clinical management of patients with cysts. (3) Accurate preoperative prognosis of resectable PDACs for personalized treatment is highly desired in clinical practice. We present a novel deep neural network for the survival prediction of resectable PDAC patients, 3D Contrast-Enhanced Convolutional Long Short-Term Memory network (CE-ConvLSTM), to derive the tumor attenuation signatures from CE-CT imaging studies. Our framework can significantly improve the prediction performances upon existing state-of-the-art survival analysis methods. This deep tumor signature has evidently added values (as a predictive biomarker) to be combined with the existing clinical staging system.
题目：Deep Learning for inverse problems
摘要：This talk is about some recent progress on solving inverse problems using deep learning. Compared to traditional machine learning problems, inverse problems are often limited by the size of the training data set. We show how to overcome this issue by incorporating mathematical analysis and physics into the design of neural network architectures. We first describe neural network representations of pseudodifferential operators and Fourier integral operators. We then continue to discuss applications including electric impedance tomography, optical tomography, inverse acoustic/EM scattering, seismic imaging, and travel-time tomography.
题目：Self-supervised Deep Learning for Solving Inverse Problems in Imaging
摘要：Deep learning has become a prominent tool for solving many inverse problems in imaging sciences. Most existing SOTA solutions are built on supervised learning with a prerequisite on the availability of a large-scale dataset of many degraded/truth image pairs. In recent years, driven by practical need, there is an increasing interest on studying deep learning methods under limited data resources, which has particular significance for imaging in science and medicine. This talk will focus on the discussion of self-supervised deep learning for solving inverse imaging problems, which assumes no training sample is available. By examining deep learning from the perspective of Bayesian inference, we will present several results and techniques on self-supervised learning for MMSE (minimum mean squared error) estimator. Built on these techniques, we will show that, in several applications, the resulting dataset-free deep learning methods provide very competitive performance in comparison to their SOTA supervised counterparts. While the demonstrations only cover image denoising, compressed sensing, and phase retrieval, the presented techniques and methods are quite general which can used for solving many other inverse imaging problems.
Affiliations： Pennsylvania State University
摘要：In this talk, we will present some recent results related to the Barron space studiedf by E et al in 2019. For the ReLU activation function, we give an equivalent characterization of the corresponding Barron space in terms of the convex hull of an appropriate dictionary. This characterization enables a generalization of the notion of Barron space to neural networks with general activation functions, and even to general dictionaries of functions. We provide an explicit representation of some Barron norms which are of particular interest to the theory of neural networks, specifically corresponding to a dictionary of decaying Fourier modes and the dictionary corresponding to shallow ReLU^k networks. Next, we present optimal estimates of approximation rates, metric entropy, Kolmogorov and Gelfand n-widths of the Barron space unit ball with respect to L^2 for ReLU^k activation functions and the dictionary of decaying Fourier modes. These results provide a solution to several open problems concerning the precise approximation properties of these spaces. If time allows, we will also give recent results on the approximation rates and metric entropies for sigmoidal and ReLU Barron spaces with respect to L^p for p > 2. This talk is based on joint work with Jonathan Siegel.
题目：Convergence analysis for the gradient descent optimization method in the training of artificial neural networks with ReLU activation for piecewise linear target functions
Applied Mathematics: Institute for Analysis and Numerics, Faculty of Mathematics and Computer Science, University of Muenster, Germany;
School of Data Science and Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China
摘要：Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains -- even in the simplest situation of the plain vanilla GD optimization method with random initializations and ANNs with one hidden layer -- an open problem to prove (or disprove) the conjecture that the risk of the GD optimization method converges in the training of ANNs with ReLU activation to zero as the width of the ANNs, the number of independent random initializations, and the number of GD steps increase to infinity. In this talk we prove this conjecture in the special situation where the probability distribution of the input data is absolutely continuous with respect to the continuous uniform distribution on a compact interval and where the target function under consideriation is piecewise linear.
题目：Learning and Learning to Solve PDEs
摘要：Deep learning continues to dominate machine learning and has been successful in computer vision, natural language processing, etc. Its impact has now expanded to many research areas in science and engineering. In this talk, I will mainly focus on some recent impact of deep learning on computational mathematics. I will present our recent work on bridging deep neural networks with numerical differential equations. On the one hand, I will show how to design transparent deep convolutional networks to uncover hidden PDE models from observed dynamical data. On the other hand, I will present our recent preliminary attempts to combine wisdoms from numerical PDEs and machine learning to design data-driven solvers for PDEs.
题目：Neural Operator: Learning Maps Between Function Spaces
摘要：The classical development of neural networks has primarily focused on learning mappings between finite dimensional Euclidean spaces or finite sets. We propose a generalization of neural networks tailored to learn operators mapping between infinite dimensional function spaces. We formulate the approximation of operators by composition of a class of linear integral operators and nonlinear activation functions, so that the composed operator can approximate complex nonlinear operators. We prove a universal approximation theorem for our construction. The proposed neural operators are resolution-invariant: they share the same network parameters between different discretization of the underlying function spaces and can be used for zero-shot super-resolutions. Numerically, the proposed models show superior performance compared to existing machine learning based methodologies on Burgers' equation, Darcy flow, and the Navier-Stokes equation, while being several order of magnitude faster compared to conventional PDE solvers.
报告人：许志钦 (Zhi-Qin John Xu), 上海交通大学
摘要：In this talk, I would introduce frequency principle (F-Principle) in detail, including experiments, and theory. I would also connect the F-Principle with traditional iterative methods, such as Jacobi methods, understanding the training of neural networks from the perspective of numerical analysis. Then, I will use some examples to show how F-Principle benefits the design of neural networks. Finally, I would talk about some open questions about the F-Principle.
题目：DeePKS: a machine learning assisted electronic structure model
摘要：We introduce a general machine learning-based framework for building an accurate and widely-applicable energy functional within the framework of generalized Kohn-Sham density functional theory. In particular, we develop a way of training self-consistent models that are capable of taking large datasets from different systems and different kinds of labels. We demonstrate that the functional that results from this training procedure, with the efficiency of cheap density functional models, gives chemically accurate predictions on energy, force, dipole, and electron density for a large class of molecules.
题目：Modelling Temporal Data: from RNNs to CNNs
报告人：Zhong Li, Qianxiao Li
摘要：There are several competing models in deep learning when modelling
input-output relationships in temporal
data: recurrent neural networks (RNNs), convolutional neural networks (CNNs),
transformers, etc. In recent
work, we study the approximation properties and optimization dynamics of RNNs when
applied to learn
temporal dynamics. We consider the simple but representative setting of using
continuous-time linear RNNs
to learn from data generated by linear relationships. Mathematically, the latter can be
understood as a
sequence of linear functionals. We prove a universal approximation theorem of such
linear functionals and
characterize the approximation rate. Moreover, we perform a fine-grained dynamical
analysis of training
linear RNNs by gradient methods. A unifying theme uncovered is the non-trivial effect of
memory, a notion
that can be made precise in our framework, on both approximation and optimization: when
there is longterm memory in the
target, it takes a large number of neurons to approximate it. Moreover, the training
process will suffer from severe slow downs. In particular, both of these effects become
pronounced with increasing memory - a phenomenon we call the “curse of memory”. These
a basic step towards a concrete mathematical understanding of new phenomenons arising in
relationships using recurrent architectures.
We also study the approximation properties of convolutional architectures applied to time series modelling. Similar to the recurrent setting, parallel results for convolutional architectures are derived regarding to the approximation efficiency, with WaveNet being a prime example. Our results reveal that under this new setting, the approximation efficiency is not only characterized by memory, but also additional fine structures in the target relationship. This leads to a novel definition of spectrum-based regularity that measures the complexity of temporal relationships under the convolutional approximation scheme. These analyses provide a foundation to understand the differences between architectural choices for temporal modelling with theoretically grounded guidance for practical applications.
题目：Implicit biases of SGD for neural network models
报告人：Lei Wu (Princeton University)
摘要：Understanding the implicit biases of optimization algorithms is one of the core problems in theoretical machine learning. This refers to the fact that even without any explicit regularizations to avoid overfitting, the dynamics of an optimizer itself is biased to pick solutions that generalize well. This talk introduces the recent progress in understanding the implicit bias of stochastic gradient descent (SGD) for neural network models. First, we consider the gradient descent flow, i.e., SGD with an infinitesimal learning rate, for two-layer neural networks. In particular, we will see how the implicit bias is affected by the extent of over-parameterization. Then, we turn to SGD with a finite learning rate. The influence of learning rate as well as the batch size will be studied from the perspective of dynamical stability. The concept of uniformity is introduced, which, together, with flatness characterizes the accessibility of a particular SGD to a global minimum. This analysis shows that learning rate and batch size play different roles in selecting global minima. Extensive empirical results correlate well with the theoretical findings.
题目：The landscape-dependent annealing strategy in machine learning: How Stochastic-Gradient-Descent finds flat minima
报告人：Yuhai Tu (IBM T. J. Watson Research Center)
摘要：Despite tremendous success of the Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds ``good" solutions (low generalization error) in the high-dimensional weight space. In this talk, we discuss our recent work [1,2] on establishing a theoretical framework based on nonequilibrium statistical physics to understand the SGD learning dynamics, the loss function landscape, and their relation. Our study shows that SGD dynamics follows a low-dimensional drift-diffusion motion in the weight space and the loss function is flat with large values of flatness (inverse of curvature) in most directions. Furthermore, our study reveals a robust inverse relation between the weight variance in SGD and the landscape flatness opposite to the fluctuation-response relation in equilibrium systems. We develop a statistical theory of SGD based on properties of the ensemble of minibatch loss functions and show that the noise strength in SGD depends inversely on the landscape flatness, which explains the inverse variance-flatness relation. Our study suggests that SGD serves as an ``intelligent" annealing strategy where the effective temperature self-adjusts according to the loss landscape in order to find the flat minimum regions that contain generalizable solutions. Finally, we discuss an application of these insights for reducing catastrophic forgetting efficiently for sequential multiple tasks learning.
Reference： 1. “The inverse variance-flatness relation in Stochastic-Gradient-Descent is critical for finding flat minima”, Y. Feng and Y. Tu, PNAS, 118 (9), 2021. 2. “Phases of learning dynamics in artificial neural networks: in the absence and presence of mislabeled data”, Y. Feng and Y. Tu, Machine Learning: Science and Technology (MLST), April 7, 2021. https://iopscience.iop.org/article/10.1088/2632-2153/abf5b9/pdf
题目：Stochastic gradient descent for noise with ML-type scaling
报告人：Stephan Wojtowytsch (Princeton University)
摘要：In the literature on stochastic gradient descent, there are two types of convergence results: (1) SGD finds minimizers of convex objective functions and (2) SGD finds critical points of smooth objective functions. Classical results are obtained under the assumption that the stochastic noise is L^2-bounded and that the learning rate decays to zero at a suitable speed. We show that, if the objective landscape and noise possess certain properties which are reminiscent of deep learning problems, then we can obtain global convergence guarantees of first type under second type assumptions for a fixed (small, but positive) learning rate. The convergence is exponential, but with a large random coefficient. If the learning rate exceeds a certain threshold, we discuss minimum selection by studying the invariant distribution of a continuous time SGD model. We show that at a critical threshold, SGD prefers minimizers where the objective function is 'flat' in a precise sense.
题目：Machine Learning and Dynamical Systems
摘要：In this talk, we discuss some recent work on the connections between machine learning and dynamical systems. These come broadly in three categories, namely machine learning via, for and of dynamical systems, and here we will focus on the first two. In the direction of machine learning via dynamical systems, we introduce a dynamical approach to deep learning theory with particular emphasis on its connections with control theory. In the reverse direction of machine learning for dynamical systems, we discuss the approximation and optimization theory of learning input-output temporal relationships using recurrent neural networks, with the goal of highlighting key new phenomena that arise in learning in dynamic settings. If time permits, we will also discuss some applications of dynamical systems on the analysis of optimization algorithms commonly applied in machine learning.
摘要：近年来，以神经网络模型为基础的深度学习方法在不同领域取得了前所未有的成功， 例如计算机视觉、科学计算等。从逼近论的角度来说，这些成功依赖于神经网络强大的逼近高维函数的能力。而我们知道传统方法在逼近高维函数时必然会遭受维数诅咒。 这是否表明神经网络在某种意义下可以避免维数诅咒？如果可以，那么背后的机制又是什么呢？ 我们将围绕kernel方法、两层神经网络和深度残差网络三个模型来讨论这些问题。 特别地，我们将刻画每个模型所逼近的高维函数空间。最后，我们会罗列一些公开问题帮助大家对这个领域有个整体理解。
摘要：近年来深度学习的蓬勃发展为我们求解高维计算提供了新的强有力工具，其中梯度反向传播和随机梯度下降的算法为我们求解最优神经网络提供了高效的算法。本报告将先以控制论的视角来分析神经网络，探讨求解最优神经网络和最优控制问题的相似点以及上述算法对于我们求解高维控制问题的启发。在此启发下我们会展示两个利用深度学习求解高维控制问题的工作，(1) 求解基于模型的高维随机控制问题；(2) 结合倒向随机方程的变分形式求解抛物类偏微分方程。这些算法相比于以往的受限于维数灾难的传统算法体现出巨大的计算优势，大幅度提高了我们处理一大类高维问题的计算能力。最后我们会讨论一些相关方向的未解决问题帮助大家对这个领域有更好的理解。
时间：2020年11月15日 10:00 - 12:00
时间：2020年11月08日 10:00 - 12:00
时间：2020年11月01日 10:00 - 12:00