Machine Learning and Scientific Application

Joint Seminar of Machine Learning

This joint seminar of machine learning program aims to integrate high-quality teaching and research resources at home and abroad for outstanding domestic students and young professors, and create the world's top learning and research environment and research platform, making them the best young talents in machine learning and related fields. Please see here for the early event webpage.

At this stage, we mainly invite domestic and foreign researchers who are most active in the field of machine learning to systematically introduce the latest developments in machine learning research and the most cutting-edge scientific issues. The main content includes: the basic research of machine learning, the application of machine learning in scientific problems and the application of machine learning in the industrial field. Seminars are held every two weeks. Two scholars are invited to report each time.

Organizing Committee:Weinan E, Bin Dong, Weiguo Gao, Zhongyi Huang, Han Wang, Zhiqin John Xu, Zhouwang Yang, Linfeng Zhang.

If you want to learn about history lectures, please visit: History

 

Time:2021-11-27   9:30-11:30

 

 

Reporter:王宇航

Affiliations: 北京深势科技有限公司

Title:冷冻电镜原理及相关算法

Abstract:冷冻电镜显微学(cryo-electron microscopy)是结构⽣物学领域⾥⼀项⾮常重要的实验技术,并且在近⼗年⾥有突破性进展。这项技术对阐释包括新冠病毒作⽤机理在内的众多⽣物医学问题提供了重要帮助。它的三位创⽴者在2017 年被授予了诺⻉尔化学奖。相对于传统结构⽣物学技术,可适⽤于冷冻电镜成像的⽣物样品范围更广,⽽且从同⼀批样品中可以解析出多个结构,从⽽更全⾯的帮助我们理解⽣物⼤分⼦的功能机理。本次报告将侧重于冷冻电镜显微学⾥发展最快的⼀个分⽀:单颗粒冷冻电镜分析技术(single-particle analysis)。我会介绍这项技术的原理,应⽤,相关传统算法和基于机器学习的算法。虽然冷冻电镜实验技术正在逐步⾛向成熟,相关数据分析算法的精准度和鲁棒性仍有很⼤的提升空间。我希望能借助这次报告激发⼤家对冷冻电镜技术的兴趣,⼀起推动冷冻电镜算法的发展。

Bio:
王宇航(深势科技电镜算法研究员),毕业于美国伊利诺伊⼤学⾹槟分校的计算⽣物物理学专业。他博⼠阶段的主要课题是利⽤分⼦动⼒学⽅法来研究⽣物⼤分⼦功能背后的机理,研究的⽣物⼤分⼦体系包括离⼦通道等膜蛋⽩以及T 细胞受体蛋⽩。博⼠就读期间的研究重点是把冷冻电镜的测量数据和分⼦动态模拟技术结合起来解析⽣物⼤分⼦的结构。之后在加州理⼯学院做博后期间,他利⽤冷冻电⼦断层扫描技术来研究⽣物⼤分⼦在细胞环境下的结构和功能。目前他的⼯作重点是开发新的冷冻电镜数据分析算法。

 

Reporter:朱通

Affiliations: 华东师范大学

Title:基于神经网络实现碳氢燃料的燃烧模拟

Abstract:碳氢分子是燃料的主要组成部分,探明其燃烧机理是实现模拟发动机燃烧,进而推动发动机设计所需要重点解决的问题之一。受限于力场精度,目前广泛采用的分子力学方法对燃烧反应模拟结果的可靠性仍有较大提升空间。由于量子化学方法需要大量的计算资源,直接用其模拟碳氢燃料的燃烧机理是不可行的。在前期工作中,我们发展了一种基于分块的量子化学计算方法MFCC-combustion,实现了对模拟体系能量和力的高效精确计算。该方法与动力学模拟软件结合后,通过合理的控温控压,可实现对燃料燃烧的从头算分子动力学模拟 (AIMD)。最近,我们基于深度势能(Deep Potential)模型将AIMD的模拟效率进一步提高了三个数量级左右,从而实现了对碳氢燃料的纳秒级反应动力学模拟。该方法的发展有望为碳氢燃料燃烧机理、燃烧基础数库的构建和完善提供一个高效准确的研究工具。  

Bio:朱通,2013年博士毕业于华东师范大学精密光谱科学与技术国家重点实验室,2016-2018年台湾中央研究院访问学者,现为华东师范大学化学与分子工程学院副研究员。主要研究方向为采用量化计算及分子动力学模拟研究复杂化学体系的结构与性质,包括金属离子与蛋白质/核酸的相互作用和碳氢燃料的燃烧反应机理。

 

 

Tecent Meeting Link: https://meeting.tencent.com/dm/p1q2HxwW3x4r

Tecent Meeting download link: https://meeting.tencent.com/download-center.html

Meeting ID:329744341

Password:974114

Living Room: bilibili living room

Alternate meeting ID (no password): 553 2421 7498

Time:2021-10-30   9:30-11:30

Title:Deep Network Approximation: Achieving Arbitrary Error with Fixed Size

Reporter:Shijun Zhang

Affiliations: National University of Singapore

Slides: Install

Replay: https://www.bilibili.com/video/BV1pf4y1u79M

Homepage: https://shijunzhang.top/

Abstract:This talk discusses a new type of simple feed-forward neural network that achieves the universal approximation property for all continuous functions with a fixed network size. This new type of neural network is simple because it is designed with a simple and computable continuous activation function σ leveraging a triangular-wave function and a softsign function. First, we prove that σ-activated networks with width 36d(2d + 1) and depth 11 can approximate any continuous function on a d-dimensional hypercube with an arbitrarily small error. Next, we show that classification functions arising from image and signal classification can be exactly represented by σ-activated networks with width 36d(2d + 1) and depth 12, when there exist pairwise disjoint closed bounded subsets of Rd such that the samples of the same class are located in the same subset.

Time:2021-10-30   9:30-11:30

Title:Deep Network Approximation: Error Characterization in term of Width and Depth

Reporter:Haizhao Yang

Affiliations: Purdue University

Slides: Install

Replay: https://www.bilibili.com/video/BV1TL4y1q76Z

Homepage: https://haizhaoyang.github.io/

Abstract:Deep neural networks are a powerful tool in many applications in sciences, engineering, technology, and industries, especially for large-scale and high-dimensional learning problems. This talk focuses on the mathematical understanding of deep neural networks. In particular, a relation of the approximation properties of deep neural networks and function compositions is characterized. The approximation error of ReLU networks in terms of the width and depth is given for various function spaces, such as space of polynomials, continuous functions, or smooth functions on a hypercube. Finally, to achieve a better approximation error, we introduce a new type of network called Floor-ReLU networks, built with each neuron activated by either Floor or ReLU.

Time:2021-10-17   9:30-11:30

Title:Embedding Principle of Loss Landscape of Deep Neural Networks

Reporter:Yaoyu Zhang

Affiliations: Shanghai Jiao Tong University

Replay: https://www.bilibili.com/video/BV1x44y1x7Qg

Homepage: http://old.ins.sjtu.edu.cn/p/zhangyaoyu

Abstract:Understanding the structure of loss landscape of deep neural networks (DNNs) is obviously important. In this talk, we present an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/affine subspace of the target DNN with higher degeneracy and preserving the DNN output function. Note that, given any training data, differentiable loss function and differentiable activation function, this embedding structure of critical points holds. This general structure of DNNs is starkly different from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides a new perspective to study the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, this work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near future.

Time:2021-10-17   9:30-11:30

Title:Global Loss Landscape of Neural Networks: What do We Know and What do We Not Know?

Reporter:Ruoyu Sun

Affiliations: University of Illinois at Urbana-Champaign

Slides: 下载链接

Replay: https://www.bilibili.com/video/BV1cq4y1G7Zn

Homepage: https://ruoyus.github.io/

Abstract:The recent success of neural networks suggests that their loss landscape is not too bad, but what concrete results do we know and not know about the landscape? In this talk, we present a few recent results on the landscape. First, non-linear neural nets can have sub-optimal local minima under mild assumptions (for arbitrary width, for generic input data and for most activation functions). Second, wide networks do not have sub-optimal ``basin'' for any continuous activation, while narrow networks can have sub-optimal basin. We will present a simple 2D geometrical object that is a basic component of neural net landscape, which can visually explain the above two results. Third, we show that for ReQU and ReLU networks, adding a proper regularizer can eliminate sub-optimal local minima and decreasing paths to infinity. Together, these results demonstrate that wide neural nets have a ``nice landscape'', but the meaning of ``nice landscape'' is more subtle than we expected. We will also mention the limitation of existing results, e.g., in what settings we do not know the existence of sub-optimal local minima. We will briefly discuss the relevance of the landscape results in two aspects: (1) these results can help understand the training difficulty of narrow networks; (2) these results can potentially help convergence analysis and implicit regularization.

Time:2021-9-25   9:30-11:30

Title:Some theoretical results on model-based reinforcement learning

Reporter:Mengdi Wang

Affiliations: Princeton University

Homepage: https://mwang.princeton.edu/

Abstract:We discuss some recent results on model-based methods for reinforcement learning (RL) in both online and offline problems. For the online RL problem, we discuss several model-based RL methods that adaptively explore an unknown environment and learn to act with provable regret bounds. In particular, we focus on finite-horizon episodic RL where the unknown transition law belongs to a generic family of models. We propose a model based ‘value-targeted regression’ RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting values as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, for arbitrary family of transition models, using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014). Next we discuss batch data (offline) reinforcement learning, where the goal is to predict the value of a new policy using data generated by some behavior policy (which may be unknown). We show that the fitted Q-iteration method with linear function approximation is equivalent to a model-based plugin estimator. We establish that this model-based estimator is minimax optimal and its statistical limit is determined by a form of restricted chi-square divergence between the two policies.

Time:2021-9-25   9:30-11:30

Title:Policy Cover Guided Exploration in Model-free and Model-based Reinforcement Learning

Reporter:Wen Sun

Affiliations: Cornell University

Replay: https://www.bilibili.com/video/BV1hL4y1z7GN

Homepage: https://wensun.github.io/

Abstract:Existing RL methods that leverage random exploration techniques often fail to learn efficiently in environments where strategic exploration is needed. In this talk, we introduce a new concept called Policy Cover, which is an ensemble of learned policies. The policy cover encodes information about which part of the state space is explored and such information can be used to further guide the exploration process. We show that this idea can be used in both model-free and model-based algorithmic frameworks. Particularly, for model-free learning, we present the first policy gradient algorithm--- Policy Cover Policy Gradient (PG-PG), that can explore and learn with polynomial sample complexity. For model-based setting, we present an algorithm — Policy-Cover Model Learning and Planning (PC-MLP), that also learns and explores with polynomial sample complexity. For both approaches, we show that they are flexible enough to be used together with deep neural networks, and their deep versions achieve state-of-art performance on common benchmarks, including exploration challenging tasks such as reward-free Maze exploration.

Time:2021-9-11   9:30-11:30

Title:In Search of Effective and Reproducible Clinical Imaging Biomarkers for Population Health and Oncology Applications of Screening, Diagnosis and Prognosis.

Reporter:Le Lu (吕乐)

Affiliations: PhD, FIEEE, MICCAI Board Member, AE for TPAMI

Homepage: https://www.cs.jhu.edu/~lelu/

Abstract:This talk will first give an overall on the work of employing deep learning to permit novel clinical workflows in two population health tasks, namely using conventional ultrasound for liver steatosis screening and quantitative reporting; osteoporosis screening via conventional X-ray imaging and "AI readers". These two tasks were generally considered as infeasible tasks for human readers, but as proved by our scientific and clinical studies and peer-reviewed publications, they are suitable for AI readers. AI can be a supplementary and useful tool to assist physicians for cheaper and more convenient/precision patient management. Next, the main part of this talk describes a roadmap on three key problems in pancreatic cancer imaging solution: early screening, precision differential diagnosis, and deep prognosis on patient survival prediction. (1) Based on a new self-learning framework, we train the pancreatic ductal adenocarcinoma (PDAC) segmentation model using a larger quantity of patients, with a mix of annotated/unannotated venous or multi-phase CT images. Pseudo annotations are generated by combining two teacher models with different PDAC segmentation specialties on unannotated images, and can be further refined by a teaching assistant model that identifies associated vessels around the pancreas. Our approach makes it technically feasible for robust large-scale PDAC screening from multi-institutional multi-phase partially-annotated CT scans. (2) We propose a holistic segmentation-mesh classification network (SMCN) to provide patient-level diagnosis, by fully utilizing the geometry and location information. SMCN learns the pancreas and mass segmentation task and builds an anatomical correspondence-aware organ mesh model by progressively deforming a pancreas prototype on the raw segmentation mask. Our results are comparable to a multimodality clinical test that combines clinical, imaging, and molecular testing for clinical management of patients with cysts. (3) Accurate preoperative prognosis of resectable PDACs for personalized treatment is highly desired in clinical practice. We present a novel deep neural network for the survival prediction of resectable PDAC patients, 3D Contrast-Enhanced Convolutional Long Short-Term Memory network (CE-ConvLSTM), to derive the tumor attenuation signatures from CE-CT imaging studies. Our framework can significantly improve the prediction performances upon existing state-of-the-art survival analysis methods. This deep tumor signature has evidently added values (as a predictive biomarker) to be combined with the existing clinical staging system.

Time:2021-8-28   9:30-11:30

Title:Deep Learning for inverse problems

Reporter:Lexing Ying

Affiliations: Stanford University

Replay: https://www.bilibili.com/video/BV1Qq4y1K71Q

Homepage: https://web.stanford.edu/~lexing/

Abstract:This talk is about some recent progress on solving inverse problems using deep learning. Compared to traditional machine learning problems, inverse problems are often limited by the size of the training data set. We show how to overcome this issue by incorporating mathematical analysis and physics into the design of neural network architectures. We first describe neural network representations of pseudodifferential operators and Fourier integral operators. We then continue to discuss applications including electric impedance tomography, optical tomography, inverse acoustic/EM scattering, seismic imaging, and travel-time tomography.

Time:2021-8-28   9:30-11:30

Title:Self-supervised Deep Learning for Solving Inverse Problems in Imaging

Reporter:Hui ji

Affiliations: National University of Singapore

Replay: https://www.bilibili.com/video/BV1uU4y177VF

Homepage: https://blog.nus.edu.sg/matjh/

Abstract:Deep learning has become a prominent tool for solving many inverse problems in imaging sciences. Most existing SOTA solutions are built on supervised learning with a prerequisite on the availability of a large-scale dataset of many degraded/truth image pairs. In recent years, driven by practical need, there is an increasing interest on studying deep learning methods under limited data resources, which has particular significance for imaging in science and medicine. This talk will focus on the discussion of self-supervised deep learning for solving inverse imaging problems, which assumes no training sample is available. By examining deep learning from the perspective of Bayesian inference, we will present several results and techniques on self-supervised learning for MMSE (minimum mean squared error) estimator. Built on these techniques, we will show that, in several applications, the resulting dataset-free deep learning methods provide very competitive performance in comparison to their SOTA supervised counterparts. While the demonstrations only cover image denoising, compressed sensing, and phase retrieval, the presented techniques and methods are quite general which can used for solving many other inverse imaging problems.

Time:2021-8-14   9:30-11:30

Title:Barron Spaces

Reporter:Jinchao Xu

Affiliations: Pennsylvania State University

Replay: https://www.bilibili.com/video/BV1Yq4y1Q7n2

Homepage: http://www.personal.psu.edu/jxx1/

Abstract:In this talk, we will present some recent results related to the Barron space studiedf by E et al in 2019. For the ReLU activation function, we give an equivalent characterization of the corresponding Barron space in terms of the convex hull of an appropriate dictionary. This characterization enables a generalization of the notion of Barron space to neural networks with general activation functions, and even to general dictionaries of functions. We provide an explicit representation of some Barron norms which are of particular interest to the theory of neural networks, specifically corresponding to a dictionary of decaying Fourier modes and the dictionary corresponding to shallow ReLU^k networks. Next, we present optimal estimates of approximation rates, metric entropy, Kolmogorov and Gelfand n-widths of the Barron space unit ball with respect to L^2 for ReLU^k activation functions and the dictionary of decaying Fourier modes. These results provide a solution to several open problems concerning the precise approximation properties of these spaces. If time allows, we will also give recent results on the approximation rates and metric entropies for sigmoidal and ReLU Barron spaces with respect to L^p for p > 2. This talk is based on joint work with Jonathan Siegel.

Time:2021-8-14   9:30-11:30

Title:Convergence analysis for the gradient descent optimization method in the training of artificial neural networks with ReLU activation for piecewise linear target functions

Reporter:Arnulf Jentzen

Affiliations:
Applied Mathematics: Institute for Analysis and Numerics, Faculty of Mathematics and Computer Science, University of Muenster, Germany;
School of Data Science and Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China

Slides: Install

Replay: https://www.bilibili.com/video/BV1DU4y177A1

Homepage: http://www.ajentzen.de/

Abstract:Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains -- even in the simplest situation of the plain vanilla GD optimization method with random initializations and ANNs with one hidden layer -- an open problem to prove (or disprove) the conjecture that the risk of the GD optimization method converges in the training of ANNs with ReLU activation to zero as the width of the ANNs, the number of independent random initializations, and the number of GD steps increase to infinity. In this talk we prove this conjecture in the special situation where the probability distribution of the input data is absolutely continuous with respect to the continuous uniform distribution on a compact interval and where the target function under consideriation is piecewise linear.

Time:2021-7-31   9:30-11:30

Title:Learning and Learning to Solve PDEs

Reporter:董彬, 北京大学

Slides: Install

Replay: https://www.bilibili.com/video/BV1dh411z7JA

Homepage: https://bicmr.pku.edu.cn/~dongbin/

Abstract:Deep learning continues to dominate machine learning and has been successful in computer vision, natural language processing, etc. Its impact has now expanded to many research areas in science and engineering. In this talk, I will mainly focus on some recent impact of deep learning on computational mathematics. I will present our recent work on bridging deep neural networks with numerical differential equations. On the one hand, I will show how to design transparent deep convolutional networks to uncover hidden PDE models from observed dynamical data. On the other hand, I will present our recent preliminary attempts to combine wisdoms from numerical PDEs and machine learning to design data-driven solvers for PDEs.

Time:2021-7-31   9:30-11:30

Title:Neural Operator: Learning Maps Between Function Spaces

Reporter:李宗宜, 加州理工学院

Slides: Install

Replay: https://www.bilibili.com/video/BV1MA411P7HU

Homepage: https://zongyi-li.github.io/

Abstract:The classical development of neural networks has primarily focused on learning mappings between finite dimensional Euclidean spaces or finite sets. We propose a generalization of neural networks tailored to learn operators mapping between infinite dimensional function spaces. We formulate the approximation of operators by composition of a class of linear integral operators and nonlinear activation functions, so that the composed operator can approximate complex nonlinear operators. We prove a universal approximation theorem for our construction. The proposed neural operators are resolution-invariant: they share the same network parameters between different discretization of the underlying function spaces and can be used for zero-shot super-resolutions. Numerically, the proposed models show superior performance compared to existing machine learning based methodologies on Burgers' equation, Darcy flow, and the Navier-Stokes equation, while being several order of magnitude faster compared to conventional PDE solvers.

Time:2021-7-17   9:30-11:30

Title:Frequency Principle

Reporter:许志钦 (Zhi-Qin John Xu), 上海交通大学

Slides: Install

Replay: https://www.bilibili.com/video/BV1Yy4y1T7M8

Homepage: https://ins.sjtu.edu.cn/people/xuzhiqin/

Abstract:In this talk, I would introduce frequency principle (F-Principle) in detail, including experiments, and theory. I would also connect the F-Principle with traditional iterative methods, such as Jacobi methods, understanding the training of neural networks from the perspective of numerical analysis. Then, I will use some examples to show how F-Principle benefits the design of neural networks. Finally, I would talk about some open questions about the F-Principle.

Time:2021-7-3   9:30-11:30

Title:基于深度学习的分子动力学模拟

Reporter:王涵 (北京应用物理与计算数学研究所)

Slides: 下载链接

Abstract:分子动力学模拟需要对原子间相互作用(势函数)有一个精确的描述,然而人们面临两难困境:第一性原理方法精确但昂贵,经验势方法快速但精度有限。我们在报告中从两个方面讨论了解决办法:势函数构造和数据生成。在势函数构造方面,我们介绍深度势能方法,这是一个对第一性原理势函数的精确表示。在数据生成方面,我们介绍同步学习格式DP-GEN。这个方法能自动生成满足特定精度要求的最小训练数据集。相比于经验势,DP-GEN开启了通过探索构型和化学空间持续改进深度势能的可能性。在报告的最后部分,我们介绍深度势能方法针对CPU+GPU异构超级计算机的优化实现。这个实现在超级计算机顶点(Summit)上达到了双精度91P的峰值性能,在一天内能够完成纳秒量级的第一性原理精度分子动力学模拟,快于之前基线水平1000倍以上。在我们的工作中,物理模型、深度学习和高性能计算的结合为重大科学发现提供了有力模拟工具。

Time:2021-7-3   9:30-11:30

Title:DeePKS: a machine learning assisted electronic structure model

Reporter:张林峰 (北京大数据研究院、深势科技)

Slides: 下载链接

Homepage: https://cn.linkedin.com/in/linfeng-zhang-%E5%BC%A0%E6%9E%97%E5%B3%B0-312242a8

Abstract:We introduce a general machine learning-based framework for building an accurate and widely-applicable energy functional within the framework of generalized Kohn-Sham density functional theory. In particular, we develop a way of training self-consistent models that are capable of taking large datasets from different systems and different kinds of labels. We demonstrate that the functional that results from this training procedure, with the efficiency of cheap density functional models, gives chemically accurate predictions on energy, force, dipole, and electron density for a large class of molecules.

Time:2021-6-19   9:30-11:30

Title:Modelling Temporal Data: from RNNs to CNNs

Reporter:Zhong Li, Qianxiao Li

Replay: https://www.bilibili.com/video/BV1Wq4y1L7Cg

Homepage: https://blog.nus.edu.sg/qianxiaoli/

Abstract:There are several competing models in deep learning when modelling input-output relationships in temporal data: recurrent neural networks (RNNs), convolutional neural networks (CNNs), transformers, etc. In recent work, we study the approximation properties and optimization dynamics of RNNs when applied to learn temporal dynamics. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals and characterize the approximation rate. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs by gradient methods. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on both approximation and optimization: when there is longterm memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from severe slow downs. In particular, both of these effects become exponentially more pronounced with increasing memory - a phenomenon we call the “curse of memory”. These analyses represent a basic step towards a concrete mathematical understanding of new phenomenons arising in learning temporal relationships using recurrent architectures.

We also study the approximation properties of convolutional architectures applied to time series modelling. Similar to the recurrent setting, parallel results for convolutional architectures are derived regarding to the approximation efficiency, with WaveNet being a prime example. Our results reveal that under this new setting, the approximation efficiency is not only characterized by memory, but also additional fine structures in the target relationship. This leads to a novel definition of spectrum-based regularity that measures the complexity of temporal relationships under the convolutional approximation scheme. These analyses provide a foundation to understand the differences between architectural choices for temporal modelling with theoretically grounded guidance for practical applications.

Time:2021-6-5   9:30-11:30

Topic:Implicit biases of SGD for neural network models

Reporter:Lei Wu

Replay: https://www.bilibili.com/video/BV1xv411V7v3

Slides: Install

Homepage: https://leiwu0.github.io/index.html

Abstract:Understanding the implicit biases of optimization algorithms is one of the core problems in theoretical machine learning. This refers to the fact that even without any explicit regularizations to avoid overfitting, the dynamics of an optimizer itself is biased to pick solutions that generalize well. This talk introduces the recent progress in understanding the implicit bias of stochastic gradient descent (SGD) for neural network models. First, we consider the gradient descent flow, i.e., SGD with an infinitesimal learning rate, for two-layer neural networks. In particular, we will see how the implicit bias is affected by the extent of over-parameterization. Then, we turn to SGD with a finite learning rate. The influence of learning rate as well as the batch size will be studied from the perspective of dynamical stability. The concept of uniformity is introduced, which, together, with flatness characterizes the accessibility of a particular SGD to a global minimum. This analysis shows that learning rate and batch size play different roles in selecting global minima. Extensive empirical results correlate well with the theoretical findings.

Time:2021-6-5   9:30-11:30

Topic:The landscape-dependent annealing strategy in machine learning: How Stochastic-Gradient-Descent finds flat minima

Reporter:Yuhai Tu (IBM T. J. Watson Research Center)

Replay: https://www.bilibili.com/video/BV1v64y1R7Xx

Slides: Install

Homepage: https://researcher.watson.ibm.com/researcher/view.php?person=us-yuhai

Abstract:Despite tremendous success of the Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds ``good" solutions (low generalization error) in the high-dimensional weight space. In this talk, we discuss our recent work [1,2] on establishing a theoretical framework based on nonequilibrium statistical physics to understand the SGD learning dynamics, the loss function landscape, and their relation. Our study shows that SGD dynamics follows a low-dimensional drift-diffusion motion in the weight space and the loss function is flat with large values of flatness (inverse of curvature) in most directions. Furthermore, our study reveals a robust inverse relation between the weight variance in SGD and the landscape flatness opposite to the fluctuation-response relation in equilibrium systems. We develop a statistical theory of SGD based on properties of the ensemble of minibatch loss functions and show that the noise strength in SGD depends inversely on the landscape flatness, which explains the inverse variance-flatness relation. Our study suggests that SGD serves as an ``intelligent" annealing strategy where the effective temperature self-adjusts according to the loss landscape in order to find the flat minimum regions that contain generalizable solutions. Finally, we discuss an application of these insights for reducing catastrophic forgetting efficiently for sequential multiple tasks learning.

Reference:
1. “The inverse variance-flatness relation in Stochastic-Gradient-Descent is critical for finding flat minima”, Y. Feng and Y. Tu, PNAS, 118 (9), 2021.
2. “Phases of learning dynamics in artificial neural networks: in the absence and presence of mislabeled data”, Y. Feng and Y. Tu, Machine Learning: Science and Technology (MLST), April 7, 2021. https://iopscience.iop.org/article/10.1088/2632-2153/abf5b9/pdf

Time:2021-6-5   9:30-11:30

Topic:Stochastic gradient descent for noise with ML-type scaling

Reporter:Stephan Wojtowytsch (Princeton University)

Replay: https://www.bilibili.com/video/BV1ky4y137ou

Slides: Install

Homepage: https://www.swojtowytsch.com/

Abstract:In the literature on stochastic gradient descent, there are two types of convergence results: (1) SGD finds minimizers of convex objective functions and (2) SGD finds critical points of smooth objective functions. Classical results are obtained under the assumption that the stochastic noise is L^2-bounded and that the learning rate decays to zero at a suitable speed. We show that, if the objective landscape and noise possess certain properties which are reminiscent of deep learning problems, then we can obtain global convergence guarantees of first type under second type assumptions for a fixed (small, but positive) learning rate. The convergence is exponential, but with a large random coefficient. If the learning rate exceeds a certain threshold, we discuss minimum selection by studying the invariant distribution of a continuous time SGD model. We show that at a critical threshold, SGD prefers minimizers where the objective function is 'flat' in a precise sense.

Time:2021-5-22,9:30-11:30

Topic:Learn like a child: super-large-scale multi-modal pre-training model

Reporter:Jirong Wen

Replay: https://www.bilibili.com/video/BV1Z54y1G7du

Abstract:The experience revolution of cognitive science brings people a new perspective on understanding meaning from language: the ability to think and use language is the result of the cooperation between our body and mind. The physical body includes various modalities such as vision, hearing, smell, touch, and motor nerves. Human children learn languages in a multi-modal environment, which is also the lack of AI at present. The lecture will introduce some of our work on cross-modal understanding of language and images. Starting from the relationship between vision and language, we use tens of millions or even hundreds of millions of pairs of pictures and text generated by the Internet to complete one of the largest Chinese general graphic pre-training models with self-supervised tasks. Initially explore the possibility of AI learning languages in a multi-modal environment. By analyzing the changes in language from monomodal to multimodal learning, we found some phenomena closely related to human cognition.

Time:2021-5-22,9:30-11:30

Topic:Knowledge-guided pre-training language model

Reporter:Zhiyuan Liu

Replay: https://www.bilibili.com/video/BV1MK4y1G7Eb

Abstract:In recent years, deep learning has become a key technology for natural language processing, especially the pre-trained language model since 2018 has significantly improved the overall performance of natural language processing. As a typical data-driven method, deep learning represented by pre-trained language models still faces problems such as poor interpretability and poor robustness. How to introduce a large amount of human language knowledge and world knowledge into the model is to improve the performance of deep learning It is an important direction, but also faces many challenges. This report will systematically introduce the latest developments and trends of knowledge-guided pre-training language models.

Time:2021-5-9,9:30-11:30

Topic:Introduction to Reinforced Dynamics

Report:Han Wang

Slides: Install

Abstract:This report will introduce the basic concepts of molecular dynamics simulation, as well as the fundamental problem in molecular dynamics simulation-sampling problem. We briefly introduce the limitations and challenges of two types of enhanced sampling methods. In particular, the report will introduce in detail our solution to the enhanced sampling problem: an enhanced sampling method based on deep learning-reinforced dynamics (reinforced dynamics). Finally, we show the effect of enhanced dynamics on protein structure prediction.

Time:2021-5-9,9:30-11:30

Topic:Flow Model: A Computational Physics Perspective

Reporter:Lei Wang

Replay: https://www.bilibili.com/video/BV1gp4y1t7ND

Slides: Install

Abstract:This report will combine some personal research experiences to introduce scientific issues and scientific applications related to flow-based generative models. From the perspective of computational physics, we will see the relationship between the flow model and the optimal transport theory, fluid mechanics, symplectic geometry algorithm, renormalization group, and Monte Carlo calculations.

Time:2021-4-17,10:30-11:30

Topic:Understand the training process of neural networks

Reporter:Zhiqin John Xu

Replay: https://www.bilibili.com/video/BV1Rh411U7xH

Slides: Install

Abstract:Only from the perspective of approximation theory, an over-parameterized neural network can have infinite sets of solutions to minimize the error of the training set. In the actual training process, the neural network always seems to find a good generalization solution. In order to understand how a neural network can learn a class of generalized solutions from an infinite number of possibilities, it is necessary to understand the training process experienced in finding the solution. In this report, I will introduce some progress in the training process of neural networks, such as the complexity changes of neural networks during the training process, frequency behavior and how initialization affects training. Finally, I will discuss some open issues and explore the understanding of the development trend of neural networks from the training process.

Time:2021-4-17,9:30-10:30

Topic:Machine Learning and Dynamical Systems

Reporter:Qianxiao Li

Replay: https://www.bilibili.com/video/BV1jh411S7ds

Slides: Install

Abstract:In this talk, we discuss some recent work on the connections between machine learning and dynamical systems. These come broadly in three categories, namely machine learning via, for and of dynamical systems, and here we will focus on the first two. In the direction of machine learning via dynamical systems, we introduce a dynamical approach to deep learning theory with particular emphasis on its connections with control theory. In the reverse direction of machine learning for dynamical systems, we discuss the approximation and optimization theory of learning input-output temporal relationships using recurrent neural networks, with the goal of highlighting key new phenomena that arise in learning in dynamic settings. If time permits, we will also discuss some applications of dynamical systems on the analysis of optimization algorithms commonly applied in machine learning.

Time:2021-4-3,10:30-11:30

Topic:Neural network and high-dimensional function approximation

Reporter:Lei Wu

Replay: https://www.bilibili.com/video/BV1R64y1m7fJ

Abstract:In recent years, deep learning methods based on neural network models have achieved unprecedented success in different fields, such as computer vision and scientific computing. From the perspective of approximation theory, these successes depend on the powerful ability of neural networks to approximate high-dimensional functions. And we know that traditional methods will inevitably suffer the curse of dimensionality when approximating high-dimensional functions. Does this indicate that neural networks can avoid the curse of dimensionality in a sense? If so, what is the mechanism behind it? We will discuss these issues around the three models of the kernel method, two-layer neural network and deep residual network. In particular, we will characterize the high-dimensional function space approximated by each model. Finally, we will list some open questions to help everyone have an overall understanding of this field.

Time:2021-4-3,9:30-10:30

Topic:Use deep learning to solve high-dimensional control problems

Reporter:Jiequn Han

Replay: https://www.bilibili.com/video/BV1uK411c7Cz

Abstract:The vigorous development of deep learning in recent years has provided us with new powerful tools for solving high-dimensional calculations. Among them, the algorithms of gradient back propagation and stochastic gradient descent provide us with efficient algorithms for solving the optimal neural network. This report will first analyze the neural network from the perspective of cybernetics, discuss the similarities between solving the optimal neural network and the optimal control problem, and the enlightenment of the above algorithm for solving the high-dimensional control problem. Under this enlightenment, we will show two works that use deep learning to solve high-dimensional control problems, (1) solve model-based high-dimensional stochastic control problems; (2) combine the variational form of backward stochastic equations to solve parabolic partial differentials equation. Compared with the previous traditional algorithms that are limited by the disaster of dimensionality, these algorithms show huge computational advantages and greatly improve our computational ability to deal with a large class of high-dimensional problems. Finally, we will discuss some unresolved issues in related directions to help everyone have a better understanding of this field.

The 3rd course of 2020 "Deep Learning Fundamentals and Practice"

Reporter:Linfeng Zhang

Time:2020-11-15, 10:00 - 12:00

Replay: https://www.bilibili.com/video/BV1i64y1y7nN?p=2

The 2nd course of 2020 "Deep Learning Fundamentals and Practice"

Reporter:Lei Wu

Time:2020-11-08 10:00 - 12:00

Replay: https://www.bilibili.com/video/BV1i64y1y7nN?p=1

The 1st course of 2020 "Deep Learning Fundamentals and Practice"

Reporter:Weinan E

Time:2020-11-01 10:00 - 12:00