Algorithms for stochastic nonconvex and nonsmooth optimization

Courtney Paquette
University of Waterloo

Nonsmooth and nonconvex loss functions are often used to model physical phenomena, provide robustness, and improve stability. While convergence guarantees in the smooth, convex settings are well-documented, algorithms for solving large-scale nonsmooth and nonconvex problems remain in their infancy.
I will begin by isolating a class of nonsmooth and nonconvex functions that can be used to model a variety of statistical and signal processing tasks. Standard statistical assumptions on such inverse problems often endow the optimization formulation with an appealing regularity condition: the objective grows sharply away from the solution set. We show that under such regularity, a variety of simple algorithms, subgradient and Gauss Newton-like methods, converge rapidly when initialized within constant relative error of the optimal solution. We illustrate the theory and algorithms on the real phase retrieval problem, and survey a number of other applications, including blind deconvolution and covariance matrix estimation.
One of the main advantages of smooth optimization over its nonsmooth counterpart is the potential to use a line-search for improved numerical performance. A long-standing open question is to design a line-search procedure in the stochastic setting. In the second part of the talk, I will present a practical line-search method for smooth stochastic optimization that has rigorous convergence guarantees and requires only knowable quantities for implementation. While traditional line-search methods rely on exact computations of the gradient and function values, our method assumes that these values are available up to some dynamically adjusted accuracy that holds with some sufficiently high, but fixed, probability. We show that the expected number of iterations to reach an approximate-stationary point matches the worst-case efficiency of typical first-order methods, while for convex and strongly convex objectives it achieves the rates of deterministic gradient descent.