**Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs**

Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, Josh Tenenbaum

**Abstract:** The idea of computer vision as the Bayesian inverse problem to computer graphics has a long history and an appealing elegance, but it has proved difficult to directly implement. Instead, most vision tasks are approached via complex bottom-up processing pipelines. Here we show that it is possible to write short, simple probabilistic graphics programs that define flexible generative models and to automatically invert them to interpret real-world images. Generative probabilistic graphics programs consist of a stochastic scene generator, a renderer based on graphics software, a stochastic likelihood model linking the renderer’s output and the data, and latent variables that adjust the fidelity of the renderer and the tolerance of the likelihood model. Representations and algorithms from computer graphics, originally designed to produce high-quality images, are instead used as the deterministic backbone for highly approximate and stochastic generative models. This formulation combines probabilistic programming, computer graphics, and approximate Bayesian computation, and depends only on general-purpose, automatic inference techniques. We describe two applications: reading sequences of degraded and adversarially obscured alphanumeric characters, and inferring 3D road models from vehicle-mounted camera images. Each of the probabilistic graphics programs we present relies on under 20 lines of probabilistic code, and supports accurate, approximately Bayesian inferences about ambiguous real-world images.

**Memoized Online Variational Inference for Dirichlet Process Mixture Models**

Michael Hughes, Erik Sudderth

**Abstract:** Variational inference algorithms provide the most effective framework for large-scale training of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima. We present a new algorithm, memoized online variational inference, which scales to very large (yet finite) datasets while avoiding the complexities of stochastic gradient. Our algorithm maintains finite-dimensional sufficient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples. Exploiting nested families of variational bounds for infinite nonparametric models, we develop principled birth and merge moves allowing non-local optimization. Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed. Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.

**On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations**

Tamir Hazan, Subhransu Maji, Tommi Jaakkola

**Abstract:** In this paper we describe how MAP inference can be used to sample efficiently from Gibbs distributions. Specifically, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical high signal – high coupling’’ regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. ”

**Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC**

Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen

**Abstract:** State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once an approximation of the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach.

**Structured Learning via Logistic Regression**

Justin Domke

**Abstract:** A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where exists to minimize a logistic loss.oraclean smoothedis

**A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks**

Junming Yin, Qirong Ho, Eric Xing

**Abstract:** We propose a scalable approach for making inference about latent spaces of large networks. With a succinct representation of networks as a bag of triangular motifs, a parsimonious statistical model, and an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices and hundreds of latent roles on a single machine in a matter of hours, a setting that is out of reach for many existing methods. When compared to the state-of-the-art probabilistic approaches, our method is several orders of magnitude faster, with competitive or improved accuracy for latent space recovery and link prediction.

**Learning Stochastic Inverses**

Andreas Stuhlmuller, Jacob Taylor, Noah Goodman

**Abstract:** We describe a class of algorithms for amortized inference in Bayesian networks. In this setting, we invest computation upfront to support rapid online inference for a wide range of queries. Our approach is based on learning an inverse factorization of a model’s joint distribution: a factorization that turns observations into root nodes. Our algorithms accumulate information to estimate the local conditional distributions that constitute such a factorization. These stochastic inverses can be used to invert each of the computation steps leading to an observation, sampling backwards in order to quickly find a likely explanation. We show that estimated inverses converge asymptotically in number of (prior or posterior) training samples. To make use of inverses before convergence, we describe the Inverse MCMC algorithm, which uses stochastic inverses to make block proposals for a Metropolis-Hastings sampler. We explore the efficiency of this sampler for a variety of parameter regimes and Bayes nets.

**Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex**

Yee Whye Teh, Sam Patterson

**Abstract:** In this paper we investigate the use of Langevin Monte Carlo methods on the probability simplex and propose a new method, Stochastic gradient Riemannian Langevin dynamics, which is simple to implement and can be applied online. We apply this method to latent Dirichlet allocation in an online setting, and demonstrate that it achieves substantial performance improvements to the state of the art online variational Bayesian methods.

NIPS was fabulous this year, kudos to all the organizers, area chairs, reviewers, and volunteers. Between the record number of attendees, multitude of corporate sponsors, and the Mark Zuckerburg show, this year’s conference is most notable for sheer magnitude. It’s past the point that one person can summarize it effectively, but here’s my retrospective, naturally heavily biased towards my interests.

The keynote talks were all excellent, consistent with the integrative “big picture” heritage of the conference. My favorite was by Daphne Koller, who talked about the “other online learning”, i.e., pedagogy via telecommunications. Analogous to how moving conversations online allows us to precisely characterize the popularity of Snooki, moving instruction online facilitates the use of machine learning to improve human learning. Based upon the general internet arc from early infovore dominance to mature limbic-stimulating pablum, it’s clear the ultimate application of the Coursera platform will be around courtship techniques, but in the interim a great number of people will experience more substantial benefits.

As far as overall themes, I didn’t detect any emergent technologies, unlike previous years where things like deep learning, randomized methods, and spectral learning experienced a surge. Intellectually the conference felt like a consolidation phase, as if the breakthroughs of previous years were still being digested. However, output representation learning and extreme classification (large cardinality multiclass or multilabel learning) represent interesting new frontiers and hopefully next year there will be further progress in these areas.

There were several papers about improving the convergence of stochastic gradient descent which appeared broadly similar from a theoretical standpoint (Johnson and Zhang; Wang et. al.; Zhang et. al.). I like the control variateinterpretation of Wang et. al. the best for generating an intuition, but if you want to implement something than Figure 1 of Johnson and Zhang has intelligible pseudocode.

Covariance matrices were hot, and not just for PCA. The BIG & QUIC algorithm of Hseih et. al. for estimating large sparse inverse covariance matrices was technically very impressive and should prove useful for causal modeling of biological and neurological systems (presumably some hedge funds will also take interest). Bartz and Müller had some interesting ideas regarding shrinkage estimators, including the “orthogonal complement” idea that the top eigenspace should not be shrunk since the sample estimate is actually quite good.

An interesting work in randomized methods was from McWilliams et. al., in which two random feature maps are then aligned with CCA over unlabeled data to extract the “useful” random features. This is a straightforward and computationally inexpensive way to leverage unlabeled data in a semi-supervised setup, and it is consistent with theoretical results from CCA regression. I’m looking forward to trying it out.

The workshops were great, although as usual there are so many interesting things going on simultaneously that it made for difficult choices. I bounced between extreme classification, randomized methods, and big learning the first day. Michael Jordan’s talk in big learning was excellent, particularly the part juxtaposing decreasing computational complexity of various optimization relaxations with increasing statistical risk (both effects due to the expansion of the feasible set). This is starting to get at the tradeoff between data and computation resources. Extreme classification (large cardinality multiclass or multilabel learning) is an exciting open area which is important (e.g., for structured prediction problems that arise in NLP) and appears tractable in the near-term. Two relevant conference papers were Frome et. al. (which leveraged word2vec to reduce extreme classification to regression with nearest-neighbor decode) and Cisse et. al. (which exploits the near-disconnected nature of the label graph often encountered in practice with large-scale multi-label problems).

The second day I mostly hung out in spectral learning but I saw Blei’s talk in topic modeling. Spectral learning had a fun discussion session. The three interesting questions were

- Why aren’t spectral techniques more widely used?
- How can spectral methods be made more broadly easily applicable, analogous to variational Bayes or MCMC for posterior inference?
- What are the consequences of model mis-specification, and how can spectral methods be made more robust to model mis-specification?

With respect to the first issue, I think what’s missing is rock solid software that can easily found, installed, and experimented with. Casual practitioners do not care about theoretical benefits of algorithms, in fact they tend to view “theoretical” as a synonym for “putative”. Progress on the second issue would be great, c.f., probabilistic programming. Given where hardware is going, the future belongs to the most declarative. The third issue is a perennial Bayesian issue, but perhaps has special structure for spectral methods that might suggest, e.g., robust optimization criterion.

**Optimization**

– Non-strongly-convex smooth stochastic approximation with convergence rate by Francis Bach and Eric Moulines. With noisy gradients (see this post) it is known that the best rate of convergence to minimize an -strongly convex function is of order while for convex functions it is of order . Unfortunately in Machine Learning applications the strong convexity parameter is often a regularization parameter that can be as small as , in which case the standard analysis using strong convexity do not yield any acceleration. In this paper Bach and Moulines show that in the case of the square loss (whose strong convexity parameter depends on the smallest non-zero eigenvalue of the covariance matrix of the covariates) one can obtain a rate of order (i.e. with no dependency on the smallest eigenvalue ). The proposed algorithm is simply Stochastic Gradient Descent with a constant step-size (and averaging for the output). A more intricate algorithm is also proposed for the logistic loss.

– On Decomposing the Proximal Map by Yaoliang Yu. Algorithms such as ISTA and FISTA (see this post) require to compute the proximal operator

Recall that this proximal operator arises naturally for the minimization of a function of the form , i.e. when one wants to minimize some function while enforcing some of the ‘properties’ of in the solution. For instance with one would like to output a sparse solution. Thus it is very natural to try to understand the relation between and . This paper consider various properties under which one has .

– Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Rie Johnson and Tong Zhang. This paper gives a beautiful new algorithm achieving the same performances as SDCA and SAG (see this post). The algorithm/analysis are much more intuitive than the one of SDCA and SAG. I will make a more detailed post on this paper later next year.

– Accelerated Mini-Batch Stochastic Dual Coordinate Ascent by Shai Shalev-Shwartz and Tong Zhang. Both SDCA and SAG have a linear dependency on the condition number . For the deterministic case Nesterov’s accelerated gradient descent attains a linear dependency on . This paper partially bridges the gap between these results and present an accelerated version of SDCA using mini batches.

– Mixed Optimization for Smooth Functions by Mehrdad Mahdavi, Lijun Zhang and Rong Jin. This paper considers a new setting which seems quite natural: what if on top of noisy first order oracle one can also access a regular first order oracle? Mehrdad will do a guest post on this problem soon, but the short answer is that with only a logarithmic number of calls to the regular oracle one can attain a rate of order for smooth optimization (while with only the noisy oracle the rate is ).

– Estimation, Optimization, and Parallelism when Data is Sparse by John Duchi, Mike Jordan and Brendan McMahan. For this paper too I am hoping to have a guest post (by John Duchi this time) that would explain the new algorithm and its properties. The idea is roughly to do a gradient descent where the step-size adapts to the sparsity of the observed gradient, allowing for much faster rates in certain situations.

**Bandits**

– Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search by Alexander Zimin and Gergely Neu. This paper shows that one can solve episodic loop-free MDPs by simply using a combinatorial semi-bandit strategy (see this paper by Audibert, myself and Lugosi where we solved the semi-bandit problem). I believe that this paper initiate a research direction that will be very fruitful in the future. Namely reducing (or rather reformulating) a complicated sequential decision making problem as a linear bandit (or semi-bandit). A similar approach is very popular in optimization where everyone knows that one should try very hard to formulate the problem of interest as a convex program. On the other hand such an approach in online learning/sequential decision making has not been recognized yet. I believe that at the moment the most promising direction is to try to formulate the problem as a linear bandit as it is both an extremely general problem but also one for which we have seemingly canonical algorithms. A related paper is Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions by Abbasi-Yadkori, Bartlett, Kanade, Seldin and Szepesvari.

– Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards by Thomas Bonald and Alexandre Proutiere. This paper considers the famous Bayesian setting of Berry, Chen, Zame, Heath and Shepp where one has a countable set of arms with Bernoulli distributions and means drawn uniformly on . This paper shows the first optimal strategy with a Bayesian regret of order . I recommend to take a look at the strategy, it is both very elegant and quite smart.

– Sequential Transfer in Multi-armed Bandit with Finite Set of Models by Mohammad Azar, Alessandro Lazaric and Emma Brunskill. This paper considers an elegant and simple model for transfer learning in multi-armed bandit problems. To put it simply the setting is the one of a Bayesian multi-armed bandit where the underlying parameter is replaced by a new fresh independent sample very steps. If the prior is known then this problem is fairly straightforward (given our current knowledge of the multi-armed bandit). The issue is what to do when the prior is unknown. The paper proposes an interesting strategy based on learning latent variable models via the method of moments (see this paper by Anandkumar, Ge, Hsu, Kakade, and Telgarsky for instance). While this gives non-trivial results I believe that much more can be said by fully embracing the sequential nature of the problem rather than trying to blend together a batch method with standard bandit strategies. I suspect that the technical difficulties to obtain an optimal strategy for this setting will be tremendous (which is quite exciting!).

– Eluder Dimension and the Sample Complexity of Optimistic Exploration by Daniel Russo and Benjamin Van Roy; Prior-free and prior-dependent regret bounds for Thompson Sampling by myself and Che-Yu Liu. These two papers ask the same question (with the latter paper being inspired by the former): we know that Thompson Sampling can be as good as UCB when initialized with a well-chosen priors, but can we show that it has a significant advantage in the cases where the prior is actually informative? The first paper addresses this question in the context of linear bandits (which is, as pointed out above, the most fundamental bandit problem), while the second paper considers a much more restricted class of problems but for which surprisingly strong results can be obtained.

**Online Learning**

– Dimension-Free Exponentiated Gradient by Francesco Orabona. This paper introduces a new regularizer for Mirror Descent and shows that it adapts automatically to the norm of the unknown comparator. In my opinion we know very few interesting regularizers for MD (especially in the full-information setting) and any non-trivial addition to this set seems difficult. This paper manages to do just that. It would be interesting to see if this gives new insights for bandit regularizers.

– Minimax Optimal Algorithms for Unconstrained Linear Optimization by Brendan McMahan and Jacob Abernethy. My advice for this paper is to look at Section 3.2 which gives one very concrete and very cool application of the method they develop.

**Other**

I’m running out of stamina so I will now just list the other papers that I found very interesting.

– Density estimation from unweighted k-nearest neighbor graphs: a roadmap by von Luxburg and Alamgir.

– Stochastic blockmodel approximation of a graphon: Theory and consistent estimation by Airoldi, Costa and Chan.

– Near-Optimal Entrywise Sampling for Data Matrices by Achlioptas, Karnin and Liberty.

– Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic by Sharpnack, Krishnamurthy and Singh.

– Estimating the Unseen: Improved Estimators for Entropy and other Properties by Valiant and Valiant.

– Information-theoretic lower bounds for distributed statistical estimation with communication constraints by Zhang, Duchi, Jordan and Wainwright.