Project page for the Multi-Modal Mean-Fields (CVPR 2017).

Arxiv paper and code (Soon) are available.

Mean Field inference is central to statistical physics. It has attracted much interest in the Computer Vision community to efficiently solve problems expressible in terms of large Conditional Random Fields. However, since it models the posterior probability distribution as a product of marginal probabilities, it may fail to properly account for important dependencies between variables. We therefore replace the fully factorized distribution of Mean Field by a weighted mixture of such distributions, that similarly minimizes the KL-Divergence to the true posterior. By introducing two new ideas, namely, conditioning on groups of variables instead of single ones and using a parameter of the conditional random field potentials, that we identify to the temperature in the sense of statistical physics to select such groups, we can perform this minimization efficiently. Our extension of the clamping method proposed in previous works allows us to both produce a more descriptive approximation of the true posterior and, inspired by the diverse MAP paradigms, fit a mixture of Mean Field approximations. We demonstrate that this positively impacts real-world algorithms that initially relied on mean fields.

Additional proofs related to Section 5 and pseudo-code.

We illustrate, how our multi-modal algorithm of Section 7.2 resolves ambiguities, on a single frame example.

We explain the details of our tracking algorithm of Section 7.2, inspired from the K-Shortest Path one.

We propose a short video to highlight the performance of our tracking algorithm compared to the POM+KSP baseline. We use 4 modes per frame and our new version of KSP.

People Tracking video.

We provide the derivation of the critical temperature for the Dense Gaussian CRF, which is used in Section 6, together with experimental validation.

A short video illustrating the advantages of the multi-modal segmentation of Section 7.3 is available. We use pairwise potentials from a Dense Gaussian CRF and unaries computed by a CNN. We generate 16 modes, from which the best is chosen by temporal consistency.

Segmentation video.

Finally, we present additional sample results for MMMF on the PascalVOC dataset.