Jekyll2020-03-30T03:48:04+00:00http://stillbreeze.github.io/feed.xmlSyed Ashar JavedLearning to learnOptimization On a Manifold2020-01-29T00:00:00+00:002020-01-29T00:00:00+00:00http://stillbreeze.github.io/Optimization%20On%20a%20Manifold<p>In machine learning and robotics, data and model parameters often lie on spaces which are non-Euclidean. This means that these spaces don’t follow the flat Euclidean geometry and our models and algorithms need to account for this. To clarify this using a well-known example, let’s say our optimization algorithm gave us an update to apply onto the parameter space which we wrongly assumed is Euclidean, but was actually elliptical. This is the case shown below when minimizing the distance between two points on the earth’s surface. If we optimized in the Euclidean space, we would end up with the flat, straight line shown in black, whereas we know that the shortest path between the two points (the geodesic) would in reality be the red line due to the elliptical geometry of the earth (that’s why flight routes appear curved on 2D maps). This example shows why it’s important to correctly model the geometry of the parameter space during optimization. This post will introduce the concept of a manifold, motivate the need to do optimization over them in machine learning and finally go into some detail about doing this optimization for Matrix Lie groups like rotation matrices and poses which frequently come up in robotics and computer vision.</p>
<p><br /><br />
<img src="/images/geodesic.png" alt="Example of non-Euclidean Geometry" />
<br /><br /></p>
<h4 id="what-is-a-manifold">What is a manifold</h4>
<p>Intuitively, a manifold is a topological space that locally looks like a <a href="https://en.wikipedia.org/wiki/Euclidean_space">Euclidean space</a>. For example, the earth’s surface is spherical but looks planar locally. Stated more formally, each point on an n-dimensional manifold has a local neighbourhood that is <a href="https://en.wikipedia.org/wiki/Homeomorphism">homeomorphic</a> (one-to-one mapping with a continuous inverse function) to the Euclidean space of n dimensions. An even more formal definition based on Hausdorff spaces can be found in this <a href="http://bjlkeng.github.io/posts/manifolds/">blog</a> which does a really good job at giving an introduction to these concepts. For those who understand better from video lectures, I found these short, but very lucid explanations on this <a href="https://www.youtube.com/playlist?list=PLeFwDGOexoe8cjplxwQFMvGLSxbOTUyLv">YouTube Channel</a>. For this blog post, the above intuitive definition of a space that can be <a href="https://en.wikipedia.org/wiki/Homeomorphism#/media/File:Mug_and_Torus_morph.gif">deformed locally</a> to be Euclidean should be sufficient.</p>
<h4 id="why-care-about-manifolds-in-machine-learning">Why care about manifolds in machine learning</h4>
<p>As mentioned earlier, we often make Euclidean assumptions about our data or models which might not be correct. For example, representing a document as a vector in Euclidean space might be problematic as algebraic operations like addition or multiplication with a scalar on these data points might not have any meaning in data space. Another example in computer vision [<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>] is representing images using a low dimensional subspace explicitly assuming that these points lie on a <a href="https://en.wikipedia.org/wiki/Grassmannian">Grassmann Manifold</a>. Many other manifold learning methods like LLE (Locally Linear Embedding) [<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>] and Isomap [<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>] try to do the same without explicitly defining a manifold of choice. This is a direct motivation of the extensively used <a href="https://heavytailed.wordpress.com/2012/11/03/manifold-hypothesis-part-1-compression-learning-and-the-strong-mh/">manifold hypothesis</a> in machine learning. This more optimal and compact representation can then be used for various learning-based estimation tasks and better feature learning. There are tons of other examples where manifolds are useful for optimization with different data types ranging from trajectories for path planning in robotics to graphs for gene expressions in genetics to 3D shapes and surfaces in computer vision, all of which involve a non-Euclidean manifold.</p>
<h4 id="optimization-on-lie-groups">Optimization on Lie Groups</h4>
<p>Modeling rotations is a very useful task in computer vision, graphics and robotics. Images and their parts can be described as transformations involving rotation, motion of complicated surfaces and rigid bodies can be described using rotations and robots need to know the position and orientation of the sensors on it which again involve rotations. There are various ways to <a href="https://en.wikipedia.org/wiki/Rotation_formalisms_in_three_dimensions">parameterize rotations</a> when optimizing over them. A common use-case is using a <script type="math/tex">3 \times 3</script> rotation matrix and running optimization algorithms to recover the real rotation values. However, when optimizing these parameters, often a gradient-based update is applied to the initial value of the rotations and these updates can modify the matrix such that it isn’t a valid rotation matrix anymore. Thus, there is a need to apply these updates on a manifold which preserves the property of the model parameters, ie the parameters continue to lie on the rotation manifold. This manifold is actually a smooth and differentiable manifold which is also called the Special Orthogonal Group or <script type="math/tex">SO(n)</script> group. For 2D and 3D rotations, this is referred to as the <script type="math/tex">SO(2)</script> and <script type="math/tex">SO(3)</script> group respectively. These <script type="math/tex">SO(n)</script> groups are a part of much broader <a href="https://en.wikipedia.org/wiki/Lie_group">Lie Groups</a> which intuitively are groups allowing smooth and continuous operations on its elements (such that we can use differentiable calculus with them). Another example is the <script type="math/tex">SE(n)</script> group or the Special Euclidean Group which consist of a set of rotations and translations applied simultaneously and is heavily used for rigid transformations in robotics (a robot pose ie, position and orientation is specified using this group). Essentially, using Lie Groups like <script type="math/tex">SO(n)</script> or <script type="math/tex">SE(n)</script> allows us to optimize over the rotations/poses in a space where they continue to lie on the rotation/pose manifold after the update is applied.</p>
<h5 id="optimizing-rotations-using-exponential-maps-the-theory">Optimizing rotations using exponential maps: The theory</h5>
<p>This subsection will simply outline the use of exponential maps for adding rotations without going into any proofs or justifications. The next subsection talks about the intuition for our choices.</p>
<p>Let <script type="math/tex">R \in \mathbb{R}^{3 \times 3}</script> be the initial estimate of the rotation matrix. If we add an arbitrary update matrix to it, the final matrix will almost never lie on a <script type="math/tex">SO(3)</script> manifold (one straightforward way to see this is that adding arbitrary elements removes the orthogonality of the newly formed matrix).</p>
<p>Since rotations have 3 degrees of freedom, let <script type="math/tex">\omega \in \mathbb{R}^3</script> be the update which needs to be added to the initial rotation. This vector space is also called the tangent space and is loosely the same as the something called Lie algebra or <script type="math/tex">so(3)</script> of the group (a bijective mapping exists between Lie algebra and the tangent space). In order to solve the problem of doing optimizations on the rotation manifold we need a mapping which can take us from the tangent space (a vector of 3 elements) to a valid rotation lying on the <script type="math/tex">SO(3)</script> manifold. This mapping from the tangent space (or in turn Lie algebra) to the Lie Group (or the manifold) is the called the exponential map. Let <em>g</em> be the Lie algebra and <em>G</em> be the Lie Group. Then the exponential map defines a mapping <script type="math/tex">g \rightarrow G</script>.</p>
<p>Since we need to apply an update on top of an initial rotation estimate while staying on its manifold, the exponential map should be able to characterize the local neighbourhood on the manifold. More specifically, we need derivatives of how the <script type="math/tex">SO(3)</script> manifold changes around a specific point in order to guide how to map an arbitrary 3-vector to a valid rotation on <script type="math/tex">SO(3)</script>. This is done using the tangent on the points on <script type="math/tex">SO(3)</script> at identity, which is the reason why its called the tangent space in the first place. The bijective mapping from <script type="math/tex">\mathbb{R}^3</script> to <script type="math/tex">so(3)</script> is the skew-symmetric matrix operator and the exponential map function is the matrix exponentiation function. We will come back to see why these choices make sense. For now, the figure below visually shows the mappings from <script type="math/tex">\mathbb{R}^3 \rightarrow so(3) \rightarrow SO(3)</script> (image taken from the amazing course of <a href="http://www.cs.cmu.edu/~hebert/geom.html">Geometry-based Methods in Vision</a> from CMU:</p>
<p><br /><br />
<img src="/images/exp_map_rot3.png" alt="Exponential map for 3D rotations" />
<br /><br /></p>
<p>Therefore, the exponential map takes a vector from the Euclidean space and maps it to a valid rotation on the manifold.</p>
<!-- The bijective mapping $$\omega \in \mathbb{R}^{3} \rightarrow \hat{\omega} \in g$$. For the general case of matrix Lie Groups of n-dimensions, the hat operator takes the following form:
<center>
$$
\begin{align}
\hat{\omega} = \sum_i^n{\omega G^i}
\end{align}
$$
</center>
where $$G \in R^{n \times n}$$ are called generators and have a specific geometric interpretation depending on the group G. The generators are sets of vectors which form a basis for the tangents and any rotations can be described as linear combinations of these generators. / -->
<p>Mathematically, for adding the update to our initial rotation, we are finding an incremental rotation <script type="math/tex">R_{inc}</script> around the identity matrix and composing our initial rotation with it to get the updated rotation <script type="math/tex">R'</script>.</p>
<center>
$$
\begin{align}
R + \omega \triangleq R' = R . R_{inc}(\omega)\\
\end{align}
$$
</center>
<p>Here, the incremental rotation is the exponential map function shown in the figure above.</p>
<center>
$$
\begin{align}
R' = R e^{[\omega]_{\times})}
\end{align}
$$
</center>
<p>where <script type="math/tex">[\omega]_{\times}</script> is the skew-symmetric matrix associated with the vector <script type="math/tex">\omega</script>, defined as:</p>
<center>
$$
\begin{align}
[\omega]_{\times} = \begin{bmatrix} 0 & -\omega & \omega\\ \omega & 0 & -\omega\\ -\omega & \omega & 0\end{bmatrix}\\
\end{align}
$$
</center>
<h5 id="intuition-behind-exponential-maps">Intuition behind exponential maps</h5>
<p>Why does exponentiating a skew-symmetric version of the 3-vector give us a valid rotation on the <script type="math/tex">SO(3)</script> manifold? This is actually a two part question. Firstly, why does <script type="math/tex">\mathbb{R}^3 \rightarrow so(3)</script> mapping involve creating a skew-symmetric matrix from the vectors and secondly, why does the mapping <script type="math/tex">so(3) \rightarrow SO(3)</script> involve taking matrix exponentials. The answers to these arise from the very definitions of <script type="math/tex">so(3)</script> and <script type="math/tex">SO(3)</script> spaces.</p>
<p>The skew-symmetric matrix when written in the cross product form shows that it forms the tangent for any vector <script type="math/tex">a</script></p>
<center>
$$
\begin{align}
[\omega]_{\times}a = \omega \times a \quad \forall a \in \mathbb{R^3}
\end{align}
$$
</center>
<p>Now, since <script type="math/tex">so(3)</script> is the Lie algebra which is defined as the tangent space at identity, cross product operation and in turn the set of all skew-symmetric matrices form the so(3) space. A proof with equations is given <a href="https://math.stackexchange.com/questions/903861/lie-algebra-for-so3-as-a-skew-symmetric-matrix">here</a>.</p>
<p>Coming to the second mapping, we can see why exponentiation leads to a valid <script type="math/tex">SO(3)</script> point by analyzing the Taylor expansion of the exponential function.</p>
<center>
$$
\begin{align}
e^A = \lim_{n\to\infty} (I + \frac{1}{n} A)^n
\end{align}
$$
</center>
<p>As the identity matrix is an element of the <script type="math/tex">SO(3)</script> group, the term <script type="math/tex">I + \frac{1}{n} A</script> becomes <script type="math/tex">SO(3)</script> as <script type="math/tex">n</script> tends towards infinity. Additionally raising this term to the power of <script type="math/tex">n</script> also keeps this within the <script type="math/tex">SO(3)</script> as the group is closed under multiplication. Another intuitive way to see this is that the first two terms of the expansion are simply <script type="math/tex">I + A</script> which is what we originally (and naively) wanted to do with simple addition of rotations, but which would’ve deviated the final result from the rotation manifold. However, each additional higher power term in the expansion pulls the point towards the <script type="math/tex">SO(3)</script> manifold. This can be seen in the figure below taken from Tom Drummond’s <a href="https://www.dropbox.com/s/5y3tvypzps59s29/3DGeometry.pdf?dl=0">notes</a>.</p>
<p><br /><br />
<img src="/images/exp_expansion.png" alt="Exponential map for 3D rotations" />
<br /><br /></p>
<p>A more mathematical interpretation of exponential map is due to the fact that the exponentiation is the solution to the differential equation <script type="math/tex">\frac{dR}{dt} = AR</script> which gives <script type="math/tex">R(t) = e^{tA}</script>. This solution creates a relation between the derivatives of the rotations and the rotations or equivalently, a relation between <script type="math/tex">so(3)</script> and <script type="math/tex">SO(3)</script>.</p>
<p>It is important to note that the exponential map is an exact mapping even for arbitrarily large vectors <script type="math/tex">\omega</script> and not just an approximation. Also, there is a closed form solution of the matrix exponential function for <script type="math/tex">SO(3)</script> which is called <a href="http://mathworld.wolfram.com/RodriguesRotationFormula.html">Rodrigues’ Formula</a> which makes computing these updates to the rotations pretty convenient.</p>
<h5 id="optimization-over-other-general-manifolds">Optimization over other general manifolds</h5>
<p>In addition to the <script type="math/tex">SO(3)</script>, we can use similar machinery to add elements of any General Linear Groups or <script type="math/tex">GL(n)</script> which are essentially the set of all <script type="math/tex">n \times n</script> non-invertible matrices. All of them entail the use of exponential coordinates, but use different forms of Lie algebra depending on the group structure. In fact, even arbitrary manifolds which do not have the group structure can be optimized on through something called retractions which also maps the local coordinates onto the manifold much like the incremental rotations mentioned above. The book by Absil et al [<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>] goes into a lot of detail about optimization methods for matrix manifolds and is a great resource for any of the topics mentioned in this post.</p>
<h4 id="references">References</h4>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.7010&rep=rep1&type=pdf">Statistical analysis on Stiefel and Grassmann Manifolds with applications in Computer Vision</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://cs.nyu.edu/~roweis/lle/papers/lleintro.pdf">Locally Linear Embedding</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="http://web.mit.edu/cocosci/Papers/sci_reprint.pdf">Isomap</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="http://www.eeci-institute.eu/GSC2011/Photos-EECI/EECI-GSC-2011-M5/book_AMS.pdf">Optimization Algorithms on Matrix Manifolds</a> <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Basic ideas behind optimization on non-Euclidean manifolds and how it ties to the common problem of doing optimization over Lie Groups in robotics and computer visionJoining a Startup vs a Big Company2019-11-16T00:00:00+00:002019-11-16T00:00:00+00:00http://stillbreeze.github.io/Joining%20a%20Startup%20Vs%20a%20Big%20Company<p>The ever-alive dilemma of whether to join a startup or a big company has been a very active topic of discussion in my student circle in the last few months (graduation time!). In this post, I write about my personal views on when it can make sense to choose a startup over a big tech company and the various tradeoffs associated with the decision.</p>
<h4 id="the-importance-of-defining-priorities-and-risk-appetite">The Importance of defining priorities and risk appetite</h4>
<p>Let me say this upfront. Like most other things in life, there’s no single golden rule that can help make this decision easy. But before jumping into the chaos of all the different factors which can affect this career decision, I find it helpful to reflect and ask myself what am I trying to fundamentally optimize for when making this decision. In regard to the startup vs big company decision, this can be an ordered list of several things like maximizing learning opportunity, minimizing regret, maximizing money, maximizing work-life-balance, maximizing the possibility of exploration, maximizing flexibility, minimizing effort and so on. All these appear in different proportions in different kinds of jobs and the amount of accepted variability in each forms the perimeter of risk you’re willing to take across each aspect. To take a few specific examples from real life, a lot of friends from my machine learning/ computer vision circle decide to join/not join a specific company because of reasons like “I want to stay in the self-driving industry as it’s hot right now”, “I can afford to work for less money on more interesting problems at this stage of my life”, “I want to stay in the Bay Area as it’s easier to switch jobs”, “I want to work on research” or the good old “I want to earn a shit ton of money and xyz company is paying a lot!”. All these are examples where the people were optimizing for a set of things they wanted to prioritize in life. This is a very good thing. Being clear about what you want from life might not give you what you want. But it can often give directions to your decisions and limit blind fluctuations in the end result.</p>
<h4 id="disclaimer">Disclaimer</h4>
<p>Before giving my list on how one might evaluate startup vs big-company job opportunities, it’s critical to note a few things. First, startups come in different flavors. Not all of them have a hippie hacker culture, not all of them pay badly, not all of them are doing disruptive work, not all are cradles of rapid learning experience and most of them will fail at whatever they’re trying to do. They come in different sizes, very different cultures, different future prospects and different opportunities. It’s tough, but necessary to evaluate them individually. The points below only apply to startups where these characteristics exist. Also, I recommend you read <a href="https://danluu.com/startup-tradeoffs/">this blog post</a> by Dan Luu for an interesting point of view.</p>
<h4 id="choosing-between-a-startup-and-a-big-company">Choosing between a startup and a big company</h4>
<h5 id="impact">Impact</h5>
<p>There are countless stories on Reddit and Blind about software engineers in <a href="https://en.wikipedia.org/wiki/Facebook,_Apple,_Amazon,_Netflix_and_Google">FAANG</a> questioning their work life satisfaction after years of work in big tech. Sometimes, writing a load balancer for some internal platform or making APIs for a nondescript feature in an app, though a useful component for the company, is not of a lot of personal value to one of the individuals working on that team. In a big company with high hiring volume, you’re always functionally dispensable and the impact you make to something in the real world can sometimes be very low and more importantly is often opaque even after accounting for the fact that larger companies serve millions of users. Amazon sometimes has multiple teams competing for the same product and after working on a project for months a lot of your work might not even see the light of the day. However, depending on the size and nature of a startup, you can feel the tangible effect of your work. My friends interning at bigger self-driving companies like Uber ATG had a harder time knowing their work would be deployed and of use to anyone than the ones interning at smaller companies like Aurora and Nuro where they saw their work being adopted and tested while they were still interning. I myself had similar experiences with PathAI even though I worked on research projects which are even less prone to getting deployed. All good startups aim to add value to the society and it’s a great feeling to go back home knowing you affected real change that was visible to the company and even perhaps to the society. Of course, staying clear of startups that don’t add value is critical here (smart <a href="https://gizmodo.com/these-flip-flops-are-smart-for-the-dumbest-possible-rea-1793730937">flip-flops</a> and smart <a href="https://www.cnet.com/news/juicero-is-still-the-greatest-example-of-silicon-valley-stupidity/">juicers</a>! Give me a break).</p>
<h5 id="learning-opportunities--career-development">Learning opportunities & career development</h5>
<p>There are some things smaller startups can be great at. They allow you to have a more holistic view of the running of the whole company. They also allow you to build from ground up and take more initiative towards high impact projects at a younger age than what a large company would. Anyone having future entrepreneurial ambitions can extract a lot of learning from such startup experiences. Another great thing which lends itself to faster career development is the lack of regimented structure and bureaucracy in smaller companies. It’s possible for people to grow career-wise in larger companies by focusing on metrics which might not correlate directly with what was beneficial for the company or the learnings of an individual or teams. However, in smaller companies, this alignment is more easier to evaluate. Another area in which I’ve read contrasting viewpoints is startup vs big-company jobs for recent graduates. On one hand, larger companies have streamlined pipelines and mature tools, best coding practices and stronger code-review culture and hence lead to building of better fundamentals in recent graduates. Startups often are fighting for survival/growth and hacking your way to a solution and ignoring best practices makes partial sense for them. On the other hand, startups can often rapidly shift to new tech stacks, adopt new technologies and are less tied to massive pipelines and structures which help scale the processes for larger companies but at the same time make them slow and harder to change. A friend recently pointed out that his team at Microsoft took half a year to migrate a small tool to prevent deprecation of certain services. Another friend told me about how a team at Amazon had to wait for 2 months before trying out a new set of machine learning algorithms in their workflow because the pipeline for it was still being built so that the company could scale it easily once done. Things like these drastically slow down your own learning ability.</p>
<h5 id="money">Money</h5>
<p>Startups have it hard when competing with big tech for salaries, specially as you go towards more senior positions. This is the reason they complement the lower salaries with equity or stock options. However, it’s worthwhile to evaluate the chances of the startup you’re considering to join making it big and more importantly your own chances of offsetting the lower pay with a higher, but risky payout in the future. There are so may factors here that it’s easier for me to point to <a href="https://danluu.com/startup-options/">this</a> and <a href="http://yosefk.com/blog/stock-options-a-balanced-approach.html">this</a> blog post to see two slightly different points of view on the topic of whether the stocks in a startup is worth the risk of accepting lower money. Also see this <a href="https://www.reddit.com/r/startups/comments/4xec0h/a_crash_course_on_startup_job_offers_and_how_to/">Reddit post</a> on evaluating startups offers. In general, from the statistics I’ve seen being thrown around on the internet, accounting for the risk of a future liquidity event and adjusting the possibility of extremely high payouts for early startups employees, most startups are certainly less attractive options if money is a high priority item on your list. But if you have faith in your ability to judge the health of a startup, the problem it is solving and its value to the market and finally your own circumstance-dependent evaluation of your risk appetite, it might be reasonable to choose startups for money.</p>
<h4 id="conclusion">Conclusion</h4>
<p>I have pointed out some reasons where a startup is more likely to work out better for you than a larger company. But there are so many individual data points all across the spectrum and this analysis needs to be done on a case-by-case basis. Many large companies have small teams which give a lot of similar benefits as startups without the risky downsides of it. Most big tech companies today have specialized teams working on cutting edge areas like machine learning, NLP or blockchain and the high impact of the projects and the rate of learning is as good as a startup without the need to compromise on money. However most non-senior roles in big commpanies will never be able to allow exposure to self-initiated, high impact projects. People themselves have different strengths and not everyone can be successful in a startup environment. All this taken together means that the final decision of how to choose a place to work depends heavily on the specific options you have and your own interests and abilities.</p>When does it make sense to leave a big tech job for a startupDual Process Theory and Machine Learning2019-01-20T00:00:00+00:002019-01-20T00:00:00+00:00http://stillbreeze.github.io/Dual%20Process%20Theory%20and%20Machine%20Learning<p>The dual process theory, defined very generically states that our cognitive processes are composed of two types, Type 1 which is unconscious and rapid and Type 2 which is controlled and slow (also called System 1 & 2). The origins of these ideas go as far back as the 1970s and 1980s, but most non-experts like me know of these through the popular work and subsequent book (Thinking Fast and Slow) by Daniel Kahnemann. He often terms the two systems as intuition and reasoning. See <a href="https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow#Two_systems">these examples</a> for the two types of thoughts. To clarify, this theory isn’t without faults and there’s a lot of new work on more precisely defining the nature of these types or even proposing new theories which have more than 2 types [<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>]. The link to a relatively recent paper about these discussions by Evans et al [<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>] is given below. However, in this post, I would like to very briefly explore which properties of the theory have been or can be incorporated with machine learning. In contrast to previous posts, this will be a less rigorous and short article simply discussing some raw and speculative ideas. <strong>It’s possible that many questions I pose here have already been answered in cognitive science, neuroscience or psychology research, but I might be going off in the wrong direction</strong>, in which case let me know in the comments. I also found a NIPS 2017 paper by Anthony et al from David Barber’s lab at UCL called “Thinking Fast and Slow with Deep Learning and Tree Search” [<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>] which explores the interplay between intuition and reasoning in RL systems through a new algorithm called Expert Iteration (EXIT). See <a href="https://davidbarber.github.io/blog/2017/11/07/Learning-From-Scratch-by-Thinking-Fast-and-Slow-with-Deep-Learning-and-Tree-Search/">this blog</a> if you’re interested in more a detailed explanation.</p>
<h4 id="characterizing-dual-process-theory-dpt">Characterizing Dual Process Theory (DPT)</h4>
<p>In order to use DPT in ML, it should be helpful to enumerate some key properties of each type of thinking. The <a href="https://en.wikipedia.org/wiki/Dual_process_theory#Systems">Wikipedia page</a> for DPT gives a list of items taken from Kahnemann’s book. I have cherry-picked a few which I think have been/should be integrated with ML models.</p>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Type 1</strong></th>
<th style="text-align: center"><strong>Type 2</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Unconscious reasoning</td>
<td style="text-align: center">Conscious reasoning</td>
</tr>
<tr>
<td style="text-align: center">Mostly linked to emotions (‘gut feeling’)</td>
<td style="text-align: center">Mostly detached from emotions</td>
</tr>
<tr>
<td style="text-align: center">Low Effort</td>
<td style="text-align: center">High Effort</td>
</tr>
<tr>
<td style="text-align: center">Large Capacity</td>
<td style="text-align: center">Small Capacity</td>
</tr>
<tr>
<td style="text-align: center">Rapid</td>
<td style="text-align: center">Slow</td>
</tr>
<tr>
<td style="text-align: center">Default process</td>
<td style="text-align: center">Inhibitory</td>
</tr>
<tr>
<td style="text-align: center">Nonverbal</td>
<td style="text-align: center">Linked to language or images in most people</td>
</tr>
<tr>
<td style="text-align: center">Contextualized</td>
<td style="text-align: center">Abstract</td>
</tr>
</tbody>
</table>
<h4 id="connecting-dpt-to-machine-learning">Connecting DPT to machine learning</h4>
<p>The difference in intuition vs conscious reasoning can be interpreted in ML in various ways. A lot of people like to think about intuition as priors (perhaps built into the structure of model) or as information we can access from a memory. But do such priors or embeddings retrieved from the memory also allow for other properties of Type 1 thinking mentioned above? And how do they compare with the Type 2 counterparts? Are there any other ways to incorporate the notion of intuition? Note that a completely separate and perhaps even more important question is of how to learn intuition from experience and data and what kind of models we need to enable this learning (evolution had millions of years to fine-tune our biology to make our brain’s structure fit for learning from our world’s data).</p>
<h5 id="voluntary-vs-involuntary-thought">Voluntary vs involuntary thought</h5>
<p>It’s not clear to me, a single way in which two ML systems can differ in terms of being voluntary. Maybe involuntary and rapid action can simply be achieved by having a model which given a stimuli, accesses a memory indexed in a certain way (associative rules or context-specific rules). Maybe it need not be an explicit memory access, but an implicit distributed model which produces the rapid unconscious thought (the wiki article lists neural networks in Type 1 systems). In the last few years, the use of modular and even symbolic ML architectures for reasoning have become very popular. These models have decomposable modules which often deal with specialized inputs and grounded symbols and do explicit reasoning (like a ‘locate’ module only responsible for localizing a primitive in visual tasks). The difference between a distributed processing in a general neural network and such models where reasoning is explicit and information flow is constrained by the nature of the modules can perhaps be the difference between voluntary vs involuntary thought.</p>
<h5 id="role-of-intuitive-psychology">Role of intuitive psychology</h5>
<p>I wrote earlier about the ideas of Lake et al (2016) which highlight how our psycho-social experiences shape our internal models. The ‘gut feeling’ we experience can be treated as our priors and incorporated in ML models as mentioned earlier but it’s tricky to define what kind of representations need to be used to enable such ‘gut feeling’ to be modelled.</p>
<h5 id="difference-in-effort-and-capacity">Difference in effort and capacity</h5>
<p>The two types of thinking also differ in the amount of computation and cognitive load. This can point to the fact that to replicate a Type 1 system in ML, it should be associated with lesser computation and lesser load on the working memory. The concept of effort can mean multiple things, but they can be more precisely defined than the previous properties. Similarly, capacity differences in the two types of thinking are also easier to define and implement. However an interesting question to ask here would be how exactly does a Type 1 system which involves low computation and is rapid, result in the general intelligence and high capacity we find to have as humans.</p>
<h5 id="type-1-as-the-default-process">Type 1 as the default process</h5>
<p>An important feature of the original DPT is that Type 1 thinking is the default due to its cognitive ease (also see <a href="https://en.wikipedia.org/wiki/Default_mode_network">default mode networks</a>). For Type 2 thinking, we make a conscious effort to go beyond instinctive thoughts and reason about the task in a more involved way. But under which circumstances do we initiate the Type 2 thinking process instead of relying on our default? We know this will be a function of the task, the context in which we are performing it (eg our intrinsic or extrinsic motivations or the fact that we might have already tried our instinctive action which hasn’t worked) and our past experiences relating to such tasks. I’m not sure how we can link this with the current ML models in an elegant way, but a naive solution can be to have an implicit or explicit meta-controller which is responsible for making this decision of switching from a default model to a dedicated model for reasoning.</p>
<h5 id="type-2-is-verbal-and-visuospatial">Type 2 is Verbal and visuospatial</h5>
<p>This difference might be a consequence of the fact that most of the reasoning we do, is done through visual or verbal methods. But it also means that we can constrain the structure of reasoning-based models using our visual world or using the structure from language. Many reasoning based methods in ML are defined a particular way to exploit the regularities in the visual world or to constrain learning with the help of the syntanctic or semantic structures imposed by language. The fact that this is missing from Type 1 processes highlights the difficulty in defining concepts like intuition as they lead to representations which mostly aren’t verbal or visual.</p>
<h5 id="the-mapping-between-tasks-and-type-of-thinking-isnt-fixed">The mapping between tasks and type of thinking isn’t fixed</h5>
<p>Another important point I want to mention is that it isn’t necessary that a slow Type 2 thought process requiring controlled reasoning and effort will always be so. We know from our experience that many Type 2 processes become Type 1 through experience and acclimatization. Placing your fingers a certain way when learning to play the guitar is obviously a Type 2 activity. But when learnt, we see experienced guitarists fluidly moving around their fingers and the task requires as less a cognitive load as a Type 1 task. RL systems do learn subsequent tasks at a faster rate, but I’m not sure if there’s a switch occurring when they go from high effort and slow processing to that of low effort and rapid processing once they have learnt how to perform a task.</p>
<h4 id="conclusion">Conclusion</h4>
<p>We haven’t yet explored ML systems where these two types of systems co-exist and it would be very interesting to see what advantages this can offer. It might be the case that we don’t actually have distinct systems within our brain to which intuition vs reasoning tasks can be localized separately. To point out one such possibility, many competing theories which extend DPT believe that the two types of thinking occur simultaneously [<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>] (which sounds super exciting to me in terms of the richness of learning which is possible). In any case, building such properties into ML systems from the top down will certainly open up new avenues and is worth discussing and experimenting with.</p>
<h4 id="references">References</h4>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/diversity-in-reasoning-and-rationality-metacognitive-and-developmental-considerations/AE82722C96F7E92E852030B7F09940F7">Diversity in reasoning and rationality: Metacognitive and developmental considerations</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://scottbarrykaufman.com/wp-content/uploads/2014/04/dual-process-theory-Evans_Stanovich_PoPS13.pdf">Dual-Process Theories of Higher Cognition: Advancing the Debate</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="https://papers.nips.cc/paper/7120-thinking-fast-and-slow-with-deep-learning-and-tree-search.pdf">Thinking Fast and Slow with Deep Learning and Tree Search</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="https://www.frontiersin.org/articles/10.3389/fnhum.2012.00274/full">Toward an integrative account of social cognition: marrying theory of mind and interactionism to study the interplay of Type 1 and Type 2 processes</a> <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Can the influential ideas from Kahnemann and others' research about humans having two modes of thinking, one fast, one slow, be applied to current machine learning systemsUncertainty Estimation in Deep Learning2019-01-10T00:00:00+00:002019-01-10T00:00:00+00:00http://stillbreeze.github.io/Uncertainty%20Estimation%20in%20Deep%20Learning<p>Neural networks have seen amazing diversification in its applications in the last 10 years. The effectiveness of deep learning as complex function approximators has allowed us to go past many benchmarks across domains. Models using some form of deep learning have been widely adopted for the real-world tasks, which has brought to the fore, a very important topic of model confidence. More often than not, the predictions from the deep network are used for some downstream decision-making. A semantic segmentation map produced by a CNN is used to plan future trajectories of the driverless cars. The credit scores from a model are used towards loan approval/denial decisions. Thus it makes sense to have reasonable estimates of uncertainty of our model’s predictions. It is interesting to note that many papers from more than a couple of decades ago tried to solve this problem for neural networks through its Bayesian treatment. The first ideas behind Bayesian Neural Networks (BNNs) can be found as early as 1992-1995 in various works by David Mackay [<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>], Radford Neal [<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>] and Hinton and Van Camp[<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>]. A great <a href="https://www.youtube.com/watch?v=FD8l2vPU5FY">keynote talk</a> about the history of these ideas was given by Zoubin Ghaharmani in the NIPS 2016 workshop. More recently, Yarin Gal came up with a Bayesian interpretation of dropout based deep models which has resulted in a flurry of research into this area (not to mention funny comments like this <a href="https://www.inference.vc/everything-that-works-works-because-its-bayesian-2/">post from Ferenc</a> and the comical cartoon below!) In this post, I would like to summarize few interesting papers in the uncertainty estimation area in the recent literature.</p>
<p><br /><br />
<img src="/images/scooby_doo_dl.png" alt="Reparameterization" />
<br /><br /></p>
<h4 id="overview-of-methods">Overview of methods</h4>
<p>Before jumping into the specific papers, it’s always nice to have some mental buckets into which all approaches can be categorized. In my limited knowledge, most works in uncertainty estimation in neural networks fall under two heads. The first belongs to the category of treating the network as a Bayesian model, having a prior distribution over its weights and using data to learn a posterior distribution. The problem of doing inference in such setups has been solved through either Markov Chain Monte Carlo (MCMC) based methods (like the ones by Mackay mentioned earlier) or variational methods (see this great <a href="https://csc2541-f17.github.io/slides/lec04.pdf">lecture slide</a>). Majority of papers follow this paradigm and these methods are often clubbed together as Bayesian Deep Learning. The second category, for the lack of a better name, is the non-Bayesian one. Many recent papers have explored ideas other than approximate Bayesian NNs, some trying to obtain a frequentist estimate of uncertainty while others try to enforce an explicit minimization of KL divergences of certain distributions of in-domain and out-of-domain samples while yet others use adversarial samples or contrastive samples to build uncertainty estimates. The key ideas from some of these papers is given below. The main intention here is to give a breadth of the kind of techniques being explored.</p>
<h4 id="types-of-uncertainty">Types of uncertainty</h4>
<p>It is also useful to quickly describe the different types of predictive uncertainties being estimated in these papers. The first is the model uncertainty or epistemic uncertainty which is a consequence of mis-specification of the model or its parameters, for some given data. The second is the data uncertainty or aleatoric uncertainty which is a result of complexity or noisy nature of the data. This itself is further divided into homoscedastic and heteroscedastic uncertainty where the former is constant across data samples while the latter changes with the inputs. Some works also address a third type of uncertainty called the distributional uncertainty, which is the uncertainty in prediction due to a change in the data distribution from train to test.</p>
<h4 id="papers">Papers</h4>
<h5 id="1-dropout-as-a-bayesian-approximation-representing-model-uncertainty-in-deep-learning-by-gal-et-al-">1. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning by Gal et al [<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>]</h5>
<ul>
<li>First paper to formulate stochastic regularization techniques like dropout in deep learning as approximate Bayesian inference.</li>
<li>Shows that training a neural network with dropout is equivalent to doing approximate variational inference in a probabilistic deep Gaussian process. This means that when dealing with the predictive distribution, we can simply have a Bernoulli (or other depending on the kind of stochastic noise being injected) distribution over the weights and then doing multiple forward passes through the network and averaging them will be the same as doing Monte Carlo integration to find the expected output value of the model under the predictive distribution (it’s called Monte Carlo dropout).</li>
<li>For a much more detailed (and amazingly lucid!) explanation, please see chapter 3.2 in <a href="http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf">Gal’s thesis</a> where he explains how variational inference in Bayesian NNs is the same as Monte Carlo dropout for various stochastic regularization techniques, and how moment matching can be used to obtain uncertainty estimates in this formulation.</li>
</ul>
<h5 id="2-what-uncertainties-do-we-need-in-bayesian-deep-learning-for-computer-vision-by-kendall-et-al-">2. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? by Kendall et al [<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>]</h5>
<ul>
<li>A follow up paper by Kendall and Gal discussing two kinds of uncertainties, namely epistemic and aleatoric.</li>
<li>Show how both uncertainty estimates can be obtained from the same model. Use Monte Carlo dropout for epistemic uncertainty as shown in the previous paper, and predict a variance term using the model itself to handle aleatoric uncertainty of each input. The usual regression loss function is extended with variance term which makes the variance high whenever the model outputs a very wrong value.</li>
<li>Show some really nice (state-of-art) results on real-world semantic segmentation and depth regression datasets.</li>
</ul>
<h5 id="3-weight-uncertainty-in-neural-networks-by-blundell-et-al-">3. Weight Uncertainty in Neural Networks by Blundell et al [<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>]</h5>
<ul>
<li>Another early work post deep learning and variational autoencoders (VAEs) which learns a distribution over neural network weights using variational inference, similar to techniques mentioned in Gal’s thesis.</li>
<li>Apply the reparameterization trick to get a variational approximation to the distribution over weights as opposed to distribution over hidden units as done in VAE papers. Use a scale Gaussian mixture as prior combined with a diagonal Gaussian posterior distribution. Also has a small interesting paragraph on why optimizing the prior distribution’s parameters based on the data (empirical Bayes) does not work for their model.</li>
<li>Show that their uncertainty estimates for the weights can be used to decide exlploration strategies in contextual bandits with Thomson sampling by modeling the conditional reward distribution using their neural network. Interesting idea!</li>
</ul>
<h5 id="4-simple-and-scalable-predictive-uncertainty-estimation-using-deep-ensembles-by-lakshminarayanan-et-al-">4. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles by Lakshminarayanan et al [<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup>]</h5>
<ul>
<li>A departure from the usual Bayesian modeling for getting uncertainty estimates through ensembling and adversarial training. A side-effect of being non-Bayesian is that it involves no mathematical guarantees like previous work (this of course is not a general comment on non-Bayesian techniques).</li>
<li>Define proper scoring rules which can be used to measure the quality of the predictive distribution, which it turns out, are many of the standard loss functions used for training neural networks. The scoring rules are then used with adversarial training to smooth the predictive distribution and ensembles are used as uniformly-weighted mixture models to get the predicted value (mean) and the variance associated with the prediction, by assuming the output conditional distribution to be a mixture of Gaussian.</li>
<li>Also show empirically that another intuitive solution of using the ensemble to make multiple predictions and then using the empirical variance in output as measure of uncertainty does not provide good estimates (consistently underestimates it). The final method is fairly simple and is shown to give better estimates than Monte Carlo dropout.</li>
</ul>
<h5 id="5-bayesian-uncertainty-estimation-for-batch-normalized-deep-networks-by-teye-et-al-">5. Bayesian Uncertainty Estimation for Batch Normalized Deep Networks by Teye et al [<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>]</h5>
<ul>
<li>Shows that just like Monte Carlo dropout, using batch normalization in neural networks can also be cast as doing approximate Bayesian inference. This is possible due to the stochasticity involved in sampling mini-batches for getting batch statistics in batch normalization.</li>
<li>Similar to the analysis done in Gal’s thesis, this paper too shows equivalence of the variational loss in a neural network and a network trained with Batch norm in all its layers. However, it also makes certain other assumptions about the nature of each batch norm layer, the units in each layer being uncorrelated and so on in order to cast the prior term in the objective as weight decay.</li>
</ul>
<h5 id="6-accurate-uncertainties-for-deep-learning-using-calibrated-regression-by-kuleshov-et-al-">6. Accurate Uncertainties for Deep Learning Using Calibrated Regression by Kuleshov et al [<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup>]</h5>
<ul>
<li>Propose a method for obtaining well-calibrated uncertainty estimates for Bayesian NNs using the post-hoc recalibration motivated by <a href="https://en.wikipedia.org/wiki/Platt_scaling">Platt’s scaling</a>. Intuitively, a calibrated model means that for an input, whenever it predicts an output with a probability 0.7, then that output should occur 70% of the time. See their first couple of sections to read more about calibration and sharpness of model predictions.</li>
<li>Train an auxiliary regression model to recalibrate uncertainty estimates. This model is formulated as estimating a CDF arrived at through earlier defined properties of calibrated models. This estimation is done on a separate calibration set of data to prevent overfitting and uses a prior work called isotonic regression model which is non-parametric and hence can learn the true distribution with enough iid data.</li>
<li>The proposed method also works with probabilistic predictions such as the ones from a Bayesian NN (they show it for paper 1 and 4 along with others). Thus this method can be used with any black box methods to recalibrate their uncertainty estimates.</li>
</ul>
<h5 id="7-reliable-uncertainty-estimates-in-deep-neural-networks-using-noise-contrastive-priors-by-hafner-et-al-">7. Reliable Uncertainty Estimates In Deep Neural Networks Using Noise Contrastive Priors by Hafner et al [<sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup>]</h5>
<ul>
<li>Many earlier Bayesian NN methods use a standard normal prior over weights which imposes weight shrinkage in the form of weight decay in the final objective. Thus these priors are uninformative about the function class and data and only depend on the parameterization which can cause the posterior to generalize to out-of-distribution (OOD) samples not seen during training. This paper proposes a new contrastive prior which explicitly ensures high uncertainty for OOD samples.</li>
<li>Since generating a OOD data means finding the complement of training distribution, which is tricky, the paper uses few key ideas. First is to approximate the OOD input using random contrastive noise (motivated by noise contrastive estimation). Another is to encourage high uncertainty at data points close to the boundary of the training distribution and let this effect propagate through the OOD space.</li>
<li>These contrastive data points are used during training and the prior KL loss term for these is added to the final objective, which can be interpreted as minimization of KL divergence on pseudo-data points from the OOD inputs. This kind of prior is used to extend the work from paper 3 from above and show good uncertainty estimates on small datasets.</li>
</ul>
<h4 id="references">References</h4>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.29.274&rep=rep1&type=pdf">A Practical Bayesian Framework for Backprop Networks</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="http://www.cs.toronto.edu/pub/radford/thesis.pdf">Bayesian Learning for Neural Networks</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="http://www.cs.toronto.edu/~fritz/absps/colt93.pdf">Keeping Neural Networks Simple by Minimizing the Description Length of the Weights</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="http://proceedings.mlr.press/v48/gal16.pdf">Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning</a> <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p><a href="https://papers.nips.cc/paper/7141-what-uncertainties-do-we-need-in-bayesian-deep-learning-for-computer-vision.pdf">What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?</a> <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p><a href="http://proceedings.mlr.press/v37/blundell15.pdf">Weight Uncertainty in Neural Networks</a> <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p><a href="https://papers.nips.cc/paper/7219-simple-and-scalable-predictive-uncertainty-estimation-using-deep-ensembles.pdf">Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles</a> <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p><a href="https://arxiv.org/pdf/1802.06455.pdf">Bayesian Uncertainty Estimation for Batch Normalized Deep Networks</a> <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p><a href="http://proceedings.mlr.press/v80/kuleshov18a/kuleshov18a.pdf">Accurate Uncertainties for Deep Learning Using Calibrated Regression</a> <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p><a href="https://arxiv.org/pdf/1807.09289.pdf">Reliable Uncertainty Estimates In Deep Neural Networks Using Noise Contrastive Priors</a> <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Collection of some recent work on uncertainty estimation for deep learning models using Bayesian and non-Bayesian methodsREINFORCE vs Reparameterization Trick2018-08-09T00:00:00+00:002018-08-09T00:00:00+00:00http://stillbreeze.github.io/REINFORCE%20vs%20Reparameterization%20trick<p>In machine learning, it is often required to compute gradients of a loss function for stochastic optimization and sometimes these loss functions are expressed as an expectation. For example, in <a href="https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf">variational inference</a> (converting an inference problem in a probabilistic model to an optimization problem), we need to compute the derivative of the ELBO loss which is written in terms of an expectation. Another example is the <a href="http://www.scholarpedia.org/article/Policy_gradient_methods#Likelihood_Ratio_Methods_and_REINFORCE">policy gradient algorithm</a> in reinforcement learning where the objective function is the expected reward. REINFORCE and reparameterization trick are two of the many methods which allow us to calculate gradients of expectation of a function. However both of them make different assumptions about the underlying model and data distributions and thus differ in their usefulness. This post will introduce both methods, and in the process, draw a comparison between them. There are multiple tutorials which already cover REINFORCE and reparameterization gradients but I’ve often found them in context of specific models like VAEs or DRAW which slightly obfuscates the general picture of these methods. Shakir Mohamed’s <a href="http://blog.shakirm.com/">blog</a> also covers these topics in an excellent way and I would highly advise everyone to go check it out.</p>
<h4 id="the-setup">The setup</h4>
<p>Given a random variable <script type="math/tex">x \sim p_{\theta}(x)</script> where <script type="math/tex">p_{\theta}</script> is a parametric distribution and a function <script type="math/tex">f</script>, for which we wish to compute the gradient of its expected value, the quantity of interest is:</p>
<script type="math/tex; mode=display">\nabla_{\theta}\mathbb E_{x\sim p_{\theta}(x)}[f(x)]</script>
<p>For an optimization problem, the above refers to the derivative of the expected value of the loss function. The difficulty in evaluating this term is that in the general case, the expectation is unkown and the derivative is taken wrt the parameters of the distribution <script type="math/tex">p_{\theta}</script>.</p>
<h4 id="reinforce">REINFORCE</h4>
<p>The REINFORCE algorithm [<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>] also known as the score function estimator [<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>] uses a simple differentiation rule called the log-derivative trick which is simply the differentiation rule for the logarithm.</p>
<script type="math/tex; mode=display">\nabla_{\theta}p_{\theta}(x) = p_{\theta}(x) \nabla_{\theta}\log p_{\theta}(x)\tag{1}</script>
<p>Although written as above, the ‘trick’ seems very plain, it is very useful in situations where <script type="math/tex">p_{\theta}</script> is the likelihood for a random variable (also, likelihoods often belong to exponential families which makes the expression on the right more amenable). The term <script type="math/tex">\nabla_{\theta}\log p_{\theta}(x)</script> is called the score and regularly comes up in maximum likelihood estimation. It also has many wonderful properties like having zero expected value (which proves useful when using it for variational inference among other things).</p>
<p>With this, we get back to our problem of estimating the gradient. Using the definition of expectation,</p>
<center>
$$
\begin{align}
\nabla_{\theta}\mathbb E_{x\sim p_{\theta}(x)}[f(x)] & = \nabla_{\theta}\int{f(x)p_{\theta}(x)dx}\tag{2}\\
& = \int{f(x)\nabla_{\theta}p_{\theta}(x)dx}\tag{3}\\
& = \int{f(x)p_{\theta}(x) \nabla_{\theta}\log p_{\theta}(x)dx}\tag{4}\\
& = \mathbb E_{x\sim p_{\theta}(x)}[f(x)\nabla_{\theta}\log p_{\theta}(x)]\tag{5}\\
\end{align}
$$
</center>
<p>The reason why the integral and differentiation can be switched in the equation <script type="math/tex">3</script> is because of the <a href="https://en.wikipedia.org/wiki/Leibniz_integral_rule">Leibniz Integral rule</a>. Equation <script type="math/tex">4</script> is just the application of the log-derivative trick from equation <script type="math/tex">1</script>. Now, since we know the distribution under the expectation, we can use Monte Carlo sampling to approximate the expectation.</p>
<script type="math/tex; mode=display">\approx \frac{1}{N}\sum_{i=1}^{N} f(x_i)\nabla_{\theta}\log p_{\theta}(x_i)\tag{6}</script>
<p>Note that the above is an unbiased estimator of the gradients (expected value of the gradient is the same as the true gradient), and hence optimization with such gradients can converge to a local optima following the Robins-Munro conditions. The score function estimator assumes it is possible to cheaply sample from the distribution <script type="math/tex">p_{\theta}(x)</script>. It’s also interesting to note that REINFORCE places no restriction on the nature of the function <script type="math/tex">f</script> and it doesn’t even need to be differentiable for us to estimate the gradients of its expected value.</p>
<p>Of course, the unbiased estimates also mean that the variance for these gradients are very high. This can be thought of as a result of sampling values of <script type="math/tex">x</script> which are rare. To counter this, a common solution is to use something called control variates. The basic idea is to replace the function under the expectation with another function which has the same expected value but lesser variance. This can be done by subtracting from the original function, a term which has its expectation as zero. Many other solutions like Importance Sampling or Rao-Blackwellization can also be used for variance reduction. Refer to chapter 8, 9 and 10 of this <a href="https://statweb.stanford.edu/~owen/mc/">book</a> for details on those methods.</p>
<h4 id="reparameterization-trick">Reparameterization trick</h4>
<p>Recall that our object of interest is the gradient of the expected value of the function.</p>
<script type="math/tex; mode=display">\nabla_{\theta}\mathbb E_{x\sim p_{\theta}(x)}[f(x)]</script>
<p>Also recall that the problem in evaluating this quantity is the fact that the expectation is taken wrt a distribution with parameters <script type="math/tex">\theta</script> and we can’t compute the derivative of that stochastic quantity. Reparameterization gradients also known as pathwise gradients allow us to compute this by re-writing the samples of the distribution <script type="math/tex">p_{\theta}</script> in terms of a noise variable <script type="math/tex">\varepsilon</script>, that’s independent of <script type="math/tex">\theta</script>. More concretely,</p>
<center>
$$
\begin{align}
\varepsilon & \sim q(\varepsilon)\tag{7}\\
x & = g_{\theta}(\varepsilon)\tag{8}\\
\nabla_{\theta}\mathbb E_{x\sim p_{\theta}(x)}[f(x)] & = \nabla_{\theta}\mathbb E_{\varepsilon\sim q(\varepsilon)}[f(g_{\theta}(\varepsilon))]\tag{9}\\
& = \mathbb E_{\varepsilon\sim q(\varepsilon)}[\nabla_{\theta}f(g_{\theta}(\varepsilon))]\tag{10}\\
\end{align}
$$
</center>
<p>Thus, x is reparameterized as a function of <script type="math/tex">\varepsilon</script> and the stochasticity of <script type="math/tex">p_{\theta}</script> is pushed to the distribution <script type="math/tex">q(\varepsilon)</script> where <script type="math/tex">q</script> can be chosen as any random noise distribution, eg a standard Gaussian <script type="math/tex">\mathcal{N}(0,1)</script>. An example of such reparameterization can be highlighted by assuming <script type="math/tex">x</script> is sampled from a Gaussian, <script type="math/tex">x \sim \mathcal{N}(\mu,\sigma)</script>. The function <script type="math/tex">g_{\theta}(\varepsilon)</script> then can be defined as the following:</p>
<script type="math/tex; mode=display">g_{\theta}(\varepsilon) = \mu_{\theta} + \varepsilon\sigma_{\theta}</script>
<p>where <script type="math/tex">\varepsilon \sim \mathcal{N}(0,1)</script></p>
<p>The figure below taken from <a href="https://jaan.io/what-is-variational-autoencoder-vae-tutorial/">Jaan’s blog</a> shows it succinctly for the case of a VAE (he uses <script type="math/tex">z</script> as the random variable instead of the <script type="math/tex">x</script> I have been using). Circles are stochastic nodes whereas diamonds are deterministic nodes.</p>
<p><br /><br />
<img src="/images/reparameterization.png" alt="Reparameterization" />
<br /><br /></p>
<p>As evident from equation <script type="math/tex">10</script>, the reparameterization has changed the expectation to a distribution independent of <script type="math/tex">\theta</script> and can now be computed using Monte Carlo provided <script type="math/tex">f(g_{\theta}(\varepsilon))</script> is differentiable wrt <script type="math/tex">\theta</script>.</p>
<script type="math/tex; mode=display">\nabla_{\theta}\mathbb E_{x\sim p_{\theta}(x)}[f(x)] \approx \frac{1}{N}\sum_{i=1}^{N} (\nabla_{\theta}f(g_{\theta}(\varepsilon_i)))</script>
<p>Reparameterization gradients have been shown to typically have lower variance than REINFORCE gradients or even REINFORCE with control variates (for example, in variational inference [<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>]). But they do have requirements of having differentiable functions as shown above.</p>
<h4 id="summary-of-differences">Summary of differences</h4>
<p>The key differences between the two gradient estimation techniques are summarized in the table below.</p>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Properties</strong></th>
<th style="text-align: center">REINFORCE</th>
<th style="text-align: center">Reparameterization</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><strong>Differentiability requirements</strong></td>
<td style="text-align: center">Can work with a non-differentiable model</td>
<td style="text-align: center">Needs a differentiable model</td>
</tr>
<tr>
<td style="text-align: center"><strong>Gradient variance</strong></td>
<td style="text-align: center">High variance; needs variance reduction techniques</td>
<td style="text-align: center">Low variance due to implicit modeling of dependencies</td>
</tr>
<tr>
<td style="text-align: center"><strong>Type of distribution</strong></td>
<td style="text-align: center">Works for both discrete and continuous distributions</td>
<td style="text-align: center">In the current form, only valid for continuous distributions</td>
</tr>
<tr>
<td style="text-align: center"><strong>Family of distribution</strong></td>
<td style="text-align: center">Works for a large class of distributions of x</td>
<td style="text-align: center">It should be possible to reparameterize x as done above</td>
</tr>
</tbody>
</table>
<h4 id="references">References</h4>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Simple statistical gradient-following algorithms for connectionist reinforcement learning</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://www.sciencedirect.com/science/article/pii/S0927050706130194">Gradient Estimation</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="https://arxiv.org/pdf/1603.00788.pdf">Automatic Differentiation Variational Inference</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>An introduction and comparison of two popular techniques for estimating gradients in machine learning modelsIngredients For Closing the Gap Between Machines and Humans2018-07-17T00:00:00+00:002018-07-17T00:00:00+00:00http://stillbreeze.github.io/Ingredients%20For%20Closing%20the%20Gap%20Between%20Machines%20and%20Humans<p>I have recently been going through some papers in the cognitive sciences, specifically related to cognitive theories and new ideas at the intersection of machine learning (ML) and human learning. The central question these papers aim to answer is a way forward to bridge the gap between current ML systems and the general learning abilities which humans possess. <strong>The primary focus of this post will be to summarize one such paper along with some response commentaries it received from other researchers in the field</strong>. Lake et al. in their paper, <a href="https://arxiv.org/pdf/1604.00289.pdf">“Building Machines That Learn and Think Like People”</a>, identify certain ingredients of human cognition which can help ML researchers realise systems which learn like humans. The remaining post summarises some interesting ideas from the paper and the commentary, but do check out [<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>] for the full paper and [<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>] for all the related commentaries.</p>
<h4 id="core-ingredients-of-human-intelligence">Core ingredients of human intelligence</h4>
<p>An overview of the described human-like learning characteristics from Lake et al:</p>
<ol>
<li>Developmental start-up software
<ol>
<li>Intuitive physics</li>
<li>Intuitive psychology</li>
</ol>
</li>
<li>Learning as rapid model building
<ol>
<li>Compositionality</li>
<li>Causality</li>
<li>Learning-to-learn</li>
</ol>
</li>
<li>Thinking Fast
<ol>
<li>Approximate inference in structured models</li>
<li>Model-based and model-free reinforcement learning</li>
</ol>
</li>
</ol>
<h5 id="1-developmental-start-up-software">1. Developmental start-up software</h5>
<p>The authors contend that humans have a foundational basis of understanding certain concepts like set operations, mechanics, geometry and agency. For example, a small infant is capable of identifying distinct objects, anticipating how rigid objects move under gravity, how solids differ from liquids when touched or how human agents differ from inanimate objects. All these examples are clubbed up under the foundational start-up software which comes intuitively to a child and on top of which further experiences build more knowledge.</p>
<h6 id="11-intuitive-physics">1.1. Intuitive physics</h6>
<p>Children have a developed intuitive physical state representation which gives an approximate, probabilistic and oversimplified account of the the physical world and its interactions. Many recent works have tried to embed deep learning models within a physics simulation engine from which it can learn this notion of intuitive physics, but how well it incorporates the physical rules of world, whether it can learn with as few experiences as humans do and how we evaluate what they have learnt are all challenging problems.</p>
<h6 id="12-intuitive-psychology">1.2. Intuitive psychology</h6>
<p>How children perceive world agents like other humans or animate objects and how they react to these agents gives us a view of how psycho-social experiences shape our intrinsic mind models. Lake et al. give examples of how children associate negatively with an agent who blocks a positive action based on cues. But the no of cues need to scale rapidly as situations become more complex for this to be plausible. Alternatively, such reasoning can be thought of as a generative model of actions where the child is seen as optimising for some goal through mental planning (like that of an MDP or POMDP). However research connecting these psycho-computational theories to deep learning models have just begun (see [<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>] and [<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>]).</p>
<h5 id="2-learning-as-rapid-model-building">2. Learning as rapid model building</h5>
<p>Humans have an amazing capacity for generalizing with few examples. We can see, relate, imagine and describe new concepts and make plausible inferences about them. Moreover, there is considerable evidence that this few-shot learning occurs on top of domain knowledge of various other classes of concepts (we can mentally picture a monkey with wings roller skating on the road because we have previous knowledge about the mentioned objects and their functional nature). The question is how to integrate various domain knowledge into current ML models to enable rapid model building.</p>
<h6 id="21-compositionality">2.1. Compositionality</h6>
<p>Compositionality is the mechanism which allows humans to build complex representations by composing multiple primitives. Therefore, instead of individually learning complex concepts which is combinatorially expensive, they are learnt as general composition of simple representations. This allows faster few-shot learning of novel concepts. Many recent papers explore the cognitive theories ([<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>] and [<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>]) and computational models ([<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup>], [<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>] and [<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup>]) in the machine learning for concept composition, but of course a lot of remains to be understood.</p>
<h6 id="22-causality">2.2. Causality</h6>
<p>Causality refers to the generative process by which a certain prediction or observation is produced in humans. Novel few-shot learning is often dependent on the nature of causal models we have learnt in the past. Causality acts as a glue for binding together various concepts and events in order to constrain our learning towards the real-world observations. It is important to note however, that not all generative models in machine learning are necessarily causal as they might not have anything to do with the actual process of generating that data.</p>
<h6 id="23-learning-to-learn">2.3 Learning-to-learn</h6>
<p>Many priors and inductive biases humans gain during the learning of one task is often useful for newer tasks. Learning-to-learn is thus the ability to transfer representation and computational structure to solve novel tasks. This problem has been discussed for long in the machine learning community (Jurgen Schmidhuber did a lot of early work in the 1980s) and a lot of new ideas in deep RL and supervised transfer learning have pushed the performance benchmarks further (also see <a href="https://github.com/floodsung/Meta-Learning-Papers">this</a> list of papers on meta learning). Nevertheless, modern ML systems don’t learn as rapidly and flexibly as humans do and meta-learning will certainly have an important role to play here.</p>
<h5 id="3-thinking-fast">3. Thinking Fast</h5>
<p>Given that humans seem to have complex and structured models which allow for rapid generalization (previous three characteristics), it’s even more remarkable that the inference in these models is extremely fast. Deep learning based approaches are often advantageous due to their fast inference times and scalability and can form a viable basis for more human-like ML systems.</p>
<h6 id="31-approximate-inference-in-structured-models">3.1. Approximate inference in structured models</h6>
<p>It is vital for any human-like ML model to perform approximate inference as calculating the probability distribution over the entire search space is almost always intractable. Some cognitive theories posit that humans perform approximate bayesian inference using stochastic sampling methods like Monte Carlo sampling. Inductive biases are also evoked for facilitating rapid hypothesis selection in addition to hypothesis evaluation. For example, we know the answer to the question “how old is that tree?” is a number even though we may not know the correct answer or a never-before-seen object with wheels can be moved around even though we haven’t interacted with it yet ()it might not even move in reality but we still make these inferences). In the recent ML literature, many methods learn to do amortised inference in graphical models and the work done in probabilistic inference in generative models or differential programming are exciting avenues for the integration of deep learning and structured probabilistic models.</p>
<h6 id="32-model-based-and-model-free-reinforcement-learning">3.2. Model-based and model-free reinforcement learning</h6>
<p>There’s significant evidence that the human brain uses fast model-free algorithms like the ones used in DQN models. However many cognitive capabilities which we exhibit also point to the presence of model-based learning. For example, for a given state-action environment, our brain can flexibly adapt to optimise for different reward signals without re-learning. This highlights our capacity to build a cognitive map of the environment and re-use it for different end goals. Thus, it is necessary for ML systems to allow for both model-free and model-based mechanisms.</p>
<h4 id="response-commentary-from-behavioral-and-brain-sciences-journal">Response commentary from Behavioral and Brain Sciences Journal</h4>
<p>There are 27 commentaries on the above summarized Lake et al. paper, but I have chosen a select few from them based on what I found interesting and have tried to condense them into few bullet points. The purpose here is to stimulate thoughts in different broad directions.</p>
<h5 id="1-the-architecture-challenge-future-artificial-intelligence-systems-will-require-sophisticated-architectures-and-knowledge-of-the-brain-might-guide-their-construction-by-baldassarre-et-al">1. The architecture challenge: Future artificial-intelligence systems will require sophisticated architectures, and knowledge of the brain might guide their construction by Baldassarre et al</h5>
<ul>
<li>Developing new architectures is essential for human-level AI systems.</li>
<li>Looking at the brain can provide guidance as to which architecture spaces to look at for navigating through the tons of possible architectures. Eg. Cortex is organised along multiple cortical pathways which are hierarchical with higher ones focusing on motivation information and lower ones on sensation.</li>
</ul>
<h5 id="2-building-machines-that-learn-and-think-for-themselves-by-botvinick-et-al">2. Building machines that learn and think for themselves by Botvinick et al</h5>
<ul>
<li>Agree with the list of ingredients, but focus should be on autonomy to reach these goals (agents learn their own internal models and how to use them instead of relying on human engineering).</li>
<li>Learning agents should be able to capable across multiple domains without needing too much of priori knowledge.</li>
<li>The idea is to use high-level prior knowledge like general structures about compositionality or causality (just like translational invariance was built into CNNs) along with large-scale and general architectures and algorithms like attentional filtering, learning through intrinsic motivation, episodic learning and memory augmented systems.</li>
<li>Models should be calibrated not just to individual tasks but to a distribution of tasks, learnt through experience and evolution. Thus, autonomously learning of internal models such that these models can be shaped by specific set of tasks is advantageous.</li>
<li>Autonomy also depends on control functions (processes that use the model to make decisions). Even these control functions should co-evolve with models over time, hence agent-based approaches are important to develop.</li>
<li>Model free methods might be primarily important. It’s premature to relate them to a supporting role.</li>
</ul>
<h5 id="3-the-humanness-of-artificial-non-normative-personalities-by-kevin-b-clark">3. The humanness of artificial non-normative personalities by Kevin B. Clark</h5>
<ul>
<li>Cognitive emotional behaviour and non-normative (unique) personalities and in turn, dynamic expression of human intelligence and identities is a key aspect of being human which is overlooked in Lake et al.</li>
<li>Attributes like resoluteness, meticulousness, fallibility, natural dispositions etc are all very human traits and must be accounted for in an artificially intelligent agent in order to realise their effects on learning and in order to prevent unwanted machine behaviour.</li>
</ul>
<h5 id="4-evidence-from-machines-that-learn-and-think-like-people-by-forbus-and-gentner">4. Evidence from machines that learn and think like people by Forbus and Gentner</h5>
<ul>
<li>Analogical comparison are an important part of human reasoning and might be better than learning structured relational representation.</li>
<li>Qualitative representations, not quantitative simulations are the main ingredients of conceptual structure in the brain. Actual dynamics of the physics might not be known or even be encoded in the model, but just a qualitative experience is needed. Hence the Monte Carlo simulation of the kind used in lake et al. (another 2015 paper) might not work.</li>
</ul>
<h5 id="5-the-importance-of-motivation-and-emotion-for-explaining-human-cognition-by-güss-and-dörner">5. The importance of motivation and emotion for explaining human cognition by Güss and Dörner</h5>
<ul>
<li>Lake et al focus only on cognitive factors and misses out motivation and emotion. Motivation and diverse exploration (seeking uncertainty in order to minimise it later on) and emotion lead to many human behaviours which interacts with the cognitive processes.</li>
</ul>
<h5 id="6-building-on-prior-knowledge-without-building-it-in-by-hansen-et-al">6. Building on prior knowledge without building it in by Hansen et al</h5>
<ul>
<li>
<p>Compositional approach is limited because it downplays the complex interaction of multiple contextual variables related to the various tasks where the representations are used. Not committing to compositionality provides more flexible ways of dealing with learning complex representations.</p>
</li>
<li>
<p>An important direction to explore is how humans learn from a rich ensemble of multiple varying, but partially related tasks.</p>
</li>
<li>
<p>Meta learning of these related sub-tasks can be done, with the meta-tasks becoming more general (eg give an explanation for your behaviour, incorporate comments from a teacher etc.) which will not rely on a startup software which requires domain-specific prior knowledge.</p>
</li>
</ul>
<h5 id="7-benefits-of-embodiment-by-maclennan">7. Benefits of embodiment by MacLennan</h5>
<ul>
<li>
<p>Lake et al focus on the startup software but neglect the nature of the software or how it is acquired. For understanding intuitive physics and physical causality, embodied interaction of a organism with an environment serves as a guide for higher order imagination and conceptual physical understanding. Simulations, in principle, can help in developing similar competencies, but generating simulations with enough complexity is difficult.</p>
</li>
<li>
<p>Explicit models are the ones which scientists construct in terms of symbolic variables and reason about them discursively (including mathematically). Implicit models are constructed in terms of large no of sub-symbolic variables which are densely interrelated (like a neural network). Implicit models allow for emergent behaviour and are more likely to be relevant to the the goal of human-like learning.</p>
</li>
</ul>
<h5 id="8-autonomous-development-and-learning-in-artificial-intelligence-and-robotics-scaling-up-deep-learning-to-human-like-learning-by-oudeyer">8. Autonomous development and learning in artificial intelligence and robotics: Scaling up deep learning to human-like learning by Oudeyer</h5>
<ul>
<li>
<p>Curiosity, intrinsic motivation, social learning and natural interaction with peers and embodiment are interesting areas to probe.</p>
</li>
<li>
<p>Many of the current systems have manually specified, task-specific objective. Many are learnt offline, that too on large datasets. Whereas, human learning has open ended goals and explores various skills. It is online and incremental in nature and involves free play.</p>
</li>
<li>
<p>Human learning happens in the physical world under constraints of energy, time and computation. Thus, embodiment is crucial for learning. Sensorimotor constraints in the models can simplify learning.</p>
</li>
</ul>
<h5 id="9-crossmodal-lifelong-learning-in-hybrid-neural-embodied-architectures-by-wermter-et-al">9. Crossmodal lifelong learning in hybrid neural embodied architectures by Wermter et al</h5>
<ul>
<li>
<p>Lifelong learning through experiencing the ‘world’ is the next big direction for ML systems. Additionally, these models should facilitate cross-modal learning to make sense of the multimodal stimuli of the environment.</p>
</li>
<li>
<p>Startup-software is tightly coupled with the general learning mechanisms of the brain. Past research suggests that architectural mechanisms like different timings in information processing in the cortex, foster compositionality, that in turn enables more complex actions.</p>
</li>
<li>
<p>Transfer learning shouldn’t merely be switching between modalities, but integrating multiple modalities which are richer than the sum of their parts.</p>
</li>
</ul>
<h4 id="acknowledgement">Acknowledgement</h4>
<p>A big thanks to <a href="http://web.stanford.edu/~lampinen/">Andrew Lampinen</a> for helping me access the paper commentaries.</p>
<h4 id="references">References</h4>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://arxiv.org/pdf/1604.00289.pdf">Building Machines That Learn and Think Like People</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/building-machines-that-learn-and-think-like-people/A9535B1D745A0377E16C590E14B94993#fndtn-related-commentaries">Related Commentaries: Building Machines That Learn and Think Like People</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="https://hal.inria.fr/hal-01404278/document">Intrinsic motivation, curiosity and learning: theory and applications in educational technologies</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="https://arxiv.org/pdf/1712.06560.pdf">Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents</a> <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p><a href="https://pdfs.semanticscholar.org/6bd9/fa9aad10e0edd965c9bb43882a487c875d08.pdf">Cognitively Plausible Theories of Concept Composition </a> <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p><a href="https://link.springer.com/chapter/10.1007/978-3-319-45977-6_10">Conceptual Versus Referential Affordance in Concept Composition</a> <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p><a href="http://openaccess.thecvf.com/content_cvpr_2017/papers/Misra_From_Red_Wine_CVPR_2017_paper.pdf">From Red Wine to Red Tomato: Composition with Context</a> <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p><a href="https://arxiv.org/pdf/1705.10762.pdf">Generative Models Of Visually Grounded Imagination</a> <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p><a href="https://arxiv.org/pdf/1803.09851.pdf">Attributes as Operators</a> <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Highlighting ideas at the intersection of cognitive science and machine learning by summarizing the work of Lake et al (2016) and its related paper commentariesDeep Learning and the Demand for Interpretability2017-05-02T00:00:00+00:002017-05-02T00:00:00+00:00http://stillbreeze.github.io/Deep%20Learning%20and%20the%20Demand%20For%20Interpretability<p>Deep learning has always been under fire for a lot of things in a lot of contexts. There is criticism about the arbitrariness of its hyperparameters and choice of architecture (Yann LeCun’s <a href="https://plus.google.com/+YannLeCunPhD/posts/gurGyczzsJ7">strong reaction</a> to a rejected paper from CVPR’12). There is also <a href="http://lists.numenta.org/pipermail/nupic-theory_lists.numenta.org/2014-October/001453.html">criticism</a> about how they don’t reflect the true functioning of what we know about the human brain. In the academic setting, another criticism I’ve noticed is how quite a few people suggest (I think, correctly) that “deep learning”, if just dealt with as a method for stacking ad-hoc layers and loss functions, is not worth a student’s time (see Ferenc Huszár’s views <a href="http://www.inference.vc/deep-learning-is-easy/">here</a>). Another popular area of discussion which has recently gained importance is about how deep learning is essentially a black box which may be fine for prediction tasks where only the results matter, but not in inference problems or tasks requiring an explanation of its results.</p>
<p>Although all these comments on deep learning belong to very diverse areas and are often over-generalized (deep learning in practice isn’t a monolithic, standalone technique), in this post, I will write specifically about the notion of interpretability for these deep models. As the use of deep learning in real-life, decision-making systems increases, it becomes imperative that we are able to explain, to some degree, how our models come to the conclusions they do. But what exactly is interpretability and why is it needed at all? The remainder of this post discusses these two questions and finally explores some papers which try to make deep models more interpretable.</p>
<h4 id="what-is-interpretability">What is interpretability?</h4>
<p><br /><br />
<img src="/images/interpretability.jpg" alt="Perils of using black boxes" />
<br /><br /></p>
<p>If a CNN model is to be made interpretable, what will make it so? Is it the features it generates, which should be interpretable, or the weights, or the choice of hyperparameters, or the learning algorithm, or the architecture itself? As far as supervised deep models are concerned, we know very well how the learning algorithm works to minimize the loss through gradient updates. We even have a fair idea of how the topology of these loss functions in the high dimensional space looks like and how we can possibly escape the local minima and saddle points and get to the optima. Does it mean that such a CNN model is interpretable? Or does knowing which specific neurons activate for an input and how the prediction accuracy varies when we obscure a part of the image make the model interpretable? Not necessarily. The questions above deal with various nuances of interpretability.</p>
<p>Zachary Lipton compiled his article on KDnuggets into a workshop paper called <a href="https://arxiv.org/pdf/1606.03490.pdf">The Mythos of Model Interpretability</a> at the 2016 ICML Workshop on Human Interpretability of Machine Learning. In section 3, he defines two characteristics of an interpretable model: Transparency and Post-hoc Interpretability, each with more sub-parts. Transparency, he defines as <em>“opposite of blackbox-ness”</em> and <em>“some sense of understanding the mechanism by which the model works”</em>, which seem like a very broad definition and highlights the difficulty in defining it. Post-hoc interpretability, on the other hand, is simply the extraction and analysis of information from models after they have been learned. Clearly, the first one is the more interesting characteristic here, but also the one more difficult to achieve. He also argues in these sections that the posterboy of model interpretability in machine learning, ie, a decision tree, can be analyzed simply because of its size and its computational requirements and that there is nothing intrinsically interpretable about them. He says this is the case for most techniques and that there is often a tradeoff between constraining the size of the model or its computational requirements and its performance which in turn is often a good reason to ignore the opaqueness of the model.</p>
<p>I personally think that the task of defining interpretability formally is not the best way to go about the problem of making models more interpretable. Answering the question ‘why interpretability’, on the other hand, can give more specific and useful ways to approach the problem.</p>
<h4 id="why-do-we-need-interpretability">Why do we need interpretability?</h4>
<p>A very popular thought in the machine learning circle goes like this:</p>
<p><em>“The demand for complete interpretability from intelligent systems is overblown. Humans too are poor at explaining their decisions. We too are not completely interpretable.”</em></p>
<p>Although this statement glosses over a lot of specific legal, ethical and philosophical questions, it is important nevertheless to justify why or why not we need to invest time on transparent techniques which mostly will be transparent at the cost of performance. It helps to differentiate between the various motivations for such models.</p>
<h5 id="1-interpretability-for-real-world-applications">1. Interpretability for real-world applications</h5>
<p>This sentence from a <a href="https://www.datanami.com/2017/03/15/scrutinizing-inscrutability-deep-learning/">blog</a> is a good indicator of the need to have understandable models.
<em>“Try explaining an “ADAM Optimizer” to the judge when your GAN inadvertently crashes an autonomous vehicle into a crowd of innocent people.”</em></p>
<p>The reason why this is a good indicator isn’t because of its correct technical understanding of the GANs or the machine learning models deployed in a self-driving cars, but because of exactly the opposite reason. Users of these models are usually people who don’t understand these models. And they shouldn’t need to. Users should be able to trust these systems for them to adapted. It is interesting to note here that this motivation for interpretability is very different from the rest. The transparency that the model may provide might not serve any other purpose than to make the general public comfortable in using the system. This is in contrast to other motivations in the real-world where interpretability is largely a necessity. My first project in computer vision was to detect fire in industrial areas using surveillance cameras. The model consisted of a set of hand-engineered bag of features for the regions where motion was present followed by a binary classification using an SVM. I later discovered that it was not robust against adversarial video frames. A person walking past the camera with clothes of colours and textures similar to that of fire also triggered the system. But since the features were hand-engineered and small in number, I could identify why certain clothes predicted fire and subsequently managed to add more features like the flickering motion of fire pixels to handle the adversarial examples. On the other hand, deep CNNs have been known to be vulnerable to small, imperceptible adversarial changes in the input and don’t allow for a robust analyses for why this is the case, because, among other things, their distributed representations make it difficult to analyse how the neurons behave to adversarial examples. A lot of similar applications in healthcare and medicine also require justifications from the model as to why and how it produces its output. In fact this is one of the reasons why quite a few industries still use extremely simple linear models or decision trees. However, it is important to keep track of the implications of using/not using more complex models by compromising transparency and explainability. Some would argue that even if the inner functioning of an autonomous car is partially opaque, knowing from empirical experiments, just the fact that its adoption will reduce the the number of accidents and deaths is enough to give it a leeway in terms of the policy regulations. More generally, whether a use-case of machine learning needs to be interpretable, and if yes, then to what extent, must be decided on a case-by-case basis. This is something that was recently discussed at the panel discussion at the <a href="https://www.youtube.com/watch?v=09yQG_A1kHM">Frontiers of Machine Learning</a>.</p>
<h5 id="2-interpretability-for-furthering-research">2. Interpretability for furthering research</h5>
<p>Although many researchers don’t agree with this, the theoretical foundations of many practices in deep learning is lacking. The immense potential and the fast growth of the field has led the researchers to come up with a lot of practical techniques to train, improve and modify these networks, with the theoretical understanding of them lagging behind. One of the motivations to invest time in the interpretability of these models is to identify the limitations and make theoretically sound improvements to the existing models. The next section talks very briefly about some of the works that I know of which try to do the same.</p>
<h4 id="work-in-deep-learning-and-interpretability">Work in deep learning and interpretability</h4>
<p>The papers mentioned here are deeply limited by my own reading list, so if I miss out on any important work, please let me know.</p>
<p>There has been a lot of work in trying to make deep models more explainable. For a clearer demarcation between these approaches, I find it useful to classify them into 3 types even though they aren’t necessarily mutually exclusive or exhaustive in nature:</p>
<ol>
<li>
<p>Post-hoc interpretability</p>
<p>Most work done in explaining the predictions of neural networks belongs to this class of approach. It involves using a trained model and analysing the weights, features, co-occurrence patterns, sensitivity to obscuration and much more in order to understand what the network has learnt. Early work in vision often learnt a separate inverting mechanism to visualise features from already trained CNN models (see [<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>] and [<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>]). In the text domain, models trained using RNN/LSTM have also been analysed through post-hoc analysis of representations, predictions and errors of the model (see [<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>]) and [<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>]). Another popular work done in interpretability is LIME [<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>] which uses simpler surrogate models like linear models and decision trees to construct a model agnostic <em>‘explainer’</em>. More recent works in vision like Excitation Backprop [<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>], Grad-CAM [<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup>] and Network Dissection [<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>] try to obtain a visual explanation for the prediction through individual neurons and layers and then subject them to quantitative and qualitative experiments.</p>
<p>Post-hoc techniques are very useful in understanding the nature of features learnt and the predictions made, but are mostly empirical and qualitative. They explain the ‘what’ to some degree, but not the ‘how’. That being said, with post-hoc interpretability, models usually don’t have to sacrifice performance in order to be interpretable.</p>
</li>
<li>
<p>Inherent interpretability</p>
<p>The second class of interpretability approach is found in deep models where interpretability of some kind is achieved as a by-product of the model or the training method. The best example of this is the attention model. Networks trained with an attention module are inherently interpretable throughout training and at test time too. In tasks like captioning or visual question answering, attention over images and text allow us to visualise the parts of image and text the network is looking at in order to produce a prediction (see [<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup>] and [<sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup>]). Similarly, attention in generative models like DRAW [<sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup>] allow a temporal visualisation of how the network generates an image. Another recent work in which the architecture allows for interpretability is the paper on visual reasoning [<sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup>], where the model itself contains functional modules and thus makes it possible to follow the chain of reasoning of the model. Apart from the architecture, the choice of objective which is optimised can also result in relatively more transparent models. For example, in [<sup id="fnref:13"><a href="#fn:13" class="footnote">13</a></sup>], learning a MIL based detector on image regions using single-word concepts also leads to an attention-like visualisation of the image.</p>
<p>The obvious problem with this class of interpretable models is that they are very task-specific and thus can’t be extended to generic use-cases. Also, this kind of interpretability, like the first one, is also mostly limited to explaining the ‘what’ instead of the ‘how’.</p>
</li>
<li>
<p>Intrinsic interpretability</p>
<p>Instead of providing an account of the model which is extracted extrinsically, this class of approaches apply theoretical analysis to interpret what the models have learnt. The basic idea is that a better theoretical understanding of deep learning will yield models whose predictions and errors can be better explained. Lot of work done at Cambridge’s machine learning lab, previously by David Mackay’s group and now in Zoubin Ghahramani’s lab, have revealed bayesian interpretations of neural networks. More recently, Yarin Gal’s work [<sup id="fnref:14"><a href="#fn:14" class="footnote">14</a></sup>] on how dropouts can be used for estimating uncertainty bounds on its predictions is one example where theoretical insights can enable us to quantify what the model does and doesn’t know (<a href="http://mlg.eng.cam.ac.uk/yarin/PDFs/2015_UCL_Bayesian_Deep_Learning_talk.pdf">see slides from his talk</a>).</p>
<p>In contrast to the previous class of models, this approach results in general explainability. Although there’s a lot of work done to explore how neural networks learn, they don’t necessarily translate to how they arrive at the predictions.</p>
</li>
</ol>
<p>From my observations, the ‘black-boxness’ of deep learning is overhyped in some situations while in others it is underestimated. In the practical world, interpretability is needed only in specific circumstances and might serve a very different purpose than what research-world interpretability is expected to do. In any case, as we go forward, we will be seeing much more work on both these end-goals of having explainable models.</p>
<h4 id="references">References</h4>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://arxiv.org/pdf/1311.2901.pdf">Visualizing and Understanding Convolutional Networks</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://www.robots.ox.ac.uk/~vedaldi/assets/pubs/mahendran15understanding.pdf">Understanding Deep Image Representations by Inverting Them</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="https://arxiv.org/pdf/1506.02078.pdf">Visualizing and Understanding Recurrent Network</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="https://arxiv.org/pdf/1506.01066.pdf">Visualizing and Understanding Neural Models in NLP</a> <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p><a href="https://arxiv.org/pdf/1602.04938v1.pdf">“Why Should I Trust You?” Explaining the Predictions of Any Classifier</a> <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p><a href="https://arxiv.org/pdf/1608.00507.pdf">Top-down Neural Attention by Excitation Backprop</a> <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p><a href="https://arxiv.org/pdf/1610.02391.pdf">Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization</a> <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p><a href="http://netdissect.csail.mit.edu/final-network-dissection.pdf">Network Dissection: Quantifying Interpretability of Deep Visual Representations</a> <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p><a href="http://proceedings.mlr.press/v37/xuc15.pdf">Show, Attend and Tell: Neural Image Caption Generation with Visual Attention</a> <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p><a href="http://papers.nips.cc/paper/6202-hierarchical-question-image-co-attention-for-visual-question-answering.pdf">Hierarchical Question-Image Co-Attention for Visual Question Answering</a> <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
<li id="fn:11">
<p><a href="https://arxiv.org/pdf/1502.04623.pdf">DRAW: A Recurrent Neural Network For Image Generation</a> <a href="#fnref:11" class="reversefootnote">↩</a></p>
</li>
<li id="fn:12">
<p><a href="https://arxiv.org/pdf/1705.03633.pdf">Inferring and Executing Programs for Visual Reasoning</a> <a href="#fnref:12" class="reversefootnote">↩</a></p>
</li>
<li id="fn:13">
<p><a href="https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Fang_From_Captions_to_2015_CVPR_paper.pdf">From Captions to Visual Concepts and Back</a> <a href="#fnref:13" class="reversefootnote">↩</a></p>
</li>
<li id="fn:14">
<p><a href="https://arxiv.org/pdf/1506.02142.pdf">Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning</a> <a href="#fnref:14" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>How can we define what an interpretable model is and why is it even an important question to askVariational Inference and Expectation Maximization2017-03-28T00:00:00+00:002017-03-28T00:00:00+00:00http://stillbreeze.github.io/Variational%20Inference%20and%20Expectation%20Maximization<p>Few days ago, while reading about variational autoencoders, I came to know that variational inference was in fact a generalization of the popular algorithm, expectation maximization (EM). The aim of writing this post is to explain the relationship between these two popular techniques.
The post assumes the reader is familiar with the EM algorithm, but if you need a reference for it before starting here, have a look at <a href="http://www.svcl.ucsd.edu/courses/ece271A/handouts/EM2.pdf">this</a>.</p>
<h4 id="notation">Notation</h4>
<ul>
<li>X denotes the observed random variables</li>
<li>p(X) or q(X) denotes probability distribution over the variables X</li>
<li>Z denotes the latent random variables</li>
</ul>
<h4 id="expectation-maximization">Expectation Maximization</h4>
<p>EM is used to estimate the maximum likelihood of data given the model parameters in cases where the data has some latent variables. In order to do so, EM repeats the following two steps until convergence:</p>
<p><strong>E step</strong>: Estimate the latent variables according to posterior distribution calculated with the model parameters<br />
<strong>M step</strong>: Update the model parameters by maximizing the likelihood</p>
<p><br /><br />
<img src="/images/em.jpg" alt="EM algorithm cycle" />
<br /><br /></p>
<p>What this means mathematically is:</p>
<p><strong>E step</strong>: Estimate <script type="math/tex">Q(\theta, \theta_t)</script> for iteration <script type="math/tex">t</script><br />
<strong>M step</strong>: Maximize <script type="math/tex">Q(\theta, \theta_t)</script> wrt to <script type="math/tex">\theta</script></p>
<p>where</p>
<center>$$Q(\theta, \theta_t) = \mathbb E_{p(Z \mid X)}[\log p(X,Z;\theta)]$$</center>
<p>and
the probability distribution <script type="math/tex">p</script> is parameterized by <script type="math/tex">\theta</script>, ie <script type="math/tex">\theta</script> is the model parameter.</p>
<h4 id="variational-inference">Variational Inference</h4>
<p>Variational inference is a method which tries to do inference in complicated graphical models where the distribution to be computed is intractable. It does this by re-framing the inference problem into an optimization problem. In the Bayesian framework, inference is formulated as computing the posterior distribution over the set of latent variables:</p>
<center>$$p(Z \mid X;\theta) = \frac{p(X,Z;\theta)}{\int_Z p(X,Z;\theta)}$$</center>
<p>The integral in the denominator is intractable for a lot of distributions of interest. So, the problem boils down to finding close approximations for it. There are sampling based techniques like MCMC which do this by constructing a Markov chain with the latent variables, but they are very slow to converge. Instead, what VI does is replace the intractable distribution, <script type="math/tex">p(Z \mid X;\theta)</script> by a proxy distribution, <script type="math/tex">q(Z)</script> and perform inference on it. For this to work out, the following needs to be taken care of:</p>
<ol>
<li>The proxy distribution should closely resemble the original posterior</li>
<li>The proxy distribution should be simple enough to perform inference on</li>
</ol>
<p>For measuring deviation from the posterior, we use the KL divergence of the proxy distribution with respect to the original posterior. For simplicity, I drop the model parameter <script type="math/tex">\theta</script> for now, but will include it in the end.</p>
<center>
$$
\begin{align}
KL(q(Z)||p(Z \mid X)) & = \mathbb E_{q(Z)}[\log \frac{q(Z)}{p(Z \mid X)}]\\
& = \mathbb E_{q(Z)}[\log q(Z)] - \mathbb E_{q(Z)}[\log p(Z \mid X)]\\
& = \mathbb E_{q(Z)}[\log q(Z)] - \mathbb E_{q(Z)}[\log p(X,Z)] + \log(p(X))\tag{1}\\
\end{align}
$$
</center>
<p><br />
Now, we take a detour to calculate the log likelihood for the observed data. This quantity is of interest to us because we often use it for maximum likelihood estimation.</p>
<center>$$\log(p(X)) = \log(\int_Z p(X,Z))$$</center>
<p>This is just the marginal distribution over the latent variables. Now, to change it into expectation form, we apply a small trick. We multiply and divide the term inside the integral by <script type="math/tex">q(Z)</script>.</p>
<center>
$$
\begin{align}
\log(p(X)) & = \log(\int_Z \frac{p(X,Z) q(Z)}{q(Z)})\\
& = \log(\mathbb E_{q(Z)}[\frac{p(X,Z)}{q(Z)}])
\end{align}
$$
</center>
<p>Now using <a href="http://www.sef.hku.hk/~wsuen/teaching/micro/jensen.pdf">Jensen’s Inequality</a>, we switch the log and the expectation and update the inequality.</p>
<center>
$$
\begin{align}
\log(p(X)) & \ge \mathbb E_{q(Z)}[\log(\frac{p(X,Z)}{q(Z)})]\\
& \ge \mathbb E_{q(Z)}[\log(p(X,Z)] - E_{q(Z)}[\log(q(Z))]\tag{2}\\
\end{align}
$$
</center>
<p>The important thing to notice here is that equation <script type="math/tex">2</script> places a lower bound on the log probability of the data and hence it is often called the evidence lower bound or ELBO.</p>
<p>Now, we go back to our equation <script type="math/tex">1</script> and find that the RHS of both equations have the common term ELBO.
In fact, substituting ELBO in the first equation gives us:</p>
<center>$$KL(q(Z)||p(Z \mid X)) = -ELBO + \log(p(X))\tag{3}$$</center>
<p>Finally, coming back to our original problem of minimizing the KL divergence, we can see that since the second term on the RHS of equation <script type="math/tex">3</script> is independent of <script type="math/tex">q</script>, minimizing the KL divergence is the same as maximizing the ELBO. Furthermore, we also have seen that the log probability of the data has a lower bound called ELBO and the gap between them is quantified by the KL divergence term between the approximating distribution and the original posterior.</p>
<h4 id="em-as-a-special-case-of-variational-inference">EM as a special case of Variational Inference</h4>
<p>So, variational inference is all about changing the posterior estimation problem to an optimization problem, namely the maximization of ELBO. Let’s have a closer look at it, this time with the model parameters.</p>
<center>$$ELBO(q,\theta) = E_{q(Z)}[\log(p(X,Z;\theta)] - E_{q(Z)}[\log(q(Z))]\tag{4}$$</center>
<p>The ELBO in fact is a function of the probability distribution <script type="math/tex">q</script> and model parameters <script type="math/tex">\theta</script></p>
<p>The EM algorithm described in the beginning can be interpreted as an iterative algorithm of optimizing <script type="math/tex">ELBO(q,\theta)</script>, keeping one parameter constant at a time.</p>
<p>The two steps can be re-stated more generally in the following manner:</p>
<p><strong>E step</strong>:</p>
<center>$$\mathop{\arg\,\max}\limits_q (ELBO(q,\theta_t))$$</center>
<p>This step does coordinate ascent on <script type="math/tex">ELBO(q,\theta_t)</script> at iteration <script type="math/tex">t</script>.<br />
Since we know that the optimal <script type="math/tex">q</script> for the above problem will occur when the approximate distribution equals the original posterior, the solution to the above problem trivially becomes</p>
<center>$$q_t(Z) = p(Z \mid X)\tag{5}$$</center>
<p>Note that this step is the same as estimating the function <script type="math/tex">Q(\theta,\theta_t)</script> as done in the E-step of EM described above. It assumes that the approximating distribution is the same as the posterior.</p>
<p><strong>M step</strong>:</p>
<center>$$\mathop{\arg\,\max}\limits_\theta (ELBO(q_t,\theta))$$</center>
<p>Substituting the ELBO value from equation <script type="math/tex">4</script> and substituting the E step value from equation <script type="math/tex">5</script>, we have</p>
<center>$$\mathop{\arg\,\max}\limits_\theta (E_{p(Z \mid X)}[\log(p(X,Z;\theta)] - E_{p(Z \mid X)}[\log(p(Z \mid X))])$$</center>
<p>Since the second expectation term is independent of <script type="math/tex">\theta</script>, the problem simplifies to the original M-step of the EM algorithm as described above.</p>
<center>$$\mathop{\arg\,\max}\limits_\theta (E_{p(Z \mid X)}[\log(p(X,Z;\theta)])$$</center>
<h4 id="conclusion">Conclusion</h4>
<p>The above analysis shows that variational inference is expectation maximization when the variational distribution of VI is the same as the original posterior distribution.
This means that EM assumes that the expectation over the posterior is computable and can be dealt with without any approximations and hence the KL divergence from equation <script type="math/tex">3</script> becomes zero.</p>
<p>I highly recommend reading <a href="https://arxiv.org/pdf/1601.00670.pdf">this review paper</a> and <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.137.221&rep=rep1&type=pdf">these slides</a> for more on variational inference.</p>Exploring the relationship between variational inference and expectation maximization algorithm