willtipton.comCoding and poker, mostly.
http://willtipton.com/
Thu, 15 Jun 2017 21:59:15 -0700Thu, 15 Jun 2017 21:59:15 -0700Jekyll v3.4.3Playing a toy poker game with Reinforcement Learning<p>Reinforcement learning (RL) has had some high-profile successes lately, e.g. <a href="https://en.wikipedia.org/wiki/AlphaGo">AlphaGo</a>, but the basic ideas are fairly straightforward. Let’s try RL on our favorite toy problem: the heads-up no limit shove/fold game. This is a pedagogical post rather than a research write-up, so we’ll develop all of the ideas (and code!) more or less from scratch. Follow along in a Python3 <a href="http://jupyter.org/">Jupyter</a> notebook!
<!--more--></p>
<ul>
<li><a href="#problem-setup">Problem setup</a></li>
<li><a href="#reinforcement-learning">Reinforcement learning</a>
<ul>
<li><a href="#features-inputs-to-hatq">Features: inputs to <script type="math/tex">\hat{Q}</script></a></li>
<li><a href="#a-linear-model-for-hatq">A linear model for <script type="math/tex">\hat{Q}</script></a></li>
<li><a href="#simulating-poker">Simulating poker</a></li>
<li><a href="#learning-updating-hatq">Learning: updating <script type="math/tex">\hat{Q}</script></a></li>
</ul>
</li>
<li><a href="#putting-it-all-together">Putting it all together</a></li>
<li><a href="#results">Results</a>
<ul>
<li><a href="#interpreting-the-model">Interpreting the model</a></li>
<li><a href="#visualizing-the-strategies">Visualizing the strategies</a></li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>
<h2 id="problem-setup">Problem setup</h2>
<p>A quick reminder – shove/fold is a 2-player no limit hold’em game where:</p>
<ol>
<li>Both players start with stacks of <script type="math/tex">S</script> and a randomly-dealt 2 card hand.</li>
<li>The BB player posts 1.0 blind, and the SB player posts 0.5 blind.</li>
<li>The SB can go all-in or fold.</li>
<li>Facing an all-in, the BB can call or fold.</li>
</ol>
<p>We might visualize this as a decision tree like the one shown here. A hand beings in <script type="math/tex">E</script> where the SB can shove or fold. If she folds, we transition to <script type="math/tex">A</script>, and the hand is over. If she shoves, we end up in <script type="math/tex">D</script> where the BB must decide between calling and folding. If one player folds, the other captures the blinds, and if both players get all-in, a 5 card board is dealt and payouts are given according to the normal rules of poker.</p>
<p align="center">
<img alt="Shove/fold game decision tree" src="/images/shove_fold_tree_labelled.svg" />
</p>
<p>The solution for this game is <a href="http://www.dandbpoker.com/preflop-charts">well known</a>, and we’ve looked at other approaches elsewhere, e.g. <a href="https://www.youtube.com/watch?v=MVMfDswjJE0">fictitious play</a> and <a href="http://willtipton.com/coding/poker/2016/03/06/shove-fold-with-tensorflow.html">direct optimization</a>. Here, we’ll estimate the solution using RL.</p>
<p>Now, there are <script type="math/tex">\binom{52}{2} = 1326</script> distinct starting 2-card hand combinations in hold’em. We can thus order all the hands and number them from 0 to 1325. The specific order won’t matter as long as we’re consistent. The following function implicitly defines such an ordering and creates a map from hand number back to the strategically-relevant info: card ranks and suitedness.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">numHoldemHands</span> <span class="o">=</span> <span class="mi">1326</span> <span class="c"># nchoosek(52,2)</span>
<span class="n">ranks</span> <span class="o">=</span> <span class="p">[</span><span class="s">'2'</span><span class="p">,</span><span class="s">'3'</span><span class="p">,</span><span class="s">'4'</span><span class="p">,</span><span class="s">'5'</span><span class="p">,</span><span class="s">'6'</span><span class="p">,</span><span class="s">'7'</span><span class="p">,</span><span class="s">'8'</span><span class="p">,</span><span class="s">'9'</span><span class="p">,</span><span class="s">'T'</span><span class="p">,</span><span class="s">'J'</span><span class="p">,</span><span class="s">'Q'</span><span class="p">,</span><span class="s">'K'</span><span class="p">,</span><span class="s">'A'</span><span class="p">]</span>
<span class="n">suits</span><span class="o">=</span><span class="p">[</span><span class="s">'c'</span><span class="p">,</span><span class="s">'s'</span><span class="p">,</span><span class="s">'d'</span><span class="p">,</span><span class="s">'h'</span><span class="p">]</span>
<span class="n">numRanks</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span>
<span class="n">numSuits</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">suits</span><span class="p">)</span>
<span class="c"># Input: N/A</span>
<span class="c"># Output:</span>
<span class="c"># A map from int hand representation in [0,1235] to tuple of form</span>
<span class="c"># (rank1, rank2, isSuited).</span>
<span class="k">def</span> <span class="nf">makeIntToHandMap</span><span class="p">():</span>
<span class="n">result</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">numHoldemHands</span><span class="p">)]</span>
<span class="n">c</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">r1</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">numRanks</span><span class="p">):</span>
<span class="k">for</span> <span class="n">r2</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">r1</span><span class="p">,</span> <span class="n">numRanks</span><span class="p">):</span>
<span class="k">for</span> <span class="n">s1</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">numSuits</span><span class="p">):</span>
<span class="k">for</span> <span class="n">s2</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">numSuits</span><span class="p">):</span>
<span class="k">if</span> <span class="n">r1</span> <span class="o">==</span> <span class="n">r2</span> <span class="ow">and</span> <span class="n">s1</span> <span class="o">>=</span> <span class="n">s2</span><span class="p">:</span>
<span class="k">continue</span>
<span class="c"># hand number c corresponds to holding</span>
<span class="c"># ranks[r2], suits[s2], ranks[r1], suits[s1]</span>
<span class="n">result</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">r2</span><span class="p">,</span> <span class="n">r1</span><span class="p">,</span> <span class="n">s1</span> <span class="o">==</span> <span class="n">s2</span><span class="p">)</span>
<span class="n">c</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">result</span>
<span class="n">intToHandMap</span> <span class="o">=</span> <span class="n">makeIntToHandMap</span><span class="p">()</span></code></pre></figure>
<p>Notice that the first entry in the output tuples (<code>r2</code> in the code) is always the higher rank, if there is one. For example, hand number 57 happens to be 6♦2♣, and we have:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">r2</span><span class="p">,</span> <span class="n">r1</span><span class="p">,</span> <span class="n">suited</span> <span class="o">=</span> <span class="n">intToHandMap</span><span class="p">[</span><span class="mi">57</span><span class="p">]</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">ranks</span><span class="p">[</span><span class="n">r2</span><span class="p">],</span> <span class="n">ranks</span><span class="p">[</span><span class="n">r1</span><span class="p">],</span> <span class="s">'s'</span> <span class="k">if</span> <span class="n">suited</span> <span class="k">else</span> <span class="s">'o'</span><span class="p">)</span>
<span class="mi">6</span> <span class="mi">2</span> <span class="n">o</span></code></pre></figure>
<p>When the players get all-in, the amount of the pot they capture on average (their “equity”) is given by the rules of the game. The file <a href="/static/pf_eqs.dat">pf_eqs.dat</a> holds a numpy matrix <code>pfeqs</code> (see <a href="http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.savetxt.html">numpy.savetxt</a>) where <code>pfeqs[i,j]</code> is the equity of hand <script type="math/tex">i</script> when the opponent holds hand <script type="math/tex">j</script>.</p>
<p>Of course, sometimes two starting hands have a card in common, in which case they can’t both be dealt simultaneously, and it doesn’t make sense to ask for their equities. The file <a href="/static/pf_confl.dat">pf_confl.dat</a> holds another <script type="math/tex">1326 \times 1326</script> matrix where every entry is either 0 or 1. A 0 indicates that the hands conflict, and a 1 indicates that they don’t.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="n">pfeqs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="s">"pf_eqs.dat"</span><span class="p">)</span>
<span class="n">pfconfl</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="s">"pf_confl.dat"</span><span class="p">)</span></code></pre></figure>
<p>For example, since hand 56 is 6♦2♣, 57 is 6♥2♣, and 58 is 6♣2♠, we have:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">pfconfl</span><span class="p">[</span><span class="mi">56</span><span class="p">,</span><span class="mi">57</span><span class="p">]</span>
<span class="mf">0.0</span>
<span class="o">>>></span> <span class="n">pfconfl</span><span class="p">[</span><span class="mi">57</span><span class="p">,</span><span class="mi">58</span><span class="p">]</span>
<span class="mf">1.0</span>
<span class="o">>>></span> <span class="n">pfeqs</span><span class="p">[</span><span class="mi">57</span><span class="p">,</span><span class="mi">58</span><span class="p">]</span>
<span class="mf">0.49655199999999999</span></code></pre></figure>
<p>Why isn’t that last result precisely 0.5, by the way?</p>
<h2 id="reinforcement-learning">Reinforcement learning</h2>
<p>Now – a crash course on RL. There are three important components to an RL problem: state, action, reward. They fit together as follows</p>
<ol>
<li>We are in some <strong>state</strong> (i.e. the state of the world, which we observe).</li>
<li>We use that info to take some <strong>action</strong>.</li>
<li>We get some <strong>reward</strong>.</li>
<li>Repeat.</li>
</ol>
<p>We do this over and over again: observe state, take action, get reward, observe new state, take another action, get another reward, etc. The RL problem is simply to figure out how to choose actions to get as much reward as possible.</p>
<p>This turns out to be a pretty general framework. Lots of problems can be thought of in this way, and there are lots of different approaches to solving these problems. Generally, solutions involve wandering around, choosing various actions in various states, remembering which rewards were obtained, and then trying to do something smart with that info to make better choices in the future.</p>
<p>How does this apply to the shove/fold game? At any decision point, the player knows her hole cards and the position she’s in. This is the state. She can then take an action: either FOLD or GII. (GII is “get it in”! For the SB, GII means shove, and for the BB, GII is a call). And then rewards – this is money won, and we’ll use the players’ total stack sizes at the end of the hand. For example, if the initial stack size is <script type="math/tex">S=10</script>, the SB shoves and the BB folds, then the players’ rewards are 11 and 9, respectively.</p>
<p>We’ll find strategies for playing this game by simulating a bunch of hands. We’ll deal both players some random cards, let them make decisions about how to play, and then observe how much money they end up with at the end each time. We’ll use this info to learn (estimate) a function <script type="math/tex">Q(S,A)</script>. <script type="math/tex">Q</script> takes in a description of the state <script type="math/tex">S</script> and an action <script type="math/tex">A</script> and outputs the value of taking that action in that state. Once we have <script type="math/tex">Q</script> (or some estimate thereof) , making strategy choices is easy: we can just evaluate each of our options and see which one is better.</p>
<p>So, our job here is to estimate <script type="math/tex">Q</script>, and we’ll use <script type="math/tex">\hat{Q}</script> (pronounced “Q hat”) to refer to this estimate. We’ll start with some random initial guess for <script type="math/tex">\hat{Q}</script>. Then, we’ll simulate a bunch of hands where both players make decisions according to <script type="math/tex">\hat{Q}</script>. After each hand, we’ll adjust the estimate <script type="math/tex">\hat{Q}</script> to reflect the actual values the players got after taking particular actions in particular states. Eventually, we should end up with a pretty good estimate, which, again, is all we need to determine the players’ strategies.</p>
<p>There’s one wrinkle here – we need to make sure to take all the actions in all the states at least occasionally if we want to end up with good estimates of each possibility’s value. So, we’ll have the players act randomly some small fraction <script type="math/tex">\epsilon</script> of the time but otherwise use their (currently-estimated) best options. At first, we should explore our options a lot, frequently making a random choice. As time goes on, we’ll opt to exploit the knowledge we’ve gained more often. That is to say, <script type="math/tex">\epsilon</script> will shrink over time. There are lots of ways to do this. Here’s one:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Input:</span>
<span class="c"># nHands: total number of simulations we plan to run</span>
<span class="c"># i: current simulation number</span>
<span class="c"># Output:</span>
<span class="c"># Fraction of the time we should choose our action randomly.</span>
<span class="k">def</span> <span class="nf">epsilon</span><span class="p">(</span><span class="n">nHands</span><span class="p">,</span> <span class="n">i</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">nHands</span> <span class="o">-</span> <span class="n">i</span><span class="p">)</span> <span class="o">/</span> <span class="n">nHands</span></code></pre></figure>
<p><script type="math/tex">Q</script> happens to be called <em>the action-value function</em>, since it gives the value of taking any particular action (from any state). It plays a central place in most RL methods. How exactly might we represent <script type="math/tex">\hat{Q}</script>? Evaluate it? Update it after every hand?</p>
<h3 id="features-inputs-to-hatq">Features: inputs to <script type="math/tex">\hat{Q}</script></h3>
<p>First, <script type="math/tex">\hat{Q}</script>’s inputs: the state and action. We could pass this info to our function for <script type="math/tex">Q</script> straightaway as the position (say, 1 for SB and 0 for BB), the hand number (from 0 to 1325), and the action (say, 1 for GII and 0 for FOLD). However, as we’ll see, we’ll get better results if we do a bit more legwork. Here, we’ll describe the state and action together with a vector of 7 numbers:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">nParams</span> <span class="o">=</span> <span class="mi">7</span>
<span class="c"># Input:</span>
<span class="c"># hand: int hand between 0 and 1325</span>
<span class="c"># isSB: boolean indicating whether position is SB, else BB</span>
<span class="c"># isGII: boolean indicating whether action is GII, else FOLD</span>
<span class="c"># Output:</span>
<span class="c"># numpy array containing features describing a state and action</span>
<span class="k">def</span> <span class="nf">phi</span><span class="p">(</span><span class="n">hand</span><span class="p">,</span> <span class="n">isSB</span><span class="p">,</span> <span class="n">isGII</span><span class="p">):</span>
<span class="n">rank2</span><span class="p">,</span> <span class="n">rank1</span><span class="p">,</span> <span class="n">isSuited</span> <span class="o">=</span> <span class="n">intToHandMap</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span>
<span class="n">rank2</span><span class="o">/</span><span class="n">numRanks</span> <span class="k">if</span> <span class="n">isGII</span> <span class="k">else</span> <span class="mi">0</span><span class="p">,</span>
<span class="n">rank1</span><span class="o">/</span><span class="n">numRanks</span> <span class="k">if</span> <span class="n">isGII</span> <span class="k">else</span> <span class="mi">0</span><span class="p">,</span>
<span class="nb">abs</span><span class="p">(</span><span class="n">rank2</span><span class="o">-</span><span class="n">rank1</span><span class="p">)</span><span class="o">**</span><span class="mf">0.25</span> <span class="k">if</span> <span class="n">isGII</span> <span class="k">else</span> <span class="mi">0</span><span class="p">,</span>
<span class="mi">1</span> <span class="k">if</span> <span class="p">(</span><span class="n">isSuited</span> <span class="ow">and</span> <span class="n">isGII</span><span class="p">)</span> <span class="k">else</span> <span class="mi">0</span><span class="p">,</span>
<span class="mi">1</span> <span class="k">if</span> <span class="n">isSB</span> <span class="k">else</span> <span class="mi">0</span><span class="p">,</span>
<span class="mi">1</span> <span class="k">if</span> <span class="n">isSB</span> <span class="ow">and</span> <span class="n">isGII</span> <span class="k">else</span> <span class="mi">0</span><span class="p">,</span>
<span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float64</span><span class="p">)</span></code></pre></figure>
<p>The vector <script type="math/tex">\phi</script> returned by <code>phi</code> here will be the inputs to <script type="math/tex">Q</script> and is known as the <em>feature vector</em> and the individual entries are all <em>features</em> (<script type="math/tex">\phi</script> is pronounced “fee” – get it?). As we’ll see, the features we choose can make a big difference in the quality of the result. Choosing features (known as “feature engineering”) is one place where we can leverage domain knowledge about our problem. It’s as much art as science. Here, we’ve encoded our knowledge about what info is relevant in this situation in several ways. Let’s take a look.</p>
<p>The first entry is always a 1, for convenience. Consider the next four entries. These represent the player’s hand. We’ve converted from hand number to <code>rank1</code>, <code>rank2</code>, and <code>isSuited</code>. These three variables technically give the same info as the hand number (neglecting the particular suits), but the model will make better use of the information in this format. In addition to the raw ranks, we’ve also included <code>abs(rank1-rank2)**0.25</code>. We happen to know that connectedness is an important property of hold’em hands, and that’s what this represents. Also, the model will learn better if all the features have about the same magnitudes. Here, all the features are approximately between 0 and 1, and we’ve divided the ranks by <code>numRanks</code> to make this so.</p>
<p>Finally, if <code>not isGII</code> (i.e. if the action is FOLD) we actually set each of these numbers to 0. We know that the particular holding doesn’t have any effect on the result when the player is folding (neglecting minor card removal effects), so we just remove the extraneous info in this case.</p>
<p>Now consider the final two entries. The first of these straightforwardly encodes the player’s position, but the second depends on both <code>isSB and isGII</code>. Why might this be? We’ll show the need for this “cross term” later.</p>
<h3 id="a-linear-model-for-hatq">A linear model for <script type="math/tex">\hat{Q}</script></h3>
<p>We’re going to learn a linear function for our estimate <script type="math/tex">\hat{Q}</script>. This means we’ll actually learn a vector of parameters, usually called <script type="math/tex">\theta</script>, of the same length (7) as the feature vector. Then, we’ll evaluate our estimator <script type="math/tex">\hat{Q}</script> for a particular <script type="math/tex">\phi</script> with</p>
<script type="math/tex; mode=display">\hat{Q}(\phi;\theta) = \sum_{i=1}^7 \phi_i \cdot \theta_i</script>
<p>Here, the subscripts <script type="math/tex">i</script> refer to the particular elements of the vectors, and writing the argument list as <script type="math/tex">(\phi;\theta)</script> indicates that the value of <script type="math/tex">\hat{Q}</script> depends on both <script type="math/tex">\phi</script> and <script type="math/tex">\theta</script> but that we’re mostly thinking of it as a function of <script type="math/tex">\phi</script> with <script type="math/tex">\theta</script> fixed. It’s simple in code:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Inputs:</span>
<span class="c"># theta: vector of parameters of our model</span>
<span class="c"># phi: vector of features</span>
<span class="c"># Output:</span>
<span class="c"># Qhat(phi; theta), an estimate of the action-value</span>
<span class="k">def</span> <span class="nf">evalModel</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">phi</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">theta</span> <span class="o">*</span> <span class="n">phi</span><span class="p">)</span></code></pre></figure>
<p>Though it’s commonly used, there’s nothing particularly fundamental about this scheme that makes it the right choice for this problem. It’s just one way (among many) to combine some learned parameters with some features to get an output, and it’s entirely up to us to find a vector <script type="math/tex">\theta</script> that makes this produce the output we want. However, with the right choice of <script type="math/tex">\theta</script>, this will give us a pretty good estimate of the value of taking a particular action in a particular position with a particular hand.</p>
<h3 id="simulating-poker">Simulating poker</h3>
<p>We’re going to “play” a bunch of hands. We’ll put everything together to do this in the next couple of sections, but for now, let’s build some three important pieces. These relate to the three important components of an RL problem: state, action, and reward. First, the state – each hand, we’ll start by dealing each player a random holding.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Input: N/A</span>
<span class="c"># Output: tuple of two random hand numbers representing hands</span>
<span class="c"># that don't conflict.</span>
<span class="k">def</span> <span class="nf">dealCards</span><span class="p">():</span>
<span class="n">hand1</span> <span class="o">=</span> <span class="n">hand2</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">pfconfl</span><span class="p">[</span><span class="n">hand1</span><span class="p">,</span> <span class="n">hand2</span><span class="p">]:</span>
<span class="n">hand1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">numHoldemHands</span><span class="p">)</span>
<span class="n">hand2</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">numHoldemHands</span><span class="p">)</span>
<span class="k">return</span> <span class="n">hand1</span><span class="p">,</span> <span class="n">hand2</span></code></pre></figure>
<p>Second, we need actions. Each player will use the current model (given by <code>theta</code>) and knowledge of his hand (<code>hand</code>) and position (<code>isSB</code>) to choose an action. In the following function, we estimate the values of GII and FOLD (<code>qGII</code> and <code>qFOLD</code>, resp.). We then choose the best option <script type="math/tex">(1-\epsilon)</script> of the time and otherwise choose randomly. We return the action taken, as well as the corresponding value estimate and feature vector which we’ll need later.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Input:</span>
<span class="c"># theta: parameter for current model Qhat</span>
<span class="c"># hand: hand number</span>
<span class="c"># isSB: boolean position</span>
<span class="c"># epsilon: chance of making a random move</span>
<span class="c"># Output:</span>
<span class="c"># A tuple of form (isGII, qhat, phi) describing the action</span>
<span class="c"># taken, its value, and its feature vector.</span>
<span class="k">def</span> <span class="nf">act</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">hand</span><span class="p">,</span> <span class="n">isSB</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">):</span>
<span class="n">phiGII</span> <span class="o">=</span> <span class="n">phi</span><span class="p">(</span><span class="n">hand</span><span class="p">,</span> <span class="n">isSB</span><span class="p">,</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">phiFOLD</span> <span class="o">=</span> <span class="n">phi</span><span class="p">(</span><span class="n">hand</span><span class="p">,</span> <span class="n">isSB</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
<span class="n">qGII</span> <span class="o">=</span> <span class="n">evalModel</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">phiGII</span><span class="p">)</span>
<span class="n">qFOLD</span> <span class="o">=</span> <span class="n">evalModel</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">phiFOLD</span><span class="p">)</span>
<span class="n">isGII</span> <span class="o">=</span> <span class="n">qGII</span> <span class="o">></span> <span class="n">qFOLD</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">()</span> <span class="o"><</span> <span class="n">epsilon</span><span class="o">/</span><span class="mi">2</span><span class="p">:</span>
<span class="n">isGII</span> <span class="o">=</span> <span class="ow">not</span> <span class="n">isGII</span>
<span class="k">if</span> <span class="n">isGII</span><span class="p">:</span>
<span class="k">return</span> <span class="n">isGII</span><span class="p">,</span> <span class="n">qGII</span><span class="p">,</span> <span class="n">phiGII</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">isGII</span><span class="p">,</span> <span class="n">qFOLD</span><span class="p">,</span> <span class="n">phiFOLD</span></code></pre></figure>
<p>Thirdly, once we know each player’s cards and action, we simulate the rest of the hand to get the players’ rewards. If either player folded, we can immediately return the correct values. Otherwise, we reference the players’ equities to choose a random winner the right fraction of the time.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Input:</span>
<span class="c"># S: stack size at the beginning of the hand</span>
<span class="c"># sbHand: SB hand number</span>
<span class="c"># sbIsGII: boolean indicating SB's action</span>
<span class="c"># bbHand: BB hand number</span>
<span class="c"># bbIsGII: boolean indicating BB's action</span>
<span class="c"># Output:</span>
<span class="c"># A tuple of the form (SB value, BB value) indicating each player's</span>
<span class="c"># stack size at the end of the hand.</span>
<span class="k">def</span> <span class="nf">simulateHand</span><span class="p">(</span><span class="n">S</span><span class="p">,</span> <span class="n">sbHand</span><span class="p">,</span> <span class="n">sbIsGII</span><span class="p">,</span> <span class="n">bbHand</span><span class="p">,</span> <span class="n">bbIsGII</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">sbIsGII</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">S</span><span class="o">-</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">S</span><span class="o">+</span><span class="mf">0.5</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">bbIsGII</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="n">S</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">S</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="c"># GII. Note: neglecting chops!</span>
<span class="n">sbEquity</span> <span class="o">=</span> <span class="n">pfeqs</span><span class="p">[</span><span class="n">sbHand</span><span class="p">,</span> <span class="n">bbHand</span><span class="p">]</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">()</span> <span class="o"><</span> <span class="n">sbEquity</span><span class="p">:</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">S</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">S</span><span class="p">)</span></code></pre></figure>
<p>We’ve cheated a little here in the case where the players get all-in. Instead of actually simulating the game by dealing out a 5 card board and evaluating the players’ hands to see who won, we instead just choose a random winner according to the pre-calculated probabilities. This is mathematically equivalent (neglecting chops which don’t matter here); it’s just a bit more convenient and computationally efficient.</p>
<p>Most importantly, our learning process isn’t taking advantage of these equities or knowledge about how the game “works”. As we’ll see shortly, the learning process would proceed exactly the same if we did bother to do the full simulation, or indeed even if our agent were interacting with some external, black-box poker game system which might even behave according to some different rules! Now, speaking of the learning process, how does that work exactly?</p>
<h3 id="learning-updating-hatq">Learning: updating <script type="math/tex">\hat{Q}</script></h3>
<p>After a hand is over, we need to update <code>theta</code>. For each player, we have the state observed and the action taken. We also have the estimated value of the action as well as the actual rewards obtained from the game. In some sense, the actual reward obtained is the “correct answer”, and if the value we estimated is different than this, there’s an error in our model. We want to update <code>theta</code> to make <script type="math/tex">\hat{Q}(\phi;\theta)</script> a little closer to this correct answer.</p>
<p>Let <script type="math/tex">\phi'</script> be a particular state seen by one of the players and <script type="math/tex">R</script> the actual reward she obtained. Let <script type="math/tex">L = (R - \hat{Q(\phi;\theta)})^2</script>. <script type="math/tex">L</script> here is known as the <em>loss function</em>. We’ve constructed <script type="math/tex">L</script> so that the smaller it is, the closer <script type="math/tex">R</script> is to <script type="math/tex">\hat{Q}(\phi;\theta)</script>, and if <script type="math/tex">L</script> is 0, then <script type="math/tex">\hat{Q}</script> exactly equals <script type="math/tex">R</script>. In other words, we want to find a small adjustment to <script type="math/tex">\theta</script> to make <script type="math/tex">L</script> somewhat smaller. (Note that there are many possible loss functions that do what we want: get smaller as <script type="math/tex">\hat{Q}</script> gets closer to <script type="math/tex">R</script>. Here we’ve just made a common choice).</p>
<p>So, “updating <script type="math/tex">Q</script>” means changing <script type="math/tex">\theta</script> to make <script type="math/tex">L</script> smaller. There’s more than one way to do this as well, but a simple approach is known as <em>stochastic gradient descent</em>. Wikipedia has the <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">details</a>, but in short, the rule for updating <script type="math/tex">\theta</script> is:</p>
<script type="math/tex; mode=display">\theta \leftarrow \theta - \frac{\alpha}{2} \nabla_\theta L</script>
<p>We get to choose the “hyperparameter” <script type="math/tex">\alpha</script> (known as the <em>learning rate</em>) which controls the size of the updates we make. If <script type="math/tex">\alpha</script> is too small, learning is slow, but if it’s too large, the process may not converge. Plugging in <script type="math/tex">L</script> to this update rule and doing a couple lines of calculus, we get</p>
<script type="math/tex; mode=display">\theta \leftarrow \theta - \frac{\alpha}{2} \nabla_\theta (R - \hat{Q(\phi;\theta)})^2</script>
<script type="math/tex; mode=display">\theta \leftarrow \theta - \frac{\alpha}{2} (-2) (R - \hat{Q(\phi;\theta)}) \frac{d}{d\theta} Q(\phi;\theta)</script>
<script type="math/tex; mode=display">\theta \leftarrow \theta + \alpha (R - \hat{Q}(\phi)) \phi</script>
<p>The last line gives us the version of the update rule that we’ll code. Keep in mind that both <script type="math/tex">\theta</script> and <script type="math/tex">\phi</script> here are vectors of length <script type="math/tex">7</script>. This update rule applies to each element individually.</p>
<h3 id="putting-it-all-together">Putting it all together</h3>
<p>Finally, it’s time to put everything together. We’ll repeatedly:</p>
<ol>
<li>Deal each player a random hand.</li>
<li>Let them each choose an action.</li>
<li>Get the results.</li>
<li>Update the model using the state-action-result tuples we observe.</li>
</ol>
<p>The following function implements this Monte Carlo algorithm and returns the parameters <code>theta</code> of the model we learn.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Input:</span>
<span class="c"># S: effective stack size in BB</span>
<span class="c"># nHands: number of random hands to play</span>
<span class="c"># alpha: learning rate hyperparameter</span>
<span class="c"># Output:</span>
<span class="c"># An 7-vector of weights parameterizing our linear model</span>
<span class="k">def</span> <span class="nf">mc</span><span class="p">(</span><span class="n">S</span><span class="p">,</span> <span class="n">nHands</span><span class="p">,</span> <span class="n">alpha</span><span class="p">):</span>
<span class="c"># Start with a random guess for theta.</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">nParams</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span> <span class="p">(</span><span class="n">nHands</span><span class="p">):</span>
<span class="n">sbHand</span><span class="p">,</span> <span class="n">bbHand</span> <span class="o">=</span> <span class="n">dealCards</span><span class="p">()</span>
<span class="c"># SB action</span>
<span class="n">sbIsGII</span><span class="p">,</span> <span class="n">sbQhat</span><span class="p">,</span> <span class="n">sbPhi</span> <span class="o">=</span> <span class="n">act</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">sbHand</span><span class="p">,</span> <span class="bp">True</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">(</span><span class="n">nHands</span><span class="p">,</span> <span class="n">i</span><span class="p">))</span>
<span class="c"># BB action</span>
<span class="n">bbIsGII</span><span class="p">,</span> <span class="n">bbQhat</span><span class="p">,</span> <span class="n">bbPhi</span> <span class="o">=</span> <span class="n">act</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">bbHand</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="n">epsilon</span><span class="p">(</span><span class="n">nHands</span><span class="p">,</span> <span class="n">i</span><span class="p">))</span>
<span class="c"># get result from environment</span>
<span class="n">sbReward</span><span class="p">,</span> <span class="n">bbReward</span> <span class="o">=</span> <span class="n">simulateHand</span><span class="p">(</span><span class="n">S</span><span class="p">,</span> <span class="n">sbHand</span><span class="p">,</span> <span class="n">sbIsGII</span><span class="p">,</span> <span class="n">bbHand</span><span class="p">,</span> <span class="n">bbIsGII</span><span class="p">)</span>
<span class="c"># update the model using each player's results</span>
<span class="n">theta</span> <span class="o">+=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="n">sbReward</span><span class="o">-</span><span class="n">sbQhat</span><span class="p">)</span> <span class="o">*</span> <span class="n">sbPhi</span>
<span class="n">theta</span> <span class="o">+=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="n">bbReward</span><span class="o">-</span><span class="n">bbQhat</span><span class="p">)</span> <span class="o">*</span> <span class="n">bbPhi</span>
<span class="k">return</span> <span class="n">theta</span></code></pre></figure>
<p>Notice in particular how the update rule derived in the previous section was expressed in code.</p>
<h2 id="results">Results</h2>
<h3 id="interpreting-the-model">Interpreting the model</h3>
<p>We’ll fix <script type="math/tex">S=10</script> for the sake of the example.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="n">theta</span> <span class="o">=</span> <span class="n">mc</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10000000</span><span class="p">,</span> <span class="mf">0.0001</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span>
<span class="p">[</span> <span class="mf">9.63250365</span><span class="p">,</span> <span class="mf">6.16764962</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.34843332</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.72533963</span><span class="p">,</span> <span class="mf">0.22571655</span><span class="p">,</span>
<span class="o">-</span><span class="mf">0.15230302</span><span class="p">,</span> <span class="mf">0.14547532</span><span class="p">]</span></code></pre></figure>
<p>We have numbers! Do they make sense? There are actually a couple ways we can sanity check these and gain some intuition for what our model is telling us.</p>
<p>First, let’s consider some specific situations. What’s the estimated value for the SB when she plays FOLD? This is easy to find because <script type="math/tex">\phi</script> is quite simple in this case. In fact, all its entries are 0 except for the first (fixed to 1) and the sixth (corresponding to isSB): <code>phi = [1, 0, 0, 0, 0, 1, 0]</code>. So, evaluating our linear model just amounts to adding the first and sixth entries of <code>theta</code>:</p>
<script type="math/tex; mode=display">\hat{Q}(\text{SB, FOLD}) = 9.63250365 + -0.15230302 = 9.48020063</script>
<p>Now, we actually know that the value for the SB when she chooses to fold is exactly 9.5 according to the rules of the game. So, cool – pur model is pretty close! This is a nice sanity check and gives us one example of the magnitude of errors our model might be making.</p>
<p>Take another specific situation: BB folding. Only the first entry of phi is nonzero here, and we find an estimated value <script type="math/tex">\hat{Q}(\text{BB, FOLD}) = 9.63250365</script>. It’s not as clear what the right answer should be here, except that it should certainly be between 10.5 (as it would be if SB always plays FOLD) and 9 (what it’d be if SB always plays GII). And indeed, it is, and it’s somewhat closer to 9 than 10.5, which is consistent with the SB playing GII more than FOLD.</p>
<p>There’s a more general way to think about each individual entry of <script type="math/tex">\theta</script>. An entry <script type="math/tex">\theta_i</script> is the increment to <script type="math/tex">\hat{Q}</script> due to an increasing the corresponding feature <script type="math/tex">\phi_i</script> by 1. For example, having a suited hand when playing GII increases the the fifth entry of <script type="math/tex">\theta</script> by exactly one. Thus, the estimated the benefit of having the suited hand is 0.22571655 – a small, positive benefit. Seems reasonable.</p>
<p>The second entry of <script type="math/tex">\theta</script> (corresponding to the player’s higher-ranked card) is 6.16764962. This corresponds to the feature <code>rank2/numRanks if isGII else 0</code>, which corresponds to the player’s higher-ranked card when playing GII. We divide here by <code>numRanks</code>, so an increment of 1 in the feature is approximately the difference between a 2 and an ace. An extra 6 BB for getting it in with an ace rather than a 2 seems reasonable. (But, why do you think that having a higher second card is apparently negative?!)</p>
<p>Examining the entry of <script type="math/tex">\theta</script> corresponding to the sixth feature (<code>1 if isSB else 0</code>), the additional value of being in the SB is apparently -0.15230302, if all other features are equal. We might interpret this as the positional disadvantage: the small penalty that comes from having to act first.</p>
<p>However, all else isn’t necessarily equal. If the SB is playing GII, the last feature becomes active as well. So, -0.15230302 is the additional value of being in the SB only when playing FOLD. When playing GII, we include the contribution from the last feature also to find a benefit of <script type="math/tex">-0.15230302 + 0.14547532 = -0.0068277</script>. Apparently the positional disadvantage is less when the SB takes the more aggressive option!</p>
<p>As we see here, choosing features that are meaningful in our problem domain can help us to meaningfully interpret our results. Interestingly, there’s an old scheme for playing shove/fold scenarios known as <a href="http://pokerartikelen.blogspot.com/2007/06/are-you-sage-getting-edge-in-heads-up.html">SAGE</a>. It was constructed to be easy to remember at the table during live tournament play. The idea is to construct the “power index” for your hand which has contributions for rank, suitedness, and pair and then use that to decide whether to GII or not. How do their set of features compare to ours? How about their results?</p>
<p>Finally, why did we choose the last feature to depend on <code>isSB and isGII</code> rather than just <code>isGII</code>? Think about it as follows. The estimated value of (BB, FOLD) is simply the first entry of <script type="math/tex">\theta</script>, so this first entry needs to be free to change to whatever it needs to be in order to get the correct value for (BB, FOLD). Then, the sixth entry is the additional contribution for being in the SB, and it needs to be free to vary to get (SB, FOLD) right.</p>
<p>Once we switch from FOLD to GII, entries 2-5 become active and adjust the value for the player’s particular, but these contributions apply equally to both the SB and BB. The model needs some way to give a different contribution for the SB getting all-in as opposed to the BB doing so.</p>
<p>Suppose our final feature were just <code>1 if isGII else 0</code>. This doesn’t depend on the player, so the only difference in estimated value between SB and BB would be due to the <code>isSB</code> term. This single number would have to account for both the difference between SB and BB when playing FOLD as well as the difference between SB and BB when playing GII. The model would be forced to pick a single number for both of these differences, and it would probably end up with some poor compromise between them. Instead, we need <code>1 if isGII and isSB else 0</code>. With this, the model can differentiate between the incremental value of the SB GII vs the BB GII.</p>
<p>Note that there are still lots of subtle details that this model can’t capture. For example, it is entirely built-in to the functional form of our model that the <em>difference</em> in estimated values of GII with two particular hands, e.g. A2 and K2, is exactly the same as the SB as for the BB. It’s impossbile for our model predict otherwise, regardless of the values of <script type="math/tex">\theta</script>.</p>
<p>We say that such a model has <em>high bias</em>. Basically, it’s inflexible and has a strong built-in “opinion” about what the result will look like. This is why the feature engineering was so important. If we hadn’t made some attempt at providing the algorithm with well-crafted features, it probably wouldn’t have even been capable of representing a good solution at all.</p>
<p>We can add more features such as other cross terms to get a lower bias model, but this might come with downsides. We’d lose interpretability very quickly, and we may hit upon more technical problems such as overfitting as well. (Of course, in most modern applications, this ship has sailed. Accuracy is more important than interpretability, and there are ways to deal with overfitting).</p>
<h3 id="visualizing-the-strategies">Visualizing the strategies</h3>
<p>To find the complete strategies, we’ll evaluate the model to see whether GII or FOLD is better for each of the 1326 hand combinations, for each player:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">sbGIIRange</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">numHoldemHands</span><span class="p">)</span>
<span class="n">bbGIIRange</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">numHoldemHands</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">numHoldemHands</span><span class="p">):</span>
<span class="k">for</span> <span class="n">isSB</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="p">[(</span><span class="bp">True</span><span class="p">,</span> <span class="n">sbGIIRange</span><span class="p">),</span> <span class="p">(</span><span class="bp">False</span><span class="p">,</span> <span class="n">bbGIIRange</span><span class="p">)]:</span>
<span class="n">qHatGII</span> <span class="o">=</span> <span class="n">evalModel</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">phi</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">isSB</span><span class="p">,</span> <span class="bp">True</span><span class="p">))</span>
<span class="n">qHatFOLD</span> <span class="o">=</span> <span class="n">evalModel</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">phi</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">isSB</span><span class="p">,</span> <span class="bp">False</span><span class="p">))</span>
<span class="n">r</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">qHatGII</span> <span class="o">></span> <span class="n">qHatFOLD</span> <span class="k">else</span> <span class="mi">0</span></code></pre></figure>
<p>It looks like the SB is shoving with about 55% of hands, and the BBs calls about 49% of the time:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">sbGIIRange</span><span class="p">)</span><span class="o">/</span><span class="n">numHoldemHands</span><span class="p">)</span>
<span class="mf">0.553544494721</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">bbGIIRange</span><span class="p">)</span><span class="o">/</span><span class="n">numHoldemHands</span><span class="p">)</span>
<span class="mf">0.485671191554</span></code></pre></figure>
<p>Finally, we can generate some SVGs to draw the GII ranges themselves in a Jupyter environment:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">SVG</span><span class="p">,</span> <span class="n">display</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="k">def</span> <span class="nf">drawRange</span><span class="p">(</span><span class="n">r</span><span class="p">):</span>
<span class="n">weights</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">counts</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">numHoldemHands</span><span class="p">):</span>
<span class="n">rank2</span><span class="p">,</span> <span class="n">rank1</span><span class="p">,</span> <span class="n">isSuited</span> <span class="o">=</span> <span class="n">intToHandMap</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">hand</span> <span class="o">=</span> <span class="n">ranks</span><span class="p">[</span><span class="n">rank2</span><span class="p">]</span> <span class="o">+</span> <span class="n">ranks</span><span class="p">[</span><span class="n">rank1</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="s">'s'</span> <span class="k">if</span> <span class="n">isSuited</span> <span class="k">else</span> <span class="s">'o'</span><span class="p">)</span>
<span class="n">weights</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span> <span class="o">+=</span> <span class="n">r</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">counts</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">svg</span> <span class="o">=</span> <span class="s">'<svg xmlns="http://www.w3.org/2000/svg" version="1.1" width="325" height="325">'</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">i</span><span class="o"><</span><span class="n">j</span><span class="p">:</span>
<span class="n">hand</span> <span class="o">=</span> <span class="n">ranks</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="o">+</span><span class="n">ranks</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="s">'s'</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">hand</span> <span class="o">=</span> <span class="n">ranks</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">ranks</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="o">+</span><span class="s">'o'</span>
<span class="n">frac</span> <span class="o">=</span> <span class="n">weights</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span> <span class="o">/</span> <span class="n">counts</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span>
<span class="n">hexcolor</span> <span class="o">=</span> <span class="s">'#</span><span class="si">%02</span><span class="s">x</span><span class="si">%02</span><span class="s">x</span><span class="si">%02</span><span class="s">x'</span> <span class="o">%</span> <span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="mi">255</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">frac</span><span class="p">)),</span> <span class="mi">255</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="mi">255</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">frac</span><span class="p">)))</span>
<span class="n">svg</span> <span class="o">+=</span> <span class="s">'<rect x="'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span><span class="o">-</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="mi">25</span><span class="p">)</span> <span class="o">+</span> <span class="s">'" y="'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span><span class="o">-</span><span class="n">j</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="mi">25</span><span class="p">)</span> \
<span class="o">+</span> <span class="s">'" width="25" height="25" fill="'</span><span class="o">+</span><span class="n">hexcolor</span><span class="o">+</span><span class="s">'"></rect>'</span>
<span class="n">svg</span> <span class="o">+=</span> <span class="s">'<text x="'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span><span class="o">-</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="mi">25</span><span class="p">)</span><span class="o">+</span><span class="s">'" y="'</span><span class="o">+</span><span class="nb">str</span><span class="p">(((</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span><span class="o">-</span><span class="n">j</span><span class="p">))</span><span class="o">*</span><span class="mi">25</span><span class="o">-</span><span class="mi">10</span><span class="p">)</span> \
<span class="o">+</span> <span class="s">'" font-size="11" >'</span> <span class="o">+</span> <span class="n">hand</span> <span class="o">+</span> <span class="s">'</text>'</span>
<span class="n">svg</span> <span class="o">+=</span> <span class="s">'</svg>'</span>
<span class="n">display</span><span class="p">(</span><span class="n">SVG</span><span class="p">(</span><span class="n">svg</span><span class="p">))</span>
<span class="n">drawRange</span><span class="p">(</span><span class="n">sbGIIRange</span><span class="p">)</span>
<span class="n">drawRange</span><span class="p">(</span><span class="n">bbGIIRange</span><span class="p">)</span></code></pre></figure>
<p align="center">
<img alt="SB shoving range" src="/images/rl_linear_sb_10bb.svg" />
<img alt="BB calling range" src="/images/rl_linear_bb_10bb.svg" />
</p>
<p>How’d we do? A lot of qualitative features we expect are here: big cards are good, pairs are good, suitedness is somewhat better than not suitedness, the SB plays looser than the BB, etc. However, borderline hands are sometimes played differently than in the true equilibrium strategy.</p>
<h2 id="conclusions">Conclusions</h2>
<p>A fairly introductory use of RL techniques got us some fairly reasonable strategies for playing the shove/fold game. The learning process didn’t rely on any knowledge of the structure or rules of the game. It occurred purely by having the agent play itself, observing the results, and using them to make better decisions in the future. On the other hand, significant feature engineering requiring some domain expertise was necessary to learn a good model.</p>
<p>Finally, a bit of context. Many problems fit the can be expressed as RL challenges, and there are many different ways to approach them as well. The solution here might be characterized as model-free, value-based, Monte Carlo, on policy, undiscounted, and using a linear function approximator.</p>
<ul>
<li><strong>Model-free</strong>: Our agent learned simply by taking actions and observing rewards. It didn’t require any <em>a priori</em> knowledge about how those rewards were generated (e.g. knowledge about things like ranges, equities, or even the rules of the game) nor did it try to learn such things on the fly. In poker, we actually do know a lot about how certain hands and actions lead to particular rewards (and we could have taken advantage of this), but that’s not the case in many other applications.</li>
<li><strong>Value-based</strong>: We focused on finding the values of each action in each state and then the actual policy (i.e. strategy) was more or less an afterthought. There are also policy-based methods (such a fictitious play), where the focus is on directly learning the action to take in each spot.</li>
<li><strong>Monte Carlo</strong>: We sampled entire hands (episodes) and learned based on the values we got at the end of the hand. “Temporal difference” methods make estimates of expected values in all the intermediate states before the hand is over and can learn more efficiently using those. Given that each player only makes a single action in the shove/fold game before it ends, this wasn’t important for us, but it can make a big difference in problems with more states.</li>
<li><strong>On policy</strong>: We estimated the values of the same strategies that our players were playing. This is actually somewhat problematic. Because the players sometimes took a random (non-optimal) action, the values we estimated were not quite the values of the optimal strategy, which is what we’d really like. More sophisticated “off policy” methods can actually learn about the optimal policy even while exploring non-optimal choices.</li>
<li><strong>Undiscounted</strong>: Most RL problems involve situations where there are many (possibly infinitely many!) states from the beginning to end of an episode. Of course, in that case the agent is looking to maximize the sum of all future rewards rather than its immediate reward. In this case, the agent is assumed to have a small preference for getting a reward now rather than sometime in the future. A hand of shove/fold play is always very short, so we didn’t need to worry about this.</li>
<li><strong>Linear function approximator</strong>: We learned a linear function to map from our representation of the state-action pair to the value. Alternatives include simple tables which store a separate estimate of the value of every action in every state as well as many other types of function approximators. Neural networks, in particular, have been very successful. To some degree, this is because they don’t require much feature engineering to get good results. Neural nets can often learn both a good set of features and how to use them! But that’s a topic for another day.</li>
</ul>
<p>References:</p>
<ul>
<li><a href="http://incompleteideas.net/sutton/book/the-book-2nd.html">Textbook by Sutton and Barto</a></li>
<li><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html">Lectures by David Silver</a></li>
</ul>
Tue, 06 Jun 2017 00:00:00 -0700
http://willtipton.com/coding/poker/2017/06/06/shove-fold-with-reinforcement-learning.html
http://willtipton.com/coding/poker/2017/06/06/shove-fold-with-reinforcement-learning.htmlcodingpokerGames, Strategies, and GTO Strategies<p><em>This is Part 1 of 6 of an adaptation of my chapter “Game Theory Optimal Strategies: What Are They Good For?” from <a href="http://amzn.to/1QHNNHY">Excelling at No-Limit Hold’em</a> edited by Jonathan Little.</em></p>
<p>Much of the reason I wrote Expert Heads Up NLHE was to explain the ideas of game theory, poorly understood in the community at the time, to the average poker player. Heads up no limit (HUNL) is my game of choice personally, so it made sense to use it as the primary example. However, HUNL is something of a simple case, and there’s a bit more to be said about how game theory applies to other games. In this chapter, I’ll give a quick introduction to game theory as it applies to a variety of common poker formats. We’ll see when it’s useful, and more importantly, when it’s not – when it’s appropriate to use game theory-inspired strategies, and when it just can’t really guide our play. I promise to cover a practical skill or two as well.<!--more--></p>
<h3 id="games-strategies-and-gto-strategies">Games, Strategies, and GTO Strategies</h3>
<p>So what is <strong>game theory optimal</strong> (GTO) play? First of all, people tend to get hung up on the word <em>optimal</em>, so I want to dispell some common misconceptions. Imagine this – there’s some mathematician. She’s made up some potentially useful concept with a moderately complicated definition, and she wants to discuss it with other people. What does she do? Well, first, she probably needs to give her concept a name. That way, she can just say, “Suppose I have a continuous function <script type="math/tex">f(x)</script> instead of “Suppose I have a function <script type="math/tex">f(x)</script> such that, at every point <script type="math/tex">a</script> on its domain, the limit of <script type="math/tex">f(x)</script> as <script type="math/tex">x</script> approaches <script type="math/tex">a</script> in the domain equals <script type="math/tex">f(a)</script>.” Much easier, right? Now don’t worry – you don’t need to know anything about functions, continuous or otherwise, to read this chapter. The point is that the word “continuous” wasn’t made up from scratch – it was a pre-existing word in spoken English that means something only vaguely related to what the mathematician actually wants you to think about when you hear “.</p>
<p>The “O” in GTO is like that. There’s a very specific technical definition for “GTO strategy” which we’ll get to shortly. We could have decided to call these strategies crunchy or yellow or Vulcan, but hopefully game theory optimal is a little more evocative of what we mean, even if it isn’t perfect. So please forget any preconceived bias you have about the word optimal. In this chapter, GTO means exactly the following, no more and no less.</p>
<p>Ok so suppose you have some players playing a game, and you have a set of strategies (one for each player) such that no player can improve his EV by changing his strategy. Then, we say that any one of those players’ strategies is a GTO strategy for that player in that game. Great. In a minute, we’ll tease out some consequences of that definition: what special properties such a strategy has, etc. But first, if you’re paying attention, you might feel like you’ve been cheated! I told you that “GTO” has a very specific technical meaning, but then I gave you a definition that relies on more fuzzy terms: <strong>game</strong> and <strong>strategy</strong>. As you may guess, we mean something specific by those terms as well. Let’s talk about those ideas and then come back to GTO. We’ll say something more about EV in the future as well.</p>
<p>I’m going to tweak the next couple definitions a little bit to make them more useful for poker. For us, a “game” will correspond more or less to a single hand. It is composed of the following four things. It’s:
\begin{itemize}
\item A set of players
\item Starting ranges for each player
\item A <strong>decision tree</strong> that describes all the possible sequences of actions that the players (and <strong>Nature</strong>, i.e., random chance) can take, and
\item Payoffs that describe how much money or chips or value each player has at the end of the hand, for every way the hand can end
\end{itemize}
When we describe a game, we’ll also usually want to specify the starting pot and stack sizes of each player, although presumably we could find them by starting at the bottom of a decision tree (at the end of the hand) and working back up the series of actions to the beginning, tallying bets as we go.</p>
<p>A player’s range tells us the different hands he can hold as well as how likely each of them is. A player’s starting range is his range at the beginning of the game. Of course, a player’s possible holdings at the beginning of a holdem hand are well-known, so we often won’t need to specify them. However, we’ll sometimes find it convenient to set up sort of artificial games that describe play over just part of a hand. For example, we could draw a decision tree that describes play on just a single river. In that case, we’ll need to specify the ranges of each player at the start of river play to fully describe the situation.</p>
<p>I should say what a decision tree is! A picture is best. Check out the figure below. This picture corresponds to a game with 3 players, named BU, SB, and BB. There are two components in the diagram: circles and lines. Each circle in the tree represents a spot where a player has to make a decision – we call them <strong>decision points</strong>. More specifically, each decision point corresponds to a distinct set of <strong>public</strong> information – the information you’d have available if you were a third party watching the game (with no hole card cam) – basically everything except the hole cards. I’ve labelled each point with the name of the player who owns it, i.e., who gets to make a decision there. Each arrow leaving a point represents an action the player can choose, and when he takes an action, the game moves to the point indicated by the arrow.</p>
<p><img src="/images/larger_game_tree_example.svg" alt="Larger game tree example" /></p>
<p>The game begins at the top of the tree. (Here, I’ve neglected to draw actions for posting blinds, but they’re implied.) Then, BU can fold, call, or raise. If he folds, the SB also has the options to fold, call or raise. If the SB calls, the action moves to a point owned by the BB. And so on. Points all the way at the bottom of the tree (which are arrived at the end of a hand, i.e. at showdown or after all but one player folds) are known as the <strong>leaves</strong> of the tree. (Get it?) A tree describing all of the possible lines, including all future streets and and so on, would be a bit unwieldy, so I’ve left dangling arrows to indicate places where much more lies below, undrawn. You can imagine how it would go.</p>
<p>So that’s a game. Strategy is another word that has an English meaning that’s close to but not quite the same as its technical definition. For us, a strategy for a player is something that tells him exactly how to make every decision he could face in the game. Practically, it tells him, for every one of his decision points and every hole card combination that doesn’t conflict with the board, how he will choose between each of the options available to him there. Now, we could imagine some fairly convoluted decision making processes, but we’ll generally restrict ourselves to one of the two following types. If a player takes one action all the time (with a particular hand at a particular point) we say he’s playing a pure strategy there, and if he chooses randomly between multiple options with certain probability (say fold <script type="math/tex">30\%</script> and call <script type="math/tex">70\%</script>), then he’s playing a mixed strategy.</p>
<p>Now, if we know a player’s strategy, we can find his range at any point in the game. We have his starting range, and then at each of his decision points, he splits the range with which he arrives there. He chooses an action to take for each component of his range. If we know a player’s range for taking each action, we can often more or less work out his strategy. For example, if we know he arrives at a point with <script type="math/tex">20\%</script> of a hand, and his range for taking one action includes <script type="math/tex">15\%</script> of the hand and the other includes <script type="math/tex">5\%</script>, then we can reason that at that point, his strategy involves taking the first action three-quarters of the time and the second one-quarter of the time. However, if a player arrives at a point with <script type="math/tex">0\%</script> of a hand (because his strategy is such that he never gets to this spot with this hand), then all of his subsequent action ranges must also contain <script type="math/tex">0\%</script> of the hand. His strategy, by definition, must dictate his play here, but we can’t use his ranges to figure out his frequencies.</p>
<p>So if we know a strategy, we can find the ranges, and if we know ranges, we can work out parts of the strategy – those that we might consider most important – the parts that describe play in spots the players can actually get to when they play their strategies. For practical purposes, when we describe players’ strategies, we’ll usually talk about their ranges, but to be clear, they’re not exactly the same thing.</p>
<p>Great, now we’re ready to revisit GTO in full force. So again, a set of strategies is GTO if no player can unilaterally deviate and increase his average profit. An equivalent way to put this is to say that every player is playing <strong>maximally exploitably</strong> (i.e. as profitably as possible), given his opponents’ strategies. So, if all players but one in a game are playing strategies from a GTO set, then the last player can do no better than to also play his strategy from the set. A set of GTO strategies is also called an <strong>equilibrium</strong> or a <strong>Nash equilibrium</strong>, and if all players are playing their strategy from an equilibrium, we say we’re <strong>at equilibrium</strong>.</p>
<p>Let’s take a look at one consequence of these definitions that many players find counterintuitive. This isn’t super important in and of itself, but it’ll help us to become more familiar with the concepts. <em>A GTO strategy can involve folding the nuts, even on the river.</em> Suppose we’re at equilibrium. No player has any incentive to change his strategy. Imagine taking Hero’s strategy in a spot that play never reaches and tweaking it so that it folds the nuts a small amount of the time. By “small” here, I mean that we don’t start playing poorly enough that our opponents actually can improve their EV by switching up their play to arrive at that spot. Well then the tweaked strategy is still GTO, since it’s still the case no player can increase his EV by unilaterally deviating. Folding the nuts on the river doesn’t affect our EV if it’s in a spot we never get to at equilibrium. However, if we did get there (perhaps because Villain played a non-GTO strategy), we could find ourselves folding the nuts despite playing a GTO strategy.</p>
<p>This is a pretty good example of how the normal English meaning of “optimal” conflicts with our definition. Few people would call folding the nuts on the river optimal, but such play is consistent with a GTO strategy. By the way, notice that in the previous paragraph, we imagined constructing two distinct strategies for a player, and we said both were GTO. Indeed, there is no reason to think that GTO strategies are unique, and they’re often not. This point will become important for us shortly.</p>
<p>The next section, GTO Play in Cash Games and Tournaments, will be posted eventually, and the full book is available now:
<a href="http://amzn.to/1QHNNHY">Excelling at No-Limit Hold’em</a>.</p>
Mon, 07 Mar 2016 00:00:00 -0800
http://willtipton.com/poker/2016/03/07/gto-pt1.html
http://willtipton.com/poker/2016/03/07/gto-pt1.htmlpokerSolving the Shove/fold Game with TensorFlow<p>Google recently open-sourced TensorFlow (<a href="https://www.tensorflow.org">website</a>, <a href="http://download.tensorflow.org/paper/whitepaper2015.pdf">whitepaper</a>), a software package primarily meant for training <a href="http://neuralnetworksanddeeplearning.com/chap1.html">neural networks</a>. However, neural nets come in all shapes and sizes, so TF is fairly general. Essentially, you can write down some expression in terms of vectors, matrices, and other tensors, and then tell TF to minimize it.</p>
<p>I ran through a couple of their very well written <a href="https://www.tensorflow.org/versions/master/tutorials/index.html">tutorials</a> and then decided to try it out on one of my standard toy problems: the HUNL shove/fold game.<!--more--></p>
<p>As a reminder, shove/fold is a 2-player no limit hold’em model game where play proceeds as follows:</p>
<ol>
<li>Both players start with stacks of <code>S</code>.</li>
<li>The player in the BB posts <script type="math/tex">1.0</script> blind, and the player in the SB posts <script type="math/tex">0.5</script> blind.</li>
<li>The SB can go all-in or fold.</li>
<li>Facing an all-in, the BB can call or fold.</li>
</ol>
<p>Notice that the SB’s strategy is completely specified by his shoving range, and the BB’s by his calling range. The equilibrium for this game is <a href="http://www.dandbpoker.com/preflop-charts">well known</a> and exhaustively described in
<a href="http://www.amazon.com/gp/product/1904468942/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1904468942&linkCode=as2&tag=willtipton-20&linkId=W7WIPDD4JQK7EK6Y">EHUNL v1</a><img src="http://ir-na.amazon-adsystem.com/e/ir?t=willtipton-20&l=as2&o=1&a=1904468942" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />
chapter 3.</p>
<p>So, the equilibrium is a pair of strategies, one for each player, where both are maximally exploiting each other, simultaneously. There are several ways we can find it. In the <a href="https://www.youtube.com/watch?v=MVMfDswjJE0">Solving Poker</a> video series, we used the Fictitious Play algorithm. There, we maintained guesses for both players’ optimal strategies and repeatedly calculated each player’s maximally exploitative counter strategy, taking a small step towards it each time.</p>
<p>An alternate approach lets express the problem directly as an optimization problem, which is what TF is good at. The BB’s equilibrium strategy is the one that minimizes the EV of the SB’s maximally exploitative (ME) response. So, we just need to write down the EV of the SB’s ME strategy as a function of the BB’s strategy and then minimize it.</p>
<p>Let’s get started. Feel free to follow along in a <a href="http://jupyter.org/">Jupyter</a> (née iPython) notebook after <a href="https://www.tensorflow.org/versions/master/get_started/index.html">installing TF</a>.</p>
<p>First, import some libraries.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="kn">as</span> <span class="nn">tf</span>
<span class="kn">import</span> <span class="nn">numpy</span></code></pre></figure>
<p>Second, we’re going to need some data. There are <script type="math/tex">\binom{52}{2} = 1326</script> distinct starting hands in NLHE, so we can describe a player’s range with a vector of <script type="math/tex">1326</script> numbers. To do this, we need to determine some sort of ordering of all the hand combos. I’ll give some code to convert from a vector to a human-understandable picture of a range and the end of this article. For now, the particular order doesn’t matter as long as we’re consistent.</p>
<p>We’ll need hand-vs-hand equities, and we can put the equity of every starting hand versus every other in a <script type="math/tex">1326 \times 1326</script> matrix. The file <a href="/static/pf_eqs.dat">pf_eqs.dat</a> contains such a matrix as output by <a href="http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.savetxt.html">numpy.savetxt</a>. Of course, some entries in that matrix don’t make sense. If two starting hands have a card in common, they can’t both be held simultaneously. The file <a href="/static/pf_confl.dat">pf_confl.dat</a> holds another <script type="math/tex">1326 \times 1326</script> matrix where every entry is either 0 or 1. A 1 indicates that we can compare two hands, and a 0 indicates that the hands conflict.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">pfeqs</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="s">"pf_eqs.dat"</span><span class="p">)</span>
<span class="n">pfconfl</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="n">loadtxt</span><span class="p">(</span><span class="s">"pf_confl.dat"</span><span class="p">)</span></code></pre></figure>
<p>Now we’ll use TensorFlow to set up and minimize the SB’s maximally exploitative EV (MEEV) as a function of the BB’s strategy. The process will be:</p>
<ol>
<li>Define some constants: the two arrays we just loaded and the stack size.</li>
<li>Define a variable: the BB’s calling range.</li>
<li>Write the SB’s MEEV in terms of those things.</li>
<li>Use TF to find the value of the variable (BB’s calling range) that minimizes the SB’s MEEV.</li>
</ol>
<p>First, the constants. The <a href="https://www.tensorflow.org/versions/master/api_docs/python/constant_op.html#constant">tf.constant</a> method creates a TensorFlow constant. We give it the data, the type of the data, the shape of the data, and a name that TF will give back to us in error messages if we mess up. We’ll choose a stack size of <script type="math/tex">10</script> here for no particular reason.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">equity_array</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="n">pfeqs</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="p">[</span><span class="mi">1326</span><span class="p">,</span> <span class="mi">1326</span><span class="p">],</span> <span class="n">name</span><span class="o">=</span><span class="s">'equity_array'</span><span class="p">)</span>
<span class="n">confl_hands</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="n">pfconfl</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="p">[</span><span class="mi">1326</span><span class="p">,</span> <span class="mi">1326</span><span class="p">],</span> <span class="n">name</span><span class="o">=</span><span class="s">'confl_hands'</span><span class="p">)</span>
<span class="n">S</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="p">[],</span> <span class="n">name</span><span class="o">=</span><span class="s">'S'</span><span class="p">)</span></code></pre></figure>
<p>The variable which will be represented by an instance of the <a href="https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Variable">tf.Variable</a> class. We pass in an initial value: a vector of <script type="math/tex">1326</script> zeroes.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">_bb_call_range</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Variable</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">zeros</span><span class="p">([</span><span class="mi">1326</span><span class="p">,</span><span class="mi">1</span><span class="p">]))</span></code></pre></figure>
<p>Now, the vector representing the BB’s calling range should have <script type="math/tex">1326</script> numbers, where each number corresponds to a particular hand and represents the probability he will call an all-in with that hand. Since it’s a probability, it should be a number from <script type="math/tex">0</script> to <script type="math/tex">1</script>. But the optimizer isn’t going to know about that constraint. Instead, it will change the entries of this vector in any way necessary to minimize the MEEV. We need to account for this, or else the result will probably tell the BB to call with aces 1000000000% of the time and fold everything else ;).</p>
<p>So, I use a trick here and define <code>bb_call_range</code> to be the sigmoid function of <code>_bb_call_range</code>. From the graph of the <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid function</a>, we can see that this function takes any real number as input and produces something between <script type="math/tex">0</script> and <script type="math/tex">1</script> as output. This way, we can let the optimizer do whatever it wants to <code>_bb_call_range</code>, and by doing so, it will be able to make the entries of <code>bb_call_range</code> anything and only anything between <script type="math/tex">0</script> and <script type="math/tex">1</script>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">bb_call_range</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">_bb_call_range</span><span class="p">)</span></code></pre></figure>
<p>Of course, the sigmoid function isn’t the only one that would work here. I’d be interested if anyone has a sufficiently different way to introduce bounded variables. I suspect something involving subclassing either <code>Variable</code> or the optimizer op is possible…</p>
<p>Now for some math. Big picture, we want the SB’s MEEV, which is the average of his MEEV for every particular hand, which is the maximum of his EV of folding and his EV of shoving with that hand. We’ll start with something easy: the EV of folding each hand. If he open-folds, the SB ends up with a stack of <script type="math/tex">S-0.5</script>, regardless of his holding.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">sb_ev_fold</span> <span class="o">=</span> <span class="p">(</span><span class="n">S</span><span class="o">-</span><span class="mf">0.5</span><span class="p">)</span><span class="o">*</span><span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">([</span><span class="mi">1326</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span></code></pre></figure>
<p>Now for the EV of jamming. We start by finding the number of hand combos in the BB’s range for each SB hand. This is almost just a sum of all the entries in <code>bb_call_range</code>, except that we need to remove the entries that correspond to hands that conflict with the SB’s hand because of card removal effects. We achieve this by multiplying with the <code>confl_hands</code> binary array.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">bb_num_calling_hands</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">confl_hands</span><span class="p">,</span> <span class="n">bb_call_range</span><span class="p">)</span></code></pre></figure>
<p>The number of BB hand combos left after we fix a particular SB hand is <script type="math/tex">\binom{50}{2}=1225</script>, so we divide <code>bb_num_calling_hands</code> by <script type="math/tex">1225</script> to get the chance the BB calls a shove.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">chance_bb_calls</span> <span class="o">=</span> <span class="n">bb_num_calling_hands</span> <span class="o">/</span> <span class="mi">1225</span></code></pre></figure>
<p>The SB’s equity when called with each particular hand is just the average of his equity versus each possible BB holding, weighted by how likely the BB is to hold each holding after calling an all-in.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">sb_equity_when_called</span> <span class="o">=</span> <span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">equity_array</span><span class="p">,</span> <span class="n">bb_call_range</span><span class="p">))</span> <span class="o">/</span> <span class="n">bb_num_calling_hands</span></code></pre></figure>
<p>And then the SB’s EV of shoving is the average of his stack size when the BB does and does not call, weighted by the probability of each case.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">sb_ev_shove</span> <span class="o">=</span> <span class="n">sb_equity_when_called</span> <span class="o">*</span> <span class="p">(</span><span class="n">chance_bb_calls</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">S</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="mf">1.0</span><span class="o">-</span><span class="n">chance_bb_calls</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="n">S</span><span class="o">+</span><span class="mf">1.0</span><span class="p">)</span></code></pre></figure>
<p>Keep in mind that <code>sb_ev_shove</code> here is a vector of length <script type="math/tex">1326</script> – we have a (potentially) different EV for each SB holding. TensorFlow makes this easy in that basic operations (*, +) between objects of the same shape are done elementwise, and operations between a matrix or vector and a number (such as <script type="math/tex">(2*S)</script>) are handled via <a href="http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html">broadcasting</a>.</p>
<p>Finally, we can write the SB’s MEEV with each hand and take the average to get his average MEEV.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">sb_meev_by_hand</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">maximum</span><span class="p">(</span><span class="n">sb_ev_shove</span><span class="p">,</span> <span class="n">sb_ev_fold</span><span class="p">)</span>
<span class="n">sb_meev</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">reduce_sum</span><span class="p">(</span><span class="n">sb_meev_by_hand</span><span class="p">)</span> <span class="o">/</span> <span class="mi">1326</span></code></pre></figure>
<p>Now for the optimization.
Up to this point, we haven’t actually done any calculations. We’ve simply set up the sequence of operations that lead from our constants and the variable describing the BB’s range to the SB’s MEEV. Actually, this sequence forms a graph, and it’s possible to directly <a href="https://www.tensorflow.org/versions/master/how_tos/graph_viz/index.html">visualize</a> this graph. Do you see the inputs at the bottom and the output up top? Which part of this graph corresponds to the EV of folding calculation and which to the EV of shoving?</p>
<p><img src="/images/tf-graph.png" alt="TF Graph" /></p>
<p>We’ll now define an <a href="https://www.tensorflow.org/versions/r0.7/api_docs/python/train.html#Optimizer">Optimizer</a> whose minimize method annotates this graph with various bits of info (about how to compute and apply gradients) necessary to do the actual optimization. We’ll use a GradientDescentOptimizer and give it the quantity we want to minimize (<code>sb_meev</code>) and the list of variables we want to optimize (<code>[_bb_call_range]</code>).</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">bb_train_step</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">train</span><span class="o">.</span><span class="n">GradientDescentOptimizer</span><span class="p">(</span><span class="mf">1000.</span><span class="p">)</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">sb_meev</span><span class="p">,</span> <span class="n">var_list</span><span class="o">=</span><span class="p">[</span><span class="n">_bb_call_range</span><span class="p">])</span></code></pre></figure>
<p>Finally, we need to run the optimization. We’ll set up a <a href="https://www.tensorflow.org/versions/r0.7/api_docs/python/client.html#Session">tf.Session</a> object which is responsible for tracking the state of our optimization. We initialize all our variables, and then we run the optimizer a bunch of times. Finally, we extract the BB’s optimal strategy.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">sess</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">initialize_all_variables</span><span class="p">())</span>
<span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">):</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">bb_train_step</span><span class="p">)</span>
<span class="n">bb_range</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">bb_call_range</span><span class="p">)</span></code></pre></figure>
<p>Lastly, as promised, here’s some code to create a nicely-formatted version of the BB’s range, presented without comment. It creates an <a href="http://www.w3schools.com/svg/">SVG</a>, and the <code>_repr_svg_</code> magic will cause the image to be embedded directly if your browser if you’re using Jupyter. Otherwise, save the text and open it in your browser or some such.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">RangeDrawer</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">r</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">r</span> <span class="o">=</span> <span class="n">r</span>
<span class="k">def</span> <span class="nf">_repr_svg_</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">ranks</span><span class="o">=</span><span class="p">[</span><span class="s">'2'</span><span class="p">,</span><span class="s">'3'</span><span class="p">,</span><span class="s">'4'</span><span class="p">,</span><span class="s">'5'</span><span class="p">,</span><span class="s">'6'</span><span class="p">,</span><span class="s">'7'</span><span class="p">,</span><span class="s">'8'</span><span class="p">,</span><span class="s">'9'</span><span class="p">,</span><span class="s">'T'</span><span class="p">,</span><span class="s">'J'</span><span class="p">,</span><span class="s">'Q'</span><span class="p">,</span><span class="s">'K'</span><span class="p">,</span><span class="s">'A'</span><span class="p">]</span>
<span class="n">suits</span><span class="o">=</span><span class="p">[</span><span class="s">'c'</span><span class="p">,</span><span class="s">'s'</span><span class="p">,</span><span class="s">'d'</span><span class="p">,</span><span class="s">'h'</span><span class="p">]</span>
<span class="n">weights</span><span class="o">=</span><span class="p">{}</span>
<span class="n">counts</span><span class="o">=</span><span class="p">{}</span>
<span class="n">c</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">r1</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)):</span>
<span class="k">for</span> <span class="n">r2</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">r1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)):</span>
<span class="k">for</span> <span class="n">s1</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">suits</span><span class="p">)):</span>
<span class="k">for</span> <span class="n">s2</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">suits</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">r1</span> <span class="o">==</span> <span class="n">r2</span> <span class="ow">and</span> <span class="n">s1</span> <span class="o">>=</span> <span class="n">s2</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">s1</span> <span class="o">==</span> <span class="n">s2</span><span class="p">:</span>
<span class="n">hand</span> <span class="o">=</span> <span class="n">ranks</span><span class="p">[</span><span class="n">r2</span><span class="p">]</span><span class="o">+</span><span class="n">ranks</span><span class="p">[</span><span class="n">r1</span><span class="p">]</span><span class="o">+</span><span class="s">'s'</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">hand</span> <span class="o">=</span> <span class="n">ranks</span><span class="p">[</span><span class="n">r2</span><span class="p">]</span><span class="o">+</span><span class="n">ranks</span><span class="p">[</span><span class="n">r1</span><span class="p">]</span><span class="o">+</span><span class="s">'o'</span>
<span class="n">weights</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span> <span class="o">=</span> <span class="n">weights</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">hand</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">r</span><span class="p">[</span><span class="n">c</span><span class="p">]</span>
<span class="n">counts</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span> <span class="o">=</span> <span class="n">counts</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">hand</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">c</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">result</span> <span class="o">=</span> <span class="s">'<svg xmlns="http://www.w3.org/2000/svg" version="1.1" width="325" height="325">'</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">i</span><span class="o"><</span><span class="n">j</span><span class="p">:</span>
<span class="n">hand</span> <span class="o">=</span> <span class="n">ranks</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="o">+</span><span class="n">ranks</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="s">'s'</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">hand</span> <span class="o">=</span> <span class="n">ranks</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">ranks</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="o">+</span><span class="s">'o'</span>
<span class="n">frac</span> <span class="o">=</span> <span class="n">weights</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span> <span class="o">/</span> <span class="n">counts</span><span class="p">[</span><span class="n">hand</span><span class="p">]</span>
<span class="n">hexcolor</span> <span class="o">=</span> <span class="s">'#</span><span class="si">%02</span><span class="s">x</span><span class="si">%02</span><span class="s">x</span><span class="si">%02</span><span class="s">x'</span> <span class="o">%</span> <span class="p">(</span><span class="mi">255</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">frac</span><span class="p">),</span> <span class="mi">255</span><span class="p">,</span> <span class="mi">255</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">frac</span><span class="p">))</span>
<span class="n">result</span> <span class="o">+=</span> <span class="s">'<rect x="'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span><span class="o">-</span><span class="n">i</span><span class="p">)</span><span class="o">*</span><span class="mi">25</span><span class="p">)</span> <span class="o">+</span> <span class="s">'" y="'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span><span class="o">-</span><span class="n">j</span><span class="p">)</span><span class="o">*</span><span class="mi">25</span><span class="p">)</span> \
<span class="o">+</span> <span class="s">'" width="25" height="25" fill="'</span><span class="o">+</span><span class="n">hexcolor</span><span class="o">+</span><span class="s">'"></rect>'</span>
<span class="n">result</span> <span class="o">+=</span> <span class="s">'<text x="'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span><span class="o">-</span><span class="n">i</span><span class="p">)</span><span class="o">*</span><span class="mi">25</span><span class="p">)</span><span class="o">+</span><span class="s">'" y="'</span><span class="o">+</span><span class="nb">str</span><span class="p">(((</span><span class="nb">len</span><span class="p">(</span><span class="n">ranks</span><span class="p">)</span><span class="o">-</span><span class="n">j</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="mi">25</span><span class="o">-</span><span class="mi">10</span><span class="p">)</span> \
<span class="o">+</span> <span class="s">'" font-size="11" >'</span> <span class="o">+</span> <span class="n">hand</span> <span class="o">+</span> <span class="s">'</text>'</span>
<span class="n">result</span> <span class="o">+=</span> <span class="s">'</svg>'</span>
<span class="k">return</span> <span class="n">result</span>
<span class="n">RangeDrawer</span><span class="p">(</span><span class="n">bb_range</span><span class="p">)</span></code></pre></figure>
<p><strong>BB calling range:</strong></p>
<p><img src="/images/bb_10bb_shovefold.svg" alt="BB calling range" /></p>
<p><strong>Questions:</strong></p>
<ul>
<li>What does the graph look like after we create the Optimizer?</li>
<li>Try other types of Optimizers and/or a different learning rate. Can you train faster?</li>
<li>Find the SB’s optimal shoving range. You need to express the BB’s ME EV in terms of the SB’s strategy and minimize that to find the SB’s jamming range. Make sure to write the BB’s EV at the beginning of the hand and not after he is facing a shove.</li>
</ul>
Sun, 06 Mar 2016 00:00:00 -0800
http://willtipton.com/coding/poker/2016/03/06/shove-fold-with-tensorflow.html
http://willtipton.com/coding/poker/2016/03/06/shove-fold-with-tensorflow.htmlcodingpokerRunning it up, Part 3<p>We doubled up twice – time for round 3!</p>
<!--more-->
<center>
<iframe width="640" height="390" src="//www.youtube.com/embed/dBdwNA9eI50" frameborder="0" allowfullscreen=""></iframe>
</center>
Sun, 17 May 2015 00:00:00 -0700
http://willtipton.com/poker/2015/05/17/martingaling-pt-3.html
http://willtipton.com/poker/2015/05/17/martingaling-pt-3.htmlpokerRunning it up, Part 2<p>We doubled the roll once, can we do it again?</p>
<!--more-->
<center>
<iframe width="640" height="390" src="//www.youtube.com/embed/qE5fIoVh0os" frameborder="0" allowfullscreen=""></iframe>
</center>
Sat, 11 Apr 2015 00:00:00 -0700
http://willtipton.com/poker/2015/04/11/martingaling-pt-2.html
http://willtipton.com/poker/2015/04/11/martingaling-pt-2.htmlpokerRunning it up, Part 1<p>Tonight, Carbon was down, but Black Chip Poker gave me a few dollars to play with, so I’m going to try to run it up.</p>
<!--more-->
<center>
<iframe width="640" height="390" src="//www.youtube.com/embed/Wc2Vzq-W6aw" frameborder="0" allowfullscreen=""></iframe>
</center>
Sat, 04 Apr 2015 00:00:00 -0700
http://willtipton.com/poker/2015/04/04/martingaling-pt-1.html
http://willtipton.com/poker/2015/04/04/martingaling-pt-1.htmlpokerValue categories in C++11<p>One of the most important additions to C++ in the C++11 standard was the introduction of movable types. This feature has consequences for many common programming tasks such as assigning variables and passing arguments to or returning objects from a function. Move semantics are a bit subtle, and when reading documentation, it helps to understand some vocabulary: value categories.
<!--more--></p>
<p>So, before C++11, any expression in C++ was categorized as either an lvalue or an rvalue, depending on whether it had <em>identity</em>. Basically, an expression has identify it it has a name and thus outlives an expression that uses it. <strong>Pre-C++11</strong>, an <strong>lvalue</strong> was something with identity, and an <strong>rvalue</strong> was anything else, e.g. any temporary value that doesn’t outlive its particular expression. So if we have a line like</p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="n">MyClass</span> <span class="n">foo</span> <span class="o">=</span> <span class="n">bar</span><span class="p">();</span></code></pre></figure>
<p>then <code>foo</code> is an lvalue because it has a name and life beyond this particular line, whereas the temporary object returned by <code>bar()</code> does not and is thus an rvalue. The terms come from the fact that lvalues usually show up on the _l_eft side of an assignment while rvalues are found on the right. However, that doesn’t always hold – a <code>const</code> object can’t be assigned to, but it still has a name and a lifetime beyond a particular expression, so it’s still an lvalue.</p>
<p>Some more examples:</p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="kt">int</span> <span class="n">f</span><span class="p">();</span>
<span class="kt">int</span><span class="o">&</span> <span class="n">g</span><span class="p">();</span>
<span class="kt">int</span> <span class="n">bar</span><span class="p">;</span>
<span class="n">bar</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span> <span class="c1">// good
</span><span class="mi">4</span> <span class="o">=</span> <span class="n">bar</span><span class="p">;</span> <span class="c1">// bad -- can't assign to the rvalue 4
</span><span class="n">bar</span> <span class="o">=</span> <span class="n">f</span><span class="p">();</span> <span class="c1">// good
</span><span class="n">f</span><span class="p">()</span> <span class="o">=</span> <span class="n">bar</span><span class="p">;</span> <span class="c1">// bad -- can't assign to the temporary int returned by f()
</span><span class="n">g</span><span class="p">()</span> <span class="o">=</span> <span class="n">bar</span><span class="p">;</span> <span class="o">//</span> <span class="n">good</span> <span class="o">--</span> <span class="n">the</span> <span class="n">reference</span> <span class="n">returned</span> <span class="n">by</span> <span class="n">g</span><span class="p">()</span> <span class="n">is</span> <span class="n">an</span> <span class="n">lvalue</span></code></pre></figure>
<p>So anyway, the point here is that prior to C++11, expressions were categorized based on identity. Lvalues have identity, so they can be assigned to, it makes sense to take their address, etc., while we can’t assign to or ask for the address of an rvalue.</p>
<p>In C++11, things changed a bit, thanks to the need to talk about the new move semantics. It turns out to be helpful to categorize expressions not just based on identity, but also based on <em>movability</em>. Identity means the same thing as before, but we should clarify movability. When we say an object “can be moved”, we don’t mean that it’s possible to copy all its bits to a different spot in memory and thus have moved it. Essentially, an expression is movable if we can use it as input to a <em>move constructor</em> or <em>move assignment operator</em>. These move methods were introduced in C++11 and are allowed to basically cannibalize one object in order to cheaply construct a new (or moved) object.</p>
<p>An easy example here is <code>std::vector</code>. A vector object is basically composed of a bit of accounting information along with a pointer to a potentially-large, dynamically-allocated array. It can be expensive to copy vectors around (say, when returning from a function) because the potentially-large array has to be copied. So instead, it’s sometimes easier for a new vector to be constructed from an old one not by copying the old vector’s array but rather just by stealing it. This is often implented with a <code>swap</code> operation – the new object starts as an empty vector and then swaps its pointer with the old object’s. Anyhow, this is destructive to the original object. After this movement, the old (cannibalized) object is left in a “valid but unspecified” state. Basically we shouldn’t do anything with it as-is other than call its destructor.</p>
<p>So, there are some safety concerns here. The langauge can’t go around using the move functionality willy-nilly under the covers, because the programmer might end up trying to use an object after its been cannibalized. There are two spots where it’s safe to use move operations:</p>
<ol>
<li>when the programmer explicitly says so</li>
<li>for temporary objects – if we don’t have a name for something, we can’t shoot ourselves in the foot by accessing it later</li>
</ol>
<p>Great, so we say something <em>is moveable</em> if it’s safe to cannibalize it, and that only happens in the two cases above.</p>
<p>Now we have two properties of expressions: identity and movability. Any given expression either has identity or does not and is movable or is not. There are 4 possible combinations:</p>
<ul>
<li><strong>lvalues</strong> have identity and are not movable</li>
<li><strong>xvalues</strong> (I’ve heard eXtraordinary, eXpert, eXpiring) have identity and are movable</li>
<li><strong>prvalues</strong> (pure rvalues) do not have identity but are movable</li>
<li>The last combination, no identity and not movable, is not useful and thus not used.</li>
</ul>
<p>Any expression in C++11 belongs to exactly one of these three value categories.</p>
<p>So let’s see – lvalues are pretty much the same as before C++11. An lvalue is anything that’s not safe to move – it’s something that has a name and a life. We can assign to and ask its address. We can’t move (i.e. cannibalize) it because we might access it later and get unspecified behavior. The other two value types <em>are</em> movable and correspond to the two different ways it can be safe to move something.</p>
<p>First, the compiler can move something if the programmer explicity says it’s OK (using, perhaps, <a href="http://en.cppreference.com/w/cpp/utility/move"><code>std::move()</code></a>). In this case, we get something that has a name but is also moveable – an xvalue. It is up to the programmer to avoid doing anything dumb with an xvalue’s name. Second, the compiler can move a temporary object with no identity, a.k.a. a prvalue.</p>
<p>That’s pretty much all there is to it – expressions in C++11 are partitioned into lvalues, xvalues, and prvalues based on whether they have identity and whether they’re movable. There are things that can’t be moved at all (lvalues), things that can be moved because the programmer says so (xvalues), and things that can be moved because they don’t have a name (prvalues).</p>
<p>It turns out there are a couple more terms used that group those 3 fundamental categories in different ways. Any expression with identity (regardless of movability) is called a glvalue (generalized lvalue). In other words, an expression is a glvalue if it’s either an lvalue or an xvalue. Anything that’s movable (regardless of identity) is called an rvalue. That is, something’s an rvalue if it’s if it’s either a prvalue or an xvalue. Here’s a picture that might help:</p>
<div style="text-align:center">
<p><img src="/images/value_types.png" alt="Value categories" /></p>
</div>
<p><strong>Quiz</strong>:</p>
<ol>
<li>What do you call a glvalue that’s movable?</li>
<li>What do you call a C++11 rvalue that doesn’t have identity?</li>
<li>What type of expression does std::move return?</li>
</ol>
Sun, 14 Dec 2014 00:00:00 -0800
http://willtipton.com/coding/2014/12/14/value_categories.html
http://willtipton.com/coding/2014/12/14/value_categories.htmlcodingEDVis v1.1<p>Changes from v1.0 to v1.1:
- Control fractions of individual hand combos
- View and set fractions of hands of a particular suit
- Account for card removal effects when drawing the distributions</p>
<!--more-->
<iframe width="420" height="315" src="//www.youtube.com/embed/Sk4qxezzQPQ" frameborder="0" allowfullscreen=""></iframe>
<p>Download it at <a href="http://www.dandbpoker.com/product/expert-heads-up-no-limit-holdem-volume-1">dandbpoker.com</a></p>
Sat, 13 Dec 2014 00:00:00 -0800
http://willtipton.com/poker/2014/12/13/edvis-v1.1.html
http://willtipton.com/poker/2014/12/13/edvis-v1.1.htmlpokerDebugging<p>Pretty much any nontrivial piece of software will have bugs during development. Fixing bugs is thus an unavoidable part of programming, and it’s important that all programmers have some skill at the task. I recently made a video series about developing some poker-related software. The focus was on the problem domain, but much of the audience was new to programming, and I didn’t talk too much about what to do when things don’t go perfectly, i.e. when there are bugs. So, this post is a quick intro to debugging methodology in general, but I have my poker audience in mind.
<!--more--></p>
<p>So again, for my poker students, if some of your code doesn’t work while you’re going through the series, and you have to fix it, please don’t think of that as an unfortunate detour. On the contrary, it’s an important and valuable part of the process. If you just copy down everything I write, and it all works perfectly, you haven’t gotten your money’s worth :). When you write your own code, there won’t be an answer key available, and you’ll have to figure out problems on your own.</p>
<p>Generally, the debugging process starts when I notice the code is doing something that’s not what I expect. This can happen naturally while using the software or it can be the result of explicitly testing the code. I usually start out by looking for low-hanging fruit. Maybe I can think of a couple things that could concievably cause the the unexpected behavior, so I go check on those first. If I don’t have any luck, however, I start a more systematic approach…</p>
<p>The approach basically involves coming up with a series of assumptions about how code should work and then verifying those assumptions and digging into them when they’re violated. We’ll find a series of violated assumptions that form a sort of trail leading us to the root of the problem. The trail begins with whatever unfulfilled expectation led us start debugging in the first place. (“My code gives the wrong answer.”) How do we find the next step?</p>
<p>So, we have a function that’s producing a wrong answer, and that function has some inputs. Google image search provides this helpful illustration, where I guess “I” is “Input”:</p>
<div style="text-align:center">
<p><img src="http://i.investopedia.com/inv/articles/site/trading/112404_2.gif" alt="functions have inputs and outputs" title="Function" /></p>
</div>
<p>As far as I can recall, for all functions we write in our video series, the output follows deterministically from the input and are otherwise “pure” in the sense described <a href="http://en.wikipedia.org/wiki/Pure_function">here</a>. If we run such a function a bunch of times with the same input, we’ll get the same output every time. Things become a bit tricker with code that uses random numbers or reachs out to the user or over the network for data or things like that. In our case, however, if the output is wrong, then either</p>
<ol>
<li>one of the inputs is wrong, or</li>
<li>the logic in the function itself is wrong</li>
</ol>
<p>Test each of those cases, starting with the inputs. For each input to the function, figure out what you think it should look like, and then check what it actually is when you run the program. A debugger can help to inspect the values of variables when a function runs, but print statements are a simple way to accomplish this as well. If one of the inputs is bad, move to the place where that input is being generated, i.e., to the function that called this one. The caller is another function that’s producing an unexpected result causing the problem in the current one. Repeat the process there.</p>
<p>On the other hand, if all the inputs are correct, then it’s the current function itself that is at fault. Work through it line by line, figuring out the values of all variables involved and verifying that they are as you expect. When one is not, that’s the smoking gun that indicates our root problem. Verifying individual functions line-by-line can be tricky. It helps a lot to follow good programming practice and break complicated tasks into small, easily-analyzable functions. Facility with a debugger is useful as well.</p>
<p>So that’s it. If we follow these steps, and our code satisfies our original assumptions re: pureness, this process should eventually lead to the root cause of unexpected behavior.</p>
<p>Now, if you can’t figure out an issue, and you need to ask for help, the sort of information gleaned in this process is exactly the sort that somebody will need to help debug your problem. So (dear poker students), please provide <em>all</em> of the following info (in a nicely formatted way) if you need debugging help:</p>
<ol>
<li>What function is giving a wrong answer, what is that answer, and what should the answer be?</li>
<li>For all inputs to the function:
<ol>
<li>what should the input be?</li>
<li>what is the input when you actually run the code?</li>
</ol>
</li>
<li>If all the inputs are correct, then presumably the problem is in the current function. Please provide your code for that function.</li>
<li>If one of the inputs was not correct, then move to the function that produced that input and go back to 1.</li>
</ol>
<p>glgl!</p>
Sun, 05 Oct 2014 00:00:00 -0700
http://willtipton.com/coding/2014/10/05/debugging.html
http://willtipton.com/coding/2014/10/05/debugging.htmlcodingArbitrarily-deep nested loops<p>I finished a first pass at my <a href="https://github.com/wtipton/latticeregression">lattice regression library</a> over the weekend. The idea with that is pretty straightforward. Essentially, there’s some function we want to model, and it’s unknown, but we have a bunch of observations of inputs and corresponding outputs. So, we throw down a lattice (i.e. a regularly-spaced grid) of points over the space of inputs, and we use the data to “learn” some values of the function at the lattice points. Then, we discard the training data but can predict new values of the function by interpolating between the values at the lattice points. For more details, see, e.g. <a href="http://www.mayagupta.org/publications/GarciaAroraGupta_lattice_regression_IEEETransImageProcessing2012.pdf">this paper</a>.</p>
<p>Code-wise, one challenge of the project was in representing and dealing with the lattice. For example, suppose the function we want to model has 4 inputs. Then, our learned values on a grid over the space of inputs might naturally be stored in something like a 4-D array,<!--more--></p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="kt">double</span> <span class="n">b</span><span class="p">[</span><span class="n">p</span><span class="p">][</span><span class="n">p</span><span class="p">][</span><span class="n">p</span><span class="p">][</span><span class="n">p</span><span class="p">];</span></code></pre></figure>
<p>where <code>p</code> is the number of grid points in any one direction. Then, if we want to do something to every lattice point, we can do something like</p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">p</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">p</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">k</span> <span class="o"><</span> <span class="n">p</span><span class="p">;</span> <span class="n">k</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">l</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">l</span> <span class="o"><</span> <span class="n">p</span><span class="p">;</span> <span class="n">l</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">do_something_to</span><span class="p">(</span><span class="n">b</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">][</span><span class="n">k</span><span class="p">][</span><span class="n">l</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>Unfortunately, we want our code to be able to model functions with different numbers of inputs. Effectively, the number of dimensions needed in <code>b</code> array and the number of nested <code>for</code> loops necessary to act on it are unknown until runtime. To handle the general case, we need something a bit smarter than the above. Essentially, we need some way to simulate arbitrarily-deeply nested loops. There are good recursive solutions to this problem, but I didn’t go that route. Here are a couple approaches I used in the Lattice Regression library.</p>
<p>1) Put all the lattice points in a long 1-D array</p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="kt">double</span> <span class="n">b</span><span class="p">[</span><span class="n">pow</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">d</span><span class="p">)]</span></code></pre></figure>
<p>where <code>d</code> is the number of dimensions. Then, loop over it like normal</p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">pow</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">d</span><span class="p">),</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// do cool things that matter
</span><span class="p">}</span></code></pre></figure>
<p>If we need to know the “real” coordinates of the current point, we can extract them from the loop counter <code>i</code> by mapping <code>i</code> to the “real” coordinates <script type="math/tex">(x_1,x_2,...,x_d)</script>. If we only care about the real coordinates of points, then any one-to-one mapping from <code>i</code> to real coordinates will work. I happen to like</p>
<script type="math/tex; mode=display">i = x_1 + x_2*p + x_3*p^2 + ... + x_d*p^{(d-1)} = \sum_{i=1}^d x_ip^{i-1}</script>
<p>Then, we can convert from <code>i</code> to the <script type="math/tex">d</script> coordinates with something like</p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="kt">int</span> <span class="n">indx</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">coords</span><span class="p">[</span><span class="n">d</span><span class="p">];</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="n">d</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">k</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">k</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
<span class="n">coords</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">indx</span> <span class="o">%</span> <span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="n">indx</span> <span class="o">/=</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>and we can go back again, by just performing the above sum.</p>
<p>Now, if we’re iterating over all points, then we’re essentially counting in base <code>p</code> from 0 to <code>(p-1)(p-1)...(p-1)</code>, (where each <code>(p-1)</code> is one digit of a base-<code>p</code> number). We can make that more explicit and perhaps make a couple things easier by storing each “digit” separately…</p>
<p>2) Put artifical loop counters into a dynamically-sized array and “count” upwards.</p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="kt">int</span> <span class="n">counters</span><span class="p">[</span><span class="n">d</span><span class="p">];</span>
<span class="n">memset</span><span class="p">(</span><span class="n">counters</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">d</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">));</span></code></pre></figure>
<p>We start out with each loop counter 0. Then we increment the first one until we reach its maximum value <code>(p-1)</code> in our case). Then we set it back to zero, increment the second loop counter once, and again increment the first counter until it maxes out. Then we increment the second loop counter once, and pass over the first counter again. Eventually, the second counter maxes out also, and we’ve covered all possible combinations of the first two counters once. We then increment the third counter once, reset the first and second, and repeat all combinations of the first two. And so on.</p>
<p>In pseudocode,</p>
<figure class="highlight"><pre><code class="language-cpp" data-lang="cpp"><span class="k">while</span> <span class="p">(</span><span class="o">!</span> <span class="n">all_the_counters_are_maxed_out</span><span class="p">())</span> <span class="p">{</span>
<span class="c1">// do some stuff with the "loop indices" counters[0],counters[1],...,counters[d-1]
</span>
<span class="c1">// increment counters
</span> <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">counters</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="n">max_value_of_counter_i</span><span class="p">)</span>
<span class="n">counters</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">counters</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>The lines here under <code>//increment counters</code> set all maxed-out counters back to 0 and then increment the first not-maxed-out counter found. Of course, the most common case is that the least-significant counter is not maxed out, in which case the body of the <code>while</code> loop doesn’t execute at all.</p>
Tue, 30 Sep 2014 00:00:00 -0700
http://willtipton.com/coding/2014/09/30/arbitrarily-deep-nested-loops.html
http://willtipton.com/coding/2014/09/30/arbitrarily-deep-nested-loops.htmlcoding