<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://mehrdadmoghimi.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://mehrdadmoghimi.github.io/" rel="alternate" type="text/html" /><updated>2026-05-14T03:42:18+00:00</updated><id>https://mehrdadmoghimi.github.io/feed.xml</id><title type="html">Mehrdad Moghimi</title><subtitle>Ph.D. student, Department of Mathematics and Statistics, York University</subtitle><author><name>Mehrdad Moghimi</name></author><entry><title type="html">Decoupling Time and Risk: Risk-Sensitive RL with General Discounting</title><link href="https://mehrdadmoghimi.github.io/posts/2026/02/rigor/" rel="alternate" type="text/html" title="Decoupling Time and Risk: Risk-Sensitive RL with General Discounting" /><published>2026-02-09T00:00:00+00:00</published><updated>2026-02-09T00:00:00+00:00</updated><id>https://mehrdadmoghimi.github.io/posts/2026/02/rigor</id><content type="html" xml:base="https://mehrdadmoghimi.github.io/posts/2026/02/rigor/"><![CDATA[<p>In standard Reinforcement Learning (RL), the discount factor (\(\gamma\)) is often treated as a fixed parameter of the Markov Decision Process or a tunable hyperparameter for training stability. We typically default to <strong>exponential discounting</strong>, where the value of a reward decays by a constant factor at every time step.</p>

<p>While mathematically convenient, this standard formulation is restrictive. It limits our ability to model complex <strong>time preferences</strong> (how an agent values the future vs. the present) and <strong>risk preferences</strong> (how an agent handles uncertainty) independently.</p>

<p>In our recent <a href="https://arxiv.org/abs/2602.04131">paper</a>, <strong>“Decoupling Time and Risk: Risk-Sensitive RL with General Discounting,”</strong> we propose a unified framework that supports general discount functions and risk measures. By properly handling <strong>time consistency</strong> and tracking accumulated rewards, we show that we can capture more expressive behaviors, like preference reversals, and significantly improve performance in complex environments.</p>

<h2 id="the-problem-with-stationary-hyperbolic-discounting">The Problem with “Stationary” Hyperbolic Discounting</h2>

<p>A major motivation for this work was to revisit <strong>hyperbolic discounting</strong>. Unlike exponential discounting, hyperbolic discounting models agents that are impatient in the short term but patient in the long term—a behavior observed in humans and animals.</p>

<p>A notable approach by <strong>Fedus et al. (2019)</strong> attempted to introduce hyperbolic discounting into Deep RL. They approximated the hyperbolic discount function as a weighted average of multiple exponential discount factors \(\gamma_i\):</p>

\[Q_{\text{hyperbolic}}(s,a) \approx \sum_{i} w_i Q_{\gamma_i}(s,a)\]

<p>However, there is a theoretical issue here. Fedus et al. used <strong>fixed weights</strong> \(w_i\) throughout the episode. By enforcing a stationary policy (one that acts the same way regardless of time), they implicitly reset the agent’s “time zero” at every step. This leads to <strong>time inconsistency</strong>: the policy the agent plans at \(t=0\) is not the policy it wants to execute at \(t=1\). The agent is effectively fighting its future selves.</p>

<h2 id="our-solution-time-dependent-weights">Our Solution: Time-Dependent Weights</h2>

<p>We argue that to solve general discounting problems correctly, the agent must be explicitly aware of time, and the weights must evolve.</p>

<p>In our <strong>multi-horizon framework</strong>, we show that as time \(t\) progresses, the effective contribution of each exponential discount factor \(\gamma_i\) changes. The weights should not be static constants \(w_i\), but rather time-dependent weights \(w_{i,t}\):</p>

\[w_{i,t} \propto w_i \gamma_i^t\]

<p>Because these weights vary with time, the decision problem becomes <strong>non-stationary</strong>. To handle this, our agent learns a set of distributional value functions for different \(\gamma_i\)’s, but combines them dynamically based on the current time step \(t\).</p>

<video controls="" autoplay="" loop="" muted="" playsinline="" style="width: 100%;">
  <source src="/files/EvolvingWeightsExact.mp4" type="video/mp4" />
  Your browser does not support the video tag.
</video>

<p>This formulation is theoretically coherent and finds the true optimal non-stationary policy that maximizes the objective defined at the start of the episode.</p>

<h2 id="a-unified-framework-for-time-and-risk">A Unified Framework for Time and Risk</h2>

<p>Our contributions go beyond just fixing hyperbolic discounting. We introduce a broad framework called <strong>RIGOR</strong> (<strong>RI</strong>sk-sensitive RL under <strong>G</strong>eneral discounting <strong>O</strong>f <strong>R</strong>eturns) that decouples time and risk:</p>

<ol>
  <li><strong>Stock-Augmented Distributional RL:</strong> We build on the idea of augmenting the state with a “stock” \(c\) that tracks accumulated rewards. We derive an “Anytime Proxy” equation that guarantees the agent optimizes the global objective from any time \(t\):
\(C^d_0 + G^d_0 \overset{D}{=} d_t \left( C^d_t + G^d_t \right)\)</li>
  <li><strong>General Discount Functions:</strong> Our method supports any non-increasing discount function (hyperbolic, quasi-hyperbolic, etc.), not just exponential.</li>
  <li><strong>OCE Risk Measures:</strong> By operating on the full return distribution, we can optimize for <strong>Optimized Certainty Equivalent (OCE)</strong> risk measures, such as <strong>Conditional Value at Risk (CVaR)</strong> or <strong>Entropic Risk</strong>.</li>
</ol>

<p><img src="/images/utilities.png" alt="OCE Risk Measures Utilities" />
<em>Utility functions for common OCE risk measures. Our framework allows us to plug in different utility functions \(f\) to shape the agent’s risk profile, independent of the discount function.</em></p>

<h2 id="preference-reversals-in-wealth-management">Preference Reversals in Wealth Management</h2>

<p>To demonstrate that our agent captures human-like time preferences, we tested it on a “Goal-Based Wealth Management” problem. We compared a standard Exponential agent against our Hyperbolic agent.</p>

<p>The results showed a clear <strong>preference reversal</strong>. When the “late goal” (stars) was more valuable, the Hyperbolic agent (red) showed impatience for immediate rewards but patience for distant ones, shifting its probability of success in a way the Exponential agent (blue) could not capture.</p>

<p><img src="/images/gbwm-risk-E.png" alt="Goal-Based Wealth Management" />
<em>Monte-Carlo probabilities of achieving goals. The shift in the red markers (Hyperbolic) compared to the blue (Exponential) illustrates the agent’s non-linear time preference, capturing behaviors that standard RL misses.</em></p>

<h2 id="improving-performance-in-atari">Improving Performance in Atari</h2>

<p>Finally, we evaluated whether fixing the theoretical inconsistency in Fedus et al. actually matters for performance. We compared our <strong>Time-Consistent</strong> agent against the <strong>Time-Inconsistent</strong> baseline across 50 Atari games.</p>

<p>The results were significant. By correctly modeling the non-stationary optimal policy and evolving the weights \(w_{i,t}\) over time, our method achieved higher returns in <strong>39 out of 50 games</strong>, with a mean improvement of roughly <strong>40%</strong>.</p>

<p><img src="/images/improvement2.png" alt="Atari Performance Improvement" />
<em>Relative performance improvement of our Time-Consistent algorithm across 50 Atari games. The consistent positive trend demonstrates the benefits of maintaining time-consistency under hyperbolic discounting.</em></p>

<h2 id="conclusion">Conclusion</h2>

<p>Discounting is a fundamental part of the problem definition. It encodes <strong>time preference</strong>, which is distinct from the <strong>risk preference</strong> encoded in the objective function.</p>

<p>By decoupling these two dimensions and ensuring our optimization remains time-consistent, we can build RL agents that are not only more expressive and robust but also perform better on complex control tasks.</p>

<p>For the full theoretical analysis, performance bounds, and proofs, please check out the <a href="https://arxiv.org/abs/2602.04131">paper</a>.</p>

<h2 id="references">References</h2>

<p>Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., and Larochelle, H. Hyperbolic Discounting and Learning over Multiple Horizons. <a href="https://arxiv.org/abs/1902.06865">arXiv preprint</a>.</p>

<hr />]]></content><author><name>Mehrdad Moghimi</name></author><category term="Distributional RL" /><category term="Stock-augmentation" /><category term="General Discounting" /><summary type="html"><![CDATA[In standard Reinforcement Learning (RL), the discount factor (\(\gamma\)) is often treated as a fixed parameter of the Markov Decision Process or a tunable hyperparameter for training stability. We typically default to exponential discounting, where the value of a reward decays by a constant factor at every time step.]]></summary></entry></feed>