Three lightweight approaches — no GPU, no neural nets — that still beat hand-crafted rules
🔍
Exploration vs. Exploitation
Every algorithm here faces the same core dilemma: try something new (explore) or
stick with what works (exploit). Each solves it differently.
Simulated Annealing starts hot — accepts even bad moves to escape local optima —
then cools down and converges on the best policy it found.
Thompson Sampling maintains a probability distribution over each price's
expected reward. Uncertainty drives exploration naturally: untested prices have wide
distributions and are sampled often; well-tested prices are exploited confidently.
Q-Learning starts fully random (ε = 100%) and gradually reduces
exploration as its Q-table fills with experience.
📊
Three Algorithms, One Problem
Simulated Annealing (SA) searches for an optimal price table:
one price per (inventory level × time remaining) cell. It perturbs one cell at a time
and uses a temperature schedule to accept worse solutions early on.
Pure numpy — trains in ~3 s on any CPU.
Thompson Sampling (TS) treats each price as a "bandit arm" per market context.
After each step it updates a Gaussian posterior: mean reward and uncertainty per price.
No gradients, no episodes — purely Bayesian updating.
Q-Learning (QL) stores Q(state, action) values in a lookup table and
updates them via the Bellman equation after every step. An ensemble of 4 runs with
averaged Q-tables reduces variance from random initialisation.
🏢
Business Applications
This exact framework applies to real business problems:
- E-commerce: Dynamic pricing for flash sales & limited stock
- Hotels & Airlines: Yield management — sell seats at the right price as the flight fills up
- Advertising: Bid optimisation in real-time auctions
- Supply Chain: Inventory replenishment and markdown decisions
- Ride-sharing: Surge pricing based on real-time demand signals
- SaaS: Optimal discount depth for conversion vs. lifetime value
All three algorithms here run on a single CPU core with no GPU — making them
deployable in cost-constrained environments without sacrificing learning quality.