閱讀207 返回首頁    go 阿裏雲 go 技術社區[雲棲]


讀書筆記:LADDER: A Human-Level Bidding Agent for Large-Scale Real-Time Online Auctions

這篇文章是關於JD的實時在線競價係統的論文。這篇論文介紹了京東的LADDER係統,第一個成功的運用深度增強學習agent直接處理原始的、包含複雜語義信息的大規模實時問題。這個agent基於DQN的異步隨機版本DASQN。該係統將廣告收入提高50%,大大提高了投資者的投資回報(ROI)。

簡介:We present LADDER, the first deep reinforcement learning agent that can successfully learn control policies for largescale real-world problems directly from raw inputs composed of high-level semantic information. The agent is based on an asynchronous stochastic variant of DQN (Deep Q Network) named DASQN. The inputs of the agent are plain-text descriptions of states of a game of incomplete information, i.e. real-time large scale online auctions, and the rewards are auction profits of very large scale. We apply the agent to an essential portion of JD’s online RTB (real-time bidding) advertising business and find that it easily beats the former state-of-the-art bidding policy that had been carefully engineered and calibrated by human experts: during JD.com’s June 18th anniversary sale, the agent increased the company’s ads revenue from the portion by more than 50%, while the advertisers’ ROI (return on investment) also improved significantly.

內容:
要解決問題:
First, the solution space of the auction game is tremendous. JD DSP system is bidding for 100,000s of auctions per second, assume we have 10 actions and each day is an episode (ad plans are usually on a daily basis), simple math shows the solution space is of 10的10次方的9次方 . For comparison, the solution space of the game of Go is about 10的360次方 (Allis and others 1994; Silver et al. 2016).
Second, state-of-the-art RL algorithms are inherently sequential, hence cannot be applied to large-scale practical problems such as the auction game, for our online service cannot afford the inefficiencies of sequential algorithms.
Third, auction requests are actually triggered by JD users and randomness of human behaviors implies stochastic transitions of states. That’s very different from Atari games, text-based games (Narasimhan, Kulkarni, and Barzilay 2015) and the game of Go (Silver et al. 2016).
Besides, we have widely ranged rewards of which the maximum may be 100,000 times larger than the minimum, which implies only very expressive models are suitable.

大致做法:
In this paper, we model the auction game as a partially observable Markov decision process (POMDP) and present the DASQN algorithm which successfully solve the inherently synchronousness of RL algorithms and the stochastic transitions of the game. We encode each auction request into plain text in a domain specific natural language, feed the encoded request to a deep convolutional neural networks (CNN) and make full use of the high-level semantic information without any sophisticated feature engineering. This results in a lightweight model both responsive and expressive which can update in real-time and reacts to the changes of the auction environment rapidly. Our whole architecture is named LADDER.

具體的內容大家看論文吧,下載地址:
https://arxiv.org/abs/1708.05565

最後更新:2017-08-23 10:33:18

  上一篇:go  《星際爭霸2》人工智能研究環境 SC2LE 初體驗
  下一篇:go  CNN 那麼多的網絡有什麼區別嗎?看這裏了解 CNN 的發展曆程