Zheng Wen

ORCID: 0000-0001-6987-6463
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Bandit Algorithms Research
  • Machine Learning and Algorithms
  • Reinforcement Learning in Robotics
  • Optimization and Search Problems
  • Auction Theory and Applications
  • Data Stream Mining Techniques
  • Smart Grid Energy Management
  • Supply Chain and Inventory Management
  • Adversarial Robustness in Machine Learning
  • Multimodal Machine Learning Applications
  • Consumer Market Behavior and Pricing
  • Gaussian Processes and Bayesian Inference
  • Domain Adaptation and Few-Shot Learning
  • Power Systems and Renewable Energy
  • Recommender Systems and Techniques
  • Mobile Crowdsensing and Crowdsourcing
  • Advanced Algorithms and Applications
  • Explainable Artificial Intelligence (XAI)
  • Advanced Control Systems Optimization
  • Game Theory and Applications
  • Topic Modeling
  • Optical measurement and interference techniques
  • Distributed Sensor Networks and Detection Algorithms
  • Transportation and Mobility Innovations
  • Generative Adversarial Networks and Image Synthesis

Jiangsu University
2023

DeepMind (United Kingdom)
2019-2023

Adobe Systems (United States)
2016-2021

Google (United States)
2019-2021

Yunnan University
2018

Northeastern University
2007-2017

CRRC (China)
2016

Statistics Belgium
2016

Shanghai University
2016

Yahoo (United States)
2014-2015

10.1561/2200000070 article EN Foundations and Trends® in Machine Learning 2018-01-01

Demand response (DR) for residential and small commercial buildings is estimated to account as much 65% of the total energy savings potential DR, previous work shows that a fully automated management system (EMS) necessary prerequisite DR in these areas. In this paper, we propose novel EMS formulation problems sectors. Specifically, formulate EMS's rescheduling problem reinforcement learning (RL) problem, argue RL can be approximately solved by decomposing it over device clusters. Compared...

10.1109/tsg.2015.2396993 article EN IEEE Transactions on Smart Grid 2015-02-16

We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized functions. explain why versions of that use Boltzmann or epsilon-greedy exploration can be highly inefficient, we present computational results demonstrate dramatic efficiency gains enjoyed by RLSVI. Further, establish an upper bound on the expected regret RLSVI demonstrates near-optimality in tabula rasa context. More...

10.48550/arxiv.1402.0635 preprint EN other-oa arXiv (Cornell University) 2014-01-01

A search engine usually outputs a list of $K$ web pages. The user examines this list, from the first page to last, and chooses attractive page. This model behavior is known as cascade model. In paper, we propose cascading bandits, learning variant where objective identify most items. We formulate our problem stochastic combinatorial partial monitoring problem. two algorithms for solving it, CascadeUCB1 CascadeKL-UCB. also prove gap-dependent upper bounds on regret these derive lower bound in...

10.48550/arxiv.1502.02763 preprint EN other-oa arXiv (Cornell University) 2015-01-01

A stochastic combinatorial semi-bandit is an online learning problem where at each step a agent chooses subset of ground items subject to constraints, and then observes weights these receives their sum as payoff. In this paper, we close the computationally sample efficient in semi-bandits. particular, analyze UCB-like algorithm for solving problem, which known be efficient; prove $O(K L (1 / Δ) \log n)$ $O(\sqrt{K n n})$ upper bounds on its $n$-step regret, $L$ number items, $K$ maximum...

10.48550/arxiv.1410.0949 preprint EN other-oa arXiv (Cornell University) 2014-01-01

We study the use of randomized value functions to guide deep exploration in reinforcement learning. This offers an elegant means for synthesizing statistically and computationally efficient with common practical approaches function present several learning algorithms that leverage demonstrate their efficacy through computational studies. also prove a regret bound establishes statistical efficiency tabular representation.

10.48550/arxiv.1703.07608 preprint EN other-oa arXiv (Cornell University) 2017-01-01

We address the problem of best policy identification in preference-based reinforcement learning (PbRL), where occurs from noisy binary preferences over trajectory pairs rather than explicit numerical rewards. This approach is useful for post-training optimization generative AI models during multi-turn user interactions, preference feedback more robust handcrafted reward models. In this setting, driven by both an offline dataset -- collected a rater unknown 'competence' and online data with...

10.48550/arxiv.2501.18873 preprint EN arXiv (Cornell University) 2025-01-30

A matroid is a notion of independence in combinatorial optimization which closely related to computational efficiency. In particular, it well known that the maximum constrained modular function can be found greedily if and only constraints are associated with matroid. this paper, we bring together ideas bandits matroids, propose new class bandits, bandits. The objective these problems learn how maximize on This stochastic initially unknown. We practical algorithm for solving our problem,...

10.48550/arxiv.1403.5045 preprint EN other-oa arXiv (Cornell University) 2014-01-01

A stochastic combinatorial semi-bandit is an online learning problem where at each step a agent chooses subset of ground items subject to constraints, and then observes weights these receives their sum as payoff. In this paper, we consider efficient in large-scale semi-bandits with linear generalization, solution, propose two algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) UCB (CombLinUCB). Both are computationally long the offline version can be solved efficiently. We...

10.48550/arxiv.1406.7443 preprint EN other-oa arXiv (Cornell University) 2014-01-01

Most recommender systems recommend a list of items. The user examines the list, from first item to last, and often chooses attractive does not examine rest. This type behavior can be modeled by cascade model. In this work, we study cascading bandits, an online learning variant model where goal is $K$ most items large set $L$ candidate We propose two algorithms for solving problem, which are based on idea linear generalization. key in our solutions that learn predictor attraction...

10.48550/arxiv.1603.05359 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Multi-armed bandit (MAB) is a class of online learning problems where agent aims to maximize its expected cumulative reward while repeatedly selecting pull arms with unknown distributions. We consider scenario the distributions may change in piecewise-stationary fashion at time steps. show that by incorporating simple change-detection component classic UCB algorithms detect and adapt changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on order $O(\sqrt{MKT\log...

10.48550/arxiv.1802.03692 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what known to maximize immediate performance and investing accumulate new information may improve future performance. The addresses broad range of computationally efficient therefore enjoying wide use. This tutorial covers the its application, illustrating concepts through examples, including Bernoulli bandit problems, shortest path product...

10.48550/arxiv.1707.02038 preprint EN other-oa arXiv (Cornell University) 2017-01-01

We propose a bandit algorithm that explores by randomizing its history of rewards. Specifically, it pulls the arm with highest mean reward in non-parametric bootstrap sample pseudo design rewards such is optimistic sufficiently high probability. call our Giro, which stands for garbage in, out. analyze Giro Bernoulli and derive $O(K Δ^{-1} \log n)$ bound on $n$-round regret, where $Δ$ difference expected optimal best suboptimal arms, $K$ number arms. The main advantage exploration easily...

10.48550/arxiv.1811.05154 preprint EN other-oa arXiv (Cornell University) 2018-01-01
Coming Soon ...