- Advanced Bandit Algorithms Research
- Reinforcement Learning in Robotics
- Military Defense Systems Analysis
- Guidance and Control Systems
- Adversarial Robustness in Machine Learning
- Optimization and Search Problems
- UAV Applications and Optimization
- Aerospace and Aviation Technology
- Musculoskeletal pain and rehabilitation
- Smart Grid Energy Management
- Auction Theory and Applications
- Robotics and Sensor-Based Localization
- Data Stream Mining Techniques
- Pain Management and Opioid Use
- Machine Learning and ELM
- Simulation Techniques and Applications
- Age of Information Optimization
- Machine Learning and Algorithms
- Influenza Virus Research Studies
- Advanced Image Processing Techniques
- Game Theory and Voting Systems
- Spacecraft Dynamics and Control
- Face and Expression Recognition
- Pain Management and Placebo Effect
- Image Processing Techniques and Applications
Institute of Electronics
2024
Nanjing University of Information Science and Technology
2020-2024
Tsinghua University
2019
This study aimed to investigate the pain situation, functional limitations, treatment used, care-seeking behaviors, and educational preferences of adults with in mainland China. An online questionnaire was developed through expert validation, participants were recruited via social media platforms. Inclusion criteria required having access Internet smartphones, while individuals significant cognitive impairments or severe mental illness excluded. 1566 participants, predominantly male (951) a...
We study the problem of allocating T indivisible items that arrive online to agents with additive valuations. The allocation must satisfy a prominent fairness notion, envy-freeness up one item (EF1), at each round. To make this possible, we allow reallocation previously allocated items, but aim minimize these so-called adjustments. For case two agents, show algorithms are informed about values future can get by without any adjustments, whereas uninformed require Theta(T) general three or...
Semi-supervised learning has been proven to be effective in utilizing unlabeled samples mitigate the problem of limited labeled data. Traditional semi-supervised methods generate pseudo-labels for and train classifier using both pseudo-labeled samples. However, data-scarce scenarios, reliance on initial generation can degrade performance. Methods based consistency regularization have shown promising results by encouraging consistent outputs different semantic variations same sample obtained...
We study federated contextual linear bandits, where $M$ agents cooperate with each other to solve a global bandit problem the help of central server. consider asynchronous setting, all work independently and communication between one agent server will not trigger agents' communication. propose simple algorithm named \texttt{FedLinUCB} based on principle optimism. prove that regret is bounded by $\tilde{O}(d\sqrt{\sum_{m=1}^M T_m})$ complexity $\tilde{O}(dM^2)$, $d$ dimension vector $T_m$...
Command and control in modern air combat is showing pressing demand for intelligent decision-making. Research on decision-making technique relies data accumulation by simulations, which still remains underdeveloped at present. In this paper we introduce a tactical-level simulation system decision-making, can simulate between formations, has multiple application modes, namely Man-Man, Man-Machine, Machine-Machine. This collect diversified fine-grid data, provide support training of air-combat...
In emergency search and rescue scenarios, the quick location of trapped people is essential. However, disasters can render Global Positioning System (GPS) unusable. Unmanned aerial vehicles (UAVs) with localization devices serve as mobile anchors due to their agility high line-of-sight (LoS) probability. Nonetheless, number available UAVs during initial stages disaster relief limited, innovative methods are needed quickly plan UAV trajectories locate non-uniformly distributed dynamic targets...
Aligning large language models (LLM) with human preference plays a key role in building modern generative and can be achieved by reinforcement learning from feedback (RLHF). Despite their superior performance, current RLHF approaches often require amount of human-labelled data, which is expensive to collect. In this paper, inspired the success active learning, we address problem proposing query-efficient methods. We first formalize alignment as contextual dueling bandit design an...
We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite over infinite episodes with high probability. introduce algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where both transition kernel and reward function can be approximated by some up misspecification level $\zeta$. At core of Cert-LSVI-UCB innovative certified estimator, which facilitates a fine-grained concentration...
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate output undesirable or harmful direction. To tackle challenge, we study a specific model within problem domain--contextual dueling bandits with adversarial feedback, where true preference label flipped adversary. We propose...
Although Convolutional Neural Networks have significantly improved the development of SAR image super-resolution (SR) technology in recent years, it is a very challenging problem to reconstruct with large-scale factors, such as ×4 and ×8 due limited available information from low-resolution image. The co-registered high-resolution optical has been successfully applied enhance quality its discriminative characteristics. Compared single-frame SR reconstruction technology, image-guided better...
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes, where multiple agents cooperate via communication through a central server. propose provably efficient algorithm based on value iteration that enable asynchronous while ensuring advantage cooperation with low overhead. With linear function approximation, we prove our enjoys an $\tilde{\mathcal{O}}(d^{3/2}H^2\sqrt{K})$ regret $\tilde{\mathcal{O}}(dHM^2)$ complexity, $d$ is feature dimension, $H$...
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL), where transition dynamics can be corrupted by an adversary. Existing studies on corruption-robust RL mostly focus setting model-free RL, robust least-square regression is often employed for value function estimation. However, these techniques cannot directly applied to RL. In this paper, we and take maximum likelihood estimation (MLE) approach learn model. Our work encompasses both online...
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous Markov decision processes (linear MDPs) whose transition probability can be parameterized as a of given feature mapping, we propose the first computationally efficient algorithm that achieves nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is dimension $H$ planning horizon, and $K$ number episodes. Our based on weighted regression scheme carefully designed weight, which...
We study the linear contextual bandit problem in presence of adversarial corruption, where reward at each round is corrupted by an adversary, and corruption level (i.e., sum magnitudes over horizon) $C\geq 0$. The best-known algorithms this setting are limited that they either computationally inefficient or require a strong assumption on their regret least $C$ times worse than without corruption. In paper, to overcome these limitations, we propose new algorithm based principle optimism face...
Recently, several studies (Zhou et al., 2021a; Zhang 2021b; Kim 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the worst-case regime deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of noise. In this paper, we present a novel solution open problem by proposing first efficient algorithm bandits with heteroscedastic Our is adaptive noise...
We study linear contextual bandits in the misspecified setting, where expected reward function can be approximated by a class up to bounded misspecification level $\zeta>0$. propose an algorithm based on novel data selection scheme, which only selects vectors with large uncertainty for online regression. show that, when $\zeta$ is dominated $\tilde O (\Delta / \sqrt{d})$ $\Delta$ being minimal sub-optimality gap and $d$ dimension of vectors, our enjoys same gap-dependent regret bound...