TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human FeedbackTLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 10
  • Download : 0
DC FieldValueLanguage
dc.contributor.authorEunseop Yoonko
dc.contributor.authorHee Suk Yoonko
dc.contributor.authorSooHwan Eomko
dc.contributor.authorDaniel Wontae Namko
dc.contributor.authorDaejin Joko
dc.contributor.authorKyoung-Woon Onko
dc.contributor.authorMark A. Hasegawa-Johnsonko
dc.contributor.authorSungwoong Kimko
dc.contributor.authorYoo, Chang-Dongko
dc.date.accessioned2024-09-28T04:00:07Z-
dc.date.available2024-09-28T04:00:07Z-
dc.date.created2024-09-28-
dc.date.issued2024-08-
dc.identifier.citationThe 62nd Annual Meeting of the Association for Computational Linguistics-
dc.identifier.urihttp://hdl.handle.net/10203/323301-
dc.description.abstractReinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks.-
dc.languageEnglish-
dc.publisherThe 62nd Annual Meeting of the Association for Computational Linguistics-
dc.titleTLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback-
dc.title.alternativeTLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback-
dc.typeConference-
dc.type.rimsCONF-
dc.citation.publicationnameThe 62nd Annual Meeting of the Association for Computational Linguistics-
dc.identifier.conferencecountryTH-
dc.identifier.conferencelocationCentara Grand Convention Center-
dc.contributor.localauthorYoo, Chang-Dong-
dc.contributor.nonIdAuthorEunseop Yoon-
dc.contributor.nonIdAuthorHee Suk Yoon-
dc.contributor.nonIdAuthorSooHwan Eom-
dc.contributor.nonIdAuthorDaniel Wontae Nam-
dc.contributor.nonIdAuthorDaejin Jo-
dc.contributor.nonIdAuthorKyoung-Woon On-
dc.contributor.nonIdAuthorMark A. Hasegawa-Johnson-
dc.contributor.nonIdAuthorSungwoong Kim-
Appears in Collection
EE-Conference Papers(학술회의논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0