Author: "Benhaim, Alon" / Publication Year Range: Last 3 years - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Benhaim, Alon"' showing total 14 results

Start Over Author "Benhaim, Alon" Publication Year Range Last 3 years

14 results on '"Benhaim, Alon"'

1. POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Author: Karaman, Batuhan K., Zabir, Ishmam, Benhaim, Alon, Chaudhary, Vishrav, Sabuncu, Mert R., and Song, Xia
Subjects: Computer Science - Computation and Language
Abstract: Balancing safety and usefulness in large language models has become a critical challenge in recent years. Models often exhibit unsafe behavior or adopt an overly cautious approach, leading to frequent overrefusal of benign prompts, which reduces their usefulness. Addressing these issues requires methods that maintain safety while avoiding overrefusal. In this work, we examine how the overgeneration of training data using advanced teacher models (e.g., GPT-4o), including responses to both general-purpose and toxic prompts, influences the safety and overrefusal balance of instruction-following language models. Additionally, we present POROver, a strategy to use preference optimization methods in order to reduce overrefusal, via employing a superior teacher model's completions. Our results show that overgenerating completions for general-purpose prompts significantly improves the balance between safety and usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 70.8% to 88.3%. Moreover, overgeneration for toxic prompts substantially reduces overrefusal, decreasing it from 94.4% to 45.2%. Furthermore, preference optimization algorithms, when applied with carefully curated preference data, can effectively reduce a model's overrefusal from 45.2% to 15.0% while maintaining comparable safety levels. Our code and data are available at https://github.com/batuhankmkaraman/POROver.
Published: 2024

2. Scaling Laws for Multilingual Language Models

Author: He, Yifei, Benhaim, Alon, Patra, Barun, Vaddamanu, Praneetha, Ahuja, Sanchit, Chopra, Parul, Chaudhary, Vishrav, Zhao, Han, and Song, Xia
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, addressing the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To address this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This relationship enables us to predict performance across various combinations of the above three quantities, and derive the optimal sampling ratios at different model scales. To demonstrate the effectiveness and accuracy of our proposed scaling law, we perform a large-scale empirical study, training more than 100 models on 23 languages spanning 5 language families. Our experiments show that the optimal sampling ratios derived from small models (85M parameters) generalize effectively to models that are several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach for multilingual LM training at scale.
Published: 2024

3. On The Adaptation of Unlimiformer for Decoder-Only Transformers

Author: Ahrabian, Kian, Benhaim, Alon, Patra, Barun, Pujara, Jay, Singhal, Saksham, and Song, Xia
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: One of the prominent issues stifling the current generation of large language models is their limited context length. Recent proprietary models such as GPT-4 and Claude 2 have introduced longer context lengths, 8k/32k and 100k, respectively; however, despite the efforts in the community, most common models, such as LLama-2, have a context length of 4k or less. Unlimiformer (Bertsch et al., 2023) is a recently popular vector-retrieval augmentation method that offloads cross-attention computations to a kNN index. However, its main limitation is incompatibility with decoder-only transformers out of the box. In this work, we explore practical considerations of adapting Unlimiformer to decoder-only transformers and introduce a series of modifications to overcome this limitation. Moreover, we expand the original experimental setup on summarization to include a new task (i.e., free-form Q&A) and an instruction-tuned model (i.e., a custom 6.7B GPT model). Our results showcase the effectiveness of these modifications on summarization, performing on par with a model with 2x the context length. Moreover, we discuss limitations and future directions for free-form Q&A and instruction-tuned models., Comment: 8 pages, 6 figures
Published: 2024

4. Scaling Optimal LR Across Token Horizons

Author: Bjorck, Johan, Benhaim, Alon, Chaudhary, Vishrav, Wei, Furu, and Song, Xia
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: State-of-the-art LLMs are powered by scaling -- scaling model size, dataset size and cluster size. It is economically infeasible to extensively tune hyperparameter for the largest runs. Instead, approximately optimal hyperparameters must be inferred or \textit{transferred} from smaller experiments. Hyperparameter transfer across model sizes has been studied in Yang et al. However, hyperparameter transfer across dataset size -- or token horizon -- has not been studied yet. To remedy this we conduct a large scale empirical study on how optimal learning rate (LR) depends on token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly we demonstrate the the optimal LR follows a scaling law, and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via such scaling laws. We also provide a rule-of-thumb for transferring LR across token horizons with zero overhead over current practices. Lastly we provide evidence that LLama-1 used too high LR, and estimate the performance hit from this. We thus argue that hyperparameter transfer across data size is an important and overlooked component of LLM training.
Published: 2024

5. The Hitchhiker's Guide to Human Alignment with *PO

Author: Ahrabian, Kian, Lin, Xihui, Patra, Barun, Chaudhary, Vishrav, Benhaim, Alon, Pujara, Jay, and Song, Xia
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: With the growing utilization of large language models (LLMs) across domains, alignment towards human preferences has become one of the most critical aspects of training models. At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). However, prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters, which can be impractical for general practitioners. In this paper, we aim to identify the algorithm that, while being performant, is simultaneously more robust to varying hyperparameters, thereby increasing the likelihood of achieving better results. We focus on a realistic out-of-distribution (OOD) scenario that mirrors real-world applications of human alignment, offering practical insights into the strengths and weaknesses of these methods. Furthermore, to better understand the shortcomings of generations from the different methods, we analyze the model generations through the lens of KL divergence of the SFT model and the response length statistics. Our analysis reveals that the widely adopted DPO method consistently produces lengthy responses of inferior quality that are very close to the SFT responses. Motivated by these findings, we propose an embarrassingly simple extension to the DPO algorithm, LN-DPO, resulting in more concise responses without sacrificing quality compared to the policy obtained by vanilla DPO., Comment: 10 pages
Published: 2024

6. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Author: Abdin, Marah, Aneja, Jyoti, Awadalla, Hany, Awadallah, Ahmed, Awan, Ammar Ahmad, Bach, Nguyen, Bahree, Amit, Bakhtiari, Arash, Bao, Jianmin, Behl, Harkirat, Benhaim, Alon, Bilenko, Misha, Bjorck, Johan, Bubeck, Sébastien, Cai, Martin, Cai, Qin, Chaudhary, Vishrav, Chen, Dong, Chen, Dongdong, Chen, Weizhu, Chen, Yen-Chun, Chen, Yi-Ling, Cheng, Hao, Chopra, Parul, Dai, Xiyang, Dixon, Matthew, Eldan, Ronen, Fragoso, Victor, Gao, Jianfeng, Gao, Mei, Gao, Min, Garg, Amit, Del Giorno, Allie, Goswami, Abhishek, Gunasekar, Suriya, Haider, Emman, Hao, Junheng, Hewett, Russell J., Hu, Wenxiang, Huynh, Jamie, Iter, Dan, Jacobs, Sam Ade, Javaheripi, Mojan, Jin, Xin, Karampatziakis, Nikos, Kauffmann, Piero, Khademi, Mahoud, Kim, Dongwoo, Kim, Young Jin, Kurilenko, Lev, Lee, James R., Lee, Yin Tat, Li, Yuanzhi, Li, Yunsheng, Liang, Chen, Liden, Lars, Lin, Xihui, Lin, Zeqi, Liu, Ce, Liu, Liyuan, Liu, Mengchen, Liu, Weishung, Liu, Xiaodong, Luo, Chong, Madan, Piyush, Mahmoudzadeh, Ali, Majercak, David, Mazzola, Matt, Mendes, Caio César Teodoro, Mitra, Arindam, Modi, Hardik, Nguyen, Anh, Norick, Brandon, Patra, Barun, Perez-Becker, Daniel, Portet, Thomas, Pryzant, Reid, Qin, Heyang, Radmilac, Marko, Ren, Liliang, de Rosa, Gustavo, Rosset, Corby, Roy, Sambudha, Ruwase, Olatunji, Saarikivi, Olli, Saied, Amin, Salim, Adil, Santacroce, Michael, Shah, Shital, Shang, Ning, Sharma, Hiteshi, Shen, Yelong, Shukla, Swadheen, Song, Xia, Tanaka, Masahiro, Tupini, Andrea, Vaddamanu, Praneetha, Wang, Chunyu, Wang, Guanhua, Wang, Lijuan, Wang, Shuohang, Wang, Xin, Wang, Yu, Ward, Rachel, Wen, Wen, Witte, Philipp, Wu, Haiping, Wu, Xiaoxia, Wyatt, Michael, Xiao, Bin, Xu, Can, Xu, Jiahang, Xu, Weijian, Xue, Jilong, Yadav, Sonali, Yang, Fan, Yang, Jianwei, Yang, Yifan, Yang, Ziyi, Yu, Donghan, Yuan, Lu, Zhang, Chenruidong, Zhang, Cyril, Zhang, Jianwen, Zhang, Li Lyna, Zhang, Yi, Zhang, Yue, Zhang, Yunan, and Zhou, Xiren
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts., Comment: 24 pages
Published: 2024

7. A Length-Extrapolatable Transformer

Author: Sun, Yutao, Dong, Li, Patra, Barun, Ma, Shuming, Huang, Shaohan, Benhaim, Alon, Chaudhary, Vishrav, Song, Xia, and Wei, Furu
Subjects: Computer Science - Computation and Language
Abstract: Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer., Comment: 9 pages
Published: 2022

8. TorchScale: Transformers at Scale

Author: Ma, Shuming, Wang, Hongyu, Huang, Shaohan, Wang, Wenhui, Chi, Zewen, Dong, Li, Benhaim, Alon, Patra, Barun, Chaudhary, Vishrav, Song, Xia, and Wei, Furu
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Large Transformers have achieved state-of-the-art performance across many tasks. Most open-source libraries on scaling Transformers focus on improving training or inference with better parallelization. In this work, we present TorchScale, an open-source toolkit that allows researchers and developers to scale up Transformers efficiently and effectively. TorchScale has the implementation of several modeling techniques, which can improve modeling generality and capability, as well as training stability and efficiency. Experimental results on language modeling and neural machine translation demonstrate that TorchScale can successfully scale Transformers to different sizes without tears. The library is available at https://aka.ms/torchscale., Comment: Work in progress
Published: 2022

9. Foundation Transformers

Author: Wang, Hongyu, Ma, Shuming, Huang, Shaohan, Dong, Li, Wang, Wenhui, Peng, Zhiliang, Wu, Yu, Bajaj, Payal, Singhal, Saksham, Benhaim, Alon, Patra, Barun, Liu, Zhun, Chaudhary, Vishrav, Song, Xia, and Wei, Furu
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3)., Comment: Work in progress
Published: 2022

10. Scaling Blockchains: Can Committee-Based Consensus Help?

Author: Benhaim, Alon, primary, Falk, Brett H., additional, and Tsoukalas, Gerry, additional
Published: 2023
Full Text: View/download PDF

11. Balancing Power in Decentralized Governance: Quadratic Voting under Imperfect Information

Author: Benhaim, Alon, primary, Hemenway Falk, Brett, additional, and Tsoukalas, Gerry, additional
Published: 2023
Full Text: View/download PDF

12. A Length-Extrapolatable Transformer

Author: Sun, Yutao, primary, Dong, Li, additional, Patra, Barun, additional, Ma, Shuming, additional, Huang, Shaohan, additional, Benhaim, Alon, additional, Chaudhary, Vishrav, additional, Song, Xia, additional, and Wei, Furu, additional
Published: 2023
Full Text: View/download PDF

13. Study of Nash Equilibria in Blockchain Voting Systems

Author: Benhaim, Alon and Benhaim, Alon
Abstract: In the first part of this thesis we analyze the three most common blockchain committees selection strategies: lottery, single-vote and approval voting, where voters can “approve” of any number of candidates. We first show that all these mechanisms converge to optimality exponentially quickly as the size of the committee grows. Approval-voting requires that even honest voters act strategically, we characterize different approval voting strategies and we show that although finding the optimal approval voting strategy is extremely complex, almost any approval voting strategy outperforms the single-vote mechanism enforced on the majority of blockchains. In the second part, we investigate a blockchain governance model where a group of n voters must choose between two collective alternatives. As opposed to the usual voting system (one person – one vote), we propose a voting system where each agent buys votes in favor of their preferred alternative, paying the m-th root of the number of votes purchased. Its novelty relies on allowing voters to express the intensity of their preferences in a simple manner. We provide a rigorous comparison of the utilitarian welfare between Regular Voting (m = 1) and Quadratic Voting (m = 2). We present closed formequilibrium solutions to the 2 voters and 3 voters games. In addition to characterizing the nature of equilibria, one of our main result demonstrates that the normalized utilitarian welfare of the mechanisms tends to one as the population size becomes large.
Published: 2022

14. Scaling Blockchains: Can Committee-based Consensus Help?

Author: Benhaim, Alon, Falk, Brett Hemenway, and Tsoukalas, Gerry
Subjects: Computer Science - Cryptography and Security, Quantitative Finance - Trading and Market Microstructure, Computer Science - Computer Science and Game Theory, Computer Science - Information Theory, Economics - General Economics
Abstract: In the high-stakes race to develop more scalable blockchains, some platforms (Binance, Cosmos, EOS, TRON, etc.) have adopted committee-based consensus (CBC) protocols, whereby the blockchain's record-keeping rights are entrusted to a committee of elected block producers. In theory, the smaller the committee, the faster the blockchain can reach consensus and the more it can scale. What's less clear, is whether such protocols ensure that honest committees can be consistently elected, given blockchain users typically have limited information on who to vote for. We show that the approval voting mechanism underlying most CBC protocols is complex and can lead to intractable optimal voting strategies. We empirically characterize some simpler intuitive voting strategies that users tend to resort to in practice and prove that these nonetheless converge to optimality exponentially quickly in the number of voters. Exponential convergence ensures that despite its complexity, CBC exhibits robustness and has some efficiency advantages over more popular staked-weighted lottery protocols currently underlying many prominent blockchains such as Ethereum.
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

14 results on '"Benhaim, Alon"'

1. POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

2. Scaling Laws for Multilingual Language Models

3. On The Adaptation of Unlimiformer for Decoder-Only Transformers

4. Scaling Optimal LR Across Token Horizons

5. The Hitchhiker's Guide to Human Alignment with *PO

6. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

7. A Length-Extrapolatable Transformer

8. TorchScale: Transformers at Scale

9. Foundation Transformers

10. Scaling Blockchains: Can Committee-Based Consensus Help?

11. Balancing Power in Decentralized Governance: Quadratic Voting under Imperfect Information

12. A Length-Extrapolatable Transformer

13. Study of Nash Equilibria in Blockchain Voting Systems

14. Scaling Blockchains: Can Committee-based Consensus Help?

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

14 results on '"Benhaim, Alon"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources