Author: "Buratti, Luca" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Buratti, Luca"' showing total 11 results

Start Over Author "Buratti, Luca" Publication Type Reports

11 results on '"Buratti, Luca"'

1. Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models

Author: Vishwakarma, Sanjay, Harkins, Francis, Golecha, Siddharth, Bajpe, Vishal Sharathchandra, Dupuis, Nicolas, Buratti, Luca, Kremer, David, Faro, Ismael, Puri, Ruchir, and Cruz-Benito, Juan
Subjects: Quantum Physics, Computer Science - Artificial Intelligence
Abstract: Quantum programs are typically developed using quantum Software Development Kits (SDKs). The rapid advancement of quantum computing necessitates new tools to streamline this development process, and one such tool could be Generative Artificial intelligence (GenAI). In this study, we introduce and use the Qiskit HumanEval dataset, a hand-curated collection of tasks designed to benchmark the ability of Large Language Models (LLMs) to produce quantum code using Qiskit - a quantum SDK. This dataset consists of more than 100 quantum computing tasks, each accompanied by a prompt, a canonical solution, a comprehensive test case, and a difficulty scale to evaluate the correctness of the generated solutions. We systematically assess the performance of a set of LLMs against the Qiskit HumanEval dataset's tasks and focus on the models ability in producing executable quantum code. Our findings not only demonstrate the feasibility of using LLMs for generating quantum code but also establish a new benchmark for ongoing advancements in the field and encourage further exploration and development of GenAI-driven tools for quantum code generation.
Published: 2024

2. Qiskit Code Assistant: Training LLMs for generating Quantum Computing Code

Author: Dupuis, Nicolas, Buratti, Luca, Vishwakarma, Sanjay, Forrat, Aitana Viudes, Kremer, David, Faro, Ismael, Puri, Ruchir, and Cruz-Benito, Juan
Subjects: Quantum Physics, Computer Science - Artificial Intelligence
Abstract: Code Large Language Models (Code LLMs) have emerged as powerful tools, revolutionizing the software development landscape by automating the coding process and reducing time and effort required to build applications. This paper focuses on training Code LLMs to specialize in the field of quantum computing. We begin by discussing the unique needs of quantum computing programming, which differ significantly from classical programming approaches or languages. A Code LLM specializing in quantum computing requires a foundational understanding of quantum computing and quantum information theory. However, the scarcity of available quantum code examples and the rapidly evolving field, which necessitates continuous dataset updates, present significant challenges. Moreover, we discuss our work on training Code LLMs to produce high-quality quantum code using the Qiskit library. This work includes an examination of the various aspects of the LLMs used for training and the specific training conditions, as well as the results obtained with our current models. To evaluate our models, we have developed a custom benchmark, similar to HumanEval, which includes a set of tests specifically designed for the field of quantum computing programming using Qiskit. Our findings indicate that our model outperforms existing state-of-the-art models in quantum computing tasks. We also provide examples of code suggestions, comparing our model to other relevant code LLMs. Finally, we introduce a discussion on the potential benefits of Code LLMs for quantum computing computational scientists, researchers, and practitioners. We also explore various features and future work that could be relevant in this context.
Published: 2024

3. Insights from the Usage of the Ansible Lightspeed Code Completion Service

Author: Sahoo, Priyam, Pujar, Saurabh, Nalawade, Ganesh, Gebhardt, Richard, Mandel, Louis, and Buratti, Luca
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Programming Languages
Abstract: The availability of Large Language Models (LLMs) which can generate code, has made it possible to create tools that improve developer productivity. Integrated development environments or IDEs which developers use to write software are often used as an interface to interact with LLMs. Although many such tools have been released, almost all of them focus on general-purpose programming languages. Domain-specific languages, such as those crucial for Information Technology (IT) automation, have not received much attention. Ansible is one such YAML-based IT automation-specific language. Ansible Lightspeed is an LLM-based service designed explicitly to generate Ansible YAML, given natural language prompt. In this paper, we present the design and implementation of the Ansible Lightspeed service. We then evaluate its utility to developers using diverse indicators, including extended utilization, analysis of user edited suggestions, as well as user sentiments analysis. The evaluation is based on data collected for 10,696 real users including 3,910 returning users. The code for Ansible Lightspeed service and the analysis framework is made available for others to use. To our knowledge, our study is the first to involve thousands of users of code assistants for domain-specific languages. We are also the first code completion tool to present N-Day user retention figures, which is 13.66% on Day 30. We propose an improved version of user acceptance rate, called Strong Acceptance rate, where a suggestion is considered accepted only if less than 50% of it is edited and these edits do not change critical parts of the suggestion. By focusing on Ansible, Lightspeed is able to achieve a strong acceptance rate of 49.08% for multi-line Ansible task suggestions. With our findings we provide insights into the effectiveness of small, dedicated models in a domain-specific context., Comment: This paper has been published at the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024), Industry Showcase under the title "Ansible Lightspeed: A Code Generation Service for IT Automation"
Published: 2024

4. Learning Transfers over Several Programming Languages

Author: Baltaji, Razan, Pujar, Saurabh, Mandel, Louis, Hirzel, Martin, Buratti, Luca, and Varshney, Lav
Subjects: Computer Science - Computation and Language, I.2.7, I.2.5
Abstract: Large language models (LLMs) have become remarkably good at improving developer productivity for high-resource programming languages. These models use two kinds of data: large amounts of unlabeled code samples for pre-training and relatively smaller amounts of labeled code samples for fine-tuning or in-context learning. Unfortunately, many programming languages are low-resource, lacking labeled samples for most tasks and often even lacking unlabeled samples. Therefore, users of low-resource languages (e.g., legacy or new languages) miss out on the benefits of LLMs. Cross-lingual transfer uses data from a source language to improve model performance on a target language. It has been well-studied for natural languages, but has received little attention for programming languages. This paper reports extensive experiments on four tasks using a transformer-based LLM and 11 to 41 programming languages to explore the following questions. First, how well does cross-lingual transfer work for a given task across different language pairs. Second, given a task and target language, how should one choose a source language. Third, which characteristics of a language pair are predictive of transfer performance, and how does that depend on the given task. Our empirical study with 1,808 experiments reveals practical and scientific insights, such as Kotlin and JavaScript being the most transferable source languages and different tasks relying on substantially different features. Overall, we find that learning transfers well across several programming languages., Comment: 15 pages, 9 figures, 8 tables
Published: 2023

5. Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Author: Min, Marcus J., Ding, Yangruibo, Buratti, Luca, Pujar, Saurabh, Kaiser, Gail, Jana, Suman, and Ray, Baishakhi
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Software Engineering, 68, I.2, D.2
Abstract: Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and conventional accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain., Comment: ICLR 2024
Published: 2023

6. CONCORD: Clone-aware Contrastive Learning for Source Code

Author: Ding, Yangruibo, Chakraborty, Saikat, Buratti, Luca, Pujar, Saurabh, Morari, Alessandro, Kaiser, Gail, and Ray, Baishakhi
Subjects: Computer Science - Software Engineering
Abstract: Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware contrastive learning drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code., Comment: Camera-ready for 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 23)
Published: 2023

7. Automated Code generation for Information Technology Tasks in YAML through Large Language Models

Author: Pujar, Saurabh, Buratti, Luca, Guo, Xiaojie, Dupuis, Nicolas, Lewis, Burn, Suneja, Sahil, Sood, Atin, Nalawade, Ganesh, Jones, Matthew, Morari, Alessandro, and Puri, Ruchir
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Programming Languages
Abstract: The recent improvement in code generation capabilities due to the use of large language models has mainly benefited general purpose programming languages. Domain specific languages, such as the ones used for IT Automation, have received far less attention, despite involving many active developers and being an essential component of modern cloud platforms. This work focuses on the generation of Ansible-YAML, a widely used markup language for IT Automation. We present Ansible Wisdom, a natural-language to Ansible-YAML code generation tool, aimed at improving IT automation productivity. Ansible Wisdom is a transformer-based model, extended by training with a new dataset containing Ansible-YAML. We also develop two novel performance metrics for YAML and Ansible to capture the specific characteristics of this domain. Results show that Ansible Wisdom can accurately generate Ansible script from natural language prompts with performance comparable or better than existing state of the art code generation models. In few-shot settings we asses the impact of training with Ansible, YAML data and compare with different baselines including Codex-Davinci-002. We also show that after finetuning, our Ansible specific model (BLEU: 66.67) can outperform a much larger Codex-Davinci-002 (BLEU: 50.4) model, which was evaluated in few shot settings.
Published: 2023

8. Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Author: Ding, Yangruibo, Buratti, Luca, Pujar, Saurabh, Morari, Alessandro, Ray, Baishakhi, and Chakraborty, Saikat
Subjects: Computer Science - Programming Languages, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Software Engineering
Abstract: Understanding the functional (dis)-similarity of source code is significant for code modeling tasks such as software vulnerability and code clone detection. We present DISCO(DIS-similarity of COde), a novel self-supervised model focusing on identifying (dis)similar functionalities of source code. Different from existing works, our approach does not require a huge amount of randomly collected datasets. Rather, we design structure-guided code transformation algorithms to generate synthetic code clones and inject real-world security bugs, augmenting the collected datasets in a targeted way. We propose to pre-train the Transformer model with such automatically generated program contrasts to better identify similar code in the wild and differentiate vulnerable programs from benign ones. To better capture the structural features of source code, we propose a new cloze objective to encode the local tree-based context (e.g., parents or sibling nodes). We pre-train our model with a much smaller dataset, the size of which is only 5% of the state-of-the-art models' training datasets, to illustrate the effectiveness of our data augmentation and the pre-training approach. The evaluation shows that, even with much less data, DISCO can still outperform the state-of-the-art models in vulnerability and code clone detection tasks., Comment: ACL 2022 Camera-Ready
Published: 2021

9. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Author: Puri, Ruchir, Kung, David S., Janssen, Geert, Zhang, Wei, Domeniconi, Giacomo, Zolotov, Vladimir, Dolby, Julian, Chen, Jie, Choudhury, Mihir, Decker, Lindsey, Thost, Veronika, Buratti, Luca, Pujar, Saurabh, Ramji, Shyam, Finkler, Ulrich, Malaika, Susan, and Reiss, Frederick
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence
Abstract: Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled numerous breakthroughs, motivating researchers to leverage AI techniques to improve software development efficiency. Thus, the fast-emerging research area of AI for Code has garnered new interest and gathered momentum. In this paper, we present a large-scale dataset CodeNet, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to benchmark and help accelerate research in AI techniques for a variety of critical coding tasks, including code similarity and classification, code translation between a large variety of programming languages, and code performance (runtime and memory) improvement techniques. Additionally, CodeNet provides sample input and output test sets for 98.5% of the code samples, which can be used as an oracle for determining code correctness and potentially guide reinforcement learning for code quality improvements. As a usability feature, we provide several pre-processing tools in CodeNet to transform source code into representations that can be readily used as inputs into machine learning models. Results of code classification and code similarity experiments using the CodeNet dataset are provided as a reference. We hope that the scale, diversity and rich, high-quality annotations of CodeNet will offer unprecedented research opportunities at the intersection of AI and Software Engineering., Comment: 22 pages including references
Published: 2021

10. D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

Author: Zheng, Yunhui, Pujar, Saurabh, Lewis, Burn, Buratti, Luca, Epstein, Edward, Yang, Bo, Laredo, Jim, Morari, Alessandro, and Su, Zhong
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first., Comment: Accepted to the 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP '21)
Published: 2021

11. Exploring Software Naturalness through Neural Language Models

Author: Buratti, Luca, Pujar, Saurabh, Bornea, Mihaela, McCarley, Scott, Zheng, Yunhui, Rossiello, Gaetano, Morari, Alessandro, Laredo, Jim, Thost, Veronika, Zhuang, Yufan, and Domeniconi, Giacomo
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Programming Languages
Abstract: The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks. Present approaches to code analysis depend heavily on features derived from the Abstract Syntax Tree (AST) while our transformer-based language models work on raw source code. This work is the first to investigate whether such language models can discover AST features automatically. To achieve this, we introduce a sequence labeling task that directly probes the language models understanding of AST. Our results show that transformer based language models achieve high accuracy in the AST tagging task. Furthermore, we evaluate our model on a software vulnerability identification task. Importantly, we show that our approach obtains vulnerability identification results comparable to graph based approaches that rely heavily on compilers for feature extraction.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

11 results on '"Buratti, Luca"'

1. Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models

2. Qiskit Code Assistant: Training LLMs for generating Quantum Computing Code

3. Insights from the Usage of the Ansible Lightspeed Code Completion Service

4. Learning Transfers over Several Programming Languages

5. Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

6. CONCORD: Clone-aware Contrastive Learning for Source Code

7. Automated Code generation for Information Technology Tasks in YAML through Large Language Models

8. Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

9. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

10. D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

11. Exploring Software Naturalness through Neural Language Models

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

11 results on '"Buratti, Luca"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources