CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code

Authors :: Guan, Batu
Wan, Yao
Bi, Zhangqian
Wang, Zheng
Zhang, Hongyu
Sui, Yulei
Zhou, Pan
Sun, Lichao
Publication Year :: 2024
Abstract: As Large Language Models (LLMs) are increasingly used to automate code generation, it is often desired to know if the code is AI-generated and by which model, especially for purposes like protecting intellectual property (IP) in industry and preventing academic misconduct in education. Incorporating watermarks into machine-generated content is one way to provide code provenance, but existing solutions are restricted to a single bit or lack flexibility. We present CodeIP, a new watermarking technique for LLM-based code generation. CodeIP enables the insertion of multi-bit information while preserving the semantics of the generated code, improving the strength and diversity of the inerseted watermark. This is achieved by training a type predictor to predict the subsequent grammar type of the next token to enhance the syntactical and semantic correctness of the generated code. Experiments on a real-world dataset across five programming languages showcase the effectiveness of CodeIP.<br />Comment: 13 pages, 7 figures