Background: Vocabulary learning in a second language (L2) encompasses crucial aspects, including single words and collocations. Research indicates that L2 learners can incidentally learn single words from captioned videos, but less is known about the incidental learning outcomes of collocations, let alone the differences in learning gains for single words and collocations under different captioned conditions, as well as individual differences that may account for such differences. Objectives: This study aimed to fill this gap by comparing the learning gains of single words and collocations while investigating the influence of vocabulary knowledge (VK) and working memory (WM) on the learning results within diverse forms of captioning conditions: full captions, keyword captions, and no captions. Methods: The study involved 129 young Chinese ESL learners who completed vocabulary tests assessing their meaning recall before, immediately after, and 2 weeks after the study, as well as tests for VK and WM. Results and Conclusions: The results showed that full captions are the most efficacious condition for enhancing both single word and collocation learning. The depth of VK, as well as phonological and complex WM, were significant factors in the learning of new language items. Takeaways: Different types of captioning (full or keyword) contribute differently to the learning of various language items. Individual differences in WM and depth of VK among learners should be considered when utilizing captioned videos for language learning. Lay Description: What is already known about this topic: The type of captions employed does influence incidental single word learning.Incidental collocation learning from captioned videos is worth to be investigated due to the essential role of collocation knowledge in L2 development and the limited classroom time.The breadth of prior vocabulary knowledge (VK) does impact the incidental learning of single words across different captioned conditions. However, the findings remain inconclusive.Working memory (WM) plays a vital role in single word learning. However, limited emphasis has been placed on the examination of how WM affects the incidental learning of collocations across various captioning conditions. What this paper adds: Different captioned conditions play different roles in the incidental learning of different language units: single word learning benefits most from full captioning, whereas both full and keyword captioning lead to significant improvements in incidental collocation learning.The depth of VK is a key determinant of both single word and collocation learning from captioned videos, but its impact is greater for collocations.The breadth of VK is more relevant to collocation learning than single word learning.Both phonological and complex WM play an important role in learning both single words and collocations, but their contribution is greater for collocations. Implications for practice and/or policy: L2 policymakers can incorporate short storytelling videos into the EFL curriculum to facilitate vocabulary learning among young learners and ultimately enhance their L2 proficiency.Teachers can strategically design and implement various types of captioned videos (full or keyword) as out‐of‐class extensive viewing activities, targeting different language components such as single words and collocations.Teachers should be mindful of individual differences (e.g., VK and WM) among learners when utilizing captioned videos for language learning, particularly when it comes to incidental collocation learning.Parents are encouraged to include short storytelling videos with captions as part of their children's home entertainment activities. [ABSTRACT FROM AUTHOR]