International audience; Human listeners are able to recognize accurately an impressive range of complex sounds, such as musical instruments or voices. The underlying mechanisms are still poorly understood. Here, we aimed to characterize the processing time needed to recognize a natural sound. To do so, by analogy with the "rapid visual sequential presentation paradigm", we embedded short target sounds within rapid sequences of distractor sounds. The core hypothesis is that any correct report of the target implies that sufficient processing for recognition had been completed before the time of occurrence of the subsequent distractor sound. We conducted four behavioral experiments using short natural sounds (voices and instruments) as targets or distractors. We report the effects on performance, as measured by the fastest presentation rate for recognition, of sound duration, number of sounds in a sequence, the relative pitch between target and distractors and target position in the sequence. Results showed a very rapid auditory recognition of natural sounds in all cases. Targets could be recognized at rates up to 30 sounds per second. In addition, the best performance was observed for voices in sequences of instruments. These results give new insights about the remarkable efficiency of timbre processing in humans, using an original behavioral paradigm to provide strong constraints on future neural models of sound recognition. Anecdotally, we as human listeners seem remarkably apt at recognizing sound sources: the sound of a voice, approaching footsteps, or musical instruments in each of our cultures. There is now quantitative behavioral evidence supporting this idea (for a review, see Agus et al., in press 1). However, the underlying neural mechanisms for such an impressive feat remain unclear. One way to constrain the range of possible mechanisms is to measure the temporal characteristics of sound source recognition. Using a straightforward operational definition of recognition as a correct response to a target sound defined by its category (e.g., a voice among musical instruments), Agus et al. 2 have shown that reaction times for recognition were remarkably short, with an overhead compared to simple detection between 145 ms and 250 ms depending on target type. When natural sounds were artificially shortened by applying an amplitude "gate" of variable duration, it was observed that recognition remained above chance for durations in the milliseconds range 3-5. However, none of these results speak directly to the processing time required for sound recognition. For reaction times, the comparison of recognition and simple detection times cannot be unequivocally used to estimate processing time 6. For gating, recognizing a very short sound presented in isolation could still require a very long processing time: the short sound duration only constrain the type of acoustic features that are used 7. Similar questions about the processing time required for visual recognition of natural objects have been asked 8,9. They have typically been addressed by what is known as the now-classic "rapid sequential visual presentation task" (RSVP) 10-13 (for a review, see Keysers et al. 13). Briefly, in RSVP, images are flashed in a rapid sequence, with images from one target category presented among many distractors belonging to other categories. Participants are asked to report the sequences containing an image from the target category. The fastest presentation rate for which target recognition remains accurate is taken as a measure of processing time. The core hypothesis is that, for a target to be accurately recognized, it needs to have been sufficiently processed before the next distractor is flashed 13 .