The complexity of many embedded software applications and their stringent performance requirements have reached a point where they can no longer be supported by conventional embedded-system architectures based on a single general-purpose processor. Nanometer fabrication technologies have made it feasible to integrate several embedded processors on a single chip to create multiprocessor systems-on-chip (MPSoCs) for a wide range of applications. It is already well known that, for embedded systems, customizing the processor to the application can significantly increase performance and energy efficiency. Single-chip heterogeneous multiprocessors, where the different processors are customized to the tasks they perform, are thus attracting attention. Despite their significant potential, a primary bottleneck to their adoption remains the development of programming paradigms and tools to alleviate the design complexity. Some advances have been made in the areas of custom multiprocessor architecture, tool support, and design methodology in order to reduce the design turnaround time. However, because of the heterogeneity of the custom processors, the designers are still required to manually map the application tasks to the different processors and, at the same time, customize each processor so that the overall performance requirements are satisfied. A multilevel custom multiprocessor-synthesis methodology to perform the assignment and scheduling of the application tasks on the various processors together with processor customization in an integrated manner is proposed. This paper focuses on extensible processors, which provide a good tradeoff between efficiency, flexibility, and design turnaround time, by combining a configurable base processor with hardware that implements custom instructions to extend its instruction set. This paper motivates the need for such an integrated approach by demonstrating that custom-instruction selection has complex interdependencies with task assignment and scheduling, and performing these steps independently often results in significant degradation in the quality of the synthesized multiprocessor architecture. The methodology presented here uses an iterative improvement algorithm to first assign and schedule tasks on processors and then select custom instructions along the critical path. It utilizes the concept of "expected execution time" to connect these two steps. It not only considers the currently selected custom instructions for the current task assignment and schedule, but also the possibility of better custom instructions being selected in future iterations. The methodology presented here is also enhanced to integrate task-level software pipelining to further increase the parallelism in the task graph and provide opportunities for multiprocessing. In this paper, the proposed heterogeneous multiprocessor-synthesis methodology is implemented in the context of a commercial extensible-processor design flow, using the Xtensa platform from Tensilica Inc. The presented tool was evaluated by automatically generating custom multiprocessor architectures for several complex embedded software benchmarks. The results show that architectures synthesized by the proposed methodology demonstrate an average speedup of 2.0times (up to 2.9times) compared to symmetric multiprocessor architectures in which the processors have not been augmented with custom instructions. To the best of the authors' knowledge, this is the first tool for the synthesis of custom MPSoCs using extensible processors