Multi-core clusters have become primary tools for high performance computing due to their great computing power and cost-to-performance effectiveness in nowadays. However, it introduces new challenges for the design of efficient parallel algorithms because of the different storage mechanisms and non-uniform communication latencies on these machines. The traditional domain decomposition methods use the direct partition method to achieve load balancing, which directly divides the structure into a number of subdomains with equal according to the number of processing cores involved in parallel computing. As the number of processing cores in a single node of multi-core clusters increases exponentially, the number of subdomains will increase dramatically as well. A substantial increase in the number of subdomains leads to the rapid expansion of the size and the condition number of interface equations, thereby reducing the numerical convergence of the system. In addition, it leads to a considerable increase in the number of processes involved in parallel computing, thereby increasing contention for the limited network ports and bandwidth. The decrease of the numerical convergence and the increase of network communication overheads seriously affect the solution efficiency of interface equations, and greatly reduce the overall parallel efficiency of the domain decomposition method. In order to make full use of the computing power of multi-core clusters to improve the parallel efficiency of large-scale finite element structural analysis, a hierarchical load balancing approach is proposed in the paper. The proposed approach is based on the full mining of computational tasks. In order to adapt to the hardware topology architecture of multi-core clusters, the computational tasks of finite element structural analysis are divided into three layers: inter-node parallelism, inter-chip parallelism and inter-core parallelism. The coarse grain parallel computing method is utilized in inter-node parallelism and inter-chip parallelism, and the fine grain parallel computing method is used in inter-core parallelism. Through mapping computing tasks to different hardware layers of multi-core clusters, the proposed method not only efficiently achieves the load balancing at different layers, but also greatly reduces the communication overheads of the system. Furthermore, it considerably reduces the number of subdomains and significantly improves the numerical convergence of the interface equations. In order to verify the effectiveness of the algorithms, two numerical experiments about finite element structural linear static analysis for large-scale parallel computing were conducted on “Tianhe 2” supercomputer. For each model, both the traditional domain decomposition method and the proposed hierarchical load balancing approach were employed for numerical simulation utilizing 50, 100, 150, and 200 nodes, respectively. Test results show that the proposed method could obtain higher speedup and parallel efficiency compared with the conventional domain decomposition method. The proposed approach can be widely used for solving many kinds of structural analysis problems including linear static analysis, nonlinear static analysis and nonlinear dynamic analysis and so on. In this paper, the authors’ current research only focuses on the linear static analysis. For the nonlinear static or dynamic analysis and other kinds of structural analysis, the proposed method can be used as a sub-procedure because the calculations are still dominated by solutions of the same sort of equations.