168 results on '"multicore"'
Search Results
2. 30-GHz Low-Phase-Noise Scalable Multicore Class-F Voltage-Controlled Oscillators Using Coupled-Line-Based Synchronization Topology.
- Author
-
Wan, Jiayue, Li, Xiao, Fei, Zesong, Han, Fang, Li, Xiaoran, Wang, Xinghua, and Chen, Zhiming
- Abstract
In this letter, low-phase-nose multicore class-F voltage-controlled oscillators (VCOs) using coupled-lined-based synchronization topology are proposed. Compared to traditional resistance-coupled multicore VCOs, the proposed coupled-line-based topology improves the $Q$ of the small inductors in the millimeter-wave frequency range. Mode ambiguity is eliminated for a robust oscillation startup. Quad-core and oct-core VCO prototypes are designed and implemented in 65-nm CMOS process, which exhibit a measured frequency tuning range of 20.5% centered at 31.32 GHz. The quad-core VCO has a measured phase noise (PN) of −134.33 dBc/Hz and a corresponding FoM of 191.32 dBc/Hz at 10-MHz offset from 28.28 GHz. The oct-core VCO has a measured PN of −137.23 dBc/Hz and a corresponding FoM of 191.08 dBc/Hz at 10-MHz offset from 28.16 GHz. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
3. Memory-Aware Denial-of-Service Attacks on Shared Cache in Multicore Real-Time Systems.
- Author
-
Bechtel, Michael and Yun, Heechul
- Subjects
- *
DENIAL of service attacks , *SHARED workspaces , *RANDOM access memory , *MULTICORE processors , *MICROELECTROMECHANICAL systems - Abstract
In this paper, we identify that memory performance plays a crucial role in the feasibility and effectiveness for performing denial-of-service attacks on shared cache. Based on this insight, we introduce new cache DoS attacks, which can be mounted from the user-space and can cause extreme worst-case execution time (WCET) impacts to cross-core victims—even if the shared cache is partitioned—by taking advantage of the platform’s memory address mapping information and HugePage support. We deploy these enhanced attacks on two popular embedded out-of-order multicore platforms using both synthetic and real-world benchmarks. The proposed DoS attacks achieve up to 111X WCET increases on the tested platforms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
4. Fast Newton-Raphson Power Flow Analysis Based on Sparse Techniques and Parallel Processing.
- Author
-
Ahmadi, Afshin, Smith, Melissa C., Collins, Edward R., Dargahi, Vahid, and Jin, Shuangshuang
- Subjects
- *
ELECTRICAL load , *MULTICORE processors , *PARALLEL processing , *SPARSE matrices , *SYSTEM analysis , *GRAPHICS processing units - Abstract
Power flow (PF) calculation provides the basis for the steady-state power system analysis and is the backbone of many power system applications ranging from operations to planning. The calculated voltage and power values by PF are essential to determining the system condition and ensuring the security and stability of the grid. The emergence of multicore processors provides an opportunity to accelerate the speed of PF computation and, consequently, improve the performance of applications that run PF within their processes. This paper introduces a fast Newton-Raphson power flow implementation on multicore CPUs by combining sparse matrix techniques, mathematical methods, and parallel processing. Experimental results validate the effectiveness of our approach by finding the power flow solution of a synthetic U.S. grid test case with 82,000 buses in just 1.8 seconds. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
5. OPTIMUS: A Security-Centric Dynamic Hardware Partitioning Scheme for Processors that Prevent Microarchitecture State Attacks.
- Author
-
Omar, Hamza, D'Agostino, Brandon, and Khan, Omer
- Subjects
- *
DYNAMIC random access memory , *RANDOM access memory , *HARDWARE , *MULTICORE processors - Abstract
Hardware virtualization allows multiple security-critical and ordinary (insecure) processes to co-execute on a processor. These processes temporally share hardware resources and endure numerous security threats on the microarchitecture state. State-of-the-art secure processor architectures, such as MI6 and IRONHIDE enable capabilities to execute security-critical processes in hardware isolated enclaves utilizing the strong isolation security primitive. The MI6 processor purges small state resources on each enclave entry/exit and statically partitions the last-level cache and DRAM regions to ensure strong isolation. IRONHIDE takes a spatial approach and creates two isolated clusters of cores in a multicore processor to ensure strong isolation for processes executing in the enclave cluster. Both architectures observe performance degradation due to static partitioning of shared hardware resources. OPTIMUS proposes a security-centric dynamic hardware resource partitioning scheme that operates entirely at runtime and ensures strong isolation. It enables deterministic resource allocations at the application level granularity, and limits the number of hardware reconfigurations to ensure bounded information leakage via the timing and termination channels. The dynamic hardware resource partitioning capability of OPTIMUS is shown to co-optimize performance and security for the MI6 and IRONHIDE architectures. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
6. Lock-Free Parallelization for Variance-Reduced Stochastic Gradient Descent on Streaming Data.
- Author
-
Peng, Yaqiong, Hao, Zhiyu, and Yun, Xiaochun
- Subjects
- *
INFORMATION commons , *DATA structures , *ALGORITHMS , *MACHINE learning - Abstract
Stochastic Gradient Descent (SGD) is an iterative algorithm for fitting a model to the training dataset in machine learning problems. With low computation cost, SGD is especially suited for learning from large datasets. However, the variance of SGD tends to be high because it uses only a single data point to determine the update direction at each iteration of gradient descent, rather than all available training data points. Recent research has proposed variance-reduced variants of SGD by incorporating a correction term to approximate full-data gradients. However, it is difficult to parallelize such variants with high performance and accuracy, especially on streaming data. As parallelization is a crucial requirement for large-scale applications, this article focuses on the parallel setting in a multicore machine and presents LFS-STRSAGA, a lock-free approach to parallelizing variance-reduced SGD on streaming data. LFS-STRSAGA embraces a lock-free data structure to process the arrival of streaming data in parallel, and asynchronously maintains the essential information to approximate full-data gradients with low cost. Both our theoretical and empirical results show that LFS-STRSAGA matches the accuracy of the state-of-the-art variance-reduced SGD on streaming data under sparsity assumption (common in machine learning problems), and that LFS-STRSAGA reduces the model update time by over 98 percent. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
7. BlackParrot: An Agile Open-Source RISC-V Multicore for Accelerator SoCs.
- Author
-
Petrisko, Daniel, Gilani, Farzam, Wyse, Mark, Jung, Dai Cheol, Davidson, Scott, Gao, Paul, Zhao, Chun, Azad, Zahra, Canakci, Sadullah, Veluri, Bandhav, Guarino, Tavio, Joshi, Ajay, Oskin, Mark, and Taylor, Michael Bedford
- Abstract
This article introduces BlackParrot, which aims to be the default open-source, Linux-capable, cache-coherent, 64-bit RISC-V multicore used by the world. In executing this goal, our research aims to advance the world's knowledge about the "software engineering of hardware." Although originally bootstrapped by the University of Washington and Boston University via DARPA funding, BlackParrot strives to be community driven and infrastructure agnostic; a multicore which is Pareto optimal in terms of power, performance, area, and complexity. In order to ensure BlackParrot is easy to use, extend, and, most importantly, trust, development is guided by three core principles: Be Tiny, Be Modular, and Be Friendly. Development efforts have prioritized the use of intentional interfaces and modularity and silicon validation as first-order design metrics, so that users can quickly get started and trust that their design will perform as expected when deployed. BlackParrot has been validated in a GlobalFoundries 12-nm FinFET tapeout. BlackParrot is ideal as a standalone Linux processor or as a malleable fabric for an agile accelerator SoC design flow. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
8. Custard: ASIC Workload-Aware Reliable Design for Multicore IoT Processors.
- Author
-
Lerner, Scott, Yilmaz, Isikcan, and Taskin, Baris
- Subjects
APPLICATION-specific integrated circuit design & construction ,INTERNET of things ,MULTICORE processors - Abstract
In the typical application-specified integrated circuit (ASIC) design flow, reliability-driven performance loss is computed, in part, with switching activity files. However, for ASIC designs of multicore processors, the typical switching activity files lack multithreaded software workload information. An accurate switching activity for a multicore design can be generated using a logic simulator. However, the logic simulator process suffers from long runtimes when dealing with real workloads. This paper analyzes the effects of scaling multithreaded workloads and proposes Custard, a hardware methodology for lifetime improvement of multicore processors by obtaining multithreaded switching activity signatures in a short period of time using a performance simulator (gem5), logic simulator (VCS), and thermal simulator (HotSpot). Custard is particularly important for multicore, Internet of Things processors as the runtime feedback-based reliability mechanisms used on current multicore processors incur area and power overhead that could be prohibitive for smaller form factors and power budgets. Experiments are performed with Custard using real workloads on an OpenSPARC T1 design with two, four, and eight cores that are fully synthesized and routed. The default-sized T1 core is improved to have a reliability increase of $4.1\times $ , with 0.08% and 1.57% increase on average in cell area and switching power, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
9. Dynamic Scheduling of Real-Time Tasks in Heterogeneous Multicore Systems.
- Author
-
Baital, Kalyan and Chakrabarti, Amlan
- Abstract
The shift from homogeneous multicore to heterogeneous multicore introduces challenges in scheduling the tasks to the appropriate cores maintaining the time deadline. This letter studies the existing scheduling schemes in a heterogeneous multicore system and finds an approach to enhance the homogeneous system model to heterogeneous scheduling architecture. The proposed model increases the overall system utilization by accommodating almost all the tasks (low power task and high power task) into appropriate cores (big high power and small low power). It further enhances the system performance by allocating rejected jobs from small cores into the big cores through a dispatcher. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
10. WCRT Analysis and Evaluation for Sporadic Message-Processing Tasks in Multicore Automotive Gateways.
- Author
-
Xie, Guoqi, Zeng, Gang, Kurachi, Ryo, Takada, Hiroaki, Li, Zhetao, Li, Renfa, and Li, Keqin
- Subjects
- *
MULTICORE processors , *REACTION time , *GATEWAYS (Computer networks) , *LOGIC circuits , *AUTOMOTIVE electronics , *INTEGRATED circuits - Abstract
We study the worst case response time (WCRT) analysis and evaluation for sporadic message-processing tasks in a multicore automotive gateway of a controller area network (CAN) cluster. We first build a multicore automotive gateway on CAN clusters. Two WCRT analysis methods for message-processing tasks in the multicore gateway are subsequently presented based on global and partitioned scheduling paradigms. We evaluate the WCRT results of two analysis methods with real message sets provided by the automaker, and present the design optimization guide. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
11. SEE Error-Rate Evaluation of an Application Implemented in COTS Multicore/Many-Core Processors.
- Author
-
Ramos, Pablo, Vargas, Vanessa, Baylac, Maud, Zergainoh, Nacer-Eddine, and Velazco, Raoul
- Subjects
- *
RADIATION , *COMPLEMENTARY metal oxide semiconductors , *SILICON-on-insulator technology , *MICROPROCESSORS , *LOGIC circuits - Abstract
This paper evaluates the error rate of a memory-bound application implemented in different commercial-off-the-shelf multicore and many-core processors. To achieve this goal, two quantitative experiments are performed: fault-injection campaigns and radiation ground testing. In addition, this paper proposes an approach for predicting the application error rate by combining the results issued from both types of experiments. The usefulness of the approach is illustrated by three case studies implemented in processors having different manufacturing technologies and architectures: 45-nm silicon-on-insulator (SOI) free-scale P2041 quad-core processor, 65-nm CMOS Adapteva Epiphany multicore processor, and 28-nm CMOS Kalray multipurpose processing array-256 many-core processor. The reliability of the processors for avionics is obtained from their experimental error rates extrapolated to avionic altitudes. Reliability curves are plotted for observing the prediction accuracy. A comparison of the failure in time of the selected processors shows that the greater single-event effect vulnerability of CMOS technology compared with the SOI one can be compensated with the implementation of effective error detection and correction. These protection mechanisms allow the use of CMOS devices having huge memory capacity in applications operating in severe radiation environments. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
12. FA-Stack: A Fast Array-Based Stack with Wait-Free Progress Guarantee.
- Author
-
Peng, Yaqiong and Hao, Zhiyu
- Subjects
- *
MULTICORE processors , *DATA structures , *PARALLEL processing , *QUEUEING networks , *DATA types (Computer science) , *PARALLEL computers - Abstract
The prevalence of multicore processors necessitates the design of efficient concurrent data structures. Shared concurrent stacks are widely used as inter-thread communication structures in parallel applications. Wait-free stacks can ensure that each thread completes operations on them in a finite number of steps. This characteristic is valuable for parallel applications and operating systems, especially in real-time environments. Unfortunately, because wait-free algorithms are typically hard to design and considered inefficient, practical wait-free stacks are rare. In this paper, we present a practical, fast array-based concurrent stack with wait-free progress guarantee, named FA-Stack. A series of optimizations are proposed to bound the number of steps required to complete every push and pop operation. In addition, FA-Stack adopts a time-stamped scheme to reclaim memory. We use linearizability , a correctness condition for concurrent data structures, to prove that FA-Stack is a wait-free linearizable stack with respect to the Last in First Out (LIFO) semantics. Our evaluation with representative benchmarks shows that FA-Stack is an efficient wait-free stack. For example, compared to Sim-Stack (a state-of-the-art wait-free stack), FA-Stack improves the throughput of halfhalf benchmark by up to 2.4 $\times$
. [ABSTRACT FROM AUTHOR]- Published
- 2018
- Full Text
- View/download PDF
13. An Efficient Architecture of In-Loop Filters for Multicore Scalable HEVC Hardware Decoders.
- Author
-
Kim, HyunMi, Ko, JeongGil, and Park, Seongmo
- Abstract
This paper proposes an efficient architecture of HEVC in-loop filters (ILFs) with the target of providing effective multicore utilization for ultra-high definition video applications. While HEVC allows for a high level of parallelization, the issue of data dependencies at the ILF leads to inefficient parallel processing performance. The novel memory organization and management techniques address the data dependence-related issues between multiple processing units and enable to filter the flexible area on multicore decoder. In addition, we introduce the adaptive deblocking filtering order (ADFO) to minimize the impact of bus congestion when multiple cores interoperate for processing very large data. Furthermore, we design the deblocking filter with skip mode pipelining to achieve the high performance minimizing the increased cost and the power consumption. For SAO, we apply the window-based parallel SAO filtering scheme. The resource sharing is considered throughout the entire architecture. Based on both experimental and analytical results, our proposed design can achieve more than 1.31 Gpixels/s and less than 2.6 Gpixels/s at maximum frequency 660 MHz in single core, and consumes 56.2 Kgates including 10.6 Kgates for memory management architecture, which supports multicore decoder, and about 20.8 mW power on average when synthesizing with the 28 nm CMOS library. Moreover, the skip modes of DF improve both the performance and the power dissipation. The ADFO improves the performance of ∼9.17% when decoding 8 K sequence on octacore at 400 MHz frequency. TpG (Throughput per Gate) is the highest among the related works. [ABSTRACT FROM PUBLISHER]
- Published
- 2018
- Full Text
- View/download PDF
14. Principal Component Analysis Based Filtering for Scalable, High Precision k-NN Search.
- Author
-
Feng, Huan, Eyers, David, Mills, Steven, Wu, Yongwei, and Huang, Zhiyi
- Subjects
- *
PRINCIPAL components analysis , *COMPUTER vision , *MACHINE learning , *K-nearest neighbor classification , *SCALABILITY , *PARALLEL algorithms - Abstract
Approximate $k$
NN search in high-dimensional datasets does not scale well on multicore platforms, due to its large memory footprint. Parallel A $k$ NN search based on principal component analysis. PCAF improves on previous methods, demonstrating sustained, high scalability for a wide range of high-dimensional datasets on both Intel and AMD multicore platforms. Moreover, PCAF maintains highly precise A$k$- Published
- 2018
- Full Text
- View/download PDF
15. Ring-Core Multicore Few-Mode Erbium-Doped Fiber Amplifier.
- Author
-
Amma, Yoshimichi, Hosokawa, Tsukasa, Ono, Hirotaka, Ichii, Kentaro, Takenaga, Katsuhiro, Matsuo, Shoichiro, and Yamada, Makoto
- Abstract
In this letter, we demonstrate a seven-core few-mode erbium-doped fiber amplifier by core pumping for long-haul transmission system using a few-mode multicore fiber. The differential modal gain between the LP01 and LP11 modes is reduced to 3.2 dB or less regardless of the pumping-light mode using a ring-core refractive-index profile, and flattened averaged gains of over 20 dB and noise figures of less than 7 dB for all cores and all modes of the input light are achieved in the C-band. The inter-core crosstalk between the center and outer cores has a very low value of -49 dB or less at a wavelength of 1560 nm. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
16. Resource Sharing in Multicore Mixed-Criticality Systems: Utilization Bound and Blocking Overhead.
- Author
-
Han, Jian-Jun, Tao, Xin, Zhu, Dakai, and Yang, Laurence T.
- Subjects
- *
MULTICORE processors , *APPLICATION software , *SYNCHRONIZATION , *SPAM filtering (Email) , *TASKS - Abstract
In mixed-criticality (MC) system, diverse application activities with various certification requirements (different criticality) can share a computing platform, where multicore processors have emerged as the prevailing computing engines. Focusing on the problem of resource access contention in multicore MC systems, we analyze the synchronization issues and blocking characteristics of the Multiprocessor Stack Resource Policy (MSRP) with both priority and criticality inversions among MC tasks being considered. We develop the first criticality-aware utilization bound under partitioned Earliest Deadline First (EDF) and MSRP by taking the worst case synchronization overheads of tasks into account. The non-monotonicityof the bound where it may decrease when more cores are deployed is identified, which can cause anomalies in the feasibility tests. With the objective to improve system schedulability, a novel criticality-cognizant and resource-oriented analysis approach is further studied to tighten the bound on the synchronization overheads for MC tasks scheduled under partitioned EDF and MSRP. The simulation results show that the new analysis approach can effectively reduce the blocking times for tasks (up to 30 percent) and thus improve the schedulability ratio (e.g., 10 percent more). The actual implementation in Linux kernel further shows the practicability of partitioned-EDF with MSRP (with run-time overhead being about 3 to 7 percent of the overall execution time) for MC tasks running on multicores with shared resources. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
17. BWLOCK: A Dynamic Memory Access Control Framework for Soft Real-Time Applications on Multicore Platforms.
- Author
-
Yun, Heechul, Ali, Waqar, Gondi, Santosh, and Biswas, Siddhartha
- Subjects
- *
MULTICORE processors , *ACCESS control , *REAL-time computing , *BANDWIDTHS , *ELECTRONIC data processing - Abstract
Soft real-time applications often show bursty memory access patterns—requiring high memory bandwidth for a short duration of time—that are often critical for timely data processing and performance. We call the code sections that exhibit such characteristics as Memory-Performance Critical Sections (MPCSs). Unfortunately, in multicore architectures, non-real-time applications on different cores may also demand high memory bandwidth at the same time. Resulting bandwidth contention can substantially increase the time spent on MPCSs of soft real-time applications, which in turn could result in missed deadlines. In this paper, we present a memory access control framework called BWLOCK, which is designed to protect MPCSs of soft real-time applications. BWLOCK consists of a user-level libarary and a kernel-level memory bandwidth control mechanism. The user-level library provides a lock-like API to declare MPCSs for real-time applications. When a real-time application enters a MPCS, the kernel-level bandwidth control mechanism dynamically throttles memory bandwidth of the rest of the cores to protect the MPCS, until it is completed. We evaluate BWLOCK using CortexSuite benchmarks. By selectively applying BWLOCK, based on the memory intensity of the code blocks in each benchmark, we achieve significant performance improvements, up to 150 percent reduction in slowdown, at a controllable throughput impact to non real-time applications. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
18. Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores.
- Author
-
Feliu, Josue, Sahuquillo, Julio, Petit, Salvador, and Duato, Jose
- Subjects
- *
COMPUTER scheduling , *SIMULTANEOUS multithreading processors , *CACHE memory , *BANDWIDTHS , *LINUX operating systems - Abstract
Nowadays, high performance multicore processors implement multithreading capabilities. The processes running concurrently on these processors are continuously competing for the shared resources, not only among cores, but also within the core. While resource sharing increases the resource utilization, the interference among processes accessing the shared resources can strongly affect the performance of individual processes and its predictability. In this scenario, process scheduling plays a key role to deal with performance and fairness. In this work we present a process scheduler for SMT multicores that simultaneously addresses both performance and fairness. This is a major design issue since scheduling for only one of the two targets tends to damage the other. To address performance, the scheduler tackles bandwidth contention at the L1 cache and main memory. To deal with fairness, the scheduler estimates the progress experienced by the processes, and gives priority to the processes with lower accumulated progress. Experimental results on an Intel Xeon E5645 featuring six dual-threaded SMT cores show that the proposed scheduler improves both performance and fairness over two state-of-the-art schedulers and the Linux OS scheduler. Compared to Linux, unfairness is reduced to a half while still improving performance by 5.6 percent. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
19. A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter.
- Author
-
Lee, Sae Kyu, Tong, Tao, Zhang, Xuan, Brooks, David, and Wei, Gu-Yeon
- Subjects
ELECTRIC potential ,CAPACITORS ,CONVERTERS (Electronics) ,MULTICORE processors ,VOLTAGE control - Abstract
This paper presents a 16-core voltage-stacked system with adaptive frequency clocking (AFClk) and a fully integrated voltage regulator that demonstrates efficient on-chip power delivery for multicore systems. Voltage stacking alleviates power delivery inefficiencies due to off-chip parasitics but adds complexity to combat internal voltage noise. To address the corresponding issue of internal voltage noise, the system utilizes an AFClk scheme with an efficient switched-capacitor dc–dc converter to mitigate noise on the stack layers and to improve system performance and efficiency. Experimental results demonstrate robust voltage noise mitigation as well as the potential of voltage stacking as a highly efficient power delivery scheme. This paper also illustrates that augmenting the hardware techniques with intelligent workload allocation that exploits the inherent properties of voltage stacking can preemptively reduce the interlayer activity mismatch and improve system efficiency. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
20. Cache Hierarchy-Aware Query Mapping on Emerging Multicore Architectures.
- Author
-
Ozturk, Ozcan, Orhan, Umut, Ding, Wei, Yedlapalli, Praveen, and Kandemir, Mahmut Taylan
- Subjects
- *
COMPUTER architecture , *MIXED integer linear programming , *COMPUTER programming , *CACHE memory , *INTEGRATED circuits - Abstract
One of the important characteristics of emerging multicores/manycores is the existence of “shared on-chip caches,” through which different threads/processes can share data (help each other) or displace each other’s data (hurt each other). Most of current commercial multicore systems on the market have on-chip cache hierarchies with multiple layers (typically, in the form of L1, L2 and L3, the last two being either fully or partially shared). In the context of database workloads, exploiting full potential of these caches can be critical. Motivated by this observation, our main contribution in this work is to present and experimentally evaluate a cache hierarchy-aware query mapping scheme targeting workloads that consist of batch queries to be executed on emerging multicores. Our proposed scheme distributes a given batch of queries across the cores of a target multicore architecture based on the affinity relations among the queries. The primary goal behind this scheme is to maximize the utilization of the underlying on-chip cache hierarchy while keeping the load nearly balanced across domain affinities. Each domain affinity in this context corresponds to a cache structure bounded by a particular level of the cache hierarchy. A graph partitioning-based method is employed to distribute queries across cores, and an integer linear programming (ILP) formulation is used to address locality and load balancing concerns. We evaluate our scheme using the TPC-H benchmarks on an Intel Xeon based multicore. Our solution achieves up to 25 percent improvement in individual query execution times and 15-19 percent improvement in throughput over the default Linux-based process scheduler. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
21. IBM Power9 Processor Architecture.
- Author
-
Sadasivam, Satish Kumar, Thompto, Brian W., Kalla, Ron, and Starke, William J.
- Subjects
- *
MULTIPROCESSORS , *COMPUTER systems , *COMPUTING platforms , *PERFORMANCE of multiprocessors - Abstract
The IBM Power9 processor has an enhanced core and chip architecture that provides superior thread performance and higher throughput. The core and chip architectures are optimized for emerging workloads to support the needs of next-generation computing. Multiple variants of silicon target the scale-out and scale-up markets. With a new core microarchitecture design, along with an innovative I/O fabric to support several accelerated computing requirements, the Power9 processor meets the diverse computing needs of the cognitive era and provides a platform for accelerated computing. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
22. Enhancing the Malloc System with Pollution Awareness for Better Cache Performance.
- Author
-
Liao, Xiaofei, Guo, Rentong, Jin, Hai, Yue, Jianhui, and Tan, Guang
- Subjects
- *
POLLUTION , *CACHE memory , *MULTICORE processors , *POLLUTION control equipment , *ONLINE monitoring systems - Abstract
Cache pollution, by which weak-locality data unduly replaces strong-locality data, may notably degrade application performance in a shared-cache multicore machine. This paper presents NightWatch, a cache management subsystem that provides general, transparent and low-overhead pollution control to applications. NightWatch is based on the observation that data within the same memory chunk or chunks within the same allocation context often share similar locality property. NightWatch embodies this observation by online monitoring current cache locality to predict future behavior and restricting potential cache polluters proactively. We have integrated NightWatch into two popular allocators, tcmalloc and ptmalloc2. Experiments with SPEC CPU2006 show that NightWatch improves application performance by up to 45 percent (18 percent on average), with an average monitoring overhead of 0.57 percent (up to 3.02 percent). [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
23. An Ultralow Phase Noise Eight-Core Fundamental 62-to-67-GHz VCO in 65-nm CMOS.
- Author
-
Zhang, Jingzhi, Zhao, Chenxi, Wu, Yunqiu, Liu, Huihua, Zhu, Yan, and Kang, Kai
- Abstract
An eight-core fundamental VCO with ultralow phase noise (PN) performance is presented. Eight VCOs are in-phase coupled for low PN, and a 9-dB PN improvement is achieved. A scalable layout methodology is presented to solve the layout difficulties in multicore VCOs. Two VCO cores are connected together directly to form a VCO cell. By simply connecting $N$ /2 VCO cells, the PN of the entire multicore VCO will achieve a 10 logN dB PN improvement compared with the single VCO core. The proposed eight-core VCO is fabricated in 65-nm CMOS and occupies 0.15 mm2. The measured PN is −105.5 dBc/Hz at 1-MHz offset, with the tuning range from 62.2 to 67.3 GHz (7.9%). The VCO consumes 61.2-mW power with the figure-of-merit of −183.5 dBc/Hz at 1-MHz offset. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
24. A $4 \times 4 \times 2$ Homogeneous Scalable 3D Network-on-Chip Circuit With 326 MFlit/s 0.66 pJ/b Robust and Fault Tolerant Asynchronous 3D Links.
- Author
-
Vivet, Pascal, Thonnart, Yvain, Lemaire, Romain, Santos, Cristiano, Beigne, Edith, Bernard, Christian, Darve, Florian, Lattard, Didier, Miro-Panades, Ivan, Dutoit, Denis, Clermidy, Fabien, Cheramy, S., Sheibanyrad, Abbas, Petrot, Frederic, Flamand, Eric, Michailos, Jean, Arriordaz, Alexandre, Wang, Lee, and Schloeffel, Juergen
- Subjects
INTEGRATED circuits ,THREE-dimensional display systems ,NETWORKS on a chip - Abstract
Future many cores, either for high performance computing or for embedded applications, are facing the power wall, and cannot be scaled up using only the reduction of technology nodes; 3D integration, using through silicon via (TSV) as an advanced packaging technology, allows further system integration, while reducing the power dissipation devoted to system-level communication. In this paper, we present a 3D modular and scalable network-on-chip (NoC) architecture implemented using robust asynchronous logic. The 3DNOC circuit targets a Telecom long-term evolution application; it is composed of two die layers, fabricated in 65 nm technology using TSV middle aspect ratio 1:8, and integrates ESD protection, a 3D design-for-test, and a fault tolerant scheme. The 3D links achieve 0.66 pJ/b energy consumption and 326 Mb/s data rate per pin for the parallel link. Thin die effect is demonstrated by thermal analysis and measurements, as well as the dynamic self-adaptation of the 3D link performances with 3D thermal conditions. Finally, the scalability of the 3DNOC circuit, in terms of power delivery network and thermal dissipation, is demonstrated by using simulations up to a 3D stack of eight die layers. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
25. An Extended Shared Logarithmic Unit for Nonlinear Function Kernel Acceleration in a 65-nm CMOS Multicore Cluster.
- Author
-
Gautschi, Michael, Schaffner, Michael, Gurkaynak, Frank K., and Benini, Luca
- Subjects
NONLINEAR functional analysis ,MULTICORE processors ,LOGARITHMIC amplifiers - Abstract
Energy-efficient computing and ultralow-power computing are strong requirements for various application areas, such as internet of things and wearables. While for some applications integer and fixed-point arithmetic suffice, others require a larger dynamic range, typically obtained using floating-point (FP) numbers. Logarithmic number systems (LNSs) have been proposed as energy-efficient alternative, since several complex FP operations translate into simple integer operations. However, additions and subtractions become nonlinear operations, which have to be approximated via interpolation. Even efficient LNS units (LNUs) are still larger than standard FP units (FPUs), rendering them impractical for most general-purpose processors. We show that, when shared among several cores, LNUs become a very attractive solution. A series of compact LNUs is developed, which provide significantly more functionality (such as transcendental functions) than other state-of-the-art designs. This allows, for example, to evaluate the atan2 function with three instructions for only 183.2 pJ/op at 0.8 V. We present the first shared-LNU architecture where these LNUs have been integrated into a multicore system with four 32-b-OpenRISC cores and show measurement results demonstrating that the shared-LNU design can be up to 4.1 $\times $ more energy-efficient in common nonlinear processing kernels, compared with a similar area design with four private FPUs. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
26. Extending Amdahl’s Law for Multicores with Turbo Boost.
- Author
-
Verner, Uri, Mendelson, Avi, and Schuster, Assaf
- Abstract
Rewriting sequential programs to make use of multiple cores requires considerable effort. For many years, Amdahl’s law has served as a guideline to assess the performance benefits of parallel programs over sequential ones, but recent advances in multicore design introduced variability in the performance of the cores and motivated the reexamination of the underlying model. This paper extends Amdahl’s law for multicore processors with built-in dynamic frequency scaling mechanisms such as Intel’s Turbo Boost. Using a model that captures performance dependencies between cores, we present tighter upper bounds for the speedup and reduction in energy consumption of a parallel program over a sequential one on a given multicore processor and validate them on Haswell and Sandy Bridge Intel CPUs. Previous studies have shown that from a processor design perspective, Turbo Boost mitigates the speedup limitations obtained under Amdahl’s law by providing higher performance for the same energy budget. However, our new model and evaluation show that from a software development perspective, Turbo Boost aggravates these limitations by making parallelization of sequential codes less profitable. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
27. Analysis of Parallel Computing Strategies to Accelerate Ultrasound Imaging Processes.
- Author
-
Romero-Laorden, D., Villazon-Terrazas, J., Martinez-Graullera, O., Ibanez, A., Parrilla, M., and Penas, M. Santos
- Subjects
- *
ULTRASONIC imaging , *PARALLEL processing , *MULTICORE processors , *CENTRAL processing units , *GRAPHICS processing units - Abstract
This work analyses the use of parallel processing techniques in synthetic aperture ultrasonic imaging applications. In particular, the Total Focussing Method, which is a $O(N^2 \times P)$
problem, is studied. This work presents different parallelization strategies for multicore CPU and GPU architectures. The parallelization processes on both platforms are discussed and optimized in order to achieve real-time performance. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
28. Design of multicore HEVC decoders using actor-based dataflow models and OpenMP.
- Author
-
Chavarr?as, M., Pescador, F., Garrido, M.J., Sanchez, A., and Sanz, C.
- Subjects
- *
HOUSEHOLD electronics , *MATHEMATICAL programming , *MULTICORE processors , *VIDEO coding , *CENTRAL processing units - Abstract
New multimedia portable devices support increasing spatial and temporal video resolutions as well as new and more efficient video codecs. This scenario leads to the use of multicore chips in order to comply with the higher computational loads. In a context dominated by the time to market pressure, the programming of multicore chips is a challenge. In this work, a method based on a combination of dataflow models and OpenMP is proposed for quick design of HEVC decoders using multicore chips. The proposed method has been automated by its integration into an open source framework. Tests have been carried out by implementing the same HEVC decoder targeted on three different multicore chips.1 [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
29. Robinhood: Towards Efficient Work-Stealing in Virtualized Environments.
- Author
-
Peng, Yaqiong, Wu, Song, and Jin, Hai
- Subjects
- *
PARALLEL programming , *MOTHERBOARDS , *VIRTUAL machine systems , *MULTICORE processors , *COMPUTER software - Abstract
Work-stealing, as a common user-level task scheduler for managing and scheduling tasks of multithreaded applications, suffers from inefficiency in virtualized environments, because the steal attempts of thief threads may waste CPU cycles that could be otherwise used by busy threads. This paper contributes a novel scheduling framework named Robinhood. The basic idea of Robinhood is to use the time slices of thieves to accelerate busy threads with no available tasks (referred to as poor workers) at both the guest Operating System (OS) level and Virtual Machine Monitor (VMM) level. In this way, Robinhood can reduce the cost of steal attempts and accelerate the threads doing useful work, so as to put the CPU cycles to better use. We implement Robinhood based on BWS, Linux and Xen. Our evaluation with various benchmarks demonstrates that Robinhood paves a way to efficiently run work-stealing applications in virtualized environments. Compared to Cilk++ and BWS, Robinhood can reduce up to 90 and 72 percent execution time of work-stealing applications, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
30. Parallelized Network-on-Chip-Reused Test Access Mechanism for Multiple Identical Cores.
- Author
-
Han, Taewoo, Choi, Inhyuk, Oh, Hyunggoy, and Kang, Sungho
- Subjects
- *
NETWORKS on a chip , *EMBEDDED computer systems , *SYSTEMS on a chip , *ELECTRONIC systems , *COMPUTER engineering - Abstract
This paper proposes a new network-on-chip (NoC)-reused test access mechanism (TAM) for testing multiple identical cores. It can test multiple cores concurrently and identify faulty cores to derate the chip by excluding the core. In order to minimize the test time, the TAM utilizes the majority value of test response data. All of the cores can thereby be tested in parallel and test costs (in both test pins and test time) are exactly the same as those for a single core. The hardware overhead is minimized by reusing the NoC infrastructures and transfer-counters are designed as a majority analyzer. The experimental results in this paper show that the proposed TAM can test multiple cores in the same time as a single core and with negligible hardware overhead. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
31. Threads and Data Mapping: Affinity Analysis for Traffic Reduction.
- Author
-
Hu, Qi, Liu, Peng, and Huang, Michael C.
- Abstract
Modern processors spend significant amount of time and energy moving data. With the increase in core count, the relative importance of such latency and energy expenditure will only increase with time. Inter-core communication traffic when executing a multithreaded application is one such source of latency and energy expenditure. This traffic is influenced by the mapping of threads and data onto multicore systems. This paper investigates the impact of threads and data mapping on traffic in a chip-multiprocessor, and exploits the potential for traffic reduction through threads and data mapping. Based on the analysis and estimation of the lowest traffic, we propose a threads and data mapping mechanism to approach the lowest traffic. The mapping takes both the correlation among threads and the affinity of data with individual threads into account, and results in significant traffic reduction and energy savings. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
32. The Effect of Temperature on Amdahl Law in 3D Multicore Era.
- Author
-
Yavits, L., Morad, A., and Ginosar, R.
- Subjects
- *
MULTICORE processors , *TEMPERATURE measurements , *MAXIMUM power transfer theorem , *PARALLEL processing , *THREE-dimensional display systems , *RANDOM access memory , *COMPUTER architecture - Abstract
This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors (CMP) from Amdahl's law perspective. We find that 3D CMP may reach its thermal limit before reaching its maximum power. We show that a high level of parallelism may lead to high peak temperatures even in small scale 3D CMPs, thus limiting 3D CMP scalability and calling for different, in-memory computing architectures. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
33. Unlocking Ordered Parallelism with the Swarm Architecture.
- Author
-
Jeffrey, Mark C., Subramanian, Suvinay, Yan, Cong, Emer, Joel, and Sanchez, Daniel
- Subjects
- *
PARALLEL processing , *ALGORITHMS , *COMPUTER software , *COMPUTER programming , *ELECTRONIC data processing - Abstract
The authors present Swarm, a parallel architecture that exploits ordered parallelism, which is abundant but hard to mine with current software and hardware techniques. Swarm programs consist of short tasks, as small as tens of instructions each, with programmer-specified order constraints. Swarm executes tasks speculatively and out of order and efficiently speculates thousands of tasks ahead of the earliest active task to uncover enough parallelism. Several techniques allow Swarm to scale to large core counts and speculation windows. The authors evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm outperforms sequential implementations of these algorithms by 43 to 117 times and state-of-the-art software-only parallel algorithms by 3 to 18 times. Besides achieving near-linear scalability, Swarm programs are almost as simple as their sequential counterparts, because they do not use explicit synchronization. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
34. A Methodology for Cognitive NoC Design.
- Author
-
Wu, Wo-Tak and Louri, Ahmed
- Abstract
The number of cores in a multicore chip design has been increasing in the past two decades. The rate of increase will continue for the foreseeable future. With a large number of cores, the on-chip communication has become a very important design consideration. The increasing number of cores will push the communication complexity level to a point where managing such highly complex systems requires much more than what designers can anticipate for. We propose a new design methodology for implementing a cognitive network-on-chip that has the ability to recognize changes in the environment and to learn new ways to adapt to the changes. This learning capability provides a way for the network to manage itself. Individual network nodes work autonomously to achieve global system goals, e.g., low network latency, higher reliability, power efficiency, adaptability, etc. We use fault-tolerant routing as a case study. Simulation results show that the cognitive design has the potential to outperform the conventional design for large applications. With the great inherent flexibility to adopt different algorithms, the cognitive design can be applied to many applications. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
35. Cost-Effective Design of Mesh-of-Tree Interconnect for Multicore Clusters With 3-D Stacked L2 Scratchpad Memory.
- Author
-
Kang, Kyungsu, Benini, Luca, and De Micheli, Giovanni
- Subjects
COST effectiveness ,MANUFACTURING processes ,FABRICATION (Manufacturing) ,MULTICORE processors ,THREE-dimensional integrated circuits - Abstract
3-D integrated circuits (3-D ICs) offer a promising solution to overcome the scaling limitations of 2-D ICs. However, using too many through-silicon-vias (TSVs) pose a negative impact on 3-D ICs due to the large overhead of TSV (e.g., large footprint and low yield). In this paper, we propose a new TSV sharing method for a circuit-switched 3-D mesh-of-tree (MoT) interconnect, which supports high-throughput and low-latency communication between processing cores and 3-D stacked multibanked L2 scratchpad memory. The proposed method supports traffic balancing and TSV-failure tolerant routing. The proposed method advocates a modular design strategy to allow stacking multiple identical memory dies without the need for different masks for dies at different levels in the memory stack. We also investigate various parameters of 3-D memory stacking (e.g., fabrication technology, TSV bonding technique, number of memory tiers, and TSV sharing scheme) that affect interconnect latency, system performance, and fabrication cost. Compared to conventional MoT interconnect
[6] that is straightforwardly adapted to 3-D integration, the proposed method yields up to $\times 2.11$ and $\times 1.11$ improvements in terms of cost efficiency (i.e., performance/cost) for microbump TSV bonding and direct Cu–Cu TSV bonding techniques, respectively. [ABSTRACT FROM AUTHOR]- Published
- 2015
- Full Text
- View/download PDF
36. Majority-Based Test Access Mechanism for Parallel Testing of Multiple Identical Cores.
- Author
-
Han, Taewoo, Choi, Inhyuk, and Kang, Sungho
- Subjects
MULTICORE processors ,COMPARATOR circuits ,ON-chip transformers ,VERY large scale circuit integration ,MICROPROCESSOR testing - Abstract
The increased use of multicore chips diminishes per-core complexity and also demands parallel design and test technologies. An especially important evolution of the multicore chip has been the use of multiple identical cores, providing a homogenous system with various merits. This paper introduces a novel test access mechanism (TAM) for parallel testing of multiple identical cores and identifying faulty cores to derate the chip by excluding it. Instead of typical test response data from the cores, the test output data used in this paper are the majority values, that is, the typical test responses from the cores. All the cores can thereby be tested in parallel and test costs (in both test pins and test time) are exactly the same as for a single core. The proposed TAM can be implemented with on-chip comparators and majority analyzers. The experimental results in this paper show that the proposed TAM can test multiple cores with minimal test pins and test time and with hardware overhead of $<0.1$ %. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
37. XCollOpts: A Novel Improvement of Network Virtualizations in Xen for I/O-Latency Sensitive Applications on Multicores.
- Author
-
Zeng, Lingfang, Wang, Yang, Feng, Dan, and Kent, Kenneth B.
- Abstract
It has long been recognized that the Credit scheduler selectively favors CPU-bound applications whereas for I/O-latency sensitive workloads, such as those related to stream-based audio/video services, it only exhibits tolerable, or even worse, unacceptable performance. The reasons behind this phenomenon are the poor understanding (to some degree) of the virtual machine scheduling as well as the network I/O virtualizations. In order to address these problems and make the system more responsive to the I/O-latency sensitive applications, in this paper, we present XCollOpts which performs a collection of novel optimizations to improve the Credit scheduler and the underlying I/O virtualizations in multicore environments, each from two perspectives. To optimize the schedule, in XCollOpts, we first pinpoint the
Imbalanced Multi-Boosting problem among the cores thereby minimizing the system response time by load balancing the BOOST VCPUs. Then, we describe thePremature Preemption problem and address it by monitoring the received network packets in the driver domain and deliberately preventing it from being prematurely preempted during the packet delivery. However, these optimizations on the scheduling strategies cannot be fully exploited if the performance issues of the underlying supportive communication mechanisms are not considered. To this end, we make two further optimizations for the network I/O virtualizations, namely,Multi-Tasklet Pairs andOptimized Small Data Packet . Our empirical studies show that with XCollOpts, we can significantly improve the performance of the latency-sensitive applications at a cost of relatively small system overhead. [ABSTRACT FROM PUBLISHER]- Published
- 2015
- Full Text
- View/download PDF
38. A multicore DSP HEVC decoder using an actorbased dataflow model and OpenMP.
- Author
-
Chavarr?as, M., Pescador, F., Garrido, M. J., Ju?rez, E., and Sanz, C.
- Subjects
- *
VIDEO coding , *VIDEO codecs , *DATA flow computing , *MULTIMEDIA systems , *MULTICORE processors - Abstract
Video coding is one of the most demanding applications, in terms of computational cost, for portable multimedia terminals. In the last years, the new video coding standards, like High Efficiency Video Coding (HEVC), and the increasing resolutions of video codecs have overtaken the capacities of the single core processors in embedded systems. In consequence, multicore architectures are used in current multimedia systems. Besides, new methodologies and frameworks are arising to speed-up the design cycle. In this paper, a methodology based on the Reconfigurable Video Coding CAL Actor Language (RVC-CAL) and the OpenMP API has been used to implement an HEVC decoder based on a multicore DSP. A RVC-CAL description of the HEVC decoder has been used as starting point. The Open RVC-CAL compiler framework (Orcc) has been used to generate C-code from the RVC-CAL specification. This code and the OpenMP library have been ported to the multicore DSP environment. Decoders running on 1, 2, 3 and 4 cores have been tested. Also, the multi DSP based HEVC decoder has been compared with other implementations based on multicore GPPs. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
39. Always-on Vision Processing Unit for Mobile Applications.
- Author
-
Barry, Brendan, Brick, Cormac, Connor, Fergal, Donohoe, David, Moloney, David, Richmond, Richard, O'Riordan, Martin, and Toma, Vasile
- Subjects
- *
MULTICORE processors , *CENTRAL processing units , *EMBEDDED computer systems , *COMPUTER architecture , *COMPUTER vision - Abstract
Myriad 2 is a multicore, always-on system on chip that supports computational imaging and visual awareness for mobile, wearable, and embedded applications. The vision processing unit incorporates parallelism, instruction set architecture, and microarchitectural features to provide highly sustainable performance efficiency across a range of computationalImaging and computer vision applications, including those with low latency requirements on the order of milliseconds. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
40. Task Migrations for Distributed Thermal Management Considering Transient Effects.
- Author
-
Liu, Zao, Tan, Sheldon X.-D., Huang, Xin, and Wang, Hai
- Subjects
THERMAL management (Electronic packaging) ,TASK analysis ,TRANSIENT analysis ,HEAT sinks (Electronics) ,TEMPERATURE distribution ,MULTICORE processors ,SYSTEMS on a chip ,ENERGY consumption - Abstract
In this brief, a new distributed thermal management scheme using task migrations based on a new temperature metric called effective initial temperature is proposed to reduce the on-chip temperature variance and the occurrence of hot spots for many-core microprocessors. The new temperature metric derived from frequency domain moment matching technique incorporates both initial temperature and other transient effects to make optimized task migration decisions, which leads to more effective reduction of hot spots in the experiments on a 100-core microprocessor than the existing distributed thermal management methods. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
41. List Scheduling Strategies for Task Graphs with Data Parallelism.
- Author
-
Liu, Yang, Taniguchi, Ittetsu, Tomiyama, Hiroyuki, and Meng, Lin
- Abstract
This paper studies task scheduling algorithms which schedule a set of tasks on multiple cores so that the total scheduling length is minimized. Most of the algorithms developed in the past assume that a task is executed on a single core. Unlike the previous algorithms, the algorithms studied in this paper allow a task to be executed on multiple cores. This paper proposes six algorithms. All of the six algorithms are based on list scheduling, but the strategy for priority assignment is different. In our experiments, the six algorithms as well as an integer linear programming method are evaluated. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
42. Dynamically reducing overestimated design margin of MultiCores.
- Author
-
Sato, Toshinori, Hayashida, Takanori, and Yano, Ken
- Abstract
MultiCore processor is one of the promising techniques to satisfy computing demands of the future consumer devices. However, MultiCore processor is still threatened by increasing energy consumption due to PVT (Process-Voltage-Temperature) variations. They require large design margins in the supply voltage, resulting in large energy consumption. The combination of DVS (Dynamic voltage scaling) technique and Canary FF (flip-flop), named Canary-DVS, has been proposed to eliminate the overestimated voltage margin but has only been evaluated under the assumption of typical delay. This paper considers C2C (Core-to-Core) variations and evaluates how Canary-DVS eliminates the energy waste under the practical assumption of delay variations. We adopt Canary-DVS to a commercial processor, Toshiba's quad-core Media embedded Processor (MeP). From Monte Carlo simulations, it is found that energy is reduced by 18.6% on average and there are not any noticeable discrepancies from the typical situations, when 0.064 of σ/μ value is assumed in gate delay. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
43. High performance computing for Optical Diffraction Tomography.
- Author
-
Ortega, G., Lobera, J., Arroyo, M. P., Garcia, I., and Garzon, E. M.
- Abstract
This paper analyses several parallel approaches for the development of a physical model of Non-linear ODT for its application in velocimetry techniques. The main benefits of its application in HPIV are the high accuracy with non-damaging radiation and its imaging capability to recover information from the vessel wall of the flow. Thus ODT-HPIV is suitable for microfluidic devices and biofluidic applications. Our physical model is based on an iterative method which uses double-precision complex numbers, therefore it has a high computational cost. As a result, High Performance Computing is necessary for both: implementation and validation of the model. Concretely, the model has been parallelized by means of different architectures: shared-memory multiprocessors and graphics processing units (GPU) using the CUDA device. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
44. On the Suitability of Multi-Core Processing for Embedded Automotive Systems.
- Author
-
Jena, Santosh Kumar and Srinivas, M.B.
- Abstract
This paper examines the suitability of multi-core processors over single core in automotive safety critical applications. As vehicles become more and more complex with an embedded network interconnection of ECUs (Electronic Control Unit) and integrate more and more features, safety standardization is becoming increasingly important among the automakers and OEMs (Original Equipment Manufacture). Thus the demand for computing power is increasing by the day in the automotive domain to meet all the requirements of time critical functionalities. Multi-core processor hardware is seen as a solution to the problem of increasing the ECU processing power with the support of software and also to power consumption with frequency. In this work, ABS (Anti-Lock Braking System) is taken as an example to demonstrate the suitability of multicore processor. It is shown how, through the scheduling of events in the hard braking system, multicore processor can help in achieving near real time response. The performance of ABS has been studied with the help of TMS570 which is a dual core controller from Texas Instruments (TI) and compared with TMS470 which is single core controller from the same company. A software architecture using MPI (Message Passing Interface) with shared memory is described in detail and applied to quantify the performance. In addition to performance, a comparative study of power consumption by TMS570 operating at 150MHz and TMS470 operating at 80MHz at an ambient temperature of 25ºC has been studied in detail. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
45. Exploring Speculative Procedure Level Parallelism from Multimedia Applications.
- Author
-
Wang, Yaobin, An, Hong, Liu, Zhiqin, Song, Hui, and Peng, Yong
- Abstract
Multimedia applications are fast becoming one of the dominating workloads for modern computer systems. How to make use of multicore computing resources to accelerate multimedia applications has become a common concern problem. However, potential speculative procedure level parallelism in multimedia applications has not yet been explored thoroughly. This paper proposes a procedure level speculation architecture for accelerating multimedia applications, including its speculative mechanism and analysis method. It also takes several applications from Media bench to analyze their coverage parallelism, thread size, inter-thread data dependence feature and potential speedup. The experimental results show that: (1) the best adpcm application can get a 12.2x speedup in procedure level speculation, (2) limited parallelism coverage and severe inter-thread data dependence violations badly affect speculative procedure level parallelism in some multimedia applications. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
46. An Analysis of Multicore Specific Optimization in MPI Implementations.
- Author
-
Cheng, Pengqi and Gu, Yan
- Abstract
We first introduced the multicore specific optimization modules of two common MPI implementations â" MPICH2 and Open MPI, and then tested their performance on one multicore computer. By enabling and disabling these modules, we provided their performance, including bandwidth and latency, under different circumstances. Finally, we analyzed the two MPI implementations and discussed the choice of MPI implementations and possible improvements. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
47. On the Portability and Performance of Message-Passing Programs on Embedded Multicore Platforms.
- Author
-
Hung, Shih-Hao, Chiu, Po-Hsun, Tu, Chia-Heng, Chou, Wei-Ting, and Yang, Wen-Long
- Abstract
Recently, embedded multicore platforms have become popular, but software development for such platforms has been very challenging. While message-passing is a popular programming model for parallel applications, it is not adequately supported on the current embedded multicore platforms. Similar to the situations in '80s~'90s, applications are hardly portable across parallel computers before the advent of MPI. Unfortunately, MPI is too big for most embedded platforms of today. Moreover, the message-passing functions need to utilize the architectural features to offer optimized performance, but such platform-specific optimizations often hurt the portability. This paper addresses the portability and performance issues by designing a new message-passing library with a three-layer modular design. The top two layers are mostly platform-independent, and the bottom layer enables platform-specific optimizations. We discuss the performance issues in the paper and evaluate the issues with experimental results. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
48. Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU.
- Author
-
Lee, Che-Rung, Chen, Zhi-Hung, and Kao, Quey-Liang
- Abstract
Determinant Quantum Monte Carlo (DQMC) simulation is one of few numerical methods that can explore the micro properties of fermions, which has many technically important applications in chemistry and material science. Conventionally, its parallelization relies on parallel Monte Carlo method, whose speedup is limited by the thermalization process and the underlying matrix computation. To achieve better performance, fine-grained parallelization on its numerical kernel is essential to utilize the massive parallel processing units, which are multicores and/or GPUs interconnected by high performance network. In this paper, we address the parallelization on one of the matrix kernel in the DQMC simulations: the multiplication of matrix exponentials. The matrix is derived from the kinetic Hamiltonian, which is highly sparse. We approximate its exponential by the checkerboard method, which decomposes the matrix exponential into a product of a sequence of block sparse matrices. We analyze the block sparse matrices of two common used lattice geometry: 2D torus and 3D cubic, and parallelize the computational kernel of multiplying them to a general matrix. The parallel algorithm is designed for multicore CPU and GPU. The results of experiments showed on a quad core processor, 3 times speedup can be observed in average, and on GPU, 145 times speedup is achievable. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
49. Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems.
- Author
-
Dongarra, Jack, Faverge, Mathieu, Herault, Thomas, Langou, Julien, and Robert, Yves
- Abstract
This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. %equipped with accelerators. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of multicores, in order to minimize the number of inter-processor communications (aka, ``communication-avoiding'' algorithm), it is natural to consider two-level hierarchical trees composed of an ``inter-node'' tree which acts on top of ``intra-node'' trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ``TS level'' for cache-friendliness, (1) ``low level'' for decoupled highly parallel inter-node reductions, (2) ``coupling level'' to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-cluster and intra-cluster. Numerical experiments on a cluster of multicore nodes (1) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (2) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the \Dague scheduling tool significantly outperforms currently available QR factorization softwares for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
50. HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters.
- Author
-
Ma, Teng, Bosilca, George, Bouteiller, Aurelien, and Dongarra, Jack
- Abstract
Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the core distance, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the aptitude to support duplex communications or operate multiple concurrent copies simultaneously. This calls for a collaborative approach between multiple layers of collective algorithms, dedicating to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. In this work, we present Hier KNEM a kernel-assisted topology-aware collective framework, and how this framework orchestrates the collaboration between multiple layers of collective algorithms. The resulting scheme enables perfect overlap of intra- and inter-node communications. We demonstrated experimentally, by considering three of the most used collective operations (Broadcast, All gather and Reduction), that 1) this approach is immune to modifications of the underlying process-core binding, 2) it outperforms state-of-art MPI libraries (Open MPI, MPICH2 and MVAPICH2) demonstrating up to a 30x speedup for synthetic benchmarks, and up to a 3x acceleration for a parallel graph application (ASP), 3) it furthermore demonstrates a linear speedup with the increase of the number of cores per node, a paramount requirement for scalability on future many-core hardware. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.