13 results on '"WEIWEN JIANG"'
Search Results
2. Contour: A Process Variation Aware Wear-Leveling Mechanism for Inodes of Persistent Memory File Systems
- Author
-
Wang Xinxin, Chaoshu Yang, Qingfeng Zhuge, Weiwen Jiang, Xianzhang Chen, and Edwin H.-M. Sha
- Subjects
File system ,Computer science ,Linux kernel ,02 engineering and technology ,inode ,Parallel computing ,computer.software_genre ,020202 computer hardware & architecture ,Theoretical Computer Science ,Process variation ,Memory management ,Computational Theory and Mathematics ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Table (database) ,computer ,Software ,Wear leveling - Abstract
Existing persistent memory file systems exploit the fast, byte-addressable persistent memory (PM) to boost storage performance but ignore the limited endurance of PM. Particularly, the PM storing the inode section is extremely vulnerable for the inodes are most frequently updated, fixed on a location throughout lifetime, and require immediate persistency. The huge endurance variation of persistent memory domains caused by process variation makes things even worse. In this article, we propose a process variation aware wear leveling mechanism called Contour for the inode section of persistent memory file system. Contour first enables the movement of inodes by virtualizing the inodes with a deflection table. Then, Contour adopts cross-domain migration algorithm and intra-domain migration algorithm to balance the writes across and within the memory domains. We implement the proposed Contour mechanism in Linux kernel 4.4.30 based on a real persistent memory file system, SIMFS. We use standard benchmarks, including Filebench, MySQL, and FIO, to evaluate Contour. Extensive experimental results show Contour can improve the wear ratios of pages 417.8× and 4.5× over the original SIMFS and PCV , the state-of-the-art inode wear-leveling algorithm, respectively. Meanwhile, the average performance overhead and wear overhead of Contour are 0.87 and 0.034 percent in application-level workloads, respectively.
- Published
- 2021
3. Device-Circuit-Architecture Co-Exploration for Computing-in-Memory Neural Accelerators
- Author
-
Lei Yang, Yiyu Shi, Xiaobo Sharon Hu, Weiwen Jiang, Jingtong Hu, Zheyu Yan, and Qiuwen Lou
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial neural network ,Computer science ,Cognitive neuroscience of visual object recognition ,Computer Science - Neural and Evolutionary Computing ,02 engineering and technology ,Variation (game tree) ,Device type ,Machine Learning (cs.LG) ,020202 computer hardware & architecture ,Theoretical Computer Science ,Computational Theory and Mathematics ,Computer architecture ,Hardware and Architecture ,Search algorithm ,0202 electrical engineering, electronic engineering, information engineering ,Circuit architecture ,Neural and Evolutionary Computing (cs.NE) ,Architecture ,Software ,Efficient energy use - Abstract
Co-exploration of neural architectures and hardware design is promising to simultaneously optimize network accuracy and hardware efficiency. However, state-of-the-art neural architecture search algorithms for the co-exploration are dedicated for the conventional von-neumann computing architecture, whose performance is heavily limited by the well-known memory wall. In this paper, we are the first to bring the computing-in-memory architecture, which can easily transcend the memory wall, to interplay with the neural architecture search, aiming to find the most efficient neural architectures with high network accuracy and maximized hardware efficiency. Such a novel combination makes opportunities to boost performance, but also brings a bunch of challenges. The design space spans across multiple layers from device type, circuit topology to neural architecture. In addition, the performance may degrade in the presence of device variation. To address these challenges, we propose a cross-layer exploration framework, namely NACIM, which jointly explores device, circuit and architecture design space and takes device variation into consideration to find the most robust neural architectures. Experimental results demonstrate that NACIM can find the robust neural network with 0.45% accuracy loss in the presence of device variation, compared with a 76.44% loss from the state-of-the-art NAS without consideration of variation; in addition, NACIM achieves an energy efficiency up to 16.3 TOPs/W, 3.17X higher than the state-of-the-art NAS., 10 pages, 6 figures
- Published
- 2021
4. On the Design of Minimal-Cost Pipeline Systems Satisfying Hard/Soft Real-Time Constraints
- Author
-
Qingfeng Zhuge, Edwin H.-M. Sha, Lei Yang, Weiwen Jiang, Xianzhang Chen, and Hailiang Dong
- Subjects
020203 distributed computing ,Mathematical optimization ,Computer science ,Pipeline (computing) ,Probabilistic logic ,Approximation algorithm ,02 engineering and technology ,020202 computer hardware & architecture ,Computer Science Applications ,Human-Computer Interaction ,Pipeline transport ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Time complexity ,Throughput (business) ,Random variable ,Integer programming ,Information Systems - Abstract
Pipeline systems provide high throughput for applications by overlapping the executions of tasks. In the architectures with heterogeneity, two basic issues in the design of application-specific pipelines need to be studied: what type of functional unit to execute each task, and where to place buffers. Due to the increasing complexity of applications, pipeline designs face a bundle of problems. One of the most challenging problems is the uncertainty on the execution times, which makes the deterministic techniques inapplicable. In this paper, the execution times are modeled as random variables. Given an application, our objective is to construct the optimal pipeline, such that the total cost of the resultant pipeline can be minimized while satisfying the required timing constraints with the given guaranteed probability. We first prove the NP-hardness of the problem. Then, we present Mixed Integer Linear Programming (MILP) formulations to obtain the optimal solution. Due to the high time complexity of MILP, we devise an efficient $(1+\varepsilon)$ ( 1 + ɛ ) -approximation algorithm, where the value of $\varepsilon$ ɛ is less than 5 percent in practice. Experimental results show that our algorithms can achieve significant reductions in cost over the existing techniques, reaching up to 31.93 percent on average.
- Published
- 2021
5. On the Design of Time-Constrained and Buffer-Optimal Self-Timed Pipelines
- Author
-
Edwin H.-M. Sha, Lei Yang, Weiwen Jiang, Jingtong Hu, Qingfeng Zhuge, and Xianzhang Chen
- Subjects
Marked graph ,Matching (graph theory) ,Computer science ,Pipeline (computing) ,02 engineering and technology ,Parallel computing ,Computer Graphics and Computer-Aided Design ,Synchronization ,020202 computer hardware & architecture ,Reduction (complexity) ,Asynchronous communication ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Field-programmable gate array ,Integer programming ,Software - Abstract
Pipelining is a powerful technique to achieve high performance in computing systems. However, as computing platforms become large-scale and integrate with heterogeneous processing elements (PEs) (CPUs, GPUs, field-programmable gate arrays, etc.), it is difficult to employ a global clock to achieve synchronous pipelines. Therefore, self-timed (or asynchronous) pipelines are usually adopted. Nevertheless, due to their complex running behavior, the performance modeling and systematic optimizations for self-timed pipeline (STP) systems are more complicated than those for synchronous ones. This paper employs marked graph theory to model STPs and presents algorithms to detect performance bottlenecks. Based on the proposed model, we observe that the system performance can be improved by inserting buffers. Due to the limited memory resources on the PEs, it is critical to minimize the number of buffers for STPs while satisfying the required timing constraints. In this paper, we propose integer linear programming formulations to obtain the optimal solutions and devise efficient algorithms to obtain the near-optimal solutions. Experimental results show that the proposed algorithms can achieve 53.10% improvement in the maximum performance and 54.04% reduction in the number of buffers, compared with the technique for the slack matching problem.
- Published
- 2019
6. Thermal-Aware Task Mapping on Dynamically Reconfigurable Network-on-Chip Based Multiprocessor System-on-Chip
- Author
-
Weichen Liu, Weiwen Jiang, Nikil Dutt, Wei Zhang, Lei Yang, Nan Guan, Liang Feng, and School of Computer Science and Engineering
- Subjects
Engineering::Computer science and engineering [DRNTU] ,020203 distributed computing ,Computer science ,business.industry ,Reconfigurable NoC ,Reconfigurability ,Multiprocessing ,02 engineering and technology ,Energy consumption ,Chip ,020202 computer hardware & architecture ,Theoretical Computer Science ,Scheduling (computing) ,Network on a chip ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,Scalability ,Dark silicon ,0202 electrical engineering, electronic engineering, information engineering ,System on a chip ,Task Mapping ,business ,Software - Abstract
Dark silicon is the phenomenon that a fraction of many-core chip has to be turned off or run in a low-power state in order to maintain the safe chip temperature. System-level thermal management techniques normally map application on non-adjacent cores, while communication efficiency among these cores will be oppositely affected over conventional network-on-chip (NoC). Recently, SMART NoC architecture is proposed, enabling single-cycle multi-hop bypass channels to be built between distant cores at runtime, to reduce communication latency. However, communication efficiency of SMART NoC will be diminished by communication contention, which will in turn decrease system performance. In this paper, we first propose an Integer-Linear Programming (ILP) model to properly address communication problem, which generates the optimal solutions with the consideration of inter-processor communication. We further present a novel heuristic algorithm for task mapping in dark silicon many-core systems, called TopoMap, on top of SMART architecture, which can effectively solve communication contention problem in polynomial time. With fine-grained consideration of chip thermal reliability and inter-processor communication, presented approaches are able to control the reconfigurability of NoC communication topology in task mapping and scheduling. Thermal-safe system is guaranteed by physically decentralized active cores, and communication overhead is reduced by the minimized communication contention and maximized bypass routing. Performance evaluation on PARSEC shows the applicability and effectiveness of the proposed techniques, which achieve on average 42.5 and 32.4 percent improvement in communication and application performance, and 32.3 percent reduction in system energy consumption, compared with state-of-the-art techniques. TopoMap only introduces 1.8 percent performance difference compared to ILP model and is more scalable to large-size NoCs. Accepted version
- Published
- 2018
7. Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs
- Author
-
Lei Yang, Qingfeng Zhuge, Jingtong Hu, Edwin H.-M. Sha, Weiwen Jiang, and Xianzhang Chen
- Subjects
010302 applied physics ,Optimization problem ,Speedup ,Cost efficiency ,Data parallelism ,Computer science ,Pipeline (computing) ,Task parallelism ,02 engineering and technology ,01 natural sciences ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Dynamic programming ,Reduction (complexity) ,Memory management ,Computer engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Software - Abstract
Field programmable gate array (FPGA) has been one of the most popular platforms to implement convolutional neural networks (CNNs) due to its high performance and cost efficiency; however, limited by the on-chip resources, the existing single-FPGA architectures cannot fully exploit the parallelism in CNNs. In this paper, we explore heterogeneous FPGA-based designs to effectively leverage both task and data parallelism, such that the resultant system can achieve the minimum cost while satisfying timing constraints. In order to maximize the task parallelism, we investigate two critical problems: 1) buffer placement , where to place buffers to partition CNNs into pipeline stages and 2) task assignment , what type of FPGA to implement different CNN layers. We first formulate the system-level optimization problem with a mixed integer linear programming model. Then, we propose an efficient dynamic programming algorithm to obtain the optimal solutions. On top of that, we devise an efficient algorithm that exploits data parallelism within CNN layers to further improve cost efficiency. Evaluations on well-known CNNs demonstrate that the proposed techniques can obtain an average of 30.82% reduction in system cost under the same timing constraint, and an average of 1.5 times speedup in performance under the same cost budget, compared with the state-of-the-art techniques.
- Published
- 2018
8. Towards the Design of Efficient and Consistent Index Structure with Minimal Write Activities for Non-Volatile Memory
- Author
-
Runyu Zhang, Xianzhang Chen, Zhulin Ma, Weiwen Jiang, Edwin H.-M. Sha, Hailiang Dong, and Qingfeng Zhuge
- Subjects
010302 applied physics ,Speedup ,CPU cache ,Computer science ,Search engine indexing ,02 engineering and technology ,Linked list ,Parallel computing ,Data structure ,01 natural sciences ,020202 computer hardware & architecture ,Theoretical Computer Science ,Database index ,Tree (data structure) ,Tree structure ,Computational Theory and Mathematics ,Data retrieval ,Hardware and Architecture ,Search algorithm ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Software - Abstract
Index structures can significantly accelerate the data retrieval operations in data intensive systems, such as databases. Tree structures, such as B $^{+}$ -tree alike, are commonly employed as index structures; however, we found that the tree structure may not be appropriate for Non-Volatile Memory (NVM) in terms of the requirements for high-performance and high-endurance. This paper studies what is the best index structure for NVM-based systems and how to design such index structures. The design of an NVM-friendly index structure faces a lot of challenges. First , in order to prolong the lifetime of NVM, the write activities on NVM should be minimized. To this end, the index structure should be as simple as possible. The index proposed in this paper is based on the simplest data structure, i.e., linked list. Second , the simple structure brings challenges to achieve high-performance data retrieval operations. To overcome this challenge, we design a novel technique by explicitly building up a contiguous virtual address space on the linked list, such that efficient search algorithms can be performed. Third , we need to carefully consider data consistency issues in NVM-based systems, because the order of memory writes may be changed and the data content in NVM may be inconsistent due to write-back effects of CPU cache. This paper devises a novel indexing scheme, called “ V irtual L inear A ddressable B uckets” (VLAB). We implement VLAB in a storage engine and plug it into MySQL. Evaluations are conducted on an NVDIMM workstation using YCSB workloads and real-world traces. Results show that write activities of the state-of-the-art indexes are 6.98 times more than ours; meanwhile, VLAB achieves 2.53 times speedup.
- Published
- 2018
9. Optimal Functional-Unit Assignment for Heterogeneous Systems Under Timing Constraint
- Author
-
Edwin H.-M. Sha, Qingfeng Zhuge, Xianzhang Chen, Lei Zhou, Weiwen Jiang, and Lei Yang
- Subjects
020203 distributed computing ,Mathematical optimization ,Computer science ,02 engineering and technology ,Directed acyclic graph ,Graph ,020202 computer hardware & architecture ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,Algorithm design ,Retiming ,Algorithm ,Time complexity ,Integer programming - Abstract
In high-level synthesis for real-time systems, it typically employs heterogeneous functional-unit types to achieve high-performance and low-cost designs. In the design phase, it is critical to determine which functional-unit type to be mapped for each operation in a given application such that the total cost is minimized while the deadline can be met. For a path or tree structured application, existing approaches can obtain the minimum-cost assignment, called “optimal assignment”, under which the resultant system satisfies a given timing constraint. However, it is still an open question whether there exist efficient algorithms to obtain the optimal assignment for the directed acyclic graph (DAG), or more generally, the data-flow graph with cycles (cyclic DFG). For DAGs, by analyzing the property of the problem, this paper designs an efficient algorithm to obtain the optimal assignments. For cyclic DFGs, we approach this problem with the combination of retiming technique to thoroughly explore the design space. We formulate a Mixed Integer Linear Programming (MILP) model to give the optimal solution. But because of the high degree of its time complexity, we devise a practical algorithm to obtain near-optimal solutions within a minute. Experimental results show the effectiveness of our algorithms. Specifically, compared with existing techniques, we can achieve 25.70 and 30.23 percent reductions in total cost on DAGs and cyclic DFGs, respectively.
- Published
- 2017
10. FoToNoC: A Folded Torus-Like Network-on-Chip Based Many-Core Systems-on-Chip in the Dark Silicon Era
- Author
-
Lei Yang, Mengquan Li, Weichen Liu, Peng Chen, Weiwen Jiang, and Edwin H.-M. Sha
- Subjects
010302 applied physics ,business.industry ,Computer science ,Transistor ,02 engineering and technology ,Energy consumption ,01 natural sciences ,Power budget ,020202 computer hardware & architecture ,law.invention ,Network on a chip ,Computational Theory and Mathematics ,Hardware and Architecture ,law ,Embedded system ,0103 physical sciences ,Signal Processing ,Dark silicon ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,business ,Efficient energy use - Abstract
Dark silicon refers to the phenomenon that a fraction of a many-core chip has to become “dark” or “dim” in order to guarantee the system to be kept in a safe temperature range and allowable power budget. Techniques have been developed to selectively activate non-adjacent cores on many-core chip to avoid temperature hotspot, while resulting unexpected increase of communication overhead due to the longer average distance between active cores, and in turn affecting application performance and energy efficiency, when Network-on-Chip (NoC) is used as a scalable communication subsystem. To address the brand-new challenges brought by dark silicon, in this paper, we present FoToNoC , a Fo lded To rus-like NoC , coupled with a hierarchical management strategy for heterogeneous many-core systems. On top of it, objectives of maximizing application performance, energy efficiency and chip reliability are isolated and well achieved by hardware-software co-design in several different phases, including application mapping and scheduling, cluster management and DVFS control. Evaluations on PARSEC benchmark applications demonstrate the significance of the entire strategy. Compared with state-of-the-art approaches, the proposed FoToNoC organization can achieve on average 35.4 and 35.2 percent on communication efficiency and application performance improvement, respectively, when maintaining the safe chip temperature. The hierarchical cluster-based management strategy can further reduce an average 34.6 percent of the total energy consumption with a notable reduction on the chip peak temperature. The significant achievements on system energy efficiency and the reduction on chip temperature of H.264 decoder and DSP-stone benchmarks additionally verify the effectiveness of the proposed methods.
- Published
- 2017
11. Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory
- Author
-
Edwin H.-M. Sha, Qingfeng Zhuge, Chun Jason Xue, Xianzhang Chen, Weiwen Jiang, and Wang Yuangang
- Subjects
010302 applied physics ,Computer science ,Locality ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Data access ,Hardware and Architecture ,0103 physical sciences ,Hardware_INTEGRATEDCIRCUITS ,0202 electrical engineering, electronic engineering, information engineering ,Algorithm design ,Electrical and Electronic Engineering ,Integer programming ,Software - Abstract
A domain-wall memory (DWM) is becoming an attractive candidate to replace the traditional memories for its high density, low-power leakage, and low access latency. Accessing data on DWM is accomplished by shift operations that move data located on nanowires to read/write ports. Due to this kind of construction, data accesses on DWM exhibit varying access latencies. Therefore, data placement (DP) strategy has a significant impact on the performance of data accesses on DWM. In this paper, we prove the nondeterministic polynomial time (NP)-completeness of the DP problem on DWM. For the DWMs organized in single DWM block cluster (DBC), we present integer linear programming formulations to solve the problem optimally. We also propose an efficient single DBC placement (S-DBC-P) algorithm to exploit the benefits of multiple read/write ports and data locality. Compared with the sequential DP strategy, S-DBC-P reduces 76.9% shift operations on average for eight-port DWMs. Furthermore, for DP problem on the DWMs organized in multiple DBCs, we develop an efficient multiple DBC placement (M-DBC-P) algorithm to utilize the parallelism of DBCs. The experimental results show that the M-DBC-P achieves 90% performance improvement over the sequential DP strategy.
- Published
- 2016
12. A New Design of In-Memory File System Based on File Virtual Address Framework
- Author
-
Xianzhang Chen, Edwin H.-M. Sha, Liang Shi, Weiwen Jiang, and Qingfeng Zhuge
- Subjects
Computer science ,Stub file ,02 engineering and technology ,computer.software_genre ,Theoretical Computer Science ,Persistence (computer science) ,Design rule for Camera File system ,Data_FILES ,0202 electrical engineering, electronic engineering, information engineering ,Versioning file system ,SSH File Transfer Protocol ,File system fragmentation ,Flash file system ,File system ,Random access memory ,Address space ,Computer file ,Device file ,020206 networking & telecommunications ,computer.file_format ,Everything is a file ,Unix file types ,Virtual file system ,020202 computer hardware & architecture ,Torrent file ,Memory-mapped file ,File Control Block ,Self-certifying File System ,Computational Theory and Mathematics ,Virtual address space ,Hardware and Architecture ,Journaling file system ,Operating system ,computer ,Software - Abstract
The emerging technologies of persistent memory, such as PCM, MRAM, provide opportunities for preserving files in memory. Traditional file system structures may need to be re-studied. Even though there are several file systems proposed for memory, most of them have limited performance without fully utilizing the hardware at the processor side. This paper presents a framework based on a new concept, “File Virtual Address Space”. A file system, Sustainable In-Memory File System (SIMFS), is designed and implemented, which fully utilizes the memory mapping hardware at the file access path. First, SIMFS embeds the address space of an open file into the process’ address space. Then, file accesses are handled by the memory mapping hardware. Several optimization approaches are also presented for the proposed SIMFS. Extensive experiments are conducted. The experimental results show that the throughput of SIMFS achieves significant performance improvement over the state-of-the-art in-memory file systems.
- Published
- 2016
13. Application Mapping and Scheduling for Network-on-Chip-Based Multiprocessor System-on-Chip With Fine-Grain Communication Optimization
- Author
-
Lei Yang, Edwin H.-M. Sha, Weiwen Jiang, Juan Yi, Weichen Liu, and Mengquan Li
- Subjects
Rate-monotonic scheduling ,Earliest deadline first scheduling ,020203 distributed computing ,FIFO (computing and electronics) ,Computer science ,Processor scheduling ,Multiprocessing ,02 engineering and technology ,Dynamic priority scheduling ,Parallel computing ,Energy consumption ,MPSoC ,Round-robin scheduling ,Bottleneck ,Fair-share scheduling ,Multiprocessor scheduling ,020202 computer hardware & architecture ,Scheduling (computing) ,Fixed-priority pre-emptive scheduling ,Network on a chip ,Hardware and Architecture ,Two-level scheduling ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Software - Abstract
Network-on-chip (NoC) is promising for the communication paradigm of the next-generation multiprocessor system-on-chip (MPSoC). As communication has become an integral part of on-chip computing, and even the performance bottleneck, researchers are paying much attention to its implementation and optimization. Traditional techniques that model communication inaccurately will lead to unexpected runtime performance, which is on average 90.8% worse than the predicted results based on observation, and are not suitable for the deep optimization of communication-intensive scenarios. In this paper, techniques are presented for the NoC-based MPSoCs that integrate optimization on interprocessor communications with the objective of minimizing the schedule length. A fine-grained integer-linear programming (ILP) model is proposed to properly address the communication latency with a network contention, which generates runtime scheduling with trivial performance difference from the predictions. We further propose a heuristic algorithm, unified priority-based scheduling (UPS), to effectively solve the contention problem in polynomial time by assigning priorities to messages. Evaluation results show that the solutions obtained by the ILP model outperform the state-of-the-art techniques by 31.1%, and UPS improves application performance by 34.7% and 44.4% compared with acquainted first-in–first-out (FIFO)-based and random-based methods. In addition, UPS achieves averagely 8.3% approximated results with the optimal solutions generated by ILP. A case study on H.264 high-definition television (HDTV) decoder and the digital signal processor (DSP) filter benchmarks achieves significant improvement on the performance and the results prediction accuracy, as well as the prominent reduction in the number of network contention and energy consumption.
- Published
- 2016
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.