721 results on '"Instruction cycle"'
Search Results
2. The PIC18F1220 Microcontroller
- Author
-
Katzen, Sid and Katzen, Sid
- Published
- 2010
- Full Text
- View/download PDF
3. CPU Architectures for Speech Processing
- Author
-
Sinha, Priyabrata and Sinha, Priyabrata
- Published
- 2010
- Full Text
- View/download PDF
4. Toward low CPU usage and efficient DPDK communication in a cluster
- Author
-
Mingjie Wu, Jingjuan Wang, and Qingkui Chen
- Subjects
business.industry ,Computer science ,CPU time ,Thread (computing) ,Theoretical Computer Science ,Hardware and Architecture ,Embedded system ,Forwarding plane ,CPU core voltage ,Central processing unit ,Polling ,business ,Instruction cycle ,Frequency scaling ,Software ,Information Systems - Abstract
In recent years, DPDK (Data Plane Development Kit, a data plane development tool set provided by Intel, focusing on high-performance processing of data packets in network applications), one of the high-performance packet I/O frameworks, is widely used to improve the efficiency of data transmission in the cluster. But, the busy polling used in DPDK will not only waste a lot of CPU cycles and cause certain power consumption, but also the high CPU usage will have a great impact on the performance of other applications in the host. Although some technologies, such as DVFS (dynamic voltage and frequency scaling, which is to dynamically adjust the operating frequency and voltage of the chip according to the different needs of the computing power of the application running on the chip, so as to achieve the purpose of energy saving) and LPI (low power idle, a technology that saves power by turning off the power of certain supporting circuits when the CPU core is idle), can reduce power consumption by adjusting CPU voltage and frequency, they can also cause performance degradation in other applications. Using thread sleep technology is a promising method to reduce the CPU usage and power consumption. However, it is challenging because the appropriate thread sleep duration cannot be obtained accurately. In this paper, we propose a model that finds the optimal thread sleep duration to solve the above challenges. From the model, we can balance the thread CPU usage and transmission efficiency to obtain the optimal sleep duration called the transmission performance threshold. Experiments show that the proposed models can significantly reduce the thread CPU usage. Generally, while the communication performance is slightly reduced, the CPU utilization is reduced by about 80%.
- Published
- 2021
5. A NOVEL MULTI GRANULARITY LOCKING SCHEME BASED ON CONCURRENT MULTI -VERSION HIERARCHICAL STRUCTURE
- Author
-
Vivek Jaglan, Swati, and Shalini Bhaskar Bajaj
- Subjects
Reduction (complexity) ,Critical section ,Computer science ,Concurrency ,Granularity ,Multiple granularity locking ,Parallel computing ,Software_PROGRAMMINGTECHNIQUES ,Instruction cycle ,Protocol (object-oriented programming) ,Hierarchical database model - Abstract
We present an efficient locking scheme in a hierarchical data structure. The existing multi-granularity locking mechanism works on two extremes: fine-grained locking through which concurrency is being maximized, and coarse grained locking that is being applied to minimize the locking cost. Between the two extremes, there lies several pare to-optimal options that provide a trade-off between the concurrency that can be attained. In this work, we present a locking technique, Collaborative Granular Version Locking (CGVL) which selects an optimal locking combination to serve locking requests in a hierarchical structure. In CGVL a series of version is being maintained at each granular level which allows the simultaneous execution of read and write operation on the data item. Our study reveals that in order to achieve optimal performance the lock manager explore various locking options by converting certain non-supporting locking modes into supporting locking modes thereby improving the existing compatibility matrix of multiple granularity locking protocol. Our claim is being quantitatively validated by using a Java Sun JDK environment, which shows that our CGVL perform better compared to the state-of-the-art existing MGL methods. In particular, CGVL attains 20% reduction in execution time for the locking operation that are being carried out by considering, the following parameters: i) The number of threads ii) The number of locked object iii) The duration of critical section (CPU Cycles) which significantly supports the achievement of enhanced concurrency in terms of the number of concurrent read accesses.
- Published
- 2021
6. ODMDEF: On-Device Multi-DNN Execution Framework Utilizing Adaptive Layer-Allocation on General Purpose Cores and Accelerators
- Author
-
Cheolsun Lim and Myungsun Kim
- Subjects
Co-scheduling ,General Computer Science ,Computer science ,GPU ,multi-DNN framework ,02 engineering and technology ,Parallel computing ,01 natural sciences ,Bottleneck ,Field (computer science) ,embedded system ,Kernel (linear algebra) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,General Materials Science ,Data synchronization ,Instruction cycle ,010302 applied physics ,Multi-core processor ,General Engineering ,020207 software engineering ,TK1-9971 ,prediction model ,Electrical engineering. Electronics. Nuclear engineering ,Data transmission - Abstract
On-device DNN processing has been common interests in the field of autonomous driving research. For better accuracy, both the number of DNN models and the model-complexity have been increased. To properly respond to this, hardware platforms structured with multicore-based CPUs and DNN accelerators have been released, and the GPU is generally used as an accelerator. When multiple DNN workloads are sporadically requested, the GPU can be easily oversubscribed, thereby leading to an unexpected performance bottleneck. We propose an on-device CPU-GPU co-scheduling framework for multi-DNN execution to remove the performance barrier precluding DNN executions from being bounded by the GPU. Our framework fills up the unused CPU cycles with DNN computations to ease the computational burden of the GPU. To provide seamless computing environment for the two different core types, the framework formats each layer execution according to the computational methods supported by CPU and GPU cores. To cope with irregular arrivals of DNN workloads, and to accommodate their fluctuating demands for hardware resources, our framework dynamically selects the best fit core type after making a comparative judgement between the current availabilities of the two core types. During the core selection time, offline-trained prediction models are utilized to get precisely predicted execution time of the issued layer. Our framework mitigates the fact that even the same DNN models can have large performance deviations due to the nature of the process scheduler of the underlying OS which is GPU-agnostic. In addition, the framework minimizes the memory copy overhead inevitably occurring in the data synchronization phase between the heterogeneous cores. To do so, we further analyze GPU-to-CPU and CPU-to-GPU data transfer cases separately, and then apply the solution that best suits each case. For multi-DNN inference jobs with the NVIDIA Jetson AGX Xavier platform, our framework speeds up the execution time by up to 46.6% over the GPU-only solution.
- Published
- 2021
7. Data Mining Techniques in Materialised Projection View
- Author
-
Teh, Ying Wah, Zaitun, Abu Bakar, Kacprzyk, Janusz, editor, Abraham, Ajith, editor, Franke, Katrin, editor, and Köppen, Mario, editor
- Published
- 2003
- Full Text
- View/download PDF
8. REFLIX: A Processor Core for Reactive Embedded Applications
- Author
-
Salcic, Zoran, Biglari-Abhari, Morteza, Bigdeli, Abbas, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Glesner, Manfred, editor, Zipf, Peter, editor, and Renovell, Michel, editor
- Published
- 2002
- Full Text
- View/download PDF
9. Interrupt Handling
- Author
-
Katzen, Sid, Sammes, A. J., editor, and Katzen, Sid
- Published
- 2001
- Full Text
- View/download PDF
10. CrocodileDB in action
- Author
-
Zechao Shang, Sanjay Krishnan, Dixin Tang, Michael J. Franklin, and Aaron J. Elmore
- Subjects
Query plan ,Set (abstract data type) ,Stream processing ,Resource (project management) ,Computer science ,Distributed computing ,General Engineering ,Central processing unit ,Latency (engineering) ,Tuple ,Instruction cycle - Abstract
Existing stream processing and continuous query processing systems eagerly maintain standing queries by consuming all available resources to finish the jobs at hand, which can be a major source of wasting CPU cycles and memory resources. However, users sometimes do not need to see the up-to-date query result right after the data is ready, and thus allow a slackness of time before the result is returned, which provides new opportunities to avoid wasting resources. We proposed CrocodileDB, a resource-efficient database, where users specify a performance goal representing the maximally allowed slackness of time and the system generates a query plan to minimize resource consumption (e.g. memory consumption or CPU cycles) while meeting this performance goal at the same time. In this paper, we demonstrate how users interact with CrocodileDB and show how the time slackness enables our optimization of reducing CPU consumption: Incrementability-aware Query Processing (InQP). With the slackness specified by users, InQP can reduce computing resource waste by selectively deferring the execution of parts of a query that are not amenable to incremental executions (i.e. outputting tuples that can be deleted by later executions in a high probability). In this demonstration, users can set the performance goal as a trade-off between CPU consumption and query latency, and observe the CPU usages and other statistics to understand how InQP reduces computing resources.
- Published
- 2020
11. Towards Power Efficient High Performance Packet I/O
- Author
-
Xuesong Li, Wenxue Cheng, Bailong Yang, Fengyuan Ren, and Tong Zhang
- Subjects
Power management ,CPU power dissipation ,Network packet ,Computer science ,business.industry ,Packet processing ,CPU time ,Throughput ,Idle ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Latency (engineering) ,Polling ,business ,Instruction cycle ,Computer network - Abstract
Recently, high performance packet I/O frameworks continue to flourish for their ability to process packets from high-speed links. To achieve high throughput and low latency, high performance packet I/O frameworks usually employ busy polling. As busy polling will burn all CPU cycles even if there's no packet to process, these frameworks are quite power inefficient. However, exploiting power management techniques such as DVFS and LPI in the frameworks is challenging, because neither the OS nor the frameworks can provide information (e.g., actual CPU utilization, available idle period, or the target frequency) required by these techniques. In this article, we establish a model that can formulate the packet processing flow of high performance packet I/O to help and address the above challenges. From the model, we can deduce the information needed for power management techniques, and gain the insights to balance the power and latency. After suggesting to use pause instruction to reduce CPU power within short idle period, we propose two approaches to conduct power conservation for high performance packet I/O: one with the aid of traffic information and the other without. Experiments with Intel DPDK show that both approaches can achieve significant power reduction with little latency increase.
- Published
- 2020
12. EQueue: Elastic Lock-Free FIFO Queue for Core-to-Core Communication on Multi-Core Processors
- Author
-
Xiong Fu, Tian Yangfeng, and Junchang Wang
- Subjects
020203 distributed computing ,Multi-core processor ,General Computer Science ,Computer science ,CPU cache ,General Engineering ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Burstiness ,0202 electrical engineering, electronic engineering, information engineering ,Non-blocking algorithm ,Memory footprint ,multi-core processors ,General Materials Science ,Double-ended queue ,Lock-free queue ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Electrical and Electronic Engineering ,Instruction cycle ,Queue ,pipeline parallelism ,lcsh:TK1-9971 - Abstract
In recent years, the number of CPU cores in a multi-core processor keeps increasing. To leverage the increasing hardware resource, programmers need to develop parallelized software programs. One promising approach to parallelizing high-performance applications is pipeline parallelism, which divides a task into a serial of subtasks and then maps these subtasks to a group of CPU cores, making the communication scheme between the subtasks running on different cores a critical component for the parallelized programs. One widely-used implementation of the communication scheme is software-based, lock-free first-in-first-out queues that move data between different subtasks. The primary design goal of the prior lock-free queues was higher throughput, such that the technique of batching data was heavily used in their enqueue and dequeue operations. Unfortunately, a lock-free queue with batching heavily depends on the assumption that data arrive at a constant rate, and the queue is in an equilibrium state. Experimentally we found that the equilibrium state of a queue rarely happens in real, high-performance use cases (e.g., 10Gbps+ network applications) because data arriving rate fluctuates sharply. As a result, existing queues suffer from performance degradation when used in real applications on multi-core processors. In this paper, we present EQueue, a lock-free queue to handle this robustness issue in existing queues. EQueue is lock-free, efficient, and robust. EQueue can adaptively (1) shrink its queue size when data arriving rate is low to keep its memory footprint small to utilize CPU cache better, and (2) enlarge its queue size to avoid overflow when data arriving rate is in burstiness. Experimental results show that when used in high-performance applications, EQueue can always perform an enqueue/dequeue operation in less than 50 CPU cycles, which outperforms FastForward and MCRingBuffer, two state-of-the-art queues, by factors 3 and 2, respectively.
- Published
- 2020
13. Memory-Aware Fair-Share Scheduling for Improved Performance Isolation in the Linux Kernel
- Author
-
Myungsun Kim, Jung-Ho Kim, Philkyue Shin, and Seongsoo Hong
- Subjects
General Computer Science ,CFS ,business.industry ,Computer science ,Quality of service ,Linux ,General Engineering ,Temporal isolation among virtual machines ,Linux kernel ,Fair-share scheduling ,Scheduling (computing) ,backend stall cycle ,Kernel (image processing) ,Embedded system ,operating system ,Human multitasking ,General Materials Science ,Resource management ,Cache ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,business ,Instruction cycle ,lcsh:TK1-9971 ,Memory-related interference - Abstract
Performance interference between QoS and best-effort applications is getting more aggravated as data-intensive applications are rapidly and widely spreading in recently emerging computing systems. While the completely fair scheduler (CFS) of the Linux kernel has been extensively used to support performance isolation in a multitasking environment, it falls short of addressing memory-related interference due to memory access contention and insufficient cache coverage. Though quite a few memory-aware performance isolation mechanisms have been proposed in the literature, many of them rely on hardware-based solutions, inflexible resource management or ineffective execution throttling, which makes it difficult for them to be used in widely deployed operating systems like Linux running on a COTS SoC platform. We propose a memory-aware fair-share scheduling algorithm that can make QoS applications less susceptible to memory-related interference from other co-running applications. Our algorithm carefully separates the genuine memory-related stall from a running task’s CPU cycles and compensates the task for the memory-related interference so that the task gets the desired share of CPU before it is too late. The proposed approach is adaptive, effective and efficient in the sense that it does not rely on any static allocation or partitioning of memory hardware resources and improves the performance of QoS applications with only a negligible runtime overhead. Moreover, it is a software-only solution that can be easily integrated into the kernel scheduler with only minimal modification to the kernel. We implement our algorithm into the CFS of Linux and name the end result mCFS. We show the utility and effectiveness of the approach via extensive experiments.
- Published
- 2020
14. Congestion-Free Transient Plane (CFTP) Using Bandwidth Sharing During Link Failures in SDN
- Author
-
Muthumanikandan Vanamoorthy and Valliyammai Chinnaiah
- Subjects
010302 applied physics ,021103 operations research ,General Computer Science ,Computer science ,Network packet ,business.industry ,0211 other engineering and technologies ,02 engineering and technology ,01 natural sciences ,Backup ,Packet loss ,0103 physical sciences ,Convergence (routing) ,Forwarding plane ,Bandwidth (computing) ,Transient (computer programming) ,Instruction cycle ,business ,Computer network - Abstract
Software-defined networking (SDN) is an emerging trend where the control plane and the data plane are separated from each other, culminating in effective bandwidth utilization. This separation also allows multi-vendor interoperability. Link failure is a major problem in networking and must be detected as soon as possible because when a link fails the path becomes congested and packet loss occurs, delaying the delivery of packets to the destination. Backup paths must be configured immediately when a failure is detected in the network to speed up packet delivery, avoid congestion and packet loss and provide faster convergence. Various SDN segment protection algorithms that efficiently reduce CPU cycles and flow table entries exist, but each has drawbacks. An independent transient plane technique can be used to reduce packet loss but is not as efficient when multiple flows try to share the same link. The proposed work focuses on reducing congestion, providing faster convergence with minimal packet loss and effectively utilizing link bandwidth using bandwidth-sharing techniques. An analysis and related studies show that this method performs better and offers a more reliable network without loss, while simultaneously ensuring the swift delivery of data packets toward the destination without congestion, compared to the other existing schemes.
- Published
- 2019
15. Xenic
- Author
-
Weihao Liang, Arvind Krishnamurthy, Jacob Nelson, Henry N. Schuh, and Ming Liu
- Subjects
Remote direct memory access ,Network interface controller ,Asynchronous communication ,Computer science ,Distributed transaction ,Operating system ,Transaction processing system ,Hardware acceleration ,Network interface ,Instruction cycle ,computer.software_genre ,computer - Abstract
High-performance distributed transactions require efficient remote operations on database memory and protocol metadata. The high communication cost of this workload calls for hardware acceleration. Recent research has applied RDMA to this end, leveraging the network controller to manipulate host memory without consuming CPU cycles on the target server. However, the basic read/write RDMA primitives demand trade-offs in data structure and protocol design, limiting their benefits. SmartNICs are a flexible alternative for fast distributed transactions, adding programmable compute cores and on-board memory to the network interface. Applying measured performance characteristics, we design Xenic, a SmartNIC-optimized transaction processing system. Xenic applies an asynchronous, aggregated execution model to maximize network and core efficiency. Xenic's co-designed data store achieves low-overhead remote object accesses. Additionally, Xenic uses flexible, point-to-point communication patterns between SmartNICs to minimize transaction commit latency. We compare Xenic against prior RDMA- and RPC-based transaction systems with the TPC-C, Retwis, and Smallbank benchmarks. Our results for the three benchmarks show 2.42x, 2.07x, and 2.21x throughput improvement, 59%, 42%, and 22% latency reduction, while saving 2.3, 8.1, and 10.1 threads per server.
- Published
- 2021
16. Verification of the Tamarack-3 microprocessor in a hybrid verification environment
- Author
-
Zhu, Zheng, Joyce, Jeff, Seger, Carl, Goos, Gerhard, editor, Hartmanis, Juris, editor, Joyce, Jeffrey J., editor, and Seger, Carl-Johan H., editor
- Published
- 1994
- Full Text
- View/download PDF
17. Embedded Processor Architectures
- Author
-
Kcs Murti
- Subjects
Instruction set ,Addressing mode ,Memory hierarchy ,Reduced instruction set computing ,business.industry ,Computer science ,Embedded system ,Virtual memory ,Central processing unit ,Instruction cycle ,business ,Page table - Abstract
Improvements in semiconductor technology enabled smaller feature sizes, better clock speeds, and high performance. Improvements in computer architectures were enabled by RISC architectures and efficient high-level language compilers. Together, we have enabled customized computer architectures from system-on-hips to powerful GPUs and high-performance processors. Users need that the CPU should be able to access unlimited amounts of memory with low latency. The cost of fast memory is multi-fold compared to lower speed memory. Another characteristic of CPU memory access is principle of spatial and temporal locality. The solution is to organize the memory into hierarchy by caching data at different levels. Section 12.3 covers cache basics in detail. All the memory addressable by CPU need not be in physical memory due to space and cost. It can reside in disk. The address range is mapped by virtual memory manager. Virtual address constitutes the page number and the offset within the page. This page is placed in the physical memory in the free page slot available. This is indexed in the page table. Thus, virtual memory is mapped into physical memory. Section 12.4 details the virtual memory management in detail. RISC stands for Reduced Instruction Set Computer. The clock per instruction (CPI) is one in RISC. This architecture uses optimized set of instructions executed in one cycle. This allows pipelining by which multiple instructions can be executed simultaneously in different stages. RISC has several registers; instruction decoding is simple and simple addressing modes. Section 12.5 explains RISC architectures in detail. An efficient implementation of instruction execution is to overlap the instruction executions by which each hardware unit is busy all the time. Section 12.6 explains in detail this concept of pipelining and hazards are controlled in the architecture. Several advances in pipelining architecture have been developed. But the performance improvements get saturated with new constraints and issues in implementation. When a single instruction operates on multiple data elements in a single instruction cycle, the instructions are called Single Instruction Multiple Data (SIMD) instructions. Section 12.7 introduces data-level parallelism with vector processing. Section 12.9 introduces Single instruction Multi-threading (SIMT) in GPUs. We can exploit certain type of programs where they are inherently parallel and have very little dependence. We call them as threads of execution. Thread-Level Parallelism (TLP) is explained in detail in Sect. 12.10. FPGA-based technology has made system-on-chip designs a cake’s walk. Systems with high-performance requirements are possible with hardware configured to such requirements. Temporal re-configuration in FPGAs mimicking DLLs in software has made re-use of same FPGA fabric for just-in-time “use and throw” hardware blocks. Section 12.11 covers reconfigurable computing in detail. After reading this chapter, readers will be able to understand internal architecture of any processor which helps in selecting a processor for individual requirement.
- Published
- 2021
18. A Robust Tool To Debloat And Optimize Android Devices
- Author
-
Ashref Raj C, Kingsy Grace. R, and Akilavan J
- Subjects
Random access memory ,Data collection ,Computer control ,Software ,User experience design ,business.industry ,Computer science ,Embedded system ,Android (operating system) ,business ,Instruction cycle ,Partition (database) - Abstract
A Bloatware is software that offers limited functionality at the expense of RAM and CPU cycles. The proliferation of bloatware in Android devices depletes not only hardware resources but also makes them more vulnerable to security and privacy threats. A consumer pays in terms of battery, wasted space, privacy and security compromises. These applications have also been known to be used for computer control and data collection. This, however, has some disadvantages. These bloatwares are often mounted in the device partition, making it difficult to remove them without root access. Each app that has been uninstalled can be reinstalled using the same program. The existing methods of rooting the device is complex and may render the device inoperable if not done cautiously. The proposed “Android Debloater and Optimizer” uninstall such bloatware without the risk of rooting your device.
- Published
- 2021
19. Enhanced control path for repeated TCP connections
- Author
-
Niu Zhixiong, Gyeongsik Yang, Junho Lee, Yongqiang Xiong, Chuck Yoo, and Peng Cheng
- Subjects
Stack (abstract data type) ,business.industry ,Property (programming) ,Computer science ,Data path ,Control (management) ,Path (graph theory) ,Measure (physics) ,business ,Instruction cycle ,Computer network - Abstract
This paper presents FALTCON that enhances the control path for repeated TCP connections. First, we measure and find that the control path of TCP stack consumes as many CPU cycles as that of the data path, which brings up the importance of optimizing the control path. Yet, to the best of our knowledge, there has been little research effort on investigating the control path. Also, we observe that a significant portion of TCP traffic (e.g., HTTP) is not only short-lived but also repeated for a server and client pair. We design FALTCON to take advantage of the property of being repeated. Specifically, FALTCON re-designs the control path to remove the duplicate allocation of the structures and redundant operations over them. FALTCON is implemented in Linux 5.1 that has the latest and highly efficient networking stack. Furthermore, we optimize FALTCON to be lockless entirely and to work per-core. The experiment results show that FALTCON achieves a higher number of connections than Linux, up to 19%, and with much less CPU cycles up to 31%.
- Published
- 2021
20. Concordia
- Author
-
Xenofon Foukas and Bozidar Radunovic
- Subjects
Radio access network ,Base station ,Computer science ,Reliability (computer networking) ,Operating system ,CPU time ,Central processing unit ,computer.software_genre ,Instruction cycle ,computer ,Edge computing ,Scheduling (computing) - Abstract
Virtualized Radio Access Network (vRAN) offers a cost-efficient solution for running the 5G RAN as a virtualized network function (VNF) on commodity hardware. The vRAN is more efficient than traditional RANs, as it multiplexes several base station workloads on the same compute hardware. Our measurements show that, whilst this multiplexing provides efficiency gains, more than 50% of the CPU cycles in typical vRAN settings still remain unused. A way to further improve CPU utilization is to collocate the vRAN with general-purpose workloads. However, to maintain performance, vRAN tasks have sub-millisecond latency requirements that have to be met 99.999% of times. We show that this is difficult to achieve with existing systems. We propose Concordia, a userspace deadline scheduling framework for the vRAN on Linux. Concordia builds prediction models using quantile decision trees to predict the worst case execution times of vRAN signal processing tasks. The Concordia scheduler is fast (runs every 20 us) and the prediction models are accurate, enabling the system to reserve a minimum number of cores required for vRAN tasks, leaving the rest for general-purpose workloads. We evaluate Concordia on a commercial-grade reference vRAN platform. We show that it meets the 99.999% reliability requirements and reclaims more than 70% of idle CPU cycles without affecting the RAN performance.
- Published
- 2021
21. Distributed Multi-Agent Empowered Resource Allocation in Deep Edge Networks
- Author
-
Yongkang Gong, Jingjing Wang, and Haipeng Yao
- Subjects
Optimization problem ,Computer science ,business.industry ,Distributed computing ,Resource allocation ,Wireless ,Overhead (computing) ,Energy consumption ,Enhanced Data Rates for GSM Evolution ,Instruction cycle ,business ,Scheduling (computing) - Abstract
The sixth generation wireless communication networks (6G) are anticipated to bring a disruptive innovation on multiple scenarios, where deep edge networks (DENs) turn into a vital network structure on vertical industrial paradigms, including the combination of communication, computing and caching (3C). In this paper, we present the DENs scene to facilitate the deep convergence of computing and communication resources. More specifically, we formulate the optimization problem in terms of energy consumption and latency in order to minimize the total agents overhead. At the same time, for the sake of executing tasks and alleviating interference among different edge networks and high-dynamic network environments, we propose a CPU cycle frequency aided multi-agent deep deterministic policy gradient (C-MADDPG) algorithm framework to optimize the task scheduling, transmission power, CPU cycle frequency and mutual interference from multiple channels to obtain the optimal overhead. Finally, extensive simulation and experimental results demonstrate that our proposed C-MADDPG algorithm has better performance gain in term of execution overhead for different network parameters.
- Published
- 2021
22. automemcpy: a framework for automatic generation of fundamental memory operations
- Author
-
Xinliang David Li, Sam Likun Xi, Chris Kennelly, Guillaume Chatelet, Clement Courbet, Bruno de Backer, and Ondrej Sykora
- Subjects
010302 applied physics ,Delegate ,Programming language ,Computer science ,02 engineering and technology ,computer.software_genre ,Supercomputer ,01 natural sciences ,Toolchain ,020202 computer hardware & architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Constraint programming ,Compiler ,User interface ,Instruction cycle ,computer ,C standard library - Abstract
Memory manipulation primitives (memcpy, memset, memcmp) are used by virtually every application, from high performance computing to user interfaces. They often consume a significant portion of CPU cycles. Because they are so ubiquitous and critical, they are provided by language runtimes and in particular by libc, the C standard library. These implementations are heavily optimized, typically written in hand-tuned assembly for each target architecture. In this article, we propose a principled alternative to hand-tuning these functions: (1) we profile the calls to these functions in their production environment and use this data to drive the important high-level algorithmic decisions, (2) we use a high-level language for the implementation, delegate the job of tuning the generated code to the compiler, and (3) we use constraint programming and automatic benchmarks to select the optimal high-level structure of the functions. We compile our memfunctions implementations using the same compiler toolchain that we use for application code, which allows leveraging the compiler further by allowing whole-program optimization. We have evaluated our approach by applying it to the fleet of one of the largest computing enterprises in the world. This work increased the performance of the fleet by 1%.
- Published
- 2021
23. Porting HEP Parameterized Calorimeter Simulation Code to GPUs
- Author
-
Zhihua Dong, Heather Gray, Charles Leggett, Meifeng Lin, Vincent R. Pascuzzi, and Kwangmin Yu
- Subjects
FOS: Computer and information sciences ,Big Data ,Computer science ,Physics::Instrumentation and Detectors ,large hadron collider ,FOS: Physical sciences ,CUDA ,02 engineering and technology ,Information technology ,Porting ,kokkos ,Computational science ,High Energy Physics - Experiment ,Software portability ,High Energy Physics - Experiment (hep-ex) ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,particle physics ,Instruction cycle ,Original Research ,performance portability ,Large Hadron Collider ,Calorimeter (particle physics) ,Detector ,gpu ,Computational Physics (physics.comp-ph) ,Supercomputer ,T58.5-58.64 ,high performance computing ,Computer Science - Distributed, Parallel, and Cluster Computing ,020201 artificial intelligence & image processing ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Physics - Computational Physics ,Information Systems - Abstract
The High Energy Physics (HEP) experiments, such as those at the Large Hadron Collider (LHC), traditionally consume large amounts of CPU cycles for detector simulations and data analysis, but rarely use compute accelerators such as GPUs. As the LHC is upgraded to allow for higher luminosity, resulting in much higher data rates, purely relying on CPUs may not provide enough computing power to support the simulation and data analysis needs. As a proof of concept, we investigate the feasibility of porting a HEP parameterized calorimeter simulation code to GPUs. We have chosen to use FastCaloSim, the ATLAS fast parametrized calorimeter simulation. While FastCaloSim is sufficiently fast such that it does not impose a bottleneck in detector simulations overall, significant speed-ups in the processing of large samples can be achieved from GPU parallelization at both the particle (intra-event) and event levels; this is especially beneficial in conditions expected at the high-luminosity LHC, where extremely high per-event particle multiplicities will result from the many simultaneous proton-proton collisions. We report our experience with porting FastCaloSim to NVIDIA GPUs using CUDA. A preliminary Kokkos implementation of FastCaloSim for portability to other parallel architectures is also described., 15 pages, 1 figure, 8 tables, 2 listings, submitted to Frontiers in Big Data (Big Data in AI and High Energy Physics)
- Published
- 2021
24. Triggering Adaptive Automation in Naval Command and Control
- Author
-
T. . De Greef and H.F.R. Arciszewski
- Subjects
Engineering ,Adaptive control ,business.industry ,Control theory ,Process (computing) ,Command and control ,Control engineering ,Workload ,business ,Instruction cycle ,Automation ,Task (project management) - Abstract
In many control domains (plant control, air traffic control, military command and control) humans are assisted by computer systems during their assessment of the situation and their subsequent decision making. As computer power increases and novel algorithms are being developed, machines move slowly towards capabilities similar to humans, leading in turn to an increased level of control being delegated to them. This technological push has led to innovative but at the same time complex systems enabling humans to work more efficiently and/or effectively. However, in these complex and information-rich environments, task demands can still exceed the cognitive resources of humans, leading to a state of overload due to fluctuations in tasks and the environment. Such a state is characterized by excessive demands on human cognitive capabilities resulting in lowered efficiency, effectiveness, and/or satisfaction. More specifically, we focus on the human-machine adaptive process that attempts to cope with varying task and environmental demands. In the research field of adaptive control an adaptive controller is a controller with adjustable parameters and a mechanism for adjusting the parameters (Astrom & Wittenmark, 1994, p. 1) as the parameters of the system being controlled are slowly time-varying or uncertain. The classic example concerns an airplane where the mass decreases slowly during flight as fuel is being consumed. More specifically, the controller being adjusted is the process that regulates the fuel intake resulting in thrust as output. The parameters of this process are adjusted as the airplane mass decreases resulting in less fuel being injected to yield the same speed. In a similar fashion a human-machine ensemble can be considered an adaptive controller. In this case, human cognition is a slowly time-varying parameter, the adjustable parameters are the task sets that can be varied between human and machine, and the control mechanism is an algorithm that ‘has insight’ in the workload of the human operator (i.e., an algoritm that monitors human workload). Human performance is reasonably optimal when the human has a workload that falls within certain margins; severe performance reductions result from a workload that is either too high or (maybe surprisingly) too low. Consider a situation where the human-machine ensemble works in cooperation in order to control a process or situation. Both the human and the machine cycle through an information processing loop, collecting data, interpreting the situation, deciding on actions to achieve one or more stated goals and acting on the decisions (see for example Coram, 2002
- Published
- 2021
25. HAGP: A Heuristic Algorithm Based on Greedy Policy for Task Offloading with Reliability of MDs in MEC of the Industrial Internet
- Author
-
Xing Huang, Liangyin Chen, Wang Wei, Guo Min, Lei Zhang, Bing Liang, and Yanbing Yang
- Subjects
Optimization problem ,task offloading ,Computer science ,Reliability (computer networking) ,Distributed computing ,TP1-1185 ,02 engineering and technology ,Biochemistry ,Article ,Analytical Chemistry ,Task (project management) ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Electrical and Electronic Engineering ,Instruction cycle ,Instrumentation ,Mobile edge computing ,reliability ,Industrial Internet ,Chemical technology ,020208 electrical & electronic engineering ,020206 networking & telecommunications ,Energy consumption ,Atomic and Molecular Physics, and Optics ,mobile edge computing (MEC) ,Mobile device ,optimization - Abstract
In the Industrial Internet, computing- and power-limited mobile devices (MDs) in the production process can hardly support the computation-intensive or time-sensitive applications. As a new computing paradigm, mobile edge computing (MEC) can almost meet the requirements of latency and calculation by handling tasks approximately close to MDs. However, the limited battery capacity of MDs causes unreliable task offloading in MEC, which will increase the system overhead and reduce the economic efficiency of manufacturing in actual production. To make the offloading scheme adaptive to that uncertain mobile environment, this paper considers the reliability of MDs, which is defined as residual energy after completing a computation task. In more detail, we first investigate the task offloading in MEC and also consider reliability as an important criterion. To optimize the system overhead caused by task offloading, we then construct the mathematical models for two different computing modes, namely, local computing and remote computing, and formulate task offloading as a mixed integer non-linear programming (MINLP) problem. To effectively solve the optimization problem, we further propose a heuristic algorithm based on greedy policy (HAGP). The algorithm achieves the optimal CPU cycle frequency for local computing and the optimal transmission power for remote computing by alternating optimization (AP) methods. It then makes the optimal offloading decision for each MD with a minimal system overhead in both of these two modes by the greedy policy under the limited wireless channels constraint. Finally, multiple experiments are simulated to verify the advantages of HAGP, and the results strongly confirm that the considered task offloading reliability of MDs can reduce the system overhead and further save energy consumption to prolong the life of the battery and support more computation tasks.
- Published
- 2021
26. Autonomous NIC offloads
- Author
-
Dan Tsafrir, Aviad Yehezkel, Adam Morrison, Haggai Eran, Liran Liss, and Boris Pismenny
- Subjects
010302 applied physics ,Software_OPERATINGSYSTEMS ,computer.internet_protocol ,Network packet ,business.industry ,Computer science ,ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS ,Linux kernel ,Throughput ,02 engineering and technology ,Packet segmentation ,Encryption ,01 natural sciences ,020202 computer hardware & architecture ,Firewall (construction) ,Internet protocol suite ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Instruction cycle ,business ,computer - Abstract
CPUs routinely offload to NICs network-related processing tasks like packet segmentation and checksum. NIC offloads are advantageous because they free valuable CPU cycles. But their applicability is typically limited to layer≤4 protocols (TCP and lower), and they are inapplicable to layer-5 protocols (L5Ps) that are built on top of TCP. This limitation is caused by a misfeature we call ”offload dependence,” which dictates that L5P offloading additionally requires offloading the underlying layer≤4 protocols and related functionality: TCP, IP, firewall, etc. The dependence of L5P offloading hinders innovation, because it implies hard-wiring the complicated, ever-changing implementation of the lower-level protocols. We propose ”autonomous NIC offloads,” which eliminate offload dependence. Autonomous offloads provide a lightweight software-device architecture that accelerates L5Ps without having to migrate the entire layer≤4 TCP/IP stack into the NIC. A main challenge that autonomous offloads address is coping with out-of-sequence packets. We implement autonomous offloads for two L5Ps: (i) NVMe-over-TCP zero-copy and CRC computation, and (ii) https authentication, encryption, and decryption. Our autonomous offloads increase throughput by up to 3.3x, and they deliver CPU consumption and latency that are as low as 0.4x and 0.7x, respectively. Their implementation is already upstreamed in the Linux kernel, and they will be supported in the next-generation of Mellanox NICs.
- Published
- 2021
27. ALICE Connex: A volunteer computing platform for the Time-Of-Flight calibration of the ALICE experiment. An opportunistic use of CPU cycles on Android devices
- Author
-
Khajonpong Akkarajitsakul, Patcharaporn Jenviriyakul, Filippo Costa, Gantaphon Chalumporn, and Tiranee Achalakul
- Subjects
Large Hadron Collider ,Computer Networks and Communications ,Computer science ,Mobile computing ,020206 networking & telecommunications ,Data_CODINGANDINFORMATIONTHEORY ,02 engineering and technology ,computer.software_genre ,Hardware and Architecture ,Volunteer computing ,Middleware ,ComputingMilieux_COMPUTERSANDEDUCATION ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,020201 artificial intelligence & image processing ,Android (operating system) ,Instruction cycle ,computer ,Mobile device ,Software - Abstract
In this paper, we propose ALICE Connex, a prototype of a volunteer mobile computing platform for the ALICE experiment. The platform adopts the “volunteer computing” concept on mobile devices. Untapped computing power of smartphones can be aggregated and exploited to help in the calibration of the ALICE’s Time-of-Flight (TOF) particle detector. ALICE Connex is built based on the Berkeley Open Infrastructure for Network Computing or BOINC, which is a well-known volunteer computing middleware. In addition, ALICE Connex will offer an outreach service to connect the ALICE experiment to the general public by a way of edutainment and volunteer computing allowing them to be a part of the big experiments at CERN.
- Published
- 2019
28. Enhancing Speculative Execution With Selective Approximate Computing
- Author
-
Moumita Das, Bernard Nongpoh, Rajarshi Ray, and Ansuman Banerjee
- Subjects
Computer science ,Pipeline (computing) ,Reliability (computer networking) ,Speculative execution ,Probabilistic logic ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Branch predictor ,Computer Graphics and Computer-Aided Design ,020202 computer hardware & architecture ,Computer Science Applications ,Parsec ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Instruction cycle ,Rollback - Abstract
Speculative execution is an optimization technique used in modern processors by which predicted instructions are executed in advance with an objective of overlapping the latencies of slow operations. Branch prediction and load value speculation are examples of speculative execution used in modern pipelined processors to avoid execution stalls. However, speculative executions incur a performance penalty as an execution rollback when there is a misprediction. In this work, we propose to aid speculative execution with approximate computing by relaxing the execution rollback penalty associated with a misprediction. We propose a sensitivity analysis method for data and branches in a program to identify the data load and branch instructions that can be executed without any rollback in the pipeline and yet can ensure a certain user-specified quality of service of the application with a probabilistic reliability. Our analysis is based on statistical methods, particularly hypothesis testing and Bayesian analysis. We perform an architectural simulation of our proposed approximate execution and report the benefits in terms of CPU cycles and energy utilization on selected applications from the AxBench, ACCEPT, and Parsec 3.0 benchmarks suite.
- Published
- 2019
29. Handling Limited Resources in Mobile Computing via Closed-Loop Approximate Computations
- Author
-
Dario Pompili and Parul Pandey
- Subjects
Reduction (complexity) ,Ubiquitous computing ,Computational Theory and Mathematics ,Computer science ,Distributed computing ,Testbed ,Mobile computing ,Approximation algorithm ,Energy consumption ,Instruction cycle ,Software ,Energy (signal processing) ,Computer Science Applications - Abstract
Mobile computing is one of the largest untapped reservoirs in today's pervasive computing world as it has the potential to enable a variety of in situ, real-time applications. Yet, this computing paradigm suffers when the available resources—such as energy in the network, CPU cycles, memory, and I/O data rate—are limited. In this paper, the new paradigm of approximate computing is proposed to harness such potential and to enable energy-intensive mobile applications. A reduction in time and energy consumed by an application is obtained via closed-loop approximate computing by decreasing the amount of computation needed; such improvement, however, comes with the potential loss in accuracy. We present our framework on how approximation can be applied at different levels, namely, at the application level and the input data level. The effectiveness of the proposed closed-loop approach is validated both through extensive simulation and testbed experiments by comparing approximate versus exact-computation performance.
- Published
- 2019
30. Power Trust: Energy Auditing Aware Trust-Based System to Detect Security Attacks in IoT
- Author
-
P. Subhash, K. Samrat Surya, and Gollapudi Ramesh Chandra
- Subjects
Signal processing ,Matching (statistics) ,Measure (data warehouse) ,Computer science ,business.industry ,Node (networking) ,020208 electrical & electronic engineering ,020206 networking & telecommunications ,02 engineering and technology ,Computer security ,computer.software_genre ,0202 electrical engineering, electronic engineering, information engineering ,Wireless ,Network performance ,Electricity ,business ,Instruction cycle ,computer - Abstract
Internet of Things (IoT) has become a part of our daily life; these provide information like sensory data which could help us in driving a vehicle with ease, electricity management in homes, health care, and many more. In all the scenarios, the information is generally in the form of log data used to measure certain parameters to figure out the solutions matching to the applications. As this information is very sensitive, there are chances of capturing the device and tampering the data causing sever performance degradation of the network. The possibility of attacks on these devices is mainly of two types one is of physical attack which could occur due to accidents and intentional damage to the device and the second type of possible attack is cyber-attack like tampering the information, denial-of-service, and compromising the device. In this paper, we propose a Power Trust mechanism to assign trust value for each node of the network based on the energy auditing. This energy auditing is done with reference to CPU cycles, data received, data processed, data sent network performance, and power consumption. Using the energy auditing model, we calculate trust values of every node present in the network dynamically and predict the physical attacks and cyber-attacks. With the method introduced, IoT devices would be better monitored and secured.
- Published
- 2021
31. GPUKV
- Author
-
Jungki Noh, Woosuk Chung, Donggyu Park, Kyoung Hwan Park, Youngjae Kim, Min-Gyo Jung, Sungyong Park, and Chang-Gyu Lee
- Subjects
File system ,Computer science ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Kernel (linear algebra) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Overhead (computing) ,General-purpose computing on graphics processing units ,Instruction cycle ,Host (network) ,computer ,Abstraction (linguistics) ,PCI Express - Abstract
When data is loaded from a key-value store to the GPU in a conventional GPU-driven computing model, it entails the overhead of all the heavy I/O stacks of the key-value store and file system. This paper presents GPUKV, a GPU-driven computing framework that eliminates the aforementioned overhead with less host-side usage of resources such as CPU and memory. GPUKV has the following three features: (i) GPUKV provides a key-value store abstraction to the GPU; (ii) In GPUKV, when loading data from the key-value store to the GPU, it is performed through PCIe peer-to-peer (P2P) communication without copying to the user and kernel space memory; and (iii) GPUKV uses KVSSD, which implements a key-value store inside an SSD, completely eliminating the interaction with the key-value store and file system for P2P communication. We have developed GPUKV with a KVSSD implemented on the Cosmos+ OpenSSD platform in a Linux environment. Our extensive evaluations demonstrate that GPUKV improves execution time by up to 18.7 times and reduces host CPU cycle usage by up to 175 times compared to conventional CPU-based GPU computing models.
- Published
- 2021
32. Dynamic Mobile Edge Computing empowered with Reconfigurable Intelligent Surfaces
- Author
-
Paolo Di Lorenzo, Emilio Calvanese Strinati, and Mattia Merluzzi
- Subjects
Mobile edge computing ,Dynamic problem ,Computer science ,Distributed computing ,Computation offloading ,Resource allocation ,Context (language use) ,Stochastic optimization ,Instruction cycle ,5G - Abstract
The goal of this work is to propose a novel algorithm for energy-efficient, low-latency dynamic computation offloading in mobile edge computing (MEC), in the context of 5G (and beyond) networks endowed with Reconfigurable Intelligent Surfaces (RISs). In our setting, new requests for computations are continuously generated at each user, and are handled through a dynamic queueing system. Building on stochastic optimization tools, we devise a dynamic algorithm that jointly optimizes radio resources (i.e., power, rates), computation resources (i.e., CPU cycles), and RIS reflectivity parameters (i.e., phase shifts), while guaranteeing a target performance in terms of average end-to-end delay. The proposed strategy is dynamic, since it performs a low-complexity optimization on a per-slot basis while dealing with time-varying radio channels and task arrivals, whose statistics are unknown a priori. Numerical results corroborate the benefits of our strategy in the context of RIS-empowered MEC systems.
- Published
- 2021
- Full Text
- View/download PDF
33. Kreon: An Efficient Memory-Mapped Key-Value Store for Flash Storage
- Author
-
Pilar González-Férez, Anastasios Papagiannis, Giorgos Kalaentzis, Angelos Bilas, Giorgos Saloustros, and Giorgos Xanthakis
- Subjects
Computer science ,Copy-on-write ,Sorting ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,Memory-mapped I/O ,Data processing system ,Data access ,Hardware and Architecture ,020204 information systems ,Server ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Instruction cycle - Abstract
Persistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Nowadays, due to power limitations, it is important to reduce CPU overheads for data processing. In this article, we propose Kreon , a key-value store that targets servers with flash-based storage, where CPU overhead and I/O amplification are more significant bottlenecks compared to I/O randomness. We first observe that two significant sources of overhead in key-value stores are: (a) The use of compaction in Log-Structured Merge-Trees (LSM-Tree) that constantly perform merging and sorting of large data segments and (b) the use of an I/O cache to access devices, which incurs overhead even for data that reside in memory. To avoid these, Kreon performs data movement from level to level by using partial reorganization instead of full data reorganization via the use of a full index per-level. Kreon uses memory-mapped I/O via a custom kernel path to avoid a user-space cache. For a large dataset, Kreon reduces CPU cycles/op by up to 5.8×, reduces I/O amplification for inserts by up to 4.61×, and increases insert ops/s by up to 5.3×, compared to RocksDB.
- Published
- 2021
- Full Text
- View/download PDF
34. Low Complexity ECG Biometric Authentication for IoT Edge Devices
- Author
-
Deepu John, Avishek Nag, and Guoxin Wang
- Subjects
Authentication ,Biometrics ,Edge device ,Computer science ,business.industry ,Deep learning ,Real-time computing ,Wearable computer ,Enhanced Data Rates for GSM Evolution ,Artificial intelligence ,business ,Instruction cycle ,Convolutional neural network - Abstract
Wearable Internet of Things (IoT) devices are getting ubiquitous for continuous physiological data acquisition and health monitoring. This paper investigates an electrocardiogram (ECG) based biometric user authentication technique for IoT edge devices. A convolutional neural network (CNN) based deep learning technique for user authentication is proposed. The proposed technique achieves an authentication accuracy of 99.63% when tested with 290 subjects from Physionet PTB ECG database. To limit the complexity of the technique for IoT edge nodes, we applied optimisation techniques such as binarisation and approximation of the CNN weights. Accuracy-vs-time-complexity trade-off analysis is performed and results are presented for different optimisations. Our evaluations shows that the complexity-optimised method achieves 98.88% authentication accuracy with acceptable CPU cycles consumed.
- Published
- 2020
35. Roadblocks of I/O Parallelization: Removing H/W Contentions by Static Role Assignment in VNFs
- Author
-
Tsunemasa Hayashi, Hiroki Nakayama, Ryota Kawashima, Masahiro Asada, and Hiroshi Matsuo
- Subjects
0303 health sciences ,Computer science ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,Program optimization ,Bottleneck ,Instruction set ,03 medical and health sciences ,Server ,Datapath ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Instruction cycle ,Throughput (business) ,030304 developmental biology - Abstract
Achieving 100 Gbps+ throughput with commodity servers is a challenging goal, even with state-of-the-art Data Plane Development Kit (DPDK). Fundamental performance of CPU/Memory is now the bottleneck and simple code optimization of Network Functions (NFs) cannot be the solution. Hardware accelerators including FPGA are getting attentions for performance boost; however, relying on specific features degrades manageability of NFV-nodes. Common Receive Side Scaling (RSS) provides a means of H/W-level parallelization, but per-flow throughput is not accelerated. Existing software-based approaches distribute processing load of NFs, but I/O is still serialized for each datapath. We tackled I/O parallelization and uncovered encounterd certainly misty contentions in our previous study. Specifically, per-thread CPU cycle consumptions proportionally grew as increasing parallelization level, although the overhead of conceivable mutual executions (e.g. CAS operations) was trivial. In this paper, we pursue the cause of the issue and upgrade our I/O parallelization scheme. Our careful investigation of NFV-node internals ranging from application to device driver layers indicates that hidden H/W-level contentions involving DMA heavily consume CPU cycles. We propose a contention avoidance design of thread role assignment and prove our design can flatten per-thread CPU cycle consumptions.
- Published
- 2020
36. Hybrid Job Scheduling in Distributed Systems based on Clone Detection
- Author
-
Uddalok Sen, Madhulina Sarkar, and Nandini Mukherjee
- Subjects
Scheme (programming language) ,Job scheduler ,050101 languages & linguistics ,ComputingMilieux_THECOMPUTINGPROFESSION ,Computer science ,Distributed computing ,05 social sciences ,02 engineering and technology ,Dynamic priority scheduling ,computer.software_genre ,Scheduling (computing) ,Set (abstract data type) ,Resource (project management) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,0501 psychology and cognitive sciences ,Resource management ,Instruction cycle ,computer ,computer.programming_language - Abstract
In order to propose an efficient scheduling policy in a large distributed heterogeneous environment, resource requirements of newly submitted jobs should be predicted prior to the execution of jobs. An execution history can be maintained to store the execution profile of all jobs executed earlier on a given set of resources. The execution history stores the actual CPU cycle consumed by the job as well as the resource details where it is executed. A feedback-guided job-modeling scheme can be used to detect similarity between the newly submitted jobs and previously executed jobs. It can also be used to predict resource requirements based on this similarity. However, efficient resource scheduling based on this knowledge has not been dealt with. In this paper, we propose a hybrid, scheduling policy of new jobs, which are independent of each other, based on their similarity with history jobs. Here we focus on exact clone jobs only i.e. its identical job is found in execution history and predicted resource consumption is same as exact resource consumption. We also endeavor to deal with two conflicting parameters i.e., execution cost and make span of jobs. A comparison with other existing algorithms is also presented in this paper.
- Published
- 2020
37. Coded Access Architectures for Dense Memory Systems
- Author
-
Ethan R. Elenberg, Sriram Vishwanath, Matthew Edwards, Hardik Jain, and Ankit Singh Rawat
- Subjects
Hardware_MEMORYSTRUCTURES ,Computer science ,020206 networking & telecommunications ,02 engineering and technology ,Coding theory ,High Bandwidth Memory ,Memory controller ,020202 computer hardware & architecture ,Memory bank ,Computer architecture ,0202 electrical engineering, electronic engineering, information engineering ,Double data rate ,Instruction cycle ,Dram ,Block (data storage) - Abstract
We explore the use of coding theory in double data rate (DDR) and high bandwidth memory (HBM) systems. Modern DDR systems incur large latencies due to contention of multiple requests on the same memory bank. Our proposed memory design stores data across DRAM pages in a redundant manner using Reed-Solomon codes as a building block. A memory controller then assigns coded versions of the data to dedicated parity banks. This multi-bank coding scheme creates multiple ways to retrieve a given data element and allows for more efficient scheduling in multi-core memory architectures such as HBM. Our approach differs from conventional, uncoded systems which only optimize the timing of each incoming memory request. We implement our proposed memory design on an HBM DRAM architecture via the Ramulator simulation platform. Experimental results show that multi-bank coding reduces the number of contended memory accesses, and thus the overall latency, for several standard benchmarks. Our design reduces the number of CPU cycles in some scenarios by over \(50\%\) compared to an uncoded baseline.
- Published
- 2020
38. GPU Direct I/O with HDF5
- Author
-
Suren Byna, Quincey Koziol, and John Ravi
- Subjects
File system ,Parallel processing (DSP implementation) ,Computer science ,Transfer (computing) ,Message passing ,Operating system ,computer.file_format ,Central processing unit ,Hierarchical Data Format ,computer.software_genre ,Instruction cycle ,computer ,Host (network) - Abstract
Exascale HPC systems are being designed with accelerators, such as GPUs, to accelerate parts of applications. In machine learning workloads as well as large-scale simulations that use GPUs as accelerators, the CPU (or host) memory is currently used as a buffer for data transfers between GPU (or device) memory and the file system. If the CPU does not need to operate on the data, then this is sub-optimal because it wastes host memory by reserving space for duplicated data. Furthermore, this “bounce buffer” approach wastes CPU cycles spent on transferring data. A new technique, NVIDIA GPUDirect Storage (GDS), can eliminate the need to use the host memory as a bounce buffer. Thereby, it becomes possible to transfer data directly between the device memory and the file system. This direct data path shortens latency by omitting the extra copy and enables higher-bandwidth. To take full advantage of GDS in existing applications, it is necessary to provide support with existing I/O libraries, such as HDF5 and MPI-IO, which are heavily used in applications. In this paper, we describe our effort of integrating GDS with HDF5, the top I/O library at NERSC and at DOE leadership computing facilities. We design and implement this integration using a HDF5 Virtual File Driver (VFD). The GDS VFD provides a file system abstraction to the application that allows HDF5 applications to perform I/O without the need to move data between CPUs and GPUs explicitly. We compare performance of the HDF5 GDS VFD with explicit data movement approaches and demonstrate superior performance with the GDS method.
- Published
- 2020
39. On-Chip Intelligent Dynamic Frequency Scaling for Real-Time Systems
- Author
-
Prachi Sharma, Arkid Kalyan Bera, and Anu Gupta
- Subjects
Computer science ,business.industry ,Power consumption ,Embedded system ,Dynamic frequency scaling ,Operating frequency ,Latency (engineering) ,Instruction cycle ,business ,Frequency scaling ,Performance per watt ,Voltage - Abstract
To curb redundant power consumption in portable embedded and real-time applications, processors are equipped with various Dynamic Voltage and Frequency Scaling (DVFS) techniques. The accuracy of the prediction of the operating frequency of any such technique determines how power-efficient it makes a processor for a variety of programs and users. But, in the recent techniques, the focus has been too much on saving power, thus, ignoring the user-satisfaction metric, i.e. performance. The DVFS technique used to save power, in turn, introduces unwanted latency due to the high complexity of the algorithm. Also, many of the modern DVFS techniques provide feedback manually triggered by the user to change the frequency to conserve energy efficiently, thus, further increasing the reaction time. In this paper, we imple- ment a novel Artificial Neural Networks-driven frequency scaling methodology, which makes it possible to save power and boost performance at the same time, implicitly i.e. without any feedback from the user. Also, to make the system more inclusive concerning the kinds of processes run on it, we trained the ANN not only for CPU-intensive programs but also on the ones that are more memory-bound, i.e. which have frequent memory accesses during its average CPU cycle. The proposed technique has been evaluated on Intel i7-4720HQ Haswell processor and has shown performance boost by up to 20%, SoC power savings up to 16%, and Performance per Watt improvement by up to 30%, as compared to the existing DVFS technique. An open-source memory-intensive benchmark kit called Mibench was used to verify the utility of the suggested technique.
- Published
- 2020
40. Assembly programming optimization based on TS201.
- Author
-
Shuting Guo and Hongchun Hu
- Abstract
It's very important to optimize the performance of programming based on DSP for high efficiency of execution. Comparison and analysis on several software optimizing measures in assembly code are represented based on TS201, including avoiding or utilizing the pipelining delay, SIMD technique, software pipelining and loop unrolling. It's proved by an example that the execution efficiency is improved by 49.3%. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
41. Efficient hybrid polling for ultra-low latency storage devices
- Author
-
Seokha Shin, Jinkyu Jeong, and Gyu-Sun Lee
- Subjects
Software_OPERATINGSYSTEMS ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,CPU time ,Linux kernel ,Hardware and Architecture ,Embedded system ,Central processing unit ,Timer ,Latency (engineering) ,Polling ,business ,Instruction cycle ,Software - Abstract
With the introduction of ultra-low latency SSDs, which complete I/O operations in a few microseconds, the use of polling is becoming an attractive solution for alleviating the overheads in interrupt-driven I/O completion. However, careful use of polling is essential because of its inherent CPU overhead. Hybrid polling, in which a timer-based sleep is inserted in the middle of polling, has recently been proposed to relieve the CPU overhead. However, there is still substantial headroom for further optimization to save CPU cycles. In this paper, we propose efficient hybrid polling scheme that minimizes the CPU cycles for polling without sacrificing the I/O latency. By considering I/O time characteristics of idle and busy storage devices, our scheme makes an appropriate sleep time decision that maximizes the I/O performance and minimizes the CPU cycles for polling. The proposed scheme is implemented in the Linux kernel and evaluated with various I/O workloads. The evaluation results show that whenever an SSD is heavily or lightly loaded, our scheme achieves I/O latency identical to that of classical polling while maintaining a low CPU utilization. Compared to the original hybrid polling, our scheme reduces CPU utilization by 5%–40% and provides faster I/O latency by up to 10%.
- Published
- 2022
42. Scalable Coordination of Hierarchical Parallelism
- Author
-
Vinay Devadas and Matthew Curtis-Maury
- Subjects
File system ,Hierarchy (mathematics) ,Computer science ,Distributed computing ,Multiprocessing ,Thread (computing) ,computer.software_genre ,Hierarchical database model ,Idle ,Scalability ,Synchronization (computer science) ,Overhead (computing) ,Instruction cycle ,computer - Abstract
Given continually increasing core counts, multiprocessor software scaling is critical. One set of applications that is especially difficult to parallelize efficiently are those that operate on hierarchical data. In such applications, correct execution relies on all threads coordinating their accesses within the hierarchy. At the same time, high-performance execution requires that this coordination be efficient and that it maximize parallelism. In this paper, we identify two key scalability bottlenecks in the coordination of hierarchical parallelism by studying the hierarchical data partitioning framework within the NetApp® WAFL® file system. We first observe that the global synchronization required to enforce the hierarchical constraints limits performance on increased core counts. We thus propose a distributed architecture, called Scheduler Pools, that divides the hierarchy into disjoint subhierarchies that can be managed independently in the common case, thereby reducing coordination overhead. We next observe that periodically draining all in-flight operations in order to facilitate the execution of coarse-grained operations in the hierarchy results in an excess of idle CPU cycles. To address this issue, we propose a new scheme, called Hierarchy-Aware Draining, that minimizes wasted CPU cycles by draining only those regions of the hierarchy that are required to execute the desired operation. When implemented together in the context of WAFL, Scheduler Pools and Hierarchy-Aware Draining overcome the observed scalability bottlenecks. Our evaluation with a range of benchmarks on high-end storage systems shows throughput gains of up to 33% and reductions in latency of up to 64%.
- Published
- 2020
43. The Optimal Initial Buffer and Cycle Time Design for Improving Lean Production Automation Line Efficiency
- Author
-
Duong Vu Xuan Quynh, Fumio Kojima, Chawalit Jeenanunta, and Masahiro Nakamura
- Subjects
Production line ,Optimal design ,Downtime ,Computer science ,business.industry ,Production (economics) ,Line (text file) ,Process engineering ,business ,Instruction cycle ,Lean manufacturing ,Automation - Abstract
Optimizing the size and location of storage buffer between machines is critical issues in designing efficient production automation line. The objective of the study is to propose the optimal design of the initial buffer size as well as the suitable machine cycle time to increase the efficiency of the automation line. In this study, the production system consists of three machines with two buffers are placed between them is considered. There is a regular downtime for maintenance at the second machine, which results in lower production efficiency. The production simulation on cloud is used for modelling and demonstrating the concept. The experimental results revealed the proposed optimal initial buffer and the proposed cycle time of maintenance station could yield the maximum production line efficiency.
- Published
- 2020
44. VirtualStack: Green High Performance Network Protocol Processing Leveraging FPGAs
- Author
-
Philipp Thomasberger, Jens Heuschkel, Max Mühlhäuser, and Julien Gedeon
- Subjects
NetFPGA ,business.industry ,Computer science ,Network packet ,Clock rate ,02 engineering and technology ,Network interface ,Port (computer networking) ,020202 computer hardware & architecture ,Instructions per second ,Network interface controller ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Hardware acceleration ,Central processing unit ,business ,Instruction cycle ,Communications protocol ,Host (network) - Abstract
In times of cloud services and IoT, network protocol processing is a big part of the CPU utilization today. Foong et al. proposed the rule of thumb for TCP, that a single-core CPU needs about 1 Hz clock frequency to produce 1 bit/s worth of TCP data packets. Unfortunately, CPU speed has stagnated around 5 GHz in recent years resulting in a upper limit of 5 GBit/s throughput with single-threaded network processing. Further, CPUs featuring such high clock rates (e.g., Intel Core i7-8086K) have rated TDP around 95 W, resulting in very high power consumption for high throughput situations. Meanwhile, industry offers some hardware acceleration for TCP as part of their server network cards, to relief the server CPUs and increase the energy efficiency. However this is just a small support as state and management still needs the CPU of the host system. In this paper, we present an approach based on field pro- grammable gate arrays (FPGA) to not only free up CPU cycles but provide a scaleable and energy efficient concept to fully utilize high-speed network interfaces, while maintaining the flexibility of software solutions. For our evaluation, we utilized the NetFPGA Sume, proofing to achieve the linerate of connected SFP+ ports while power consumption stays below 6 W. By leveraging network protocol virtualization, the hardware acceleration approach is not only deployable but stays flexible enough to adapt new networking paradigms quickly.
- Published
- 2019
45. A server-based approach for predictable GPU access with improved analysis
- Author
-
Hyoseung Kim, Shige Wang, Pratyush Patel, and Ragunathan Rajkumar
- Subjects
0209 industrial biotechnology ,Focus (computing) ,Busy waiting ,Computer science ,business.industry ,Computation ,Graphics processing unit ,02 engineering and technology ,020202 computer hardware & architecture ,Priority inversion ,Task (computing) ,020901 industrial engineering & automation ,Hardware and Architecture ,Embedded system ,Synchronization (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Instruction cycle ,business ,Software - Abstract
We propose a server-based approach to manage a general-purpose graphics processing unit (GPU) in a predictable and efficient manner. Our proposed approach introduces a GPU server that is a dedicated task to handle GPU requests from other tasks on their behalf. The GPU server ensures bounded time to access the GPU, and allows other tasks to suspend during their GPU computation to save CPU cycles. By doing so, we address the two major limitations of the existing real-time synchronization-based GPU management approach: busy waiting within critical sections and long priority inversion. We have implemented a prototype of the server-based approach on a real embedded platform. This case study demonstrates the practicality and effectiveness of the server-based approach. Experimental results indicate that the server-based approach yields significant improvements in task schedulability over the existing synchronization-based approach in most practical settings. Although we focus on a GPU in this paper, the server-based approach can also be used for other types of computational accelerators.
- Published
- 2018
46. Performance guarantees for P4 through cost analysis
- Author
-
Dániel Lukács, Máté Tejfel, and Gergely Pongracz
- Subjects
business.product_category ,Parsing ,Semantics (computer science) ,Computer science ,Distributed computing ,Process (computing) ,computer.software_genre ,Control flow ,Lookup table ,Forwarding plane ,Network switch ,business ,Instruction cycle ,computer - Abstract
For modern switches operating with terabit-scale bandwidth it is important to achieve guarantees of highperformance as early as possible in the design and development process. P4 is the state-of-art programming language for defining the control flow of SDN network switches. The most computationally intensive part of the program flow are the lookup tables. Unfortunately, the execution semantics of P4 lookup tables is highly implementation dependent.Continuing our previous work on P4 parsers, we propose, classify, and evaluate implementation-level cost models for P4 programs composed of parser and lookup constructs, enabling developers to estimate program execution costs (with CPU cycle precision) before actually deploying the solution to hardware.
- Published
- 2019
47. Semi-Coherent DMA: An Alternative I/O Coherency Management for Embedded Systems
- Author
-
Mohammad Alian, Nam Sung Kim, Wen-mei W. Hwu, and Seungwon Min
- Subjects
Ethernet ,Hardware_MEMORYSTRUCTURES ,Network packet ,business.industry ,Computer science ,02 engineering and technology ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,020202 computer hardware & architecture ,Network interface controller ,Hardware and Architecture ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Central processing unit ,Cache ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,business ,Instruction cycle ,Throughput (business) ,Data transmission - Abstract
Many modern embedded CPUs adopt Non-Coherent DMA (NC-DMA) over Coherent DMA (C-DMA) because of simplicity. An NC-DMA design, however, requires a CPU device driver to explicitly invalidate or flush a wide range of cache space. When an I/O DMA device writes data to a main memory region, the CPU needs to invalidate the cache space corresponding to the same memory region twice: (1) to prevent dirty cache lines from overwriting the DMA data and (2) to remove any cache lines prefetched before the DMA is done. In this work, we first show that such explicit invalidations consume 31 percent of CPU cycles, limiting the data transfer throughput of a high-speed network interface card (NIC) when receiving network packets. Second, we propose a Semi-Coherent DMA (SC-DMA) architecture for improving the efficiency of NC-DMA with a slight modification to the hardware. Specifically, our SC-DMA records the DMA region and prohibits any data that is prefetched from the region from entering the cache, reducing nearly 50 percent of the unnecessary invalidations. Lastly, we identify several software optimizations that can substantially reduce excessive cache invalidations prevalent in NIC drivers. Our evaluation with NVIDIA Jetson TX2 shows that our proposed SC-DMA design with the NIC driver optimizations can improve the NIC data transfer throughput by up to 53.3 percent.
- Published
- 2018
48. EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud
- Author
-
Junjie Xie, Laurence T. Yang, Yuhui Deng, and Yongtao Zhou
- Subjects
Source code ,Computer Networks and Communications ,Computer science ,Modulo ,media_common.quotation_subject ,Data management ,Cloud computing ,02 engineering and technology ,computer.software_genre ,020204 information systems ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Instruction cycle ,media_common ,business.industry ,020206 networking & telecommunications ,Computer Science Applications ,Hardware and Architecture ,Computer data storage ,Data mining ,business ,Precision and recall ,Algorithm ,computer ,Software ,Information Systems - Abstract
The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarity identification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Therefore, it is an effective approach of similarity identification for the cloud.
- Published
- 2018
49. Distributed Logless Atomic Durability with Persistent Memory
- Author
-
Babak Falsafi, Siddharth Gupta, and Alexandros Daglis
- Subjects
010302 applied physics ,Atomicity ,Hardware_MEMORYSTRUCTURES ,Memory hierarchy ,Computer science ,Distributed computing ,Logging ,02 engineering and technology ,Commit ,Persistent Memory ,01 natural sciences ,Durability ,020202 computer hardware & architecture ,Persistence (computer science) ,0103 physical sciences ,Atomic Durability ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Instruction cycle ,Protocol (object-oriented programming) ,Database transaction - Abstract
Datacenter operators have started deploying Persistent Memory (PM), leveraging its combination of fast access and persistence for significant performance gains. A key challenge for PM-aware software is to maintain high performance while achieving atomic durability. The latter typically requires the use of logging, which introduces considerable overhead with additional CPU cycles, write traffic, and ordering requirements. In this paper, we exploit the data multiversioning inherent in the memory hierarchy to achieve atomic durability without logging. Our design, LAD, relies on persistent buffering space at the memory controllers (MCs)---already present in modern CPUs---to speculatively accumulate all of a transaction's updates before they are all atomically committed to PM. LAD employs an on-chip distributed commit protocol in hardware to manage the distributed speculative state each transaction accumulates across multiple MCs. We demonstrate that LAD is a practical design relying on modest hardware modifications to provide atomically durable transactions, while delivering up to 80% of ideal---i.e., PM-oblivious software's---performance.
- Published
- 2019
50. Approximate computing for multithreaded programs in shared memory architectures
- Author
-
Rajarshi Ray, Ansuman Banerjee, and Bernard Nongpoh
- Subjects
Multi-core processor ,Hardware_MEMORYSTRUCTURES ,Computer science ,Reliability (computer networking) ,020207 software engineering ,02 engineering and technology ,Coherence (statistics) ,Parallel computing ,Shared memory ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Instruction cycle ,Protocol (object-oriented programming) ,Cache coherence - Abstract
In multicore and multicached architectures, cache coherence is ensured with a coherence protocol. However, the performance benefits of caching diminishes due to the cost associated with the protocol implementation. In this paper, we propose a novel technique to improve the performance of multithreaded programs running on shared-memory multicore processors by embracing approximate computing. Our idea is to relax the coherence requirement selectively in order to reduce the cost associated with a cache-coherence protocol, and at the same time, ensure a bounded QoS degradation with probabilistic reliability. In particular, we detect instructions in a multithreaded program that write to shared data, we call them Shared-Write-Access-Points (SWAPs), and propose an automated statistical analysis to identify those which can tolerate coherence faults. We call such SWAPs approximable. Our experiments on 9 applications from the SPLASH 3.0 benchmarks suite reveal that an average of 57% of the tested SWAPs are approximable. To leverage this observation, we propose an adapted cache-coherence protocol that relaxes the coherence requirement on stores from approximable SWAPs. Additionally, our protocol uses stale values for load misses due to coherence, the stale value being the version at the time of invalidation. We observe an average of 15% reduction in CPU cycles and 11% reduction in energy footprint from architectural simulation of the 9 applications using our approximate execution scheme.
- Published
- 2019
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.