7,986 results on '"Concurrent computing"'
Search Results
2. libEnsemble: A Library to Coordinate the Concurrent Evaluation of Dynamic Ensembles of Calculations
- Author
-
Hudson, Stephen, Larson, Jeffrey, Navarro, John-Luke, and Wild, Stefan M
- Subjects
Information and Computing Sciences ,Library and Information Studies ,Generators ,Resource management ,Libraries ,Optimization ,Arrays ,History ,Concurrent computing ,Dynamic ensembles ,simulation support systems ,workflow management ,HPC Python ,Computer Software ,Distributed Computing ,Communications Technologies ,Distributed computing and systems software - Abstract
Almost all applications stop scaling at some point; those that don't are seldom performant when considering time to solution on anything but aspirational/unicorn resources. Recognizing these tradeoffs as well as greater user functionality in a near-term exascale computing era, we present libEnsemble, a library aimed at particular scalability- and capability-stretching uses. libEnsemble enables running concurrent instances of an application in dynamically allocated ensembles through an extensible Python library. We highlight the structure, execution, and capabilities of the library on leading pre-exascale environments as well as advanced capabilities for exascale environments and beyond.
- Published
- 2022
3. Enhancing Blockchain Performance via On-chain and Off-chain Collaboration
- Author
-
Chen, Wuhui, Yang, Zhaoxian, Zhang, Jianting, Liang, Junyuan, Sun, Qilin, Zhou, Fan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Monti, Flavia, editor, Rinderle-Ma, Stefanie, editor, Ruiz Cortés, Antonio, editor, Zheng, Zibin, editor, and Mecella, Massimo, editor
- Published
- 2023
- Full Text
- View/download PDF
4. Emerging Frameworks for Advancing Scientific Workflows Research, Development, and Education
- Author
-
Casanova, Henri, Deelman, Ewa, Gesing, Sandra, Hildreth, Michael, Hudson, Stephen, Koch, William, Larson, Jeffrey, McDowell, Mary Ann, Meyers, Natalie, Navarro, John-Luke, Papadimitriou, George, Tanaka, Ryan, Taylor, Ian, Thain, Douglas, Wild, Stefan M, Filgueira, Rosa, and da Silva, Rafael Ferreira
- Subjects
Information and Computing Sciences ,Human-Centred Computing ,Scientific workflows ,training and education ,ensembles ,python ,concurrent computing ,numerical optimization ,simulation ,Nvidia ,GPU ,monitoring - Abstract
Lightning talks of the Workflows in Support of Large-Scale Science (WORKS) workshop are a venue where the workflow community (researchers, developers, and users) can discuss work in progress, emerging technologies and frameworks, and training and education materials. This paper summarizes the WORKS 2021 lightning talks, which cover four broad topics: (i) libEnsemble, a Python library to coordinate the concurrent evaluation of dynamic ensembles of calculations; (ii) Edu WRENCH, a set of online pedagogic modules that provides simulation-driven hands-on activity in the browser; (iii) VisDict, an envisioned visual dictionary framework that will translate terms, jargon, and concepts between research domains and workflow providers; and (iv) Pegasus Kickstart, a lightweight tool for capturing workflow tasks' performance, including performance metrics from Nvidia GPUs.
- Published
- 2021
5. APPLICATION OF REAL-TIME FAN SCHEDULING IN EXPLORATION-EXPLOITATION TO OPTIMIZE MINIMUM FUNCTION OBJECTIVES
- Author
-
Mariano LARIOS-GÓMEZ, Perfecto M. QUINTERO-FLORES, and Mario ANZURES-GARCÍA
- Subjects
real-time task scheduling ,genetic algorithms ,concurrent computing ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
This paper presents the application of a task scheduling algorithm called Fan based on artificial intelligence technique such as genetic algorithms for the problem of finding minima in objective functions, where equations are predefined to measure the return on investment. This work combines the methodologies of population exploration and exploitation. Results with good aptitudes are obtained until a better learning based on non-termination conditions is found, until the individual provides a better predisposi¬tion, adhering to the established constraints, exhausting all possible options and satisfying the stopping condition. A real-time task planning algorithm was applied based on consensus techniques. A software tool was developed, and the scheduler called FAN was adapted that contemplates the execution of periodic, aperiodic, and sporadic tasks focused on controlled environments, considering that strict time restrictions are met. In the first phase of the work, it is shown how convergence precipitates to an evolution. This is done in a few iterations. In the second stage, exploitation was improved, giving the algorithm a better performance in convergence and feasibility. As a result, a population was used and iterations were applied with a fan algorithm and better predisposition was obtained, which occurs in asynchronous processes while scheduling in real time.
- Published
- 2023
- Full Text
- View/download PDF
6. Discovering and Analyzing Contextual Behavioral Patterns From Event Logs.
- Author
-
Acheli, Mehdi, Grigori, Daniela, and Weidlich, Matthias
- Subjects
- *
POINT processes , *INFORMATION storage & retrieval systems - Abstract
Event logs that are recorded by information systems provide a valuable starting point for the analysis of processes in various domains, reaching from healthcare, through logistics, to e-commerce. Specifically, behavioral patterns discovered from an event log enable operational insights, even in scenarios where process execution is rather unstructured and shows a large degree of variability. While such behavioral patterns capture frequently recurring episodes of a process’ behavior, they are not limited to sequential behavior but include notions of concurrency and exclusive choices. Existing algorithms to discover behavioral patterns are context-agnostic, though. They neglect the context in which patterns are observed, which severely limits the granularity at which behavioral regularities are identified. In this paper, we therefore present an approach to discover contextual behavioral patterns. Contextual patterns may be frequent solely in a certain partition of the event log, which enables fine-granular insights into the aspects that influence the conduct of a process. Moreover, we show how to analyze the discovered contextual behavioral patterns in terms of causal relations between context information and the patterns, as well as correlations between the patterns themselves. A complete analysis methodology leveraging all the tools presented in the paper and supplemented by interpretations guidelines is also provided. Finally, experiments with real-world event logs demonstrate the effectiveness of our techniques in obtaining fine-granular process insights. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
7. Efficient Oblivious Query Processing for Range and kNN Queries.
- Author
-
Chang, Zhao, Xie, Dong, Li, Feifei, Phillips, Jeff M., and Balasubramonian, Rajeev
- Subjects
- *
BATCH processing , *DATA encryption , *MATHEMATICAL optimization , *RANDOM access memory , *CLOUD computing , *TRUST - Abstract
Increasingly, individuals and companies adopt a cloud service provider as a primary data and IT infrastructure platform. The remote access of the data inevitably brings the issue of trust. Data encryption is necessary to keep sensitive information secure and private on the cloud. Yet adversaries can still learn valuable information regarding encrypted data by observing data access patterns. To solve such problem, Oblivious RAMs (ORAMs) are proposed to completely hide access patterns. However, most ORAM constructions are expensive and not suitable to deploy in a database for supporting query processing over large data. Furthermore, an ORAM processes queries synchronously, hence, does not provide high throughput for concurrent query processing. In this article, we design a practical oblivious query processing framework to enable efficient query processing over a cloud database. In particular, we focus on processing multiple range and $k$ k NN queries asynchronously and concurrently with high throughput. The key idea is to integrate indices into ORAM which leverages a suite of optimization techniques (e.g., oblivious batch processing and caching). The effectiveness and efficiency of our oblivious query processing framework is demonstrated through extensive evaluations over large datasets. Our construction shows an order of magnitude speedup in comparison with other baselines. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
8. Coordinating Fast Concurrency Adapting With Autoscaling for SLO-Oriented Web Applications.
- Author
-
Liu, Jianshu, Zhang, Shungeng, Wang, Qingyang, and Wei, Jinpeng
- Subjects
- *
WEB-based user interfaces , *CLOUD computing , *RESOURCE allocation , *BOTTLENECKS (Manufacturing) - Abstract
Cloud providers tend to support dynamic computing resources reallocation (e.g., Autoscaling) to handle the bursty workload for web applications (e.g., e-commerce) in the cloud environment. Nevertheless, we demonstrate that directly scaling a bottleneck server without quickly adjusting its soft resources (e.g., server threads and database connections) can cause significant response time fluctuations of the target web application. Since soft resources determine the request processing concurrency of each server in the system, simply scaling out/in the bottleneck service can unintentionally change the concurrency level of related services, inducing either under- or over-utilization of the critical hardware resource. In this paper, we propose the Scatter-Concurrency-Throughput (SCT) model, which can rapidly identify the near-optimal soft resource allocation of each server in the system using the measurement of each server’s real-time throughput and concurrency. Furthermore, we implement a Concurrency-aware autoScaling (ConScale) framework that integrates the SCT model to quickly reallocate the soft resources of the key servers in the system to best utilize the new hardware resource capacity after the system scaling. Based on extensive experimental comparisons with two widely used hardware-only scaling mechanisms for web applications: EC2-AutoScaling (VM-based autoscaler) and Kubernetes HPA (container-based autoscaler), we show that ConScale can successfully mitigate the response time fluctuations over the system scaling phase in both VM-based and container-based environments. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
9. Intermittent-Aware Distributed Concurrency Control.
- Author
-
Tsai, Wei-Che, Chen, Wei-Ming, Kuo, Tei-Wei, and Hsiu, Pi-Cheng
- Subjects
- *
ENERGY harvesting , *INTERNET of things , *DATA management , *NONVOLATILE memory , *TASK analysis - Abstract
Internet of Things (IoT) devices are gradually adopting battery-less, energy harvesting solutions, thereby driving the development of an intermittent computing paradigm to accumulate computation progress across multiple power cycles. While many attempts have been made to enable standalone intermittent systems, little attention has focused on IoT networks formed by intermittent devices. We observe that the computation progress improved by distributed task concurrency in an intermittent network can be significantly offset by data unavailability due to frequent system failures. This article presents an intermittent-aware distributed concurrency control protocol which leverages existing data copies inherently created in the network to improve the computation progress of concurrently executed tasks. In particular, we propose a borrowing-based data management method to increase data availability and an intermittent two-phase commit procedure incorporated with distributed backward validation to ensure data consistency in the network. The proposed protocol was integrated into a FreeRTOS-extended intermittent operating system running on Texas Instruments devices. Experimental results show that the computation progress can be significantly improved, and this improvement is more apparent under weaker power, where more devices will remain offline for longer duration. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
10. A High Performance Concurrency Protocol for Smart Contracts of Permissioned Blockchain.
- Author
-
Jin, Cheqing, Pang, Shuaifeng, Qi, Xiaodong, Zhang, Zhao, and Zhou, Aoying
- Subjects
- *
GRAPH algorithms , *PARALLEL algorithms , *CONTRACTS , *SUBGRAPHS , *BLOCKCHAINS - Abstract
Although the emergence of the programmable smart contract makes blockchain systems easily embrace a wide range of industrial services, how to execute smart contracts efficiently becomes a big challenge nowadays. Due to the existence of Byzantine nodes, existing mature concurrency control protocols in database cannot be employed directly, since the mechanism of executing smart contracts varies a lot. Furthermore, even though smart contract execution follows a two-phase style, i.e., the primary node executes a batch of smart contracts in the first phase and the validators replay them in the second phase, existing parallel solutions merely focus on the optimization for the first phase, rather than the second phase. In this paper, we propose a novel two-phase concurrency control protocol to optimize both phases for the first time. First, the primary executes transactions in parallel and generates a transaction dependency graph with high parallelism for validators. Then, a graph partition algorithm is devised to divide the original graph into several sub-graphs to preserve parallelism and reduce communication cost remarkably. Finally, we propose a deterministic replay protocol to re-execute the primary’s parallel schedule concurrently. Moreover, this two-phase protocol is further optimized by integrating with PBFT. Theoretical analysis and extensive experimental results illustrate that the proposed scheme outperforms state-of-art solutions significantly. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
11. Space-Efficient Subgraph Search Over Streaming Graph With Timing Order Constraint.
- Author
-
Li, Youhuan, Zou, Lei, Ozsu, M. Tamer, and Zhao, Dongyan
- Subjects
- *
STREAMING video & television , *SOCIAL networks , *PUBLIC spaces , *DATA analysis - Abstract
The growing popularity of dynamic applications such as social networks provides a promising way to detect valuable information in real time. These applications create high-speed data that can be easily modeled as streaming graph. Efficient analysis over these data is of great significance. In this paper, we study the subgraph (isomorphism) search over streaming graph data that obeys timing order constraints over the occurrence of edges in the stream. The sliding window model is employed to focus on the most recent data. We propose an efficient solution to answer subgraph search, introduce optimizations to greatly reduce the space cost, and design concurrency management to improve system throughput. Extensive experiments on real network traffic data and synthetic social streaming data shows that our solution outperforms comparative ones by one order of magnitude with less space cost. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
12. Handling Concurrency in Behavior Trees.
- Author
-
Colledanchise, Michele and Natale, Lorenzo
- Subjects
- *
FINITE state machines , *VIDEO game industry , *SOURCE code , *TREES - Abstract
This article addresses the concurrency issues affecting behavior trees (BTs), a popular tool to model the behaviors of autonomous agents in the video game and the robotics industry. BT designers can easily build complex behaviors composing simpler ones, which represents a key advantage of BTs. The parallel composition of BTs expresses a way to combine concurrent behaviors that has high potential, since composing pre-existing BTs in parallel results easier than composing in parallel classical control architectures, as finite state machines or teleo-reactive programs. However, BT designers rarely use such composition due to the underlying concurrency problems similar to the ones faced in concurrent programming. As a result, the parallel composition, despite its potential, finds application only in the composition of simple behaviors or where the designer can guarantee the absence of conflicts by design. In this article, we define two new BT nodes to tackle the concurrency problems in BTs and we show how to exploit them to create predictable behaviors. In addition, we introduce measures to assess execution performance and show how different design choices affect them. We validate our approach in both simulations and the real world. Simulated experiments provide statistically significant data, whereas real-world experiments show the applicability of our method on real robots. We provided an open-source implementation of the novel BT formulation and published all the source code to reproduce the numerical examples and experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
13. ConcSpectre: Be Aware of Forthcoming Malware Hidden in Concurrent Programs.
- Author
-
Liu, Yang, Xu, Zisen, Fan, Ming, Hao, Yu, Chen, Kai, Chen, Hao, Cai, Yan, Yang, Zijiang, and Liu, Ting
- Subjects
- *
MALWARE , *INTERNET servers , *COMPUTER systems , *DEBUGGING , *COMPUTER software security - Abstract
Concurrent programs with multiple threads executing in parallel are widely used to unleash the power of multicore computing systems. Owing to their complexity, a lot of research focuses on testing and debugging concurrent programs. Besides correctness, we find that security can also be compromised by concurrency. In this article, we present concurrent program spectre (ConcSpectre), a new security threat that hides malware in nondeterministic thread interleavings. To demonstrate such threat, we have developed a stealth malware technique called concurrent logic bomb by partitioning a piece of malicious code and injecting its components separately into a concurrent program. The malicious behavior can be triggered by certain thread interleavings that rarely happen (e.g., $< $ 1%) under a normal execution environment. However, with a new technique called controllable probabilistic activation, we can activate such ConcSpectre malware with a very high probability (e.g., $>$ 90%) by remotely disturbing thread scheduling. In the evaluation, more than 1000 ConcSpectre samples are generated, which bypassed most of the antivirus engines in VirusTotal and four well-known online dynamic malware analysis systems. We also demonstrate how to remotely trigger a ConcSpectre sample on a web server and control its activation probability. Our work shows an urgent need for new malware analysis methods for concurrent programs. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
14. MUSE: A Multistage Assembling Algorithm for Simultaneous Localization of Large-Scale Massive Passive RFIDs.
- Author
-
Ma, Yongtao, Tian, Chenglong, and Liu, Hankai
- Subjects
GRAPH theory ,KNOWLEDGE graphs ,ALGORITHMS ,SLAM (Robotics) ,SYNTHETIC aperture radar ,THEORY of knowledge - Abstract
In this paper, MUSE, an algorithm enabled by backscattering tag-to-tag network (BTTN) is presented to accomplish simultaneous 2-D localization of large-scale (10 m × 10 m) massive (20 $\sim$ ∼ 50) passive UHF RFIDs. In BTTNs, the most intractable problem is the high-frequency loss of range measurements. In a particular case of 30 tags to be located with maximum communication range being 3 m, the rate is nearly up to 85.75 percent. In the proposed framework, we utilize relevant knowledge in the theory of graphs to obtain underlying subsets in which tags can communicate with each other and then assemble them stage by stage to achieve overall localization. Theoretical analysis shows that multistage assembly imparts extraordinary characteristics to MUSE: Assembling rectifies fragment maps given some condition, and in later stages prevents errors flowing down into the next stage. Experimental analysis shows that the condition is easy to satisfy. Furthermore, an analytical expression for the Cramér-Rao lower bound is also derived as a benchmark to evaluate the localization performance. Extensive simulations demonstrate that MUSE outperforms existing algorithms for simultaneous localization. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
15. Enabling Scalable and Extensible Memory-Mapped Datastores in Userspace.
- Author
-
Peng, Ivy B., Gokhale, Maya B., Youssef, Karim, Iwabuchi, Keita, and Pearce, Roger
- Subjects
- *
DYNAMIC loads , *SYSTEMS software , *DYNAMIC balance (Mechanics) , *MACHINE learning , *SOFTWARE architecture , *METAGENOMICS - Abstract
Exascale workloads are expected to incorporate data-intensive processing in close coordination with traditional physics simulations. These emerging scientific, data-analytics and machine learning applications need to access a wide variety of datastores in flat files and structured databases. Programmer productivity is greatly enhanced by mapping datastores into the application process's virtual memory space to provide a unified “in-memory” interface. Currently, memory mapping is provided by system software primarily designed for generality and reliability. However, scalability at high concurrency is a formidable challenge on exascale systems. Also, there is a need for extensibility to support new datastores potentially requiring HPC data transfer services. In this article, we present UMap, a scalable and extensible userspace service for memory-mapping datastores. Through decoupled queue management, concurrency aware adaptation, and dynamic load balancing, UMap enables application performance to scale even at high concurrency. We evaluate UMap in data-intensive applications, including sorting, graph traversal, database operations, and metagenomic analytics. Our results show that UMap as a userspace service outperforms an optimized kernel-based service across a wide range of intra-node concurrency by 1.22-1.9 ${\times}$ × . We performed two case studies to demonstrate UMap's extensibility. First, a new datastore residing in remote memory is incorporated into UMap as an application-specific plugin. Second, we present a persistent memory allocator Metall built atop UMap for unified storage/memory. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
16. Performance Analysis, Design Considerations, and Applications of Extreme-Scale In Situ Infrastructures
- Author
-
Bethel, E. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)]
- Published
- 2016
- Full Text
- View/download PDF
17. Distributed Tree-Based Machine Learning for Short-Term Load Forecasting With Apache Spark
- Author
-
Ameema Zainab, Ali Ghrayeb, Haitham Abu-Rub, Shady S. Refaat, and Othmane Bouhali
- Subjects
Apache spark ,concurrent computing ,load forecasting ,parallel processing ,resource management ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Machine learning algorithms have been intensively applied to perform load forecasting to obtain better accuracies as compared to traditional statistical methods. However, with the huge increase in data size, sophisticated models have to be created which require big data platforms. Optimal and effective use of the available computational resources can be attained by maximizing the effective utilization of the cluster nodes. Parallel computing is demanded to allow for optimal resource utilization in dealing with smart grid big data. In this paper, a master-slave parallel computing paradigm is utilized and experimented with for load forecasting in a multi-AMI environment. The paper proposes a concurrent job scheduling algorithm in a multi-energy data source environment using Apache Spark. An efficient resource utilization strategy is proposed for submitting multiple Spark jobs to reduce job completion time. The optimal value of clustering is used in this paper to cluster the data into groups to be able to reduce the computational time additionally. Multiple tree-based machine learning algorithms are tested with parallel computation to evaluate the performance with tunable parameters on a real-world dataset. One thousand distribution transformers’ real data from Spain for three years are used to demonstrate the performance of the proposed methodology with a trade-off between accuracy and processing time.
- Published
- 2021
- Full Text
- View/download PDF
18. Garou: An Efficient and Secure Off-Blockchain Multi-Party Payment Hub.
- Author
-
Ye, Yongjie, Ren, Zhifeng, Luo, Xiapu, Zhang, Jingjing, and Wu, Weigang
- Abstract
To mitigate the scalability problem of decentralized cryptocurrencies such as Bitcoin and Ethereum, the payment channel, which allows two parties to perform secure coin transfers without involving the blockchain, has been proposed. The payment channel increases the transaction throughput of two parties to a level that is only limited by their network bandwidth. Recent proposals focus on extending the two-party payment channel to the N-party payment hub. Unfortunately, none of them can achieve efficiency, flexibility in the absence of a trusted third-party. In this paper, we propose Garou, a secure N-party payment hub that allows multiple parties to perform secure off-chain coin transfers. Except in the case of disputes, participants within the payment hub can make concurrent and direct coin transfers with each other without the involvement of the blockchain or any third-party intermediaries. This allows Garou to achieve both high-performance and flexibility. If all participants keep online, Garou guarantees all honest users’ balance security against strong adversarial capabilities. To demonstrate the feasibility of the Garou protocol, we develop a proof of concept prototype for the Ethereum network. Experimental evaluations show that, with 300 participants and without disputes, that the maximum transaction throughput of Garou is $20{ \times }$ higher than that of state-of-the-art payment hubs. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
19. Performance testing of STL and Qt library elements in multi-threaded processing
- Author
-
Piotr Krasowski and Jakub Smołka
- Subjects
concurrent computing ,multithreading ,container performance ,data structures ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
In recent years multithreaded processing has become a important programming aspect. Computers with a multi-core processor are now widely available, enabling the creation of more efficient applications. Many libraries support multithreaded solutions, but performance information is often lacking. The use of appropriate data structures and algorithms significantly speeds up the process of creation and development of applications. Article describes selected elements of the Qt and STL library and compares their performance in concurrent programming. The time needed to perform individual operations was analysed.
- Published
- 2020
- Full Text
- View/download PDF
20. Dynamic Block-Wise Local Learning Algorithm for Efficient Neural Network Training.
- Author
-
Lee, Gwangho, Lee, Sunwoo, and Jeon, Dongsuk
- Subjects
MACHINE learning ,ALGORITHMS ,ARTIFICIAL neural networks ,COMPUTATIONAL complexity - Abstract
In the backpropagation algorithm, the error calculated from the output of the neural network should backpropagate the layers to update the weights of each layer, making it difficult to parallelize the training process and requiring frequent off-chip memory access. Local learning algorithms locally generate error signals which are used for weight updates, removing the need for backpropagation of error signals. However, prior works rely on large, complex auxiliary networks for reliable training, which results in large computational overhead and undermines the advantages of local learning. In this work, we propose a local learning algorithm that significantly reduces computational complexity as well as improves training performance. Our algorithm combines multiple consecutive layers into a block and performs local learning on a block-by-block basis, while dynamically changing block boundaries during training. In experiments, our approach achieves 95.68% and 79.42% test accuracy on the CIFAR-10 and CIFAR-100 datasets, respectively, using a small fully connected layer as auxiliary networks, closely matching the performance of the backpropagation algorithm. Multiply-accumulate (MAC) operations and off-chip memory access also reduce by up to 15% and 81% compared to backpropagation. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
21. Workflow Refactoring for Maximizing Concurrency and Block-Structuredness.
- Author
-
Song, Wei, Jacobsen, Hans-Arno, Cheung, S.C., Liu, Hongyu, and Ma, Xiaoxing
- Abstract
In the era of Internet and big data, contemporary workflows become increasingly large in scale and complex in structure, introducing greater challenges for workflow modeling. Workflows are not with maximized concurrency and block-structuredness in terms of control flow, though languages supporting block-structuredness (e.g., BPEL) are employed. Existing workflow refactoring approaches mostly focus on maximizing concurrency according to dependences between activities, but do not consider the block-structuredness of the refactored workflow. It is easier to comprehend and analyze a workflow that is block-structured and to transform it into BPEL-like processes. In this paper, we aim at maximizing both concurrency and block-structuredness. Nevertheless, not all workflows can be refactored with a block-structured representation, and it is intractable to make sure that the refactored workflows are as block-structured as possible. We first define a well-formed dependence pattern of activities. The control flow among the activities in this pattern can be represented in block-structured forms with maximized concurrency. Then, we propose a greedy heuristics-based graph reduction approach to recursively find such patterns. In this way, the resulting workflow is with maximized concurrency and its block-structuredness approximates optimality. We show the effectiveness and efficiency of our approach with real-world scientific workflows. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
22. ConcurDB: Concurrent Query Authentication for Outsourced Databases.
- Author
-
Bajaj, Sumeet, Chakraborti, Anrin, and Sion, Radu
- Subjects
- *
DATABASES , *NETWORK hubs , *DATABASE security , *ONLINE data processing , *DATA structures - Abstract
Clients of outsourced databases need Query Authentication (QA) guaranteeing the integrity and authenticity of query results returned by potentially compromised providers. Prior work provides QA assurances for a limited class of queries by deploying several software-based cryptographic constructs. The constructs are often designed assuming read-only or infrequently updated databases. For dynamic datasets, the data owner is required to perform all updates on behalf of clients. Hence, for concurrent updates by multiple clients, such as for OLTP workloads, existing QA solutions are inefficient. We present ConcurDB, a concurrent QA scheme that enables simultaneous updates by multiple clients. To realize concurrent QA, we have designed several new mechanisms. First, we identify and use an important relationship between QA and memory checking to decouple query execution and verification. We allow clients to execute transactions concurrently and perform verifications in parallel using an offline memory checking based protocol. Then, to extend QA to a multi-client scenario, we design new protocols that enable clients to securely exchange a small set of authentication data even when using the untrusted provider as a communication hub. Finally, we overcome provider-side replay attacks. Using ConcurDB, we provide and evaluate concurrent QA for the full TPC-C benchmark. For updates, ConcurDB shows a 4x performance increase over existing solutions. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
23. Concurrent Healthcare Data Processing and Storage Framework Using Deep-Learning in Distributed Cloud Computing Environment.
- Author
-
Yan, Shengguang, He, Lijuan, Seo, Jaebok, and Lin, Minmin
- Abstract
Distributed cloud computing environments rely on sophisticated communication and sharing paradigms for ease of access, information processing, and analysis. The challenging characteristic of such cloud computing environments is the concurrency and access as both the service provider and end-user rely on the common sharing platform. In this article, retrieval and storage-based indexing framework (RSIF) is designed to improve the concurrency of user and service provider access to the cloud-stored healthcare data. Concurrency is achieved through replication-free and continuous indexing and time-constrained retrieval of stored information. The process of classifying the constraints for data augmentation and update is performed using deep learning for all the storage instances. Through conditional assessment, the learning process determines the approximation of indexing and ordering for storing and retrieval, respectively. This helps to reduce the time for access and retrieval concurrently, provided the process is not dependent. The simulation analysis using the metrics discontinuous indexing, replicated data, retrieval time, and cost proves the reliability of the proposed framework. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
24. OODIDA: On-Board/Off-Board Distributed Real-Time Data Analytics for Connected Vehicles.
- Author
-
Ulm, Gregor, Smith, Simon, Nilsson, Adrian, Gustavsson, Emil, and Jirstrand, Mats
- Subjects
PROGRAMMING languages ,ELECTRONIC data processing ,VEHICLES ,VEHICLE routing problem ,EDGE computing - Abstract
A fleet of connected vehicles easily produces many gigabytes of data per hour, making centralized (off-board) data processing impractical. In addition, there is the issue of distributing tasks to on-board units in vehicles and processing them efficiently. Our solution to this problem is On-board/Off-board Distributed Data Analytics (OODIDA), which is a platform that tackles both task distribution to connected vehicles as well as concurrent execution of tasks on arbitrary subsets of edge clients. Its message-passing infrastructure has been implemented in Erlang/OTP, while the end points use a language-independent JSON interface. Computations can be carried out in arbitrary programming languages. The message-passing infrastructure of OODIDA is highly scalable, facilitating the execution of large numbers of concurrent tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
25. Global Analysis of C Concurrency in High-Level Synthesis.
- Author
-
Ramanathan, Nadesh, Constantinides, George A., and Wickerson, John
- Subjects
GLOBAL analysis (Mathematics) ,COMPILERS (Computer programs) ,FIELD programmable gate arrays ,CARTOGRAPHY software - Abstract
When mapping C programs to hardware, high-level synthesis (HLS) tools reorder independent instructions, aiming to obtain a schedule that requires as few clock cycles as possible. However, when synthesizing multithreaded C programs, reordering opportunities are limited by the presence of atomic operations (“atomics”), the fundamental concurrency primitives in C. Existing HLS tools analyze and schedule each thread in isolation. In this article, we argue that thread-local analysis is conservative, especially since HLS compilers have access to the entire program. Hence, we propose a global analysis that exploits information about memory accesses by all threads when scheduling each thread. Implemented in the LegUp HLS tool, our analysis is sensitive to sequentially consistent (SC) and weak atomics and supports loop pipelining. Since the semantics of C atomics is complicated, we formally verify that our analysis correctly implements the C memory model using the Alloy model checker. Compared with thread-local analysis, our global analysis achieves a $2.3\times $ average speedup on a set of lock-free data structures and data-flow patterns. We also apply our analysis to a larger application: a lock-free, streamed, and load-balanced implementation of Google’s PageRank, where we see a $1.3\times $ average speedup compared with the thread-local analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
26. Facilitating rapid prototyping in the distributed data analytics platform OODIDA via active-code replacement
- Author
-
Gregor Ulm, Simon Smith, Adrian Nilsson, Emil Gustavsson, and Mats Jirstrand
- Subjects
Distributed computing ,Concurrent computing ,Distributed data processing ,Hot swapping ,Code replacement ,Erlang ,Computer engineering. Computer hardware ,TK7885-7895 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
OODIDA (On-board/Off-board Distributed Data Analytics) is a platform for distributed real-time analytics, targeting fleets of reference vehicles in the automotive industry. Its users are data analysts. The bulk of the data analytics tasks are performed by clients (on-board), while a central cloud server performs supplementary tasks (off-board). OODIDA can be automatically packaged and deployed, which necessitates restarting parts of the system, or all of it. As this is potentially disruptive, we added the ability to execute user-defined Python modules on clients as well as the server. These modules can be replaced without restarting any part of the system; they can even be replaced between iterations of an ongoing assignment. This feature is referred to as active-code replacement. It facilitates use cases such as iterative A/B testing of machine learning algorithms or modifying experimental algorithms on-the-fly. Various safeguards are in place to ensure that custom code does not have harmful consequences, for instance by limiting the allowed types for return values or prohibiting importing of certain modules of the Python standard library. Consistency of results is achieved by majority vote, which prevents tainted state. Our evaluation shows that active-code replacement can be done in less than a second in an idealized setting whereas a standard deployment takes many orders of magnitude more time. The main contribution of this paper is the description of a relatively straightforward approach to active-code replacement that is very user-friendly. It enables a data analyst to quickly execute custom code on the cloud server as well as on client devices. Sensible safeguards and design decisions ensure that this feature can be used by non-specialists who are not familiar with the implementation of OODIDA in general or this feature in particular. As a consequence of adding the active-code replacement feature, OODIDA is now very well-suited for rapid prototyping.
- Published
- 2020
- Full Text
- View/download PDF
27. A Polynomial-Time Algorithm to Obtain State Machine Cover of Live and Safe Petri Nets.
- Author
-
Karatkevich, Andrei G. and Wisniewski, Remigiusz
- Subjects
- *
PETRI nets , *ALGORITHMS , *DATA mining , *COMPUTATIONAL complexity , *MATHEMATICAL models , *DISCRETE systems - Abstract
A method to find the minimized state machine cover of live and safe Petri nets (PNs) is proposed in this paper. It is required that a cover be obtained in some implementation methods of PN-based control algorithms and other applications such as data mining. Most of the known methods of its computation have at least exponential time and space complexity; this is true not only for the methods which guarantee a minimal cover is obtained but also for some methods of finding approximate solutions. The proposed algorithm is an approximation algorithm and has polynomial computational complexity. The experimental verification shows a very high efficiency of the presented technique. The proposed method is especially valuable in the case of large systems where a solution is obtained hundreds of times quicker than by other methods. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
28. A High Performance Concurrency Protocol for Smart Contracts of Permissioned Blockchain
- Author
-
Aoying Zhou, Zhao Zhang, Cheqing Jin, Shuaifeng Pang, and Xiaodong Qi
- Subjects
Concurrency control ,Dependency graph ,Computational Theory and Mathematics ,Smart contract ,Computer science ,Distributed computing ,Concurrency ,Node (networking) ,Graph partition ,Concurrent computing ,Graph (abstract data type) ,Computer Science Applications ,Information Systems - Abstract
Although the emergence of the programmable smart contract makes blockchain systems easily embrace a wide range of industrial services, how to execute smart contracts efficiently becomes a big challenge nowadays. Due to the existence of Byzantine nodes, existing mature concurrency control protocols in database cannot be employed directly, since the mechanism of executing smart contracts varies a lot. Furthermore, even though smart contract execution follows a two-phase style, i.e., the primary node executes a batch of smart contracts in the first phase and the validators replay them in the second phase, existing parallel solutions merely focus on the optimization for the first phase, rather than the second phase. In this paper, we propose a novel two-phase concurrency control protocol to optimize both phases for the first time. First, the primary executes transactions in parallel and generates a transaction dependency graph with high parallelism for validators. Then, a graph partition algorithm is devised to divide the original graph into several sub-graphs to preserve parallelism and reduce communication cost remarkably. Finally, we propose a deterministic replay protocol to re-execute the primary's parallel schedule concurrently. Moreover, this two-phase protocol is further optimized by integrating with PBFT. Theoretical analysis and extensive experimental results illustrate that the proposed scheme outperforms state-of-art solutions significantly.
- Published
- 2022
29. A Fast MPEG’s CDVS Implementation for GPU Featured in Mobile Devices
- Author
-
Alessandro Garbo and Stefano Quer
- Subjects
Computer applications ,concurrent computing ,embedded software ,image analysis ,object detection ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The Moving Picture Experts Group's Compact Descriptors for Visual Search (MPEG's CDVS) intends to standardize technologies in order to enable an interoperable, efficient, and cross-platform solution for internet-scale visual search applications and services. Among the key technologies within CDVS, we recall the format of visual descriptors, the descriptor extraction process, and the algorithms for indexing and matching. Unfortunately, these steps require precision and computation accuracy. Moreover, they are very time-consuming, as they need running times in the order of seconds when implemented on the central processing unit (CPU) of modern mobile devices. In this paper, to reduce computation times and maintain precision and accuracy, we re-design, for many-cores embedded graphical processor units (GPUs), all main local descriptor extraction pipeline phases of the MPEG's CDVS standard. To reach this goal, we introduce new techniques to adapt the standard algorithm to parallel processing. Furthermore, to reduce memory accesses and efficiently distribute the kernel workload, we use new approaches to store and retrieve CDVS information on proper GPU data structures. We present a complete experimental analysis on a large and standard test set. Our experiments show that our GPU-based approach is remarkably faster than the CPU-based reference implementation of the standard, and it maintains a comparable precision in terms of true and false positive rates.
- Published
- 2018
- Full Text
- View/download PDF
30. Analysis of Threading Libraries for High Performance Computing.
- Author
-
Castello, Adrian, Gual, Rafael Mayo, Seo, Sangmin, Balaji, Pavan, Quintana-Orti, Enrique S., and Pena, Antonio J.
- Subjects
- *
HIGH performance computing - Abstract
With the appearance of multi-/many core machines, applications and runtime systems have evolved in order to exploit the new on-node concurrency brought by new software paradigms. POSIX threads (Pthreads) was widely-adopted for that purpose and it remains as the most used threading solution in current hardware. Lightweight thread (LWT) libraries emerged as an alternative offering lighter mechanisms to tackle the massive concurrency of current hardware. In this article, we analyze in detail the most representative threading libraries including Pthread- and LWT-based solutions. In addition, to examine the suitability of LWTs for different use cases, we develop a set of microbenchmarks consisting of OpenMP patterns commonly found in current parallel codes, and we compare the results using threading libraries and OpenMP implementations. Moreover, we study the semantics offered by threading libraries in order to expose the similarities among different LWT application programming interfaces and their advantages over Pthreads. This article exposes that LWT libraries outperform solutions based on operating system threads when tasks and nested parallelism are required. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
31. An MDE-Based Tool for Early Analysis of UML2.0/PSM Atomic and Composite Components.
- Author
-
Rouis, Taoufik Sakka, Bhiri, Mohamed Tahar, Sliman, Layth, and Kmimech, Mourad
- Abstract
System analysis is a crucial activity throughout component-based architecture design. It enables detecting and correcting errors in the early stage of the system development life cycle. In this article, we consider system analysis in UML2.0 component-based architectural design phase. This is done by proposing a model-driven engineering (MDE) tool called UML2Ada. It enables the systematic translation of a UML2.0 atomic and composite component into Ada concurrent language. This, in turn, supports the validation of the source description using an Ada dynamic analysis tools such as GNATprove and ObjectAda. In addition, by using an Ada static analysis tool such as FLAVERS or INCA, the proposed tool enables the detection of the potential behavioral concurrency properties of the Ada concurrent program. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
32. Adaptive Model-Based Scheduling in Software Transactional Memory.
- Author
-
Di Sanzo, Pierangelo, Pellegrini, Alessandro, Sannicandro, Marco, Ciciani, Bruno, and Quaglia, Francesco
- Subjects
- *
SCHEDULING software , *MARKOV processes , *AUTHORSHIP , *MEMORY , *PREDICTION models - Abstract
Software Transactional Memory (STM) stands as powerful concurrent programming paradigm, enabling atomicity, and isolation while accessing shared data. On the downside, STM may suffer from performance degradation due to excessive conflicts among concurrent transactions, which cause waste of CPU-cycles and energy because of transaction aborts. An approach to cope with this issue consists of putting in place smart scheduling strategies which temporarily suspend the execution of some transaction in order to reduce the transaction conflict rate. In this article, we present an adaptive model-based transaction scheduling technique relying on a Markov Chain-based performance model of STM systems. Our scheduling technique is adaptive in a twofold sense: (i) It controls the execution of transactions depending on throughput predictions by the model as a function of the current system state. (ii) It re-tunes on-line the Markov Chain-based model to adapt it—and the outcoming transaction scheduling decisions—to dynamic variations of the workload. We have been able to achieve the latter target thanks to the fact that our performance model is extremely lightweight. In fact, to be recomputed, it requires a reduced set of input parameters, whose values can be estimated via a few on-line samples related to the current workload dynamics. We also present a scheduler that implements our adaptive technique, which we integrated within the open source TinySTM package. Further, we report the results of an experimental study based on the STAMP benchmark suite, which has been aimed at assessing both the accuracy of our performance model in predicting the actual system throughput and the advantages of the adaptive scheduling policy over literature techniques. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
33. Advances in Concurrent Computing for Digital Stochastic Measurement Simulation.
- Author
-
Pjevalica, Nebojsa, Pjevalica, Velibor, and Petrovic, Nenad
- Subjects
- *
FOURIER transforms , *ELECTRIC power distribution grids , *STOCHASTIC analysis , *SIMULATION methods & models , *MEASUREMENT , *ELECTRIC transients , *DOMAIN decomposition methods - Abstract
This paper introduces a concurrent computing technique for the acceleration of digital stochastic measurement simulations. The digital stochastic measurement presents an advanced methodology based on the specific parallel hardware structure, utilized for an orthogonal transformation calculus/decomposition. Methodology is analyzed in detail, starting from the very basic idea, toward recent references, covering main research directions and trends. An oversampling nature of the evaluated digital stochastic measurement, along with demanding arithmetic requirements, implies exhausting simulation complexity. As a test case, several typical power grid signals were harmonically analyzed through a discrete Fourier transformation based on the proposed methodology. A harmonic decomposition was simulated with several levels of computing concurrency. Through all the simulated scenarios main success criterion was model accuracy, while the parameter used for selection of the optimal simulation computing technique was the overall calculus speed. Final results exposed thread pool computing technique as an optimal simulation platform. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
34. Optimizing Parallel I/O Accesses through Pattern-Directed and Layout-Aware Replication.
- Author
-
He, Shuibing, Yin, Yanlong, Sun, Xian-He, Zhang, Xuechen, and Li, Zongpeng
- Subjects
- *
SOLID state drives , *DATA replication , *BENEFIT performances , *COMPUTER systems , *PATTERNMAKING - Abstract
As the performance gap between processors and storage devices keeps increasing, I/O performance becomes a critical bottleneck of modern high-performance computing systems. In this paper, we propose a pattern-directed and layout-aware data replication design, named PDLA, to improve the performance of parallel I/O systems. PDLA includes an HDD-based scheme H-PDLA and an SSD-based scheme S-PDLA. For applications with relatively low I/O concurrency, H-PDLA identifies access patterns of applications and makes a reorganized data replica for each access pattern on HDD-based servers with an optimized data layout. Moreover, to accommodate applications with high I/O concurrency, S-PDLA replicates critical access patterns that can bring performance benefits on SSD-based servers or on HDD-based and SSD-based servers. We have implemented the proposed replication scheme under MPICH2 library on top of OrangeFS file system. Experimental results show that H-PDLA can significantly improve the original parallel I/O system performance and demonstrate the advantages of S-PDLA over H-PDLA. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
35. LPM: A Systematic Methodology for Concurrent Data Access Pattern Optimization from a Matching Perspective.
- Author
-
Liu, Yuhang and Sun, Xian-He
- Subjects
- *
COMPUTING platforms , *COMPUTER systems , *PATTERN matching , *MATHEMATICAL optimization - Abstract
As applications become increasingly data intensive, conventional computing systems become increasingly inefficient due to data access performance bottlenecks. While intensive efforts have been made in developing new memory technologies and in designing special purpose machines, there is a lack of solutions for evaluating and utilizing recent hardware advancements to address the memory-wall problem in a systematic way. In this study, we present the memory Layered Performance Matching (LPM) methodology to provide a systematic approach for data access performance optimization. LPM uniquely presents and utilizes the data access concurrency, in addition to data access locality, in a memory hierarchical system. The LPM methodology consists of models and algorithms, and is supported with a series of analytic results for its correctness. The rationale of LPM is to reduce the overall data access delay through the matching of data request rate and data supply rate at each layer of a memory hierarchy, with a balanced consideration of data locality, data concurrency, and latency hiding of data flow. Extensive experimentations on both physical platforms and software simulators confirm our theoretical findings, and they show that the LPM approach can be applied in diverse computing platforms and can effectively guide performance optimization of memory systems. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
36. Simulation of Methane Production from Carbon Dioxide on a Collaborative Research Infrastructure
- Author
-
Martí, Carles, Pacifici, Leonardo, Capriccioli, Andrea, Laganà, Antonio, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Gervasi, Osvaldo, editor, Murgante, Beniamino, editor, Misra, Sanjay, editor, Rocha, Ana Maria A. C., editor, Torre, Carmelo M., editor, Taniar, David, editor, Apduhan, Bernady O., editor, Stankova, Elena, editor, and Wang, Shangguang, editor
- Published
- 2016
- Full Text
- View/download PDF
37. Space-Efficient Subgraph Search Over Streaming Graph With Timing Order Constraint
- Author
-
Lei Zou, Youhuan Li, M. Tamer Özsu, and Dongyan Zhao
- Subjects
Theoretical computer science ,Computational Theory and Mathematics ,Computer science ,Server ,Graph (abstract data type) ,Concurrent computing ,Computer Science Applications ,Information Systems ,Data modeling - Published
- 2022
38. TridentKV: A Read-Optimized LSM-Tree Based KV Store via Adaptive Indexing and Space-Efficient Partitioning
- Author
-
Changhong Fei, Wei Zhao, Jiguang Wan, Tongliang Deng, Kai Lu, and Nannan Zhao
- Subjects
Non-volatile memory ,Metadata ,Computational Theory and Mathematics ,Computer engineering ,Hardware and Architecture ,Computer science ,Signal Processing ,Search engine indexing ,Adaptive indexing ,Concurrent computing ,Tree based ,Space (commercial competition) ,Associative array - Published
- 2022
39. Toward Automated Analysis of Electrocardiogram Big Data by Graphics Processing Unit for Mobile Health Application
- Author
-
Xiaomao Fan, Runge Chen, Chenguang He, Yunpeng Cai, Pu Wang, and Ye Li
- Subjects
GPU computing ,mobile health ,automated ECG analysis ,parallel algorithm ,concurrent computing ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
With the rapid development of mobile health technologies and applications in recent years, large amounts of electrocardiogram (ECG) signals that need to be processed timely have been produced. Although the CPU-based sequential automated ECG analysis algorithm (CPU-AECG) designed for identifying seven types of heartbeats has been in use for years, it is single-threaded and handling lots of concurrent ECG signals still poses a severe challenge. In this paper, we propose a novel GPU-based automated ECG analysis algorithm (GPU-AECG) to effectively shorten the program executing time. A new concurrencybased GPU-AECG, named cGPU-AECG, is also developed to handle multiple concurrent signals. Compared with the CPU-AECG, our cGPU-AECG achieves a 35 times speedup when handling 24-h-long ECG data, without reducing the classification accuracy. With cGPU-AECG, we can handle 24-h-ECG signals from thousands of users in a few seconds and provide prompt feedback, which not only greatly improves the user experience of mobile health services, but also reduces the economic cost of building healthcare platforms.
- Published
- 2017
- Full Text
- View/download PDF
40. Enabling Scalable and Extensible Memory-Mapped Datastores in Userspace
- Author
-
Maya Gokhale, Keita Iwabuchi, Roger Pearce, Karim Youssef, and Ivy Bo Peng
- Subjects
mmap ,Computer science ,Concurrency ,Memory-mapped I/O ,computer.software_genre ,Allocator ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Scalability ,Virtual memory ,Operating system ,Concurrent computing ,computer ,System software - Abstract
Exascale workloads are expected to incorporate data-intensive processing in close coordination with traditional physics simulations. These emerging scientific, data-analytics and machine learning applications need to access a wide variety of datastores in flat files and structured databases. Programmer productivity is greatly enhanced by mapping datastores into the application process's virtual memory space to provide a unified “in-memory” interface. Currently, memory mapping is provided by system software primarily designed for generality and reliability. However, scalability at high concurrency is a formidable challenge on exascale systems. Also, there is a need for extensibility to support new datastores potentially requiring HPC data transfer services. In this article, we present UMap , a scalable and extensible userspace service for memory-mapping datastores. Through decoupled queue management, concurrency aware adaptation, and dynamic load balancing, UMap enables application performance to scale even at high concurrency. We evaluate UMap in data-intensive applications, including sorting, graph traversal, database operations, and metagenomic analytics. Our results show that UMap as a userspace service outperforms an optimized kernel-based service across a wide range of intra-node concurrency by 1.22-1.9 ${\times}$ × . We performed two case studies to demonstrate UMap 's extensibility. First, a new datastore residing in remote memory is incorporated into UMap as an application-specific plugin. Second, we present a persistent memory allocator Metall built atop UMap for unified storage/memory.
- Published
- 2022
41. The Queuing-First Approach for Tail Management of Interactive Services.
- Author
-
Mirhosseini, Amirhossein and Wenisch, Thomas F.
- Subjects
- *
TAILS , *HICCUPS - Abstract
Managing high-percentile tail latencies is key to designing user-facing cloud services. Rare system hiccups or unusual code paths make some requests take $\boldsymbol{10\;\times -100\;\times}$10×-100× longer than the average. Prior work seeks to reduce tail latency by trying to address primarily root causes of slow requests. However, often the bulk of requests comprising the tail are not these rare slow-to-execute requests. Rather, due to head-of-line blocking, most of the tail comprises requests enqueued behind slow-to-execute requests. Under high disparity service distributions, queuing effects drastically magnify the impact of rare system hiccups and can result in high tail latencies even under modest load. We demonstrate that improving the queuing behavior of a system often yields greater benefit than mitigating the individual system hiccups that increase service time tails. We suggest two general directions to improve system queuing behavior–server pooling and common-case service acceleration–and discuss circumstances where each is most beneficial. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
42. Fast Wait-Free Construction for Pool-Like Objects with Weakened Internal Order: Stacks as an Example.
- Author
-
Peng, Yaqiong, Yun, Xiaochun, and Hao, Zhiyu
- Subjects
- *
DATA structures , *CONSTRUCTION , *ARCHITECTURE - Abstract
This paper focuses on a large class of concurrent data structures that we call pool-like objects (e.g., stack, double-ended queue, and queue). Performance and progress guarantee are two important characteristics for concurrent data structures. In the aspect of performance, weakening the internal order in a pool-like object is an effective technique to reduce the synchronization cost among threads accessing the object, but no objects with weakened internal order provide a progress guarantee as strong as wait-freedom. Meanwhile, wait-free algorithms tend to be inefficient, which is mainly attributed to the helping mechanisms. Based on the philosophy of existing helping mechanisms, a wait-free pool-like object with weakened internal order would suffer from unnecessary process of getting the latest object state and synchronization. This paper takes a state-of-the-art implementation of stacks with weakened internal order as an example, and transforms it into a highly-efficient wait-free stack named WF-TS-Stack. The transformation method includes a helping mechanism with state reuse and a relaxed removal scheme. In addition, we use a simple and effective scheme to further improve the performance of WF-TS-Stack in Non-Uniform Memory Access (NUMA) architectures. Our evaluation with representative benchmarks shows that WF-TS-Stack outperforms its original building blocks by up to 1.45× at maximum concurrency. We also discuss how to yield an efficient double-ended queue (deque) variant of WF-TS-Stack, because deque is a more generalized pool-like object. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
43. Composable Building Blocks to Open Up Processor Design.
- Author
-
Zhang, Sizhuo, Wright, Andrew, and Bourgeat, Thomas
- Subjects
- *
MICROPROCESSORS , *MODULAR design , *RESEARCH libraries , *DESIGN - Abstract
In this article, we present a framework called Composable Modular Design (CMD) to facilitate the design of out-of-order processors. In CMD, 1) the interface methods of modules provide instantaneous access and perform atomic updates to the state elements inside the module; 2) every interface method is guarded, i.e., it cannot be applied unless it is ready; and 3) modules are composed by atomic rules which call interface methods of different modules. A rule either successfully updates the state of all the called modules or does nothing. The atomicity properties of interfaces in CMD ensure composability when modules are refined selectively. We show the efficacy of CMD by building an out-of-order RISC-V processor which boots Linux. Modules designed using CMD (e.g., ROB, load-store unit, etc.) can be used and refined by other implementations. We believe that this framework can revolutionize architectural research and practice as the library of reusable components grows. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
44. Applying Transactional Memory for Concurrency-Bug Failure Recovery in Production Runs.
- Author
-
Chen, Yuxi, Wang, Shu, Lu, Shan, and Sankaralingam, Karthikeyan
- Subjects
- *
TRANSACTIONAL analysis , *HARDWARE , *TEXT messages , *SYNCHRONIZATION software , *FEASIBILITY studies - Abstract
Concurrency bugs widely exist and severely threaten system availability. Techniques that help recover from concurrency-bug failures during production runs are highly desired. This paper proposes BugTM, an approach that applies transactional memory techniques for concurrency-bug recovery in production runs. Requiring no knowledge about where are concurrency bugs, BugTM uses static analysis and code transformation to enable BugTM-transformed software to recover from a concurrency-bug failure by rolling back and re-executing the recent history of a failure thread. BugTM is instantiated as three schemes that have different trade-offs in performance and recovery capability: BugTM$_H$H uses existing hardware transactional memory (HTM) support, BugTM$_S$S leverages software transactional memory techniques, and BugTM$_{\mathrm{HS}}$ HS is a software-hardware hybrid design. BugTM greatly improves the recovery capability of state-of-the-art techniques with low run-time overhead and no changes to OS or hardware, while guarantees not to introduce new bugs. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
45. The benefits of concurrent computing in tribology system design.
- Author
-
Wang, Nenzi, Chen, Hsin-Yi, Chen, Yu-Wen, and Lin, Yu-Chun
- Subjects
- *
TRIBOLOGY , *COMPUTER operating systems , *PROGRAMMING languages , *COMPUTER programming , *COMPUTATION laboratories - Abstract
Abstrect Many optimum designs of tribological components are highly time-constrained before final productions. It is well-known that the process of a complex simulated design can be considerably accelerated by using some form of parallel computing. Also, for many tribological models additional assumptions can be relaxed with stricter design constrains, if the execution can be speeded up. In this study, the concurrent computing for tribological design is proposed, which is to perform parallel computing using the multitasking capability of today's operating system. In the concurrent computing a master program, which manages the process of the optimization, is used to launch a number of standalone slave programs (air bearing models) in a quick succession. And the operating system (MS-Windows) of the computer manages the parallel execution of the slave programs. Other than the standard programming language (Fortran 95) this approach uses none of the general parallel programming paradigms or directives, such as message passing interface, OpenMP, or coding using graphics processing units. In this study, the algorithm for the multiobjective optimization is group inching fortification method and the concurrent computing is executed in the algorithm-level. High parallel computing speedups are obtained in the simulated bearing designs. The approach can also be applied in using commercial general-purpose packages for modelling and self-coded methods for optimum design of tribological components or systems. Highlights • The concurrent computing for tribological optimization is accomplished by the multitasking ability of the operating systems. • The concurrent computing for tribological studies is not restricted to parallel programming languages or directives. • The concurrent computing can deal with variable computing load and tasks in parallel for tribological optimizations. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
46. Integrating Concurrency Control in n-Tier Application Scaling Management in the Cloud.
- Author
-
Wang, Qingyang, Chen, Hui, Zhang, Shungeng, Hu, Liting, and Palanisamy, Balaji
- Subjects
- *
ELECTRONIC commerce , *RESOURCE management , *DATABASES , *WEB-based user interfaces , *CLOUD computing - Abstract
Scaling complex distributed systems such as e-commerce is an importance practice to simultaneously achieve high performance and high resource efficiency in the cloud. Most previous research focuses on hardware resource scaling to handle runtime workload variation. Through extensive experiments using a representative n-tier web application benchmark (RUBBoS), we demonstrate that scaling an n-tier system by adding or removing VMs without appropriately re-allocating soft resources (e.g., server threads and connections) may lead to significant performance degradation resulting from implicit change of request processing concurrency in the system, causing either over- or under-utilization of the critical hardware resource in the system. We build a concurrency-aware model that determines a near optimal soft resource allocation of each tier by combining some operational queuing laws and the fine-grained online measurement data of the system. We then develop a dynamic concurrency management (DCM) framework that integrates the concurrency-aware model to intelligently reallocate soft resources in the system during the system scaling process. We compare DCM with Amazon EC2-AutoScale, the state-of-the-art hardware only scaling management solution using six real-world bursty workload traces. The experimental results show that DCM achieves significantly shorter tail latency and higher throughput compared to Amazon EC2-AutoScale under all the workload traces. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
47. A Distributed Heterogeneous Inspection System for High Performance Inline Surface Defect Detection.
- Author
-
Yu-Cheng Chou, Wei-Chieh Liao, Yan-Liang Chen, Ming Chang, and Po Ting Lin
- Subjects
HIGH performance computing ,SURFACE defects ,TURNAROUND time - Abstract
This paper presents the Distributed Heterogeneous Inspection System (DHIS), which comprises two CUDA workstations and is equipped with CPU distributed computing, CPU concurrent computing, and GPU concurrent computing functions. Thirty-two grayscale images, each with 5,000×12,288 pixels and simulated defect patterns, were created to evaluate the performances of three system configurations: (1) DHIS; (2) two CUDA workstations with CPU distributed computing and GPU concurrent computing; (3) one CUDA workstation with GPU concurrent computing. Experimental results indicated that: (1) only DHIS can satisfy the time limit, and the average turnaround time of DHIS is 37.65% of the time limit; (2) a good linear relationship exists between the processing speed ratio and the instruction sequence quantity ratio. [ABSTRACT FROM AUTHOR]
- Published
- 2019
48. BQ: A Lock-Free Queue with Batching
- Author
-
Yossi Lev, Gal Milman, Erez Petrank, Victor Luchangco, and Alex Kogan
- Subjects
Multi-core processor ,Linearizability ,Computer science ,Concurrent data structure ,020207 software engineering ,0102 computer and information sciences ,02 engineering and technology ,Parallel computing ,Data structure ,01 natural sciences ,Computer Science Applications ,Computational Theory and Mathematics ,010201 computation theory & mathematics ,Hardware and Architecture ,Modeling and Simulation ,0202 electrical engineering, electronic engineering, information engineering ,Non-blocking algorithm ,Concurrent computing ,Performance improvement ,Queue ,Software - Abstract
Concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later execution. A sequence of such operations can then be executed more efficiently than the standard execution of one operation at a time. In this article, we develop a novel algorithmic extension to the prevalent FIFO queue data structure that exploits such batching scenarios. An implementation in C++ on a multicore demonstrates significant performance improvement of more than an order of magnitude (depending on the batch lengths and the number of threads) compared to previous queue implementations.
- Published
- 2022
49. Modeling Player Knowledge in a Parallel Programming Educational Game
- Author
-
Katelyn Bright Alderfer, Jichen Zhu, Brian K. Smith, Santiago Ontañón, Pavan Kantharaju, and Bruce W. Char
- Subjects
Computer science ,Context (language use) ,Tracing ,Artificial Intelligence ,Control and Systems Engineering ,Software deployment ,Human–computer interaction ,ComputingMilieux_COMPUTERSANDEDUCATION ,Domain knowledge ,Concurrent computing ,Electrical and Electronic Engineering ,Set (psychology) ,Software ,Educational game - Abstract
This paper focuses on tracing player knowledge in educational games. Specifically, given a set of concepts or skills required to master a game, the goal is to estimate the likelihood with which the current player has mastery of each of those concepts or skills.The main contribution of the work is an approach that integrates machine learning and domain knowledge rules to find when the player applied a certain skill and either succeeded or failed. This is then given as input to a standard knowledge tracing module (such as those from Intelligent Tutoring Systems) to perform knowledge tracing. We evaluate our approach in the context of an educational game called Parallel to teach parallel and concurrent programming with data collected from real users, showing our approach can predict students skills with a low mean-squared error. We also provide results from our initial deployment of our system in a classroom environment.
- Published
- 2022
50. CloudRaid: Detecting Distributed Concurrency Bugs via Log Mining and Enhancement
- Author
-
Lian Li, Feng Li, Chen Liu, Xiaobing Feng, Jie Lu, and Jingling Xue
- Subjects
Service (systems architecture) ,Computer science ,business.industry ,Distributed computing ,Concurrency ,Cloud computing ,Data loss ,Yarn ,Software bug ,visual_art ,visual_art.visual_art_medium ,Concurrent computing ,Overhead (computing) ,business ,Software - Abstract
Cloud systems suffer from distributed concurrency bugs, which often lead to data loss and service outage. This paper presents CLOUDRAID, a new automatical tool for finding distributed concurrency bugs efficiently and effectively. Distributed concurrency bugs are notoriously difficult to find as they are triggered by untimely interaction among nodes, i.e., unexpected message orderings. To detect concurrency bugs in cloud systems efficiently and effectively, CLOUDRAID analyzes and tests automatically only the message orderings that are likely to expose errors. Specifically, CLOUDRAID mines the logs from previous executions to uncover the message orderings that are feasible but inadequately tested. In addition, we also propose a log enhancing technique to introduce new logs automatically in the system being tested. These extra logs added improve further the effectiveness of CLOUDRAID without introducing any noticeable performance overhead. Our log-based approach makes it well-suited for live systems. We have applied CLOUDRAID to analyze six representative distributed systems: Apache Hadoop2/Yarn, HBase, HDFS, Cassandra, Zookeeper, and Flink. CLOUDRAID has succeeded in testing 60 different versions of these six systems (10 versions per system) in 35 hours, uncovering 31 concurrency bugs, including nine new bugs that have never been reported before. For these nine new bugs detected, which have all been confirmed by their original developers, three are critical and have already been fixed.
- Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.