3,088 results on '"PARALLEL processing"'
Search Results
2. Tree Learning: Towards Promoting Coordination in Scalable Multi-Client Training Acceleration
- Author
-
Guo, Tao, Guo, Song, Wu, Feijie, Xu, Wenchao, Zhang, Jiewei, Zhou, Qihua, Chen, Quan, Zhuang, Weihua, Guo, Tao, Guo, Song, Wu, Feijie, Xu, Wenchao, Zhang, Jiewei, Zhou, Qihua, Chen, Quan, and Zhuang, Weihua
- Abstract
Iteration based collaborative learning (CL) paradigms, such as federated learning (FL) and split learning (SL), faces challenges in training neural models over the rapidly growing yet resource-constrained edge devices. Such devices have difficulty in accommodating a full-size large model for FL or affording an excessive waiting time for the mandatory synchronization step in SL. To deal with such challenge, we propose a novel CL framework which adopts an tree-aggregation structure with an adaptive partition and ensemble strategy to achieve optimal synchronization and fast convergence at scale. To find the optimal split point for heterogeneous clients, we also design a novel partitioning algorithm by minimizing the idleness during communication and achieving the optimal synchronization between clients. In addition, a parallelism paradigm is proposed to unleash the potential of optimum synchronization between the clients and server to boost the distributed training process without losing model accuracy for edge devices. Furthermore, we theoretically prove that our framework can achieve better convergence rate than state-of-the-art CL paradigms. We conduct extensive experiments and show that our framework is 4.6× in training speed as compared with the traditional methods, without compromising training accuracy.
- Published
- 2024
3. Parallel and Distributed Bayesian Network Structure Learning
- Author
-
Yang, Jian, Jiang, Jiantong, Wen, Zeyi, Mian, Ajmal, Yang, Jian, Jiang, Jiantong, Wen, Zeyi, and Mian, Ajmal
- Abstract
Bayesian networks (BNs) are graphical models representing uncertainty in causal discovery, and have been widely used in medical diagnosis and gene analysis due to their effectiveness and good interpretability. However, mainstream BN structure learning methods are computationally expensive, as they must perform numerous conditional independence (CI) tests to decide the existence of edges. Some researchers attempt to accelerate the learning process by parallelism, but face issues including load unbalancing, costly dominant parallelism overhead. We propose a multi-thread method, namely Fast-BNS version 1 (Fast-BNS-v1 for short), on multi-core CPUs to enhance the efficiency of the BN structure learning. Fast-BNS-v1 incorporates a series of efficiency optimizations, including a dynamic work pool for better scheduling, grouping CI tests to avoid unnecessary operations, a cache-friendly data storage to improve memory efficiency, and on-the-fly conditioning sets generation to avoid extra memory consumption. To further boost learning performance, we develop a two-level parallel method Fast-BNS-v2 by integrating edge-level parallelism with multi-processes and CI-level parallelism with multi-threads. Fast-BNS-v2 is equipped with careful optimizations including dynamic work stealing for load balancing, SIMD edge list deletion for list updating, and effective communication policies for synchronization. Comprehensive experiments show that our Fast-BNS achieves 9 to 235 times speedup over the state-of-the-art multi-threaded method on a single machine. When running on multi-machines, it further reduces the execution time of the single-machine implementation by 80%. IEEE
- Published
- 2024
4. Methods for acceleration, and stability assessment of single-rate and multi-rate electromagnetic transient simulations
- Author
-
Marti, Jose (University of British Columbia), Zhang, Yi (Electrical and Computer Engineering), Filizadeh, Shaahin (Electrical and Computer Engineering), Gole, Aniruddha, Sinkar, Ajinkya, Marti, Jose (University of British Columbia), Zhang, Yi (Electrical and Computer Engineering), Filizadeh, Shaahin (Electrical and Computer Engineering), Gole, Aniruddha, and Sinkar, Ajinkya
- Abstract
This thesis aims to investigate methods for speeding single-rate EMT simulations of large power networks as well as develop a rigorous analytical approach for assessing the stability of multi-rate EMT simulations. Firstly, an alternative method for formulating the equations of a network for EMT simulations is presented. It uses descriptor state-space equations (DSE) to represent the dynamical equations of a circuit. A procedure for interfacing a DSE-based formulation with a companion circuits-based EMT simulator is also developed. This procedure enables the interfacing of arbitrary power networks with any commercial CC-based EMT simulation package and can also be used to speed up the simulation using parallel processing. Subsequently, two of the commonly used sparse matrix-based parallelization methods are adapted and used for speeding-up DSE-based simulations, and their computational performance is compared. The first method transforms a sparse matrix into Block Diagonal (BD) form whereas, the second one transforms it into a Bordered Block Diagonal (BBD) form. Next, a novel universally passive delay-based interface is developed that uses existing inductors in the circuit to partition the network in EMT simulations. It allows for simulation speed up when the solution of the partitioned network is computed on a parallel computing platform. It is shown that the proposed interface has superior accuracy compared to other existing inductor-based partitioning approaches and is guaranteed to be passive thus benefiting the numerical stability of the simulation. Finally, a novel approach for the stability assessment of multi-rate EMT simulations of linear time-invariant (LTI) circuits is developed. Firstly, it is demonstrated that multi-rate EMT simulations can produce unstable results for stable continuous-time LTI circuits even when the well-known A-stable trapezoidal integration method is used. Further, it is shown that such simulations always yield a periodically varying
- Published
- 2023
5. Hybrid and parallel-computing methods for optimization of power systems with electromagnetic transient simulators
- Author
-
McNeill, Dean (Electrical and Computer Engineering), Muthumuni, Dharshana (Electrical and Computer Engineering), Gole, Aniruddha (Electrical and Computer Engineering), Filizadeh, Shaahin, Kuranage, Dilsha, McNeill, Dean (Electrical and Computer Engineering), Muthumuni, Dharshana (Electrical and Computer Engineering), Gole, Aniruddha (Electrical and Computer Engineering), Filizadeh, Shaahin, and Kuranage, Dilsha
- Abstract
This thesis introduces new methods for using electromagnetic transient (EMT) simulators to efficiently optimize controllers of the power electronic converters in power systems with complicated dynamic behavior. This work is motivated by several challenges that must be overcome during the design process, including high computational burden of simulating large switching systems, repetitive nature of the design cycle, the large number of variables that need to be handled, etc. These challenges are addressed in this research by combining an EMT simulator with optimization algorithms and by developing novel approaches to reduce the entire simulation time. Two screening methods are introduced in this thesis that can identify non-influential parameters so that the number of parameters to be optimized can be reduced, thus decreasing the computational burden of the process. Moreover, multi-algorithm and parallel processing techniques are developed to achieve additional computational benefits by making the design process faster. In this research, new pathways are created to solve simulation-based design problems with a large number of parameters by amalgamating all the above approaches. Several power system examples are simulated using PSCAD/EMTDC, and the accuracy and efficiency of the proposed methods are assessed and confirmed. The results show significant reductions in the time to design optimal systems without compromising the quality of the optimal performance.
- Published
- 2023
6. Low-power Implementation of Neural Network Extension for RISC-V CPU
- Author
-
Lo Presti Costantino, Dario and Lo Presti Costantino, Dario
- Abstract
Deep Learning and Neural Networks have been studied and developed for many years as of today, but there is still a great need of research on this field, because the industry needs are rapidly changing. The new challenge in this field is called edge inference and it is the deployment of Deep Learning on small, simple and cheap devices, such as low-power microcontrollers. At the same time, also on the field of hardware design the industry is moving towards the RISC-V micro-architecture, which is open-source and is developing at such a fast rate that it will soon become the standard. A batteryless ultra low power microcontroller based on energy harvesting and RISC-V microarchitecture has been the final target device of this thesis. The challenge on which this project is based is to make a simple Neural Network work on this chip, i.e., finding out the capabilities and the limits of this chip for such an application and trying to optimize as much as possible the power and energy consumption. To do that TensorFlow Lite Micro has been chosen as the Deep Learning framework of reference, and a simple existing application was studied and tested first on the SparkFun Edge board and then successfully ported to the RISC-V ONiO.zero core, with its restrictive features. The optimizations have been done only on the convolutional layer of the neural network, both by Software, implementing the Im2col algorithm, and by Hardware, designing and implementing a new RISC-V instruction and the corresponding Hardware unit that performs four 8-bit parallel multiply-and-accumulate operations. This new design drastically reduces both the inference time (3.7 times reduction) and the number of instructions executed (4.8 times reduction), meaning lower overall power consumption. This kind of application on this type of chip can open the doors to a whole new market, giving the possibility to have thousands small, cheap and self-sufficient chips deploying Deep Learning applications to solve simple e, Deep Learning och neurala nätverk har studerats och utvecklats i många år fram till idag, men det finns fortfarande ett stort behov av forskning på detta område, eftersom industrins behov förändras snabbt. Den nya utmaningen inom detta område kallas edge inferens och det är implementeringen av Deep Learning på små, enkla och billiga enheter, såsom lågeffektmikrokontroller. Samtidigt, även på området hårdvarudesign, går industrin mot RISC-V-mikroarkitekturen, som är öppen källkod och utvecklas i så snabb takt att den snart kommer att bli standarden. En batterilös mikrokontroller med ultralåg effekt baserad på energiinsamling och RISC-V-mikroarkitektur har varit den slutliga målenheten för denna avhandling. Utmaningen som detta projekt är baserat på är att få ett enkelt neuralt nätverk att fungera på detta chip, det vill säga att ta reda på funktionerna och gränserna för detta chip för en sådan applikation och försöka optimera så mycket som möjligt ström- och energiförbrukningen. För att göra det har TensorFlow Lite Micro valts som referensram för Deep Learning, och en enkel befintlig applikation studerades och testades först på SparkFun Edge-kortet och portades sedan framgångsrikt till RISC-V ONiO.zero-kärnan, med dess restriktiva funktioner. Optimeringarna har endast gjorts på det konvolutionerande skikt av det neurala nätverket, både av mjukvara, implementering av Im2col-algoritmen, och av hårdvara, design och implementering av en ny RISC-V-instruktion och motsvarande hårdvaruenhet som utför fyra 8-bitars parallella multiplikation -och-ackumulationsoperationer. Denna nya design minskar drastiskt både slutledningstiden (3,7 gånger kortare) och antalet utförda instruktioner (4.8 gånger färre), vilket innebär lägre total strömförbrukning. Den här typen av applikationer på den här typen av chip kan öppna dörrarna till en helt ny marknad, vilket ger möjlighet att ha tusentals små, billiga och självförsörjande chip som distribuerar Deep Learning-applikationer för att lös
- Published
- 2023
7. Impact of PS Load on FPGA Object Detection System Performance
- Author
-
Watanabe, Yusuke, Tamukoh, Hakaru, Watanabe, Yusuke, and Tamukoh, Hakaru
- Abstract
A field-programmable gate array (FPGA) device which has a Zynq architecture becomes popular these days. It is featured by integration of processing system (PS) and programmable logic (PL) into a single chip. While we tend to focus on the performance of PL, we can not ignore PS load completely. In this paper, using a Zynq FPGA board, we explore how our object detection system performance changes with PS load and report our experiment results., The 2023 International Conference on Artificial Life and Robotics (ICAROB 2023), February 9-12, 2023, on line, Oita, Japan
- Published
- 2023
8. Low-power Implementation of Neural Network Extension for RISC-V CPU
- Author
-
Lo Presti Costantino, Dario and Lo Presti Costantino, Dario
- Abstract
Deep Learning and Neural Networks have been studied and developed for many years as of today, but there is still a great need of research on this field, because the industry needs are rapidly changing. The new challenge in this field is called edge inference and it is the deployment of Deep Learning on small, simple and cheap devices, such as low-power microcontrollers. At the same time, also on the field of hardware design the industry is moving towards the RISC-V micro-architecture, which is open-source and is developing at such a fast rate that it will soon become the standard. A batteryless ultra low power microcontroller based on energy harvesting and RISC-V microarchitecture has been the final target device of this thesis. The challenge on which this project is based is to make a simple Neural Network work on this chip, i.e., finding out the capabilities and the limits of this chip for such an application and trying to optimize as much as possible the power and energy consumption. To do that TensorFlow Lite Micro has been chosen as the Deep Learning framework of reference, and a simple existing application was studied and tested first on the SparkFun Edge board and then successfully ported to the RISC-V ONiO.zero core, with its restrictive features. The optimizations have been done only on the convolutional layer of the neural network, both by Software, implementing the Im2col algorithm, and by Hardware, designing and implementing a new RISC-V instruction and the corresponding Hardware unit that performs four 8-bit parallel multiply-and-accumulate operations. This new design drastically reduces both the inference time (3.7 times reduction) and the number of instructions executed (4.8 times reduction), meaning lower overall power consumption. This kind of application on this type of chip can open the doors to a whole new market, giving the possibility to have thousands small, cheap and self-sufficient chips deploying Deep Learning applications to solve simple e, Deep Learning och neurala nätverk har studerats och utvecklats i många år fram till idag, men det finns fortfarande ett stort behov av forskning på detta område, eftersom industrins behov förändras snabbt. Den nya utmaningen inom detta område kallas edge inferens och det är implementeringen av Deep Learning på små, enkla och billiga enheter, såsom lågeffektmikrokontroller. Samtidigt, även på området hårdvarudesign, går industrin mot RISC-V-mikroarkitekturen, som är öppen källkod och utvecklas i så snabb takt att den snart kommer att bli standarden. En batterilös mikrokontroller med ultralåg effekt baserad på energiinsamling och RISC-V-mikroarkitektur har varit den slutliga målenheten för denna avhandling. Utmaningen som detta projekt är baserat på är att få ett enkelt neuralt nätverk att fungera på detta chip, det vill säga att ta reda på funktionerna och gränserna för detta chip för en sådan applikation och försöka optimera så mycket som möjligt ström- och energiförbrukningen. För att göra det har TensorFlow Lite Micro valts som referensram för Deep Learning, och en enkel befintlig applikation studerades och testades först på SparkFun Edge-kortet och portades sedan framgångsrikt till RISC-V ONiO.zero-kärnan, med dess restriktiva funktioner. Optimeringarna har endast gjorts på det konvolutionerande skikt av det neurala nätverket, både av mjukvara, implementering av Im2col-algoritmen, och av hårdvara, design och implementering av en ny RISC-V-instruktion och motsvarande hårdvaruenhet som utför fyra 8-bitars parallella multiplikation -och-ackumulationsoperationer. Denna nya design minskar drastiskt både slutledningstiden (3,7 gånger kortare) och antalet utförda instruktioner (4.8 gånger färre), vilket innebär lägre total strömförbrukning. Den här typen av applikationer på den här typen av chip kan öppna dörrarna till en helt ny marknad, vilket ger möjlighet att ha tusentals små, billiga och självförsörjande chip som distribuerar Deep Learning-applikationer för att lös
- Published
- 2023
9. The Tiny-Tasks Granularity Trade-Off Balancing Overhead vs. Performance in Parallel Systems
- Author
-
Bora, Stefan, Walker, Brenton, Fidler, Markus, Bora, Stefan, Walker, Brenton, and Fidler, Markus
- Abstract
Models of parallel processing systems typically assume that one has ll workers and jobs are split into an equal number of k=lk=l tasks. Splitting jobs into k > lk>l smaller tasks, i.e. using 'tiny tasks', can yield performance and stability improvements because it reduces the variance in the amount of work assigned to each worker, but as kk increases, the overhead involved in scheduling and managing the tasks begins to overtake the performance benefit. We perform extensive experiments on the effects of task granularity on an Apache Spark cluster, and based on these, develop a four-parameter model for task and job overhead that, in simulation, produces sojourn time distributions that match those of the real system. We also present analytical results which illustrate how using tiny tasks improves the stability region of split-merge systems, and analytical bounds on the sojourn and waiting time distributions of both split-merge and single-queue fork-join systems with tiny tasks. Finally we combine the overhead model with the analytical models to produce an analytical approximation to the sojourn and waiting time distributions of systems with tiny tasks which include overhead. We also perform analogous tiny-tasks experiments on a hybrid multi-processor shared memory system based on MPI and OpenMP which has no load-balancing between nodes. Though no longer strict analytical bounds, our analytical approximations with overhead match both the Spark and MPI/OpenMP experimental results very well.
- Published
- 2023
10. Impact of PS Load on FPGA Object Detection System Performance
- Author
-
Watanabe, Yusuke, Tamukoh, Hakaru, Watanabe, Yusuke, and Tamukoh, Hakaru
- Abstract
A field-programmable gate array (FPGA) device which has a Zynq architecture becomes popular these days. It is featured by integration of processing system (PS) and programmable logic (PL) into a single chip. While we tend to focus on the performance of PL, we can not ignore PS load completely. In this paper, using a Zynq FPGA board, we explore how our object detection system performance changes with PS load and report our experiment results., The 2023 International Conference on Artificial Life and Robotics (ICAROB 2023), February 9-12, 2023, on line, Oita, Japan
- Published
- 2023
11. Worst case execution time and power estimation of multicore and GPU software: a pedestrian detection use case
- Author
-
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Barcelona Supercomputing Center, Rodríguez Ferrández, Iván, Jover Álvarez, Álvaro, Trompouki, Matina Maria, Kosmidis, Leonidas, Cazorla Almeida, Francisco Javier, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Barcelona Supercomputing Center, Rodríguez Ferrández, Iván, Jover Álvarez, Álvaro, Trompouki, Matina Maria, Kosmidis, Leonidas, and Cazorla Almeida, Francisco Javier
- Abstract
Worst Case Execution Time estimation of software running on parallel platforms is a challenging task, due to resource interference of other tasks and the complexity of the underlying CPU and GPU hardware architectures. Similarly, the increased complexity of the hardware, challenges the estimation of worst case power consumption. In this paper, we employ Measurement Based Probabilistic Timing Analysis (MBPTA), which is capable of managing complex architectures such as multicores. We enable its use by software randomisation, which we show for the first time that is also possible on GPUs. We demonstrate our method on a pedestrian detection use case on an embedded multicore and GPU platform for the automotive domain, the NVIDIA Xavier. Moreover, we extend our measurement based probabilistic method in order to predict the worst case power consumption of the software on the same platform., This work was funded by the Ministerio de Ciencia e Innovación - Agencia Estatal de Investigación (PID2019-107255GB- C21/AEI/10.13039/501100011033 and IJC-2020-045931-I), the European Commission’s Horizon 2020 programme under the UP2DATE project (grant agreement 871465), an ERC grant (No. 772773) and the HiPEAC Network of Excellence, Peer Reviewed, Postprint (author's final draft)
- Published
- 2023
12. Real-time Data Processing for Ultrafast X-Ray Computed Tomography using Modular CUDA based Pipelines
- Author
-
(0000-0003-3558-5750) Windisch, D., (0000-0003-1761-2591) Kelling, J., (0000-0002-9935-4428) Juckeland, G., (0000-0003-3428-5019) Bieberle, A., (0000-0003-3558-5750) Windisch, D., (0000-0003-1761-2591) Kelling, J., (0000-0002-9935-4428) Juckeland, G., and (0000-0003-3428-5019) Bieberle, A.
- Abstract
In this article, a new version of the Real-time Image Stream Algorithms (RISA) data processing suite is introduced. It now features online detector data acquisition, high-throughput data dumping and enhanced real-time data processing capabilities. The achieved low-latency real-time data processing extends the application of ultrafast electron beam X-ray computed tomography (UFXCT) scanners to real-time scanner control and process control. We implemented high performance data packet reception based on data plane development kit (DPDK) and high-throughput data storing using both hierarchical data format version 5 (HDF5) as well as the adaptable input/output system version 2 (ADIOS2). Furthermore, we extended RISA’s underlying pipelining framework to support the fork-join paradigm. This allows for more complex workflows as it is necessary, e.g. for online data processing. Also, the pipeline configuration is moved from compile-time to runtime, i.e. processing stages and their interconnections can now be configured using a configuration file. In several benchmarks, RISA is profiled regarding data acquisition performance, data storage throughput and overall processing latency. We found that using direct IO mode significantly improves data writing performance on the local data storage. We could further prove that RISA is now capable of processing data from up to 768 detector channels (3072MB/s) at 8000 fps on a single-GPU computer in real-time.
- Published
- 2023
13. Diffuser: Packet Spraying While Maintaining Order : Distributed Event Scheduler for Maintaining Packet Order while Packet Spraying in DPDK
- Author
-
Purushotham Srinivas, Vignesh and Purushotham Srinivas, Vignesh
- Abstract
The demand for high-speed networking applications has made Network Processors (NPs) and Central Computing Units (CPUs) increasingly parallel and complex, containing numerous on-chip processing cores. This parallelism can only be exploited fully by the underlying packet scheduler by efficiently utilizing all the available cores. Classically, packets have been directed towards the processing cores at flow granularity, making them susceptible to traffic locality. Ensuring a good load balance among the processors improves the application’s throughput and packet loss characteristics. Hence, packet-level schedulers dispatch flows to the processing core at a packet granularity to improve the load balance. However, packet-level scheduling combined with advanced parallelism introduces out-of-order departure of the processed packets. Simultaneously optimizing both the load balance and packet order is challenging. In this degree project, we micro-benchmark the DPDK’s (Dataplane Development Kit) event scheduler and identify many performance and scalability bottlenecks. We find the event scheduler consumes around 40% of the cycles on each participating core for event scheduling. Additionally, we find that DSW (Distributed Software Scheduler) cannot saturate all the workers with traffic because a single NIC (Network Interface Card) queue is polled for packets in our test setup. Then we propose Diffuser, an event scheduler for DPDK that combines the functional properties of both the flow and packet-level schedulers. The diffuser aims to achieve optimal load balance while minimizing out-of-order packet transmission. Diffuser uses stochastic flow assignments along with a load imbalance feedback mechanism to adaptively control the rate of flow migrations to optimize the scheduler’s load distribution. Diffuser reduces packet reordering by at least 65% with ten flows of 100 bytes at 25 MPPS (Million Packet Per Second) and at least 50% with one flow. While Diffuser improves the reorderi, Efterfrågan på höghastighets-nätverksapplikationer har gjort nätverkspro-cessorer (NP) och centrala beräkningsenheter (CPU:er) alltmer parallella, komplexa och innehållande många processorkärnor. Denna parallellitet kan endast utnyttjas fullt ut av den underliggande paketschemaläggaren genom att effektivt utnyttja alla tillgängliga kärnor. Vanligtvis har paketschemaläggaren skickat paket till olika kärnor baserat på flödesgranularitet, vilket medför trafik-lokalitet. En bra belastningsbalans mellan processorerna förbättrar applikationens genomströmning och minskar förlorade paket. Därför skickar schemaläggare på paketnivå istället flöden till kärnan med en paketgranularitet för att förbättra lastbalansen. Schemaläggning på paketnivå kombinerat med avancerad parallellism innebär dock att de behandlade paketen avgår i oordning. Att samtidigt optimera både lastbalans och paketordning är en utmaning. I detta examensprojekt utvärderar vi DPDKs (Dataplane Development Kit) händelseschemaläggare och hittar många flaskhalsar i prestanda och skalbarhet. Vi finner att händelseschemaläggaren konsume-rar cirka 40 % av cyklerna på varje kärna.Dessutom finner vi att DSW (Schemaläggare för distribuerad programvara) inte kan mätta alla arbetande kärnor med trafik eftersom en enda nätverkskorts-kö används i vår testmiljö. Vi introducerar också Diffuser, en händelse-schemaläggare för DPDK som kombinerar egenskaperna hos både flödes-och paketnivåschemaläggare. Diffuser ämnar att uppnå optimal lastbalans samtidigt som den minimerar paketöverföring i oordning. Den använder stokastiska flödestilldelningar tillsammans med en återkopplingsmekanism för lastobalans för att adaptivt kontrollera flödesmigreringar för att optimera lastfördelningen. Diffuser minskar omordning av paket med minst 65 % med tio flöden på 100 byte vid 25 MPPS (Miljoner paket per sekund) och minst 50 % med endast ett flöde. Även om Diffuser förbättrar omordningsprestandan, minskar den genomströmningen något och ökar la
- Published
- 2023
14. The Tiny-Tasks Granularity Trade-Off Balancing Overhead vs. Performance in Parallel Systems
- Author
-
Bora, Stefan, Walker, Brenton, Fidler, Markus, Bora, Stefan, Walker, Brenton, and Fidler, Markus
- Abstract
Models of parallel processing systems typically assume that one has ll workers and jobs are split into an equal number of k=lk=l tasks. Splitting jobs into k > lk>l smaller tasks, i.e. using 'tiny tasks', can yield performance and stability improvements because it reduces the variance in the amount of work assigned to each worker, but as kk increases, the overhead involved in scheduling and managing the tasks begins to overtake the performance benefit. We perform extensive experiments on the effects of task granularity on an Apache Spark cluster, and based on these, develop a four-parameter model for task and job overhead that, in simulation, produces sojourn time distributions that match those of the real system. We also present analytical results which illustrate how using tiny tasks improves the stability region of split-merge systems, and analytical bounds on the sojourn and waiting time distributions of both split-merge and single-queue fork-join systems with tiny tasks. Finally we combine the overhead model with the analytical models to produce an analytical approximation to the sojourn and waiting time distributions of systems with tiny tasks which include overhead. We also perform analogous tiny-tasks experiments on a hybrid multi-processor shared memory system based on MPI and OpenMP which has no load-balancing between nodes. Though no longer strict analytical bounds, our analytical approximations with overhead match both the Spark and MPI/OpenMP experimental results very well.
- Published
- 2023
15. GPU acceleration of Levenshtein distance computation between long strings
- Author
-
Barcelona Supercomputing Center, Castells Rufas, David, Barcelona Supercomputing Center, and Castells Rufas, David
- Abstract
Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity., This research was supported by the European Union Regional Development Fund (ERDF) within the framework of the ERDF Operational Program of Catalonia 2014–2020 with a grant of 50% of the total cost eligible under the Designing RISC-V based Accelerators for next generation computers project (DRAC) [001-P-001723], in part by the Catalan Government under grant 2017-SGR-1624, and in part by the Spanish Ministry of Science, Innovation and Universities under grant RTI2018-095209-B-C22., Peer Reviewed, Postprint (published version)
- Published
- 2023
16. Parallel Photonic Convolutional Processing On-chip with Cross-connect Architecture and Cyclic AWGs
- Author
-
Shi, Bin, Calabretta, Nicola, Stabile, Ripalta, Shi, Bin, Calabretta, Nicola, and Stabile, Ripalta
- Abstract
Convolutional neural network (CNN) is one of the best neural network structures for solving classification problems. The convolutional processing of the network dominates processing time and computing power. Parallel computing for convolutional processing is essential to accelerate the computing speed of the neural network. In this paper, we introduce another domain of parallelism on top of the already demonstrated parallelisms suggested for photonic integrated processors with WDM approaches, to further accelerate the convolutional operation on chip. The operation of the novel parallelism is introduced with an updated cross-connect architecture, exploiting cyclic array waveguide grating. The photonic CNN system is demonstrated for the handwritten digit classification problem in simulation, with a speed of 2.56 Tera operation/s and end-to-end system energy efficiency of 3.75 pJ/operation, using 16 weighting elements and 10 Giga sample/s inputs. The proposed parallelism improves CNN acceleration by 4-16 times with respect to state-of-the-art integrated convolutional processors, depending on the available weighting elements per convolutional core.
- Published
- 2023
17. Optimizing GPU-based Graph Sampling and Random Walk for Efficiency and Scalability
- Author
-
Wang, Pengyu, Xu, Cheng, Li, Chao, Wang, Jing, Wang, Taolei, Zhang, Lu, Hou, Xiaofeng, Guo, Minyi, Wang, Pengyu, Xu, Cheng, Li, Chao, Wang, Jing, Wang, Taolei, Zhang, Lu, Hou, Xiaofeng, and Guo, Minyi
- Abstract
Graph sampling and random walk algorithms are playing increasingly important roles today because they can significantly reduce graph size while preserving structural information, thus enabling computationally intensive tasks on large-scale graphs. Current frameworks designed for graph sampling and random walk tasks are generally not efficient in terms of memory requirement and throughput. Not to mention that some of them result in biased results. To solve the above problems, we introduce Skywalker+, a high-performance graph sampling and random walk framework on multiple GPUs supporting multiple algorithms. Skywalker+ makes four key contributions: First, it realizes highly paralleled alias method on GPUs. Second, it applies finely adjusted workload-balancing techniques and locality-aware execution modes to present a highly efficient execution engine. Third, it optimizes the GPU memory usage with efficient buffering and data compression schemes. Last, it scales to multi-GPU to further enhance the system throughput. Abundant experiments show that Skywalker+ exhibits significant advantage over the baselines both in performance and utility. IEEE
- Published
- 2023
18. STAR: An STGCN ARchitecture for Skeleton-Based Human Action Recognition
- Author
-
Wu, Weiwei, Tu, Fengbin, Niu, Mengqi, Yue, Zhiheng, Liu, Leibo, Wei, Shaojun, Li, Xiangyu, Hu, Yang, Yin, Shouyi, Wu, Weiwei, Tu, Fengbin, Niu, Mengqi, Yue, Zhiheng, Liu, Leibo, Wei, Shaojun, Li, Xiangyu, Hu, Yang, and Yin, Shouyi
- Abstract
Skeleton-based human action cognition (HAR) has drawn increasing attention recently. As an emerging approach for skeleton-based HAR tasks, Spatial-Temporal Graph Convolution Network (STGCN) achieves remarkable performance by fully exploiting the skeleton topology information via graph convolution. Unfortunately, existing GCN accelerators lose efficiency when processing STGCN models due to two limitations To overcome the limitations, this paper proposes STAR, an STGCN architecture for skeleton-based human action recognition. STAR is designed based on the characteristics of different computation phases in STGCN. For limitation (1), a spatial-temporal dimension consistent (STDC) dataflow is proposed to fully exploit the data reuse opportunities in all the different dimensions of STGCN. For limitation (2), we propose a node-wise exponent sharing scheme and a temporal-structured redundancy elimination mechanism, to exploit the inherent temporal redundancy specially introduced by STGCN. To further address the under-utilization induced by redundancy elimination, we design a dynamic data scheduler to manage the feature data storage and schedule the features and weights for valid computation in real time. STAR achieves 4.48 , 5.98 , 2.54 , and 103.88 energy savings on average over the HyGCN, AWB-GCN, TPU, and Jetson TX2 GPU. IEEE
- Published
- 2023
19. Methods for acceleration, and stability assessment of single-rate and multi-rate electromagnetic transient simulations
- Author
-
Marti, Jose (University of British Columbia), Zhang, Yi (Electrical and Computer Engineering), Filizadeh, Shaahin (Electrical and Computer Engineering), Gole, Aniruddha, Sinkar, Ajinkya, Marti, Jose (University of British Columbia), Zhang, Yi (Electrical and Computer Engineering), Filizadeh, Shaahin (Electrical and Computer Engineering), Gole, Aniruddha, and Sinkar, Ajinkya
- Abstract
This thesis aims to investigate methods for speeding single-rate EMT simulations of large power networks as well as develop a rigorous analytical approach for assessing the stability of multi-rate EMT simulations. Firstly, an alternative method for formulating the equations of a network for EMT simulations is presented. It uses descriptor state-space equations (DSE) to represent the dynamical equations of a circuit. A procedure for interfacing a DSE-based formulation with a companion circuits-based EMT simulator is also developed. This procedure enables the interfacing of arbitrary power networks with any commercial CC-based EMT simulation package and can also be used to speed up the simulation using parallel processing. Subsequently, two of the commonly used sparse matrix-based parallelization methods are adapted and used for speeding-up DSE-based simulations, and their computational performance is compared. The first method transforms a sparse matrix into Block Diagonal (BD) form whereas, the second one transforms it into a Bordered Block Diagonal (BBD) form. Next, a novel universally passive delay-based interface is developed that uses existing inductors in the circuit to partition the network in EMT simulations. It allows for simulation speed up when the solution of the partitioned network is computed on a parallel computing platform. It is shown that the proposed interface has superior accuracy compared to other existing inductor-based partitioning approaches and is guaranteed to be passive thus benefiting the numerical stability of the simulation. Finally, a novel approach for the stability assessment of multi-rate EMT simulations of linear time-invariant (LTI) circuits is developed. Firstly, it is demonstrated that multi-rate EMT simulations can produce unstable results for stable continuous-time LTI circuits even when the well-known A-stable trapezoidal integration method is used. Further, it is shown that such simulations always yield a periodically varying
- Published
- 2023
20. Hybrid and parallel-computing methods for optimization of power systems with electromagnetic transient simulators
- Author
-
McNeill, Dean (Electrical and Computer Engineering), Muthumuni, Dharshana (Electrical and Computer Engineering), Gole, Aniruddha (Electrical and Computer Engineering), Filizadeh, Shaahin, Kuranage, Dilsha, McNeill, Dean (Electrical and Computer Engineering), Muthumuni, Dharshana (Electrical and Computer Engineering), Gole, Aniruddha (Electrical and Computer Engineering), Filizadeh, Shaahin, and Kuranage, Dilsha
- Abstract
This thesis introduces new methods for using electromagnetic transient (EMT) simulators to efficiently optimize controllers of the power electronic converters in power systems with complicated dynamic behavior. This work is motivated by several challenges that must be overcome during the design process, including high computational burden of simulating large switching systems, repetitive nature of the design cycle, the large number of variables that need to be handled, etc. These challenges are addressed in this research by combining an EMT simulator with optimization algorithms and by developing novel approaches to reduce the entire simulation time. Two screening methods are introduced in this thesis that can identify non-influential parameters so that the number of parameters to be optimized can be reduced, thus decreasing the computational burden of the process. Moreover, multi-algorithm and parallel processing techniques are developed to achieve additional computational benefits by making the design process faster. In this research, new pathways are created to solve simulation-based design problems with a large number of parameters by amalgamating all the above approaches. Several power system examples are simulated using PSCAD/EMTDC, and the accuracy and efficiency of the proposed methods are assessed and confirmed. The results show significant reductions in the time to design optimal systems without compromising the quality of the optimal performance.
- Published
- 2023
21. Towards Real-Time Inference Offloading with Distributed Edge Computing: the Framework and Algorithms
- Author
-
Chen, Quan, Guo, Song, Wang, Kaijia, Xu, Wenchao, Li, Jing, Cai, Zhipeng, Gao, Hong, Zomaya, Albert, Chen, Quan, Guo, Song, Wang, Kaijia, Xu, Wenchao, Li, Jing, Cai, Zhipeng, Gao, Hong, and Zomaya, Albert
- Abstract
By combining edge computing and parallel computing, distributed edge computing has emerged as a new paradigm to exploit the booming IoT devices at the edge. To accelerate computation at the edge, i.e., the inference tasks for DNN-driven applications, the parallelism of both computation and communication needs to be considered for distributed edge computing, and thus, the problem of Minimum Latency joint Communication and Computation Scheduling (MLCCS) is proposed. However, existing works have rigid assumptions that the communication time of each device is fixed and the workload can be split arbitrarily small. Aiming at making the work more practical and general, the MLCCS problem without the above assumptions is studied in this paper. Firstly, the MLCCS problem under a general model is formulated and proved to be NP-hard. Secondly, a pyramid-based computing model is proposed to consider the parallelism of communication and computation jointly, which has an approximation ratio of
, where$1+\delta$ is related to devices' communication rates. An interesting property under such a computing model is identified and proved, i.e., the optimal latency can be obtained under arbitrary scheduling order when all the devices share the same communication rate. When the workload cannot be split arbitrarily, an approximation algorithm with a ratio of at most$\delta$ is proposed. Additionally, for handling the dynamically changing network scenarios, several algorithms are also proposed accordingly. Finally, the theoretical analysis and simulation results verify that the proposed algorithm has high performance in terms of latency. Two testbed experiments are also conducted, which show that the proposed meth$2\cdot (1+\delta )$ - Published
- 2023
22. Revisiting Topographic Horizons in the Era of Big Data and Parallel Computing
- Author
-
Dozier, Jeff, Dozier, Jeff, Dozier, Jeff, and Dozier, Jeff
- Published
- 2022
23. Research Trends, Enabling Technologies and Application Areas for Big Data
- Author
-
Lundberg, Lars, Grahn, Håkan, Lundberg, Lars, and Grahn, Håkan
- Abstract
The availability of large amounts of data in combination with Big Data analytics has transformed many application domains. In this paper, we provide insights into how the area has developed in the last decade. First, we identify seven major application areas and six groups of important enabling technologies for Big Data applications and systems. Then, using bibliometrics and an extensive literature review of more than 80 papers, we identify the most important research trends in these areas. In addition, our bibliometric analysis also includes trends in different geographical regions. Our results indicate that manufacturing and agriculture or forestry are the two application areas with the fastest growth. Furthermore, our bibliometric study shows that deep learning and edge or fog computing are the enabling technologies increasing the most. We believe that the data presented in this paper provide a good overview of the current research trends in Big Data and that this kind of information is very useful when setting strategic agendas for Big Data research., open access
- Published
- 2022
- Full Text
- View/download PDF
24. Research Trends, Enabling Technologies and Application Areas for Big Data
- Author
-
Lundberg, Lars, Grahn, Håkan, Lundberg, Lars, and Grahn, Håkan
- Abstract
The availability of large amounts of data in combination with Big Data analytics has transformed many application domains. In this paper, we provide insights into how the area has developed in the last decade. First, we identify seven major application areas and six groups of important enabling technologies for Big Data applications and systems. Then, using bibliometrics and an extensive literature review of more than 80 papers, we identify the most important research trends in these areas. In addition, our bibliometric analysis also includes trends in different geographical regions. Our results indicate that manufacturing and agriculture or forestry are the two application areas with the fastest growth. Furthermore, our bibliometric study shows that deep learning and edge or fog computing are the enabling technologies increasing the most. We believe that the data presented in this paper provide a good overview of the current research trends in Big Data and that this kind of information is very useful when setting strategic agendas for Big Data research., open access
- Published
- 2022
- Full Text
- View/download PDF
25. Parallel Processing at the Edge in Dense Wireless Networks
- Author
-
Zeng, Ming, Fodor, Viktoria, Zeng, Ming, and Fodor, Viktoria
- Abstract
This paper investigates the gain of parallel processing, when mobile edge computing (MEC) is implemented in dense wireless networks. In this scenario users connect to several access points (APs), and utilize the computation capability of multiple servers at the same time. This allows a balanced load at the servers, with the eventual cost of decreased spectrum efficiency. The problem of sum transmission energy minimization under response time constraints is considered and proved to be non-convex. The complexity of optimizing a part of the system parameters is investigated, and based on these results an iterative resource allocation procedure is proposed that converges to a local optimum. The performance of the joint resource allocation is evaluated by comparing it to lower and upper bounds defined by less or more flexible multi-cell MEC architectures. The results show that the free selection of the AP is crucial for achieving decent system performance. The average level of parallel processing is in general low in dense systems, but it is an important option for the rare, highly unbalanced instances., QC 20220303
- Published
- 2022
- Full Text
- View/download PDF
26. Revisiting Topographic Horizons in the Era of Big Data and Parallel Computing
- Author
-
Dozier, Jeff, Dozier, Jeff, Dozier, Jeff, and Dozier, Jeff
- Published
- 2022
27. Towards minimum WCRT bound for DAG tasks under prioritized list scheduling algorithms
- Author
-
Chang, Shuangshuang, Bi, Ran, Sun, Jinghao, Liu, Weichen, Yu, Qi, Deng, Qingxu, Gu, Zonghua, Chang, Shuangshuang, Bi, Ran, Sun, Jinghao, Liu, Weichen, Yu, Qi, Deng, Qingxu, and Gu, Zonghua
- Abstract
Many modern real-time parallel applications can be modeled as a directed acyclic graph (DAG) task. Recent studies show that the worst-case response time (WCRT) bound of a DAG task can be significantly reduced when the execution order of the vertices is determined by the priority assigned to each vertex of the DAG. How to obtain the optimal vertex priority assignment, and how far from the best-known WCRT bound of a DAG task to the minimum WCRT bound are still open problems. In this paper, we aim to construct the optimal vertex priority assignment and derive the minimum WCRT bound for the DAG task. We encode the priority assignment problem into an integer linear programming (ILP) formulation. To solve the ILP model efficiently, we do not involve all variables or constraints. Instead, we solve the ILP model iteratively, i.e., we initially solve the ILP model with only a few primary variables and constraints, and then at each iteration, we increment the ILP model with the variables and constraints which are more likely to derive the optimal priority assignment. Experimental work shows that our method is capable of solving the ILP model optimally without involving too many variables or constraints, e.g., for instances with 50 vertices, we find the optimal priority assignment by involving 12.67% variables on average and within several minutes on average., This article was presented at the International Conference on Embedded Software (EMSOFT) 2022 and appeared as part of the ESWEEKTCAD special issue.
- Published
- 2022
- Full Text
- View/download PDF
28. Optimized continuous wavelet transform algorithm architecture and implementation on FPGA for motion artifact rejection in radar-based vital signs monitoring
- Author
-
Obadi, A. B. (Ameen Bin), Zeghid, M. (Medien), Kan, P. L. (Phak Len Eh), Soh, P. J. (Ping Jack), Mercuri, M. (Marco), Aldaye, O. (Omar), Obadi, A. B. (Ameen Bin), Zeghid, M. (Medien), Kan, P. L. (Phak Len Eh), Soh, P. J. (Ping Jack), Mercuri, M. (Marco), and Aldaye, O. (Omar)
- Abstract
The continuous wavelet transform (CWT) has been used in radar-based vital signs detection to identify and to remove the motion artifacts from the received radar signals. Since the CWT algorithm is computationally heavy, the processing of this algorithm typically results in long processing time and complex hardware implementation. The algorithm in its standard form typically uses software processing tools and is unable to support high-performance data processing. The aim of this research is to design an optimized CWT algorithm architecture to implement it on Field Programmable Gate Array (FPGA) in order to identify the unwanted movement introduced in the retrieved vital signs signals. The optimization approaches in the new implementation structure are based on utilizing the frequency domain processing, optimizing the required number of operations and implementing parallel processing of independent operations. Our design achieves significant processing speed and logic utilization optimization. It is found that processing the algorithm using our proposed hardware architecture is 48 times faster than processing it using MATLAB. It also achieves an improvement of 58% in speed performance compared to alternative solutions reported in literature. Moreover, efficient resources utilization is achieved and reported. This advanced performance of the proposed design is due to consciously implementing comprehensive approaches of multiple optimization techniques that results in multidimensional improvements. As a result, our achieved design is suitable for utilization in high-performance data processing applications.
- Published
- 2022
29. Communication-Efficient Cluster Scalable Genomics Data Processing Using Apache Arrow Flight
- Author
-
Ahmad, T. (author), Ma, Chengxin (author), Al-Ars, Z. (author), Hofstee, H.P. (author), Ahmad, T. (author), Ma, Chengxin (author), Al-Ars, Z. (author), and Hofstee, H.P. (author)
- Abstract
Current cluster scaled genomics data processing solutions rely on big data frameworks like Apache Spark, Hadoop and HDFS for data scheduling, processing and storage. These frameworks come with additional computation and memory overheads by default. It has been observed that scaling genomics dataset processing beyond 32 nodes is not efficient on such frameworks.To overcome the inefficiencies of big data frameworks for processing genomics data on clusters, we introduce a low-overhead and highly scalable solution on a SLURM based HPC batch system. This solution uses Apache Arrow as in-memory columnar data format to store genomics data efficiently and Arrow Flight as a network protocol to move and schedule this data across the HPC nodes with low communication overhead.As a use case, we use NGS short reads DNA sequencing data for pre-processing and variant calling applications. This solution outperforms existing Apache Spark based big data solutions in term of both computation time (2x) and lower communication overhead (more than 20-60% depending on cluster size). Our solution has similar performance to MPI-based HPC solutions, with the added advantage of easy programmability and transparent big data scalability. The whole solution is Python and shell script based, which makes it flexible to update and integrate alternative variant callers. Our solution is publicly available on GitHub at https://github.com/abs-tudelft/time-to-fly-high/tree/main/genomics, Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public., Computer Engineering
- Published
- 2022
- Full Text
- View/download PDF
30. A parallelized hybrid genetic algorithm with differential evolution for heat exchanger network retrofit
- Author
-
Stampfli, Jan A., Olsen, Donald G., Wellig, Beat, Hofmann, René, Stampfli, Jan A., Olsen, Donald G., Wellig, Beat, and Hofmann, René
- Abstract
The challenge of heat exchanger network retrofit is often addressed using deterministic algorithms. However, the complexity of the retrofit problems, combined with multi-period operation, makes it very difficult to find any feasible solution. In contrast, stochastic algorithms are more likely to find feasible solutions in complex solution spaces. This work presents a customized evolutionary based optimization algorithm to address this challenge. The algorithm has two levels, whereby, a genetic algorithm optimizes the topology of the heat exchanger network on the top level. Based on the resulting topology, a differential evolution algorithm optimizes the heat loads of the heat exchangers in each operating period. The following bullet points highlight the customization of the algorithm: •The advantage of using both algorithms: the genetic algorithm is used for the topology optimization (discrete variables) and the differential evolution for the heat load optimization (continuous variables). •Penalizing and preserving strategies are used for constraint handling •The evaluation of the genetic algorithm is parallelized, meaning the differential evolution algorithm is performed on each chromosome parallel on multiple cores., + ID der Publikation: hslu_93764 + Art des Beitrages: Wissenschaftliche Medien + Jahrgang: 2022 + Sprache: Englisch + Letzte Aktualisierung: 2023-01-24 12:28:10
- Published
- 2022
31. Accelerating Elliptic Curve Digital Signature Algorithms on GPUs
- Author
-
Feng, Zonghao, Xie, Qipeng, Luo, Qiong, Chen, Yujie, Li, Haoxuan, Li, Huizhong, Yan, Qiang, Feng, Zonghao, Xie, Qipeng, Luo, Qiong, Chen, Yujie, Li, Haoxuan, Li, Huizhong, and Yan, Qiang
- Abstract
The Elliptic Curve Digital Signature Algorithm (ECDSA) is an essential building block of various cryptographic protocols. In particular, most blockchain systems adopt it to ensure transaction integrity. However, due to its high computational intensity, ECDSA is often the performance bottleneck in blockchain transaction processing. Recent work has accelerated ECDSA algorithms on the CPU; in contrast, success has been limited on the GPU, which has great potential for parallelization but is challenging for implementing elliptic curve functions. In this paper, we propose RapidEC, a GPU-based ECDSA implementation for SM2, a popular elliptic curve. Specifically, we design architecture-aware parallel primitives for elliptic curve point operations, and parallelize the processing of a single SM2 request as well as batches of requests. Consequently, our GPU-based RapidEC outperformed the state-of-the-art CPU-based algorithm by orders of magnitude. Additionally, our GPU-based modular arithmetic functions as well as point operation primitives can be applied to other computation tasks. © 2022 IEEE.
- Published
- 2022
32. 1D-CROSSPOINT ARRAY AND ITS CONSTRUCTION, APPLICATION TO BIG DATA PROBLEMS, AND HIGHER DIMENSION VARIANTS
- Author
-
An, Taeyoung and An, Taeyoung
- Abstract
Increased chip densities offer massive computation power to deal with fundamental bigdata operations such as sorting. At the same time the proliferation of processing elements (PEs) in settings such as High Performance Computers(HPCs) or servers together with the employment of more aggressive parallel algorithms cause the interprocessor communications to dominate the overall computation time, potentially resulting in reduced computational efficiency. To overcome this issue, this dissertation introduces a new architecture that uses simple crosspoint switches to pair PEs instead of a complex interconnection network. This new architecture may be viewed as a “quadratic” array of processors as it uses O(n^2) PEs rather than O(n) as in linear array processor models. In addition, three different models for sorting big data in a distributed com- puting environment such as Cloud computing are presented. With the most realistic model of the three, we demonstrate that the high parallelism made possible by the simple communication channels overcomes the seemingly excessive hardware complexity and performs comparable to or better than existing algorithms. Furthermore, two additional algorithms of matrix multiplica- tion and triangle counting for the 1D-Crosspoint Array are introduced and analyzed. Lastly, two higher dimensional variants, 2D- and 3D-Crosspoint Array are also proposed with a construction method, which succeeds in reducing the number of PEs required by utilizing the communication channels in the added dimensions.
- Published
- 2022
33. A parallelized hybrid genetic algorithm with differential evolution for heat exchanger network retrofit
- Author
-
Stampfli, Jan A., Olsen, Donald G., Wellig, Beat, Hofmann, René, Stampfli, Jan A., Olsen, Donald G., Wellig, Beat, and Hofmann, René
- Abstract
The challenge of heat exchanger network retrofit is often addressed using deterministic algorithms. However, the complexity of the retrofit problems, combined with multi-period operation, makes it very difficult to find any feasible solution. In contrast, stochastic algorithms are more likely to find feasible solutions in complex solution spaces. This work presents a customized evolutionary based optimization algorithm to address this challenge. The algorithm has two levels, whereby, a genetic algorithm optimizes the topology of the heat exchanger network on the top level. Based on the resulting topology, a differential evolution algorithm optimizes the heat loads of the heat exchangers in each operating period. The following bullet points highlight the customization of the algorithm: •The advantage of using both algorithms: the genetic algorithm is used for the topology optimization (discrete variables) and the differential evolution for the heat load optimization (continuous variables). •Penalizing and preserving strategies are used for constraint handling •The evaluation of the genetic algorithm is parallelized, meaning the differential evolution algorithm is performed on each chromosome parallel on multiple cores., + ID der Publikation: hslu_93764 + Art des Beitrages: Wissenschaftliche Medien + Jahrgang: 2022 + Sprache: Englisch + Letzte Aktualisierung: 2023-01-24 12:28:10
- Published
- 2022
34. A parallelized hybrid genetic algorithm with differential evolution for heat exchanger network retrofit
- Author
-
Stampfli, Jan A., Olsen, Donald G., Wellig, Beat, Hofmann, René, Stampfli, Jan A., Olsen, Donald G., Wellig, Beat, and Hofmann, René
- Abstract
The challenge of heat exchanger network retrofit is often addressed using deterministic algorithms. However, the complexity of the retrofit problems, combined with multi-period operation, makes it very difficult to find any feasible solution. In contrast, stochastic algorithms are more likely to find feasible solutions in complex solution spaces. This work presents a customized evolutionary based optimization algorithm to address this challenge. The algorithm has two levels, whereby, a genetic algorithm optimizes the topology of the heat exchanger network on the top level. Based on the resulting topology, a differential evolution algorithm optimizes the heat loads of the heat exchangers in each operating period. The following bullet points highlight the customization of the algorithm: •The advantage of using both algorithms: the genetic algorithm is used for the topology optimization (discrete variables) and the differential evolution for the heat load optimization (continuous variables). •Penalizing and preserving strategies are used for constraint handling •The evaluation of the genetic algorithm is parallelized, meaning the differential evolution algorithm is performed on each chromosome parallel on multiple cores., + ID der Publikation: hslu_93764 + Art des Beitrages: Wissenschaftliche Medien + Jahrgang: 2022 + Sprache: Englisch + Letzte Aktualisierung: 2023-01-24 12:28:10
- Published
- 2022
35. Fused Architecture for Dense and Sparse Matrix Processing in TensorFlow Lite
- Author
-
Nunez-Yanez, Jose Luis and Nunez-Yanez, Jose Luis
- Abstract
In this paper we present a hardware architecture optimized for sparse and dense matrix processing in TensorFlow Lite and compatible with embedded-heterogeneous devices that integrate CPU and FPGA resources. The FADES (Fused Architecture for DEnse and Sparse matrices) design offers multiple configuration options that trade-off parallelism and complexity and uses a dataflow model to create four stages that read, compute, scale and write results. All stages are designed to support TensorFlow Lite operations including asymmetric quantized activations, column-major matrix write, per-filter/per-axis bias values and current scaling specifications. The configurable accelerator is integrated with the TensorFlow Lite inference engine running on the ARMv8 processor. We compare performance/power/energy with the state-of-the-art RUY software multiplication library showing up to 18x acceleration and 48x in dense and sparse modes respectively. The sparse mode benefits from structural pruning to fully utilize the DSP blocks present in the FPGA device., Funding: Royal Society Industry fellowship [INF\192044]; EPSRC HOPWARE [EP040863\1]; Leverhurme trust international fellowship Highperformance video analytics with parallel heterogeneous neural networks [IF-2021-003]
- Published
- 2022
- Full Text
- View/download PDF
36. STRETCH : Virtual Shared-Nothing Parallelism for Scalable and Elastic Stream Processing
- Author
-
Gulisano, Vincenzo, Najdataei, Hannaneh, Nikolakopoulos, Yiannis, Papadopoulos, Alessandro V., Papatriantafilou, Marina, Tsigas, Philippas, Gulisano, Vincenzo, Najdataei, Hannaneh, Nikolakopoulos, Yiannis, Papadopoulos, Alessandro V., Papatriantafilou, Marina, and Tsigas, Philippas
- Abstract
Stream processing applications extract value from raw data through Directed Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the de-facto standard to scale stream processing applications. Given an application, SN parallelism ins9tantiates several copies of each analysis task, making each instance responsible for a dedicated portion of the overall analysis, and relies on dedicated queues to exchange data among connected instances. On the one hand, SN parallelism can scale the execution of applications both up and out since threads can run task instances within and across processes/nodes. On the other hand, its lack of sharing can cause unnecessary overheads and hinder the scaling up when threads operate on data that could be jointly accessed in shared memory. This trade-off motivated us in studying a way for stream processing applications to leverage shared memory and boost the scale up (before the scale out) while adhering to the widely-adopted and SN-based APIs for stream processing applications. We introduce STRETCH, a framework that maximizes the scale up and offers instantaneous elastic reconfigurations (without state transfer) for stream processing applications. We propose the concept of Virtual Shared-Nothing (VSN) parallelism and elasticity and provide formal definitions and correctness proofs for the semantics of the analysis tasks supported by STRETCH, showing they extend the ones found in common Stream Processing Engines. We also provide a fully implemented prototype and show that STRETCH's performance exceeds that of state-of-the-art frameworks such as Apache Flink and offers, to the best of our knowledge, unprecedented ultra-fast reconfigurations, taking less than 40 ms even when provisioning tens of new task instances.
- Published
- 2022
- Full Text
- View/download PDF
37. Research Trends, Enabling Technologies and Application Areas for Big Data
- Author
-
Lundberg, Lars, Grahn, Håkan, Lundberg, Lars, and Grahn, Håkan
- Abstract
The availability of large amounts of data in combination with Big Data analytics has transformed many application domains. In this paper, we provide insights into how the area has developed in the last decade. First, we identify seven major application areas and six groups of important enabling technologies for Big Data applications and systems. Then, using bibliometrics and an extensive literature review of more than 80 papers, we identify the most important research trends in these areas. In addition, our bibliometric analysis also includes trends in different geographical regions. Our results indicate that manufacturing and agriculture or forestry are the two application areas with the fastest growth. Furthermore, our bibliometric study shows that deep learning and edge or fog computing are the enabling technologies increasing the most. We believe that the data presented in this paper provide a good overview of the current research trends in Big Data and that this kind of information is very useful when setting strategic agendas for Big Data research., open access
- Published
- 2022
- Full Text
- View/download PDF
38. Research Trends, Enabling Technologies and Application Areas for Big Data
- Author
-
Lundberg, Lars, Grahn, Håkan, Lundberg, Lars, and Grahn, Håkan
- Abstract
The availability of large amounts of data in combination with Big Data analytics has transformed many application domains. In this paper, we provide insights into how the area has developed in the last decade. First, we identify seven major application areas and six groups of important enabling technologies for Big Data applications and systems. Then, using bibliometrics and an extensive literature review of more than 80 papers, we identify the most important research trends in these areas. In addition, our bibliometric analysis also includes trends in different geographical regions. Our results indicate that manufacturing and agriculture or forestry are the two application areas with the fastest growth. Furthermore, our bibliometric study shows that deep learning and edge or fog computing are the enabling technologies increasing the most. We believe that the data presented in this paper provide a good overview of the current research trends in Big Data and that this kind of information is very useful when setting strategic agendas for Big Data research., open access
- Published
- 2022
- Full Text
- View/download PDF
39. Research Trends, Enabling Technologies and Application Areas for Big Data
- Author
-
Lundberg, Lars, Grahn, Håkan, Lundberg, Lars, and Grahn, Håkan
- Abstract
The availability of large amounts of data in combination with Big Data analytics has transformed many application domains. In this paper, we provide insights into how the area has developed in the last decade. First, we identify seven major application areas and six groups of important enabling technologies for Big Data applications and systems. Then, using bibliometrics and an extensive literature review of more than 80 papers, we identify the most important research trends in these areas. In addition, our bibliometric analysis also includes trends in different geographical regions. Our results indicate that manufacturing and agriculture or forestry are the two application areas with the fastest growth. Furthermore, our bibliometric study shows that deep learning and edge or fog computing are the enabling technologies increasing the most. We believe that the data presented in this paper provide a good overview of the current research trends in Big Data and that this kind of information is very useful when setting strategic agendas for Big Data research., open access
- Published
- 2022
- Full Text
- View/download PDF
40. Research Trends, Enabling Technologies and Application Areas for Big Data
- Author
-
Lundberg, Lars, Grahn, Håkan, Lundberg, Lars, and Grahn, Håkan
- Abstract
The availability of large amounts of data in combination with Big Data analytics has transformed many application domains. In this paper, we provide insights into how the area has developed in the last decade. First, we identify seven major application areas and six groups of important enabling technologies for Big Data applications and systems. Then, using bibliometrics and an extensive literature review of more than 80 papers, we identify the most important research trends in these areas. In addition, our bibliometric analysis also includes trends in different geographical regions. Our results indicate that manufacturing and agriculture or forestry are the two application areas with the fastest growth. Furthermore, our bibliometric study shows that deep learning and edge or fog computing are the enabling technologies increasing the most. We believe that the data presented in this paper provide a good overview of the current research trends in Big Data and that this kind of information is very useful when setting strategic agendas for Big Data research., open access
- Published
- 2022
- Full Text
- View/download PDF
41. Applied heat exchanger network retrofit for multi-period processes in industry: A hybrid evolutionary algorithm
- Author
-
Hofmann, René, Olsen, Donald, Ong, Benjamin Hung Yang, Stampfli, Jan, Wellig, Beat, Hofmann, René, Olsen, Donald, Ong, Benjamin Hung Yang, Stampfli, Jan, and Wellig, Beat
- Abstract
In Swiss process industry, process integration is often applied to retrofit existing plants with multi-period operation. Such periods may experience a high degree of variation in temperature or mass flow. Some process streams may not exist in every period or are soft streams. The resulting retrofitted network needs to be able to ensure feasible heat transfer in each period by the integration of mixer configurations to control the temperature. These attributes increase the complexity of the solution space. Hence, this work proposes an evolutionary two-level algorithm for heat exchanger network retrofit. Genetic algorithm is used for topology optimization and a differential evolution algorithm handles the heat loads. The algorithm is extended with practical constraints such as a maximum number of heat exchangers. Explicit mixer temperature calculations are implemented using the Lambert W-function. The algorithm was successfully applied to an industrial case study, reducing its total annual cost by approximately 66%., + ID der Publikation: hslu_93469 + Art des Beitrages: Wissenschaftliche Medien + Jahrgang: 2022 + Sprache: Englisch + Letzte Aktualisierung: 2022-10-26 17:30:39
- Published
- 2022
42. Applied heat exchanger network retrofit for multi-period processes in industry: A hybrid evolutionary algorithm
- Author
-
Hofmann, René, Olsen, Donald, Ong, Benjamin Hung Yang, Stampfli, Jan, Wellig, Beat, Hofmann, René, Olsen, Donald, Ong, Benjamin Hung Yang, Stampfli, Jan, and Wellig, Beat
- Abstract
In Swiss process industry, process integration is often applied to retrofit existing plants with multi-period operation. Such periods may experience a high degree of variation in temperature or mass flow. Some process streams may not exist in every period or are soft streams. The resulting retrofitted network needs to be able to ensure feasible heat transfer in each period by the integration of mixer configurations to control the temperature. These attributes increase the complexity of the solution space. Hence, this work proposes an evolutionary two-level algorithm for heat exchanger network retrofit. Genetic algorithm is used for topology optimization and a differential evolution algorithm handles the heat loads. The algorithm is extended with practical constraints such as a maximum number of heat exchangers. Explicit mixer temperature calculations are implemented using the Lambert W-function. The algorithm was successfully applied to an industrial case study, reducing its total annual cost by approximately 66%., + ID der Publikation: hslu_93469 + Art des Beitrages: Wissenschaftliche Medien + Jahrgang: 2022 + Sprache: Englisch + Letzte Aktualisierung: 2022-10-26 17:30:39
- Published
- 2022
43. Applied heat exchanger network retrofit for multi-period processes in industry: A hybrid evolutionary algorithm
- Author
-
Hofmann, René, Olsen, Donald, Ong, Benjamin Hung Yang, Stampfli, Jan, Wellig, Beat, Hofmann, René, Olsen, Donald, Ong, Benjamin Hung Yang, Stampfli, Jan, and Wellig, Beat
- Abstract
In Swiss process industry, process integration is often applied to retrofit existing plants with multi-period operation. Such periods may experience a high degree of variation in temperature or mass flow. Some process streams may not exist in every period or are soft streams. The resulting retrofitted network needs to be able to ensure feasible heat transfer in each period by the integration of mixer configurations to control the temperature. These attributes increase the complexity of the solution space. Hence, this work proposes an evolutionary two-level algorithm for heat exchanger network retrofit. Genetic algorithm is used for topology optimization and a differential evolution algorithm handles the heat loads. The algorithm is extended with practical constraints such as a maximum number of heat exchangers. Explicit mixer temperature calculations are implemented using the Lambert W-function. The algorithm was successfully applied to an industrial case study, reducing its total annual cost by approximately 66%., + ID der Publikation: hslu_93469 + Art des Beitrages: Wissenschaftliche Medien + Jahrgang: 2022 + Sprache: Englisch + Letzte Aktualisierung: 2022-10-26 17:30:39
- Published
- 2022
44. Astrea: Auto-Serverless Analytics towards Cost-Efficiency and QoS-Awareness
- Author
-
Jarachanthan, Jananie, Chen, Li, Xu, Fei, Li, Bo, Jarachanthan, Jananie, Chen, Li, Xu, Fei, and Li, Bo
- Abstract
With the ability to simplify the code deployment with one-click upload and lightweight execution, serverless computing has emerged as a promising paradigm with increasing popularity. However, there remain open challenges when adapting data-intensive analytics applications to the serverless context, in which users of {\em serverless analytics} encounter the difficulty in coordinating computation across different stages and provisioning resources in a large configuration space. This paper presents our design and implementation of {\em Astrea}, which configures and orchestrates serverless analytics jobs in an autonomous manner, while taking into account flexibly-specified user requirements. {\em Astrea} relies on the modeling of performance and cost which characterizes the intricate interplay among multi-dimensional factors ({\em e.g.}, function memory size, degree of parallelism at each stage). We formulate an optimization problem based on user-specific requirements towards performance enhancement or cost reduction, and develop a set of algorithms based on graph theory to obtain optimal job execution. We deploy {\em Astrea} in the AWS Lambda platform and conduct real-world experiments over representative benchmarks, including big data analytics and machine learning workloads, at different scales. Extensive results demonstrate that {\em Astrea} can achieve the optimal execution decision for serverless data analytics, in comparison with various provisioning and deployment baselines. For example, when compared with three provisioning baselines, {\em Astrea} manages to improve the job completion time performance by 21% to 69% under a given budget constraint, while saving cost by 20% to 84% without violating performance requirements.
- Published
- 2022
45. A Structure-Tensor Approach to Integer Matrix Completion in Indivisible Resource Allocation
- Author
-
Mo, Yanfang, Chen, Wei, Khong, Sei Zhen, Qiu, Li, Mo, Yanfang, Chen, Wei, Khong, Sei Zhen, and Qiu, Li
- Abstract
Indivisible resource allocation motivates us to study the matrix completion concerning the class of (0,1)-matrices with prescribed row/column sums and preassigned zeros. We illustrate and generalize the (0,1)-matrix completion in the following two scenarios: a demand-response application involving nonnegative integer matrices with different bounds across rows and an edge caching matching problem allowing row and column sums to vary within separately designated bounds. The applications require analytic characterizations of the supply adequacy and cause large-scale matrix completion instances. Remarkably, we derive a structure tensor and use its nonnegativity to establish a necessary and sufficient condition under which the considered matrix class is nonempty. The tensor condition can characterize the adequacy of a supply for a prescribed demand and facilitate identifying the minimum supplement to the supply so that the augmented supply becomes adequate when the adequacy gap is nonzero. Notably, we design a tensor-based combinatorial algorithm to construct a required matrix, representing a feasible resource allocation. Numerical simulations justify the efficiency of our approach.
- Published
- 2022
46. A Cost-Efficient Resource Provisioning and Scheduling Approach for Deadline-Sensitive MapReduce Computations in Cloud Environment
- Author
-
Ardagna, Claudio Agostino, Chang, Carl K., Daminai, Ernesto, Ranjan, Rajiv, Wang, Zhongjie, Ward, Robert, Zhang, Jia, Zhang, Wensheng, Jabbari, Amir, Masoumiyan, Farzaneh, Hu, Shuwen, Tang, Maolin, Tian, Yu-Chu, Ardagna, Claudio Agostino, Chang, Carl K., Daminai, Ernesto, Ranjan, Rajiv, Wang, Zhongjie, Ward, Robert, Zhang, Jia, Zhang, Wensheng, Jabbari, Amir, Masoumiyan, Farzaneh, Hu, Shuwen, Tang, Maolin, and Tian, Yu-Chu
- Abstract
The use of cloud services to process a large amount of data is growing and demands for scalable, reliable, and highly available services in cloud environments are raising. The demands and the urge for developing these facilities have made parallel computing more appealing. Cloud providers offer various types of Virtual Machines (VMs) that are compatible with parallel processing and the clients should pay for their hourly usage. The price varies based on the type, the number and the hiring time of the VMs. A daily price fluctuation timetable has been proposed and scaling the number of VMs on that helps to schedule the computations to meet both deadline and cost minimization goals. It becomes critical to select appropriate VMs and distribute workload efficiently across them. Therefore, the computations and the VMs require being well-managed, scheduled and monitored to meet the deadline while minimizing the total hiring cost. To address these concerns, this paper formulated the problem to calculate the total hiring cost before and during the computations. The execution time and the total cost are calculated based on the application's input size and the required type and the number of VMs. We worked on two applications as sample benchmarks to identify the best approach to choose and manage the VMs to compute them. Our results indicate that among different available approaches for hiring VMs, identifying the most affordable approach leads to minimizing the cost signiflcantly.
- Published
- 2021
47. Accelerating Sequence-to-Graph Alignment on Heterogeneous Processors
- Author
-
Feng, Zonghao, Luo, Qiong, Feng, Zonghao, and Luo, Qiong
- Abstract
Sequence alignment is traditionally done between individual reads and a reference sequence. Recently, several sequence-to-graph aligners have been developed, because graph genomes provide more information than sequences. For example, variation graphs, which are created by augmenting linear genomes with known genetic variations, can significantly improve the quality of genotyping. However, existing sequence-to-graph aligners run on the CPU only, and the processing time is long. To speed up the processing, we propose a parallel sequence-to-graph alignment algorithm named HGA (Heterogeneous Graph Aligner) that runs on both the CPU and GPUs. Our algorithm achieves efficient CPU-GPU co-processing through dynamically distributing tasks to each processor. We design optimizations for frequent structures in genome graphs to reduce computational cost. Moreover, we propose the GCSR (Genome CSR) data structure for efficient genome graph processing on GPUs. We also apply architecture-aware optimizations to improve memory locality and increase throughput. Our experiments on a Xeon E5-2683 CPU and eight RTX 2080 Ti GPUs show that our algorithms outperform state-of-the-art aligners by up to 15.8 times and scale well on both the CPU and GPUs. The code of our HGA is available athttps://github.com/RapidsAtHKUST/hga. © 2021 ACM.
- Published
- 2021
48. Out-Of-Core MapReduce System for Large Datasets
- Author
-
Kaur, Gurneet, Gupta, Rajiv1, Kaur, Gurneet, Kaur, Gurneet, Gupta, Rajiv1, and Kaur, Gurneet
- Abstract
While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Even though today’s computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, a single machine set up of Hadoop system performs much slower when it is presented with the datasets larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. While distributed computational resources are now easily available, efficiently performing large scale computations still remain a challenge due to out-of-memory errors and complexity involved in handling distributed systems. Therefore, we develop techniques to perform large-scale processing on a single machine by reducing the amount of IO and exploiting sequential locality when using disks.First, this dissertation presents OMR, a single machine out-of-core MapReduce system that can efficiently handle datasets that are far larger than the size of main memory and guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR’s linear scalability and empirically demonstrate it by processing datasets that are up to 5× larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also OMR avoids out-of-memory crashes for large datasets and delivers high performance for datasets that fit in main memory.Second, this dissertation presents a single-level out-of-core partitioner for larg
- Published
- 2021
49. Resource hiring cost minimisation for constrained MapReduce computations in the cloud
- Author
-
Jabbari Sabegh, Amir Hosein and Jabbari Sabegh, Amir Hosein
- Abstract
This research tackles the problem of reducing the cost of cloud-based MapReduce computations while satisfying their deadlines. It is a multi-constrained combinational optimisation problem, which is challenging to solve. The minimisation goal is achieved by pre-planning and dynamic scheduling of virtual machine provisioning during computations. The proposed optimisation models and algorithms have been implemented and quantitatively evaluated in comparison with existing approaches. They are shown to reduce the costs of cloud-based computations by hiring fewer VMs and scheduling them more efficiently.
- Published
- 2021
50. Threshold-Based Fast Successive-Cancellation Decoding of Polar Codes
- Author
-
Zheng, Haotian, Hashemi, Seyyed Ali, Balatsoukas-Stimming, Alexios, Cao, Zizheng, Koonen, Ton, Cioffi, John, Goldsmith, Andrea, Zheng, Haotian, Hashemi, Seyyed Ali, Balatsoukas-Stimming, Alexios, Cao, Zizheng, Koonen, Ton, Cioffi, John, and Goldsmith, Andrea
- Abstract
Fast SC decoding overcomes the latency caused by the serial nature of the SC decoding by identifying new nodes in the upper levels of the SC decoding tree and implementing their fast parallel decoders. In this work, we first present a novel sequence repetition node corresponding to a particular class of bit sequences. Most existing special node types are special cases of the proposed sequence repetition node. Then, a fast parallel decoder is proposed for this class of node. To further speed up the decoding process of general nodes outside this class, a threshold-based hard-decision-aided scheme is introduced. The threshold value that guarantees a given error-correction performance in the proposed scheme is derived theoretically. Analysis and hardware implementation results on a polar code of length 1024 with code rates 1/4, 1/2, and 3/4 show that our proposed algorithm reduces the required clock cycles by up to 8%, and leads to a 10% improvement in the maximum operating frequency compared to state-of-the-art decoders without tangibly altering the error-correction performance. In addition, using the proposed threshold-based hard-decision-aided scheme, the decoding latency can be further reduced by 57% at Eb/N0=5.0 dB.
- Published
- 2021
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.