708 results
Search Results
2. Affinity-Aware VNF Placement in Mobile Edge Clouds via Leveraging GPUs.
- Author
-
Xu, Zichuan, Zhang, Zhiheng, Lui, John C. S., Liang, Weifa, Xia, Qiufen, Zhou, Pan, Xu, Wenzheng, and Wu, Guowei
- Subjects
APPROXIMATION algorithms ,ONLINE algorithms ,ALGORITHMS ,MOBILE computing ,VIRTUAL networks ,EDGE computing - Abstract
Mobile edge computing becomes a promising technology to mitigate the latency of various cloud services. In addition, network function virtualization (NFV) has been shown a great potential in reducing the operational cost of cloud services while enhancing the flexibility of virtual network function deployments, by implementing dedicated hardware network functions as pieces of software in generic servers. Recently, the GPU acceleration has been investigated to speed up flow processing in virtual network functions (VNFs), by leveraging the parallelism of GPUs. VNFs that need accelerations prefer to stay at cloudlets (locations) equipped with GPUs. However, little attention has been paid for the VNF placement that takes into account GPU-affinity in cloudlets of mobile edge clouds. In this paper, we consider the affinity-aware throughput maximization problem in a mobile edge cloud via leveraging the parallelism on GPUs for user requests with VNF requirements. We consider two types of affinities in the VNF placement: The soft-affinity that allows VNFs to be executed by either CPUs or GPUs in cloudlets; and the hard-affinity that only allows VNFs to be placed to the GPUs of a specified set of cloudlets. We formulate two corresponding VNF placement problems in a mobile edge cloud. Specifically, we first propose an exact solution to the soft-affinity throughput maximization problem by formulating an Integer Linear Program (ILP). We then propose an efficient algorithm for the problem, by proposing a randomized algorithm with a provable approximation ratio for the hard-affinity-aware throughput maximization problem and extending the proposed approximation algorithm to the soft-affinity throughput maximization problem. Furthermore, assuming that user requests arrive into the mobile edge cloud one by one without the knowledge of future arrivals, we devise an online algorithm with a good competitive ratio for this dynamic hard-affinity-aware throughput maximization problem. Finally, we evaluate the performance of the proposed algorithms, through simulations and implementations in a real test-bed. Experimental results show that the performance of the proposed algorithms outperform their existing counterparts and achieve higher throughput. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
3. Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for Video Recognition With Hierarchical Tucker Tensor Decomposition.
- Author
-
Gong, Yu, Yin, Miao, Huang, Lingyi, Deng, Chunhua, and Yuan, Bo
- Subjects
PARTICIPATORY design ,ENERGY consumption ,ALGORITHMS ,HARDWARE - Abstract
Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction (more than $1000 \times$ 1000 × ) in model size and significant accuracy improvement (0.6% to 12.7%) across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieve $2.5\times$ 2. 5 × , $1.46\times$ 1. 46 × and $2.41\times$ 2. 41 × increase in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with $1.9\times$ 1. 9 × higher throughput, $1.83\times$ 1. 83 × higher energy efficiency and comparable area efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
4. CloudChain: A Cloud Blockchain Using Shared Memory Consensus and RDMA.
- Author
-
Xu, Minghui, Liu, Shuo, Yu, Dongxiao, Cheng, Xiuzhen, Guo, Shaoyong, and Yu, Jiguo
- Subjects
MEMORY ,BLOCKCHAINS ,CLOUD computing ,GOVERNMENT agencies ,ALGORITHMS - Abstract
Blockchain technologies can enable secure computing environments among mistrusting parties. Permissioned blockchains are particularly enlightened by companies, enterprises, and government agencies due to their efficiency, customizability, and governance-friendly features. Obviously, seamlessly fusing blockchain and cloud computing can significantly benefit permissioned blockchains; nevertheless, most blockchains implemented on clouds are originally designed for loosely-coupled networks where nodes communicate asynchronously, failing to take advantages of the closely-coupled nature of cloud servers. In this paper, we propose an innovative cloud-oriented blockchain – CloudChain, which is a modularized three-layer system composed of the network layer, consensus layer, and blockchain layer. CloudChain is based on a shared-memory model where nodes communicate synchronously by direct memory accesses. We realize the shared-memory model with the Remote Direct Memory Access technology, based on which we propose a shared-memory consensus algorithm to ensure presistence and liveness, the two crucial blockchain security properties countering Byzantine nodes. We also implement a CloudChain prototype based on a RoCEv2-based testbed to experimentally validate our design, and the results verify the feasibility and efficiency of CloudChain. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
5. Better Adaptive Malicious Users Detection Algorithm in Human Contact Networks.
- Author
-
Lin, Limei, Huang, Yanze, Xu, Li, and Hsieh, Sun-Yuan
- Subjects
ALGORITHMS ,GRAPH algorithms ,HUMAN beings ,FEATURE extraction - Abstract
A human contact network (HCN) consists of individuals moving around and interacting with each other. In HCN, it is essential to detect malicious users who break the data delivery through terminating the data delivery or tampering with the data. Since malicious users will pay more but gain less when breaking the data delivery of opportunistic contacts, we focus on the non-opportunistic contacts that occur more frequently and stably. It is observed that people contact with each other more frequently if they have more social features in common. In this paper, we build up topology structure for HCN based on social features, and propose a graph theoretical comparison detection model to perform malicious users detection. Then we present an adaptive detection scheme based on Hamiltonian cycle decomposition. Also, we define comparison-0-string and comparison-1-string to improve the detection efficiency. Moreover, we perform scenario simulations on real data to realize the detected process of malicious users. Experiments show that, when the number of malicious users is bounded by the dimension of HCN, our scheme has a detection rate of 100% with both false positive rate and false negative rate being 0%, and the running cost is also very low when compared to baseline approaches. When the number of malicious users exceeds the bound, the detection rate of our scheme decreases slowly, while the false positive rate and false negative rate increase slowly, but they are still better than the baseline approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
6. Triangle Counting Accelerations: From Algorithm to In-Memory Computing Architecture.
- Author
-
Wang, Xueyan, Yang, Jianlei, Zhao, Yinglin, Jia, Xiaotao, Yin, Rong, Chen, Xuhang, Qu, Gang, and Zhao, Weisheng
- Subjects
RANDOM access memory ,GRAPH algorithms ,ALGORITHMS ,TRIANGLES ,FIELD programmable gate arrays ,MAGNETIC torque - Abstract
Triangles are the basic substructure of networks and triangle counting (TC) has been a fundamental graph computing problem in numerous fields such as social network analysis. Nevertheless, like other graph computing problems, due to the high memory-computation ratio and random memory access pattern, TC involves a large amount of data transfers thus suffers from the bandwidth bottleneck in the traditional Von-Neumann architecture. To overcome this challenge, in this paper, we propose to accelerate TC with the emerging processing-in-memory (PIM) architecture through an algorithm-architecture co-optimization manner. To enable the efficient in-memory implementations, we come up to reformulate TC with bitwise logic operations (such as AND), and develop customized graph compression and mapping techniques for efficient data flow management. With the emerging computational Spin-Transfer Torque Magnetic RAM (STT-MRAM) array, which is one of the most promising PIM enabling techniques, the device-to-architecture co-simulation results demonstrate that the proposed TC in-memory accelerator outperforms the state-of-the-art GPU and FPGA accelerations by $12.2\times$ 12. 2 × and $31.8\times$ 31. 8 × , respectively, and achieves a $34\times$ 34 × energy efficiency improvement over the FPGA accelerator. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
7. On the Optimal Pre-Computation of Window τNAF for Koblitz Curves.
- Author
-
Trost, William R. and Xu, Guangwu
- Subjects
ELLIPTIC curve cryptography ,ALGORITHMS ,APPROXIMATION theory ,COMPUTATIONAL complexity ,COEFFICIENTS (Statistics) - Abstract
Koblitz curves have been an important subject of consideration for both theoretical and practical interests. The window τ-adic algorithm of Solinas (window τNAF) is the most powerful method for computing point multiplication for Koblitz curves. Pre-computation plays an important role in improving the performance of point multiplication. In this paper, the concept of optimal pre-computation for window τNAF is formulated. In this setting, an optimal pre-computation has some mathematically natural and clean forms, and requires 2
w−2 −1 point additions and two evaluations of the Frobenius map τ, where w is the window width. One of the main results of this paper is to construct an optimal pre-computation scheme for each window width w from 4 to 15 (more than practical needs).These pre-computations can be easily incorporated into implementations of window τNAF. The ideas in the paper can also be used to construct other suitable pre-computations. This paper includes a discussion of coefficient sets for window τNAF and the divisibility by powers of τ through different approaches. Some issues of implementation are also discussed. [ABSTRACT FROM PUBLISHER]- Published
- 2016
- Full Text
- View/download PDF
8. Design of Reliable and Secure Devices Realizing Shamir's Secret Sharing.
- Author
-
Wang, Zhen, Karpovsky, Mark, and Bu, Lake
- Subjects
DATA encryption ,INFORMATION sharing ,ALGORITHMS ,MATRICES (Mathematics) ,INFORMATION technology security ,SIMULATION methods & models - Abstract
Shamir's secret sharing scheme is an effective way to distribute secret to a group of shareholders. The security of the unprotected sharing scheme, however, can be easily broken by cheaters or attackers who maliciously feed incorrect shares during the secret recovery stage or inject faults into hardware computing the secret. In this paper, we propose cheater detection and identification schemes based on robust and algebraic manipulation detection (AMD) codes and m-disjunct matrices (superimposed codes). We present the constructions of codes for cheater detection and identification and describe how the cheater identification problem can be related to the classic group testing algorithms based on m-disjunct matrices. Simulation and synthesis results show that the proposed architecture can improve the security level significantly even under strong cheating attack models with reasonable area and timing overheads. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
9. Test Algorithms for ECC-Based Memory Repair in Ultimate CMOS and Post-CMOS.
- Author
-
Papavramidou, Panagiota and Nicolaidis, Michael
- Subjects
ALGORITHMS ,FABRICATION (Manufacturing) ,CMOS amplifiers ,COMPUTER science - Abstract
In modern SoCs embedded memories should be protected by ECC against field failures to achieve acceptable reliability. They should also be repaired after fabrication to achieve acceptable fabrication yield. In technologies affected by high defect densities, conventional repair induces very high costs. To reduce it, we can use ECC-based repair, consisting in using the ECC for fixing words comprising a single faulty cell and self-repair to fix all other faulty words. However, as we show in this paper, for high defect densities the diagnosis circuitry required for ECC-based repair may induce very large hardware cost. To fix this issue, we introduce a new family of memory test algorithms that exhibit a property we termed as “single-read double-fault detection”. This approach gains interest in ultimate CMOS and post-CMOS technologies, where the defect densities are expected to increase significantly, and/or in very-low power design, as very-low voltage sharply increases defect densities. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
10. New Block Recombination for Subquadratic Space Complexity Polynomial Multiplication Based on Overlap-Free Approach.
- Author
-
Park, Sun-Mi, Chang, Ku-Young, Hong, Dowon, and Seo, Changho
- Subjects
POLYNOMIAL time algorithms ,CODING theory ,CRYPTOGRAPHY ,POLYNOMIAL approximation ,ALGORITHMS - Abstract
In this paper, we present new parallel polynomial multiplication formulas which result in subquadratic space complexity. The schemes are based on a recently proposed block recombination of polynomial multiplication formula. The proposed two-way, three-way, and four-way split polynomial multiplication formulas achieve the smallest space complexities. Moreover, by providing area-time tradeoff method, the proposed formulas enable one to choose a parallel formula for polynomial multiplication which is suited for a design environment. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
11. A Framework for Crossing Temperature-Induced Timing Errors Underlying Hardware Accelerators to the Algorithm and Application Layers.
- Author
-
Paim, Guilherme, Amrouch, Hussam, Rocha, Leandro M. G., Abreu, Brunno, Cesar da Costa, Eduardo Antonio, Bampi, Sergio, and Henkel, Jorg
- Subjects
ALGORITHMS ,TEMPERATURE effect ,HARDWARE ,MOBILE apps ,MACHINE learning - Abstract
Temperature rising is an unavoidable effect on VLSI and has always been a critical issue in any system-on-chip – especially when targeting compute-intensive applications. This effect increases the delay in hardware accelerators, resulting in timing errors due to unsustainable clock frequency, whose impact must be carefully evaluated on design time to measure the performance degradation of the hardware accelerator. Further, a hardware operating at a higher temperature accelerates device aging, which incurs in more timing errors. This issue is usually addressed with the inclusion of timing guardbands that compensate for the deleterious effects of temperature, ensuring the hardware accelerator works within a reliable zone, i.e., without any timing errors caused by temperature effects at runtime. However, guardbands directly result in considerable performance and efficiency losses because the circuit will be clocked at a frequency lower than its full potential. Accelerators on edge devices often dismiss such guardbands to explore the full potential of the designed circuits, posing an enormous design challenge as this approach requires a careful evaluation of the impact of timing errors on the quality of the target applications. Many algorithms, such as in multimedia and machine learning applications, are capable of tolerating hardware errors. Yet, these algorithms have a dynamic behavior (i.e., closed-loop) where a timing error can be propagated, affecting subsequent steps. Measuring the degradation-induced errors in these applications is very challenging given that an accurate gate-level simulation to investigate degradation-induced timing errors needs to be coupled dynamically with a system-level simulator to unveil how induced errors in the underlying hardware ultimately impact the algorithm execution in the hardware accelerator. This is the first work to achieve this goal. State-of-the-art works have studied accelerators under timing-errors when removing (or narrowing) guardbands. However, their approach was suitable only for open-loop hardware accelerators which are entirely agnostic of complex interactions of the algorithms. Unlike prior work, this paper investigates temperature- and aging-induced timing-errors in the joint accelerator-algorithm interactions and their runtime impacts. Our framework investigates aging effects across the different layers starting from transistor physics all the way up to the algorithm layer. The hardware accelerator employed as a case study in this work is the sum of absolute differences (SAD), which is the most compute-intensive accelerator on commercial video encoder for mobile applications. Our results demonstrate the runtime behavior impacts of three advanced block-matching algorithms of the video encoder in a joint operation by a SAD accelerator under timing-errors induced by temperature and aging effects considering a 14nm FinFET technology. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
12. Adaptive Partition Testing.
- Author
-
Sun, Chang-ai, Dai, Hepeng, Liu, Huai, Chen, Tsong Yueh, and Cai, Kai-Yuan
- Subjects
COMPUTER software testing ,MARKOV processes ,PROBABILITY theory ,ALGORITHMS ,FAULT diagnosis - Abstract
Random testing and partition testing are two major families of software testing techniques. They have been compared both theoretically and empirically in numerous studies for decades, and it has been widely acknowledged that they have their own advantages and disadvantages and that their innate characteristics are fairly complementary to each other. Some work has been conducted to develop advanced testing techniques through the integration of random testing and partition testing, attempting to preserve the advantages of both while minimizing their disadvantages. In this paper, we propose a new testing approach, adaptive partition testing, where test cases are randomly selected from some partition whose probability of being selected is adaptively adjusted along the testing process. We particularly develop two algorithms, Markov-chain based adaptive partition testing and reward-punishment based adaptive partition testing, to implement the proposed approach. The former algorithm makes use of Markov matrix to dynamically adjust the probability of a partition to be selected for conducting tests; while the latter is based on a reward and punishment mechanism. We conduct empirical studies to evaluate the performance of the proposed algorithms using ten faulty versions of three large-scale open source programs. Our experimental results show that, compared with two baseline techniques, namely random partition testing (RPT) and dynamic random testing (DRT), our algorithms deliver higher fault-detection effectiveness with lower test case selection overhead. It is demonstrated that the proposed adaptive partition testing is an effective testing approach, taking advantages of both random testing and partition testing. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
13. Automated Test Generation for Debugging Multiple Bugs in Arithmetic Circuits.
- Author
-
Farahmandi, Farimah and Mishra, Prabhat
- Subjects
DEBUGGING ,ARITHMETIC ,CRYPTOGRAPHY ,SIGNAL processing ,ALGORITHMS - Abstract
Optimized and custom arithmetic circuits are widely used in embedded systems such as multimedia applications, cryptography systems, signal processing and console games. Debugging of arithmetic circuits is a challenge due to increasing complexity coupled with non-standard implementations. Existing algebraic rewriting techniques produce a remainder to indicate the presence of a potential bug. However, bug localization remains a major bottleneck. Simulation-based validation using random or constrained-random tests are not effective for complex arithmetic circuits due to bit-blasting. In this paper, we present an automated test generation and bug localization technique for debugging arithmetic circuits. This paper makes four important contributions. We propose an automated approach for generating directed tests by suitable assignments of input variables to make the remainder non-zero. The generated tests are guaranteed to activate bugs. We also propose an automatic bug fixing technique by utilizing the patterns of the remainder terms as well as by analyzing the regions activated by the generated tests to detect and correct the error(s). We also propose an efficient debugging algorithm that can handle multiple dependent as well as independent bugs. Finally, our proposed framework, consisting of directed test generation, bug localization and bug correction, is fully automated. In other words, our framework is capable of producing a corrected implementation of arithmetic circuits without any manual intervention. Our experimental results demonstrate that the proposed approach can be used for automated debugging of large and complex arithmetic circuits. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
14. Optimizing Polynomial Convolution for NTRUEncrypt.
- Author
-
Dai, Wei, Whyte, William, and Zhang, Zhenfei
- Subjects
COMPUTER network security ,ENCRYPTION protocols ,COMPUTER security standards ,ALGORITHMS ,PUBLIC key cryptography - Abstract
$\sf{ NTRUEncrypt}$ is one of the most promising candidates for quantum-safe cryptography. In this paper, we focus on the $\sf{ NTRU743}$ parameter set. We give a report on all known attacks against this parameter set and show that it delivers 256 bits of security against classical attackers and 128 bits of security against quantum attackers. We then present a parameter-dependent optimization using a tailored hierarchy of multiplication algorithms as well as the Intel AVX2 instructions, and show that this optimization is constant-time. Our implementation is two to three times faster than the reference implementation of $\sf{ NTRUEncrypt}$ . [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
15. Reconfiguration Algorithms for Power Efficient VLSI Subarrays with Four-Port Switches.
- Author
-
Wu Jigang and Srikanthan, Thambipillai
- Subjects
VERY large scale circuit integration ,ALGORITHMS ,ENERGY dissipation ,ELECTRIC switchgear ,DYNAMIC programming ,SIMULATION methods & models ,INTEGRATED circuit interconnections ,HEURISTIC ,LINEAR systems - Abstract
Techniques to determine subarrays when processing elements of VLSI arrays become faulty have been investigated extensively. These tend to identify the largest subarray that is possible without concentrating on the power efficiency of the resulting subarray. In this paper, we propose new techniques, based on heuristic strategy and dynamic programming, to minimize the interconnect length in an attempt to reduce power dissipation without performance penalty. Our algorithms show that notable improvements in the reduction of the number of long interconnects could be realized in linear time and without sacrificing the size of the subarray. Our evaluations show that, for a VLSI array of size 256 x 256, the number of long interconnects in the subarray can be reduced by up to 95 percent for clustered faults and up to 50 percent and 73 percent for a random fault with density of 10 percent and 0.1 percent, respectively, when compared with the most efficient implementation cited in the literature. The interconnect power saving for a VLSI array of size 512 · 512 is by up to 11 percent for a random fault. We have also shown that interconnect power savings of up to 14 percent are possible for the cases investigated. Simulations based on several random and clustered fault scenarios clearly reveal the superiority of the proposed techniques for power efficient realizations. In addition, the lower bound of the performance has been proposed to demonstrate that the proposed algorithms are nearly optimal for the cases considered in this paper. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
16. A Fast Pessimistic Diagnosis Algorithm for Hypercube-Like Networks under the Comparison Model.
- Author
-
Ye, Liang-Cheng, Liang, Jia-Rong, and Lin, Hai-Xiang
- Subjects
MULTIPROCESSORS ,HYPERCUBE networks (Computer networks) ,FAULT-tolerant computing ,ALGORITHMS ,MATHEMATICAL models - Abstract
Diagnosis by comparison is a realistic approach to detect faults of multiprocessor systems. This paper considers a pessimistic diagnostic strategy for hypercube-like multiprocessor systems under the comparison model. The pessimistic strategy is a diagnostic process whereby all faulty nodes can be correctly identified and at most one fault-free node may be misjudged as a faulty node. We propose a pessimistic diagnosis algorithm based on the largest component in the faulty system. For a system with N = 2
n nodes and n ≥ 5 , when the number of faulty nodes is bounded by 2n − 2 , the algorithm can correctly identify all nodes except at most one node left undiagnosed. The time complexity of the algorithm is O(Nlog2 N). [ABSTRACT FROM PUBLISHER]- Published
- 2016
- Full Text
- View/download PDF
17. A Nearly Optimal Packet Scheduling Algorithm for Input Queued Switches with Deadline Guarantees.
- Author
-
Zhang, Baoxian, Wan, Xili, Luo, Junzhou, and Shen, Xiaojun
- Subjects
DATA packeting ,PRODUCTION scheduling ,ALGORITHMS ,QUEUING theory ,QUALITY of service ,GLOBAL Positioning System ,COMPUTER networks - Abstract
Deadline guaranteed packet scheduling for switches is a fundamental issue for providing guaranteed QoS in digital networks. It is a historically difficult NP-hard problem if three or more deadlines are involved. All existing algorithms have too low throughput to be used in practice. A key reason is they use packet deadlines as default priorities to decide which packets to drop whenever conflicts occur. Although such a priority structure can ease the scheduling by focusing on one deadline at a time, it hurts the throughput greatly. Since deadlines do not necessarily represent the actual importance of packets, we can greatly improve the throughput if deadline induced priority is not enforced. This paper first presents an algorithm that guarantees the maximum throughput for the case where only two different deadlines are allowed. Then, an algorithm called iterative scheduling with no priority (ISNOP) is proposed for the general case where k > 2 different deadlines may occur. Not only does this algorithm have dramatically better average performance than all existing algorithms, but also guarantees approximation ratio of 2. ISNOP would provide a good practical solution for the historically difficult packet scheduling problem. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
18. Knowledge Sharing in the Online Social Network of Yahoo! Answers and Its Implications.
- Author
-
Shen, Haiying, Li, Ze, Liu, Jinwei, and Grant, Joseph Edward
- Subjects
ONLINE social networks ,INFORMATION sharing ,WEB search engines ,WEBSITES ,ALGORITHMS - Abstract
Question and Answer (Q&A) websites such as Yahoo! Answers provide a platform where users can post questions and receive answers. These systems take advantage of the collective intelligence of users to find information. In this paper, we analyze the online social network (OSN) in Yahoo! Answers. Based on a large amount of our collected data, we studied the OSN’s structural properties, which reveals strikingly distinct properties such as low link symmetry and weak correlation between indegree and outdegree. After studying the knowledge base and behaviors of the users, we find that a small number of top contributors answer most of the questions in the system. Also, each top contributor focuses only on a few knowledge categories. In addition, the knowledge categories of the users are highly clustered. We also study the knowledge base in a user’s social network, which reveals that the members in a user’s social network share only a few knowledge categories. Based on the findings, we provide guidance in the design of spammer detection algorithms and distributed Q&A systems. We also propose a friendship-knowledge oriented Q&A framework that synergistically combines current OSN-based Q&A and web Q&A. We believe that the results presented in this paper are crucial in understanding the collective intelligence in the web Q&A OSNs and lay a cornerstone for the evolution of next-generation Q&A systems. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
19. An Improved Approximation Ratio to the Partial-Terminal Steiner Tree Problem.
- Author
-
Lee, Chia-Wei, Huang, Chao-Wen, Pi, Wen-Hao, and Hsieh, Sun-Yuan
- Subjects
APPROXIMATION theory ,PROBLEM solving ,COST functions ,SET theory ,ALGORITHMS - Abstract
We consider a generalization of both the classic Steiner tree problem and the terminal Steiner tree problem. Given a complete graph G = (V,E) with a metric cost function c:E \rightarrow \BBQ_ \geq and two proper subsets R \subset V and R^\prime \subseteq R, a partial-terminal Steiner tree is a Steiner tree which contains all vertices in \it R such that all vertices in R^\prime must be leaves. The partial-terminal Steiner tree problem is to find a partial-terminal Steiner tree of the minimum cost in G. The previously best-known approximation ratio of the problem is 2\rho, where \bf \rho is the approximation ratio of the Steiner tree problem. In this paper, we improve the ratio from 2\rho to 2\rho - \rho \over 3\rho - 2 - f, where f is a non-negative function whose value is between 0 and \rho - \rho \over 3\rho - 2. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
20. Square Root Computation over Even Extension Fields.
- Author
-
Adj, Gora and Rodriguez-Henriquez, Francisco
- Subjects
SQUARE root ,FIELD extensions (Mathematics) ,ALGORITHMS ,MATHEMATICAL models ,ROUGH sets ,COMPUTATIONAL complexity - Abstract
This paper presents a comprehensive study of the computation of square roots over finite extension fields. We propose two novel algorithms for computing square roots over even field extensions of the form \BBF{q^2}, with q = p^n, p an odd prime and n \geq 1. Both algorithms have an associate computational cost roughly equivalent to one exponentiation in \BBF{q^2}. The first algorithm is devoted to the case when q \equiv 1\, mod\, 4, whereas the second one handles the case when q \equiv 3\, mod\,4. Numerical comparisons show that the two algorithms presented in this paper are competitive and in some cases more efficient than the square root methods previously known. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
21. A Stochastic Computational Multi-Layer Perceptron with Backward Propagation.
- Author
-
Liu, Yidong, Liu, Siting, Wang, Yanzhi, Lombardi, Fabrizio, and Han, Jie
- Subjects
STOCHASTIC analysis ,ARTIFICIAL neural networks ,ELECTRIC power consumption ,MULTILAYER perceptrons ,ALGORITHMS - Abstract
Stochastic computation has recently been proposed for implementing artificial neural networks with reduced hardware and power consumption, but at a decreased accuracy and processing speed. Most existing implementations are based on pre-training such that the weights are predetermined for neurons at different layers, thus these implementations lack the ability to update the values of the network parameters. In this paper, a stochastic computational multi-layer perceptron (SC-MLP) is proposed by implementing the backward propagation algorithm for updating the layer weights. Using extended stochastic logic (ESL), a reconfigurable stochastic computational activation unit (SCAU) is designed to implement different types of activation functions such as the $tanh$ and the rectifier function. A triple modular redundancy (TMR) technique is employed for reducing the random fluctuations in stochastic computation. A probability estimator (PE) and a divider based on the TMR and a binary search algorithm are further proposed with progressive precision for reducing the required stochastic sequence length. Therefore, the latency and energy consumption of the SC-MLP are significantly reduced. The simulation results show that the proposed design is capable of implementing both the training and inference processes. For the classification of nonlinearly separable patterns, at a slight loss of accuracy by 1.32-1.34 percent, the proposed design requires only 28.5-30.1 percent of the area and 18.9-23.9 percent of the energy consumption incurred by a design using floating point arithmetic. Compared to a fixed-point implementation, the SC-MLP consumes a smaller area (40.7-45.5 percent) and a lower energy consumption (38.0-51.0 percent) with a similar processing speed and a slight drop of accuracy by 0.15-0.33 percent. The area and the energy consumption of the proposed design is from 80.7-87.1 percent and from 71.9-93.1 percent, respectively, of a binarized neural network (BNN), with a similar accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
22. DAG-Fluid: A Real-Time Scheduling Algorithm for DAGs.
- Author
-
Guan, Fei, Qiao, Jiaqing, and Han, Yu
- Subjects
DIRECTED acyclic graphs ,SCHEDULING ,NUMBER systems ,ALGORITHMS - Abstract
Various scheduling algorithms have been proposed for real-time parallel tasks modeled as a Directed Acyclic Graph (DAG). The capacity augmentation bound is a quantitative metric widely used in this field to compare the algorithms. Among the existing algorithms, the lowest capacity augmentation bound for DAG tasks with implicit deadlines is 2, which has been achieved by federated scheduling. To improve the schedulability and lower the capacity augmentation bound, this paper proposes DAG-Fluid, an algorithm based on fluid scheduling. We prove that DAG-Fluid has a capacity augmentation bound of $2-\frac{1}{m+1}$ 2 - 1 m + 1 , in which $m$ m is the number of processors in the system. Experiments show that DAG-Fluid performs better than the state of the art scheduling algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
23. Algorithmic Aspects of Hardware/Software Partitioning: 1D Search Algorithms.
- Author
-
Wu Jigang, Srikanthan, Thambipillai, and Guang Chen
- Subjects
COMPUTER input-output equipment ,COMPUTER software ,ALGORITHMS ,HEURISTIC ,KNAPSACK problems ,MATHEMATICS - Abstract
Hardware/software (HW/SW) partitioning is one of the key challenges in HW/SW codesign. This paper presents efficient algorithms for the HW/SW partitioning problem, which has been proved to be NP-hard. We reduce the HW/SW partitioning problem to a variation of knapsack problem that is approximately solved by searching 1D solution space, instead of searching 2D solution space in the latest work cited in this paper, to reduce time complexity. Three heuristic algorithms are proposed to determine suitable partitions to satisfy HW/SW partitioning constraints. We have shown that the time complexity for partitioning a graph with n nodes and m edges is significantly reduced from O(d
x ⋅ dy ⋅ n³) to O(n log n + d ⋅ (n + m)), where d and dx ⋅ dy are the number of the fragments of the searched 1D solution space and the searched 2D solution space, respectively. The lower bound on the solution quality is also proposed based on the new computing model to show that it is comparable to that reported in the literature. Moreover, empirical results show that the proposed algorithms produce comparable and often better solutions when compared to the latest algorithm while reducing the time complexity significantly. [ABSTRACT FROM AUTHOR]- Published
- 2010
- Full Text
- View/download PDF
24. Real-Time Scheduling and Analysis of OpenMP DAG Tasks Supporting Nested Parallelism.
- Author
-
Sun, Jinghao, Guan, Nan, Li, Feng, Gao, Huimin, Shi, Chang, and Yi, Wang
- Subjects
TASK analysis ,PARALLEL programming ,SCHEDULING ,BENEFIT performances ,TASKS ,ALGORITHMS - Abstract
OpenMP is a promising framework to develop parallel real-time software on multi-cores. Although similar to the DAG task model, OpenMP task systems are significantly more difficult to analyze due to constraints posed by OpenMP specifications. One of the most interesting features in OpenMP is the support for nested parallelism, enjoying benefits in enhancing performance transparency of parallel libraries and promoting reuse of black-box code. Previous researches on DAG task scheduling mainly restrict to only one level of parallelism. The problem whether OpenMP tasks with multiple levels of parallelism are suitable to real-time systems remains open. In this paper, we study the real-time scheduling and analysis of OpenMP task systems supporting nested parallelism. First, we show that under existing scheduling algorithms in OpenMP implementations, nested parallelism indeed may lead to extremely bad timing behaviors where the parallel workload is sequentially executed completely. To solve this problem, we propose a new scheduling algorithm and develop two sound response time bounds by considering the trade-off between simplicity and analysis precision. Experiments demonstrate the efficiency of our methods. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
25. Effects of Instruction-Set Extensions on an Embedded Processor: A Case Study on Elliptic-Curve Cryptography over GF(2m).
- Author
-
Bartolini, Sandro, Branovic, Irina, Giorgi, Roberto, and Martinelli, Enrico
- Subjects
CRYPTOGRAPHY ,ELLIPTIC curves ,EMBEDDED computer systems ,ALGORITHMS ,COMPUTER programming ,COMPUTER engineering - Abstract
Elliptic-Curve cryptography (ECC) is promising for enabling information security in constrained embedded devices. In order to be efficient on a target architecture, ECCs require accurate choice/tuning of the algorithms that perform the underlying mathematical operations. This paper contributes with a cycle-level analysis of the dependencies of ECC performance from the interaction between the features of the mathematical algorithms and the actual architectural and microarchitectural features of an ARM- based Intel XScale processor. Another contribution is the cycle-level analysis of a modified ARM processor that includes a word- level finite field polynomial multiplier (poly_mul) in its data path. This extension constitutes a good trade-off between applicability in a number of contexts, the simplicity of integration within the processor, and performance. This paper points out the most advantageous mix of elliptic curve (EC) parameters both for the standard ARM-based Intel XScale platform and for the one equipped with the poly_mul unit. In particular, the latter case allows for more than 41 percent execution time reduction on the considered benchmarks. Last, this paper investigates the correlation between the possible architectural organizations of a processor equipped with poly_mul unit(s) and EC benchmark performance. For instance, only superscalar pipelines can exploit the features of out-of-order execution and only very complex organizations (for example, four way superscalar) can exploit a high number of available ALUs. Conversely, we show that there are no benefits in endowing the processor with more than one poly_mul, and we point out a possible trade-off between performance and complexity increase: A two-way in-order/out-of-order pipeline allows +50 percent and +90 percent of Instructions per Cycle (IPC), respectively. Finally, we show that there are no critical constraints on the latency and pipelining capability of the poly_mul unit for the basic EC point multiplication. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
26. Optimal and Practical Algorithms for Sorting on the PDM.
- Author
-
Rajasekaran, Sanguthevar and Sen, Sandeep
- Subjects
ALGORITHMS ,COMPUTER algorithms ,OPTICAL disks ,COMPUTER input-output equipment ,COMPUTER networks ,BOTTLENECKS (Manufacturing) - Abstract
The Parallel Disks Model (PDM) has been proposed to alleviate the I/O bottleneck that arises in the processing of massive data sets. Sorting has been extensively studied on the PDM model due to the fundamental nature of the problem--several asymptotically optimal algorithms are known for sorting. Although randomization has been frequently exploited, most of the prior algorithms suffer from complications in memory layouts, implementation, restrictions in range of parameters, and laborious analysis. In this paper, we present a randomized mergesort algorithm based on a simple idea that sorts using an asymptotically optimal number of I/O operations with high probability and has all of the desirable features for practical implementation. In the second part of the paper, we also present several novel algorithms for sorting on the PDM that take only a small number of passes through the data. Recently, considerable interest has been shown by researchers in developing algorithms for problem sizes of practical interest and we are able to obtain several improvements and simplification, in particular for random input. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
27. RACE: A Robust Adaptive Caching Strategy for Buffer Cache.
- Author
-
Yifeng Zhu and Hong Jiang
- Subjects
CACHE memory ,COMPUTER storage devices ,COMPUTER input-output equipment ,BUFFER storage (Computer science) ,ALGORITHMS ,SCHEME programming language ,COMPUTER science research ,COMPUTER service industry ,COMPUTER systems management ,MANAGEMENT - Abstract
Although many block replacement algorithms for buffer caches have been proposed to address the well-known drawbacks of the LRU algorithm, they are not robust and cannot maintain a consistent performance improvement over all workloads. This paper proposes a novel and simple replacement scheme, called the Robust Adaptive buffer Cache management schemE (RACE), which differentiates the locality of I/O streams by actively detecting access patterns that are inherently exhibited in two correlated spaces, that is, the discrete block space of program contexts from which I/O requests are issued and the continuous block space within files to which I/O requests are addressed. This scheme combines the global I/O regularities of an application and the local I/O regularities of individual files that are accessed in that application to accurately estimate the locality strength, which is crucial in deciding which blocks are to be replaced upon a cache miss. Through comprehensive simulations on 10 real-application traces, RACE is shown to have higher hit ratios than LRU and all other state-of-the-art cache management schemes studied in this paper. [ABSTRACT FROM AUTHOR]
- Published
- 2008
- Full Text
- View/download PDF
28. Algorithms for Triple-Word Arithmetic.
- Author
-
Fabiano, Nicolas, Muller, Jean-Michel, and Picot, Joris
- Subjects
ARITHMETIC ,ALGORITHMS ,COMPUTER arithmetic ,FLOATING-point arithmetic - Abstract
Triple-word arithmetic consists in representing high-precision numbers as the unevaluated sum of three floating-point numbers (with "nonoverlapping" constraints that are explicited in the paper). We introduce and analyze various algorithms for manipulating triple-word numbers: rounding a triple-word number to a floating-point number, adding, multiplying, dividing, and computing square-roots of triple-word numbers, etc. We compare our algorithms, implemented in the Campary library, with other solutions of comparable accuracy. It turns out that our new algorithms are significantly faster than what one would obtain by just using the usual floating-point expansion algorithms in the special case of expansions of length 3. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
29. Analysis or Errors and Erasures in Parity Sharing RS Codecs.
- Author
-
Cardarilli, G. C., Pontarelli, S., Re, M., and Salsano, A.
- Subjects
INFORMATION theory ,ERROR-correcting codes ,GALOIS correspondences ,RELIABILITY (Personality trait) ,FAULT tolerance (Engineering) ,ALGORITHMS ,ENGINEERING design ,COMPUTER systems ,TECHNOLOGICAL complexity - Abstract
Reed Solomon (RS) codes are widely used to protect information from errors in transmission and storage systems. Most of the RS codes are based on GF(2
8 ) Galois Fields and use a byte to encode a symbol providing codewords up to 255 symbols. Codewords with more than 255 symbols can be obtained by using GF(2m ) Galois Fields with m > 8, but this choice increases the complexity of the encoding and decoding algorithms. This limitation can be superseded by introducing Parity Sharing (PS) RS codes that are characterized by a greater flexibility in terms of design parameters. Consequently, a designer can choose between different PS code implementations in order to meet requirements such as Bit Error Rate (BER), hardware complexity, speed, and throughput. This paper analyzes the performance of PS codes in terms of BER with respect to the code parameters, taking into account either random error or erasure rates as two independent probabilities. This approach provides an evaluation that is independent of the communication channel characteristics and extends the results to memory systems in which permanent faults and transient faults can be modeled, respectively, as erasures and random errors. The paper also provides hardware implementations of the PS encoder and decoder and discusses their performances in terms of hardware complexity, speed, and throughput. [ABSTRACT FROM AUTHOR]- Published
- 2007
- Full Text
- View/download PDF
30. Multiconstrained QoS Routing: A Norm Approach.
- Author
-
Guoliang Xue and Makki, S. Kami
- Subjects
QUALITY of service ,QUALITY control ,ROUTING (Computer network management) ,COMPUTER network management ,CHECK-routing symbols ,NETWORK routers ,ALGORITHMS ,COMPUTER systems ,COMPUTER networks - Abstract
A fundamental problem in quality-of-service (QoS) routing is the multiconstrained path (MCP) problem, where one seeks a source-destination path satisfying K ≥ 2 additive QoS constraints in a network with K additive QoS parameters. The MOP problem is known to be NP-complete. One popular approach is to use the shortest path with respect to a single edge weighting function as an approximate solution to MCP. In a pioneering work, Jaffe showed that the shortest path with respect to a scaled 1-norm of the K edge weights is a 2-approximation to MCP in the sense that the sum of the larger of the path weight and its corresponding constraint is within a factor of 2 from minimum. In a recent paper, Xue et al. showed that the shortest path with respect to a scaled ∞-norm of the K edge weights is a K-approximation to MCP, in the sense that the largest ratio of the path weight over its corresponding constraint is within a factor of K from minimum. In this paper, we study the relationship between these two optimization criteria and present a class of provably good approximation algorithms to MCP. We first prove that a good approximation according to the second optimization criterion is also a good approximation according to the first optimization criterion, but not vice versa. We then present a class of very simple K-approximation algorithms according to the second optimization criterion, based on the computation of a shortest path with respect to a single edge weighting function. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
31. Modified Low-Density MDS Array Codes for Tolerating Double Disk Failures in Disk Arrays.
- Author
-
Fujita, Hachiro and Sakaniwa, Kohichi
- Subjects
RAID (Computer science) ,COMPUTER storage devices ,ELECTRONIC file management ,ALGORITHMS ,COMPUTER programming ,MATHEMATICAL analysis ,NUMERICAL analysis - Abstract
In this paper, we present a new class of low-density MDS array codes for tolerating double disk failures in disk arrays. The proposed MDS array code has lower encoding and decoding complexity than the EVENODD code of Blaum et al. [ABSTRACT FROM AUTHOR]
- Published
- 2007
- Full Text
- View/download PDF
32. A 2-Level TCAM Architecture for Ranges.
- Author
-
Yeim-Kuan Chang
- Subjects
COMPUTER architecture ,INTERNET ,NETWORK routers ,DATA packeting ,COMPUTER software ,COMPUTER input-output equipment ,ALGORITHMS ,COMPUTER network resources ,COMPUTER science - Abstract
As the demand for high-quality Internet increases, emerging network applications are spurring the need for faster, feature-rich, and cost-effective routers. Multifield packet classification in routers has been a computation-intensive data path function for software implementation. Therefore, solutions for packet classification based on hardware design, such as Ternary Content Addressable Memory (TCAM), are necessary to sustain gigabit line processing rate. Traditionally, TCAMs have been designed for storing prefixes. However, multifield packet classification usually involves two fields of arbitrary ranges that are TCP/IP layer 4 source and destination ports. Storing ranges in TCAMs relies on decomposing each individual range into multiple prefixes, which leads to range-to-prefix blowout. To reduce the total number of prefixes needed to represent all ranges, this paper proposes a 2-level TCAM architecture and two range-to-prefix conversion schemes. In the first proposed scheme, designed for disjoint ranges, the maximum number of entries needed in TCAM is 2m - 1 for m disjoint ranges. In the second proposed scheme, designed for contiguous ranges, only m TCAM entries are needed. In a general case of n arbitrary ranges, all ranges can first be converted into disjoint ranges or contiguous ranges and then the proposed algorithms can be applied. As a result, only 4n - 3 TCAM entries are needed for the disjoint ranges and only 2n + 1 TCAM entries are needed for contiguous ranges. This paper also proposes insertion and deletion algorithms to accommodate incremental changes to the range sets. The experiments made show that the proposed range-to-prefix conversion schemes perform better than the existing schemes in terms of the number of required TCAM entries and execution time for range update operations. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
33. Using Indexing Functions to Reduce Conflict Aliasing in Branch Prediction Tables.
- Author
-
Yi Ma, Hongliang Gao, and Huiyang Zhou
- Subjects
COMPUTER storage devices ,BRANCH & bound algorithms ,COMPUTER input-output equipment ,INFORMATION storage & retrieval systems ,INDEXING ,INFORMATION organization ,INFORMATION retrieval ,ALGORITHMS ,MATHEMATICAL models - Abstract
High-accuracy branch prediction is crucial for high-performance processors. Inspired by the work on indexing functions to eliminate conflict-misses in memory hierarchy, this paper explores different indexing approaches to reduce conflict aliasing in branch-prediction tables. Our results show that indexing functions provide a highly complexity-effective way to enhance prediction accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
34. Objective-Optimal Algorithms for Long-Term Web Prefetching.
- Author
-
Bin Wu and Kshemkalyani, Ajay D.
- Subjects
INFORMATION retrieval ,ALGORITHMS ,INTERNET servers ,CACHE memory ,INFORMATION resources management ,WORLD Wide Web ,INFORMATION science ,COMPUTER storage devices ,INTERNET - Abstract
Web prefetching is based on Web caching and attempts to reduce user-perceived latency. Unlike on-demand caching, Web prefetching fetches objects and stores them in advance, hoping that the prefetched objects are likely to be accessed in the near future and such accesses would be satisfied from the caches rather than by retrieving the objects from the Web server. This paper reviews the popular prefetching algorithms based on Popularity, Good Fetch, APL characteristic, and Lifetime, and then makes the following contributions: 1) The paper proposes a family of linear-time prefetching algorithms, Objective-Greedy prefetching, wherein each algorithm greedily prefetches those Web objects that most significantly improve the performance as per the targeted metric. 2) The Hit rate-Greedy and Bandwidth-Greedy algorithms are shown to be optimal for their respective objective metrics. A linear-time optimal prefetching algorithm that maximizes the I-I/B metric as the performance measure is proposed. 3) The paper shows the results of a performance analysis via simulations, comparing the proposed algorithms with the existing algorithms in terms of the respective objectives-the hit rate, bandwidth, and the H/B metrics. The proposed prefeching algorithms are seen to provide better objective- based performance than any existing algorithms. Further, H/B-Greedy performs almost as well as H/B-Optimal [ABSTRACT FROM AUTHOR]
- Published
- 2006
35. The Granularity Metric for Fine-Grain Real-Time Scheduling.
- Author
-
Pails, Michael A.
- Subjects
ALGORITHMS ,PRODUCTION scheduling ,ONLINE data processing ,COMPUTER algorithms ,TECHNICAL specifications ,OPERATIONS research - Abstract
This paper investigates the task scheduling problem for real-time systems that provide rate of progress guarantees on task execution. A parameterized task system model, called the (r, g) task system, is introduced that allows rate of progress requirements to be specified in terms of two simple parameters: an execution rate r and a granularity g. The granularity parameter is a new metric that allows the specification of "fine-grain" timing constraints on the task's execution and is a generalization of the stretch metric used in recent research on task scheduling. It is shown that the product rlg(1/g) is an important determiner of the existence of good online scheduling algorithms. Specifically, there is an upper bound on this product above which there are no good online algorithms, but below which an online algorithm with logarithmic competitive ratio exists. This paper also demonstrates a fundamental difference between two contrasting strategies for admission control: greedy versus nongreedy. It is shown that "greed does not pay": there is a scheduling algorithm with a nongreedy admission policy that provably outperforms the well-known Greedy EDF scheduling algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2005
36. Bounded-Latency Content Distribution: Feasibility and Evaluation.
- Author
-
Chehgdu Huang and Abdelzaher, Tarek
- Subjects
ALGORITHMS ,COMPUTER programming ,ANT algorithms ,COMPUTER networks ,DATA transmission systems ,DIGITAL communications - Abstract
This paper investigates the performance of a content distribution network designed to provide bounded content access latency. Content can be divided into multiple classes with different configurable per-class delay bounds. The network uses a simple distributed algorithm to dynamically select subsets of its proxy servers for different classes such that a global per-class delay bound is achieved on content access. The content distribution algorithm is implemented and tested on PlanetLab [25], a world-wide distributed Internet testbed. Evaluation results demonstrate that, despite Internet delay variability, subsecond delay bounds (of 200-500ms) can be guaranteed with a very high probability at only a moderate content replication cost. The distribution algorithm achieves a four to five-fold reduction in the number of response-time violations compared to prior content distribution approaches that attempt to minimize average latency. To the authors' knowledge, this paper presents the first wide-area performance evaluation of an algorithm designed to bound maximum content access latency, as opposed to optimizing an average performance metric. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
37. An On-Chip IP Address Lookup Algorithm.
- Author
-
Xuehong Sun and Zhao, Yiqiang Q.
- Subjects
DATA transmission systems ,ALGORITHMS ,DATA compression ,DATABASE management ,DIGITAL communications ,BROADBAND communication systems - Abstract
This paper proposes a new data compression algorithm to store the routing table in a tree structure using very little memory. This data structure is tailored to a hardware design reference model presented in this paper. By exploiting the low memory access latency and high bandwidth of on-chip memory, high-speed packet forwarding can be achieved using this data structure. With the addition of pipeline in the hardware, IP address lookup can only be limited by the memory access speed. The algorithm is also flexible for different implementation. Experimental analysis shows that, given the memory width of 144 bits, our algorithm needs only 400kb memory for storing a 20k entries IPv4 routing table and five memory accesses for a search. For a 1M entries IPv4 routing table, 9Mb memory and seven memory accesses are needed. With memory width of 1,068 bits, we estimate that we need 100Mb memory and six memory accesses for a routing table with 1M IPv6 prefixes. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
38. Power and Performance Analysis of Motion Estimation Based on Hardware and Software Realizations.
- Author
-
Yang, Shengqi, Wolf, Wayne, and Vijaykrishnan, N.
- Subjects
VIDEO compression ,INTEGRATED circuits ,ELECTRONIC circuit design ,ELECTRONIC circuits ,ALGORITHMS ,IMAGE compression - Abstract
Motion estimation is the most computationally expensive task in MPEG-style video compression. Video compression is starting to be widely used in battery-powered terminals, but surprisingly little is known about the power consumption of modern motion estimation algorithms. This paper describes our effort to analyze the power and performance of realistic motion estimation algorithms in both hardware and software realizations. For custom hardware realizations, this paper presents a general model of VLSI motion estimation architectures. This model allows us to analyze in detail the power consumption of a large class of modern motion estimation engines that can execute the motion estimation algorithms of interest to us. We compare these algorithms in terms of their power consumption and performance. For software realizations, this paper provides the first detailed instruction-level simulation results on motion estimation based on a programmable CPU core. We analyzed various aspects of the selected motion estimation algorithms, such as search speed and power distribution. This paper provides a guideline to two types of machine designs for motion estimation: custom ASIC (Application Specific Integrated Circuit) design and custom ASIP (Application Specific lnstruction-set Processor) designs. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
39. A Systematic Approach to the Design of Distributed Wearable Systems.
- Author
-
Anliker, Urs, Beutel, Jan, Dyer, Matthias, Enzler, Rolf, Lukowicz, Paul, Thiele, Lothar, and Tröster, Gerhard
- Subjects
COMPUTER architecture ,EMBEDDED computer systems ,COMPUTER input-output equipment ,ALGORITHMS ,PORTABLE computers ,MOBILE computing ,COMPUTER science - Abstract
Wearable computing has recently gained much popularity as an ambitious vision for future personalized mobile systems. Its aim is intelligent, environment aware systems unobtrusively embedded into the mobile environments of their users. With the combination of complex processing requirements, the necessity of placing sensors and input/output modules at different locations on the user's body, and stringent limits on size, weight, and battery capacity, the design of such systems is an inherently challenging problem. This paper demonstrates how systematic design and quantitative analysis can be applied to wearable architectures. We first present a model that allows various factors influencing the design of a wearable system to be incorporated into formal cost metrics. In particular, we show how to consistently incorporate specific wearable factors such as device placement requirements, ergonomics, and dynamic workload profiles into the model. We then discuss how efficient estimation algorithms can be extended and applied to the evaluation of different architectures with respect to our cost metrics. Finally, we discuss quantitative results from a proof-of-concept case study showing the trade offs between different architectures for a given wearable scenario. Summarized, this paper demonstrates how the description and the design of wearable systems can be put on a systematic, formal basis allowing us to treat them similar as conventional embedded systems. [ABSTRACT FROM AUTHOR]
- Published
- 2004
- Full Text
- View/download PDF
40. An Improvement of the Elliptic Net Algorithm.
- Author
-
Chen, Binglong and Zhao, Chang-An
- Subjects
ELLIPTIC curve cryptography ,ALGORITHMS ,ITERATIVE methods (Mathematics) ,HAMMING weight ,MATHEMATICAL models - Abstract
In this paper we propose a modified Elliptic Net algorithm to compute pairings. By reducing the number of the intermediate variables which should be updated in the iteration loop of the Elliptic Net algorithm, we speed up the computation of pairings. Experimental results show that the proposed method is more efficient than the original Elliptic Net algorithm on Barreto-Naehrig (BN) curves which are the best choice for implementing pairings at 128-bit security level. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
41. Approximate Radix-8 Booth Multipliers for Low-Power and High-Performance Operation.
- Author
-
Jiang, Honglan, Han, Jie, Qiao, Fei, and Lombardi, Fabrizio
- Subjects
ANALOG multipliers ,APPROXIMATION theory ,HIGH performance computing ,ALGORITHMS ,COMPUTATIONAL complexity ,CRITICAL path analysis - Abstract
The Booth multiplier has been widely used for high performance signed multiplication by encoding and thereby reducing the number of partial products. A multiplier using the radix-4 (or modified Booth) algorithm is very efficient due to the ease of partial product generation, whereas the radix-8 Booth multiplier is slow due to the complexity of generating the odd multiples of the multiplicand. In this paper, this issue is alleviated by the application of approximate designs. An approximate 2-bit adder is deliberately designed for calculating the sum of 1× and 2× of a binary number. This adder requires a small area, a low power and a short critical path delay. Subsequently, the 2-bit adder is employed to implement the less significant section of a recoding adder for generating the triple multiplicand with no carry propagation. In the pursuit of a trade-off between accuracy and power consumption, two signed 16 × 16 bit approximate radix-8 Booth multipliers are designed using the approximate recoding adder with and without the truncation of a number of less significant bits in the partial products. The proposed approximate multipliers are faster and more power efficient than the accurate Booth multiplier. The multiplier with 15-bit truncation achieves the best overall performance in terms of hardware and accuracy when compared to other approximate Booth multiplier designs. Finally, the approximate multipliers are applied to the design of a low-pass FIR filter and they show better performance than other approximate Booth multipliers. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
42. Shell: A Spatial Decomposition Data Structure for Ray Traversal on GPU.
- Author
-
Xiao, Kai, Hu, Xiaobo Sharon, Zhou, Bo, and Chen, Danny Ziyi
- Subjects
DATA structures ,GRAPHICS processing units ,DECOMPOSITION method ,COMPUTER storage devices ,ALGORITHMS - Abstract
Shared memory many-core processors such as GPUs have been extensively used in accelerating computation-intensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory many-core architectures, non-trivial modifications are often needed to match the execution patterns of the target algorithms with the characteristics of many-core architectures. Ray traversal is a fundamental process in many applications, and is commonly accelerated by spatial decomposition schemes captured in hierarchical data structures (e.g., kd-trees). However, ray traversal using hierarchical data structures needs to conduct repeated hierarchical searches. Such search process is time-consuming on shared memory many-core architectures since it incurs considerable amounts of expensive memory accesses and execution divergence. In this paper, we propose a novel spatial decomposition based data structure, called Shell, which completely avoids hierarchical search for ray traversal. In Shell, a structure is built on the boundary of each region in the decomposed space, which allows any ray traversing in a region to find the next neighboring region to traverse using table lookup schemes, without any hierarchical search. While our ray traversal approach works for other spatial decomposition paradigms and many-core processors in higher dimensional scenes, we illustrate it using kd-tree on GPU for 3D scenario and compare with the fastest known kd-tree searching algorithms for ray traversal. Experimental results in graphics ray tracing and radiation dose calculation show that our approach improves the performance by 3.5-5.5$\times$
- Published
- 2016
- Full Text
- View/download PDF
43. Bit-Stuffing Algorithms for Crosstalk Avoidance in High-Speed Switching.
- Author
-
Chang, Cheng-Shang, Cheng, Jay, Huang, Tien-Ke, Huang, Xuan-Chao, Lee, Duan-Shin, and Chen, Chao-Yi
- Subjects
ALGORITHMS ,SWITCHING circuits ,CROSSTALK ,COMPUTER simulation ,MARKOV processes ,ENCODING - Abstract
The crosstalk effect is one of the main problems in deep sub-micron designs of high-speed buses. To mitigate the crosstalk effect, there are several types of crosstalk avoidance codes proposed in the literature. In this paper, we are particularly interested in generating forbidden transition codes that do not have opposite transitions on any two adjacent wires. For this, we propose a sequential bit-stuffing algorithm and a parallel bit-stuffing algorithm. For the sequential bit-stuffing algorithm, we perform a worst-case analysis and a probabilistic analysis. We show by both theoretic analysis and simulations that the coding rate of the sequential bit-stuffing encoding scheme is quite close to the Shannon capacity. In particular, for a bus with $n=10$
parallel wires, the difference is only 2.2 percent. Using a Markov chain analysis, we show that the coding rate of the parallel bit-stuffing algorithm is only slightly lower than that of the sequential bit-stuffing algorithm. The implementation complexity of the parallel bit-stuffing algorithm is linear with $n$ . In comparison with the existing forbidden transition codes that use the Fibonacci representation in the literature, our bit-stuffing algorithms not only achieve higher coding rates but also have much lower implementation complexity. [ABSTRACT FROM PUBLISHER]- Published
- 2015
- Full Text
- View/download PDF
44. Using Flexibility in P-Circuits by Boolean Relations.
- Author
-
Bernasconi, Anna, Ciriani, Valentina, Trucco, Gabriella, and Villa, Tiziano
- Subjects
BOOLEAN functions ,PROBLEM solving ,ALGORITHMS ,COMPUTER architecture ,LOGIC circuits - Abstract
In this paper we study the problem of characterizing and exploiting the complete flexibility of a special logic architecture, called P-circuits, which realize a Boolean function by projecting it onto overlapping subsets given by a generalized Shannon decomposition. P-circuits are used to restructure logic by pushing some signals towards the outputs. The algorithms proposed so far for exploiting the structural flexibility of P-circuits do not guarantee to find the best implementation, because they cast the problem as the minimization of an incompletely specified function. Instead, here we show that to explore all solutions we must set up the problem as the minimization of a Boolean relation, because there are don’t care conditions that cannot be expressed by single cubes. Finally we report the results obtained using a minimizer of Boolean relations, which improve in a major way with respect to the previous literature. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
45. Opportunistic Offloading of Deadline-Constrained Bulk Cellular Traffic in Vehicular DTNs.
- Author
-
Yao, Hong, Zeng, Deze, Huang, Huawei, Guo, Song, Barnawi, Ahmed, and Stojmenovic, Ivan
- Subjects
WIRELESS communications ,DELAY-tolerant networks ,PROBLEM solving ,ALGORITHMS ,LINEAR programming - Abstract
The ever-growing cellular traffic demand has laid a heavy burden on cellular networks. The recent rapid development in vehicle-to-vehicle communication techniques makes vehicular delay-tolerant network (VDTN) an attractive candidate for traffic offloading from cellular networks. In this paper, we study a bulk traffic offloading problem with the goal of minimizing the cellular communication cost under the constraint that all the subscribers receive their desired whole content before it expires. It needs to determine the initial offloading points and the dissemination scheme for offloaded traffic in a VDTN. By novelly describing the content delivery process via a contact-based flow model, we formulate the problem in a linear programming (LP) form, based on which an online offloading scheme is proposed to deal with the network dynamics (e.g., vehicle arrival/departure). Furthermore, an offline LP-based analysis is derived to obtain the optimal solution. The high efficiency of our online algorithm is extensively validated by simulation results. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
46. Algorithms for Generating Probabilities with Multivalued Stochastic Relay Circuits.
- Author
-
Lee, David T. and Bruck, Jehoshua
- Subjects
SWITCHING circuits ,ALGORITHMS ,PROBABILITY theory ,STOCHASTIC processes ,RANDOM numbers ,PROBLEM solving - Abstract
The problem of random number generation dates back to Von Neumann’s work in 1951. Since then, many algorithms have been developed for generating unbiased bits from complex correlated sources as well as for generating arbitrary distributions from unbiased bits. An equally interesting, but less studied aspect is the structural component of random number generation. That is, given a set of primitive sources of randomness, and given composition rules induced by a device or nature, how can we build networks that generate arbitrary probability distributions? In this paper, we study the generation of arbitrary probability distributions in multivalued relay circuits, a generalization in which relays can take on any of $N$
- Published
- 2015
- Full Text
- View/download PDF
47. Optimal Binning for Genomics.
- Author
-
Gulino, Andrea, Kaitoua, Abdulrahman, and Ceri, Stefano
- Subjects
BIG data ,NUCLEOTIDE sequence ,INDIVIDUALIZED medicine ,GENOMICS ,ALGORITHMS - Abstract
Genome sequencing is expected to be the most prolific source of big data in the next decade; millions of whole genome datasets will open new opportunities for biological research and personalized medicine. Genome sequences are abstracted in the form of interesting regions, describing abnormalities of the genome. The parallel execution on the cloud of complex operations for joining and mapping billions of genomic regions is increasingly important. Genome binning, i.e., partitioning of the genome into small-size segments, adapts classic data partitioning methods to genomics; region distributions to bins must reflect operation-specific correctness rules. As a consequence, determining the optimal bin size for such operations is a complex mathematical problem, whose solution requires careful modeling. The main result of this paper is the mathematical formulation and solution of the optimal binning problem for join and map operations in the context of GMQL, a query language over genomic regions; the model is validated by experiments showing its accuracy and sensitivity to the variation of operations’ parameters. We also optimize sequences of operations by inheriting the binning between two consecutive operations and we show the deployment of GMQL and the tuning of the proposed model on different cloud computing systems. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
48. On Analysis of Lightweight Stream Ciphers with Keyed Update.
- Author
-
Kara, Orhun and Esgin, Muhammed F.
- Subjects
CRYPTOGRAPHY ,CRYPTOGRAPHERS ,STREAM ciphers ,INTERNET of things ,CYBERTERRORISM ,ALGORITHMS - Abstract
As the need for lightweight cryptography has grown even more due to the evolution of the Internet of Things, it has become a greater challenge for cryptographers to design ultra lightweight stream ciphers in compliance with the rule of thumb that the internal state size should be at least twice as the key size to defend against generic Time-Memory-Data Tradeoff (TMDT) attacks. However, Recently in 2015, Armknecht and Mikhalev sparked a new light on designing keystream generators (KSGs), which in turn yields stream ciphers, with small internal states, called KSG with Keyed Update Function (KSG with KUF), and gave a concrete construction named Sprout. But, currently, security analysis of KSGs with KUF in a general setting is almost non-existent. Our contribution in this paper is two-fold. 1) We give a general mathematical setting for KSGs with KUF, and for the first time, analyze a class of such KSGs, called KSGs with Boolean Keyed Feedback Function (KSG with Boolean KFF), generically. In particular, we develop two generic attack algorithms applicable to any KSG with Boolean KFF having almost arbitrary output and feedback functions where the only requirement is that the secret key incorporation is biased. We introduce an upper bound for the time complexity of the first algorithm. Our extensive experiments validate our algorithms and assumptions made thereof. 2) We study Sprout to show the effectiveness of our algorithms in a practical instance. A straightforward application of our generic algorithm yields one of the most successful attacks on Sprout. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
49. Radix-2 Division Algorithms with an Over-Redundant Digit Set.
- Author
-
Ebergen, Jo and Jamadagni, Navaneeth
- Subjects
ALGORITHMS ,REDUNDANT number systems ,SET theory ,BINARY number system ,APPROXIMATION theory - Abstract
This paper presents a derivation of four radix-2 division algorithms by digit recurrence. Each division algorithm selects a quotient digit from the over-redundant digit set {−2, −1, 0, 1, 2}, and the selection of each quotient digit depends only on the two most-significant digits of the partial remainder in a redundant representation. Two algorithms use a two’s complement representation for the partial remainder and carry-save additions, and the other two algorithms use a binary signed-digit representation for the partial remainder and carry-free additions. Three algorithms are novel. The fourth algorithm has been presented before. Results from the synthesized netlists show that two of our fastest algorithms achieve an improvement of 10 percent in latency per iteration over a standard radix-2 SRT algorithm at the cost of 36 percent more power and 50 percent more area. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
50. Power Budgeting Techniques for Data Centers.
- Author
-
Zhan, Xin and Reda, Sherief
- Subjects
CLOUD computing ,DATA libraries ,DATA science ,DYNAMIC programming ,ALGORITHMS ,SERVER farms (Computer network management) - Abstract
The development of cloud computing and data science result in rapid increases of number and scale of data centers. Because of cost and sustainability concerns, energy efficiency has been a major goal for data center architects. Focusing on reducing the cooling power and making full use of available computing power, power budgeting is an increasingly important requirement for data center operations. In this paper, we present a framework of power budgeting, considering both computing power and cooling power, in data centers to maximize the system normalized performance (SNP) of the entire center under a total power budget. Maximizing the SNP for a given power budget is equivalent to maximizing the energy efficiency. We propose a method to partition the total power budget among the cooling and computing infrastructure in a self-consistent way, where the cooling power is sufficient to extract the heat of the computing power. Intertwinedly, we devise an optimal computing power budgeting technique based on dynamic programming algorithm to determine the optimal power caps for the individual servers such that the available power could be efficiently translated to performance improvements. The optimal computing budgeting technique leverages a proposed online throughput predictor based on performance counter measurements to estimate the change in throughput of heterogeneous workloads as a function of allocated server power caps. We demonstrate that our proposed power budgeting method outperforms previous methods by 3-4 percent in terms of SNP using our data center simulation environment. While maintaining the improvement of SNP, our method improve fairness at best by 57 percent. We also evaluate the performance of our method in power saving scenario and dynamic power budgeting case. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.