Descriptor: "FLOATING-point arithmetic" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"FLOATING-point arithmetic"' showing total 2,310 results

Start Over Descriptor "FLOATING-point arithmetic"

2,310 results on '"FLOATING-point arithmetic"'

1. Posit and floating-point based Izhikevich neuron: A Comparison of arithmetic

Author: Fernandez-Hart, T., Knight, James C., and Kalganova, T.
Published: 2024
Full Text: View/download PDF

2. Computer Arithmetic

Author: Merchant, Farhad and Chattopadhyay, Anupam, editor
Published: 2025
Full Text: View/download PDF

3. Extension of accurate numerical algorithms for matrix multiplication based on error-free transformation: Error-free transformation of matrix multiplication: K. Ozaki et al.

Author: Ozaki, Katsuhisa, Mukunoki, Daichi, and Ogita, Takeshi
Abstract: The error-free transformation of matrix multiplication is a useful technique for accurate numerical computations in linear algebra problems. It can be used to transform the product of two floating-point matrices into an unevaluated sum of floating-point matrices, making it useful for developing accurate numerical algorithms for matrix multiplication. This technique splits both left and right matrices into k floating-point matrices, and then 1 2 k (k + 1) times matrix multiplications are performed. We extend this technique and propose several accurate algorithms for matrix multiplication, which involve p times matrix multiplications with p = 4 , 5 , 8 , 9 , respectively. The proposed algorithms efficiently provide more accurate results than those by double-precision arithmetic and less accurate than those by quadruple-precision arithmetic. In addition, we propose alternative forms to reduce the number of matrix multiplications with rounding errors. Numerical results show that the number of matrix multiplications affects the accuracy of the computed results. This dependence is examined using rounding error analysis and confirmed through numerical experiments. [ABSTRACT FROM AUTHOR]
Published: 2025
Full Text: View/download PDF

4. Modified fast discrete‐time PID formulas for obtaining double precision accuracy.

Author: Kim, Eungnam and Choi, Jin‐Young
Subjects: *AUTOMATIC control systems, *PID controllers, *INTEGERS, *VALUATION of real property, *ARITHMETIC
Abstract: Proportional integral derivative (PID) controllers are widely used across various industries. This paper presents a new modified PID controller based on integer origin raw data, which are equivalent to classic PID controller based on floating‐point actual values. These new formulas presented in new PID controller provide a mathematical approach to the 'Classic PID Formula', 'Subtractor Formula' and 'Scaling Formula', which form the basis of classic PID controller. The approach integrates these three formulas and separates them into integer and real value by applying the properties of associativity and commutativity. This method uses origin raw data as input to perform integer‐based computation and performs floating‐point operations once. This resulted in faster computation time and energy savings, while showing accuracy comparable to the existing double precision formulas. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. Computational Insights into the Unstable Fixed Point of the Fractional Difference Logistic Map.

Author: Uzdila, Ernestas, Telksniene, Inga, Telksnys, Tadas, and Ragulskis, Minvydas
Subjects: *FLOATING-point arithmetic, *ORBITS (Astronomy)
Abstract: Thedivergence from the unstable fixed point of the fractional difference logistic map is investigated in this paper. In contrary to the classical logistic map, the memory horizon of the fractional difference logistic map reaches the initial condition. And though higher order orbits do not exist in the fractional difference logistic map, a trajectory started at the unstable fixed point may continuously remain at the fixed point as the number of iterations tends to infinity. Such an effect is well known for the classical logistic map, but less so in the fractional difference logistic map. It appears that this effect depends on the accuracy of the floating point arithmetic. It is demonstrated that the divergence from the unstable fixed point of the fractional difference logistic map is a completely computational artifact. Using double precision, approximately 32% values of a from the interval 2.7 < a ≤ 3.7 diverge from the unstable fixed point. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Accurate bidiagonal decompositions of Cauchy–Vandermonde matrices of any rank.

Author: Delgado, Jorge, Koev, Plamen, Marco, Ana, Martínez, José‐Javier, Peña, Juan Manuel, Persson, Per‐Olof, and Spasov, Steven
Subjects: *MATRIX decomposition, *MATRIX multiplications, *FLOATING-point arithmetic, *NONNEGATIVE matrices, *EIGENVALUES
Abstract: We present a new decomposition of a Cauchy–Vandermonde matrix as a product of bidiagonal matrices which, unlike its existing bidiagonal decompositions, is now valid for a matrix of any rank. The new decompositions are insusceptible to the phenomenon known as subtractive cancellation in floating point arithmetic and are thus computable to high relative accuracy. In turn, other accurate matrix computations are also possible with these matrices, such as eigenvalue computation amongst others. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. An efficient iterative pseudo point elimination technique to represent the shape of the digital image boundary.

Author: Ramaiah, Mangayarkarasi, Ravi, Vinayakumar, Chandrasekaran, Vanmathi, Mohanraj, Vanitha, Mani, Deepa, and Maruthamuthu, Angulakshmi
Subjects: FLOATING-point arithmetic, ARITHMETIC, SUM of squares, POLYGONS, INTEGERS
Abstract: Visually, the environment is made up of a chaotic of irregular polygons. It is an important and intriguing issue in many fields of study to represent and comprehend the irregular polygon. However, approximating the polygon presents significant difficulties from a variety of perspectives. The method provided in this research eliminates the pseudo-redundant points that are not contributing to shape retention and then makes the polygonal approximation with the remaining high-curvature points, as opposed to searching for the real points on the digital image boundary curve. The proposed method uses chain code assignment to obtain initial segmentation points. Using integer arithmetic, the presented method calculates the curvature at each initial pseudo point using sum of squares of deviation. For every initial segmented pseudo point, the difference incurred by all the boundary points lies between its earlier pseudo point and its next initial pseudo point was taken into account. Then, this new proposal removes the redundant point from the subset of initial segmentation points whose curvature deviation is the lowest with each iteration. The method then recalculates the deviation information for the next and previous close pseudo points. Experiments are done with MPEG datasets and synthetic contours to show how well the proposed method works in both quantitative and qualitative ways. The experimental result shows the effectiveness of the proposed method in creating polygons with few points. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. The influence of parasitic modes on stable lattice Boltzmann schemes and weakly unstable multi-step Finite Difference schemes.

Author: Bellotti, Thomas
Subjects: *FINITE differences, *LATTICE Boltzmann methods, *FLOATING-point arithmetic, *TRANSPORT equation, *NUMERICAL analysis, *CONSERVATION laws (Mathematics)
Abstract: Numerical analysis for linear constant-coefficient multi-step Finite Difference schemes is a longstanding topic, developed approximately fifty years ago. It relies on the stability of the scheme, and thus—within the L 2 setting—on the absence of multiple roots of the amplification polynomial on the unit circle. This allows for the decoupling, while discussing the convergence of the method, of the study of the consistency of the scheme from the precise knowledge of its parasitic/spurious modes, so that the methods can be essentially studied as if they had only one step. Furthermore, stability alleviates the need to delve into the complexities of floating-point arithmetic on computers, which can be challenging topics to address. In this paper, we demonstrate that in the case of "weakly" unstable Finite Difference schemes with multiple roots on the unit circle, although the schemes may remain stable, considering parasitic modes is essential in studying their consistency and, consequently, their convergence. This research was prompted by unexpected numerical results on stable lattice Boltzmann schemes, which can be rewritten in terms of multi-step Finite Difference schemes. Unlike Finite Difference schemes, rigorous numerical analysis for lattice Boltzmann schemes is a contemporary topic with much left for future discoveries. Initial expectations suggested that third-order initialization schemes would suffice to maintain the accuracy of fourth-order schemes. However, this assumption proved incorrect for weakly unstable Finite Difference schemes and for stable lattice Boltzmann methods. This borderline scenario underscores that particular care must be adopted for lattice Boltzmann schemes, and the significance of genuine stability in facilitating the construction of Lax-Richtmyer-like theorems and in mastering the impact of round-off errors concerning Finite Difference schemes. Despite the simplicity and apparent lack of practical usage of the linear transport equation at constant velocity considered throughout the paper, we demonstrate that high-order lattice Boltzmann schemes for this equation can be used to tackle nonlinear systems of conservation laws relying on a Jin-Xin approximation and high-order splitting formulæ. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. Stable numerics for finite‐strain elasticity.

Author: Shakeri, Rezgar, Ghaffari, Leila, Thompson, Jeremy L., and Brown, Jed
Subjects: FLOATING-point arithmetic, AUTOMATIC differentiation, NUMERICAL functions, STRAIN energy, NUMERICAL calculations
Abstract: A backward stable numerical calculation of a function with condition number κ$$ \kappa $$ will have a relative accuracy of κϵmachine$$ \kappa {\epsilon}_{\mathrm{machine}} $$. Standard formulations and software implementations of finite‐strain elastic materials models make use of the deformation gradient F=I+∂u/∂X$$ \boldsymbol{F}=I+\partial \boldsymbol{u}/\partial \boldsymbol{X} $$ and Cauchy‐Green tensors. These formulations are not numerically stable, leading to loss of several digits of accuracy when used in the small strain regime, and often precluding the use of single precision floating point arithmetic. We trace the source of this instability to specific points of numerical cancellation, interpretable as ill‐conditioned steps. We show how to compute various strain measures in a stable way and how to transform common constitutive models to their stable representations, formulated in either initial or current configuration. The stable formulations all provide accuracy of order ϵmachine$$ {\epsilon}_{\mathrm{machine}} $$. In many cases, the stable formulations have elegant representations in terms of appropriate strain measures and offer geometric intuition that is lacking in their standard representation. We show that algorithmic differentiation can stably compute stresses so long as the strain energy is expressed stably, and give principles for stable computation that can be applied to inelastic materials. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. CUDASW++4.0: ultra-fast GPU-based Smith–Waterman protein sequence database search.

Author: Schmidt, Bertil, Kallenborn, Felix, Chacon, Alejandro, and Hundt, Christian
Subjects: *FLOATING-point arithmetic, *AMINO acid sequence, *TIME complexity, *PROCESS capability, *DATABASES
Abstract: Background: The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations. Results: CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt. Conclusion: CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. 一种基于浮点误差分析的混合精度鲁棒性提升方法.

Author: 于恒彪, 易昕, 李胜国, 李发, 姜浩, and 黄春
Abstract: Floating-point arithmetic is a typical numerical solution model for high-performance computing. Mixed-precision optimization enhances performance and reduces energy consumption by decreasing the precision of floating-point variables in programs. However, existing automatic mixed-precision optimization techniques are limited by low robustness, meaning that the optimized programs fail to meet the result accuracy constraints for given inputs. To address this issue, a method for improving the robustness of mixed-precision optimization based on floating-point error analysis is proposed. Firstly, inputs that can trigger imprecise calculations in the program are identified through floating-point error analysis. Then, based on these error-triggering inputs, the precision configurations are evaluated to guide the search for highly robust mixed-precision configurations. Experimental results show that for typical floating-point applications, this method can improve the robustness of mixed-precision optimization by an average of 62%. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. SPIMulator: A Spintronic Processing-in-memory Simulator for Racetracks.

Author: Bera, Pavia, Cahoon, Stephen, Bhanja, Sanjukta, and Jones Ph.D., Alex
Subjects: ADVANCED Encryption Standard, FLOATING-point arithmetic, MACHINE learning, POINT processes, MATRIX multiplications, NANOWIRE devices
Abstract: In-memory processing is becoming a popular method to alleviate the memory bottleneck of the Von Neumann computing model. With the goal of improving both latency and energy cost associated with such in-memory processing, emerging non-volatile memory technologies, such as Spintronic magnetic memory, are of particular interest, as they can provide a near-SRAM read/write performance and eliminate nearly all static energy without experiencing any endurance limitations. Spintronic Racetrack Memory (RM) further addresses density concerns of spin-transfer torque memory (STT-MRAM). Moreover, it has recently been demonstrated that portions of RM nanowires can function as a polymorphic gate, which can be leveraged to implement multi-operand bulk bitwise operations. With more complex control, they can also be leveraged to build arithmetic integer and floating point processing in memory (PIM) primitives. This article proposes SPIMulator, a Spintronic PIM simulator that can simulate the storage and PIM architecture of executing PIM commands in Racetrack memory. SPIMulator functionally models the polymorphic gate properties recently proposed for Racetrack memory, which allows transverse access that determines the number of "1"s in a segment of each Racetrack nanowire. From this simulation, SPIMulator can report real-time performance statistics such as cycle count and energy. Thus, SPIMulator simulates the multi-operand bit-wise logic operations recently proposed and can be easily extended to implement new PIM operations as they are developed. Due to the functional nature of SPIMulator, it can serve as a programming environment that allows development of PIM-based codes for verification of new acceleration algorithms. We demonstrate the value of SPIMulator through the modeling and estimations of performance and energy consumption of a variety of example applications, including the Advanced Encryption Standard (AES) for encryption primarily based on logical and look-up operations; multiplication of matrices, a frequent requirement in scientific, signal processing, and machine learning algorithms; and bitmap indices, a common search table employed for database lookups. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. Enhancing radar applications: FPGA-driven phase estimation with floating point arithmetic.

Author: Sivaprasad, Ponduri, Venkataraman, Anandi, and Murty, P. Satyanarayana
Subjects: FIELD programmable gate arrays, FLOATING-point arithmetic, DIGITAL signal processing, PARALLEL processing, FIX-point estimation
Abstract: This article introduces a paradigm shift in radar technology with field programmable gate array (FPGA)-driven Phase estimation using floating point arithmetic (FPA). Leveraging FPGA's parallel processing and the precision of FPA, this work promises enhanced accuracy and efficiency. The proposed system's key performance metrics include the following: number of slices: 20,941, number of look-up tables (LUTs): 22,371, number of digital signal processing (DSP) blocks: 2, delay: 112.9 ns, and power consumption: 7.2 mw. A comparative analysis showcases advantages in area utilization, LUT, and DSP blocks despite a trade-off with delay. The presented methodology and results demonstrate the feasibility of real-time phase estimation at GHz rates, positioning this approach as transformative for next-gen radar systems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

14. ROUNDING-ERROR ANALYSIS OF MULTIGRID V-CYCLES.

Author: MCCORMICK, STEPHEN F. and TAMSTORF, RASMUS
Subjects: *FLOATING-point arithmetic
Abstract: Earlier work on rounding-error analysis of multigrid was restricted to cycles that used one relaxation step before coarsening and none afterwards. The present paper extends this analysis to two-grid methods that use one relaxation step both before and after coarsening. The analysis is based on floating point arithmetic and focuses on a two-grid scheme that is perturbed on the coarse grid to allow for an approximate coarse-grid solve. Leveraging previously published results, this two-grid theory can then be extended to general V(μ, ν)-cycles, as well as full multigrid. It can also be extended to mixed-precision iterative refinement based on these cycles. An added benefit of the theory here over previous work is that it is obtained in a more organized, transparent, and simpler way. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

15. MULTIGRID METHODS USING BLOCK FLOATING POINT ARITHMETIC.

Author: KOHL, NILS, McCORMICK, STEPHEN F., and TAMSTORF, RASMUS
Subjects: *FLOATING-point arithmetic, *ARITHMETIC, *INTEGERS, *EXPONENTS
Abstract: Block floating point (BFP) arithmetic is currently seeing a resurgence in interest because it requires less power and less chip area and is less complicated to implement in hardware than standard floating point arithmetic. This paper explores the application of BFP to mixed- and progressive-precision multigrid methods, enabling the solution of linear elliptic partial differential equations (PDEs) in energy- and hardware-efficient integer arithmetic. While most existing applications of BFP arithmetic tend to use small block sizes, the block size here is chosen to be maximal such that matrices and vectors share a single exponent for all entries. This is sometimes also referred to as a scaled fixed point format. We provide algorithms for BLAS-like routines for BFP arithmetic that ensure exact vector-vector and matrix-vector computations up to a specified precision. Using these algorithms, we study the asymptotic precision requirements for achieving discretization-error-accuracy. We demonstrate that some computations can be performed using only 4-bit integers, while the number of bits required to attain a certain target accuracy is similar to that of standard floating point arithmetic. Finally, we present a heuristic for full multigrid in BFP arithmetic based on saturation and truncation that still achieves discretization-error-accuracy without the need for expensive normalization steps of intermediate results. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

16. BOUNDS ON NONLINEAR ERRORS FOR VARIANCE COMPUTATION WITH STOCHASTIC ROUNDING.

Author: EL ARAR, E.-M., SOHIER, D., DE OLIVEIRA CASTRO, P., and PETIT, E.
Subjects: *FLOATING-point arithmetic, *MARTINGALES (Mathematics), *ALGORITHMS, *TEXTBOOKS, *DEFAULT (Finance)
Abstract: The main objective of this work is to investigate nonlinear errors and pairwise summation using stochastic rounding (SR) in variance computation algorithms. We estimate the forward error of computations under SR through two methods: the first is based on a bound of the variance and the Bienaymé--Chebyshev inequality, while the second is based on martingales and the Azuma--Hoeffding inequality. The study shows that for pairwise summation, using SR results in a probabilistic bound of the forward error proportional to √log(n)u rather than the deterministic bound in O(log(n)u) when using the default rounding mode. We examine two algorithms that compute the variance, one called "textbook" and the other "two-pass," which both exhibit nonlinear errors. Using the two methods mentioned above, we show that the forward errors of these algorithms have probabilistic bounds under SR in O(√nu) instead of nu for the deterministic bounds. We show that this advantage holds using pairwise summation for both textbook and two-pass, with probabilistic bounds of the forward error proportional to √log(n)u. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

17. Algorithm 1048: A C++ Class for Robust Linear Barycentric Rational Interpolation.

Author: Fuda, Chiara and Hormann, Kai
Subjects: *FLOATING-point arithmetic, *INTERPOLATION, *C++, *ALGORITHMS
Abstract: Barycentric rational interpolation is a recent interpolation method with several favourable properties. In this article, we present the BRI class, which features a new C++ class template that contains all variables and functions related to linear barycentric rational interpolation. While several methods exist to evaluate a barycentric rational interpolant, the class is designed to autonomously select the best method to use on a case-by-case basis, as it takes into account the latest results regarding the efficiency and numerical stability of barycentric rational interpolation [15]. Moreover, we describe a new technique that makes the code robust and less prone to overflow and underflow errors. In addition to the standard C++ data types, the BRI template variables can also be defined with arbitrary precision because the BRI class is compatible with the Multiple Precision Floating-Point Reliable (MPFR) library [14]. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

18. Introduction to the Special Issue on Specification and Design Languages (FDL 2021).

Author: Deantoni, Julien, Girault, Alain, and Grosse, Daniel
Subjects: ARTIFICIAL neural networks, PROGRAMMING languages, SEMANTICS, FLOATING-point arithmetic, SOFTWARE product line engineering, CYBER physical systems
Published: 2024
Full Text: View/download PDF

19. Code Generation for Neural Networks Based on Fixed-point Arithmetic.

Author: Benmaghnia, Hanane, Martel, Matthieu, and Seladji, Yassamine
Subjects: COMPUTER arithmetic, FLOATING-point arithmetic, LINEAR programming, DRIVERLESS cars, NEURAL codes
Abstract: Over the past few years, neural networks have started penetrating safety critical systems to make decisions as, for example, in robots, rockets, and autonomous driving cars. Neural networks based on floating-point arithmetic are very time and memory consuming, which are not compatible with embedded systems known to have limited resources. They are also very sensitive to the precision in which they have been trained, so changing this precision generally degrades the quality of their answers. To deal with that, we introduce a new technique to generate a fixed-point code for a trained neural network. This technique is based on fixed-point arithmetic with mixed-precision. This arithmetic is based on integer operations only, which are compatible with small memory devices. The obtained neural network has the same behavior as the initial one (based on the floating-point arithmetic) up to an error threshold defined by the user. The experimental results show the efficiency of our tool SyFix in terms of memory saved and the accuracy of the computations. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

20. An exa-scale high-performance molecular dynamics simulation program: MODYLAS.

Author: Andoh, Yoshimichi, Ichikawa, Shin-ichi, Sakashita, Tatsuya, Fujimoto, Kazushi, Yoshii, Noriyuki, Nagai, Tetsuro, Tang, Zhiye, and Okazaki, Susumu
Subjects: *SIMULATION software, *FAST multipole method, *FLOATING-point arithmetic, *MOLECULAR dynamics, *COMPUTER performance
Abstract: A new version of the highly parallelized general-purpose molecular dynamics (MD) simulation program MODYLAS with high performance on the Fugaku computer was developed. A benchmark test using Fugaku indicated highly efficient communication, single instruction, multiple data (SIMD) processing, and on-cache arithmetic operations. The system's performance deteriorated only slightly, even under high parallelization. In particular, a newly developed minimum transferred data method, requiring a significantly lower amount of data transfer compared to conventional communications, showed significantly high performance. The coordinates and forces of 101 810 176 atoms and the multipole coefficients of the subcells could be distributed to the 32 768 nodes (1 572 864 cores) in 2.3 ms during one MD step calculation. The SIMD effective instruction rates for floating-point arithmetic operations in direct force and fast multipole method (FMM) calculations measured on Fugaku were 78.7% and 31.5%, respectively. The development of a data reuse algorithm enhanced the on-cache processing; the cache miss rate for direct force and FMM calculations was only 2.74% and 1.43%, respectively, on the L1 cache and 0.08% and 0.60%, respectively, on the L2 cache. The modified MODYLAS could complete one MD single time-step calculation within 8.5 ms for the aforementioned large system. Additionally, the program contains numerous functions for material research that enable free energy calculations, along with the generation of various ensembles and molecular constraints. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

21. A Virtual Machine Platform Providing Machine Learning as a Programmable and Distributed Service for IoT and Edge On-Device Computing: Architecture, Transformation, and Evaluation of Integer Discretization.

Author: Bosse, Stefan
Subjects: *INSTRUCTION set architecture, *FLOATING-point arithmetic, *VIRTUAL machine systems, *SENSOR networks, *DISTRIBUTED sensors
Abstract: Data-driven models used for predictive classification and regression tasks are commonly computed using floating-point arithmetic and powerful computers. We address constraints in distributed sensor networks like the IoT, edge, and material-integrated computing, providing only low-resource embedded computers with sensor data that are acquired and processed locally. Sensor networks are characterized by strong heterogeneous systems. This work introduces and evaluates a virtual machine architecture that provides ML as a service layer (MLaaS) on the node level and addresses very low-resource distributed embedded computers (with less than 20 kB of RAM). The VM provides a unified ML instruction set architecture that can be programmed to implement decision trees, ANN, and CNN model architectures using scaled integer arithmetic only. Models are trained primarily offline using floating-point arithmetic, finally converted by an iterative scaling and transformation process, demonstrated in this work by two tests based on simulated and synthetic data. This paper is an extended version of the FedCSIS 2023 conference paper providing new algorithms and ML applications, including ANN/CNN-based regression and classification tasks studying the effects of discretization on classification and regression accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. Optimizing Data Flow in Binary Neural Networks.

Author: Vorabbi, Lorenzo, Maltoni, Davide, and Santi, Stefano
Subjects: *FLOATING-point arithmetic, *DEEP learning, *NEON
Abstract: Binary neural networks (BNNs) can substantially accelerate a neural network's inference time by substituting its costly floating-point arithmetic with bit-wise operations. Nevertheless, state-of-the-art approaches reduce the efficiency of the data flow in the BNN layers by introducing intermediate conversions from 1 to 16/32 bits. We propose a novel training scheme, denoted as BNN-Clip, that can increase the parallelism and data flow of the BNN pipeline; specifically, we introduce a clipping block that reduces the data width from 32 bits to 8. Furthermore, we decrease the internal accumulator size of a binary layer, usually kept using 32 bits to prevent data overflow, with no accuracy loss. Moreover, we propose an optimization of the batch normalization layer that reduces latency and simplifies deployment. Finally, we present an optimized implementation of the binary direct convolution for ARM NEON instruction sets. Our experiments show a consistent inference latency speed-up (up to 1.3 and 2.4 × compared to two state-of-the-art BNN frameworks) while reaching an accuracy comparable with state-of-the-art approaches on datasets like CIFAR-10, SVHN, and ImageNet. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. Pseudo-Normalization via Integer Fast Inverse Square Root and Its Application to Fast Computation without Division.

Author: Kusaka, Takashi and Tanaka, Takayuki
Subjects: REAL numbers, RENDERING (Computer graphics), SQUARE root, FLOATING-point arithmetic, ARITHMETIC
Abstract: Vector normalization is an important process in several algorithms. It is used in classical physical calculations, mathematical techniques, and machine learning, which has witnessed significant advancements in recent years. Normalization and regularization ensure the stability of solutions and play an important role in algorithm convergence. Normalization typically refers to the division of elements by their norm. Division should not be used in algorithmic implementations because its computational cost is considerably higher than that of multiply–add operations. Based on this, there is a well-known method referred to as the fast inverse square root (FISR) algorithm in floating-point calculations (IEEE754). In deeper-level embedded systems that require fast responses or power efficiency, integer instead of real number arithmetic (floating-point number arithmetic) should be used to increase speed. Conversely, in deeper-level embedded systems that require fast responses or power efficiency, integer arithmetic should be used instead of real number arithmetic (floating-point number arithmetic) to increase speed. Therefore, embedded engineers encounter problems in instances in which they use integer arithmetic for implementation, but real number arithmetic is required to compute vectors and other higher-dimensional algebra. There is no conventional normalization algorithm similar to the FISR algorithm for integer arithmetic; however, the proposed pseudo-normalization achieves vector normalization within a restricted domain using only multiply–add operations and bit shifts. This allows for fast and robust operations, even for low-performance MCUs that do not have power-efficient FPUs. As an example, this study demonstrates the computation of the arctangent (Arctan2 function; atan2(y, x)) with high precision using only integer multiply–add operations. In this study, we proposed a method of vector normalization using only integer arithmetic for embedded systems and confirmed its effectiveness by simulation using Verilog. The research results can contribute to various fields such as signal processing of IMU sensor data, faster artificial intelligence training, and efficient rendering of computer graphics. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

24. REORTHOGONALIZED BLOCK CLASSICAL GRAM--SCHMIDT USING TWO CHOLESKY-BASED TSQR ALGORITHMS.

Author: BARLOW, JESSE L.
Subjects: *FLOATING-point arithmetic, *KRYLOV subspace, *FACTORIZATION, *MATRIX decomposition, *MATHEMATICS
Abstract: In [Numer. Math., 23 (2013), pp. 395--423], Barlow and Smoktunowicz propose the reorthogonalized block classical Gram--Schmidt algorithm BCGS2. New conditions for the backward stability of BCGS2 that allow the use of a more flexible version of that algorithm are given. Backward stability for BCGS2 means that, in floating point arithmetic with machine precision εM, for a full column rank X ε Rm×n, m≥n, the algorithm produces Q ∈ Rm×n and upper triangular R ∈ Rn×n such that |I-QTQ| = O(εM) and |X QR| = O(εM|X|). However, each major step of BCGS2 requires the QR factorization of two intermediate m×p matrices Y1 and Y2. In many applications of interest m>p, thus these factorizations are called "tall, skinny" QR (TSQR) operations. Each such factorization was assumed to produce Qj, Rj, j = 1, 2, such that |I- QjT| = O(εM) and |Yj- Qj Rj| = O(εM|Yj|). For this suboperation, the first of these two conditions limits the choice of QR factorization algorithms to those, such as Householder and Givens QR, which may not produce the Qj as efficiently as some with weaker orthogonality restrictions. For the second of these QR factorizations, it is shown that the Cholesky decomposition of Y2T Y2 followed by the Q2 = Y2R2-1 can be substituted without a significant change in the conditions for backward stability. With slightly stronger restrictions, the first QR decomposition can be done by algorithms such as the mixed precision CholQR algorithm described by Yamazaki, Tomov, and Dongarra [SIAM J. Sci. Comput., 37 (2015), pp. C307--C330]. In a GPU/CPU environment, Yamazaki, Tomov, and Dongarra showed that algorithm to be a very efficient method of producing the TSQR. Given that a common application of Gram--Schmidt algorithms is in the implementation of Krylov subspace methods, such as block GMRES, these results make the BCGS2 algorithm more broadly applicable. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. Various pseudo random number generators based on memristive chaos map model.

Author: Moussa, Karim H., Mohy El Den, Ahmed M., Mohamed, Islam Abd Ellattif, and Abdelrassoul, Roshdy A.
Subjects: RANDOM number generators, SPREAD spectrum communications, MULTIMEDIA systems, LINEAR network coding, FLOATING-point arithmetic, TELECOMMUNICATION, IMAGE encryption, RANDOM numbers
Abstract: The telecommunications industry has made huge strides, and multimedia information transmission is exploding. Text, sound, and video can all be used to create multimedia data. As a result, having systems in place to protect private or sensitive data and keep its security is critical. In this article, completely random numbers were generated in two different ways, and the extent of their randomness was tested in many ways to ensure their suitability for use in different cryptographic applications. The proposed models in this article depend on a chaos based Pseudo-random number generators (PRNGs). PRNGs, which create bit sequences, have evolved into a critical component in many industries, including encrypted communication, Wireless communication using a spread spectrum, computational simulations, RF identification networks, and coding for error correction. The PRNGs is designed by combining the memristor that is discrete with the logistic map and the memristor that is discrete with sine map both separately to construct the novel algorithms called a two-dimensional memristive logistic map (2D-MLM) and two dimensional memristive sine map (2D-MSM) and each cycle yields a sequence of 32 random bits. The binary64 dual precision format is employed for arithmetic using floating-point in accordance with the IEEE 754-2008 standard. To assess the generator's performance, several statistical analyses are utilized. The results of the tests that evaluated the presented algorithms showed that the space key was improved and increased by 3.2% compared to other generators, and the performance speed was increased by 12.22%. The findings reveal that the sequences that are created have an elevated degree of unpredictability and a high level of security, which renders them excellent for cryptographic use in terms of the speed, the large key space, and high data rate. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. IMPROVING PROTECTION OF FALCON ELECTRONIC SIGNATURE SOFTWARE IMPLEMENTATIONS AGAINST ATTACKS BASED ON FLOATING POINT NOISE.

Author: Kachko, Olena, Gorbenko, Yurii, Kandii, Serhii, and Kaptol, Yevhenii
Subjects: FLOATING-point arithmetic, CRYPTOGRAPHY, INFORMATION storage & retrieval systems, COMPUTER software, NOISE
Abstract: The object of this study is digital signatures. The Falcon digital signature scheme is one of the finalists in the NIST post-quantum cryptography competition. Its distinctive feature is the use of floating-point arithmetic, which leads to the possibility of a key recovery attack with two non-matching signatures formed under special conditions. The work considers the task to improve the Falcon in order to prevent such attacks, as well as the use of fixed-point calculations instead of floating-point calculations in the Falcon scheme. The main results of the work are proposals for methods on improving Falcon's security against attacks based on the use of floating-point calculations. These methods for improving security differ from others in the use of fixed-point calculations with specific experimentally determined orders of magnitude in one case and proposals for modifying procedures during the execution of which the conditions for performing an attack on implementation level arise in the second case. As a result of the analysis, the probability of a successful attack on the recovery of the secret key for the reference implementation of the Falcon was clarified. Specific places in the code that make the attack possible have been localized and code modifications have been suggested that make the attack impossible. In addition, the necessary scale for fixed-point calculations was determined, at which it is possible to completely get rid of floating-point calculations. The results could be used to qualitatively improve the security of existing digital signatures. This will make it possible to design more reliable and secure information systems using digital signatures. In addition, the results could be implemented in existing systems to ensure their resistance to modern threats. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. Computing integrals with an exponential weight on the real axis in floating point arithmetic.

Author: Laudadio, Teresa, Mastronardi, Nicola, and Occorsio, Donatella
Subjects: *FLOATING-point arithmetic, *SMOOTHNESS of functions, *SINGULAR value decomposition, *CONTINUOUS functions, *INTEGRALS
Abstract: The aim of this work is to propose a fast and reliable algorithm for computing integrals of the type ∫ − ∞ ∞ f (x) e − x 2 − 1 x 2 d x , where f (x) is a sufficiently smooth function, in floating point arithmetic. The algorithm is based on a product integration rule, whose rate of convergence depends only on the regularity of f , since the coefficients of the rule are "exactly" computed by means of suitable recurrence relations here derived. We prove stability and convergence in the space of locally continuous functions on R equipped with weighted uniform norm. By extensive numerical tests, the accuracy of the proposed product rule is compared with that of the Gauss–Hermite quadrature formula w.r.t. the function f (x) e − 1 x 2 . The numerical results confirm the effectiveness of the method, supporting the proven theoretical estimates. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

28. Rational Arithmetic with a Round-Off.

Author: Varin, V. P.
Subjects: *FLOATING-point arithmetic, *RATIONAL numbers, *NUMBER theory, *COMPUTER systems
Abstract: Computations on a computer with a floating point arithmetic are always approximate. Conversely, computations with the rational arithmetic (in a computer algebra system, for example) are always absolutely exact and reproducible both on other computers and (theoretically) by hand. Consequently, these computations can be demonstrative in a sense that a proof obtained with their help is no different from a traditional one (computer assisted proof). However, usually such computations are impossible in a sufficiently complicated problem due to limitations on resources of memory and time. We propose a mechanism of rounding off rational numbers in computations with rational arithmetic, which solves this problem (of resources), i.e., computations can still be demonstrative but do not require unbounded resources. We give some examples of implementation of standard numerical algorithms with this arithmetic. The results have applications to analytical number theory. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

29. Avoiding Breakdown in Incomplete Factorizations in Low Precision Arithmetic.

Author: SCOTT, JENNIFER and TŮMA, MIROSLAV
Subjects: *NUMERICAL solutions for linear algebra, *ARITHMETIC, *FACTORIZATION, *COMPUTER arithmetic, *FLOATING-point arithmetic, *COMPUTERS
Abstract: The emergence of low precision floating-point arithmetic in computer hardware has led to a resurgence of interest in the use of mixed precision numerical linear algebra. For linear systems of equations, there has been renewed enthusiasm for mixed precision variants of iterative refinement. We consider the iterative solution of large sparse systems using incomplete factorization preconditioners. The focus is on the robust computation of such preconditioners in half precision arithmetic and employing them to solve symmetric positive definite systems to higher precision accuracy; however, the proposed ideas can be applied more generally. Even for well-conditioned problems, incomplete factorizations can break down when small entries occur on the diagonal during the factorization. When using half precision arithmetic, overflows are an additional possible source of breakdown. We examine how breakdowns can be avoided and implement our strategies within new half precision Fortran sparse incomplete Cholesky factorization software. Results are reported for a range of problems from practical applications. These demonstrate that, even for highly ill-conditioned problems, half precision preconditioners can potentially replace double precision preconditioners, although unsurprisingly this may be at the cost of additional iterations of a Krylov solver. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

30. Gate-Level Hardware Priority Resolvers for Embedded Systems.

Author: Balasubramanian, Padmanabhan and Maskell, Douglas L.
Subjects: ARCHITECTURAL design, FLOATING-point arithmetic, MODULAR design, NETWORK routers, VIDEO coding, DIGITAL libraries
Abstract: An N-bit priority resolver having N inputs and N outputs functions as polling hardware in an embedded system, enabling access to a resource when multiple devices initiate access requests at its inputs which may be located on-chip or off-chip. Subsystems such as data buses, comparators, fixed- and floating-point arithmetic units, interconnection network routers, etc., utilize the priority resolver function. In the literature, there are many transistor-level designs for the priority resolver based on dynamic CMOS logic, some of which are modular and others are not. This article presents a novel gate-level modular design of priority resolvers that can accommodate any number of inputs and outputs. Based on our modular design architecture, small-size priority resolvers can be conveniently combined to form medium- or large-size priority resolvers along with extra logic. The proposed modular design approach helps to reduce the coding complexity compared to the conventional direct design approach and facilitates scalability. We discuss the gate-level implementation of 4-, 8-, 16-, 32-, 64-, and 128-bit priority resolvers based on the direct and modular approaches and provide a performance comparison between these based on the design metrics. According to the modular approach, different sizes of priority resolver modules were used to implement larger-size priority resolvers. For example, a 4-bit priority resolver module was used to implement 8-, 16-, 32-, 64-, and 128-bit priority resolvers in a modular fashion. We used a 28 nm CMOS standard digital cell library and Synopsys EDA tools to synthesize the priority resolvers. The estimated design metrics show that the modular approach tends to facilitate increasing reductions in delay and power-delay product (PDP) compared to the direct approach, especially as the size of the priority resolver increases. For example, a 32-bit modular priority resolver utilizing 16-bit priority resolver modules had a 39.4% reduced delay and a 23.1% reduced PDP compared to a directly implemented 32-bit priority resolver, and a 128-bit modular priority resolver utilizing 16-bit priority resolver modules had a 71.8% reduced delay and a 61.4% reduced PDP compared to a directly implemented 128-bit priority resolver. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

31. Formal Verification of Emulated Floating-Point Arithmetic in Falcon

Author: Hwang, Vincent, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Minematsu, Kazuhiko, editor, and Mimura, Mamoru, editor
Published: 2024
Full Text: View/download PDF

32. DLMF Standard Reference Tables on Demand

Author: Saunders, Bonita V., Brooks, Sean, Buckmire, Ron, E. Vincent-Finley, Rachel, Backeljauw, Franky, Becuwe, Stefan, Miller, Bruce, McClain, Marjorie, Cuyt, Annie, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Goos, Gerhard, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Buzzard, Kevin, editor, Dickenstein, Alicia, editor, Eick, Bettina, editor, Leykin, Anton, editor, and Ren, Yue, editor
Published: 2024
Full Text: View/download PDF

33. On Rounding Errors in the Simulation of Quantum Circuits

Author: Klamroth, Jonas, Beckert, Bernhard, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Monti, Flavia, editor, Plebani, Pierluigi, editor, Moha, Naouel, editor, Paik, Hye-young, editor, Barzen, Johanna, editor, Ramachandran, Gowri, editor, Bianchini, Devis, editor, Tamburri, Damian A., editor, and Mecella, Massimo, editor
Published: 2024
Full Text: View/download PDF

34. Dyadic linear programming and extensions

Author: Abdi, Ahmad, Cornuéjols, Gérard, Guenin, Bertrand, and Tunçel, Levent
Published: 2024
Full Text: View/download PDF

35. Theoretical and Practical Bounds on the Initial Value of Clock Skew Compensation Algorithm Immune to Floating-Point Precision Loss for Resource-Constrained Wireless Sensor Nodes.

Author: Kang, Seungyeop and Kim, Kyeong Soo
Subjects: *WIRELESS sensor nodes, *WIRELESS sensor networks, *COMPUTING platforms, *ALGORITHMS, *IMMUNOCOMPUTERS
Abstract: We revisit our prior work on clock skew compensation immune to floating-point precision loss and provide practical as well as theoretical bounds on the initial value of the skew-compensated clock based on a systematic analysis of the errors of floating-point operations; by practical bounds, we mean the actual values of the theoretical bounds calculated at resource-constrained computing platforms like WSN nodes equipped with 32-bit single-precision floating-point format. Numerical examples demonstrate that the proposed practical bounds on single-precision floating-point format do not violate the theoretical bounds and thereby can guarantee the correctness of the clock skew compensation on resource-constrained computing platforms. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

36. Floating-Point Quantization Analysis of Multi-Layer Perceptron Artificial Neural Networks.

Author: Al-Rikabi, Hussein and Renczes, Balázs
Abstract: The impact of quantization in Multi-Layer Perceptron (MLP) Artificial Neural Networks (ANNs) is presented in this paper. In this architecture, the constant increase in size and the demand to decrease bit precision are two factors that contribute to the significant enlargement of quantization errors. We introduce an analytical tool that models the propagation of Quantization Noise Power (QNP) in floating-point MLP ANNs. Contrary to the state-of-the-art approach, which compares the exact and quantized data experimentally, the proposed algorithm can predict the QNP theoretically when the effect of operation quantization and Coefficient Quantization Error (CQE) are considered. This supports decisions in determining the required precision during the hardware design. The algorithm is flexible in handling MLP ANNs of user-defined parameters, such as size and type of activation function. Additionally, a simulation environment is built that can perform each operation on an adjustable bit precision. The accuracy of the QNP calculation is verified with two publicly available benchmarked datasets, using the default precision simulation environment as a reference. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

37. A Hardware Implementation of the PID Algorithm Using Floating-Point Arithmetic.

Author: Kulisz, Józef and Jokiel, Filip
Subjects: FLOATING-point arithmetic, DIGITAL signal processing, GATE array circuits, ALGORITHMS, HARDWARE
Abstract: The purpose of the paper is to propose a new implementation of the PID (proportional–integral–derivative) algorithm in digital hardware. The proposed structure is optimized for cost. It follows a serialized, rather than parallel, scheme. It uses only one arithmetic block, performing the multiply-and-add operation. The calculations are carried out in a sequentially cyclic manner. The proposed circuit operates on standard single-precision (32-bit) floating-point numbers. It implements an extended PID formula, containing a non-ideal derivative component, and weighting coefficients, which enable reducing the influence of setpoint changes in the proportional and derivative components. The circuit was implemented in a Cyclone V FPGA (Field-Programmable Gate Array) device from Intel, Santa Clara, CA, USA. The proper operation of the circuit was verified in a simulation. For the specific implementation, which is reported in the paper, the sampling period of 516 ns was obtained, which means that the proposed solution is comparable in terms of speed with other hardware implementations of the PID algorithm operating on single-precision floating-point numbers. However, the presented solution is much more efficient in terms of cost. It uses 1173 LUT (Look-up Table) blocks, 1026 registers, and 1 DSP (Digital Signal Processing) block, i.e., about 30% of logic resources required by comparable solutions. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

38. SOME NEW RESULTS ON THE MAXIMUM GROWTH FACTOR IN GAUSSIAN ELIMINATION.

Author: EDELMAN, ALAN and URSCHEL, JOHN
Subjects: *GAUSSIAN elimination, *GROWTH factors, *FLOATING-point arithmetic
Abstract: This paper combines modern numerical computation with theoretical results to improve our understanding of the growth factor problem for Gaussian elimination. On the computational side we obtain lower bounds for the maximum growth for complete pivoting for n = 1:75 and n = 100 using the Julia JuMP optimization package. At n = 100 we obtain a growth factor bigger than 3n. The numerical evidence suggests that the maximum growth factor is bigger than n if and only if n≥ 11. We also present a number of theoretical results. We show that the maximum growth factor over matrices with entries restricted to a subset of the reals is nearly equal to the maximum growth factor over all real matrices. We also show that the growth factors under floating point arithmetic and exact arithmetic are nearly identical. Finally, through numerical search, and stability and extrapolation results, we provide improved lower bounds for the maximum growth factor. Specifically, we find that the largest growth factor is bigger than 1.0045n for n > 10, and the lim sup of the ratio with n is greater than or equal to 3.317. In contrast to the old conjecture that growth might never be bigger than n, it seems likely that the maximum growth divided by n goes to infinity as n→∞. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

39. Mathematical modelling for high precision ray tracing in optical design.

Author: Wu, Changmao, Xia, Yuanyuan, Xu, Zhengwei, Liu, Li, Tang, Xiongxin, Chen, Qiao, and Xu, Fanjiang
Subjects: *RAY tracing, *RAY tracing algorithms, *MATHEMATICAL models, *NUMERICAL calculations, *DESIGN software, *FLOATING-point arithmetic
Abstract: Optical systems have conventionally been evaluated by means of ray-tracing techniques, which can extract performance parameters such as aberration and spot size. However, current ray tracing methods in the field of optical design lack satisfactory methods for error analysis and improving its accuracy. Additionally, numerical calculations of large-scale, lengthy, and multi-scale ray tracing often result in inaccurate and even invalid outcomes due to round-off errors. In response to this challenge, we adopt the floating arithmetic theory to analyze the accumulative impact of round-off errors during the ray tracing process and develop a numerical error model for ray tracing. Using this model, we introduce compensation measures to achieve higher precision and then implemented a high-precision ray tracing algorithm. Finally, we conduct numerical experiments to validate our error model and high-precision algorithm, demonstrating their superior precision in comparison to the state-of-the-art authoritative optical design software, Zemax and CODE V. • Optical design demands higher precision in ray tracing algorithms. • The error model reveals the sources and constituents of errors in ray tracing. • Five measures are derived from the model to improve the precision of ray tracing. • Our approach excels beyond the state-of-the-art algorithms of Zemax and CODE V. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

40. Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN Training.

Author: Junaid, Muhammad, Aliev, Hayotjon, Park, SangBo, Kim, HyungWon, Yoo, Hoyoung, and Sim, Sanghoon
Subjects: *ARTIFICIAL neural networks, *FLOATING-point arithmetic, *ENERGY consumption
Abstract: The rapid advancement in AI requires efficient accelerators for training on edge devices, which often face challenges related to the high hardware costs of floating-point arithmetic operations. To tackle these problems, efficient floating-point formats inspired by block floating-point (BFP), such as Microsoft Floating Point (MSFP) and FlexBlock (FB), are emerging. However, they have limited dynamic range and precision for the smaller magnitude values within a block due to the shared exponent. This limits the BFP's ability to train deep neural networks (DNNs) with diverse datasets. This paper introduces the hybrid precision (HPFP) selection algorithms, designed to systematically reduce precision and implement hybrid precision strategies, thereby balancing layer-wise arithmetic operations and data path precision to address the shortcomings of traditional floating-point formats. Reducing the data bit width with HPFP allows more read/write operations from memory per cycle, thereby decreasing off-chip data access and the size of on-chip memories. Unlike traditional reduced precision formats that use BFP for calculating partial sums and accumulating those partial sums in 32-bit Floating Point (FP32), HPFP leads to significant hardware savings by performing all multiply and accumulate operations in reduced floating-point format. For evaluation, two training accelerators for the YOLOv2-Tiny model were developed, employing distinct mixed precision strategies, and their performance was benchmarked against an accelerator utilizing a conventional brain floating point of 16 bits (Bfloat16). The HPFP selection, employing 10 bits for the data path of all layers and for the arithmetic of layers requiring low precision, along with 12 bits for layers requiring higher precision, results in a 49.4% reduction in energy consumption and a 37.5% decrease in memory access. This is achieved with only a marginal mean Average Precision (mAP) degradation of 0.8% when compared to an accelerator based on Bfloat16. This comparison demonstrates that the proposed accelerator based on HPFP can be an efficient approach to designing compact and low-power accelerators without sacrificing accuracy. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. A Hardware-Based Orientation Detection System Using Dendritic Computation.

Author: Nomura, Masahiro, Chen, Tianqi, Tang, Cheng, Todo, Yuki, Sun, Rong, Li, Bin, and Tang, Zheng
Subjects: LOGIC circuits, FLOATING-point arithmetic, BIOLOGICALLY inspired computing, GRAYSCALE model, VIRTUAL reality, BIG data
Abstract: Studying how objects are positioned is vital for improving technologies like robots, cameras, and virtual reality. In our earlier papers, we introduced a bio-inspired artificial visual system for orientation detection, demonstrating its superiority over traditional systems with higher recognition rates, greater biological resemblance, and increased resistance to noise. In this paper, we propose a hardware-based orientation detection system (ODS). The ODS is implemented by a multiple dendritic neuron model (DNM), and a neuronal pruning scheme for the DNM is proposed. After performing the neuronal pruning, only the synapses in the direct and inverse connections states are retained. The former can be realized by a comparator, and the latter can be replaced by a combination of a comparator and a logic NOT gate. For the dendritic function, the connection of synapses on dendrites can be realized with logic AND gates. Then, the output of the neuron is equivalent to a logic OR gate. Compared with other machine learning methods, this logic circuit circumvents floating-point arithmetic and therefore requires very little computing resources to perform complex classification. Furthermore, the ODS can be designed based on experience, so no learning process is required. The superiority of ODS is verified by experiments on binary, grayscale, and color image datasets. The ability to process data rapidly owing to advantages such as parallel computation and simple hardware implementation allows the ODS to be desirable in the era of big data. It is worth mentioning that the experimental results are corroborated with anatomical, physiological, and neuroscientific studies, which may provide us with a new insight for understanding the complex functions in the human brain. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. Configurable sparse matrix - matrix multiplication accelerator on FPGA: A systematic design space exploration approach with quantization effects.

Author: Noble, G., Nalesh, S., Kala, S., and Kumar, Akash
Subjects: MATRIX multiplications, SPARSE matrices, FLOATING-point arithmetic, SPACE (Architecture), DEEP learning, BIG data
Abstract: High-performance sparse matrix multipliers are essential for deep learning applications, and as big data analytics continues to evolve, specialized accelerators are also needed to efficiently handle sparse matrix operations. This paper proposes a modified, configurable, outer product based architecture for sparse matrix multiplication, and explores design space of the proposed architecture. The performance of various architecture configurations has been examined for input samples with similar characteristics. The proposed architecture has been implemented on Xilinx Kintex-7 FPGA using 32-bit single precision floating-point arithmetic and also in 8-bit, 16-bit and 32-bit fixed-point arithmetic formats. The effect of quantization in the proposed architecture has been analyzed extensively and the results have been reported. The performance of the proposed architecture has been compared with state-of-the-art implementations, and an improvement of 9.21% has been observed in the performance. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

43. Computation of parabolic cylinder functions having complex argument.

Author: Dunster, T.M., Gil, A., and Segura, J.
Subjects: *WEBER functions, *FLOATING-point arithmetic, *ANALYTIC functions, *AIRY functions, *POINCARE series, *ASYMPTOTIC expansions
Abstract: Numerical methods for the computation of the parabolic cylinder function U (a , z) for real a and complex z are presented. The main tools are recent asymptotic expansions involving exponential and Airy functions, with slowly varying analytic coefficient functions involving simple coefficients, and stable integral representations; these two main methods can be complemented with Maclaurin series and a Poincaré asymptotic expansion. We provide numerical evidence showing that the combination of these methods is enough for computing the function with 5 × 10 − 13 relative accuracy in double precision floating point arithmetic. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

44. Numerical Approximations of the Riemann–Liouville and Riesz Fractional Integrals.

Author: Ciesielski, Mariusz and Grodzki, Grzegorz
Subjects: *FRACTIONAL integrals, *FLOATING-point arithmetic, *FRACTIONAL calculus, *CAUCHY integrals, *SPLINES, *GAUSSIAN quadrature formulas, *SPLINE theory, *NUMERICAL integration
Abstract: In this paper, the numerical algorithms for calculating the values of the left- and right-sided Riemann–Liouville fractional integrals and the Riesz fractional integral using spline interpolation techniques are derived. The linear, quadratic and three variants of cubic splines are taken into account. The estimation of errors using analytical methods are derived. We show four examples of numerical evaluation of the mentioned fractional integrals and determine the experimental rate of convergence for each derived algorithm. The high-precision calculations are executed using the 128-bit floating-point numbers and arithmetic routines. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

45. Why Did Thomas Harriot Invent Binary?

Author: Strickland, Lloyd
Subjects: *COMPUTER arithmetic, *FORTIFIED wines, *SPECIFIC gravity, *FLOATING-point arithmetic, *RENAISSANCE
Abstract: The article explores the invention of binary numeration and arithmetic by Thomas Harriot, an English mathematician, astronomer, and alchemist. It challenges the traditional credit given to Gottfried Wilhelm Leibniz and reveals that Harriot used binary notation as early as the 17th century. Harriot's use of binary was influenced by his weighing experiments, where he recorded measurements using a system of part ounces and grains. The article highlights the importance of Harriot's recording system in the development of binary notation and discusses his exploration of binary arithmetic. Despite his achievement, Harriot's work remained unpublished until the 20th century and did not impact the adoption of binary in computer arithmetic. [Extracted from the article]
Published: 2024
Full Text: View/download PDF

46. Formally-Verified Round-Off Error Analysis of Runge–Kutta Methods.

Author: Faissole, Florian
Subjects: RUNGE-Kutta formulas, FLOATING-point arithmetic, COMPUTER arithmetic, MATRIX norms, NUMERICAL integration
Abstract: Numerical errors are insidious, difficult to predict and inherent in different levels of critical systems design. Indeed, numerical algorithms generally constitute approximations of an ideal mathematical model, which itself constitutes an approximation of a physical reality which has undergone multiple measurement errors. To this are added rounding errors due to computer arithmetic implementations, often neglected even if they can significantly distort the results obtained. This applies to Runge–Kutta methods used for the numerical integration of ordinary differential equations, that are ubiquitous to model fundamental laws of physics, chemistry, biology or economy. We provide a Coq formalization of the rounding error analysis of Runge–Kutta methods applied to linear systems and implemented in floating-point arithmetic. We propose a generic methodology to build a bound on the error accumulated over the iterations, taking gradual underflow into account. We then instantiate this methodology for two classic Runge–Kutta methods, namely Euler and RK2. The formalization of the results include the definition of matrix norms, the proof of rounding error bounds of matrix operations and the formalization of the generic results and their applications on examples. In order to support the proposed approach, we provide numerical experiments on examples coming from nuclear physics applications. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

47. Fast Generation of Custom Floating-Point Spatial Filters on FPGAs

Author: Nelson Campos, Eran Edirisinghe, Slava Chesnokov, and Daniel Larkin
Subjects: Domain-specific language, embedded video processing, floating-point arithmetic, FPGA, real-time, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Convolutional Neural Networks (CNNs) have been utilised in many image and video processing applications. The convolution operator, also known as a spatial filter, is usually a linear operation, but this linearity compromises essential features and details inherent in the non-linearity present in many applications. However, due to its slow processing, the use of a nonlinear spatial filter is a significant bottleneck in many software applications. Further, due to their complexity, they are difficult to accelerate in FPGA or VLSI architectures. This paper presents novel FPGA implementations of linear and nonlinear spatial filters. More specifically, the arithmetic computations are carried out in custom floating-point, enabling a tradeoff of precision and hardware compactness, reducing algorithm development time. Further, we show that it is possible to process video at a resolution of 1080p with a frame rate of 60 frames per second, using a low-cost FPGA board. Finally, we show that using a domain-specific language will allow the rapid prototyping of image processing algorithms in custom floating-point arithmetic, allowing non-experts to quickly develop real-time video processing applications.
Published: 2024
Full Text: View/download PDF

48. Framework for rapid hardware prototyping using custom floating-point arithmetic

Author: De-Sousa-Campos, Nelson
Subjects: 004, Domain-specific language, embedded video processing, floating-point arithmetic, FPGA, real-time
Published: 2022
Full Text: View/download PDF

49. Masking Floating-Point Number Multiplication and Addition of Falcon

Author: Keng-Yu Chen and Jiun-Peng Chen
Subjects: Falcon, Floating-Point Arithmetic, Masking, Post-Quantum Cryptography, Side-Channel Analysis, Computer engineering. Computer hardware, TK7885-7895, Information technology, T58.5-58.64
Abstract: In this paper, we provide the first masking scheme for floating-point number multiplication and addition to defend against recent side-channel attacks on Falcon’s pre-image vector computation. Our approach involves a masked nonzero check gadget that securely identifies whether a shared value is zero. This gadget can be utilized for various computations such as rounding the mantissa, computing the sticky bit, checking the equality of two values, and normalizing a number. To support the masked floating-point number addition, we also developed a masked shift and a masked normalization gadget. Our masking design provides both first- and higherorder mask protection, and we demonstrate the theoretical security by proving the (Strong)-Non-Interference properties in the probing model. To evaluate the performance of our approach, we implemented unmasked, first-order, and second-order algorithms on an Arm Cortex-M4 processor, providing cycle counts and the number of random bytes used. We also report the time for one complete signing process with our countermeasure on an Intel-Core CPU. In addition, we assessed the practical security of our approach by conducting the test vector leakage assessment (TVLA) to validate the effectiveness of our protection. Specifically, our TVLA experiment results for second-order masking passed the test in 100,000 measured traces.
Published: 2024
Full Text: View/download PDF

50. DETERMINING THE EFFECT OF A FLOATING POINT ON THE FALCON DIGITAL SIGNATURE ALGORITHM SECURITY.

Author: Potii, Oleksandr, Kachko, Olena, Kandii, Serhii, and Kaptol, Yevhenii
Subjects: FLOATING-point arithmetic, ALGORITHMS, TELECOMMUNICATION systems, DIGITAL signatures, CRYPTOGRAPHY, INFORMATION storage & retrieval systems
Abstract: The object of research is digital signatures. The Falcon digital signature scheme is one of the finalists in the NIST post-quantum cryptography competition. Its distinctive feature is the use of floating-point arithmetic. However, floating-point arithmetic has so-called rounding noise, which accumulates during computations and in some cases may lead to significant changes in the processed values. The work considers the problem of using rounding noise to build attacks on implementation. The main result of the study is a novel attack on implementation, which enables the secret key recovery. This attack differs from existing attacks in using two separately secure implementations with different computation orders. As a result of the analysis, the conditions under which secret key recovery is possible were revealed. The attack requires 300,000 signatures and two implementations to recover key. The probability of successful attack ranges from 70 % to 76 %. This probability is explained by the structure of the Gaussian sampling algorithm used in the Falcon digital signature. At the same time, a necessary condition for conducting an attack is identical seed during signature generation. This condition makes the attack more theoretical than practical since the correct implementation of the Falcon makes probability of two identical seeds negligible. However, the possible usage of floating-point noise shows potential existence of additional attack vectors for the Falcon that should be covered in security models. The results could be used in the construction of digital signature security models and their implementation in existing information and communication systems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

2,310 results on '"FLOATING-point arithmetic"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources