Descriptor: "Minifloat" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Minifloat"' showing total 174 results

Start Over Descriptor "Minifloat"

174 results on '"Minifloat"'

1. Applicability of Minifloats for Efficient Calculations in Neural Networks.

Author: Kondrat'ev, A. Yu. and Goncharenko, A. I.
Abstract: The possibility of the inference of neural networks on minifloats has been studied. Calculations using a float16 accumulator for intermediate computing were performed. Performance was tested on the GoogleNet, ResNet-50, and MobileNet-v2 convolutional neural network and the DeepSpeechv01 recurrent network. The experiments showed that the performance of these neural networks with 11-bit minifloats is not inferior to the performance of networks with the float32 standard type without additional training. The results indicate that minifloats can be used to design efficient computers for the inference of neural networks. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

2. Applicability of Minifloats for Efficient Calculations in Neural Networks

Author: Kondrat’ev, A. Yu. and Goncharenko, A. I.
Published: 2020
Full Text: View/download PDF

3. Applicability of Minifloats for Efficient Calculations in Neural Networks

Author: A. I. Goncharenko and A. Yu. Kondrat’ev
Subjects: 010302 applied physics, Artificial neural network, Computer science, business.industry, Deep learning, Inference, Minifloat, Condensed Matter Physics, 01 natural sciences, Convolutional neural network, Data type, 010309 optics, Standard type, 0103 physical sciences, Artificial intelligence, Electrical and Electronic Engineering, Accumulator (computing), business, Instrumentation
Abstract: The possibility of the inference of neural networks on minifloats has been studied. Calculations using a float16 accumulator for intermediate computing were performed. Performance was tested on the GoogleNet, ResNet-50, and MobileNet-v2 convolutional neural network and the DeepSpeechv01 recurrent network. The experiments showed that the performance of these neural networks with 11-bit minifloats is not inferior to the performance of networks with the float32 standard type without additional training. The results indicate that minifloats can be used to design efficient computers for the inference of neural networks.
Published: 2020

4. Optimizing Deep Learning Models for Object Detection

Author: Gabriel Iuhasz and Calin-George Barburescu
Subjects: Hyperparameter, Artificial neural network, Computer science, business.industry, Deep learning, 05 social sciences, Detector, Inference, 010501 environmental sciences, Minifloat, Machine learning, computer.software_genre, 01 natural sciences, Object detection, Single-precision floating-point format, 0502 economics and business, Artificial intelligence, 050207 economics, business, computer, 0105 earth and related environmental sciences
Abstract: Deep learning models for object detection have gotten larger and larger over the years, spanning from 3.9M trainable parameters for EfficientDet to 209M for the AmoebaNet-based NAS-FPN detector. Different strategies are currently being researched in order to improve the efficiency of deep learning models for object detection, one of which is running the training and inference of the neural network in low precision. Interesting results have been achieved by researchers, starting from the original paradigm of using operators and doing the necessary operations in IEEE single precision (FP32), to achieving similar accuracies of the models using custom minifloat formats (FP8). The results can be pushed even further by using genetic algorithms for hyperparameter tuning, in order to find specific hyperparameter for the FP8 version of the model. In this paper, we will present the results of our experiments utilizing YOLOv3 with hybrid floating-point format (HFP8). One of the experiments in this paper shows how our solution can be used for checking social distancing guidelines which is a very important topic in the current COVID-19 pandemic.
Published: 2020

5. Floating-Point Arithmetic

Author: Jo Van Hoey
Subjects: Quadruple-precision floating-point format, Floating point, Arbitrary-precision arithmetic, Floating-point unit, Double-precision floating-point format, Hardware_ARITHMETICANDLOGICSTRUCTURES, Arithmetic, Minifloat, Extended precision, Single-precision floating-point format, Mathematics
Abstract: You already know about integer arithmetic; now we will introduce some floating-point computations. There is nothing difficult here; a floating-point value has a decimal point in it and zero or more decimals. We have two kinds of floating-point numbers: single precision and double precision. Double precision is more accurate because it can handle more significant digits. With that information, you now know enough to run and analyze the sample program in this chapter.
Published: 2019

6. IEEE standard for floating point numbers

Author: V. Rajaraman
Subjects: IEEE 754-1985, Computer science, Decimal floating point, Floating-point unit, 020207 software engineering, Double-precision floating-point format, 02 engineering and technology, Minifloat, IEEE floating point, Single-precision floating-point format, Education, Real data type, Computer engineering, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Arithmetic
Abstract: Floating point numbers are an important data type in computation which is used extensively. Yet, many users do not know the standard which is used in almost all computer hardware to store and process these. In this article, we explain the standards evolved by The Institute of Electrical and Electronic Engineers in 1985 and augmented in 2008 to represent floating point numbers and process them. This standard is now used by all computer manufacturers while designing floating point arithmetic units so that programs are portable among computers.
Published: 2016

7. Design and Implementation of Double Precision Floating Point Comparator

Author: Anjana Mary Joseph and P. Rony Antony
Subjects: Floating point comparator, Floating point, Comparator, Computer science, business.industry, Double precision, Double-precision floating-point format, 02 engineering and technology, Minifloat, Parallel prefix tree, Extended precision, IEEE floating point, 020202 computer hardware & architecture, Most significant bit, 0202 electrical engineering, electronic engineering, information engineering, General Earth and Planetary Sciences, Hardware_ARITHMETICANDLOGICSTRUCTURES, business, Computer hardware, General Environmental Science, Half-precision floating-point format
Abstract: Floating point comparison is a fundamental arithmetic operation in DSP processor. The high dynamic range of floating point comparators find wide applications in sorting data problem, DSP algorithms etc. High performance with optimum area is a major concern for the practical implementation of these comparators. Another major concern with respect to the floating point numbers is the invalid numbers. Thus a separate module is required to handle the invalid numbers. In the present work, a double precision floating point comparator design is proposed for efficient floating point comparison. This comparator takes full advantage of the parallel prefix tree architecture. It first compares the most significant bit and proceeds towards least significant bit only when the compared bits are equal. Representation of floating point numbers is based on IEEE 754 standard. The double precision floating point comparator is modelled using Verilog HDL and synthesized in Xilinx ISE 14.6 targeting Virtex 5 and Cadence encounter tool. The results show that the new comparator architecture is efficient in handling all the invalid floating point numbers.
Published: 2016
Full Text: View/download PDF

8. A Parameterized Floating-Point Formalizaton in HOL Light

Author: Ganesh Gopalakrishnan, Charles Jacobsen, and Alexey Solovyev
Subjects: Discrete mathematics, Floating point, Correctness, formalization, General Computer Science, Rounding, floating point, Parameterized complexity, IEEE-754-2008, Fixed point, Minifloat, IEEE floating point, Theoretical Computer Science, NaN, fixed point, Hardware_ARITHMETICANDLOGICSTRUCTURES, Mathematics, Computer Science(all)
Abstract: We present a new, open-source formalization of fixed and floating-point numbers for arbitrary radix and precision that is now part of the HOL Light distribution [John Harrison. HOL Light: A tutorial introduction. In Formal Methods in Computer-Aided Design, pages 265–269. Springer, 1996]. We prove correctness and error bounds for the four different rounding modes, and formalize a subset of the IEEE 754 [IEEE standard for floating point arithmetic. IEEE Std. 754-2008, 2008] standard by gluing together a set of fixed-point and floating-point numbers to represent the subnormals and normals. In our floating-point proofs, we treat phases of floating-point numbers as copies of fixed-point numbers of varying precision so that we can reuse fixed-point rounding theorems.
Published: 2015
Full Text: View/download PDF

9. Harnessing Numerical Flexibility for Deep Learning on FPGAs

Author: Andrew Bitar, Josh Fender, Andrew Ling, Suchit Subhaschandra, David Han, Gordon Raymond Chiu, Roberto DiCecco, Shane O'Connell, Mohamed S. Abdelfattah, Dmitry Denisenko, and Chris N. Johnson
Subjects: Flexibility (engineering), Floating point, business.industry, Memory bandwidth, 02 engineering and technology, Minifloat, 020202 computer hardware & architecture, Significand, Computer engineering, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Medicine, Block floating-point, business, Field-programmable gate array, Throughput (business)
Abstract: Deep learning has become a key workload in the data centre and edge leading to an arms race for compute dominance in this space. FPGAs have shown they can compete by combining deterministic low-latency with high throughput and flexibility. In particular, due to FPGAs' bit-level programmability, FPGAs can efficiently implement arbitrary precisions and numeric data types which is critical to fast evolving fields like deep learning. In this work, we explore minifloat (floating point representations with non-standard exponent and mantissa sizes) implementations on the FPGA, and show how we use a block floating point implementation that shares the exponent across many numbers to reduce the required logic to perform floating point operations. We will show that using this technique we can significantly improve the performance of the FPGA with no impact to accuracy. Using this approach, we show how we can reduce logic utilization by 3x, and memory bandwidth and capacity required by more than 40%.
Published: 2018

10. Simple floating-point filters for the two-dimensional orientation problem

Author: Takeshi Ogita, Siegfried M. Rump, Florian Bünger, Shin'ichi Oishi, and Katsuhisa Ozaki
Subjects: Arithmetic underflow, Floating point, Computer Networks and Communications, Applied Mathematics, Rounding, Binary scaling, 020207 software engineering, 010103 numerical & computational mathematics, 02 engineering and technology, Minifloat, Computational geometry, 01 natural sciences, Machine epsilon, Computational Mathematics, Filter (video), Computer Science::Mathematical Software, 0202 electrical engineering, electronic engineering, information engineering, Hardware_ARITHMETICANDLOGICSTRUCTURES, 0101 mathematics, Algorithm, Software, Mathematics
Abstract: This paper is concerned with floating-point filters for a two dimensional orientation problem which is a basic problem in the field of computational geometry. If this problem is only approximately solved by floating-point arithmetic, then an incorrect result may be obtained due to accumulation of rounding errors. A floating-point filter can quickly guarantee the correctness of the computed result if the problem is well-conditioned. In this paper, a simple semi-static floating-point filter which handles floating-point exceptions such as overflow and underflow by only one branch is developed. In addition, an improved fully-static filter is developed.
Published: 2015

11. Parallel Reproducible Summation

Author: Hong Diep Nguyen and James Demmel
Subjects: Kahan summation algorithm, Floating point, Computer science, Rounding, Double-precision floating-point format, Minifloat, FLOPS, Theoretical Computer Science, Computational Theory and Mathematics, Hardware and Architecture, Round-off error, Pairwise summation, Massively parallel, Bitwise operation, Algorithm, Software
Abstract: Reproducibility, i.e. getting bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [10] . However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, makes attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point summation that is reproducible independent of the order of summation. Our technique uses Rump’s algorithm for error-free vector transformation [7] , and is much more efficient than using (possibly very) high precision arithmetic. Our algorithm reproducibly computes highly accurate results with an absolute error bound of $n \cdot 2^{-28} \cdot macheps \cdot \max _i |v_i|$ at a cost of $7n$ FLOPs and a small constant amount of extra memory usage. Higher accuracies are also possible by increasing the number of error-free transformations. As long as all operations are performed in to-nearest rounding mode, results computed by the proposed algorithms are reproducible for any run on any platform. In particular, our algorithm requires the minimum number of reductions, i.e. one reduction of an array of six double precision floating point numbers per sum, and hence is well suited for massively parallel environments.
Published: 2015

12. High-Precision Arithmetic in Mathematical Physics

Author: David H. Bailey and Jonathan M. Borwein
Subjects: Empirical data, lcsh:Mathematics, General Mathematics, Computation, Context (language use), Integer relation algorithm, Minifloat, lcsh:QA1-939, Poisson equation, Numerical integration, Ising integrals, PSLQ algorithm, Arbitrary-precision arithmetic, numerical integration, Computer Science (miscellaneous), high-precision arithmetic, Hardware_ARITHMETICANDLOGICSTRUCTURES, Arithmetic, Engineering (miscellaneous), Mathematics, Mathematical physics
Abstract: For many scientific calculations, particularly those involving empirical data, IEEE 32-bit floating-point arithmetic produces results of sufficient accuracy, while for other applications IEEE 64-bit floating-point is more appropriate. But for some very demanding applications, even higher levels of precision are often required. This article discusses the challenge of high-precision computation, in the context of mathematical physics, and highlights what facilities are required to support future computation, in light of emerging developments in computer architecture.
Published: 2015

13. Optimal approximation for efficient termination analysis of Floating-point Loops

Author: Étienne Payet, Fonenantsoa Maurica, Frédéric Mesnard, Laboratoire d'Informatique et de Mathématiques (LIM), and Université de La Réunion (UR)
Subjects: Rational number, Floating point, Linear approximation, Computer science, Rounding, Computation, Numerical analysis, 020207 software engineering, 02 engineering and technology, Minifloat, Termination analysis, Linear ranking function, Computer Science::Mathematical Software, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, [INFO]Computer Science [cs], Algorithm, Floating-point numbers, Software correctness
Abstract: International audience; Floating-point numbers are used in a wide variety of programs, from numerical analysis programs to control command programs. However floating-point computations are affected by rounding errors that render them hard to be verified efficiently. We address in this paper termination proving of an important class of programs that manipulate floating-point numbers: the simple floating-point loops. Our main contribution is an optimal approximation to the rationals that allows us to efficiently analyze their termination.
Published: 2017

14. Better is the enemy of good: Unums — An alternative to IEEE 754 floats and doubles

Author: Thomas Risse
Subjects: Floating point, Exploit, Computer science, business.industry, Technological change, Information technology, Minifloat, computer.software_genre, Supercomputer, IEEE floating point, Memory management, Computer engineering, Operating system, business, computer
Abstract: Floats and doubles according to the 1985 IEEE 754 standard exhibit in cases very strange behavior and are no longer compatible with today's technological constraints — especially in high performance computing, HPC. Hence Gustafson's Unums, a substitute to IEEE 754 floating point numbers, are a promising alternative because they ‘end up all error’ and do so by taking into consideration the technological change which has taken place since the eighties of the last century. Of course these benefits do not come without cost: new floating point hardware is needed and new numerical algorithms have to be invented to make full use of the potential of Unums. As an example, a zero finding algorithm is proposed which exploits today's processors potential to execute several floating point operations in parallel.
Published: 2017

15. Finite Word Length Effects

Author: Winser E. Alexander and Cranos M. Williams
Subjects: Two's complement, Floating point, Rounding, Multiplication, Minifloat, Arithmetic, Algorithm, IEEE floating point, Word (computer architecture), Complement (complexity), Mathematics
Abstract: Chapter 6 covers methods used to represent numbers and the impact of the use of finite precision arithmetic for the implementation of discrete time systems. It discusses the representation of numbers using the IEEE floating point representation, computational errors due to rounding, and the multiplication of numbers that are represented using floating point. It covers the analytical basis for the two's complement representation of numbers and computational procedures for numbers represented using two's complement numbers. It covers the scaling of the coefficients for discrete time systems for given word sizes and for a restriction to avoid over ow during computations. It also presents a concept for statistical analysis of rounding errors due to word length effects.
Published: 2017

16. An Efficient Implementation of Double Precision Floating Point Multiplier Using Booth Algorithm

Author: N. N. Mhala, P. R. Lakhe, and Pallavi Ramteke
Subjects: Floating point, Computer science, business.industry, Decimal floating point, Double-precision floating-point format, Hardware_ARITHMETICANDLOGICSTRUCTURES, Minifloat, business, Extended precision, IEEE floating point, Single-precision floating-point format, Computer hardware, Half-precision floating-point format
Abstract: 3 ABSTRACT: Floating-point numbers are widely adopted in many applications due to their dynamic representation capabilities. Basically floating point numbers are one possible way of representing real numbers in binary format. Multiplying floating point numbers is also a critical requirement for DSP applications involving large dynamic range. The IEEE 754 standard presents two different floating point formats, Binary interchange format and Decimal interchange format. This paper presents the floating point multiplier that supports the IEEE 754 binary interchange format. This paper mainly focuses on double precision floating point multiplier based on Booth algorithm. The main object of this paper is to reduce the power consumption and to increase the speed of execution by implementing certain algorithm for multiplying two floating point numbers. In order to design this, VHDL is used and targeted on a Xilinx Virtex-5 FPGA. The implementation"s tradeoffs are area, speed and power. In this paper Shift and Add Multiplier is compared with Radix-4 Booth Multiplier. This multiplier also handles overflow and underflow cases. For high accuracy of the results normalization is also applied.
Published: 2014

17. Practically Accurate Floating-Point Math

Author: Neil Toronto and Jay McCarthy
Subjects: IEEE 754-1985, Floating point, General Computer Science, Computer science, Decimal floating point, General Engineering, Binary scaling, Floating-point unit, ComputerApplications_COMPUTERSINOTHERSYSTEMS, Double-precision floating-point format, Parallel computing, Minifloat, Single-precision floating-point format, Computational science, IBM Floating Point Architecture, NaN, Fixed-point arithmetic, x87, Half-precision floating-point format
Abstract: With the right tools, floating-point code can be debugged like any other code, drastically improving its accuracy and reliability.
Published: 2014

18. Comparative study on performance of single precision floating point multiplier using vedic multiplier and different types of adders

Author: K V Gowreesrinivas and P. Samundiswary
Subjects: Floating point, 05 social sciences, Sign bit, Floating-point unit, 050801 communication & media studies, Double-precision floating-point format, 02 engineering and technology, Parallel computing, Minifloat, Single-precision floating-point format, Significand, 0508 media and communications, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Multiplication, Hardware_ARITHMETICANDLOGICSTRUCTURES, Arithmetic, Mathematics
Abstract: Floating-point arithmetic plays major role in computer systems. Many of the digital signal processing applications use floating-point algorithms for execution of the floating-point computations and every operating system is answerable practically for floating-point special cases like underflow and overflow. The single precision floating point arithmetic operations are multiplication, division, addition and subtraction among all these multiplication is extensively used and involves composite arithmetic functions. The single precision (32-bit) floating point number split into three parts, Sign part, and Exponent part and Mantissa part. The most significant bit of the number is a sign bit and it is a 1-bit length. Next 8-bits represent the exponent part of the number and next 23-bits represent the mantissa part of the number. Mantissa part needs large 24-bit multiplication. The performance of the single-precision floating point number mostly based on the occupied area and delay of the multiplier. In this paper, a novel approach for single-precision floating multiplier is developed by using Urdhva Tiryagbhyam technique and different adders to decrease the complexity of mantissa multiplication. This requires less hardware for multiplication compared to that conventional multipliers and used different regular adders like carry select, carry skip adders and parallel prefix adders for exponent addition. Further, the performance parameters comparison was done in terms of area and delay. All modules are coded by using Verilog HDL and simulated with Xilinx ISE tool.
Published: 2016

19. Floating point unit core for Signal Processing applications

Author: Muktai N. Joshi and Dhanashri H. Gawali
Subjects: 020203 distributed computing, decimal32 floating-point format, business.industry, Computer science, 020208 electrical & electronic engineering, Decimal floating point, Floating-point unit, Double-precision floating-point format, 02 engineering and technology, Minifloat, Arithmetic logic unit, 0202 electrical engineering, electronic engineering, information engineering, Execution unit, decimal64 floating-point format, business, Computer hardware
Abstract: Traditional computers data processing is limited by computer data input, output, storage, display. Further computing needs repeated binary-decimal conversions. With the expansion of data intensive computing needs of distributed computing, decimal computing of mass data is widely applied in banking, financial, signal processing, bio-medical, astronomy, geography, data acquisition and image compression and other fields. Independent decimal floating point unit is becoming important in these areas. A floating point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Floating point unit have been implemented as a coprocessor rather than as an integrated unit in various systems. Today's floating point arithmetic operations are very important in the design of Digital Signal Processing and application-specific systems. As Fixed-Point arithmetic logics are faster and more area efficient, but sometimes it is desirable to implement calculation using Floating-Point numbers. In most of the digital signal processing applications addition and multiplication is done frequently. This paper presents a review of the Floating Point unit for a signal processing applications, which has faster rate of operations.
Published: 2016

20. Implementation of an open core IEEE 754-based FPU with non-linear arithmetic support

Author: Adrian Cervantes, Francis Lopez, Diego Rodriguez, Alfonso Chacon-Rodriguez, Carlos Salazar-Garcia, Jeffry Quiros, and Carlos Meza
Subjects: Floating point, Clock rate, Parallel computing, Minifloat, IEEE floating point, Dynamic demand, Verilog, Hardware_ARITHMETICANDLOGICSTRUCTURES, CORDIC, Arithmetic, computer, Block (data storage), Mathematics, computer.programming_language
Abstract: FPGA implementation results of an open core IEEE 754-based FPU with non-linear arithmetic support are shown. Non-linear operations are implemented using variations of the CORDIC algorithm, and are tested on a commercial FPGA. The unit provides results both on 32-bit and 64-bit FPU formats, with error bounded to 0.81501% for the cosine operation, 0.91367% for the sine operation, and 0.129% for the natural logarithm operation, using sixteen iterations in all cases, and a 64-bit floating point representation. Dynamic power is under 11mW for each non-linear operational block, at a 100MHz clock speed.
Published: 2016

21. Design and Implementation of Complex Floating Point Processor Using FPGA

Author: Rambabu ch, Murali Krishna Pavuluri, and Krishna Prasad T.S.R
Subjects: Computer Science::Hardware Architecture, Arithmetic logic unit, Computer science, Arbitrary-precision arithmetic, Saturation arithmetic, Floating-point unit, Double-precision floating-point format, Parallel computing, Hardware_ARITHMETICANDLOGICSTRUCTURES, Minifloat, Arithmetic, Fixed-point arithmetic, IEEE floating point
Abstract: This paper presents complete processor hardware with three arithmetic units. The first arithmetic unit can perform 32-bit integer arithmetic operations. The second unit can perform arithmetic operations such as addition, subtraction, multiplication, division, and square root on 32-bit floating point numbers. The third unit can perform arithmetic operations such as addition, subtraction, multiplication on complex numbers. The specific advancement in this processor is the new architecture introduced for complex arithmetic unit. In general complex floating point arithmetic hardware consists of floating to fixed and fixed to floating conversions. But using such hardware will lead to compromise between accuracy and number of bits used to represent the fixed point equivalent of floating point numbers. The proposed architecture avoids that compromise and it is implemented with less number of look-up tables to save around 5500 logic gates. The complex numbers are represented using a subset of IEEE754 standard floating point format, 16-bits for real part and 16-bits for imaginary part. The floating point arithmetic unit works on 32-bit IEEE754 single precision numbers. The instruction set is specially designed to support integer, floating point and complex floating point arithmetic operations. The on-chip RAM is 8kBytes and is extendable up to 64kBytes. As the processor is designed to implement on FPGA, the embedded block RAMs are utilized as RAM.
Published: 2013

22. Floating-Point Exponentiation Units for Reconfigurable Computing

Author: Florent de Dinechin, Marisa Lopez-Vallejo, Bogdan Pasca, P. Echeverria, Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Arithmetic and Computing (ARIC), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Laboratorio de Sistemas Integrados (LSI), Universidad Politécnica de Madrid (UPM), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Subjects: Exponentiation, Floating point, General Computer Science, Logarithm, Computer science, Exponentiation Unit, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, Minifloat, 01 natural sciences, Computational science, 0202 electrical engineering, electronic engineering, information engineering, Hardware_ARITHMETICANDLOGICSTRUCTURES, 0101 mathematics, Floating-Point, Telecomunicaciones, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, Reconfigurable computing, 020202 computer hardware & architecture, Exponential function, Significand, Reconfigurable Computing, Power Function, Electrónica, Multiplier (economics)
Abstract: The high performance and capacity of current FPGAs makes them suitable as acceleration co-processors. This article studies the implementation, for such accelerators, of the floating-point power function x y as defined by the C99 and IEEE 754-2008 standards, generalized here to arbitrary exponent and mantissa sizes. Last-bit accuracy at the smallest possible cost is obtained thanks to a careful study of the various subcomponents: a floating-point logarithm, a modified floating-point exponential, and a truncated floating-point multiplier. A parameterized architecture generator in the open-source FloPoCo project is presented in details and evaluated.
Published: 2013

23. Checking Compatibility of Bit Sizes in Floating Point Comparison Operations

Author: Manuel Fähndrich and Francesco Logozzo
Subjects: Floating points, Floating point, General Computer Science, Computer science, Abstract Interpretation, Static Analysis, Minifloat, Theoretical Computer Science, NET, Compatibility (mechanics), Numerical Abstract Domains, Algorithm, Design by Contracts, Computer Science(all)
Abstract: We motivate, define and design a simple static analysis to check that comparisons of floating point values use compatible bit widths and thus compatible precision ranges. Precision mismatches arise due to the difference in bit widths of processor internal floating point registers (typically 80 or 64 bits) and their corresponding widths when stored in memory (64 or 32 bits). The analysis guarantees that floating point values from memory (i.e. array elements, instance and static fields) are not compared against floating point numbers in registers (i.e. arguments or locals).Without such an analysis, static symbolic verification is unsound and hence may report false negatives.The static analysis is fully implemented in Clousot, our static contract checker based on abstract interpretation.
Published: 2012
Full Text: View/download PDF

24. Computing floating-point logarithms with fixed-point operations

Author: Florent de Dinechin, Nicolas Brunie, Jean-Michel Muller, Julien Le Maire, École normale supérieure de Lyon (ENS de Lyon), Kalray, Software and Cognitive radio for telecommunications (SOCRATE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-CITI Centre of Innovation in Telecommunications and Integration of services (CITI), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA), CITI Centre of Innovation in Telecommunications and Integration of services (CITI), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria), Arithmetic and Computing (ARIC), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Laboratoire de l'Informatique du Parallélisme (LIP), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), IEEE, ANR-13-INSE-0007,MetaLibm,Générateurs de code pour les fonctions mathématiques et les filtres(2013), École normale supérieure - Lyon (ENS Lyon), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École normale supérieure - Lyon (ENS Lyon)-Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-École normale supérieure - Lyon (ENS Lyon)
Subjects: Logarithm, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, Floating-point unit, logarithm, Double-precision floating-point format, 010103 numerical & computational mathematics, 02 engineering and technology, correct rounding, Minifloat, Binary logarithm, 01 natural sciences, Single-precision floating-point format, 020202 computer hardware & architecture, Arbitrary-precision arithmetic, 0202 electrical engineering, electronic engineering, information engineering, elementary function, fixed-point, 0101 mathematics, Arithmetic, Hardware_ARITHMETICANDLOGICSTRUCTURES, floating-point, Fixed-point arithmetic, Algorithm, Mathematics
Abstract: International audience; Elementary functions from the mathematical library input and output floating-point numbers. However it is possible to implement them purely using integer/fixed-point arithmetic. This option was not attractive between 1985 and 2005, because mainstream processor hardware supported 64-bit floating-point, but only 32-bit integers. Besides, conversions between floating-point and integer were costly. This has changed in recent years, in particular with the generalization of native 64-bit integer support. The purpose of this article is therefore to reevaluate the relevance of computing floating-point functions in fixed-point. For this, several variants of the double-precision logarithm function are implemented and evaluated. Formulating the problem as a fixed-point one is easy after the range has been (classically) reduced. Then, 64-bit integers provide slightly more accuracy than 53-bit mantissa, which helps speed up the evaluation. Finally, multi-word arithmetic, critical for accurate implementations, is much faster in fixed-point, and natively supported by recent compilers. Novel techniques of argument reduction and rounding test are introduced in this context. Thanks to all this, a purely integer implementation of the correctly rounded double-precision logarithm outperforms the previous state of the art, with the worst-case execution time reduced by a factor 5. This work also introduces variants of the logarithm that input a floating-point number and output the result in fixed-point. These are shown to be both more accurate and more efficient than the traditional floating-point functions for some applications.
Published: 2016

25. Generation of floating point 2D scaling operators for FPGA

Author: Mircea Popa and Ovidiu Sicoe
Subjects: Floating point, Generator (computer programming), Unit testing, Computer science, Double-precision floating-point format, 02 engineering and technology, Parallel computing, Minifloat, 020202 computer hardware & architecture, Computational science, VHDL, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Code generation, Field-programmable gate array, computer, computer.programming_language
Abstract: This paper presents several architectures for an FPGA implementation of a matrix operator for geometric two dimensional scaling using floating point numbers. We have generated synthesizable VHDL implementations of the proposed architectures for several floating point precisions: half, simple and double. Besides the precision of the floating point operators, the parallelization degree of the internal processing units and the targeted overall frequency can be configured for the generator. Using a generator was also helpful for generating lots of unit tests for the generated operators, thus being able to easily validate the operators. Additionally, creating all the unit tests from the beginning of the development process, allowed to use a test driven approach for creating the code generator.
Published: 2016

26. Comparison between binary and decimal floating-point numbers

Author: Jean-Michel Muller, Marc Mezzarobba, Nicolas Brisebarre, Christoph Lauter, Arithmetic and Computing (ARIC), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), Performance et Qualité des Algorithmes Numériques (PEQUAN), Laboratoire d'Informatique de Paris 6 (LIP6), Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS), ANR-10-BLAN-0203,TaMaDi,Dilemme du Fabricant de Tables(2010), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), and Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS)
Subjects: 0209 industrial biotechnology, decimal32 floating-point format, Floating point, Computer science, Decimal floating point, Binary scaling, Binary number, Double-precision floating-point format, 02 engineering and technology, Minifloat, Decimal, Single-precision floating-point format, Theoretical Computer Science, Machine epsilon, Arithmetic logic unit, 020901 industrial engineering & automation, NaN, Arbitrary-precision arithmetic, 0202 electrical engineering, electronic engineering, information engineering, Saturation arithmetic, Arithmetic, decimal64 floating-point format, Fixed-point arithmetic, decimal128 floating-point format, [INFO.INFO-SC]Computer Science [cs]/Symbolic Computation [cs.SC], IEEE 754-1985, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, Decimal data type, Binary Integer Decimal, 020202 computer hardware & architecture, Computational Theory and Mathematics, Hardware and Architecture, Algorithm, Software
Abstract: International audience; We introduce an algorithm to compare a binary floating-point (FP) number and a decimal FP number, assuming the "binary encoding" of the decimal formats is used, and with a special emphasis on the basic interchange formats specified by the IEEE 754-2008 standard for FP arithmetic. It is a two-step algorithm: a first pass, based on the exponents only, quickly eliminates most cases, then, when the first pass does not suffice, a more accurate second pass is performed. We provide an implementation of several variants of our algorithm, and compare them.
Published: 2016

27. Exploiting Structure in Floating-Point Arithmetic

Author: Claude-Pierre Jeannerod, Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Arithmetic and Computing (ARIC), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École normale supérieure - Lyon (ENS Lyon), and Université de Lyon-École normale supérieure - Lyon (ENS Lyon)-Centre National de la Recherche Scientifique (CNRS)-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL)
Subjects: Floating point, floating-point arithmetic, [INFO.INFO-AO]Computer Science [cs]/Computer Arithmetic, Binary scaling, 020207 software engineering, 010103 numerical & computational mathematics, 02 engineering and technology, [INFO.INFO-NA]Computer Science [cs]/Numerical Analysis [cs.NA], Minifloat, 01 natural sciences, rounding error analysis, Machine epsilon, Arbitrary-precision arithmetic, 0202 electrical engineering, electronic engineering, information engineering, IEEE standard 754-2008, Saturation arithmetic, high relative accuracy, 0101 mathematics, Arithmetic, Round-off error, Algorithm, Affine arithmetic, Mathematics
Abstract: Invited paper - MACIS 2015 (Sixth International Conference on Mathematical Aspects of Computer and Information Sciences); International audience; The analysis of algorithms in IEEE floating-point arithmetic is most often carried out via repeated applications of the so-called standard model, which bounds the relative error of each basic operation by a common epsilon depending only on the format. While this approach has been eminently useful for establishing many accuracy and stability results, it fails to capture most of the low-level features that make floating-point arithmetic so highly structured. In this paper, we survey some of those properties and how to exploit them in rounding error analysis. In particular, we review some recent improvements of several classical, Wilkinson-style error bounds from linear algebra and complex arithmetic that all rely on such structure properties.
Published: 2016

28. Exploiting binary floating-point representations for constraint propagation

Author: Roberto Bagnara, Matthieu Carlier, Roberta Gori, and Arnaud Gotlieb
Subjects: Theoretical computer science, Floating point, Test data generation, Computer science, Rounding, General Engineering, Binary scaling, 020207 software engineering, 02 engineering and technology, Minifloat, Computer engineering, 0202 electrical engineering, electronic engineering, information engineering, Local consistency, 020201 artificial intelligence & image processing, Software verification, Test data
Abstract: Floating-point computations are quickly finding their way in the design of safety- and mission-critical systems, despite the fact that designing floating-point algorithms is significantly more difficult than designing integer algorithms. For this reason, verification and validation of floating-point computations are hot research topics. An important verification technique, especially in some industrial sectors, is testing. However, generating test data for floating-point intensive programs proved to be a challenging problem. Existing approaches usually resort to random or search-based test data generation, but without symbolic reasoning it is almost impossible to generate test inputs that execute complex paths controlled by floating-point computations. Moreover, because constraint solvers over the reals or the rationals do not natively support the handling of rounding errors, the need arises for efficient constraint solvers over floating-point domains. In this paper, we present and fully justify improved algorithms for the propagation of arithmetic IEEE 754 binary floating-point constraints. The key point of these algorithms is a generalization of an idea by B. Marre and C. Michel that exploits a property of the representation of floating-point numbers.
Published: 2016

29. Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic

Author: Stef Graillat, Valérie Ménissier-Morain, Performance et Qualité des Algorithmes Numériques (PEQUAN), Laboratoire d'Informatique de Paris 6 (LIP6), and Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Kahan summation algorithm, Polynomial, Floating point, Accurate summation, Error-free transformations, 010103 numerical & computational mathematics, Minifloat, 01 natural sciences, Theoretical Computer Science, Accurate dot product, High precision, [INFO]Computer Science [cs], Hardware_ARITHMETICANDLOGICSTRUCTURES, 0101 mathematics, Arithmetic, Mathematics, Hornerʼs scheme, Dot product, Extended precision, Computer Science Applications, 010101 applied mathematics, Computational Theory and Mathematics, Multiplication, Pairwise summation, Complex floating point arithmetic, Accurate polynomial evaluation, Information Systems
Abstract: International audience; Several different techniques and softwares intend to improve the accuracy of resultscomputed in a fixed finite precision. Here we focus on methods to improve the accuracyof summation, dot product and polynomial evaluation. Such algorithms exist real floatingpoint numbers. In this paper, we provide new algorithms which deal with complex floatingpoint numbers. We show that the computed results are as accurate as if computed intwice the working precision. The algorithms are simple since they only require additionsubtraction and multiplication of floating point numbers in the same working precision asthe given data.
Published: 2012

30. An FPGA implementation of high speed and area efficient double-precision floating point multiplier using Urdhva Tiryagbhyam technique

Author: M. Kamaraju, Y. Srinivasa Rao, and D V S Ramanjaneyulu
Subjects: IEEE 754-1985, Multiplication algorithm, Computer science, Arbitrary-precision arithmetic, Floating-point unit, Saturation arithmetic, Double-precision floating-point format, Hardware_ARITHMETICANDLOGICSTRUCTURES, Minifloat, Arithmetic, Fixed-point arithmetic
Abstract: Floating-point arithmetic is ever-present in computer systems. All most all computer languages has supports a floating-point number types. Most of the computer compilers called upon floating-point algorithms from time to time for execution of the floating-point arithmetic operations and every operating system must be react virtually for floating-point exceptions like underflow and overflow. The double-precision floating arithmetic is mainly used in the digital signal processing (filters, FFTs) applications, numerical applications and scientic applications. The double-precision floating arithmetic operations are the addition, the subtraction, the multiplication, and the division. Among the all arithmetic operations, multiplication is widely used and most complex arithmetic operation. The double-precision (64-bit) floating point number is divide into three fields, Sign field, Exponent field and Mantissa field. The most significant bit of the number is a sign field and it is a 1-bit length, next 11-bits represents the exponent field of the number and remaining 52-bits are represents the mantissa field of the number. The double-precision floating-point multiplier requires a large 52×52 mantissa multiplications. The performance of the double-precision floating number multiplication mainly depends on the area and speed. The proposed work presents a novel approach to decrease this huge multiplication of mantissa. The Urdhva Tiryagbhyam technique permits to using a smaller number of multiplication hardware compared to the conventional method. In traditional method adding of the partial products are separately done and it takes more time in comparison with the proposed method. In proposed method the partial products are concurrently added with the multiplication operaton and it can reduce the time delay. The double-precision floating multiplier is implemented using Verilog HDL with Xilinx ISE tools on Virtex-5 FPGA.
Published: 2015

31. Are IEEE 754 32-Bit and 64-Bit Binary Floating-Point Accurate Enough?

Author: Mauridhi Hery Purnomo, Bernaridho Hutabarat, Mochamad Hariadi, and I Ketut Eddy Purnama
Subjects: IEEE 754-1985, Theoretical computer science, Floating point, Computer science, Decimal floating point, Binary number, Double-precision floating-point format, Minifloat, accuracy, binary, floating-point, ieee 754, Single-precision floating-point format, IEEE floating point, lcsh:TA1-2040, Arithmetic, lcsh:Engineering (General). Civil engineering (General)
Abstract: This paper describes a research toward the accuracy of floating-point values, and effort to reveal the real accuracy. The methods used in this research paper are assignment of values, assignment of value of arithmetic expressions, and output the values using floating-point value format that helps reveal the accuracy. The programming-tool used are Visual C# 9, Visual C++ 9, Java 5, and Visual BASIC 9. These tools run on top of Intel 80x 86 hardware. The results show that 1*10-x cannot be accurately represented, and the approximate accuracy ranges only from 7 to 16 decimal digits.
Published: 2011

32. Customizing floating-point units for FPGAs: Area-performance-standard trade-offs

Author: Marisa Lopez-Vallejo and P. Echeverria
Subjects: Adder, Floating point, Computer Networks and Communications, Computer science, Divisor, Double-precision floating-point format, 02 engineering and technology, Minifloat, Square root, Artificial Intelligence, Subtractor, 0202 electrical engineering, electronic engineering, information engineering, Hardware_ARITHMETICANDLOGICSTRUCTURES, Field-programmable gate array, Power function, IEEE 754-1985, business.industry, 020208 electrical & electronic engineering, Floating-point unit, 020202 computer hardware & architecture, Hardware and Architecture, Electrónica, Multiplier (economics), business, Software, Computer hardware
Abstract: The high integration density of current nanometer technologies allows the implementation of complex floating-point applications in a single FPGA. In this work the intrinsic complexity of floating-point operators is addressed targeting configurable devices and making design decisions providing the most suitable performance-standard compliance trade-offs. A set of floating-point libraries composed of adder/subtracter, multiplier, divisor, square root, exponential, logarithm and power function are presented. Each library has been designed taking into account special characteristics of current FPGAs, and with this purpose we have adapted the IEEE floating-point standard (software-oriented) to a custom FPGA-oriented format. Extended experimental results validate the design decisions made and prove the usefulness of reducing the format complexity.
Published: 2011

33. On Floating-Point Normal Vectors

Author: Gerd Sußner, Quirin Meyer, Günther Greiner, Marc Stamminger, and Jochen Süßmuth
Subjects: Floating point, Discretization, Octahedron, Computer science, Minifloat, Representation (mathematics), Discretization error, Computer Graphics and Computer-Aided Design, Normal, Algorithm, Single-precision floating-point format
Abstract: In this paper we analyze normal vector representations. We derive the error of the most widely used representation, namely 3D floating-point normal vectors. Based on this analysis, we show that, in theory, the discretization error inherent to single precision floating-point normals can be achieved by 250:2 uniformly distributed normals, addressable by 51 bits. We review common sphere parameterizations and show that octahedron normal vectors perform best: they are fast and stable to compute, have a controllable error, and require only 1 bit more than the theoretical optimal discretization with the same error.
Published: 2010

34. Algorithms for accurate, validated and fast polynomial evaluation

Author: Nicolas Louvet, Stef Graillat, Philippe Langlois, Performance et Qualité des Algorithmes Numériques (PEQUAN), Laboratoire d'Informatique de Paris 6 (LIP6), Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS), Electronique, Informatique, Automatique et Systèmes (ELIAUS), Procédés, Matériaux et Energie Solaire (PROMES), Université de Perpignan Via Domitia (UPVD)-Centre National de la Recherche Scientifique (CNRS)-Université de Perpignan Via Domitia (UPVD)-Centre National de la Recherche Scientifique (CNRS), Digits, Architectures et Logiciels Informatiques (DALI), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Université de Perpignan Via Domitia (UPVD), Computer arithmetic (ARENAIRE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire de l'Informatique du Parallélisme (LIP), École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Lyon (ENS Lyon)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Centre National de la Recherche Scientifique (CNRS), ANR-06-BLAN-0257,EVA-Flo,Evaluation et Validation Automatique pour le calcul Flottant(2006), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Perpignan Via Domitia (UPVD), École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL), and Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure de Lyon (ENS de Lyon)-Université Claude Bernard Lyon 1 (UCBL)
Subjects: Polynomial, Compensated algorithms, Floating point, floating-point arithmetic, Applied Mathematics, Computation, ACM G.4, General Engineering, 010103 numerical & computational mathematics, Minifloat, 01 natural sciences, IEEE floating point, IEEE-754 floating point arithmetic, 010101 applied mathematics, Significand, Transformation (function), polynomial evaluation, IEEE-754, 0101 mathematics, Algorithm, compensated algorithm, Accurate polynomial evaluation, [INFO.INFO-MS]Computer Science [cs]/Mathematical Software [cs.MS], Integer (computer science), Mathematics
Abstract: International audience; We survey a class of algorithms to evaluate polynomials with floating point coefficients and for computation performed with IEEE-754 floating point arithmetic. The principle is to apply, once or recursively, an error-free transformation of the polynomial evaluation with the Horner algorithm and to accurately sum the final decomposition. These compensated algorithms are as accurate as the Horner algorithm performed in K times the working precision, for K an arbitrary integer. We prove this accuracy property with an \apriori error analysis. We also provide validated dynamic bounds and apply these results to compute a faithfully rounded evaluation. These compensated algorithms are fast. We illustrate their practical efficiency with numerical experiments on significant environments. Comparing to existing alternatives these K-times compensated algorithms are competitive for K up to 4, i.e., up to 212 mantissa bits.
Published: 2009

35. Standardization and testing of implementations of mathematical functions in floating point numbers

Author: Victor V. Kuliamin
Subjects: Development (topology), Floating point, Standardization, Computer science, Arithmetic, Minifloat, Representation (mathematics), Requirements analysis, Implementation, Software, IEEE floating point
Abstract: Requirements definition and test suites development for implementations of mathematical functions in floating point arithmetic in the framework of the IEEE 754 standard are considered. A method based on this standard is proposed for defining requirements for such functions. This method can be used for the standardization of implementations of such functions; this kind of standardization extends IEEE 754. A method for designing test suites for the verification of those requirements is presented. The proposed methods are based on specific properties of the representation of floating point numbers and on some features of the functions under examination.
Published: 2007

36. On Subnormal Floating Point and Abnormal Timing

Author: Sorin Lerner, Keaton Mowery, Hovav Shacham, Marc Andrysco, Ranjit Jhala, and David Kohlbrenner
Subjects: Floating point, Computer science, Benchmark (computing), Floating-point unit, x86, Multiplication, Double-precision floating-point format, Parallel computing, Fixed point, Minifloat, Arithmetic, Fixed-point arithmetic, Order of magnitude
Abstract: We identify a timing channel in the floating point instructions of modern x86 processors: the running time of floating point addition and multiplication instructions can vary by two orders of magnitude depending on their operands. We develop a benchmark measuring the timing variability of floating point operations and report on its results. We use floating point data timing variability to demonstrate practical attacks on the security of the Fire fox browser (versions 23 through 27) and the Fuzz differentially private database. Finally, we initiate the study of mitigations to floating point data timing channels with lib fixed time fixed point, a new fixed-point, constant-time math library. Modern floating point standards and implementations are sophisticated, complex, and subtle, a fact that has not been sufficiently recognized by the security community. More work is needed to assess the implications of the use of floating point instructions in security-relevant software.
Published: 2015

37. Design and implementation of goldschmidts algorithm for floating point division and square root

Author: Sandeep Kakde, Rupali Bhoyar, and Prasanna Palsodkar
Subjects: Quadruple-precision floating-point format, Floating point, Iterative method, Computer science, Division algorithm, Binary scaling, Floating-point unit, Double-precision floating-point format, Minifloat, Single-precision floating-point format, Extended precision, IEEE floating point, Fast inverse square root, Machine epsilon, Square root, Arbitrary-precision arithmetic, Saturation arithmetic, Hardware_ARITHMETICANDLOGICSTRUCTURES, Arithmetic, Fixed-point arithmetic, Algorithm
Abstract: Digital signal processing algorithms are implemented using fixed point arithmetic due to expected area and power savings. However, the recent research shows that floating point arithmetic can be used by using the reduced precision format instead of standard IEEE floating point format which will avoid the algorithm design and implementation difficulties occurs in fixed point arithmetic. In this paper, the simplified single precision floating point arithmetic is used to perform division and square root operations. Goldschmidt's algorithm is iterative algorithm and has speed advantage over other iterative algorithms. Here FMA based Goldschmidt's algorithm is used for performing division and square root.
Published: 2015

38. Design and Development of Efficient Reversible Floating Point Arithmetic Unit

Author: Jenil Jain and Rahul Agrawal
Subjects: Arithmetic logic unit, Logic synthesis, Floating point, Computer science, Arbitrary-precision arithmetic, Floating-point unit, Saturation arithmetic, Minifloat, Arithmetic, Fixed-point arithmetic, Algorithm
Abstract: For calculation or representation of very large or small numbers, large range is essential. These values can be represented using the IEEE-754 standard based floating point arithmetic representation. The paper presents efficient approach towards designing of high speed floating point unit using reversible logic. Programmable reversible logic design is trending as a prospective logic design style for implementation in recent nanotechnology and quantum computing with low impact on circuit heat generation. There are various reversible implementations of logical and arithmetic units have been proposed in the existing research, but very few reversible floating-point designs has been designed. Floating-point operations are used very frequently in nearly all computing disciplines. The analysis of proposed reversible circuit can be done in terms of quantum cost, garbage outputs, constant inputs, power consumption, speed and area.
Published: 2015

39. Comparison of Pipelined Floating Point Unit with Unpipelined Floating Point Unit

Author: M. Venkateswarao, J. Triveni, K. Bhargavee Latha, K.R. Pavan, and K. Naga Deepika
Subjects: Engineering, business.industry, Floating-point unit, Double-precision floating-point format, Minifloat, Single-precision floating-point format, IEEE floating point, VHDL, Hardware_ARITHMETICANDLOGICSTRUCTURES, business, computer, Computer hardware, ModelSim, Half-precision floating-point format, computer.programming_language
Abstract: Floating-point numbers are broadly received in numerous applications due their element representation abilities. Floating-point representation has the capacity hold its determination and exactness contrasted with altered point representations. Any Digital Signal Processing (DSP) calculations utilization floating-point math, which obliges a huge number of figuring’s every second to be performed. For such stringent necessities, outline of quick, exact and effective circuits is the objective of each VLSI creator. This paper displays a correlation of pipelined floating-point snake dissention with IEEE 754 organization with an unpipelined viper additionally protests with IEEE 754 arrangement. It depicts the IEEE floating-point standard 754. A pipelined floating point unit in light of IEEE 754 configuration is produced and the outline is contrasted and that of an unpipelined floating point unit and an investigation is defeated speed, range, and force contemplations. It builds the rate as well as is vitality productive. Every one of these changes is at the expense of slight increment in the chip region. The basic methodology and approach used for VHDL (Very Large Scale Integration Hardware Descriptive Language) implementation of the floating-point unit are also described. Detailed synthesis report operated upon Xilinx ISE 11 software and Modelsim is given.
Published: 2015

40. Floating Point Arithmetic

Author: William Ford
Subjects: Discrete mathematics, Significand, Floating point, Sign bit, Binary scaling, Integer overflow, Hardware_ARITHMETICANDLOGICSTRUCTURES, Minifloat, Arithmetic, Fixed-point arithmetic, Machine epsilon, Mathematics
Abstract: The chapter discusses the handling of integer and floating point numbers in a digital computer. Integers are stored using two’s-complement notation with p bits. The positive integers have a zero in the left-most bit, and the binary representation for the integer in the remaining p − 1 bits. The negative integers begin with −1 = 111…111 and end with 1000…000., so they all have a left-most bit of 1. The negative of an integer is computed using the formula 2comp( n ) = 1comp( n ) + 1, where 1comp( n ) flips bits. The range of the integer representation is − 2 p ≤ n ≤ 2 p −1 . Perform subtraction of x and y by executing x − y = x + 2comp( y ). The sum of two positive or two negative integers can overflow, and this is indicated when the sign bit is the opposite of what it should be. After discussing integers, the chapter presents floating point arithmetic. The representation involves a sign bit, an exponent, and a mantissa (significant digits). Since only a finite number of bits are used, usually 32 or 64, most floating point numbers cannot be represented exactly, and this is the source of roundoff error, a serious problem. This finite representation also gives rise to floating point overflow and underflow. There are only a finite number of floating point numbers, and the representation is granular. The granularity depends on the machine constant eps. The chapter discusses floating arithmetic and provides error bounds for some floating point operations. Errors are normally measured using relative error rather than absolute error. Truncation error is discussed in the chapter. The chapter concludes with a discussion of how to minimize certain types of floating-point error; in particular, losing the effect of a smaller number when adding it to a much larger one, and cancellation error.
Published: 2015

41. Improving Performance of Floating Point Division on GPU and MIC

Author: Yifeng Chen and Kun Huang
Subjects: Floating point, Computer science, Division algorithm, Multiplication, Double-precision floating-point format, Parallel computing, Hardware_ARITHMETICANDLOGICSTRUCTURES, Division (mathematics), Minifloat, IEEE floating point, Half-precision floating-point format
Abstract: Floating point computing ability is an important concern in high performance scientific application and engineering computing. Although as a fundamental operation, floating point division or reciprocal has long been much less efficiency compared with addition and multiplication. Architectures like GPU and MIC even have no instruction for such division in the instruction level. This paper proposes a fast approximation algorithm to estimate the division of floating point numbers in IEEE 754 format based on existing instructions which in most cases are accurate enough for practical computing. It consists of a predicting step and an iterating step like most iterative numerical algorithm. The predicting step makes use of the property of IEEE 754 format to calculate estimation by only one integer subtraction instruction. The iterating step improves the accuracy by fast iterations in about ten instructions. This new algorithm is extremely easy to implement and shows a great performance in practical experiments.
Published: 2015

42. What every agent-based modeller should know about floating point arithmetic

Author: Gary Polhill, Nicholas Mark Gotts, and Luis Izquierdo
Subjects: Environmental Engineering, Theoretical computer science, Floating point, Computer science, Ecological Modeling, Arithmetic, Minifloat, Representation (mathematics), Implementation, Software, Simple (philosophy), Interval arithmetic
Abstract: Floating point arithmetic is a subject all too often ignored, yet, for agent-based models in particular, it has the potential to create misleading results, and even to influence emergent outcomes of the model. Using a simple demonstration model, this paper illustrates the problems that accumulated floating point errors can cause, and compares a number of techniques that might be used to address them. We show that inexact representation of parameter values, imprecision in calculation results, and differing implementations of mathematical expressions can significantly influence the behaviour of the model, and create issues for replicating results, though they do not necessarily do so. None of the techniques offer a failsafe approach that can be applied in any situation, though interval arithmetic is the most promising.
Published: 2006

43. Double-residue modular range reduction for floating-point hardware implementations

Author: Julio Villalba, M.A. Gonzalez, and T. Lang
Subjects: Floating point, business.industry, Floating-point unit, Double-precision floating-point format, Modular design, Minifloat, Single-precision floating-point format, Extended precision, IEEE floating point, Theoretical Computer Science, Computational science, Computational Theory and Mathematics, Hardware and Architecture, Arithmetic, business, Software, Mathematics
Abstract: In this paper, we present a novel algorithm and the corresponding architecture for performing range reduction, which is a preprocessing task required for the evaluation of some elementary functions such as trigonometric and exponential-based functions. The proposed algorithm introduces a modification to the modular range reduction algorithm which increases the speed of computation and allows us to design an architecture for the floating-point case. The implementation presented admits as an input argument any representable number of the standard single precision IEEE 754 floating-point representation and provides the maximum accuracy to the final result. This supposes a hardware solution to the problem of having an input argument close to a multiple of the constant. A final comparison with other implementations is presented.
Published: 2006

44. On Roundoff Errors in Block-Floating-Point Arithmetic

Author: Abhijit Mitra
Subjects: Floating point, Quantization (signal processing), Fixed point, Minifloat, External Data Representation, Computer Science Applications, Theoretical Computer Science, Computer Science::Mathematical Software, Hardware_ARITHMETICANDLOGICSTRUCTURES, Electrical and Electronic Engineering, Block floating-point, Arithmetic, Round-off error, Fixed-point arithmetic, Mathematics
Abstract: A special case of floating point data representation is block floating point format where a block of operands are forced to have a joint exponent term. This paper deals with the finite wordlength properties of this data format. The theoretical errors associated with the error model for block floating point quantization process is investigated with the help of error distribution functions. A fast and easy approximation formula for calculating signal-to-noise ratio in quantization to block floating point format is derived. This representation is found to be efficient compared to the fixed point and floating point format.
Published: 2006

45. Reconfigurable floating point adder

Author: Vipin Gemini
Subjects: decimal128 floating-point format, decimal32 floating-point format, Computer science, Decimal floating point, Decimal data type, Floating-point unit, Double-precision floating-point format, Parallel computing, Hardware_ARITHMETICANDLOGICSTRUCTURES, Minifloat, decimal64 floating-point format
Abstract: Decimal floating point arithmetic is gaining importance because of its higher accuracy for financial, commercial and web based applications. However, the binary floating point arithmetic is needed for scientific applications. Both these applications require general purpose processors (GPPs) for their execution. GPPs have separate hardware for decimal and binary floating point operations and therefore need a large area for their implementation. In this paper, we present a runtime reconfigurable floating point adder which targets both decimal and binary floating point addition on same hardware. The proposed design is 24.53% area efficient and approximately 7.6% faster than the previously reported designs. However, it is 6.3% slower for binary inputs.
Published: 2014

46. GENETIC ALGORITHMS, FLOATING POINT NUMBERS AND APPLICATIONS

Author: Willi-Hans Steeb, Ruedi Stoop, and Yorick Hardy
Subjects: Floating point, General Physics and Astronomy, Statistical and Nonlinear Physics, Interval (mathematics), Minifloat, Computer Science Applications, Computational Theory and Mathematics, Core (graph theory), Genetic algorithm, Mutation (genetic algorithm), Bitwise operation, Algorithm, Mathematical Physics, Linear equation, Mathematics
Abstract: The core in most genetic algorithms is the bitwise manipulations of bit strings. We show that one can directly manipulate the bits in floating point numbers. This means the main bitwise operations in genetic algorithm mutations and crossings are directly done inside the floating point number. Thus the interval under consideration does not need to be known in advance. For applications, we consider the roots of polynomials and finding solutions of linear equations.
Published: 2005

47. FPU Implementations with Denormalized Numbers

Author: Eric M. Schwarz, Son Dao Trong, and Martin S. Schmookler
Subjects: Floating point, Computational Theory and Mathematics, Denormal number, Hardware and Architecture, Computer science, Double-precision floating-point format, Parallel computing, Minifloat, Implementation, Software, Theoretical Computer Science
Abstract: Denormalized numbers are the most difficult type of numbers to implement in floating-point units. They are so complex that certain designs have elected to handle them in software rather than in hardware. Traps to software can result in long execution times, which renders denormalized numbers useless to programmers. This does not have to happen. With a small amount of additional hardware, denormalized numbers and underflows can be handled close to the speed of normalized numbers. This paper summarizes the little known techniques for handling denormalized numbers. Most of the techniques described here only appear in filed or pending patent applications.
Published: 2005

48. Computer arithmetic and sensitivity of natural measure

Author: Timothy Sauer
Subjects: Algebra and Number Theory, Dynamical systems theory, Applied Mathematics, Rounding, Chaotic, Binary scaling, Double-precision floating-point format, Minifloat, Machine epsilon, Arbitrary-precision arithmetic, Hardware_ARITHMETICANDLOGICSTRUCTURES, Arithmetic, Analysis, Mathematics
Abstract: In computer simulations of deterministic dynamical systems, floating-point rounding errors and other truncation errors contaminate the simulation results. We investigate the effect of computations using IEEE standard double precision arithmetic on inference of natural measure of chaotic attractors.
Published: 2005

49. Fast and Accurate Floating Point Summation with Application to Computational Geometry

Author: James Demmel and Yozo Hida
Subjects: Floating point, Arithmetic underflow, Applied Mathematics, Numerical analysis, Hardware_ARITHMETICANDLOGICSTRUCTURES, Minifloat, Pairwise summation, Algorithm, Accumulator (cryptography), SIMPLE algorithm, IEEE floating point, Mathematics
Abstract: We present several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator. Let f and F be the number of significant bits in the summands and the accumulator, respectively. Then assuming gradual underflow, no overflow, and round-to-nearest arithmetic, up to ⌊2 F−f /(1−2−f )⌋+1 numbers can be accurately added by just summing the terms in decreasing order of exponents, yielding a sum correct to within about 1.5 units in the last place. In particular, if the sum is zero, it is computed exactly. We apply this result to the floating point formats in the IEEE floating point standard, and investigate its performance. Our results show that in the absence of massive cancellation (the most common case) the cost of guaranteed accuracy is about 30–40% more than the straightforward summation. If massive cancellation does occur, the cost of computing the accurate sum is about a factor of ten. Finally, we apply our algorithm in computing a robust geometric predicate (used in computational geometry), where our accurate summation algorithm improves the existing algorithm by a factor of two on a nearly coplanar set of points.
Published: 2004

50. How to read floating point numbers accurately

Author: William Clinger
Subjects: Floating point, Integer, Scientific notation, Computer science, Binary number, Double-precision floating-point format, Decimal representation, Minifloat, Computer Graphics and Computer-Aided Design, Algorithm, Software, Decimal, Extended precision
Abstract: Converting decimal scientific notation into binary floating point is nontrivial, but this conversion can be performed with the best possible accuracy without sacrificing efficiency.Consider the problem of converting decimal scientific notation for a number into the best binary floating point approximation to that number, for some fixed precision. This problem cannot be solved using arithmetic of any fixed precision. Hence the IEEE Standard for Binary Floating-Point Arithmetic does not require the result of such a conversion to be the best approximation.This paper presents an efficient algorithm that always finds the best approximation. The algorithm uses a few extra bits of precision to compute an IEEE-conforming approximation while testing an intermediate result to determine whether the approximation could be other than the best. If the approximation might not be the best, then the best approximation is determined by a few simple operations on multiple-precision integers, where the precision is determined by the input. When using 64 bits of precision to compute IEEE double precision results, the algorithm avoids higher-precision arithmetic over 99% of the time.The input problem considered by this paper is the inverse of an output problem considered by Steele and White: Given a binary floating point number, print a correctly rounded decimal representation of it using the smallest number of digits that will allow the number to be read without loss of accuracy. The Steele and White algorithm assumes that the input problem is solved; an imperfect solution to the input problem, as allowed by the IEEE standard and ubiquitous in current practice, defeats the purpose of their algorithm.
Published: 2004

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

174 results on '"Minifloat"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources