12 results on '"Tsutomu Ikegami"'
Search Results
2. Implementation of Automatic Differentiation to Python-based Semiconductor Device Simulator
- Author
-
Tsutomu Ikegami, Koichi Fukuda, and Junichi Hattori
- Subjects
010302 applied physics ,Finite volume method ,Computer science ,Automatic differentiation ,02 engineering and technology ,Impulse (physics) ,Python (programming language) ,Solver ,021001 nanoscience & nanotechnology ,01 natural sciences ,Capacitance ,Nonlinear system ,symbols.namesake ,0103 physical sciences ,symbols ,0210 nano-technology ,computer ,Newton's method ,Simulation ,computer.programming_language - Abstract
A Python-based device simulator named Impulse TCAD was developed. The simulator is built on top of a nonlinear finite volume method (FVM) solver. To describe physical behavior of non-standard materials, both device properties and their dominant equations can be customized. The given FVM equations are solved by the Newton method, where required derivatives of the equations are derived automatically by using an automatic differentiation technique. As a demonstration, a steady state analysis of the negative capacitance field effect transistors with ferroelectric materials is selected, where the coupled Poisson and Devonshire equations are implemented in several different ways.
- Published
- 2019
- Full Text
- View/download PDF
3. ClPy: A NumPy-Compatible Library Accelerated with OpenCL
- Author
-
Tomokazu Higuchi, Ryousei Takano, Yoriyuki Kitta, Kenjiro Taura, Naoki Yoshifuji, Tsutomu Ikegami, and Tomoya Sakai
- Subjects
Coprocessor ,Computer science ,business.industry ,NumPy ,Deep learning ,05 social sciences ,050301 education ,02 engineering and technology ,Parallel computing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Software_PROGRAMMINGTECHNIQUES ,Python (programming language) ,020202 computer hardware & architecture ,CUDA ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,business ,0503 education ,computer ,computer.programming_language - Abstract
We developed ClPy, a Python library that supports OpenCL with a simple NumPy-like interface, and an extension of Chainer machine learning framework for OpenCL support. OpenCL emerged as a parallel computing standard with the goal of supporting a wide range of accelerators including GPUs (NVIDIA and others), FPGAs, DSPs, and CPUs. In contrast, many machine learning frameworks including Chainer have been built on top of CUDA, a predominant API for programming NVIDIA GPUs. As such, they cannot leverage other devices including non-NVIDIA GPUs and FPGAs. To facilitate developing cross-platform machine learning frameworks, ClPy is designed with an interface compatible with CuPy (CUDA Python), which itself has a NumPy-compatible interface and is used in Chainer to support both CPUs and NVIDIA GPUs. ClPy extends Chainer to any platform supporting OpenCL and can potentially do the same for other machine learning frameworks. This paper describes the design and implementation of ClPy and demonstrates it achieves reasonable performance on several machine learning applications. Our experiments show that the overhead of ClPy itself and serious performance degradation was caused by the lack of GPU-accelerated libraries of OpenCL including BLAS.
- Published
- 2019
- Full Text
- View/download PDF
4. Image-Classifier Deep Convolutional Neural Network Training by 9-bit Dedicated Hardware to Realize Validation Accuracy and Energy Efficiency Superior to the Half Precision Floating Point Format
- Author
-
Tomohiro Kudoh, Ryousei Takano, Shin-ichi O'uchi, Takashi Matsukawa, Wakana Nogami, Tsutomu Ikegami, and Hiroshi Fuketa
- Subjects
Floating point ,Computer science ,business.industry ,Sign bit ,010103 numerical & computational mathematics ,010501 environmental sciences ,01 natural sciences ,Convolutional neural network ,Significand ,Most significant bit ,Multiplier (economics) ,Hardware_ARITHMETICANDLOGICSTRUCTURES ,0101 mathematics ,Accumulator (computing) ,business ,Computer hardware ,0105 earth and related environmental sciences ,Half-precision floating-point format - Abstract
We propose a 9-bit floating point format for training image-classifier deep convolutional neural networks. The proposed floating point format has a 5-bit exponent, a 3-bit mantissa with the hidden most significant bit (MSB), and a sign bit. The 9-bit floating point format reduces not only the transistor count of the multiplier in the multipy-accumulate (MAC) unit, but also the data traffic for the forward and backward propagations and the weight update. Both of the reductions realize a power efficient training. To maintain the validation accuracy, the accumulator is implemented with an internal longer-bit-length floating point format while the multiplier accepts the 9-bit format. We examined this format in the training of the AlexNet and the ResNet-50 with the ILSVRC 2012 data set. The trained 9-bit AlexNet and ResNet-50 exhibited the validation accuracy superior to the 16-bit floating point format training by 1.2 % and 0.5 %, respectively. The transistor count in the 9-bit MAC unit is estimated to be reduced by 84% as compared to the 32-bit counterpart.
- Published
- 2018
- Full Text
- View/download PDF
5. Performance comparison of parallel eigensolvers based on a contour integral method and a Lanczos method
- Author
-
Tetsuya Sakurai, Hiroto Tadano, Tsutomu Ikegami, and Ichitaro Yamazaki
- Subjects
Computer Networks and Communications ,Computer science ,Crossover ,Linear system ,MathematicsofComputing_NUMERICALANALYSIS ,Computer Graphics and Computer-Aided Design ,Theoretical Computer Science ,Lanczos resampling ,Nonlinear system ,Artificial Intelligence ,Hardware and Architecture ,Focus (optics) ,Algorithm ,Software ,Eigenvalues and eigenvectors ,Cauchy's integral formula - Abstract
We study the performance of a parallel nonlinear eigensolver SSEig which is based on a contour integral method. We focus on symmetric generalized eigenvalue problems (GEPs) of computing interior eigenvalues. We chose to focus on GEPs because we can then compare the performance of SSEig with that of a publicly-available software package TRLan, which is based on a thick restart Lanczos method. To solve this type of problems, SSEig requires the solution of independent linear systems with different shifts, while TRLan solves a sequence of linear systems with a single shift. Therefore, while SSEig typically has a computational cost greater than that of TRLan, it also has greater parallel scalability. To compare the performance of these two solvers, in this paper, we develop performance models and present numerical results of solving large-scale eigenvalue problems arising from simulations of modeling accelerator cavities. In particular, we identify the crossover point, where SSEig becomes faster than TRLan. The parallel performance of SSEig solving nonlinear eigenvalue problems is also studied.
- Published
- 2013
- Full Text
- View/download PDF
6. Parallel Fock Matrix Construction with Distributed Shared Memory Model for the FMO-MO Method
- Author
-
Toru Yagi, Toshio Watanabe, Hiroto Tadano, Tsutomu Ikegami, Yuichi Inadomi, Takayoshi Ishimoto, Umpei Nagashima, Tetsuya Sakurai, and Hiroaki Umeda
- Subjects
Models, Molecular ,parallel Fock matrix construction ,Distributed shared memory ,Computer science ,large MO calculation ,Parallel algorithm ,Computational Biology ,distributed shared memory ,Sakurai-Sugiura method ,Basis function ,General Chemistry ,Construct (python library) ,Parallel computing ,FMO-MO method ,ErbB Receptors ,Computational Mathematics ,Matrix (mathematics) ,Fock matrix ,Benchmark (computing) ,Cluster (physics) ,Quantum Theory ,Computer Simulation ,Plant Proteins - Abstract
A parallel Fock matrix construction program for FMO-MO method has been developed with the distributed shared memory model. To construct a large-sized Fock matrix during FMO-MO calculations, a distributed parallel algorithm was designed to make full use of local memory to reduce communication, and was implemented on the Global Array toolkit. A benchmark calculation for a small system indicates that the parallelization efficiency of the matrix construction portion is as high as 93% at 1,024 processors. A large FMO-MO application on the epidermal growth factor receptor (EGFR) protein (17,246 atoms and 96,234 basis functions) was also carried out at the HF/6-31G level of theory, with the frontier orbitals being extracted by a Sakurai-Sugiura eigensolver. It takes 11.3 h for the FMO calculation, 49.1 h for the Fock matrix construction, and 10 min to extract 94 eigen-components on a PC cluster system using 256 processors. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2010
- Published
- 2010
7. A moving mesh method for device simulation
- Author
-
Junichi Hattori, Koichi Fukuda, Tsutomu Ikegami, Hiroo Koshimoto, and Hidehiro Asai
- Subjects
Imagination ,Search engine ,Chemical substance ,Cover (topology) ,Computer science ,Simple (abstract algebra) ,media_common.quotation_subject ,MOSFET ,Semiconductor device modeling ,Electronic engineering ,Semiconductor device ,Algorithm ,media_common - Abstract
A moving mesh method for semiconductor device simulation is developed which effectively compromises accuracies without increasing mesh number. In this method, mesh positions are shifted referring to the solution of the previous bias condition, or to the Newton corrections. The method is applied to solve PN-junctions and MOSFETs. The method provides an effective way to cover the changes of carrier distributions depending on bias conditions. The algorithm is simple and effective, and can be widely used.
- Published
- 2015
- Full Text
- View/download PDF
8. Implementation of Fault-Tolerant GridRPC Applications
- Author
-
Yoshio Tanaka, Hidemoto Nakada, Tsutomu Ikegami, Satoshi Sekiguchi, and Yusuke Tanimura
- Subjects
Computer Networks and Communications ,Computer science ,Distributed computing ,Testbed ,Throughput ,Fault tolerance ,Fault (power engineering) ,Fault detection and isolation ,GridRPC ,Task (project management) ,Hardware and Architecture ,Timeout ,Software ,Information Systems - Abstract
A task parallel application is implemented with Ninf-G, a GridRPC system. A series of experiments are conducted on the Grid testbed in Asia Pacific for three months. Through tens of long executions, typical fault patterns were collected, and instability of the network throughput was determined to be a major reason of the faults. Several important points are stressed to avoid task throughput decline due to the fault-recovery operations: Timeout minimization for fault detection, background recovery, redundant task assignments, and so on. This study also issues a steer for design of the automated fault-tolerant mechanism in an upper layer of the GridRPC framework.
- Published
- 2006
- Full Text
- View/download PDF
9. A highly available distributed self-scheduler for exascale computing
- Author
-
Hidemoto Nakada, Atsuko Takefusa, Yoshio Tanaka, and Tsutomu Ikegami
- Subjects
Mean time between failures ,Computer science ,Middleware (distributed applications) ,Suite ,Distributed computing ,Scalability ,Resource Management System ,Programming paradigm ,computer.software_genre ,Fault (power engineering) ,computer ,Exascale computing - Abstract
A hierarchical master-worker model is thought to be a promising programming paradigm for exascale-level high performance computers. However, "fault resiliency" is one of the most important issues for exascale computing because the Mean Time Between Failure (MTBF) is expected to be short. We propose a fault resilient middleware suite for exascale computing environments. In this paper, we design a highly available distributed self-scheduler as a resource management system for the proposed middleware suite. The proposed distributed self-scheduler consists of multiple processes in order to achieve scalability, fault resiliency, and persistency. We also develop a prototype system of the middleware, using Apache ZooKeeper and Apache Cassandra. Experiments using the developed prototype system show that the proposed distributed self-scheduler achieves the desired fault resiliency for an application program developed using the middleware, and that the scheduler itself is also fault resilient. We also confirmed that the overheads caused by distributed processing can be reduced, and the scheduler can be scalable.
- Published
- 2015
- Full Text
- View/download PDF
10. Exploring the Performance Impact of Virtualization on an HPC Cloud
- Author
-
Nuttapong Chakthranont, Ryousei Takano, Phonlawat Khunphet, and Tsutomu Ikegami
- Subjects
Computer science ,business.industry ,Distributed computing ,Cloud computing ,Virtualization ,computer.software_genre ,Supercomputer ,Scalability ,Benchmark (computing) ,Bandwidth (computing) ,Operating system ,Graph (abstract data type) ,business ,computer - Abstract
The feasibility of the cloud computing paradigm is examined from the High Performance Computing (HPC) viewpoint. The impact of virtualization is evaluated on our latest private cloud, the AIST Super Green Cloud, which provides elastic virtual clusters interconnected by Infini Band. Performance is measured by using typical HPC benchmark programs, both on physical and virtual cluster computing clusters. The results of the micro benchmarks indicate that the virtual clusters suffer from the scalability issue on almost all MPI collective functions. The relative performance gradually becomes worse as the number of nodes increases. On the other hand, the benchmarks based on actual applications, including LINPACK, OpenMX, and Graph 500, show that the virtualization overhead is about 5% even when the number of nodes increase to 128. This observation leads to our optimistic conclusions on the feasibility of the HPC Cloud.
- Published
- 2014
- Full Text
- View/download PDF
11. GridFMO — Quantum chemistry of proteins on the grid
- Author
-
Yasuhito Tanaka, Mutsumi Aoyagi, Tsutomu Ikegami, Jun Maki, Satoshi Sekiguchi, Toshiya Takami, and Mitsuo Yokokawa
- Subjects
Grid computing ,Ab initio quantum chemistry methods ,Computer science ,Ab initio ,Cluster (physics) ,Fault tolerance ,GAMESS ,computer.software_genre ,Grid ,computer ,Fragment molecular orbital ,Computational science - Abstract
A GridFMO application was developed by recoining the fragment molecular orbital (FMO) method of GAMESS with grid technology. With the GridFMO, quantum calculations of macro molecules become possible by using large amount of computational resources collected from many moderate-sized cluster computers. A new middleware suite was developed based on Ninf-G, whose fault tolerance and flexible resource management were found to be indispensable for long-term calculations. The GridFMO was used to draw ab initio potential energy curves of a protein motor system with 16,664 atoms. For the calculations, 10 cluster computers over the pacific rim were used, sharing the resources with other users via butch queue systems on each machine. A series of 14 GridFMO calculations were conducted for 70 days, coping with more than 100 problems cropping up. The FMO curves were compared against the molecular mechanics (MM), and it was confirmed that (1) the FMO method is capable of drawing smooth curves despite several cut-off approximations, and that (2) the MM method is reliable enough for molecular modeling.
- Published
- 2007
- Full Text
- View/download PDF
12. Full Electron Calculation Beyond 20,000 Atoms: Ground Electronic State of Photosynthetic Proteins
- Author
-
Satoshi Sekiguchi, Hiroaki Umeda, Mitsuo Yokokawa, Kazuo Kitaura, Yuichi Inadomi, Dmitri G. Fedorov, Toyokazu Ishida, and Tsutomu Ikegami
- Subjects
Photosynthetic reaction centre ,Computer science ,Rhodopseudomonas viridis ,State (functional analysis) ,Electron ,Photosynthesis ,Molecular physics ,Simulation ,Fragment molecular orbital - Abstract
A full electron calculation for the photosynthetic reaction center of Rhodopseudomonas viridis was performed by using the fragment molecular orbital (FMO) method on a massive cluster computer. The target system contains 20,581 atoms and 77,754 electrons, which was divided into 1,398 fragments. According to the FMO prescription, the calculations of the fragments and pairs of the fragments were conducted to obtain the electronic state of the system. The calculation at RHF/6-31G* level of theory took 72.5 hours with 600 CPUs. The CPUs were grouped into several workers, to which the calculations of the fragments were dispatched. An uneven CPU grouping, where two types of workers are generated, was shown to be efficient.
- Published
- 2005
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.