Descriptor: "Terabyte" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Terabyte"' showing total 1,244 results

Start Over Descriptor "Terabyte"

1,244 results on '"Terabyte"'

151. On Divide&Conquer in Image Processing of Data Monster

Author: Peter Hufnagl, Elsa Irmgard Buchholz, Hermann Hesling, and Marco Strutz
Subjects: Divide and conquer algorithms, Information Systems and Management, Speedup, Computer science, business.industry, Big data, Petabyte, Image processing, 02 engineering and technology, Terabyte, Computer Science Applications, Management Information Systems, Computational science, 020204 information systems, Sandbox (computer security), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, business, Word (computer architecture), Information Systems
Abstract: The steadily improving resolution power of sensors results in larger and larger data objects, which cannot be analysed in a reasonable amount of time on single workstations. To speed up the analysis the Divide and Conquer method can be used by splitting (large) data objects into smaller pieces where each piece is analysed on a single node and, finally, the partial results are collected and combined. We apply this method to the validated bio–medical framework Ki67–Analysis that determines the amount of cancer cells in high–resolution images from breast examinations. In previous work, we observed an anomalous behaviour when the framework is applied to subtiles of an image. To this end, we determined for each subtile a so–called Ki67–Analysis score parameter, which is given by the ratio of the number of identified cancer cells and the total number of cells. This parameter turns out to be underestimated the more the smaller the subtiles. The anomaly prevents a direct application of the Divide and Conquer method. In this work, we suggest a novel grey–box testing method for understanding the origin of the anomaly. It allows to identify a class of subtiles for which the Ki67–Analysis score parameter can be determined reasonably well, i.e. for which the Divide and Conquer method can be applied. By demanding the stability of the framework with regard to small additive noise in brightness, “ghost cells” are identified that turn out to be an artefact of the framework. Finally, the challenge of analysing huge single data objects is considered. The upcoming observatory Square Kilometre Array (SKA) will consist of thousands of antennas and telescopes. Due to the exceptional resolution power of SKA, single images from the Universe may be as large as one Petabyte. “Data monster” of that huge size cannot be analysed reasonably fast on traditional computing architectures. The relatively small throughput rates when reading data from disks is a serious bottleneck (memory–wall problem). Memory–based computing offers a change in paradigm: the current processor–centric architecture is replaced by a memory–based architecture. Hewlett Packard Enterprise (HPE) developed a prototype with 48 Terabyte of memory, called Sandbox. Counting words in large files can be considered as a first step towards simulating image processing of “Data Monster” at SKA. We run the big data framework Thrill on the Sandbox and determine the speedup of different setups for distributed word counting.
Published: 2021

152. The New Moneyball: How Ballpark Sensors Are Changing Baseball

Author: Glenn Healey
Subjects: Engineering, Point (typography), law, business.industry, Computer graphics (images), Doppler radar, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Stereoscopic video, Electrical and Electronic Engineering, Terabyte, business, law.invention
Abstract: Advancements in the capability of sensors, processors, and storage devices have led to an explosion in the amount of data that is captured during sporting events. The Statcast system, for example, uses Doppler radar and stereoscopic video from two arrays of high-resolution optical imagers to acquire seven terabytes of data during each Major League Baseball (MLB) game. One use of these data is to enhance the experience of sports fans. As a case in point, Statcast data are used to generate information and visualizations that are disseminated in real time through telecasts and other media such as an app which displays pitch parameters derived from sensor measurements.
Published: 2017

153. Measuring Scale-Up and Scale-Out Hadoop with Remote and Local File Systems and Selecting the Best Platform

Author: Zhuozhao Li and Haiying Shen
Subjects: Computer science, 02 engineering and technology, Terabyte, computer.software_genre, 0202 electrical engineering, electronic engineering, information engineering, Network File System, Distributed File System, File system fragmentation, File system, 020203 distributed computing, Random access memory, Distributed database, Database, Device file, Petabyte, 020206 networking & telecommunications, computer.file_format, Virtual file system, Torrent file, Memory-mapped file, Self-certifying File System, Computational Theory and Mathematics, Hardware and Architecture, Signal Processing, Scalability, Operating system, computer
Abstract: MapReduce is a popular computing model for parallel data processing on large-scale datasets, which can vary from gigabytes to terabytes and petabytes. Though Hadoop MapReduce normally uses Hadoop Distributed File System (HDFS) local file system, it can be configured to use a remote file system. Then, an interesting question is raised: for a given application, which is the best running platform among the different combinations of scale-up and scale-out Hadoop with remote and local file systems. However, there has been no previous research on how different types of applications (e.g., CPU-intensive, data-intensive) with different characteristics (e.g., input data size) can benefit from the different platforms. Thus, in this paper, we conduct a comprehensive performance measurement of different applications on scale-up and scale-out clusters configured with HDFS and a remote file system (i.e., OFS), respectively. We identify and study how different job characteristics (e.g., input data size, the number of file reads/writes, and the amount of computations) affect the performance of different applications on the different platforms. Based on the measurement results, we also propose a performance prediction model to help users select the best platforms that lead to the minimum latency. Our evaluation using a Facebook workload trace demonstrates the effectiveness of our prediction model. This study is expected to provide a guidance for users to choose the best platform to run different applications with different characteristics in the environment that provides both remote and local storage, such as HPC cluster and cloud environment.
Published: 2017

154. Multilevel Analysis of Student's Feedback Using Moodle Logs in Virtual Cloud Environment

Author: Ashok Verma, Santosh K. Vishwakarma, Sumangla Rathore, and Shubham Goswami
Subjects: 020203 distributed computing, Multimedia, Computer science, business.industry, Cloud computing, 02 engineering and technology, Terabyte, computer.software_genre, Witness, World Wide Web, Computer data storage, Node (computer science), 0202 electrical engineering, electronic engineering, information engineering, Virtual learning environment, 020201 artificial intelligence & image processing, Decision-making, business, computer, Word (computer architecture)
Abstract: In the current digital era, education system has witness tremendous growth in data storage and efficient retrieval. Many Institutes have very huge databases which may be of terabytes of knowledge and information. The complexity of the data is an important issue as educational data consists of structural as well as non-structural type which includes various text editors like node pad, word, PDF files, images, video, etc. The problem lies in proper storage and correct retrieval of this information. Different types of learning platform like Moodle have implemented to integrate the requirement of educators, administrators and learner. Although this type of platforms are indeed a great support of educators, still mining of the large data is required to uncover various interesting patterns and facts for decision making process for the benefits of the students.
Published: 2017

155. COMPREHENSIVE STUDY ON CONTENT BASED IMAGE RETRIEVAL WITH THEIR FEATURES

Author: Pritpal kaur and Sukhvir Kaur
Subjects: Information retrieval, Computer science, Feature vector, Search engine indexing, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Process (computing), 020207 software engineering, 02 engineering and technology, Terabyte, Content-based image retrieval, Automatic image annotation, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Visual Word, Image retrieval
Abstract: In current years, very huge collections of images and videos have grown swiftly. In parallel with this boom, content-based image retrieval and querying the indexed collections of images from the large database are required to access visible facts and visual information. Three of the principle additives of the visual images are texture, shape and color. Content based image retrieval from big sources has a wide scope in many application areas and softwareâ€™s. To accelerate retrieval and similarity computation, the database images are analyzed and the extracted regions are clustered or grouped together with their characteristic feature vectors. As a result of latest improvements in digital storage technology, it's easy and possible to create and store the large quantity of images inside the image database. These collections may additionally comprise thousands and thousands of images and terabytes of visual information like their shape, texture and color. For users to make the most from those image databases, efficient techniques and mechanisms of searching should be devised. Having a computer to do the indexing primarily based on a CBIR scheme attempts to deal with the shortcomings of human-based indexing. Since an automated process on a computer can analyze and process the images at a very quick and efficient rate that human can never do alone. In this paper, we will discuss the structure of CBIR with their feature vectors.
Published: 2017

156. CarStream

Author: Tianyu Wo, Xuelian Lin, Tao Xie, Mingming Zhang, and Yaxiao Liu
Subjects: business.industry, Computer science, Process (engineering), Big data, General Engineering, Volume (computing), 02 engineering and technology, Terabyte, computer.software_genre, Data science, Upload, 020204 information systems, Data quality, Scalability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, The Internet, Data mining, business, computer
Abstract: As the Internet-of-Vehicles (IoV) technology becomes an increasingly important trend for future transportation, designing large-scale IoV systems has become a critical task that aims to process big data uploaded by fleet vehicles and to provide data-driven services. The IoV data, especially high-frequency vehicle statuses (e.g., location, engine parameters), are characterized as large volume with a low density of value and low data quality. Such characteristics pose challenges for developing real-time applications based on such data. In this paper, we address the challenges in designing a scalable IoV system by describing CarStream, an industrial system of big data processing for chauffeured car services. Connected with over 30,000 vehicles, CarStream collects and processes multiple types of driving data including vehicle status, driver activity, and passenger-trip information. Multiple services are provided based on the collected data. CarStream has been deployed and maintained for three years in industrial usage, collecting over 40 terabytes of driving data. This paper shares our experiences on designing CarStream based on large-scale driving-data streams, and the lessons learned from the process of addressing the challenges in designing and maintaining CarStream.
Published: 2017

157. Exploring big volume sensor data with vroom

Author: Vijay Gadepally, Michael Stonebraker, Aaron Zalewski, Samuel Madden, Sudeep Pillai, Oscar Moll, and Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Subjects: 0209 industrial biotechnology, business.industry, Computer science, Real-time computing, Search engine indexing, General Engineering, Volume (computing), 02 engineering and technology, Terabyte, computer.software_genre, Domain (software engineering), 020901 industrial engineering & automation, Analytics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Data mining, State (computer science), business, computer
Abstract: State of the art sensors within a single autonomous vehicle (AV) can produce video and LIDAR data at rates greater than 30 GB/hour. Unsurprisingly, even small AV research teams can accumulate tens of terabytes of sensor data from multiple trips and multiple vehicles. AV practitioners would like to extract information about specific locations or specific situations for further study, but are often unable to. Queries over AV sensor data are different from generic analytics or spatial queries because they demand reasoning about fields of view as well as heavy computation to extract features from scenes. In this article and demo we present Vroom, a system for ad-hoc queries over AV sensor databases. Vroom combines domain specific properties of AV datasets with selective indexing and multi-query optimization to address challenges posed by AV sensor data.
Published: 2017

158. Cloud-based interactive analytics for terabytes of genomic variants data

Author: Gregory McInnes, Jonathan Bingham, Cuiping Pan, Philip S. Tsao, Michael Snyder, Nicole A. Deflaux, and Somalee Datta
Subjects: 0301 basic medicine, Statistics and Probability, Genotype, Computer science, Big data, Cloud computing, Web Browser, Terabyte, Biochemistry, 03 medical and health sciences, 0302 clinical medicine, Gene Frequency, Humans, Ecosystem, Molecular Biology, Genome, Human, business.industry, Genetic Variation, Genomics, Data Compression, Original Papers, Data science, Computer Science Applications, Computational Mathematics, 030104 developmental biology, Computational Theory and Mathematics, Analytics, Scalability, DECIPHER, Databases, Nucleic Acid, business, Software, 030217 neurology & neurosurgery, Data compression
Abstract: Motivation Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired. Results We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information. Availability and implementation Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs. Supplementary information Supplementary data are available at Bioinformatics online.
Published: 2017

159. Three lessons for genetic toxicology from baseball analytics

Author: Stephen D. Dertinger
Subjects: 0301 basic medicine, Courtesy, Epidemiology, Computer science, business.industry, Health, Toxicology and Mutagenesis, Mindset, 010501 environmental sciences, Terabyte, 01 natural sciences, Data science, 03 medical and health sciences, 030104 developmental biology, Analytics, Sabermetrics, Data analysis, Videography, business, Genetics (clinical), 0105 earth and related environmental sciences, Genetic Toxicology
Abstract: In many respects the evolution of baseball statistics mirrors advances made in the field of genetic toxicology. From its inception, baseball and statistics have been inextricably linked. Generations of players and fans have used a number of relatively simple measurements to describe team and individual player's current performance, as well as for historical record-keeping purposes. Over the years, baseball analytics has progressed in several important ways. Early advances were based on deriving more meaningful metrics from simpler forerunners. Now, technological innovations are delivering much deeper insights. Videography, radar, and other advances that include automatic player recognition capabilities provide the means to measure more complex and useful factors. Fielders' reaction times, efficiency of the route taken to reach a batted ball, and pitch-framing effectiveness come to mind. With the current availability of complex measurements from multiple data streams, multifactorial analyses occurring via machine learning algorithms have become necessary to make sense of the terabytes of data that are now being captured in every Major League Baseball game. Collectively, these advances have transformed baseball statistics from being largely descriptive in nature to serving data-driven, predictive roles. Whereas genetic toxicology has charted a somewhat parallel course, a case can be made that greater utilization of baseball's mindset and strategies would serve our scientific field well. This paper describes three useful lessons for genetic toxicology, courtesy of the field of baseball analytics: seek objective knowledge; incorporate multiple data streams; and embrace machine learning. Environ. Mol. Mutagen. 58:390-397, 2017. © 2017 Wiley Periodicals, Inc.
Published: 2017

160. Big data in mental health research – do the ns justify the means? Using large data-sets of electronic health records for mental health research

Author: Peter R. Schofield
Subjects: business.industry, Big data, Volume (computing), Information technology, Petabyte, Health records, Terabyte, Mental health, Data science, 030227 psychiatry, 03 medical and health sciences, Psychiatry and Mental health, 0302 clinical medicine, Computer data storage, Medicine, 030212 general & internal medicine, business
Abstract: SummaryAdvances in information technology and data storage, so-called ‘big data’, have the potential to dramatically change the way we do research. We are presented with the possibility of whole-population data, collected over multiple time points and including detailed demographic information usually only available in expensive and labour-intensive surveys, but at a fraction of the cost and effort. Typically, accounts highlight the sheer volume of data available in terms of terabytes (1012) and petabytes (1015) of data while charting the exponential growth in computing power we can use to make sense of this. Presented with resources of such dizzying magnitude it is easy to lose sight of the potential limitations when the amount of data itself appears unlimited. In this short account I look at some recent advances in electronic health data that are relevant for mental health research while highlighting some of the potential pitfalls.
Published: 2017

161. Next-generation sequencing: big data meets high performance computing

Author: Andreas Hildebrandt and Bertil Schmidt
Subjects: 0301 basic medicine, Computer science, Distributed computing, Genomic research, Big data, Terabyte, Computing Methodologies, DNA sequencing, 03 medical and health sciences, 0302 clinical medicine, Databases, Genetic, Drug Discovery, Humans, Throughput (business), Pharmacology, Genome, business.industry, High-Throughput Nucleotide Sequencing, Genomics, Sequence Analysis, DNA, Precision medicine, Supercomputer, Data science, Cancer treatment, 030104 developmental biology, 030220 oncology & carcinogenesis, business, Algorithms
Abstract: The progress of next-generation sequencing has a major impact on medical and genomic research. This high-throughput technology can now produce billions of short DNA or RNA fragments in excess of a few terabytes of data in a single run. This leads to massive datasets used by a wide range of applications including personalized cancer treatment and precision medicine. In addition to the hugely increased throughput, the cost of using high-throughput technologies has been dramatically decreasing. A low sequencing cost of around US$1000 per genome has now rendered large population-scale projects feasible. However, to make effective use of the produced data, the design of big data algorithms and their efficient implementation on modern high performance computing systems is required.
Published: 2017

162. Applying Parallel Computing Techniques to Analyze Terabyte Atmospheric Boundary Layer Model Outputs

Author: Song-Lak Kang and Timothy S. Sliwinski
Subjects: Information Systems and Management, 010504 meteorology & atmospheric sciences, business.industry, Serial communication, Computer science, Big data, Message Passing Interface, Graphics processing unit, 02 engineering and technology, Parallel computing, Terabyte, 01 natural sciences, Computer Science Applications, Management Information Systems, CUDA, 0202 electrical engineering, electronic engineering, information engineering, Data analysis, 020201 artificial intelligence & image processing, business, 0105 earth and related environmental sciences, Information Systems, Data transmission
Abstract: In the atmospheric sciences, the size of simulation output continues to grow as computational resources able to handle simulations with fine-scale spatial and temporal resolutions become more accessible. As output size increases, serial data analysis methods become overwhelmed, resulting in either long delays during processing or total failures due to memory constraints. Parallel data analysis methods can alleviate these issues, however atmospheric scientists are often unfamiliar with how to achieve this. Therefore, example methods are needed to help guide the use of parallel processing in the analysis of Big Data from atmospheric simulations. In this work, practical methods are presented by which an analysis may be executed in parallel using the Message Passing Interface (MPI) and Python. These methods first consider the inherent spatial dependencies of a particular data analysis process. By identifying these dependencies, horizontal or vertical distribution of the dataset across processes can be carried out with minimal process intercommunication. In addition, an analysis method is classified as either data-transfer-limited or computationally-limited. In data-transfer-limited problems, data transfer time outweighs processing time. In computationally-limited problems, processing time outweighs data transfer time. The results show that by increasing processor count, the execution time of computationally-limited problems shows improvement. For data-transfer-limited problems, increasing node count offers the greatest improvement. To further improve the performance of computationally-limited problems, a Graphics Processing Unit (GPU) and the Compute Unified Device Architecture (CUDA) framework are used. It is shown that this GPU implementation offers further improvement over the MPI version of the analysis methods tested.
Published: 2017

163. A Multicore Path to Connectomics-on-Demand

Author: Aleksandar Zlateski, Hayk Saribekyan, Tim Kaler, David Budden, Alexander Matveev, Yaron Meirovitch, Gergely Ódor, Wiktor Jakubiuk, Nir Shavit, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Department of Mathematics, Matveev, Alexander, Meirovitch, Yaron, Saribekyan, Hayk, Jakubiuk, Wiktor B., Kaler, Timothy, Odor, Gergely, Budden, David, Zlateski, Aleksandar, and Shavit, Nir N.
Subjects: 0301 basic medicine, Multi-core processor, Connectomics, Computer science, Image processing, Parallel computing, Terabyte, Computer Graphics and Computer-Aided Design, Mass storage, 03 medical and health sciences, 030104 developmental biology, 0302 clinical medicine, Path (graph theory), Enhanced Data Rates for GSM Evolution, 030217 neurology & neurosurgery, Software
Abstract: The current design trend in large scale machine learning is to use distributed clusters of CPUs and GPUs with MapReduce-style programming. Some have been led to believe that this type of horizontal scaling can reduce or even eliminate the need for traditional algorithm development, careful parallelization, and performance engineering. This paper is a case study showing the contrary: that the benefits of algorithms, parallelization, and performance engineering, can sometimes be so vast that it is possible to solve "cluster-scale" problems on a single commodity multicore machine. Connectomics is an emerging area of neurobiology that uses cutting edge machine learning and image processing to extract brain connectivity graphs from electron microscopy images. It has long been assumed that the processing of connectomics data will require mass storage, farms of CPU/GPUs, and will take months (if not years) of processing time. We present a high-throughput connectomics-on-demand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes., National Science Foundation (U.S.) (grant IIS-1447786), National Science Foundation (U.S.) (grant CCF1563880), United States. Intelligence Advanced Research Projects Activity (grant 138076-5093555)
Published: 2017

164. SIproc: an open-source biomedical data processing platform for large hyperspectral images

Author: Sebastian Berisha, David Mayerich, Davar Daeinejad, Ziqi He, Shengyuan Chang, Sam Saki, and Rupali Mankar
Subjects: 0301 basic medicine, Gigabyte, Workstation, Computer science, Terabyte, 01 natural sciences, Biochemistry, Article, Analytical Chemistry, law.invention, 03 medical and health sciences, Software, law, Image Processing, Computer-Assisted, Electrochemistry, Octave, Environmental Chemistry, Computer vision, MATLAB, Spectroscopy, computer.programming_language, business.industry, Spectrum Analysis, 010401 analytical chemistry, Process (computing), Hyperspectral imaging, 0104 chemical sciences, 030104 developmental biology, Artificial intelligence, business, computer, Algorithms
Abstract: There has recently been significant interest within the vibrational spectroscopy community to apply quantitative spectroscopic imaging techniques to histology and clinical diagnosis. However, many of the proposed methods require collecting spectroscopic images that have a similar region size and resolution to the corresponding histological images. Since spectroscopic images contain significantly more spectral samples than traditional histology, the resulting data sets can approach hundreds of gigabytes to terabytes in size. This makes them difficult to store and process, and the tools available to researchers for handling large spectroscopic data sets are limited. Fundamental mathematical tools, such as MATLAB, Octave, and SciPy, are extremely powerful but require that the data be stored in fast memory. This memory limitation becomes impractical for even modestly sized histological images, which can be hundreds of gigabytes in size. In this paper, we propose an open-source toolkit designed to perform out-of-core processing of hyperspectral images. By taking advantage of graphical processing unit (GPU) computing combined with adaptive data streaming, our software alleviates common workstation memory limitations while achieving better performance than existing applications.
Published: 2017

165. Big, Linked Geospatial Data and Its Applications in Earth Observation

Author: George Papadakis, Konstantina Bereta, George Stamoulis, Manolis Koubarakis, and Dimitrianos Savva
Subjects: Earth observation, Geospatial analysis, Computer Networks and Communications, Computer science, Data discovery, 02 engineering and technology, Linked data, computer.file_format, Terabyte, computer.software_genre, World Wide Web, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Geospatial PDF, 020201 artificial intelligence & image processing, RDF, computer
Abstract: If the terabytes of Earth observation data currently stored in archives are published on the web using a linked data paradigm, data discovery, integration with other data sources, and the development of applications will become much easier.
Published: 2017

166. Itâ€™s How Many Terabytes?! A Case Study on Managing Large Born Digital Audio-visual Acquisitions

Author: Matthew McKinley and Laura Uglean Jackson
Subjects: World Wide Web, Born-digital, Computer science, Audio visual, Special collections, Terabyte, Digital library, Data science, GeneralLiterature_MISCELLANEOUS, lcsh:Z, lcsh:Bibliography. Library science. Information resources
Abstract: In October 2014, the University of California Irvine (UCI) Special Collections and Archives acquired a born digital collection of 2.5 terabytes – the largest born digital collection acquired by the department to date. This case study describes the challenges we encountered when applying existing archival procedures to appraise, store, and provide access to a large born digital collection. It discusses solutions when they could be found and ideas for solutions when they could not, lessons learned from the experience, and the impact on born-digital policy and procedure at UCI Libraries. Working with a team of archivists, librarians, IT, and California Digital Library (CDL) staff, we discovered issues and determined solutions that will guide our procedures for future acquisitions of large and unwieldy born digital collections.
Published: 2016

167. Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

Author: Davit Buniatyan
Subjects: FOS: Computer and information sciences, Hyperparameter, Multi-core processor, Computer Science - Machine Learning, business.industry, Computer science, Distributed computing, Deep learning, Cloud computing, Terabyte, Machine Learning (cs.LG), Task (computing), Computer Science - Distributed, Parallel, and Cluster Computing, Scalability, Distributed, Parallel, and Cluster Computing (cs.DC), Artificial intelligence, business, Distributed File System
Abstract: Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid distributed cloud framework with a unified view to multiple clouds and an on-premise infrastructure for processing tasks using both CPU and GPU compute instances at scale. The system implements a distributed file system and failure-tolerant task processing scheduler, independent of the language and Deep Learning framework used. It allows to utilize unstable cheap resources on the cloud to significantly reduce costs. We demonstrate the scalability of the framework on running pre-processing, distributed training, hyperparameter search and large-scale inference tasks utilizing 10,000 CPU cores and 300 GPU instances with the overall processing power of 30 petaflops.
Published: 2019

168. Genomic Anomaly Searching with BLAST Algorithm using MapReduce Framework in Big Data Platform

Author: Ramanti Dharayani, Arfive Gandhi, Wahyu Catur Wibowo, and Yova Ruldeviyani
Subjects: Data flow diagram, Complex data type, Gigabyte, Computer science, business.industry, Business process, Big data, Volume (computing), Terabyte, business, Time complexity, Algorithm
Abstract: Biofarma Corp should adopt big data on vaccine and serum development by analyze genomic sequencing using searching any anomaly. As the root problem, it the anomaly searching requires about 1.62 Terabytes data transient as primary data and 301 Gigabytes as secondary data to get analysis from genomic variance. Moreover Biofarma Corp spent 16 hours for one anomaly searching from 3 Terabytes vaccines. This study proposed big data implementation to handle anomaly searching processes by prioritize less time complexity and less spending storage. It was signalized by a research question, “How big data technology is applied in searching anomalies on genomic data”. It aimed to implement big data system to facilitate large volume and complex data in order to fulfill business process on Biofarma Corp. It adopted framework architecture as brought by Demchenko, Ngo, and Membrey. This study has designed data flow from FASTA and FATQ as sources for anomalies searching processes. This data flow is facilitated in big data system as designed in this research. As main contribution, this research adopted MapReduce framework to run BLAST algorithm with less spending time. As comparison, MapReduce framework can handle 21, 33, and 55 K-Mer in four minutes respectively while 50 minutes were spent without MapReduce.
Published: 2019

169. Digital Forensics Investigation Reduction Model (DIFReM) Framework for Windows 10 OS

Author: Yazid Haruna Shayau, Noor Afiza Mohd Ariffin, Siti Nurulain Mohd Rum, and Aziah Asmawi
Subjects: Standardization, business.industry, Computer science, Digital forensics, 020206 networking & telecommunications, 02 engineering and technology, Terabyte, Computer security, computer.software_genre, Automation, Information security standards, 0202 electrical engineering, electronic engineering, information engineering, Microsoft Windows, 020201 artificial intelligence & image processing, The Internet, business, computer, Countermeasure (computer)
Abstract: The advent of the digital age, globalization and automation has made life easier for people and businesses. However, the ubiquitous use of digital devices and the Internet also heightens the risk and incidents of cybercrimes. Under these circumstances, Digital Forensics has become a critical countermeasure. The ISO/IEC 27001 (Information security standards published jointly by the International Organization for Standardization – ISO and the International Electrotechnical Commission-IEC) provides guidance on identifying, gathering/collecting/acquiring, handling and protecting/preserving Digital Forensic evidence for use in court. The most challenging and important part of Digital Forensic Investigation (DFI) is data examination. Knowing the data created by the Operating System (OS) or user beforehand would ease the process. Unfortunately, most of the time, such details are not available to facilitate investigation. The examination phase is the most challenging for an investigator; in Microsoft Windows OS (Operating System). Investigators have to go through terabytes of system data, most of which are OS and application files irrelevant to the investigation from a suspect’s computer. To address the problem highlighted above, this research proposes a data reduction model (DIFReM) and a tool which will not only help the investigator in identifying modified system files but also has the ability to detect files inserted into system directories and also be able to verify integrity using hashing. In the end, this research will provide the investigator with a more effective and efficient digital forensics tools.
Published: 2019

170. Demonstration of Reliable High-Rate Optical Communication over an Atmospheric Link using ARQ

Author: Ajay S. Garg, Bryan S. Robinson, Curt Schieler, Bryan C. Bilyeu, and J. P. Wang
Subjects: Terminal (telecommunication), Computer science, business.industry, Retransmission, Automatic repeat request, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, Frame (networking), Optical communication, Fading, Terabyte, Transceiver, business, Computer hardware
Abstract: Optical communication systems for satellites in low-Earth orbit are capable of offloading large quantities of sensor data, on the order of terabytes per pass, to a ground terminal or network of ground terminals. NASA plans to demonstrate this capability with the Terabyte Infrared Delivery (TBIRD) program, which leverages fiber-telecom transceivers to burst data down to ground at 200 Gbps during short passes. The cornerstone of the TBIRD architecture is the use of commercial off-the-shelf transceivers, which perform well in fiber channels but do not guarantee reliable communication in space-to-ground atmospheric channels. To address this deficiency, we have built an Automatic Repeat reQuest (ARQ) system on top of the transceivers that enables high-rate error-free communication over fading channels of interest. The ARQ system implements the digital logic required to process custom data frames and uses a low-rate feedback link for frame retransmission requests. To validate the system, we have operated over a several-kilometer atmospheric link. We have successfully demonstrated error-free communication at data rates close to 100 Gbps in a variety of atmospheric fading conditions.
Published: 2019

171. More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

Author: Tommaso Soru, André Valdestilhas, and Muhammad Saleem
Subjects: Service (systems architecture), Information retrieval, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, 010401 analytical chemistry, InformationSystems_DATABASEMANAGEMENT, 02 engineering and technology, computer.file_format, Terabyte, computer.software_genre, 01 natural sciences, 0104 chemical sciences, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), SPARQL, RDF, Raw data, computer, Data integration
Abstract: Over the last years, the Web of Data has grown significantly. Various interfaces such as LOD Stats, LOD Laudromat, SPARQL endpoints provide access to the hundered of thousands of RDF datasets, representing billions of facts. These datasets are available in different formats such as raw data dumps and HDT files or directly accessible via SPARQL endpoints. Querying such large amount of distributed data is particularly challenging and many of these datasets cannot be directly queried using the SPARQL query language. In order to tackle these problems, we present WimuQ, an integrated query engine to execute SPARQL queries and retrieve results from large amount of heterogeneous RDF data sources. Presently, WimuQ is able to execute both federated and non-federated SPARQL queries over a total of 668,166 datasets from LOD Stats and LOD Laudromat as well as 559 active SPARQL endpoints. These data sources represent a total of 221.7 billion triples from more than 5 terabytes of information from datasets retrieved using the service "Where is My URI" (WIMU). Our evaluation on state-of-the-art real-data benchmarks shows that WimuQ retrieves more complete results for the benchmark queries.
Published: 2019

172. Co-extruded multilayer optical data storage media: Toward terabyte scale discs (Conference Presentation)

Author: Kenneth D. Singer and Irina Shiyanovskaya
Subjects: 3D optical data storage, Materials science, Fabrication, business.industry, Reading (computer), Terabyte, Laser, Buffer (optical fiber), law.invention, law, Optoelectronics, business, Optical filter, Optical disc
Abstract: Multilayer optical data storage is a promising approach for realizing archival optical discs with terabyte capacity for applications in enterprise data storage. We report on the fabrication of optical discs containing 16 layers from a high-scalable multilayer polymer film co-extrusion process. Polymer co-extrusion is a well-established roll-to-roll manufacturing process with applications as diverse as food packaging and high performance optical filters. We have adapted this to produce films with alternating active and buffer layers. The film is easily fabricated into optical discs with the potential capacity of several terabytes. Data is stored in voxels defined by photobleaching a fluorescent or reflective dye contained in writable layers of 200-300nm thickness separated by inert layers of 2-3 microns. We have shown that at short pulse durations of a pulse-modulated commercial 405nm laser, the nonlinear writing process within the absorption band of the dye exhibits a distinct threshold, thus promising low crosstalk and sub-diffraction limit bit patterns. Results on writing physics will be presented. We have recently demonstrated that data can be written and read using a novel optical pick-up. The confocal optical configuration for reading suggests that the drive developed for our discs could be backward compatible with earlier commercial optical discs. Studies of photostability and defect density suggest the suitability of this technology for long-term, high-performance enterprise archival data storage.
Published: 2019

173. Measurement and Analysis of Adult Websites in IPv6 Networks

Author: Jiahai Yang, Guanglei Song, Jianping Wu, Hui Zhang, and Shize Zhang
Subjects: business.industry, Network packet, Computer science, computer.internet_protocol, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, Internet privacy, 020206 networking & telecommunications, 020302 automobile design & engineering, 02 engineering and technology, Terabyte, IPv4, IPv6, Network management, 0203 mechanical engineering, 0202 electrical engineering, electronic engineering, information engineering, The Internet, Duration (project management), business, computer
Abstract: The Internet is in the transition from IPv4 to IPv6. At present, researches on IPv6 networks mainly focus on architectural issues, such as routing, addressing, and security; there are few studies on the operational issues of IPv6 networks. Our preliminary observation shows that there are a large amount adult websites and traffic in IPv6 networks. Adult websites can damage health of teenagers and bring operational issues in IPv6 networks. This paper conducts a comprehensive measurement and analysis of the adult websites and traffic in IPv6 networks to help solve these operational issues. The data used in this paper is the raw packet traffic from CNGI-CERNET2 which is a pure IPv6 academic network in China. The duration of the data is from July 2017 to January 2018 and the total amount is 40+ terabytes. We detected about 3000 adult websites in the global IPv6 network. This paper analyzes these adult websites and traffic from the perspectives of websites, users and ISPs respectively. We find that adult websites are still in the developing stage in IPv6 networks and only 30% adult websites with full resources can be accessed in IPv6-only networks. But due to the IPv6-first policy in RFC 4038, adult traffic will continue to migrate to IPv6 networks from IPv4 networks. On the other hand, we find that CDN vendors promote the development of adult websites in IPv6 networks and many adult website owners use muti-domain policies to escape ISPs restricting. Our findings may help ISPs effectively understand adult websites and enhance the restriction of adult content in IPv6 networks.
Published: 2019

174. Big Data and Cloud Computing: An Emerging Perspective and Future Trends

Author: Sanjiv Sharma, P. K. Yadav, and Amar Singh
Subjects: Exabyte, Computer science, business.industry, Big data, Petabyte, 020207 software engineering, Cloud computing, 02 engineering and technology, Terabyte, Data science, Data warehouse, Variety (cybernetics), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Cloud database, business
Abstract: Big Data is a collection of large amount of data which is growing very rapidly with the popularity of social networking sites. The size of the Big data has been extended from terabytes to petabytes. Big data are characterized by four important attributes: volume, velocity, variety and veracity. The volume attributes describe the data at rest in the range from terabytes to Exabyte’s, the velocity deals with data in motion, i.e streaming the data to respond within milliseconds rather than in seconds, the variety discuss the data in many different forms such as structured, unstructured, text and multimedia data, whereas the veracity deals with the data in doubt, i.e. uncertainty in data due to data inconsistency. These attributes of big data make it a challenge for organizations to have control over, and use such data. In today’s era we are overloaded with the information, however we are lacking the insight. The 90% of the data in the world today has been created in the last two years by various social networking sites. As big data grows it challenges the capabilities of traditional data warehouses that collect and store large amounts of internal and external data. Data drawn from these repositories are used to improve decision making, increase organizational efficiencies, and improve organizational effectiveness. In this paper an attempt has been made to explore the real life application of Big Data , cloud database, Hadoop, Map Reduce and Cloud Computing, in various domains.
Published: 2019

175. Comparative Study of Big Data Frameworks

Author: Hriday Kumar Gupta and Rafat Parveen
Subjects: Complex data type, Data processing, Computer science, Scala, business.industry, Distributed computing, 05 social sciences, Big data, Petabyte, Intension, 02 engineering and technology, Terabyte, 020204 information systems, 0502 economics and business, Scalability, 0202 electrical engineering, electronic engineering, information engineering, 050211 marketing, business, computer, computer.programming_language
Abstract: We are really living in ever growing volume of data production. The huge amount of data in terabyte and petabytes are generating in real word and it is a challenging task to access, storage, analysis of all structured, unstructured and semi structured heterogeneous and complex data, also traditional tools is not suitable towards distributed and real-time processing. We need an efficient framework for processing such heterogeneous data and transform it into optimized meaningful information. There are many frameworks for distributed computing has been developed to perform huge amount of data processing. Hadoop Map Reduce is the extensively used framework because of its scalability, security, latency and efficiency, and reliability. The intension of this paper is to relative study of common framework such as Hadoop, Spark, Flink, Samza and Storm.
Published: 2019

176. Opportunities for Partitioning Non-Volatile Memory DIMMs between Co-scheduled Jobs on HPC Nodes

Author: Brice Goglin, Andres Rubio Proaño, Topology-Aware System-Scale Data Management for High-Performance Computing (TADAAM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, and Rubio Proaño, Andrès
Subjects: [INFO.INFO-AR]Computer Science [cs]/Hardware Architecture [cs.AR], [INFO.INFO-AR] Computer Science [cs]/Hardware Architecture [cs.AR], Co-scheduling, Computer science, 02 engineering and technology, Non Volatile Memory DIMM, Terabyte, computer.software_genre, 020204 information systems, [INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], 0202 electrical engineering, electronic engineering, information engineering, Locality, Hardware_MEMORYSTRUCTURES, Fault tolerance, DIMM, 020202 computer hardware & architecture, Non-volatile memory, NVDIMM, [INFO.INFO-OS] Computer Science [cs]/Operating Systems [cs.OS], DAX, Operating system, [INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS], [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], computer, Partitioning, Volatile memory
Abstract: International audience; The emergence of non-volatile memory DIMMs such as Intel Optane DCPMM blurs the gap between usual volatile memory and persistent storage by enabling byte-accessible persistent memory with reasonable performance. This new hardware supports many possible use cases for high-performance applications, from high performance storage to very high-capacity volatile memory (terabytes). However the numerous ways to configure the memory subsystem raises the question of how to configure nodes to satisfy applications' needs (memory, storage, fault tolerance, etc.). We focus on the issue of partitioning HPC nodes with NVDIMMs in the context of co-scheduling multiple jobs. We show that the basic NVDIMM configuration modes would require node reboots and expensive hardware configuration. Moreover it does not allow the co-scheduling of all kinds of jobs, and it does not always allow locality to be taken into account during resource allocation. Then we show that using 1-Level-Memory and the Device DAX mode by default is a good compromise. It may be easily used and partitioned for storage and memory-bound applications with locality awareness.
Published: 2019

177. Exploring Transfer Learning to Reduce Training Overhead of HPC Data in Machine Learning

Author: Shakeel Alibhai, Qing Liu, Xubin He, Tong Liu, Jinzhen Wang, and Chentao Wu
Subjects: Artificial neural network, Computer science, Process (engineering), business.industry, Scale (chemistry), Training (meteorology), Petabyte, 02 engineering and technology, Terabyte, Machine learning, computer.software_genre, 030218 nuclear medicine & medical imaging, 03 medical and health sciences, 0302 clinical medicine, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Overhead (computing), Artificial intelligence, business, Transfer of learning, computer
Abstract: Nowadays, scientific simulations on high-performance computing (HPC) systems can generate large amounts of data (in the scale of terabytes or petabytes) per run. When this huge amount of HPC data is processed by machine learning applications, the training overhead will be significant. Typically, the training process for a neural network can take several hours to complete, if not longer. When machine learning is applied to HPC scientific data, the training time can take several days or even weeks. Transfer learning, an optimization usually used to save training time or achieve better performance, has potential for reducing this large training overhead. In this paper, we apply transfer learning to a machine learning HPC application. We find that transfer learning can reduce training time without, in most cases, significantly increasing the error. This indicates transfer learning can be very useful for working with HPC datasets in machine learning applications.
Published: 2019

178. A New Storage Workload Management Model for High-Performance I/O Systems

Author: Lu Yutong, Huang Huang, and You Sheng
Subjects: Input/output, 0209 industrial biotechnology, Computer science, Process (engineering), Distributed computing, Petabyte, 02 engineering and technology, Terabyte, Supercomputer, Bottleneck, Petascale computing, 020901 industrial engineering & automation, Middleware, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing
Abstract: Data-intensive applications involve with massive data, ranging from terabytes to petabytes, which is a critical bottleneck in the next generation systems. In this article, we mainly focus on system I/O competition caused by the data-intensive applications and unbalanced storage capacity, and attempt to develop effective solutions to alleviate this issue. To this end, we develop an I/O middleware based method. The core of our middleware is the storage workload management model (SWMM), which allows us to adjust and optimize the I/O access process efficiently and flexibly. We have implemented our solution and conducted extensive tests based on a petascale system. The experimental results demonstrate the effectiveness and efficiency of our solution.
Published: 2019

179. Processing and Analyzing Big Data Generated from Data Communication and Social Networks: In-terms of Performance Speed and Accuracy

Author: L B Shyamasundar and P Jhansi Rani
Subjects: Data processing, Multi-core processor, Computer science, business.industry, Sentiment analysis, Big data, Predictive analytics, Terabyte, computer.software_genre, Visualization, Support vector machine, Data mining, business, computer
Abstract: A multiple layer architecture for sentiment analysis has been discussed in the proposed work which achieves a best accuracy of 89.61% using SVM ML classifier. Obtained results conclude that, a better accuracy has been achieved in the proposed scheme compared to existing schemes in the literature. Predictive analytics has been done against tweets collected during IPL10 cricket 20 overs final match. Existing literature highlight conceptual relationships, physicaal locations or topical changes but the textual is not being visualized primarily. Proposed scheme works on exploring hidden semantics which helps in the visualization of real text content. Also, another work has been done where data processing environment is developed using Apache SPARk which is deployed above an existing YARN cluster. With the proposed setup and tuning of resource allocation strategies, we could analyze 85GB of network trace data within 78 seconds by using a distributed 32 node cluster, having a capacity of 1 terabyte RAM and 256 CPU cores. If processing the same amount of data in traditional systems, it will take around 30 minutes.
Published: 2019

180. 'Space Explorers: Life in Orbit'

Author: Félix Lajeunesse, Liz Warren, Sebastian Sylwan, and Michael A. Interbartolo
Subjects: Engineering, Mission control center, business.industry, 020207 software engineering, 02 engineering and technology, Terabyte, Virtual reality, Session (web analytics), Space exploration, Visual arts, International Space Station, 0202 electrical engineering, electronic engineering, information engineering, Narrative, business, Studio
Abstract: In December 2018, TIME and Felix & Paul Studios launched virtual reality cameras---built to operate in microgravity---to the International Space Station. Since then, filming has documented astronauts from several countries in their daring missions more than 250 miles above Earth, capturing life in space as viewers have never truly seen before, and culminating in the first-ever spacewalk in cinematic virtual reality. Join Felix & Paul Studios, along with collaborators from NASA and the ISS National Lab, as they share insights from one of the most ambitious VR projects ever undertaken. In this production session, we will discuss the background of how this partnership came to be, before diving into the technical challenges of capturing cinematic virtual reality on the ISS. How do you direct a scene in such a tight and constrained place, especially while down on Earth? How can you transfer terabytes of data from the cameras to Mission Control? And finally, what does it take to build and operate cameras that can capture a spacewalk? The team will explore the variety of challenges inherent in such a groundbreaking project, from building a camera that can capture an EVA (extra-vehicular activity) in the extreme environment of space, to tracking and crafting months of astronaut footage into a cohesive episodic narrative. Finally, the team will share never-before-seen early footage from the project.
Published: 2019

181. Complex Approach of High-Resolution Multispectral Data Engineering for Deep Neural Network Processing

Author: Volodymyr V. Hnatushenko and Vadym Zhernovyi
Subjects: Geospatial analysis, Artificial neural network, Emergency management, Computer science, business.industry, Deep learning, Image processing, Terabyte, computer.software_genre, Field (computer science), Image (mathematics), Artificial intelligence, Data mining, business, computer
Abstract: A lot of terabytes of complex geospatial data are acquired every day, and it is used in almost every field of science and solves such problems as vegetation health monitoring, disaster management, surveillance, etc. In order to solve mentioned problems this data usually requires multiple steps of pre-processing before inferencing via machine learning algorithms. These steps may include such families of algorithms as image tiling or data augmentation. However, various studies focused on the basic concepts and research on techniques for remote sensing very high-resolution data pre-processing is in scarce.
Published: 2019

182. Terabyte-scale Particle Data Analysis

Author: Kesheng Wu, Suren Byna, Patrick Kilian, Bin Dong, Xiaocan Li, and Fan Guo
Subjects: Flexibility (engineering), 050101 languages & linguistics, Process (engineering), Computer science, Scale (chemistry), 05 social sciences, Locality, 02 engineering and technology, Terabyte, Computational science, Range (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Particle, 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Reduction (mathematics)
Abstract: A prime question for plasma physicists is how a fraction of charged particles is accelerated to very high energy.To answer this question, physicists simulate trillions of particles with detailed dynamics and analyze their trajectories. This process requires a range of data analysis tasks with high diversity. In this paper, we present a use case of formulating various analysis tasks on terabyte-scale particle data with a novel data analysis framework called ArrayUDF. The flexibility of ArrayUDF allows it to compose a wide range of particle data operations. We also present optimization strategies to avoid frequent global reduction and to take full advantage of the data locality. Tests show that our optimization methods could accelerate these particle data analysis operations by up to 1,600 times.
Published: 2019

183. Applying Big Data to Pediatric Care

Author: Justin P. Smith, Benson S Hsu, and Emily R. Griese
Subjects: Big Data, Geospatial analysis, business.industry, Medical record, Big data, Petabyte, Terabyte, computer.software_genre, Data science, Pediatrics, 03 medical and health sciences, 0302 clinical medicine, 030225 pediatrics, Pediatrics, Perinatology and Child Health, Health care, Medicine, Humans, Relevance (information retrieval), 030212 general & internal medicine, Dimension (data warehouse), business, computer, Delivery of Health Care
Abstract: 1. Benson S. Hsu, MD, MBA*,†,‡ 2. Justin P. Smith, PhD* 3. Emily R. Griese, PhD*,†,‡ 1. *Sanford Health, Sioux Falls, SD 2. †Department of Pediatrics, University of South Dakota Sanford School of Medicine, Vermillion, SD 3. ‡Sanford Research, Sioux Falls, SD At the forefront of health-care innovation is the ability to harness the power and potential of big data, commonly described as data with 3 key components: volume, variety, and velocity. Health-care gathers one of the largest and most diverse data sets, with some health systems harboring nearly 2 decades worth of electronic medical records (EMRs). Big data, however, is more than just a large volume of data (often a terabyte or even petabyte in size); it also intentionally merges a variety of data sources. In health-care, this may comprise traditional EMR data combined with data sets such as census and geospatial data. Coupled with velocity, ie, the capacity to manage rapidly generated data (ie, through personal health monitors), big data provides a new dimension to health-care, one with the possibility to transform the way patient care is provided. This In Brief aims to explore advances in using big data, the possibilities that exist with its application, and its relevance to health-care professionals. Health-care data are messy. Considering the number of data collectors in a clinic or hospital or health-care system, from nurses to physicians to registrars to coders, and the various data captures, from physician notes to pathology reports to coding abstractions, the potential for interpretation and recorder biases are numerous, resulting in the familiar adage of “garbage in, garbage out.” Combining non–health-care data sources with traditional health-care data is further fraught with concerns given the …
Published: 2019

184. Deep Multimodal Learning: An Effective Method for Video Classification

Author: Tianqi Zhao
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Deep learning, Big data, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Terabyte, Machine learning, computer.software_genre, Ensemble learning, Multimedia (cs.MM), Multimodal learning, Recurrent neural network, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Sequence learning, Language model, Artificial intelligence, business, computer, Computer Science - Multimedia
Abstract: Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as terabyte (TB) level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional Long Short Term Memory (LSTM)/Gated Recurrent Unit (GRU) network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning (DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google.
Published: 2019

185. SMURF: Efficient and Scalable Metadata Access for Distributed Applications from Edge to the Cloud

Author: Tevfik Kosar and Bing Zhang
Subjects: Instruction prefetch, Metadata, Hardware_MEMORYSTRUCTURES, Computer science, business.industry, Distributed computing, Scalability, Locality, Petabyte, Cloud computing, Latency (engineering), Terabyte, business
Abstract: In parallel with big data processing and analysis dominating the usage of distributed and cloud infrastructures, the demand for distributed metadata access and transfer has increased. In many application domains, the volume of data generated exceeds petabytes, while the corresponding metadata amounts to terabytes or even more. In this paper, we propose a novel solution for efficient and scalable metadata access for distributed applications across wide-area networks, dubbed SMURF. Our solution combines novel pipelining and concurrent transfer mechanisms with reliability, provides distributed continuum caching and prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance metadata fetch/prefetch services in the cloud. We also study the phenomenon of semantic locality in real trace logs which is not well utilized in metadata access prediction. We implement our predictor based on this observation and compare it with three existing state-of-the-art prefetch schemes on Yahoo! Hadoop audit traces. By effectively caching and prefetching metadata based on the access patterns, our continuum caching and prefetching mechanism greatly improves local cache hit rate and reduces the average fetching latency. We replayed approximately 20 Million metadata access operations from real audit traces, in which our system achieved 80% accuracy during prefetch prediction and reduced the average fetch latency 50% compared to the state-of-the-art mechanisms.
Published: 2019

186. Fog Computing for Smart Cities

Author: J. Sirisha Devi, Y. Mohana Roopa, and Chintam Muni Kanaka Sri Shalini
Subjects: 020203 distributed computing, business.industry, Computer science, Big data, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Experimental architecture, 020206 networking & telecommunications, Cloud computing, 02 engineering and technology, Terabyte, Data science, Fog computing, Smart city, 0202 electrical engineering, electronic engineering, information engineering, Architecture, business, Internet of Things, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Most of the rural areas keep declining and smart cities keep evolving in its own phase. Data is one of the most important resources to maintain as terabytes of data is collected from a single city. Data is analyzed using big data analysis and it can be stored by using the fog computing and it also describes how fog computing is effective in IoT when compared to cloud-based computing and how fog-based computing is introducing its innovative applications. The research work describes cloud and fog architecture and furthermore we will learn about a new experimental architecture. It is the combination of both fog based and cloud-based technologies. Fog computing is a technology that is rapidly influencing the emerging digital technologies and applications. There are many challenges in maintaining the data through fog computing technologies, these challenges are mentioned along with their security concerns that help for the transition from cloud to fog. This paper mainly tells us how fog computing would emerge as the future of IoT and how it would change the quality of life at an unprecedented rate.
Published: 2019

187. Anubis

Author: Kazi Abu Zubair and Amro Awad
Subjects: Scheme (programming language), Low overhead, business.industry, Computer science, Distributed computing, Terabyte, Encryption, Merkle tree, Metadata, Overhead (computing), business, Anubis, computer, computer.programming_language
Abstract: Implementing secure Non-Volatile Memories (NVMs) is challenging, mainly due to the necessity to persist security metadata along with data. Unlike conventional secure memories, NVM-equipped systems are expected to recover data after crashes and hence security metadata must be recoverable as well. While prior work explored recovery of encryption counters, fewer efforts have been focused on recovering integrity-protected systems. In particular, how to recover Merkle Tree. We observe two major challenges for this. First, recovering parallelizable integrity trees, e.g., Intel's SGX trees, requires very special handling due to inter-level dependency. Second, the recovery time of practical NVM sizes (terabytes are expected) would take hours. Most data centers, cloud systems, intermittent-power devices and even personal computers, are anticipated to recover almost instantly after power restoration. In fact, this is one of the major promises of NVMs. In this paper, we propose Anubis, a novel hardware-only solution that speeds up recovery time by almost 107 times (from 8 hours to only 0.03 seconds). Moreover, we propose a novel and elegant way to recover inter-level dependent trees, as in Intel's SGX. Most importantly, while ensuring recoverability of one of the most challenging integrity-protection schemes among others, Anubis incurs performance overhead that is only 2% higher than the state-of-the-art scheme, Osiris, which takes hours to recover systems with general Merkle Tree and fails to recover SGX-style trees.
Published: 2019

188. Waveform Signal Entropy and Compression Study of Whole-Building Energy Datasets

Author: Daniel Jorde, Thomas Kriechbaumer, and Hans-Arno Jacobsen
Subjects: Lossless compression, Electrical energy consumption, Computer science, 020209 energy, Building energy, 020206 networking & telecommunications, 02 engineering and technology, Terabyte, Grid, File format, computer.software_genre, 0202 electrical engineering, electronic engineering, information engineering, Waveform, Entropy (information theory), Data mining, computer
Abstract: Electrical energy consumption has been an ongoing research area since the coming of smart homes and Internet of Things. Consumption characteristics and usages profiles are directly influenced by building occupants and their interaction with electrical appliances. Data analysis together with machine learning models can be utilized to extract valuable information for the benefit of occupants themselves (conserve energy and increase comfort levels), power plants (maintenance), and grid operators (stability). Public energy datasets provide a scientific foundation to develop and benchmark these algorithms and techniques. With datasets exceeding tens of terabytes, we present a novel study of five whole-building energy datasets with high sampling rates, their signal entropy, and how a well-calibrated measurement can have a significant effect on the overall storage requirements. We show that some datasets do not fully utilize the available measurement precision, therefore leaving potential accuracy and space savings untapped. We benchmark a comprehensive list of 365 file formats, transparent data transformations, and lossless compression algorithms. The primary goal is to reduce the overall dataset size while maintaining an easy-to-use file format and access API. We show that with careful selection of file format and encoding scheme, we can reduce the size of some datasets by up to 73%.
Published: 2019

189. Exploiting Multi-Level Parallelism for Stitching Very Large Microscopy Images

Author: Massimo Bernaschi, Alessandro Bria, Giulio Iannello, and Massimiliano Guarrasi
Subjects: Speedup, parallel processing, Computer science, terabyte images, GPU, Biomedical Engineering, Neuroscience (miscellaneous), Parallel computing, Terabyte, 050105 experimental psychology, lcsh:RC321-571, Image stitching, Reduction (complexity), 03 medical and health sciences, 0302 clinical medicine, stitching, Code (cryptography), 0501 psychology and cognitive sciences, Technology Report, lcsh:Neurosciences. Biological psychiatry. Neuropsychiatry, [3D microscopy, data partitioning, 05 social sciences, Software maintenance, Computer Science Applications, Parallel processing (DSP implementation), ICT, 3D microscopy, User interface, 030217 neurology & neurosurgery
Abstract: Due to the limited field of view of the microscopes, acquisitions of macroscopic specimens require many parallel image stacks to cover the whole volume of interest. Overlapping regions are introduced among stacks in order to make it possible automatic alignment by means of a 3D stitching tool. Since state-of-the-art microscopes coupled with chemical clearing procedures can generate 3D images whose size exceeds the Terabyte, parallelization is required to keep stitching time within acceptable limits. In the present paper we discuss how multi-level parallelization reduces the execution times of TeraStitcher, a tool designed to deal with very large images. %Parallelization has been carried out in such a way that its impact on user interface and software maintenance is negligible. Two algorithms performing dataset partition for efficient parallelization in a transparent way are presented together with experimental results proving the effectiveness of the approach that achieves a speedup close to 300$\times$, when both coarse- and fine-grained parallelism are exploited. Multi-level parallelization of TeraStitcher led to a significant reduction of processing times with no changes in the user interface, and with no additional effort required for the maintenance of code.
Published: 2019
Full Text: View/download PDF

190. Performance Implementation of Multi-access Edge Computing at Indonesia Telco Operator

Author: Gunawan Wibisono and Marazuddin Budianto Harahap
Subjects: GPRS core network, Market research, Data access, business.product_category, Operator (computer programming), business.industry, Payload, Computer science, Internet access, Terabyte, business, Edge computing, Computer network
Abstract: Increased data traffic continues significantly, especially in Indonesia. Based on a survey by Hootsuite - We are Social, it is said that in 2018 the penetration of Internet users in Indonesia is up to 50% of the total population of Indonesia or amounting to 132.7 billion people. 91% of Internet use is accessed from a smartphone or tablet. Data traffic on PT XYZ cellular operators is increasing every year. In 2017, up to 2 million TeraByte of traffic is handled. This has become a problem in terms of traffic and capacity at PT XYZ as a cellular operator. A solution to these problems is needed, so that XYZ operators can continue to be competent in serving users in terms of data access and Internet access. Based on the literature and previous information, the addition of the amount of processing capacity and processing of data packages is the answer to the problem. Addition of capacity can be done with three solutions choices, namely: the implementation of the Multi-access Edge Computing (MEC) architecture, the addition of the GGSN module, and the addition of the GGSN to existing networks. In this study, an analysis of the feasibility of MEC for PT XYZ operators will be conducted. Analysis of three solutions choices will be carried out. The analysis conducted in this study is seen from the aspects of technological feasibility. The results of this study indicate that MEC implementation is feasible to implement. MEC implementation solution is the best choice, which has capacity 32.293 Gbps in total and has ability to handle traffic till 2023. There are two scenarios in the implementation of the MEC, first is done after the implementation of the temporary solution in October 2021 and the second is immediately at the beginning of 2020.
Published: 2019

191. CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics

Author: Roshan Dathathri, Loc Hoang, Gurbinder Gill, and Keshav Pingali
Subjects: Cusp (singularity), Graph analytics, Gigabyte, Computer science, Graph partition, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, Terabyte, 01 natural sciences, Partition (database), Graph, Vertex (geometry), Range (mathematics), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Cluster (physics), General Earth and Planetary Sciences, Partition (number theory), 020201 artificial intelligence & image processing, Enhanced Data Rates for GSM Evolution, 0101 mathematics, General Environmental Science, Abstraction (linguistics), MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.
Published: 2019

192. High Speed, Low Power, and Ultra-Small Operating Platform with Three-Dimensional Integration (3DI) by Bumpless Interconnects

Author: Takayuki Ohba and Koji Sakui
Subjects: Three dimensional integration, Computer science, 020208 electrical & electronic engineering, NAND gate, ComputerApplications_COMPUTERSINOTHERSYSTEMS, Hardware_PERFORMANCEANDRELIABILITY, 02 engineering and technology, Terabyte, 021001 nanoscience & nanotechnology, Chip, Line (electrical engineering), Power (physics), Hardware_INTEGRATEDCIRCUITS, 0202 electrical engineering, electronic engineering, information engineering, Electronic engineering, Wafer, 0210 nano-technology, Electrical impedance
Abstract: This paper proposes a fundamental architecture for a realization of an ultra-small computing system. The prospect of three-dimensional (3D) integration for Terabyte large scale integration using bumpless interconnects with low-aspect-ratio TSVs and ultra-thinning are discussed. Bumpless (no bump) interconnects between wafers are a second-generation alternative to the use of micro-bumps for Wafer-on-Wafer (WOW) technology. The bumpless interconnects technology can increase the number of TSVs per chip with fine pitch of TSVs, and reduce the impedance of the TSV interconnects with no bumps. Therefore, a promising operating platform with a higher speed by enhancing parallelism, lower power by no bumps, and smaller size by thinning wafers, can be realized. By using this technology, the 3D NAND can read and program by plane instead of by line.
Published: 2019

193. LightStore

Author: Sungjin Lee, Jinhyung Koo, Arvind, Chanwoo Chung, and Junsu Im
Subjects: 010302 applied physics, Hardware_MEMORYSTRUCTURES, Adapter (computing), Computer science, NAND gate, 020206 networking & telecommunications, 02 engineering and technology, Terabyte, computer.software_genre, 01 natural sciences, Server, 0103 physical sciences, Distributed data store, 0202 electrical engineering, electronic engineering, information engineering, Operating system, x86, Software-defined networking, computer, Dram
Abstract: We propose LightStore, a key-value flash store, as a substitute for x86-based storage servers. A LightStore node has a low-power embedded-class processor, a few gigabytes of DRAM and a few terabytes of NAND flash, and can be directly connected to a network port in a datacenter. A large-scale distributed storage cluster can be formed simply by adding more LightStore nodes to the network. Applications in a datacenter can take multiple software-defined views of LightStore stores via thin LightStore adapter layers, which translate conventional KV, YCSB, block, and file accesses to KV ones for LightStore. LightStore is estimated to be 2.0x power-efficient and 2.3x space-efficient than an x86-based all-flash array system of the same capacity. Experimental results on our LightStore prototype show that 1) the LightStore node performance is comparable to an x86 server with a single SSD; 2) a four-node LightStore cluster exhibits up to 7.4x better ops/J than an x86 server with four SSDs.
Published: 2019

194. Presto: SQL on Everything

Author: Christopher Berner, Wenlei Xie, Haozhun Jin, Raghav Sethi, David Andrew Phillips, Hwang Eric W, Nileema Bharat Shingte, Nezih Yegitbasi, Martin Traverso, Dain Sundstrom, and Yutian Sun
Subjects: SQL, Database, Computer science, business.industry, 02 engineering and technology, Terabyte, computer.software_genre, NoSQL, Data warehouse, Stream processing, Analytics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Plug-in, business, computer, computer.programming_language
Abstract: Presto is an open source distributed query engine that supports much of the SQL analytics workload at Facebook. Presto is designed to be adaptive, flexible, and extensible. It supports a wide variety of use cases with diverse characteristics. These range from user-facing reporting applications with sub-second latency requirements to multi-hour ETL jobs that aggregate or join terabytes of data. Presto's Connector API allows plugins to provide a high performance I/O interface to dozens of data sources, including Hadoop data warehouses, RDBMSs, NoSQL systems, and stream processing systems. In this paper, we outline a selection of use cases that Presto supports at Facebook. We then describe its architecture and implementation, and call out features and performance optimizations that enable it to support these use cases. Finally, we present performance results that demonstrate the impact of our main design decisions.
Published: 2019

195. Three-dimensional Integration (3DI) with Bumpless Interconnects for Tera-scale Generation : High Speed, Low Power, and Ultra-small Operating Platform

Author: Koji Sakui and Takayuki Ohba
Subjects: 010302 applied physics, Materials science, business.industry, Electrical engineering, 02 engineering and technology, Terabyte, 021001 nanoscience & nanotechnology, Chip, 01 natural sciences, Aspect ratio (image), Small form factor, Power (physics), 0103 physical sciences, Wafer, 0210 nano-technology, Tera, business, Electrical impedance
Abstract: The prospect of three-dimensional (3D) integration for Terabyte large scale integration using bumpless interconnects with low-aspect-ratio TSVs and ultra-thinning are discussed. Bumpless (no bump) interconnects between wafers are a second-generation alternative to the use of micro-bumps for Wafer-on-Wafer (WOW) technology. Ultra-thinning of wafers down to 4μm provides the advantage of a small form factor, not only in terms of the total volume of 3D ICs, but also the aspect ratio of Through-Silicon-Vias (TSVs). The bumpless interconnects technology can increase the number of TSVs per chip with fine pitch of TSVs, and reduce the impedance of the TSV interconnects with no bumps. Therefore, a promising operating platform with a higher speed by enhancing parallelism, lower power by no bumps, and smaller size by thinning wafers, can be realized.
Published: 2019

196. More Exploration to Composable Infrastructure: The Application and Analysis of Composable Memory

Author: Cheng-Yueh Liu, Matt Hsiao, Shih-Hao Hung, Wo-Hao Ruan, Kong-Yu Shiu, Keng-Yu Lin, and Andy Liang
Subjects: Ethernet, Hardware_MEMORYSTRUCTURES, Remote direct memory access, business.industry, Computer science, 020206 networking & telecommunications, Cloud computing, 02 engineering and technology, Terabyte, Computer architecture, Software deployment, Spark (mathematics), Virtual memory, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data center, business
Abstract: Memory-Driven Computing facilitates real-world applications like in-memory database to benefit from substantial memory resources; however, these workloads perform dramatically downgraded if the memory capacity cannot reach to a threshold. In this paper, we present a novel remote memory system, which can handle memory-intensive workloads up to the scale of 1 Terabyte, by leveraging sophisticated tmpfs, virtual memory system, and Remote Direct Memory Access (RDMA) over Ethernet. To the best of our knowledge, this is the first design that provides (1) the most significant memory extension without any software modification and (2) the large-scale performance evaluations comparing to other state-of-the-art works about network-based remote memory. Based on the evaluation results, we believe the proposed composable memory is valuable for readers to re-visit wildly developed computing models (e.g., Apache Spark) and facilitate research work which can potentially reshape the data center architecture/deployment further.
Published: 2019

197. GhostSZ: A Transparent FPGA-Accelerated Lossy Compression Framework

Author: Rushi Patel, Tong Geng, Chen Yang, Martin C. Herbordt, Qingqing Xiong, and Anthony Skjellum
Subjects: Speedup, Computer science, Pipeline (computing), Compression ratio, Bandwidth (computing), Parallel computing, Central processing unit, Terabyte, Lossy compression, Data modeling
Abstract: High-performance computing (HPC) applications often generate enormous amounts of data that must be transferred for check-pointing, in situ processing, or post-execution analysis. To reduce the related network traffic and storage consumption, lossy compression schemes that target scientific data are often used. SZ compression emerged three years ago and has gained much attention because of its high compression ratio. However, performing SZ compression can take half a day per Terabyte of data; this could be a drawback to adoption. We propose GhostSZ an FPGA framework for accelerating tasks in SZ at line rate, and so transparently. The critical problem to be overcome is the tight data dependence central to SZ. GhostSZ solves this with a data transfer path having novel staged hardware. We test our implementation with both synthetic and real HPC application data and show 9.5×-80× core versus pipeline speedup over the optimized production version running on a state-of-the-art CPU and 8.2× per chip. Much of the variance in performance is due to the FPGA already running at line rate and so benefiting less from optimizations applicable to the CPU only on the most favorable data sets. The significance of this work is the possibility of a major reduction in required networking and storage in HPC installations. For example, using GhostSZ, fewer than 10 FPGAs would be sufficient to handle the entire I/O bandwidth of the top entry on the latest IO-500 list.
Published: 2019

198. Body pointing, acquisition and tracking for small satellite laser communication

Author: Jamie W. Burnside, Bryan S. Robinson, Curt Schieler, Jessica T. Chang, Kenneth Aquino, and Kathleen Riesing
Subjects: Microcontroller, Signal processing, Systems analysis, Spacecraft, business.industry, Computer science, Optical communication, Satellite, Terabyte, business, Computer hardware, Free-space optical communication
Abstract: Free-space optical communications in space offer many benefits over established radio frequency based communication links; in particular, high beam directivity results in efficient power usage. Such a reduced power requirement is particularly appealing to small satellites with strict size, weight and power (SWaP) requirements. In the case of free-space optical communication, precise pointing, acquisition and tracking (PAT) of the incoming beam is necessary to close the communication link. Due to the narrow beam of the laser, the critical task of accomplishing PAT becomes increasingly arduous and often requires complex systems of optical and processing hardware to account for relative movement of the terminals. Recent developments in body pointing mecha- nisms have allowed small satellites to point with greater precision. In this work, we consider an approach to a low-complexity PAT system that utilizes a single quad-cell photodetector as an optical spatial sensor, and exploits the body pointing capabilities of the spacecraft to perform the tracking maneuvers, eschewing the need for additional dedicated optical hardware. We look at the PAT performance of this approach from a systems analysis viewpoint and present preliminary experimental results. In particular, we examine the implementation of the system on NASA's TeraByte InfraRed Delivery (TBIRD) demonstration.
Published: 2019

199. The GMRT Archival Data Processing Project

Author: B. P. Ratna Kumar, Yogesh Wadadekar, Ishwara Chandra, Divya Oberoi, and Lijo T. George
Subjects: Pipeline transport, Telescope, Data processing, Computer science, law, Process (computing), Terabyte, Archival research, Pipeline (software), Remote sensing, law.invention
Abstract: The GMRT Online Archive houses over 80 terabytes of interferometric observations obtained with the GMRT, since the first public observing cycle in 2002. The utility of this vast archive of raw UV visibilities, likely the largest of any Indian telescope, can be significantly enhanced if first look (and where possible, science ready) processed images can be made available to the user community. We have initiated a project to pipeline process GMRT images in the 150, 240, 325 and 610 MHz bands. The SPAM pipeline developed by Huib Intema is being used for this purpose.
Published: 2019

200. Developing Analytical Applications and Dashboards Using Social Media

Author: V Anbarasu
Subjects: business.industry, Computer science, Scala, Big data, Terabyte, Data science, Analytics, Spark (mathematics), Scalability, Social media, Real-time data, business, computer, computer.programming_language
Abstract: These days there is a quick boost in the usage of Social Media like Facebook, Twitter, LinkedIn etc. and it deals with Terabytes of Data per day. Also it has become more convenient way for customers to interact to share their ideas, opinions related to current environment. It's a time to dig the data and get some valuable insight. In order to ensure a smooth path in customer journey, without any friction points it gives an opportunity to do analysis on social media. This application will be cheaper and very fast; also it is very useful to do analytics for real time Data and batch processing Data. This framework is a High end Hadoop cluster, with spark and scala which is reliable, scalable and robust in nature. It provides an easy view to create charts based on the hashtags created. The benefit of using Scala, will reduce 50% of code when compared to java. Kafka is a traditional messaging broker, which is used for streaming the data. The output is visualized with Charts. So, anyone can understand the results simply by seeing without having any knowledge of it. The analysis and analytics are achieved by using the Big data application and the outputs.
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

1,244 results on '"Terabyte"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources