151. On Divide&Conquer in Image Processing of Data Monster
- Author
-
Peter Hufnagl, Elsa Irmgard Buchholz, Hermann Hesling, and Marco Strutz
- Subjects
Divide and conquer algorithms ,Information Systems and Management ,Speedup ,Computer science ,business.industry ,Big data ,Petabyte ,Image processing ,02 engineering and technology ,Terabyte ,Computer Science Applications ,Management Information Systems ,Computational science ,020204 information systems ,Sandbox (computer security) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,Word (computer architecture) ,Information Systems - Abstract
The steadily improving resolution power of sensors results in larger and larger data objects, which cannot be analysed in a reasonable amount of time on single workstations. To speed up the analysis the Divide and Conquer method can be used by splitting (large) data objects into smaller pieces where each piece is analysed on a single node and, finally, the partial results are collected and combined. We apply this method to the validated bio–medical framework Ki67–Analysis that determines the amount of cancer cells in high–resolution images from breast examinations. In previous work, we observed an anomalous behaviour when the framework is applied to subtiles of an image. To this end, we determined for each subtile a so–called Ki67–Analysis score parameter, which is given by the ratio of the number of identified cancer cells and the total number of cells. This parameter turns out to be underestimated the more the smaller the subtiles. The anomaly prevents a direct application of the Divide and Conquer method. In this work, we suggest a novel grey–box testing method for understanding the origin of the anomaly. It allows to identify a class of subtiles for which the Ki67–Analysis score parameter can be determined reasonably well, i.e. for which the Divide and Conquer method can be applied. By demanding the stability of the framework with regard to small additive noise in brightness, “ghost cells” are identified that turn out to be an artefact of the framework. Finally, the challenge of analysing huge single data objects is considered. The upcoming observatory Square Kilometre Array (SKA) will consist of thousands of antennas and telescopes. Due to the exceptional resolution power of SKA, single images from the Universe may be as large as one Petabyte. “Data monster” of that huge size cannot be analysed reasonably fast on traditional computing architectures. The relatively small throughput rates when reading data from disks is a serious bottleneck (memory–wall problem). Memory–based computing offers a change in paradigm: the current processor–centric architecture is replaced by a memory–based architecture. Hewlett Packard Enterprise (HPE) developed a prototype with 48 Terabyte of memory, called Sandbox. Counting words in large files can be considered as a first step towards simulating image processing of “Data Monster” at SKA. We run the big data framework Thrill on the Sandbox and determine the speedup of different setups for distributed word counting.
- Published
- 2021