Opening of the GraphHPC-2015 workshop
V. B. Sulimov, PhD, MSU RCC
A. S. Simonov, PhD, JSC NICEVT
A.V. Pozdneev, PhD, IBM
Community detection is an important problem that spans many research areas, such as social networks, systems biology, power grid optimization. The fine-grained communication and irregular access pattern to memory and interconnect limit the overall scalability and performance of existing algorithms. This talk presents a highly scalable parallel algorithm for distributed memory systems. The method employs a novel implementation strategy to store and process dynamic graphs. The scalability analysis of the algorithm was performed using two massively parallel machines, Blue Gene/Q and Power7- IH, for graphs with up to hundreds of billions of edges. Leveraging the convergence properties of the algorithm and the efficient implementation, it is possible to analyze communities of large-scale graphs in just a few seconds. The talk is based on a paper accepted for publication in IPDPS-2015 conference proceedings that was kindly provided by Dr. Fabrizio Petrini (IBM Research).
A.N. Daryin, PhD, T-Platforms
We describe an optimized graph structure developed by T-Platforms. This structure allows for significantly better locality of memory access, and for reduced and well-ordered communication between processes. Within each process, thread parallelization becomes possible. Results for BFS are presented.
A.V. Slesarenko; A. N. Filippov; A.V. Romanov; Y.A. Zotov, PhD, Huawei Russian Research Center
Automatic program specialization is a long-standing practice of code optimization based on some information about a particular case of the program usage. In this report we will consider a new method of automatic program specialization for different implementations of abstract data types. Graph algorithms will be used as an example.
When defining an abstract data type for graph computations, for example, one can omit concrete internal representation but only define allowed operations, such as adding edges/vertices, searching neighbors, etc.
After being expressed in terms of abstract graph operations, the algorithm can be automatically transformed into one operating on any particular implementation of the graph, such as the adjacency matrix or adjacency list, taking advantages of the concrete data representation and performing domain-specific code optimizations along the way.
The report will describe the proposed method, how and why it works, as well as the results of its application to some graph algorithms.
Coffee Break
A.S. Kolganov, Institute of Applied Mathematics, Russian Academy of Sciences
In recent years, usage of GPGPU becomes more actually. The need to use them due to their relatively high efficiency and lower cost. As you know, many problems, that use the structural grids, are solved on the GPU very well, where the parallelism is easily seen. But there are problems that require more computing power to solve them, and use unstructured grids. One of them is breadth first search problem on undirected graphs. This problem is a base in a number of algorithms on graphs. The report will be discussed the implementation of the breadth first search algorithm for processing large graphs on clusters with multicore processors and graphics cards (GPUs). Also the report will be discussed optimization techniques for a single GPU, and some optimization techniques for accelerating communications between nodes using multiple GPUs.
M.A. Chernoskutov, Institute of Mathematics and Mechanics, Ural Department of Russian Academy of Sciences; C. Bekas, PhD; Y. Ineichen, PhD, IBM Research
Subgraph centrality metric based on the mathematical tools, allowing better differentiation power in comparison with some other centrality metrics. However, algorithm for computing subgraph centrality metric have cubic complexity in dependence of number of nodes in the graph. This obstacle highly complicates exact computing of values of subgraph centrality metrics for every node in dynamic graphs. In this study we introduce heuristic algorithm, which allows to dynamically update subgraph centrality metric for several types of graphs.
Specialized Sparse Matrix Storage Structure for GPU Computing
A.V. Monakov, Institute for System Programming, Russian Academy of Sciences
Sparse matrix-vector multiplication is one of the most important sparse algebra kernels. On GPUs, this kernel is memory bandwidth bound, and achieved bandwidth in turn depends on the order of matrix elements processing, and the data structure used for the matrix. The presentation describes a variant of SELLPACK format with modifications that allow to reduce memory consumption for matrix index data. Additionally, auto-tuning of structrure parameters and GPU kernel launch parameters can be used to maximize overall performance.
A.U. Pirova; I.B. Meyerov, PhD, Lobachevsky State University of Nizhni Novgorod
This work deals with the problem of finding an ordering of graph vertices that minimizes the fill-in of the Cholesky factor of the sparse matrix associated with the graph. For this purpose, heuristic approaches based on graph algorithms are applied. The existing libraries for parallel graph ordering have MPI-based implementations, nevertheless they do not take into account the architecture features of the modern multicore systems. The work considers a new parallel ordering algorithm for shared-memory systems. Parallel processing is done in a task-based fashion employing threading techniques and concurrent queue. The modified multilevel nested dissection algorithm is used for the ordering. Experimental results on the symmetric matrices from the University of Florida Sparse Matrix Collection prove the competiveness of our implementation on shared memory systems to the widely used ParMetis library both in quality and performance.
Lunch Break
K.K. Chykhradze, A.V. Korshunov, R.K. Pastukhov, Institute for System Programming, Russian Academy of Sciences
Despite the availability of multiple datasets and tools for social network data collecting, there is still a need for building random graph models and tools for generating these graphs with a specified set of properties. For reliable testing of social data analysis methods, they must be applied to a range of samples with various properties. Collecting the real data is hindered due to the long time required for downloading and processing large arrays of semi-structured information. Further, it is difficult to obtain a dataset with desired set of properties.
We introduce tools for generating random graphs satisfying the basic properties of social networks, such as scale- freeness, community structure, etc. The output graphs include (un)directed edges, user-community assignments, user attributes, "likes" and textual content. The Apache Spark-based implementation allows to create large random graphs for testing performance and accuracy of methods for social data analysis.
A.V.Zinoviev, Omsk F.M. Dostoevsky State University
This report presents the next social network user classification problem. Every user should be assigned to one of the following two classes: User, Spamer. The Odnoklassniki Friendship Graph and additional social information about users are initial data for the investigation and building the classification model. Apache Giraph, running on the Hadoop cluster, is used for parallel feature vector calculation.
Parallel Topology Algorithm of Normals Fixing in Unstructed Multifaceted Computational Grid on Intel Xeon Phi
A.S. Semenov, PhD; A.V. Mukosey; E.A. Golovina, DISLab/JSC NICEVT
Mismatch of Normals Direction can Arise in Generation of Computational Grids in CFD codes. We propose the Normals Fixing Algorithm Based on Computational Grid Traversing by Breadth First Search method. Algorithm Performance on Intel Xeon Phi 5110 is better than Intel Xeon E5-2660 by 1.6 times on Large Computational Grid.
An Effective Parallel Algorithm for Finding of Approximately Matching Words in the Large Set of Long Character Sequences
V.A. Lyubetsky, PhD; L.I. Rubanov, PhD, Institute for Information Transmission Problems, Russian Academy of Sciences
The problem of finding similar words in a large (such as 108) set of long (up to 1012) character strings is of generally recognized significance. It is of interest if the words appear in almost every string, are very similar to each other, and are of great complexity (weakly compressible). If candidate words are represented as vertices of a graph, and their similarity as weighted edges, then the problem is equivalent to finding massive dense subgraphs in a graph with typically 1020 vertices and 1030 edges or even greater. We propose a deeply parallel algorithm for effective solution of this problem that includes initial large graph construction and search subgraphs in it. The algorithm is presented along with applications to important problems of natural science.
Coffee Break
Minimum Spanning Tree (MST) Problem on Multiprocessor: Boruvka's Algorithm Variants and Modifications
V.E. Zaytsev, Novosibirsk State University; K.V. Kalgin, PhD, Institute of Computational Mathematics and Mathematical Geophysics, Novosibirsk Scientific Centre, Russian Academy of Sciences
One of the most used algorithm to build minimum spanning tree is a Boruvka algorithm. In this work different approaches to store graph in the memory in parallel Boruvka algorithm implementation is analyzed: one constant array of edges, adjacency lists with two approaches to merge them, and lists of adjacency lists, which allow to fast merge them. These approaches were tested on RMAT and SSCA2 graphs.
S.O. Komarov, Novosibirsk State University; K.V. Kalgin, PhD, Institute of Computational Mathematics and Mathematical Geophysics, Novosibirsk Scientific Centre, Russian Academy of Sciences
This work presents a parallel algorithm for the search of connected components in an undirected graph. There were implemented two parallel algorithms: own asynchronous BFS (ABFS) and Shiloach-Vishkin (SV). Performed tests have shown advantages and disadvantages of each algorithms. On the system Intel Xeon E5-2690 on different graphs ABFS significantly outperforms SV, however on Intel Xeon Phi 7110X capacity of ABFS is greatly reduced due to frequent cache misses.
GraphHPC-2015 contest closing
GraphHPC-2015 closing