Program
9:3010:00 
Registration

10:0010:15 
Opening of the GraphHPC2015 workshop

Community detection is an important problem that spans many research areas, such as social networks, systems biology,
power grid optimization. The finegrained communication and irregular access pattern to memory and interconnect limit
the overall scalability and performance of existing algorithms. This talk presents a highly scalable parallel algorithm
for distributed memory systems. The method employs a novel implementation strategy to store and process dynamic graphs.
The scalability analysis of the algorithm was performed using two massively parallel machines, Blue Gene/Q and Power7
IH, for graphs with up to hundreds of billions of edges. Leveraging the convergence properties of the algorithm and the
efficient implementation, it is possible to analyze communities of largescale graphs in just a few seconds. The talk
is based on a paper accepted for publication in IPDPS2015 conference proceedings that was kindly provided by Dr.
Fabrizio Petrini (IBM Research).


We describe an optimized graph structure developed by TPlatforms. This structure allows for significantly better
locality of memory access, and for reduced and wellordered communication between processes. Within each process,
thread parallelization becomes possible. Results for BFS are presented.


Automatic program specialization is a longstanding practice of code optimization based on some information about a
particular case of the program usage. In this report we will consider a new method of automatic program specialization
for different implementations of abstract data types. Graph algorithms will be used as an example.
When defining an abstract data type for graph computations, for example, one can omit concrete internal representation but only define allowed operations, such as adding edges/vertices, searching neighbors, etc. After being expressed in terms of abstract graph operations, the algorithm can be automatically transformed into one operating on any particular implementation of the graph, such as the adjacency matrix or adjacency list, taking advantages of the concrete data representation and performing domainspecific code optimizations along the way. The report will describe the proposed method, how and why it works, as well as the results of its application to some graph algorithms. 

Coffee Break


In recent years, usage of GPGPU becomes more actually. The need to use them due to their relatively high efficiency
and lower cost. As you know, many problems, that use the structural grids, are solved on the GPU very well, where the
parallelism is easily seen. But there are problems that require more computing power to solve them, and use
unstructured grids. One of them is breadth first search problem on undirected graphs. This problem is a base in a
number of algorithms on graphs. The report will be discussed the implementation of the breadth first search algorithm
for processing large graphs on clusters with multicore processors and graphics cards (GPUs). Also the report will be
discussed optimization techniques for a single GPU, and some optimization techniques for accelerating communications
between nodes using multiple GPUs.


Subgraph centrality metric based on the mathematical tools, allowing better differentiation power in comparison with some other centrality metrics. However, algorithm for computing subgraph centrality metric have cubic complexity in dependence of number of nodes in the graph. This obstacle highly complicates exact computing of values of subgraph centrality metrics for every node in dynamic graphs. In this study we introduce heuristic algorithm, which allows to dynamically update subgraph centrality metric for several types of graphs.


Specialized Sparse Matrix Storage Structure for GPU Computing
Sparse matrixvector multiplication is one of the most important sparse algebra kernels. On GPUs, this kernel is memory bandwidth bound, and achieved bandwidth in turn depends on the order of matrix elements processing, and the data structure used for the matrix. The presentation describes a variant of SELLPACK format with modifications that allow to reduce memory consumption for matrix index data. Additionally, autotuning of structrure parameters and GPU kernel launch parameters can be used to maximize overall performance.


This work deals with the problem of finding an ordering of graph vertices that minimizes the fillin of the Cholesky factor of the sparse matrix associated with the graph. For this purpose, heuristic approaches based on graph algorithms are applied. The existing libraries for parallel graph ordering have MPIbased implementations, nevertheless they do not take into account the architecture features of the modern multicore systems. The work considers a new parallel ordering algorithm for sharedmemory systems. Parallel processing is done in a taskbased fashion employing threading techniques and concurrent queue. The modified multilevel nested dissection algorithm is used for the ordering. Experimental results on the symmetric matrices from the University of Florida Sparse Matrix Collection prove the competiveness of our implementation on shared memory systems to the widely used ParMetis library both in quality and performance.


Lunch Break


Despite the availability of multiple datasets and tools for social network data collecting, there is still a need for
building random graph models and tools for generating these graphs with a specified set of properties. For reliable
testing of social data analysis methods, they must be applied to a range of samples with various properties. Collecting
the real data is hindered due to the long time required for downloading and processing large arrays of semistructured
information. Further, it is difficult to obtain a dataset with desired set of properties.
We introduce tools for generating random graphs satisfying the basic properties of social networks, such as scale freeness, community structure, etc. The output graphs include (un)directed edges, usercommunity assignments, user attributes, "likes" and textual content. The Apache Sparkbased implementation allows to create large random graphs for testing performance and accuracy of methods for social data analysis. 

This report presents the next social network user classification problem. Every user should be assigned to one of the following two classes: User, Spamer. The Odnoklassniki Friendship Graph and additional social information about users are initial data for the investigation and building the classification model. Apache Giraph, running on the Hadoop cluster, is used for parallel feature vector calculation.


Parallel Topology Algorithm of Normals Fixing in Unstructed Multifaceted Computational Grid on Intel Xeon Phi
Mismatch of Normals Direction can Arise in Generation of Computational Grids in CFD codes. We propose the Normals Fixing Algorithm Based on Computational Grid Traversing by Breadth First Search method. Algorithm Performance on Intel Xeon Phi 5110 is better than Intel Xeon E52660 by 1.6 times on Large Computational Grid.


10:1510:35 
An Effective Parallel Algorithm for Finding of Approximately Matching Words
in the Large Set of Long Character Sequences
The problem of finding similar words in a large (such as 10^{8}) set of long (up to 10^{12}) character strings is of generally recognized significance. It is of interest if the words appear in almost every string, are very similar to
each other, and are of great complexity (weakly compressible). If candidate words are represented as vertices of a
graph, and their similarity as weighted edges, then the problem is equivalent to finding massive dense subgraphs in a
graph with typically 10^{20} vertices and 10^{30} edges or even greater. We propose a deeply parallel algorithm for effective
solution of this problem that includes initial large graph construction and search subgraphs in it. The algorithm is
presented along with applications to important problems of natural science.

Coffee Break


Minimum Spanning Tree (MST) Problem on Multiprocessor: Boruvka's Algorithm Variants and Modifications
One of the most used algorithm to build minimum spanning tree is a Boruvka algorithm. In this work different approaches
to store graph in the memory in parallel Boruvka algorithm implementation is analyzed: one constant array of edges,
adjacency lists with two approaches to merge them, and lists of adjacency lists, which allow to fast merge them. These
approaches were tested on RMAT and SSCA2 graphs.


This work presents a parallel algorithm for the search of connected components in an undirected graph. There were
implemented two parallel algorithms: own asynchronous BFS (ABFS) and ShiloachVishkin (SV). Performed tests have shown
advantages and disadvantages of each algorithms. On the system Intel Xeon E52690 on different graphs ABFS
significantly outperforms SV, however on Intel Xeon Phi 7110X capacity of ABFS is greatly reduced due to
frequent cache misses.


GraphHPC2015 contest closing


GraphHPC2015 closing
