Cyber Valley researchers enable better analyses of large networks
Using a recently developed algorithmic sampling method, Nicolò Ruggeri and Caterina de Bacco aim to help social scientists better understand the relationships between nodes in large networks, including social media channels such as Facebook or Twitter.
In an effort to enable more accurate network analyses in applied network science, scientists at the Max Planck Institute for Intelligent Systems in Tübingen recently developed the Theoretical Criterion for Eigenvector Centrality (TCEC), a quantitative sampling method. The method aims to preserve representative network properties within finite data samples by measuring the influence of a single node in a network. Now, Nicolò Ruggeri and Caterina De Bacco are expanding the model to include a new sampling algorithm that improves estimates of PageRank centrality (PR), the algorithm used by Google Search to rank web pages in search engine results.
They present their research in the paper entitled “Sampling on networks: estimating spectral centrality measures and their impact in evaluating other relevant network measures.” The paper was recently published in Applied Network Science. Ruggeri, the paper’s lead author, is a PhD student in the Cyber Valley Physics for Inference and Optimization group at MPI-IS in Tübingen and at ETH Zurich. De Bacco, who co-authored the paper, is group leader.
Ruggeri and De Bacco’s research can make a decisive contribution in a broad range of research fields, for instance in social science analyses that seek to infer the relationship between nodes in a network, such as social media channels.
“Our work was inspired by a conversation we had with a sociologist whose research focuses on analyzing the dynamics of hate speech on Twitter,” said Ruggeri. In such analyses, a major challenge lies in analyzing data samples drawn from complex networks. In most cases, not all the relevant information is available in the sample, and this can lead to biases in the analysis of the network properties under investigation. For instance, social scientists attempting to make an accurate analysis of the most influential users in an entire network would need a data sample that reflected influence accurately.
This is where the TCEC sampling model comes in. With it, Ruggeri and De Bacco have made it possible to get representative snapshots of user dynamics in major networks like Facebook and Twitter. It does this by selecting subsamples that well represent a measure of influence of single nodes in the network, which is also known as eigenvector centrality. “If a researcher selects network users randomly, the sampled network will likely be very different than the actual whole. Any analysis will then be a distortion of reality. With TCEC, a scientist sampling a network of millions of users like Twitter can select a data sample of 10,000 nodes that is more representative of the whole, in terms of users’ influence as measured by eigenvector centrality,” says De Bacco.
By expanding the model to include PageRank centrality, Ruggeri and De Bacco hope to further improve data samples, thus making network analyses even more accurate. Until now, however, the theoretical results of their research have shown that sampling for PR would lower the performance of the original TCEC algorithm. In a next step, the researchers plan to continue fine-tuning their method and applying it in collaboration with sociologists to study how influence on online social networks is related to real-life events.
The original TCEC sampling method is available for use: Ruggeri and De Bacco have released the code, which is available at: https://github.com/cdebacco/tcec_sampling