- Complex Network Analysis Techniques
- Algorithms and Data Compression
- Web Data Mining and Analysis
- Graph Theory and Algorithms
- Advanced Graph Neural Networks
- Bioinformatics and Genomic Networks
- Data Management and Algorithms
- Advanced Proteomics Techniques and Applications
- Opinion Dynamics and Social Influence
- Advanced Graph Theory Research
- Caching and Content Delivery
- DNA and Biological Computing
- Advanced Image and Video Retrieval Techniques
- Peer-to-Peer Network Technologies
- Graph theory and applications
- Distributed systems and fault tolerance
- Computability, Logic, AI Algorithms
- Semantic Web and Ontologies
- Topic Modeling
- Internet Traffic Analysis and Secure E-voting
- Information Retrieval and Search Behavior
- Health, Environment, Cognitive Aging
- Advanced Database Systems and Queries
- Computational Geometry and Mesh Generation
- Network Packet Processing and Optimization
University of Milan
2014-2024
Yahoo (Spain)
2014
Informa (Italy)
2004
Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow tostore a graph in memory limited space, exploiting the inner redundancies of web. The WebGraph framework suite codes, algorithms and tools aims at making it easy manipulate graphs. This papers presents compression used WebGraph, which are centred around referentiation intervalisation (which turn dual each other). can compress WebBase (118 Mnodes,...
We continue the line of research on graph compression started with WebGraph, but we move our focus to social networks in a proper sense (e.g., LiveJournal): approaches that have been used for long time compress web graphs rely specific ordering nodes (lexicographical URL ordering) whose extension general is not trivial. In this paper, propose solution mixes clusterings and orders, devise new algorithm, called Layered Label Propagation, builds previous work scalable clustering can be reorder...
Abstract We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation presence faults, very effective assignment function (based on consistent hashing) for partitioning domain to crawl, and more general complete decentralization every task. necessity handling large sets data has highlighted some limitations APIs, which prompted...
Frigyes Karinthy, in his 1929 short story "Láncszemek" (in English, "Chains") suggested that any two persons are distanced by at most six friendship links.1 Stanley Milgram famous experiments challenged people to route postcards a fixed recipient passing them only through direct acquaintances. found the average number of intermediaries on path lay between 4:4 and 5:7, depending sample chosen. We report results first world-scale social-network graph-distance computations, using entire...
Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology, and computer science, whole plethora _centrality measures_ (a.k.a. indices_, or _rankings_) were proposed to account for the importance network. In this study, we try provide mathematically sound survey most important classic centrality measures known from literature propose an _axiomatic_ approach establish whether they actually doing what have designed do. Our...
In this paper we study the problem of local triangle counting in large graphs. Namely, given a graph G = (V;E) want to estimate as accurately possible number triangles incident every node υ ∈ V graph. The computing global has been considered before, but our knowledge is first that addresses with focus on efficiency issues arising massive distribution and related clustering coefficient can be used many interesting applications. For example, show measures compute help detect presence spamming...
Query logs record the queries and actions of users search engines, as such they contain valuable information about interests, preferences, behavior users, well their implicit feedback to engine results. Mining wealth available in query has many important applications including query-log analysis, user profiling personalization, advertising, recommendation, more.In this paper we introduce query-flow graph, a graph representation interesting knowledge latent querying behavior. Intuitively,...
We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if hosts are include spam aspects or not. This is first publicly available collection includes page contents and links, has labelled by diverse judges.
PageRank is defined as the stationary state of a Markov chain. The chain obtained by perturbing transition matrix induced web graph with damping factor α that spreads uniformly part rank. choice eminently empirical, and in most cases original suggestion = 0.85 Brin Page still used. Recently, however, behaviour respect to changes was discovered be useful link-spam detection[21]. Moreover, an analytical justification value chosen for missing. In this paper, we give first mathematical analysis...
We describe the techniques developed to gather and distribute in a highly compressed, yet accessible, form series of twelve snapshot .uk web domain. Ad hoc compression made it possible store snapshots using just 1:9 bits per link, with constant-time access temporal information. Our collection makes study evolution link-based scores (e.g., PageRank), growth online communities, general time-dependent phenomena related link structure.
The query-flow graph [Boldi et al., CIKM 2008] is an aggregated representation of the latent querying behavior contained in a query log. Intuitively, directed edge from qi to qj means that two queries are likely be part same search mission. Any path over may seen as possible task, whose likelihood given by strength edges along path. An (qi, qj) also labelled with some information: e.g., probability user moves qj, or type transition, for instance, fact specialization qi.
The neighbourhood function NG(t) of a graph G gives, for each t ∈ N, the number pairs nodes x, y such that is reachable from x in less hops. provides wealth information about [10] (e.g., it easily allows one to compute its diameter), but very expensive exactly. Recently, ANF algorithm (approximate function) has been proposed with purpose approximating on large graphs. We describe breakthrough improvement over terms speed and scalability. Our algorithm, called HyperANF, uses new HyperLogLog...
Data collected nowadays by social-networking applications create fascinating opportunities for building novel services, as well expanding our understanding about social structures and their dynamics. Unfortunately, publishing social-network graphs is considered an ill-advised practice due to privacy concerns. To alleviate this problem, several anonymization methods have been proposed, aiming at reducing the risk of a breach on published data, while still allowing analyze them draw relevant...
research-article Share on PageRank: Functional dependencies Authors: Paolo Boldi Università degli Studi di Milano, MI, Italy ItalyView Profile , Massimo Santini Sebastiano Vigna Authors Info & Claims ACM Transactions Information SystemsVolume 27Issue 4Article No.: 19pp 1–23https://doi.org/10.1145/1629096.1629097Published:30 November 2009Publication History 71citation1,468DownloadsMetricsTotal Citations71Total Downloads1,468Last 12 Months42Last 6 weeks2 Get Citation AlertsNew Alert added!This...
A minimal perfect hash function maps a set S of n keys into the {0, 1,..., -- 1} bijectively. Classical results state that hashing is possible in constant time using structure occupying space close to lower bound log e bits per element. Here we consider problem monotone hashing, which bijection required preserve lexicographical ordering keys. can be seen as very weak form index provides ranking just on (and answers randomly outside S). Our goal minimise description size function: show that,...
Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine tools and at same time scales linearly with amount resources available. This paper aims filling this gap.
This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these there is damping function decreases with distance, so direct link implies more endorsement than long path. PageRank the most widely known this family.The main objective to determine whether techniques has some interest per se, and how different choices for impact on rank quality convergence speed. Even though our results suggest can be approximated other simpler forms...
In this article, we study the problem of approximate local triangle counting in large graphs. Namely, given a graph G =( V,E ) want to estimate as accurately possible number triangles incident every node v ∈ V graph. We consider question both for undirected and directed The computing global has been considered before, but our knowledge is first contribution that addresses with focus on efficiency issues arising massive graphs also considers case. distribution related clustering coefficient...
Article Free AccessComputing anonymously with arbitrary knowledge Share on Authors: Paolo Boldi Dipartimento di Scienze dell'Informazione, Università Milano, Italy ItalyView Profile , Sebastiano Vigna Authors Info & Claims PODC '99: Proceedings of the eighteenth annual ACM symposium Principles distributed computingMay 1999 Pages 181–188https://doi.org/10.1145/301308.301355Online:01 May 1999Publication History 60citation252DownloadsMetricsTotal Citations60Total Downloads252Last 12 Months4Last...
Understanding query reformulation patterns is a key step towards next generation web search engines: it can help improving users' web-search experience by predicting their intent, and thus helping them to locate information more effectively. As in this direction, we build an accurate model for classifying user reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92\% accuracy. We apply the automatically label two large logs,...
Decision-making procedures in online social networks should reflect participants' political influence within the network.
Since the first investigations on web-graph compression, it has been clear that ordering of nodes a web graph fundamental influence compression rate (usually expressed as number bits per link). The authors LINK database [Randall et al. 02], for instance, investigated three different approaches: an _extrinsic_ (URL ordering) and two _intrinsic_ orderings based rows adjacency matrix (lexicographic Gray code); they concluded URL many advantages in spite small penalty compression. In this paper...