Paolo Boldi

ORCID: 0000-0002-8297-6255
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Complex Network Analysis Techniques
  • Algorithms and Data Compression
  • Web Data Mining and Analysis
  • Graph Theory and Algorithms
  • Advanced Graph Neural Networks
  • Bioinformatics and Genomic Networks
  • Data Management and Algorithms
  • Advanced Proteomics Techniques and Applications
  • Opinion Dynamics and Social Influence
  • Advanced Graph Theory Research
  • Caching and Content Delivery
  • DNA and Biological Computing
  • Advanced Image and Video Retrieval Techniques
  • Peer-to-Peer Network Technologies
  • Graph theory and applications
  • Distributed systems and fault tolerance
  • Computability, Logic, AI Algorithms
  • Semantic Web and Ontologies
  • Topic Modeling
  • Internet Traffic Analysis and Secure E-voting
  • Information Retrieval and Search Behavior
  • Health, Environment, Cognitive Aging
  • Advanced Database Systems and Queries
  • Computational Geometry and Mesh Generation
  • Network Packet Processing and Optimization

University of Milan
2014-2024

Yahoo (Spain)
2014

Informa (Italy)
2004

Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow tostore a graph in memory limited space, exploiting the inner redundancies of web. The WebGraph framework suite codes, algorithms and tools aims at making it easy manipulate graphs. This papers presents compression used WebGraph, which are centred around referentiation intervalisation (which turn dual each other). can compress WebBase (118 Mnodes,...

10.1145/988672.988752 article EN 2004-05-17

We continue the line of research on graph compression started with WebGraph, but we move our focus to social networks in a proper sense (e.g., LiveJournal): approaches that have been used for long time compress web graphs rely specific ordering nodes (lexicographical URL ordering) whose extension general is not trivial. In this paper, propose solution mixes clusterings and orders, devise new algorithm, called Layered Label Propagation, builds previous work scalable clustering can be reorder...

10.1145/1963405.1963488 article EN 2011-03-28

Abstract We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation presence faults, very effective assignment function (based on consistent hashing) for partitioning domain to crawl, and more general complete decentralization every task. necessity handling large sets data has highlighted some limitations APIs, which prompted...

10.1002/spe.587 article EN Software Practice and Experience 2004-03-24

Frigyes Karinthy, in his 1929 short story "Láncszemek" (in English, "Chains") suggested that any two persons are distanced by at most six friendship links.1 Stanley Milgram famous experiments challenged people to route postcards a fixed recipient passing them only through direct acquaintances. found the average number of intermediaries on path lay between 4:4 and 5:7, depending sample chosen. We report results first world-scale social-network graph-distance computations, using entire...

10.1145/2380718.2380723 article EN 2012-06-22

Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology, and computer science, whole plethora _centrality measures_ (a.k.a. indices_, or _rankings_) were proposed to account for the importance network. In this study, we try provide mathematically sound survey most important classic centrality measures known from literature propose an _axiomatic_ approach establish whether they actually doing what have designed do. Our...

10.1080/15427951.2013.865686 article EN Internet Mathematics 2014-04-04

In this paper we study the problem of local triangle counting in large graphs. Namely, given a graph G = (V;E) want to estimate as accurately possible number triangles incident every node υ ∈ V graph. The computing global has been considered before, but our knowledge is first that addresses with focus on efficiency issues arising massive distribution and related clustering coefficient can be used many interesting applications. For example, show measures compute help detect presence spamming...

10.1145/1401890.1401898 article EN 2008-08-24

Query logs record the queries and actions of users search engines, as such they contain valuable information about interests, preferences, behavior users, well their implicit feedback to engine results. Mining wealth available in query has many important applications including query-log analysis, user profiling personalization, advertising, recommendation, more.In this paper we introduce query-flow graph, a graph representation interesting knowledge latent querying behavior. Intuitively,...

10.1145/1458082.1458163 article EN 2008-10-26

We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if hosts are include spam aspects or not. This is first publicly available collection includes page contents and links, has labelled by diverse judges.

10.1145/1189702.1189703 article EN ACM SIGIR Forum 2006-12-01

PageRank is defined as the stationary state of a Markov chain. The chain obtained by perturbing transition matrix induced web graph with damping factor α that spreads uniformly part rank. choice eminently empirical, and in most cases original suggestion = 0.85 Brin Page still used. Recently, however, behaviour respect to changes was discovered be useful link-spam detection[21]. Moreover, an analytical justification value chosen for missing. In this paper, we give first mathematical analysis...

10.1145/1060745.1060827 article EN 2005-01-01

We describe the techniques developed to gather and distribute in a highly compressed, yet accessible, form series of twelve snapshot .uk web domain. Ad hoc compression made it possible store snapshots using just 1:9 bits per link, with constant-time access temporal information. Our collection makes study evolution link-based scores (e.g., PageRank), growth online communities, general time-dependent phenomena related link structure.

10.1145/1480506.1480511 article EN ACM SIGIR Forum 2008-11-30

The query-flow graph [Boldi et al., CIKM 2008] is an aggregated representation of the latent querying behavior contained in a query log. Intuitively, directed edge from qi to qj means that two queries are likely be part same search mission. Any path over may seen as possible task, whose likelihood given by strength edges along path. An (qi, qj) also labelled with some information: e.g., probability user moves qj, or type transition, for instance, fact specialization qi.

10.1145/1507509.1507518 article EN 2009-02-09

The neighbourhood function NG(t) of a graph G gives, for each t ∈ N, the number pairs nodes x, y such that is reachable from x in less hops. provides wealth information about [10] (e.g., it easily allows one to compute its diameter), but very expensive exactly. Recently, ANF algorithm (approximate function) has been proposed with purpose approximating on large graphs. We describe breakthrough improvement over terms speed and scalability. Our algorithm, called HyperANF, uses new HyperLogLog...

10.1145/1963405.1963493 article EN 2011-03-28

Data collected nowadays by social-networking applications create fascinating opportunities for building novel services, as well expanding our understanding about social structures and their dynamics. Unfortunately, publishing social-network graphs is considered an ill-advised practice due to privacy concerns. To alleviate this problem, several anonymization methods have been proposed, aiming at reducing the risk of a breach on published data, while still allowing analyze them draw relevant...

10.14778/2350229.2350254 article EN Proceedings of the VLDB Endowment 2012-07-01

10.1016/s0012-365x(00)00455-6 article EN Discrete Mathematics 2002-01-01

research-article Share on PageRank: Functional dependencies Authors: Paolo Boldi Università degli Studi di Milano, MI, Italy ItalyView Profile , Massimo Santini Sebastiano Vigna Authors Info & Claims ACM Transactions Information SystemsVolume 27Issue 4Article No.: 19pp 1–23https://doi.org/10.1145/1629096.1629097Published:30 November 2009Publication History 71citation1,468DownloadsMetricsTotal Citations71Total Downloads1,468Last 12 Months42Last 6 weeks2 Get Citation AlertsNew Alert added!This...

10.1145/1629096.1629097 article EN ACM transactions on office information systems 2009-11-01

A minimal perfect hash function maps a set S of n keys into the {0, 1,..., -- 1} bijectively. Classical results state that hashing is possible in constant time using structure occupying space close to lower bound log e bits per element. Here we consider problem monotone hashing, which bijection required preserve lexicographical ordering keys. can be seen as very weak form index provides ranking just on (and answers randomly outside S). Our goal minimise description size function: show that,...

10.5555/1496770.1496856 article EN Symposium on Discrete Algorithms 2009-01-04

Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine tools and at same time scales linearly with amount resources available. This paper aims filling this gap.

10.1145/2567948.2577304 article EN 2014-04-07

This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these there is damping function decreases with distance, so direct link implies more endorsement than long path. PageRank the most widely known this family.The main objective to determine whether techniques has some interest per se, and how different choices for impact on rank quality convergence speed. Even though our results suggest can be approximated other simpler forms...

10.1145/1148170.1148225 article EN 2006-08-06

In this article, we study the problem of approximate local triangle counting in large graphs. Namely, given a graph G =( V,E ) want to estimate as accurately possible number triangles incident every node v ∈ V graph. We consider question both for undirected and directed The computing global has been considered before, but our knowledge is first contribution that addresses with focus on efficiency issues arising massive graphs also considers case. distribution related clustering coefficient...

10.1145/1839490.1839494 article EN ACM Transactions on Knowledge Discovery from Data 2010-10-01

Article Free AccessComputing anonymously with arbitrary knowledge Share on Authors: Paolo Boldi Dipartimento di Scienze dell'Informazione, Università Milano, Italy ItalyView Profile , Sebastiano Vigna Authors Info & Claims PODC '99: Proceedings of the eighteenth annual ACM symposium Principles distributed computingMay 1999 Pages 181–188https://doi.org/10.1145/301308.301355Online:01 May 1999Publication History 60citation252DownloadsMetricsTotal Citations60Total Downloads252Last 12 Months4Last...

10.1145/301308.301355 article EN 1999-05-01

Understanding query reformulation patterns is a key step towards next generation web search engines: it can help improving users' web-search experience by predicting their intent, and thus helping them to locate information more effectively. As in this direction, we build an accurate model for classifying user reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92\% accuracy. We apply the automatically label two large logs,...

10.1109/wi-iat.2009.34 article EN 2009-01-01

Decision-making procedures in online social networks should reflect participants' political influence within the network.

10.1145/1953122.1953154 article EN Communications of the ACM 2011-06-01

Since the first investigations on web-graph compression, it has been clear that ordering of nodes a web graph fundamental influence compression rate (usually expressed as number bits per link). The authors LINK database [Randall et al. 02], for instance, investigated three different approaches: an _extrinsic_ (URL ordering) and two _intrinsic_ orderings based rows adjacency matrix (lexicographic Gray code); they concluded URL many advantages in spite small penalty compression. In this paper...

10.1080/15427951.2009.10390641 article EN Internet Mathematics 2009-01-01
Coming Soon ...