- Complex Network Analysis Techniques
- Data Management and Algorithms
- Opinion Dynamics and Social Influence
- Advanced Graph Neural Networks
- Social Media and Politics
- Data Mining Algorithms and Applications
- Algorithms and Data Compression
- Advanced Database Systems and Queries
- Mobile Crowdsensing and Crowdsourcing
- Web Data Mining and Analysis
- Peer-to-Peer Network Technologies
- Misinformation and Its Impacts
- Graph Theory and Algorithms
- Complexity and Algorithms in Graphs
- Caching and Content Delivery
- Rough Sets and Fuzzy Logic
- Advanced Clustering Algorithms Research
- Privacy-Preserving Technologies in Data
- Optimization and Search Problems
- Recommender Systems and Techniques
- Advanced Image and Video Retrieval Techniques
- Spam and Phishing Detection
- Human Mobility and Location-Based Analysis
- Advanced Graph Theory Research
- Machine Learning and Algorithms
KTH Royal Institute of Technology
2019-2025
Aalto University
2013-2021
Helsinki Institute for Information Technology
2005-2020
National University of Singapore
2020
University of Technology
2019
University of Helsinki
2004-2017
St. Francis Hospital
2016
Taylor Wimpey (United Kingdom)
2016
Yahoo (United Kingdom)
2007-2013
Yahoo (Spain)
2007-2013
The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability such increases, task identifying high-quality sites based on user contributions --social media -- becomes increasingly important. Social in general exhibit a rich variety information sources: addition itself, there is wide array non-content available, as links between items explicit ratings members community. In this paper we investigate methods for exploiting community feedback...
We consider the problem of maintaining aggregates and statistics over data streams, with respect to last N elements seen so far. refer this model as sliding window model. following basic problem: Given a stream bits, maintain count number 1's in from stream. show that, using $O(\frac{1}{\epsilon} \log^2 N)$ bits memory, we can estimate within factor $1 + \epsilon$. also give matching lower bound $\Omega(\frac{1}{\epsilon}\log^2 memory for any deterministic or randomized algorithms. extend...
We consider the following problem: given a set of clusterings, find single clustering that agrees as much possible with input clusterings. This problem, aggregation , appears naturally in various contexts. For example, categorical data is an instance problem; each attribute can be viewed rows where are grouped together if they take same value on attribute. Clustering also used metaclustering method to improve robustness by combining output multiple algorithms. Furthermore, problem...
A lot of research in graph mining has been devoted the discovery communities. Most work focused scenario where communities need to be discovered with only reference input graph. However, for many interesting applications one is interested finding community formed by a given set nodes. In this paper we study query-dependent variant community-detection problem, which call community-search problem: G, and query nodes graph, seek find subgraph G that contains it densely connected.
Social network analysis has gained significant attention in recent years, largely due to the success of online social networking and media-sharing sites, consequent availability a wealth data. In spite growing interest, however, there is little understanding potential business applications mining networks. While large body research on different problems methods for mining, gap between techniques developed by community their deployment real-world applications. Therefore impact these still...
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial engines to detect pages efficiently and accurately. In this paper we present detection system that combines link-based content-based features, uses topology graph by exploiting link dependencies among pages. We find linked hosts tend belong same class: either both are or non-spam. demonstrate three methods incorporating into predictions obtained our base classifier: (i)...
In this paper we study the problem of local triangle counting in large graphs. Namely, given a graph G = (V;E) want to estimate as accurately possible number triangles incident every node υ ∈ V graph. The computing global has been considered before, but our knowledge is first that addresses with focus on efficiency issues arising massive distribution and related clustering coefficient can be used many interesting applications. For example, show measures compute help detect presence spamming...
We study the problem of online team formation. consider a setting in which people possess different skills and compatibility among potential members is modeled by social network. A sequence tasks arrives an fashion, each task requires specific set skills. The goal to form new upon arrival task, so that (i) possesses all required (ii) has small communication overhead, (iii) workload performing balanced fairest possible way.
Web data and computational models can play important roles in analyzing cultural trends.The current study presents an analysis of 23,492 sneaker images metadata collected from a global reselling shop, StockX.com.Based on encompassing 22 years 1999 to 2020, we propose design index that helps track changes the characteristics sneakers using contrastive learning method.Our suggest designs have been employing brighter colors lower hue saturation values over time.We also observe how popular...
Query logs record the queries and actions of users search engines, as such they contain valuable information about interests, preferences, behavior users, well their implicit feedback to engine results. Mining wealth available in query has many important applications including query-log analysis, user profiling personalization, advertising, recommendation, more.In this paper we introduce query-flow graph, a graph representation interesting knowledge latent querying behavior. Intuitively,...
Which topics spark the most heated debates on social media? Identifying those is not only interesting from a societal point of view but also allows filtering and aggregation media content for disseminating news stories. In this article, we perform systematic methodological study controversy detection by using network structure media. Unlike previous work, rather than studying in single hand-picked topic domain-specific knowledge, take general approach to any domain . Our quantifying based...
Finding dense subgraphs is an important graph-mining task with many applications. Given that the direct optimization of edge density not meaningful, as even a single achieves maximum density, research has focused on optimizing alternative functions. A very popular among such functions average degree, whose maximization leads to well-known densest-subgraph notion. Surprisingly enough, however, densest are typically large graphs, small and diameter.
In this paper we study approximate landmark-based methods for point-to-point distance estimation in very large networks. These involve selecting a subset of nodes as landmarks and computing offline the distances from each node graph to those landmarks. At runtime, when between pair is needed, it can be estimated quickly by combining precomputed distances. We prove that optimal set an NP-hard problem, thus heuristic solutions need employed. therefore explore theoretical insights devise...
We study the problem of correlating micro-blogging activity with stock-market events, defined as changes in price and traded volume stocks. Specifically, we collect messages related to a number companies, search for correlations between events those companies features extracted from messages. The extract can be categorized two groups. Features first group measure overall platform, such posts, re-posts, so on. second properties an induced interaction graph, instance, connected components,...
Echo chambers, i.e., situations where one is exposed only to opinions that agree with their own, are an increasing concern for the political discourse in many democratic countries. This paper studies phenomenon of echo chambers on social media. We identify two components phenomenon: opinion shared, and »chamber» (i.e., network) allows »echo» be re-shared -- examine closely at how these interact. define a production consumption measure social-media users, which captures leaning content shared...
We present a unified approach for simultaneously clustering and discovering outliers in data. Our is formalized as generalization of the k-MEANS problem. prove that problem NP-hard then practical polynomial time algorithm, which guaranteed to converge local optimum. Furthermore we extend our all distance measures can be expressed form Bregman divergence. Experiments on synthetic real datasets demonstrate effectiveness utility carrying out both outlier detection concurrent manner. In...
Association-rule mining has heretofore relied on the condition of high support to do its work efficiently. In particular, well-known a priori algorithm is only effective when rules interest are relationships that occur very frequently. However, there number applications, such as data mining, identification similar Web documents, clustering, and collaborative filtering, where have comparatively few instances in data. these cases, we must look for highly correlated items, or possibly even...
Matrix decomposition methods represent a data matrix as product of two factor matrices: one containing basis vectors that meaningful concepts in the data, and another describing how observed can be expressed combinations vectors. Decomposition have been studied extensively, but many return real-valued matrices. Interpreting matrices is hard if original Boolean. In this paper, we describe formulation for Boolean Discrete Basis Problem. The problem seeks binary matrix, thus allowing user to...
We consider the problem of maintaining aggregates and statistics over data streams, with respect to last N elements seen so far. refer this model as sliding window model. following basic problem: Given a stream bits, maintain count number 1's in from stream. show that using O(1/e log2N) bits memory, we can estimate within factor 1 + e. also give matching lower bound Ω(1/e memory for any deterministic or randomized algorithms. extend our scheme sum positive integers. provide upper bounds more...
Complex networks, such as biological, social, and communication often entail uncertainty, thus, can be modeled probabilistic graphs . Similar to the problem of similarity search in standard graphs, a fundamental for is efficiently answer k-nearest neighbor queries ( k -NN), which computing closest nodes some specific node. In this paper we introduce framework processing -NN graphs. We propose novel distance functions that extend well-known graph concepts, shortest paths. order compute them...
The problem of assessing the significance data mining results on high-dimensional 0--1 datasets has been studied extensively in literature. For problems such as frequent sets and finding correlations, testing can be done by standard statistical tests chi-square, or other methods. However, depend only specific attributes not dataset a whole. Moreover, are difficult to apply patterns complex algorithms. In this article, we consider simple randomization technique that deals with shortcoming....
In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore impact of different approaches, such as static vs. dynamic caching, and query results vs.caching posting lists. Using a log spanning whole year limitations demonstrate that lists can achieve higher hit rates than answers. propose new algorithm lists, which outperforms previous methods. also problem finding optimal way to split cache between answers Finally, measure how changes...
XML is rapidly emerging as the new standard for data representation and exchange on Web. An document can be accompanied by a Document Type Descriptor (DTD) which plays role of schema an collection. DTDs contain valuable information structure documents thus have crucial in efficient storage data, well effective formulation optimization queries. In this paper, we propose XTRACT, novel system inferring DTD database documents. Since syntax incorporates full expressive power regular expressions ,...