- Complex Network Analysis Techniques
- Advanced Clustering Algorithms Research
- Opinion Dynamics and Social Influence
- Graph theory and applications
- Advanced Graph Neural Networks
- Web Data Mining and Analysis
- Stochastic processes and statistical mechanics
- Random Matrices and Applications
- Anomaly Detection Techniques and Applications
- Data Management and Algorithms
- Peer-to-Peer Network Technologies
- Advanced Graph Theory Research
- Data Visualization and Analytics
- Attachment and Relationship Dynamics
- Graph Theory and Algorithms
- Advanced Image and Video Retrieval Techniques
- Machine Learning and Algorithms
- Neural Networks and Applications
- Caching and Content Delivery
- Stochastic Gradient Optimization Techniques
- Machine Learning and Data Classification
- Text and Document Classification Technologies
- Imbalanced Data Classification Techniques
- Mental Health Research Topics
- Markov Chains and Monte Carlo Methods
Yandex (Russia)
2013-2022
National Research University Higher School of Economics
2019-2022
Moscow Institute of Physics and Technology
2014-2020
Lomonosov Moscow State University
2013-2015
Retweet cascades play an essential role in information diffusion Twitter. Popular tweets reflect the current trends Twitter, while Twitter itself is one of most important online media. Thus, understanding reasons why a tweet becomes popular great interest for sociologists, marketers and social media researches. What even more possibility to make prognosis tweet's future popularity. Besides scientific significance such possibility, this sort prediction has lots practical applications as...
This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other publicly available implementations in terms of quality on variety datasets. Two critical advances introduced are implementation ordered boosting, permutation-driven alternative classic algorithm, and an innovative algorithm for processing categorical features. Both were created fight prediction shift caused by special kind target leakage...
There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work examined standard datasets benchmarks assessing these approaches. Additionally, most estimation developed new techniques based small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, sensor which offer challenges involving...
Node classification is a classical graph machine learning task on which Graph Neural Networks (GNNs) have recently achieved strong results. However, it often believed that standard GNNs only work well for homophilous graphs, i.e., graphs where edges tend to connect nodes of the same class. Graphs without this property are called heterophilous, and typically assumed specialized methods required achieve performance such graphs. In work, we challenge assumption. First, show datasets used...
Several performance measures can be used for evaluating classification results: accuracy, F-measure, and many others. Can we say that some of them are better than others, or, ideally, choose one measure is best in all situations? To answer this question, conduct a systematic analysis measures: formally define list desirable properties theoretically analyze which satisfy properties. We also prove an impossibility theorem: cannot simultaneously satisfied. Finally, propose new family satisfying...
Community detection is one of the most important problems in network analysis. Among many algorithms proposed for this task, methods based on statistical inference are particular interest: they mathematically sound and were shown to provide partitions good quality. Statistical fitting some random graph model (a.k.a. null model) observed by maximizing likelihood. The choice extremely main focus current study. We an extensive theoretical empirical analysis compare several models: widely used...
For many practical, high-risk applications, it is essential to quantify uncertainty in a model's predictions avoid costly mistakes. While predictive widely studied for neural networks, the topic seems be under-explored models based on gradient boosting. However, boosting often achieves state-of-the-art results tabular data. This work examines probabilistic ensemble-based framework deriving estimates of classification and regression models. We conducted experiments range synthetic real...
In this paper we address the problem of quick detection high-degree entities in large online social networks. Practical importance is attested by a number companies that continuously collect and update statistics about popular entities, usually using degree an entity as approximation its popularity. We suggest simple, efficient, easy to implement two-stage randomized algorithm provides highly accurate solutions problem. For instance, our needs only one thousand API requests order find...
Modularity is designed to measure the strength of division a network into clusters (known also as communities). Networks with high modularity have dense connections between vertices within but sparse different clusters. As result, often used in optimization methods for detecting community structure networks, and so it an important graph parameter from practical point view. Unfortunately, many existing non-spatial models complex networks do not generate graphs modularity; on other hand,...
Graph neural networks (GNNs) are powerful models that have been successful in various graph representation learning tasks. Whereas gradient boosted decision trees (GBDT) often outperform other machine methods when faced with heterogeneous tabular data. But what approach should be used for graphs node features? Previous GNN mostly focused on homogeneous sparse features and, as we show, suboptimal the setting. In this work, propose a novel architecture trains GBDT and jointly to get best of...
When information or infectious diseases spread over a network, in many practical cases, one can observe when nodes adopt become infected, but the underlying network is hidden. In this paper, we analyze problem of finding communities highly interconnected nodes, given only infection times nodes. We propose, analyze, and empirically compare several algorithms for task. The most stable performance, that improves current state-of-the-art, obtained by our proposed heuristic approaches, are...
Graph-based approaches are empirically shown to be very successful for the nearest neighbor search (NNS). However, there has been little research on their theoretical guarantees. We fill this gap and rigorously analyze performance of graph-based NNS algorithms, specifically focusing low-dimensional (d << \log n) regime. In addition basic greedy algorithm graphs, we also most heuristics commonly used in practice: speeding up via adding shortcut edges improving accuracy maintaining a...
In this article, we study the clustering properties of spatial preferential attachment (SPA) model. This model naturally combines geometry and using notion spheres influence. It was previously shown in several research papers that graphs generated by SPA are similar to real-world networks many aspects. Also, successfully used for practical applications. However, were not fully analysed. The coefficient is an important characteristic complex which tightly connected with its community...
In this paper, we study the problem of timely finding and crawling \textit{ephemeral} new pages, i.e., for which user traffic grows really quickly right after they appear, but lasts only several days (e.g., news, blog forum posts). Traditional policies do not give any particular priority to such pages may thus crawl them enough, even already obsolete content. We propose a metric, well thought out task, takes into account decrease interest ephemeral over time.
_In this article, we present a detailed analysis of the global clustering coefficient in scale-free graphs. Many observed real-world networks diverse nature have power-law degree distribution. Moreover, distribution usually has an infinite variance. Therefore, are especially interested such distributions. In addition, analyze for both weighted and unweighted There two well-known definitions graph: average local coefficients. several models proposed literature which tends to positive constant...