- Computational and Text Analysis Methods
- Opinion Dynamics and Social Influence
- Topic Modeling
- Complex Network Analysis Techniques
- Bayesian Methods and Mixture Models
- Advanced Text Analysis Techniques
- Misinformation and Its Impacts
- Stock Market Forecasting Methods
- Social Media and Politics
- Hate Speech and Cyberbullying Detection
- Sentiment Analysis and Opinion Mining
- Neural Networks and Applications
- Digital Marketing and Social Media
- Statistical Mechanics and Entropy
- Authorship Attribution and Profiling
- Media Influence and Politics
- Open Source Software Innovations
- Knowledge Management and Sharing
- Forecasting Techniques and Applications
- Media Studies and Communication
- Complex Systems and Time Series Analysis
- Impact of Technology on Adolescents
- Mental Health via Writing
- Mathematical Biology Tumor Growth
- Advanced Statistical Methods and Models
National Research University Higher School of Economics
2014-2024
Moscow Power Engineering Institute
2016
Topic modeling, in particular the Latent Dirichlet Allocation (LDA) model, has recently emerged as an important tool for understanding large datasets, particular, user-generated datasets social studies of Web. In this work, we investigate instability LDA inference, propose a new metric similarity between topics and criterion vocabulary reduction. We show limitations approach purposes qualitative analysis science sketch some ways improvement.
Qualitative studies, such as sociological research, opinion analysis and media can benefit greatly from automated topic mining provided by models latent Dirichlet allocation (LDA). However, examples of qualitative studies that employ modelling a tool are currently few far between. In this work, we identify two important problems along the way to using in studies: lack good quality metric closely matches human judgement understanding topics need indicate specific subtopics study may be most...
The random forest algorithm is one of the most popular and commonly used algorithms for classification regression tasks. It combines output multiple decision trees to form a single result. Random demonstrate highest accuracy on tabular data compared other in various applications. However, forests and, more precisely, trees, are usually built with application classic Shannon entropy. In this article, we consider potential deformed entropies, which successfully field complex systems, increase...
Abstract This article describes agendas as “packages” of topics varying salience, set by the Russian Internet users on Russia's leading blog platform LiveJournal. The research involved modeling LiveJournal's topic structure, viewed an important component what is termed here self‐generated public opinion. Topic was performed automatically with LDA algorithm, and complemented hand labeling topics. Data were collected software created authors to generate a relational database storing all posts...
Topic modeling is a popular approach for clustering text documents. However, current tools have number of unsolved problems such as instability and lack criteria selecting the values model parameters. In this work, we propose method to solve partially optimizing parameters, simultaneously accounting semantic stability. Our inspired by concepts from statistical physics based on Sharma–Mittal entropy. We test our two models: probabilistic Latent Semantic Analysis (pLSA) Dirichlet Allocation...
Topic modeling is a widely used instrument for the analysis of large text collections. In last few years, neural topic models and with word embeddings have been proposed to increase quality solutions. However, these were not extensively tested in terms stability interpretability. Moreover, question selecting number topics (a model parameter) remains challenging task. We aim partially fill this gap by testing four well-known available wide range users such as embedded (ETM), Gaussian Softmax...
Social studies of the Internet have adopted large-scale text mining for unsupervised discovery topics related to specific subjects. A recently developed approach topic modeling, additive regularization models (ARTM), provides fast inference and more control over with a wide variety possible regularizers than developing LDA extensions. We apply ARTM ethnic-related content from Russian-language blogosphere, introduce new combined regularizer, compare derived LDA. show human evaluations that is...
Purpose – The paper addresses the problem of what drives formation latent discussion communities, if any, in blogosphere: topical composition posts or their authorship? purpose this is to contribute knowledge about structure co-commenting. Design/methodology/approach research based on a dataset 17,386 full text written by top 2,000 LiveJournal bloggers and over 520,000 comments that result 4.5 million edges network co-commenting, where are vertices. Louvain algorithm used detect communities...
Topic modeling is a popular technique for clustering large collections of text documents. A variety different types regularization implemented in topic modeling. In this paper, we propose novel approach analyzing the influence on results Based Renyi entropy, inspired by concepts from statistical physics, where an inferred topical structure collection can be considered information system residing non-equilibrium state. By testing our four models-Probabilistic Latent Semantic Analysis (pLSA),...
Topic modeling is a powerful tool for analyzing large collections of user-generated web content, but it still suffers from problems with topic stability, which are especially important social sciences. We evaluate stability different models and propose new model, granulated LDA, that samples short sequences neighboring words at once. show gLDA exhibits very stable results.
Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing hierarchy representing the levels abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number topics at each level hierarchy, remains challenging task. In this paper, we propose approach based on Renyi entropy as partial solution to above problem. First, introduce entropy-based...
In this paper we apply multifractal formalism to the analysis of statistical behaviour topic models under condition varying number topics. Our reveals existence two self-similar regions and one transition region in function density-of-states depending on As earlier a that can be expressed through was successfully used determine optimal topics, test applicability for same purpose. We provide numerical results three (PLSA, ARTM, LDA Gibbs sampling) marked-up collections containing texts...
In practice, to build a machine learning model of big data, one needs tune parameters. The process parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory statistical physics provides techniques allowing us optimize this process. paper shows that function output topic modeling demonstrates self-similar behavior under variation number clusters. Such allows using renormalization technique. A combination procedure with Renyi entropy...
Recent advancements in large language models (LLMs) have opened new possibilities for developing conversational agents (CAs) various subfields of mental healthcare. However, this progress is hindered by limited access to high-quality training data, often due privacy concerns and high annotation costs low-resource languages. A potential solution create human-AI systems that utilize extensive public domain user-to-user user-to-professional discussions on social media. These discussions,...
This study investigates the topical structure of Russian-language blog-publishing service LiveJournal and change in it that occurred course public activity after State Duma elections December 2011 as compared to a previous "control" period (November 27-December 27 August 15-September 15 respectively). The data for both periods have been automatically obtained from 2000 top-rated blogs on basis ratings published by LiveJournal. Unsupervised topic modelling sampled posts was done using Latent...
In this paper we describe structural and topical properties of "ordinary" blogs versus "popular" blogs. Using the complete directory Russian language LiveJournal, sample both groups show that main difference between them is in volume posting activity commenting feedback skewedness respective distributions. No substantial differences structure obtained with LDA algorithm are found, which suggests ordinary bloggers do not hold specific vision topic salience set their own "grassroots" agendas.