NFDI4DS | UHH-SEMS - Publication Details

Carolin Strobl

ORCID: 0000-0003-0952-3230

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5021387471

Research Areas

Psychometric Methodologies and Testing
Advanced Statistical Modeling Techniques
Mental Health Research Topics
Neural Networks and Applications
Gene expression and cancer classification
Statistical Methods and Inference
Data Mining Algorithms and Applications
Data Analysis with R
Advanced Statistical Methods and Models
Bayesian Modeling and Causal Inference
Statistical Methods and Bayesian Inference
Bioinformatics and Genomic Networks
Cognitive Abilities and Testing
Genetic and phenotypic traits in livestock
Explainable Artificial Intelligence (XAI)
Imbalanced Data Classification Techniques
Statistical Methods in Clinical Trials
Evolutionary Algorithms and Applications
Bayesian Methods and Mixture Models
Multi-Criteria Decision Making
Forest ecology and management
Statistics Education and Methodologies
Sociology and Education Studies
Genetic Associations and Epidemiology
Optimal Experimental Design Methods

University of Zurich
2016-2025

University of Basel
2023

Ludwig-Maximilians-Universität München
2005-2021

Indiana University
2021

University of Washington
2021

University of Missouri
2020

Universität Innsbruck
2011-2016

University Hospital Heidelberg
2011

Zimmer Biomet (Germany)
2011

Heidelberg University
2011

Bias in random forest variable importance measures: Illustrations, sources and a solution

OPENALEX - Publications

Carolin Strobl Anne‐Laure Boulesteix Achim Zeileis Torsten Hothorn

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks bioinformatics and related scientific fields, instance to select subset genetic markers relevant the prediction certain disease. We show that forest are sensible applications, but not reliable situations where potential predictor variables vary their scale measurement or number categories. This is particularly important genomics computational...

10.1186/1471-2105-8-25 article EN cc-by BMC Bioinformatics 2007-01-25

Conditional variable importance for random forests

OPENALEX - Publications

Carolin Strobl Anne‐Laure Boulesteix Thomas Kneib Thomas Augustin Achim Zeileis

Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these show a bias towards We identify two mechanisms responsible for this finding: (i) A preference the selection of predictors tree building process (ii) an additional...

10.1186/1471-2105-9-307 article EN cc-by BMC Bioinformatics 2008-07-11

An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests.

OPENALEX - Publications

Carolin Strobl James D. Malley Gerhard Tutz

Recursive partitioning methods have become popular and widely used tools for nonparametric regression classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even the presence complex interactions, been applied successfully genetics, clinical medicine, bioinformatics within past few years. High-dimensional problems are common not only but also some areas psychological research, where a subjects be measured because time or...

10.1037/a0016973 article EN Psychological Methods 2009-12-01

The behaviour of random forest permutation-based variable importance measures under predictor correlation

OPENALEX - Publications

Kristin K. Nicodemus James D. Malley Carolin Strobl Andreas Ziegler

Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) RF come to apparently contradictory conclusions. We present an extended simulation study synthesize results. In the case when both was predictors were associated with outcome (HA), unconditional VIM attributed a higher share of correlated predictors,...

10.1186/1471-2105-11-110 article EN cc-by BMC Bioinformatics 2010-02-27

An AUC-based permutation variable importance measure for random forests

OPENALEX - Publications

Silke Janitza Carolin Strobl Anne‐Laure Boulesteix

The random forest (RF) method is a commonly used tool for classification with high dimensional data as well ranking candidate predictors based on the so-called variable importance measures (VIMs). However performance of RF known to be suboptimal in case strongly unbalanced data, i.e. where response class sizes differ considerably. Suggestions were made obtain better either sampling procedures or cost sensitivity analyses. our knowledge VIMs has not yet been examined classes. In this paper we...

10.1186/1471-2105-14-119 article EN cc-by BMC Bioinformatics 2013-04-05

Unbiased split selection for classification trees based on the Gini Index

OPENALEX - Publications

Carolin Strobl Anne‐Laure Boulesteix Thomas Augustin

10.1016/j.csda.2006.12.030 article EN Computational Statistics & Data Analysis 2006-12-23

Party on!

OPENALEX - Publications

Carolin Strobl Torsten Hothorn Achim Zeileis

10.32614/rj-2009-013 article The R Journal 2009-01-01

A new variable importance measure for random forests with missing data

OPENALEX - Publications

Alexander Hapfelmeier Torsten Hothorn Kurt Ulm Carolin Strobl

10.1007/s11222-012-9349-1 article EN Statistics and Computing 2012-08-27

Rasch Trees: A New Method for Detecting Differential Item Functioning in the Rasch Model

OPENALEX - Publications

Carolin Strobl Julia Kopf Achim Zeileis

10.1007/s11336-013-9388-3 article EN Psychometrika 2013-12-18

Conditional permutation importance revisited

OPENALEX - Publications

Dries Debeer Carolin Strobl

Random forest based variable importance measures have become popular tools for assessing the contributions of predictor variables in a fitted random forest. In this article we reconsider frequently used measure, Conditional Permutation Importance (CPI). We argue and illustrate that CPI corresponds to more partial quantification suggest several improvements its methodology implementation enhance practical value. addition, introduce threshold value algorithm as parameter can make or...

10.1186/s12859-020-03622-2 article EN cc-by BMC Bioinformatics 2020-07-14

Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis

OPENALEX - Publications

Martin Schweinsberg Michael B. Feldman Nicola Staub Olmo R. van den Akker Robbie C. M. van Aert and 95 more

In this crowdsourced initiative, independent analysts used the same dataset to test two hypotheses regarding effects of scientists' gender and professional status on verbosity during group meetings. Not only analytic approach but also operationalizations key variables were left unconstrained up individual analysts. For instance, could choose operationalize as job title, institutional ranking, citation counts, or some combination. To maximize transparency process by which choices are made, a...

10.1016/j.obhdp.2021.02.003 article EN cc-by Organizational Behavior and Human Decision Processes 2021-06-17

Anchor Selection Strategies for DIF Analysis

OPENALEX - Publications

Julia Kopf Achim Zeileis Carolin Strobl

Differential item functioning (DIF) indicates the violation of invariance assumption, for instance, in models based on response theory (IRT). For item-wise DIF analysis using IRT, a common metric parameters groups that are to be compared (e.g., reference and focal group) is necessary. In Rasch model, therefore, same linear restriction imposed both groups. Items termed ``anchor items''. Ideally, these items DIF-free avoid artificially augmented false alarm rates. However, question how anchor...

10.1177/0013164414529792 article EN Educational and Psychological Measurement 2014-04-21

Predictors of depression among middle-aged and older men and women in Europe: A machine learning approach

OPENALEX - Publications

Elizabeth P. Handing Carolin Strobl Yuqin Jiao Leilani Feliciano Stephen Aichele

The high prevalence of depression in a growing aging population represents critical public health issue. It is unclear how social, health, cognitive, and functional variables rank as risk/protective factors for among older adults whether there are conspicuous differences men women.We used random forest analysis (RFA), machine learning method, to compare 56 large representative sample European (N = 67,603; ages 45-105y; 56.1% women; 18 countries) from the Survey Health, Ageing Retirement...

10.1016/j.lanepe.2022.100391 article EN cc-by-nc-nd The Lancet Regional Health - Europe 2022-04-29

Interpretable machine learning for psychological research: Opportunities and pitfalls.

OPENALEX - Publications

Mirka Henninger Rudolf Debelak Yannick Rothacher Carolin Strobl

In recent years, machine learning methods have become increasingly popular prediction in psychology. At the same time, psychological researchers are typically not only interested making predictions about dependent variable, but also which predictor variables relevant, how they influence and predictors interact with each other. However, most directly interpretable. Interpretation techniques that support describing technique came to its may be a means this end. We present variety of...

10.1037/met0000560 article EN Psychological Methods 2023-05-25

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

OPENALEX - Publications

Anne‐Laure Boulesteix Andreas Bender Justo Lorenzo Bermejo Carolin Strobl

The use of random forests is increasingly common in genetic association studies. variable importance measure (VIM) that automatically calculated as a by-product the algorithm often used to rank polymorphisms with respect their ability predict investigated phenotype. Here, we investigate characteristic this methodology may be considered an important pitfall, namely variants are systematically favoured by widely Gini VIM. As consequence, researchers overlook rare contribute missing...

10.1093/bib/bbr053 article EN Briefings in Bioinformatics 2011-09-10

Evaluating Microarray-based Classifiers: An Overview

OPENALEX - Publications

Anne‐Laure Boulesteix Carolin Strobl Thomas Augustin Martin Däumer

For the last eight years, microarray-based class prediction has been subject of numerous publications in medicine, bioinformatics and statistics journals. However, many articles, assessment classification accuracy is carried out using suboptimal procedures not paid much attention. In this paper, we carefully review various statistical aspects classifier evaluation validation from a practical point view. The main topics addressed are measures, error rate estimation procedures, variable...

10.4137/cin.s408 article EN cc-by-nc Cancer Informatics 2008-01-01

Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance

OPENALEX - Publications

Carolin Strobl Achim Zeileis

Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., genetics and bioinformatics. We highlight both advantages limitations of different importance scores associated testing procedures, especially context correlated predictor variables. For test Breiman Cutler (2008), we investigate statistical properties find that power depends on sample size and...

10.5282/ubm/epub.2111 article EN 2008-01-30

Forest management and regional tree composition drive the host preference of saproxylic beetle communities

OPENALEX - Publications

Jörg Müller Beate Wende Carolin Strobl Manuel J. A. Eugster Iris Gallenberger and 5 more

Summary Among saproxylic beetles, many early colonizers prefer particular host species. Ranking of preferred hosts local beetle communities is critical for effective dead‐wood management in forests, but rarely done because experiments with numerous tree species are labour and cost intensive. We analysed the preference on logs 13 relation to (unmanaged managed beech stands, conifer plantations natural sites) three regions Germany during most period specificity, that first two years after...

10.1111/1365-2664.12421 article EN Journal of Applied Ecology 2015-03-06

Analysis of the individual and aggregate genetic contributions of previously identified serine peptidase inhibitor Kazal type 5 (SPINK5), kallikrein-related peptidase 7 (KLK7), and filaggrin (FLG) polymorphisms to eczema risk

OPENALEX - Publications

Stephan Weidinger Hansjörg Baurecht Stefan Wagenpfeil John Henderson Natalija Novak and 25 more

10.1016/j.jaci.2008.05.050 article EN Journal of Allergy and Clinical Immunology 2008-09-01

Accounting for Individual Differences in Bradley-Terry Models by Means of Recursive Partitioning

OPENALEX - Publications

Carolin Strobl Florian Wickelmaier Achim Zeileis

The preference scaling of a group subjects may not be homogeneous, but different groups with certain characteristics show scalings, each which can derived from paired comparisons by means the Bradley-Terry model. Usually, either models are fit in predefined subsets sample or effects subject covariates explicitly specified parametric In both cases, categorical employed directly to distinguish between groups, while numeric typically discretized prior modeling. Here, semiparametric approach for...

10.3102/1076998609359791 article EN Journal of Educational and Behavioral Statistics 2011-04-01

Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction

OPENALEX - Publications

Anne‐Laure Boulesteix Carolin Strobl

In biometric practice, researchers often apply a large number of different methods in "trial-and-error" strategy to get as much possible out their data and, due publication pressure or from the consulting customer, present only most favorable results. This may induce substantial optimistic bias prediction error estimation, which is quantitatively assessed manuscript. The focus our work on class based high-dimensional (e.g. microarray data), since such analyses are particularly exposed this...

10.1186/1471-2288-9-85 article EN cc-by BMC Medical Research Methodology 2009-12-01

Measuring the Stability of Results From Supervised Statistical Learning

OPENALEX - Publications

Michel Philipp Thomas Rusch Kurt Hornik Carolin Strobl

Stability is a major requirement to draw reliable conclusions when interpreting results from supervised statistical learning. In this article, we present general framework for assessing and comparing the stability of results, which can be used in real-world learning applications as well simulation benchmark studies. We use show that property both algorithm data-generating process. particular, demonstrate unstable algorithms (such recursive partitioning) produce stable functional form...

10.1080/10618600.2018.1473779 article EN Journal of Computational and Graphical Statistics 2018-05-18

Coming Soon ...