- Gene expression and cancer classification
- Genomics and Phylogenetic Studies
- Bioinformatics and Genomic Networks
- Gene Regulatory Network Analysis
- Genetic diversity and population structure
- Evolution and Paleontology Studies
- Genetic Mapping and Diversity in Plants and Animals
- Statistical Methods and Inference
- Genetic and phenotypic traits in livestock
- Statistical Methods in Clinical Trials
- Molecular Biology Techniques and Applications
- Metabolomics and Mass Spectrometry Studies
- Bayesian Methods and Mixture Models
- RNA and protein synthesis mechanisms
- Statistical Distribution Estimation and Applications
- Algorithms and Data Compression
- Chromosomal and Genetic Variations
- Circadian rhythm and melatonin
- Light effects on plants
- Genetic Associations and Epidemiology
- Mass Spectrometry Techniques and Applications
- Advanced Proteomics Techniques and Applications
- Genetics, Bioinformatics, and Biomedical Research
- Fractal and DNA sequence analysis
- Gaussian Processes and Bayesian Inference
University of Manchester
2019-2023
Imperial College London
2002-2017
Leipzig University
2007-2015
Supélec
2010
Ludwig-Maximilians-Universität München
1996-2008
Institut für Urheber- und Medienrecht
2006
Institut für Angewandte Statistik
2005
University of Oxford
2001-2004
Max Planck Institute of Biochemistry
2000
Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use molecular evolution phylogenetics. APE provides both utility functions reading writing data manipulating phylogenetic trees, as well several advanced methods evolutionary analysis (e.g. comparative population genetic methods). takes advantage many statistics graphics, also flexible framework developing implementing further statistical processes. The program free available from official archive at...
A versatile method, quartet puzzling, is introduced to reconstruct the topology (branching pattern) of a phylogenetic tree based on DNA or amino acid sequence data. This method applies maximum-likelihood reconstruction all possible quartets that can be formed from n sequences. The trees serve as starting points set optimal n-taxon trees. majority rule consensus these defines puzzling and shows groupings are well supported. Computer simulations show performance true always equal better than...
Abstract Summary: TREE-PUZZLE is a program package for quartet-based maximum-likelihood phylogenetic analysis (formerly PUZZLE, Strimmer and von Haeseler, Mol. Biol. Evol. , 13, 964–969, 1996) that provides methods reconstruction, comparison, testing of trees models on DNA as well protein sequences. To reduce waiting time larger datasets the tree reconstruction part software has been parallelized using message passing runs clusters workstations parallel computers. Availability:...
Inferring large-scale covariance matrices from sparse genomic data is an ubiquitous problem in bioinformatics. Clearly, the widely used standard and correlation estimators are ill-suited for this purpose. As statistically efficient computationally fast alternative we propose a novel shrinkage estimator that exploits Ledoit-Wolf (2003) lemma analytic calculation of optimal intensity.Subsequently, apply improved (which has guaranteed minimum mean squared error, well-conditioned, always...
Most analysis programs for inferring molecular phylogenies are difficult to use, in particular researchers with little programming experience. TREEFINDER is an easy-to-use integrative platform-independent environment phylogenetics. In this paper the main features of (version April 2004) described. written ANSI C and Java implements powerful statistical approaches gene tree related analyzes. addition, it provides a user-friendly graphical interface phylogenetic language. versatile framework...
We introduce a graphical method, likelihood-mapping, to visualize the phylogenetic content of set aligned sequences. The method is based on an analysis maximum likelihoods for three fully resolved tree topologies that can be computed four are represented as one point inside equilateral triangle. triangle partitioned in different regions. One region represents star-like evolution, regions represent well-resolved phylogeny, and reflect situation where it difficult distinguish between two...
Abstract Motivation: Genetic networks are often described statistically using graphical models (e.g. Bayesian networks). However, inferring the network structure offers a serious challenge in microarray analysis where sample size is small compared to number of considered genes. This renders many standard algorithms for inapplicable, and genetic an ‘ill-posed’ inverse problem. Methods: We introduce novel framework small-sample inference from gene expression data. Specifically, we focus on...
Abstract Summary: False discovery rate (FDR) methodologies are essential in the study of high-dimensional genomic and proteomic data. The R package ‘fdrtool’ facilitates such analyses by offering a comprehensive set procedures for FDR estimation. Its distinctive features include: (i) many different types test statistics allowed as input data, P-values, z-scores, correlations t-scores; (ii) simultaneously, both local tail area-based values estimated all (iii) empirical null models fit where...
Summary: MALDIquant is an R package providing a complete and modular analysis pipeline for quantitative of mass spectrometry data. specifically designed with application in clinical diagnostics mind implements sophisticated routines importing raw data, preprocessing, non-linear peak alignment, calibration. It also handles technical replicates as well spectra unequal resolution. Availability: its associated packages readBrukerFlexData readMzXmlData are freely available from the archive CRAN...
The problem of inferring confidence sets gene trees is discussed without assuming that the substitution model or branching pattern any investigated correct. In this case, widely used methods to compare genealogies can give highly contradicting results. Here, three infer are robust against misspecification compared, including a new approach based on estimating in specific tree using expected–likelihood weights. power studied by analysing HIV–1 and mtDNA sequence data as well simulated...
False discovery rate (FDR) methods play an important role in analyzing high-dimensional data. There are two types of FDR, tail area-based FDR and local as well numerous statistical algorithms for estimating or controlling FDR. These differ terms underlying test statistics procedures employed learning.A unifying algorithm simultaneous estimation both is presented that can be applied to a diverse range statistics, including p-values, correlations, z- t-scores. This approach semipararametric...
The use of correlation networks is widespread in the analysis gene expression and proteomics data, even though it known that correlations not only confound direct indirect associations but also provide no means to distinguish between cause effect. For "causal" typically inference a directed graphical model required. However, this rather difficult due curse dimensionality. We propose simple heuristic for statistical learning high-dimensional network. method first converts network into partial...
Analysis of microarray and other high-throughput data on the basis gene sets, rather than individual genes, is becoming more important in genomic studies. Correspondingly, a large number statistical approaches for detecting set enrichment have been proposed, but both interrelations relative performance various methods are still very much unclear.We conduct an extensive survey analysis identify common modular structure underlying most published methods. Based this finding we propose general...
Whitening, or sphering, is a common preprocessing step in statistical analysis to transform random variables orthogonality. However, due rotational freedom there are infinitely many possible whitening procedures. Consequently, diverse range of sphering methods use, for example based on principal component (PCA), Cholesky matrix decomposition and zero-phase (ZCA), among others. Here we provide an overview the underlying theory discuss five natural Subsequently, demonstrate that investigating...
We present an intuitive visual framework, the generalized skyline plot, to explore demographic history of sampled DNA sequences. This approach is based on a genealogy inferred from sequences and provides nonparametric estimate effective population size through time. In contrast previous related procedures, plot more applicable cases where underlying tree not fully resolved data highly variable. achieved by grouping adjacent coalescent intervals. employ small-sample Akaike information...
Abstract Motivation: Microarray experiments are now routinely used to collect large-scale time series data, for example monitor gene expression during the cell cycle. Statistical analysis of this data poses many challenges, one being that it is hard identify correctly subset genes with a clear periodic signature. This has lead controversial argument regard suitability both available methods and current microarray data. Methods: We introduce two simple but efficient statistical signal...
University of New South Wales) for bringing the following error in Equation (5) to our attention: sample size N even intensity peak belonging Fourier frequency π must not be included calculation g-statistic.Correspondingly, summation denominator and maximization numerator ( 5) runs through indices k = 1 =[(N -1)/2], rather than =[N/2] as stated paper.Accordingly, we have corrected implementation algorithm R package 'GeneCycle' (versions 1.0.4 later), which is available from...
High-dimensional case-control analysis is encountered in many different settings genomics. In order to rank genes accordingly, scores have been proposed, ranging from ad hoc modifications of the ordinary t statistic complicated hierarchical Bayesian models.Here, we introduce shrinkage t that based on a novel and model-free shrinkage estimate variance vector across genes. This derived quasi-empirical Bayes setting. The new score fully automatic requires no specification parameters or...
Causal networks based on the vector autoregressive (VAR) process are a promising statistical tool for modeling regulatory interactions in cell. However, learning these is challenging due to low sample size and high dimensionality of genomic data. We present novel highly efficient approach estimate VAR network. This proceeds two steps: (i) improved estimation regression coefficients using an analytic shrinkage approach, (ii) subsequent model selection by testing associated partial...
We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation multiclass LDA predictor function, which relative weights Mahalanobis-transformed predictors given by correlation-adjusted t-scores (cat scores). Second, for propose thresholding cat scores controlling false nondiscovery rates (FNDR). Third, training classifier is based on James–Stein shrinkage estimates correlations and...
Attempts to estimate the time of origin human immunodeficiency virus (HIV)-1 by using phylogenetic analysis are seriously flawed because unequal evolutionary rates among different viral lineages. Here, we report a new method molecular clock analysis, called Site Stripping for Clock Detection (SSCD), which allows selection nucleotide sites evolving at an equal rate in The was validated on dataset patients all infected with hepatitis C 1977 same donor, and it able date exactly known infection....
Abstract Summary: Phylogenetic Analysis Library (PAL) is a collection of Java classes for use in molecular evolution and phylogenetics. PAL provides modular environment the rapid construction both special-purpose general analysis programs. version 1.1 consists 145 public or interfaces 13 packages, including models character evolution, maximum-likelihood estimation, coalescent, with total more than 27000 lines code. The project set up as collaborative to facilitate contributions from other...