Nonlinear Dynamics of Natural Systems

NDNS+

Related News

More

STATISTICS FOR LIFE SCIENCES

Complex Data & High Dimensional Inference

PROGRAMME

November 23, 24 and 25

Het Kasteel,

Groningen, The Netherlands

Monday, 23 November Tuesday, 24 November Wednesday, 25 November
Topic: Penalized smoothing Topic: p >> n
09:00 — 10:45 Goeran Kauermann 08:45 — 09:30 Hans van Houwelingen
09:45 — 10:30 Paul Eilers 09:30 — 10:15 Luigi Augugliaro
10:30 — 11:00 Break 10:15 — 10:30 Break
Topic: Penalized inference Topic: Complexity & Pharmacy
11:00 — 11:45 Cajo ter Braak 10:30 — 11:15 Siem Heisterkamp
11:45 — 12:30 Keith Baggerly 11:15 — 11:30 Break
11:30 — 12:30 Keynote: Stephen Senn
Simon Wood:
Generalized Additive Models

12:30 — 13:30 Lunch

12:30 — 13:30 Lunch + Departure
12:00 — 13:00 Short Course
13:00 — 13:15 Break
13:15 — 14:15 Short Course Topic: Network Inference
14:15 — 14:30 Break 13:30 — 14:15 Tom Snijders
14:30 — 15:30 Short Course 14:15 — 15:00 Ritsert Jansen
15:00 — 15:30 Break
16:00 — 17:00 Keynote: Peter Diggle Topic: Machine Learning
15:30 — 16:15 John Nerbonne
Reception 16:15 — 17:00 Michael Biehl
17:00 — 18:00
Conference Dinner

PROGRAMME AND ABSTRACTS

Monday 23 November 2009

12:00 — 15:30 SHORT COURSE

Simon Wood, University of Bath

COMPLEX MODELLING USING GENERALIZED ADDITIVE MODELS

  • Some examples using GAMs.
  • GAM theory in half an hour: penalized GLMs with functions in the linear predictor; basis expansions and wiggliness; penalized likelihood, mixed models and Bayes; inference; smoothness selection.
  • What smooth to use? Thin plate splines, p-splines, tensor product smoothing, adaptive smoothing, cyclic smooths.
  • GAMs in R/mgcv: model fitting, model checking, model selection, visualization, prediction. More than just GAMs: GAMMs varying coefficients, signal regression etc.
  • More examples.

Particpants are asked to bring along a laptop, with the open source software R (http://cran.r-project.org, version later than 2.9.0) installed. In case you can't bring a laptop along, but do want to attend the short course, please contact Ineke Schelhaas (k.m.e.schelhaas@rug.nl).

16:00 — 17:00 KEYNOTE SPEECH

Opening of the conference by Professor F. Zwarts, Vice-Chancellor of the University of Groningen.

Peter J. Diggle, Lancaster University

GEOSTATISTICAL INFERENCE UNDER PREFERENTIAL SAMPLING

Geostatistics involves the fitting of spatially continuous models to spatially discrete data (Chiles and Delfiner, 1999). Preferential sampling arises when the process that determines the data-locations and the process being modelled are stochastically dependent. Conventional geostatistical methods assume, if only implicitly, that sampling is non-preferential. However, these methods are often used in situations where sampling is likely to be preferential. For example, in mineral exploration samples may be concentrated in areas thought likely to yield high-grade ore.

In this talk I will present a model for preferential sampling, and demonstrate through simulated examples that ignoring preferential sampling can lead to seriously misleading inferences. I will then describe an application of the model to a set of bio-monitoring data from Galicia, northern Spain, in which making allowance for preferential sampling materially changes the results of the analysis The talk is based on joint work with Raquel Menezes and Ting-Li Su (Diggle, Menezes and Su, 2009).

Chiles, J-P and Delfiner, P. (1999). Geostatistics. New York: Wiley.

Diggle, P.J. Menezes, R. and Su, T-L. (2009). Geostatistical inference under preferential sampling (with discussion). Applied Statistics (to appear)

17:00 — 18:00

Opening reception

Tuesday 24 November, 2009

9:00 — 9:45

Goeran Kauermann, University of Bielefeld

PENALIZED SPLINE SMOOTHING, MIXED MODELS AND BAYESIAN STATISTICS — THREE SUCCESSFUL PLAYERS IN A LIASON

The recent years have seen an impressive amount of work in the field of penalized spline smoothing. Originating from the idea of Eilers & Marx (Statistical Science, 1996) penalized spline smoothing was brought together with mixed models by considering the penalty as a priori normal distribution. The book by Ruppert, Wand & Carroll (Semiparametric Regression, 2003) initiated a wave of papers picking up the idea and extending it in various aspects.

We give a short overview about the field and concentrate on some new ideas. In fact, the penalty approach can be extended by imposing a prior on the regression parameters as well which leads towards a Bayesian formulation of the model, but still within the framework of linear mixed models. We show how this extension works well in practice and can be used for model selection and model evaluation.

9:45 — 10:30

Paul Eilers, Erasmus Medical Center

EFFICIENT COMPUTATION WITH DATA GRIDS

Fitting statistical models to large data sets can take a lot of time and may need a lot of memory. In a number of applications we can reduce both by orders of magnitude if we introduce a fine multidimensional grid and collect summaries (counts, sums, sums of squares) in its cells. I will illustrate this for trend estimation is scatterplots and estimation of parametric mixtures of regression models.

In recent year very efficient algorithms for smoothing on data on (incomplete) grids with tensor-product splines have been developed. They are attractive for estimation and correction of trends that occur on microarray surfaces, for baseline correction of gels, and for multidimensional density estimation. An important application is improved genotyping of SNPs, using log-concave non-parametric mixtures. I will sketch the principles behind the algorithms, avoiding overly technical details, and show a number of applications.

10:30 — 11:00

Break

11:00 — 11:45

Cajo ter Braak, Wageningen University

IDENTITY BY DESCENT PROBABILITY MATRIX DECOMPOSITION BY A LATENT ANCESTOR ALLELE MODEL AND ITS APPLICATION IN QTL ANALYSIS

Elements of IBD probability matrices measure the probability that two individuals share the same allele of a common ancestor. The ancestral alleles and their number remain implicit. For human inspection and QTL analysis, an explicit representation in terms of ancestral allele origin and number of alleles may be desirable. To this purpose, we decompose the IBD matrix by a latent class model with K classes (latent ancestor alleles). We provide an efficient algorithm to fit the model. The algorithm correctly reconstructed the ancestry of 16 maize inbreds from their IBD matrix only. We also show that the model can help QTL detection from connected crosses. This is joint work with Martin Boer, Radu Totir, Chris Winkler, Howie Smith, Marco Bink.

11:45 — 12:30

Keith Baggerly, University of Texas

THE IMPORTANCE OF REPRODUCIBILITY IN HIGH-THROUGHPUT BIOLOGY: A CASE STUDY

Reproducibility is such a hallmark of modern science that it is often taken as assumed. However, reproducing results from many high-throughput biological studies often involves a rather surprising level of detail. To illustrate this point, we present a case study in predicting patient response to chemotherapy. Several recent papers have claimed that array-based "signatures" of response to specific drugs can be derived cell line data (the NCI60 panel in particular) and used to predict patient response. Successful application of such tests could greatly improve cancer care, as many patients do not respond to front-line therapies but might respond to others. Replicating these results, however, involves an exercise in "forensic bioinformatics", as many of the precise methods used need to be inferred from context. In addition to describing our analysis of the data, we will discuss some of the steps we now employ (in particular, the use of Sweave) to ensure greater reproducibility at our own institution.

12:30 — 13:30

Lunch

13:30 — 14:15

Tom Snijders, University of Oxford, University of Groningen

STATISTICAL METHODS FOR SOCIAL NETWORK DYNAMICS

Social networks are relational structures between social actors, represented usually by directed graphs (digraphs). Nodes in the graph represent social actors, while arcs (directed lines) represent the ties between them. The dependence structures between arcs, e.g. tendencies to transitivity of ties, lead to complications in modelling such data.

A family of probability models will be presented for longitudinal social network data: so-called actor-oriented models, which can be motivated by assumptions about actors trying to optimize their situation with very limited foresight. Such assumptions often make sense as a simple approximation to social science ideas about network dynamics. These models are very flexible which is important for their use in statistical inference. Two methods of estimation will be discussed. The first is a method of moments, the second the maximum likelihood method. Both can be implemented by Markov chain Monte Carlo methods.

14:15 — 15:00

Ritsert Jansen, University of Groningen

FROM MULTIFACTORIAL GENETIC PERTUBATION TO DEFINING CAUSAL GENETIC NETWORKS

Genetically different individuals can exhibit large quantitative trait variation. Such variation stems, at least partly, from variations in the DNA. Modern sequencing technologies can reveal the variations in the genome, and genome-wide linkage (GWL) analysis and genome-wide association (GWA) analysis can then link or associate them to variations in the trait of interest. These strategies are increasingly applied to a growing number of organisms, including human, mouse, rat, cattle, pigs, A. thaliana, tomato, corn, yeast, C. elegans, and D. melanogaster, and have pinpointed many quantitative trait loci (QTL) on the genome.

To lift the veil that covers the genome-to-phenotype relation we may need to monitor the whole trajectory of intermediate biomolecular phenotypes. Today's molecular technologies, particular microarray and deep sequencing for epigenome and transcriptome, and high resolution mass spectrometry and nuclear magnetic resonance for proteomics and metabolomics, have reached a cost-efficiency level allowing for comprehensive molecular profiling of many samples at multiple biomolecular levels. We here discuss the promises, statistical methods, results and pitfalls for QTL analysis, network reconstruction and causal inference using system-wide data on chromatin modification (epiQTL), gene expression (eQTL), proteins (pQTL), metabolites (mQTL) and classical phenotypes (phQTL) from studies on human, plant and shorebird.

15:00 — 15:00

Break

15:30 — 16:15

John Nerbonne, University of Groningen

ANALYSING LANGUAGE VARIATION: HIGH DIMENSIONAL, GEOGRAPHICALLY STRUCTURED DATA

Typical data in language variation study is a table with sample dialect sites (villages) in the one dimension and language items (words, pronunciations or phrasal constructions) in the other. In socially motivated studies,social varieties (combinations of class, sex/gender, and/or age) may replace geographically defined samples. The problem is then to identify the geographical structure in the data as well as its linguistic expression.

We illustrate several techniques we have applied to analyse this data, including some alignment techniques from machine learning (paired Hidden Markov Models), and a clustering technique taken from bio-informatics.

16:15 — 17:00

Michael Biehl, University of Groningen

ADAPTIVE METRICS IN DISTANCE BASED CLASSIFICATION; APPLICATIONS IN BIOLOGY AND MEDICINE

An introduction to distance based classification of multi-dimensional data is given. In the popular Learning Vector Quantization (LVQ), typical representatives of the classes (prototypes) are determined from labelled example data in a training process. In the working phase, the prototypes parameterize a classification scheme which can be applied to novel, unlabelled data.

A key issue in this family of algorithms is the choice of a suitable similarity or distance measure in order to facilitate good classification performance. So-called Relevance Learning schemes address this problem by employing adaptive distance measures which are also determined in the training phase. The recently introduced Matrix Relevance LVQ is discussed in greater detail.

Recent application examples illustrate the method in the context of classification problems in biomedical data sets. For futher information and references, see http://www.cs.rug.nl/~biehl

19:00 — 22:00

Conference dinner, Humphrey's, Vismarkt 42, Groningen

Your browser may not support display of this image.

Wednesday 25 November 2009

8:45 — 9:45

Hans van Houwelingen, Leiden University

FITTING SURVIVAL MODELS WITH p>>n PREDICTORS: BEYOND PROPORTIONAL HAZARDS

In our 2006 paper [1] we have employed partial likelihood ridge regression for the prediction of breast cancer survival with gene expression data. Combining penalization with cross-validation. A comparative study by Boveldstad et al. [2] showed that our approach in most successful for this kind of data. However the proportional hazard model used in these models is quite simple and might not be realistic if there is a long survival follow-up. Exploring the fit of the model by using a cross-validated prognostic index leads to the conclusion that the effect of the predictor derived in [1] is neither linear nor constant over time.

We will review the partial likelihood ridge regression for survival data and will discuss ways of fine-tuning the model, partly based on the reduced rank model of [3] can be employed, while nonlinear effects can be introduced by means of bilinear terms.

[1] van Houwelingen, HC; Bruinsma, T; Hart, AAM; et al. Cross-validated Cox regression on microarray gene expression data STATISTICS IN MEDICINE, 25 (18): 3201-3216 SEP 30 2006

[2] Bovelstad, HM; Nygard, S; Storvold, HL; et al. Predicting survival from microarray data - a comparative study BIOINFORMATICS, 23 (16): 2080-2087 AUG 15 2007

[3] Perperoglou, A; le Cessie, S; van Houwelingen, HC Reduced-rank hazard regression for modeling non-proportional hazards STATISTICS IN MEDICINE, 25 (16): 2831-2845 AUG 30 2006

9:30 — 10:15

Luigi Augugliaro, University of Palermo

A DIFFERENTIAL GEOMETRIC APPROACH TO IDENTIFY IMPORTANT VARIABLES IN GLMs WHEN p >> n

Ultra-high dimensional variable selection plays an important role in regression models applied in many areas of modern scientific research such as genomics, astronomy and remote sensing applications. For these kinde of problems the number of variables, p, can be much larger than the sample size n. Many variable selection techniques for high dimensional statistical models are based on a penalized likelihood approach. LASSO estimator proposed by Tibshirani [5], SCAD method [2] or L1-regularization path following algorithm for generalized linear models proposed by Park and Hastie [4] are only some of the most popular methods used to select relevant variables in a generalized linear model. In a recent paper, Fan and Li [3] proposed a sure independent screening (SIS) method to select relevant variables in a linear regression model defined on a ultrahigh dimensional feature space. SIS method is based on the geometrical theory underlying the linear regression model. This observation suggests that a genuine generalization of the SIS method for generalized linear models could be founded on an adequate generalization of the geometric interpretation of a linear regression model. Based on this idea, we propose new statistical method to select relevant variables in a generalized linear model defined in ultrahigh dimensional feature space, which is based on the differential geometrical theory underlying the dgLARS algorithm proposed by Augugliaro and Wit [1].

[1] L. Augugliaro and E. Wit. Generalizing lars algorithm using differential geometry. Seventh Scientific Meeting of the CLAssification and Data Analysis Group of the Italian Statistical Society., 2009.

[2] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle propoerties. Journal of the American Statistical Assotiation, 96:1348—1359, 2001.

[3] J. Fan and J. Lv. Sure independent screening for ultrahigh dimensional feature space. Journal of the Royal Statitical Society, Series B., 70:849—911, 2008.

[4] M. Y. Park and T. Hastie. l1-regularization path algorithm for generalized linear models. Journal of the Royal Statitical Society, Series B., 69:659—677, 2007.

[5] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statitical Society, Series B., 58:267—288, 1996.

10:15 — 10:30

Break

10:30 — 11:15

Siem Heisterkamp, Shering-Plough, University of Groningen

PHARMACOGENETIC ANALYSIS IN CLINICAL TRIALS USING ELASTIC NET

We performed a pharmacogenetic study using data from a series of multi-centered clinical trials. Approximately 1100 patients from these trials consented to our study. About 7800 SNP's had been handpicked from regions around and within a number of candidate genes. The phenotype - response to drugs used in treatment of a certain disease - is not very well defined. The patients received either placebo, or the proprietary compound or some of its competitors. It is not uncommon in these kind of trials only a relatively small proportion of patients respond favourable to a specific drug. The question is this: can one characterize by the genetic make-up of patients a subgroup which responds in favour of a specific drug more often? In this paper we will address some issues in answering this question. Zou and Hastie [1] and Park and Hastie [2] proposed a regression algorithm for selection of variables in (generalised) linear models: the Elastic Net (EN). In a simulation study (submitted) we found that the the selection of 'true' markers depends strongly on pre-selection of the markers. E.g. one could use either all or the 1000 most 'significant' markers. Overlap between selected markers from an EN-regression using different pre-selections based on statistical criteria alone was disappointing. To overcome this problem we pre-selected SNP's in a biologically more meaningful manner. We propose a two step EN-regression. Firstly we apply EN for each of the genes separately to select SNP's for each of the genes, yielding a single score for each gene. In the second step the latter are used as regressors in EN to select genes. This step bears some resemblance with generalized additive models [3] for which - to the best of our knowledge - no EN-like algorithm yet exists.

[1] Zou H, Hastie T. (2005) Regularization and variable selection via the elastic net. J.R. Statistic. Soc. B, 67, Part 2, pp. 301-320

[2] Park, M. Y., Hastie, T. (2007) L1-regularization path algorithm for generalized linear models, J. R. Statist. Soc. B, 69, Part 4, 659-677.

[3] Wood, S.N. Generalized Additive Models, an introduction with R. 2006, Chapman & Hall

11:15 — 11:30

Break

11:30 — 12:30 KEYNOTE SPEECH

Stephen Senn, University of Glasgow

'THERE'S LESS TO THIS THAN MEETS THE EYE' — A SCEPTICAL LOOK AT PERSONALISED MEDICINE

In the medical literature you will come across astonishing claims as to what proportion of patients respond to treatment. Not astonishing because the proportion is low but astonishing because the certainty with which the proportion is named is so high. The truth is otherwise. For most diseases, we simply do not know what proportion of patients respond to treatment because the pharmaceutical industry has hardly ever run the sort of trial that would enable them to identify who does and does not respond to treatment. Rather than encouraging such trials to be run, the European Medicine Agency instead has issued umpteen 'points to consider' documents inviting sponsors to engage in 'responder analysis', a futile and arbitrary dichotomisation of outcomes which, despite the name, does nothing to identify response and merely drives up sample size.

In this talk, I shall show that identifying response is not 'rocket science'. It involves, repeated measures designs, mixed models and a little thought. Quite why the biostatistical community has not succeeded in letting their medical colleagues in on the secret is a mystery on which I will also freely speculate.

I conclude that the goal of personalised medicine is probably further away now than it was ten years ago.

12:30 — 13:30

Lunch and departure