STATISTICS FOR LIFE SCIENCES
Complex Data & High Dimensional Inference
PROGRAMME
November 23, 24 and 25
Het Kasteel,
Groningen, The Netherlands
PROGRAMME AND ABSTRACTS
Monday 23 November 2009
12:00 — 15:30 SHORT COURSE
Simon Wood, University of Bath
COMPLEX MODELLING USING GENERALIZED ADDITIVE MODELS
- Some examples using GAMs.
- GAM theory in half an hour: penalized GLMs with functions in the linear
predictor; basis expansions and wiggliness; penalized likelihood, mixed models
and Bayes; inference; smoothness selection.
- What smooth to use? Thin plate splines, p-splines, tensor product
smoothing, adaptive smoothing, cyclic smooths.
- GAMs in R/mgcv: model fitting, model checking, model selection,
visualization, prediction. More than just GAMs: GAMMs varying coefficients,
signal regression etc.
- More examples.
Particpants are asked to bring along a laptop, with
the open source software R (http://cran.r-project.org, version later than 2.9.0)
installed. In case you can't bring a laptop along, but do want to attend the
short course, please contact Ineke Schelhaas
(k.m.e.schelhaas@rug.nl).
16:00 — 17:00 KEYNOTE
SPEECH
Opening of the conference by Professor F. Zwarts,
Vice-Chancellor of the University of Groningen.
Peter J. Diggle, Lancaster University
GEOSTATISTICAL INFERENCE UNDER PREFERENTIAL SAMPLING
Geostatistics involves the fitting of spatially continuous
models to spatially discrete data (Chiles and Delfiner, 1999). Preferential
sampling arises when the process that determines the data-locations and the
process being modelled are stochastically dependent. Conventional geostatistical
methods assume, if only implicitly, that sampling is non-preferential. However,
these methods are often used in situations where sampling is likely to be
preferential. For example, in mineral exploration samples may be concentrated in
areas thought likely to yield high-grade ore.
In this talk I will present a model for preferential
sampling, and demonstrate through simulated examples that ignoring preferential
sampling can lead to seriously misleading inferences. I will then describe an
application of the model to a set of bio-monitoring data from Galicia, northern
Spain, in which making allowance for preferential sampling materially changes
the results of the analysis The talk is based on joint work with Raquel Menezes
and Ting-Li Su (Diggle, Menezes and Su, 2009).
Chiles, J-P and Delfiner, P. (1999). Geostatistics. New
York: Wiley.
Diggle, P.J. Menezes, R. and Su,
T-L. (2009). Geostatistical inference under preferential sampling (with
discussion). Applied Statistics (to appear)
17:00 — 18:00
Opening reception
Tuesday 24 November, 2009
9:00 — 9:45
Goeran Kauermann, University of Bielefeld
PENALIZED SPLINE SMOOTHING, MIXED MODELS AND BAYESIAN
STATISTICS — THREE SUCCESSFUL PLAYERS IN A LIASON
The recent years have seen an impressive amount of work in
the field of penalized spline smoothing. Originating from the idea of Eilers
& Marx (Statistical Science, 1996) penalized spline smoothing was brought
together with mixed models by considering the penalty as a priori normal
distribution. The book by Ruppert, Wand & Carroll (Semiparametric
Regression, 2003) initiated a wave of papers picking up the idea and extending
it in various aspects.
We give a short overview about the field and concentrate on
some new ideas. In fact, the penalty approach can be extended by imposing a
prior on the regression parameters as well which leads towards a Bayesian
formulation of the model, but still within the framework of linear mixed models.
We show how this extension works well in practice and can be used for model
selection and model evaluation.
9:45 — 10:30
Paul Eilers, Erasmus Medical Center
EFFICIENT COMPUTATION WITH DATA GRIDS
Fitting statistical models to large data sets can take a
lot of time and may need a lot of memory. In a number of applications we can
reduce both by orders of magnitude if we introduce a fine multidimensional grid
and collect summaries (counts, sums, sums of squares) in its cells. I will
illustrate this for trend estimation is scatterplots and estimation of
parametric mixtures of regression models.
In recent year very efficient algorithms for smoothing on
data on (incomplete) grids with tensor-product splines have been developed. They
are attractive for estimation and correction of trends that occur on microarray
surfaces, for baseline correction of gels, and for multidimensional density
estimation. An important application is improved genotyping of SNPs, using
log-concave non-parametric mixtures. I will sketch the principles behind the
algorithms, avoiding overly technical details, and show a number of
applications.
10:30 — 11:00
Break
11:00 — 11:45
Cajo ter Braak, Wageningen University
IDENTITY BY DESCENT PROBABILITY MATRIX DECOMPOSITION BY A
LATENT ANCESTOR ALLELE MODEL AND ITS APPLICATION IN QTL ANALYSIS
Elements of IBD probability matrices measure the probability
that two individuals share the same allele of a common ancestor. The ancestral
alleles and their number remain implicit. For human inspection and QTL analysis,
an explicit representation in terms of ancestral allele origin and number of
alleles may be desirable. To this purpose, we decompose the IBD matrix by a
latent class model with K classes (latent ancestor alleles). We provide an
efficient algorithm to fit the model. The algorithm correctly reconstructed the
ancestry of 16 maize inbreds from their IBD matrix only. We also show that the
model can help QTL detection from connected crosses. This is joint work with
Martin Boer, Radu Totir, Chris Winkler, Howie Smith, Marco Bink.
11:45 — 12:30
Keith Baggerly, University of Texas
THE IMPORTANCE OF REPRODUCIBILITY IN HIGH-THROUGHPUT BIOLOGY: A CASE
STUDY
Reproducibility is such a hallmark of modern science that it
is often taken as assumed. However, reproducing results from many
high-throughput biological studies often involves a rather surprising level of
detail. To illustrate this point, we present a case study in predicting patient
response to chemotherapy. Several recent papers have claimed that array-based
"signatures" of response to specific drugs can be derived cell line
data (the NCI60 panel in particular) and used to predict patient
response. Successful application of such tests could greatly improve cancer
care, as many patients do not respond to front-line therapies but might respond
to others. Replicating these results, however, involves an exercise in
"forensic bioinformatics", as many of the precise methods used need to
be inferred from context. In addition to describing our analysis of the data, we
will discuss some of the steps we now employ (in particular, the use of Sweave)
to ensure greater reproducibility at our own institution.
12:30 — 13:30
Lunch
13:30 — 14:15
Tom Snijders, University of Oxford, University of
Groningen
STATISTICAL METHODS FOR SOCIAL NETWORK DYNAMICS
Social networks are relational structures between social
actors, represented usually by directed graphs (digraphs). Nodes in the graph
represent social actors, while arcs (directed lines) represent the ties between
them. The dependence structures between arcs, e.g. tendencies to transitivity of
ties, lead to complications in modelling such data.
A family of probability models will be presented for
longitudinal social network data: so-called actor-oriented models, which can be
motivated by assumptions about actors trying to optimize their situation with
very limited foresight. Such assumptions often make sense as a simple
approximation to social science ideas about network dynamics. These models are
very flexible which is important for their use in statistical inference. Two
methods of estimation will be discussed. The first is a method of moments, the
second the maximum likelihood method. Both can be implemented by Markov chain
Monte Carlo methods.
14:15 — 15:00
Ritsert Jansen, University of Groningen
FROM MULTIFACTORIAL GENETIC PERTUBATION TO DEFINING CAUSAL
GENETIC NETWORKS
Genetically different individuals can exhibit large
quantitative trait variation. Such variation stems, at least partly, from
variations in the DNA. Modern sequencing technologies can reveal the variations
in the genome, and genome-wide linkage (GWL) analysis and genome-wide
association (GWA) analysis can then link or associate them to variations in the
trait of interest. These strategies are increasingly applied to a growing number
of organisms, including human, mouse, rat, cattle, pigs, A. thaliana, tomato,
corn, yeast, C. elegans, and D. melanogaster, and have pinpointed many
quantitative trait loci (QTL) on the genome.
To lift the veil that covers the genome-to-phenotype
relation we may need to monitor the whole trajectory of intermediate
biomolecular phenotypes. Today's molecular technologies, particular
microarray and deep sequencing for epigenome and transcriptome, and high
resolution mass spectrometry and nuclear magnetic resonance for proteomics and
metabolomics, have reached a cost-efficiency level allowing for comprehensive
molecular profiling of many samples at multiple biomolecular levels. We here
discuss the promises, statistical methods, results and pitfalls for QTL
analysis, network reconstruction and causal inference using system-wide data on
chromatin modification (epiQTL), gene expression (eQTL), proteins (pQTL),
metabolites (mQTL) and classical phenotypes (phQTL) from studies on human, plant
and shorebird.
15:00 — 15:00
Break
15:30 — 16:15
John Nerbonne, University of Groningen
ANALYSING LANGUAGE VARIATION: HIGH DIMENSIONAL, GEOGRAPHICALLY
STRUCTURED DATA
Typical data in language variation study is a table with
sample dialect sites (villages) in the one dimension and language items (words,
pronunciations or phrasal constructions) in the other. In socially motivated
studies,social varieties (combinations of class, sex/gender, and/or age) may
replace geographically defined samples. The problem is then to identify the
geographical structure in the data as well as its linguistic expression.
We illustrate several techniques we have applied to analyse
this data, including some alignment techniques from machine learning (paired
Hidden Markov Models), and a clustering technique taken from
bio-informatics.
16:15 — 17:00
Michael Biehl, University of Groningen
ADAPTIVE METRICS IN DISTANCE BASED CLASSIFICATION;
APPLICATIONS IN BIOLOGY AND MEDICINE
An introduction to distance based classification of
multi-dimensional data is given. In the popular Learning Vector Quantization
(LVQ), typical representatives of the classes (prototypes) are determined from
labelled example data in a training process. In the working phase, the
prototypes parameterize a classification scheme which can be applied to novel,
unlabelled data.
A key issue in this family of algorithms is the choice of a
suitable similarity or distance measure in order to facilitate good
classification performance. So-called Relevance Learning schemes address this
problem by employing adaptive distance measures which are also determined in the
training phase. The recently introduced Matrix Relevance LVQ is discussed in
greater detail.
Recent application examples illustrate the method in the
context of classification problems in biomedical data sets. For futher
information and references, see http://www.cs.rug.nl/~biehl
19:00 — 22:00
Conference dinner, Humphrey's, Vismarkt 42, Groningen
Wednesday 25 November 2009
8:45 — 9:45
Hans van Houwelingen, Leiden University
FITTING SURVIVAL MODELS WITH p>>n PREDICTORS: BEYOND
PROPORTIONAL HAZARDS
In our 2006 paper [1] we have employed partial likelihood
ridge regression for the prediction of breast cancer survival with gene
expression data. Combining penalization with cross-validation. A comparative
study by Boveldstad et al. [2] showed that our approach in most successful for
this kind of data. However the proportional hazard model used in these models is
quite simple and might not be realistic if there is a long survival
follow-up. Exploring the fit of the model by using a cross-validated prognostic
index leads to the conclusion that the effect of the predictor derived in [1] is
neither linear nor constant over time.
We will review the partial likelihood ridge regression for
survival data and will discuss ways of fine-tuning the model, partly based on
the reduced rank model of [3] can be employed, while nonlinear effects can be
introduced by means of bilinear terms.
[1] van Houwelingen, HC; Bruinsma, T; Hart, AAM; et
al. Cross-validated Cox regression on microarray gene expression data STATISTICS
IN MEDICINE, 25 (18): 3201-3216 SEP 30 2006
[2] Bovelstad, HM; Nygard, S; Storvold, HL; et
al. Predicting survival from microarray data - a comparative study
BIOINFORMATICS, 23 (16): 2080-2087 AUG 15 2007
[3] Perperoglou, A; le Cessie, S; van Houwelingen, HC
Reduced-rank hazard regression for modeling non-proportional hazards STATISTICS
IN MEDICINE, 25 (16): 2831-2845 AUG 30 2006
9:30 — 10:15
Luigi Augugliaro, University of Palermo
A DIFFERENTIAL GEOMETRIC APPROACH TO IDENTIFY IMPORTANT
VARIABLES IN GLMs WHEN p >> n
Ultra-high dimensional variable selection plays an
important role in regression models applied in many areas of modern scientific
research such as genomics, astronomy and remote sensing applications. For these
kinde of problems the number of variables, p, can be much larger than the sample
size n. Many variable selection techniques for high dimensional statistical
models are based on a penalized likelihood approach. LASSO estimator proposed by
Tibshirani [5], SCAD method [2] or L1-regularization path following algorithm
for generalized linear models proposed by Park and Hastie [4] are only some of
the most popular methods used to select relevant variables in a generalized
linear model. In a recent paper, Fan and Li [3] proposed a sure independent
screening (SIS) method to select relevant variables in a linear regression model
defined on a ultrahigh dimensional feature space. SIS method is based on the
geometrical theory underlying the linear regression model. This observation
suggests that a genuine generalization of the SIS method for generalized linear
models could be founded on an adequate generalization of the geometric
interpretation of a linear regression model. Based on this idea, we propose new
statistical method to select relevant variables in a generalized linear model
defined in ultrahigh dimensional feature space, which is based on the
differential geometrical theory underlying the dgLARS algorithm proposed by
Augugliaro and Wit [1].
[1] L. Augugliaro and E. Wit. Generalizing lars algorithm
using differential geometry. Seventh Scientific Meeting of the CLAssification
and Data Analysis Group of the Italian Statistical Society., 2009.
[2] J. Fan and R. Li. Variable selection via nonconcave
penalized likelihood and its oracle propoerties. Journal of the American
Statistical Assotiation, 96:1348—1359, 2001.
[3] J. Fan and J. Lv. Sure independent screening for
ultrahigh dimensional feature space. Journal of the Royal Statitical Society,
Series B., 70:849—911, 2008.
[4] M. Y. Park and T. Hastie. l1-regularization path
algorithm for generalized linear models. Journal of the Royal Statitical
Society, Series B., 69:659—677, 2007.
[5] R. Tibshirani. Regression shrinkage and selection via
the lasso. Journal of the Royal Statitical Society, Series B., 58:267—288,
1996.
10:15 — 10:30
Break
10:30 — 11:15
Siem Heisterkamp, Shering-Plough, University of
Groningen
PHARMACOGENETIC ANALYSIS IN CLINICAL TRIALS USING ELASTIC
NET
We performed a pharmacogenetic study using data from a
series of multi-centered clinical trials. Approximately 1100 patients from these
trials consented to our study. About 7800 SNP's had been handpicked from
regions around and within a number of candidate genes. The phenotype - response
to drugs used in treatment of a certain disease - is not very well defined. The
patients received either placebo, or the proprietary compound or some of its
competitors. It is not uncommon in these kind of trials only a relatively small
proportion of patients respond favourable to a specific drug. The question is
this: can one characterize by the genetic make-up of patients a subgroup which
responds in favour of a specific drug more often? In this paper we will address
some issues in answering this question. Zou and Hastie [1] and Park and Hastie
[2] proposed a regression algorithm for selection of variables in (generalised)
linear models: the Elastic Net (EN). In a simulation study (submitted) we found
that the the selection of 'true' markers depends strongly on
pre-selection of the markers. E.g. one could use either all or the 1000 most
'significant' markers. Overlap between selected markers from an
EN-regression using different pre-selections based on statistical criteria alone
was disappointing. To overcome this problem we pre-selected SNP's in a
biologically more meaningful manner. We propose a two step
EN-regression. Firstly we apply EN for each of the genes separately to select
SNP's for each of the genes, yielding a single score for each gene. In the
second step the latter are used as regressors in EN to select genes. This step
bears some resemblance with generalized additive models [3] for which - to the
best of our knowledge - no EN-like algorithm yet exists.
[1] Zou H, Hastie T. (2005) Regularization and variable
selection via the elastic net. J.R. Statistic. Soc. B, 67, Part 2,
pp. 301-320
[2] Park, M. Y., Hastie, T. (2007) L1-regularization path
algorithm for generalized linear models, J. R. Statist. Soc. B, 69, Part 4,
659-677.
[3] Wood, S.N. Generalized Additive Models, an introduction
with R. 2006, Chapman & Hall
11:15 — 11:30
Break
11:30 — 12:30 KEYNOTE SPEECH
Stephen Senn, University of Glasgow
'THERE'S LESS TO THIS THAN MEETS THE EYE' — A
SCEPTICAL LOOK AT PERSONALISED MEDICINE
In the medical literature you will come across astonishing
claims as to what proportion of patients respond to treatment. Not astonishing
because the proportion is low but astonishing because the certainty with which
the proportion is named is so high. The truth is otherwise. For most diseases,
we simply do not know what proportion of patients respond to treatment because
the pharmaceutical industry has hardly ever run the sort of trial that would
enable them to identify who does and does not respond to treatment. Rather than
encouraging such trials to be run, the European Medicine Agency instead has
issued umpteen 'points to consider' documents inviting sponsors to
engage in 'responder analysis', a futile and arbitrary dichotomisation
of outcomes which, despite the name, does nothing to identify response and
merely drives up sample size.
In this talk, I shall show that identifying response is not
'rocket science'. It involves, repeated measures designs, mixed models
and a little thought. Quite why the biostatistical community has not succeeded
in letting their medical colleagues in on the secret is a mystery on which I
will also freely speculate.
I conclude that the goal of personalised medicine is
probably further away now than it was ten years ago.
12:30 — 13:30
Lunch and departure
|