1 Introduction

The problem of subsample selection is always present when it is not possible to measure the whole population of interest. The most common approach to subsample selection is a simple random sampling. However, when the aim is to estimate a prediction model fitted to the subsample as precisely as possible, a carefully selected nonrandom sample may provide higher expected information per unit than a random sample.

The practical motivation for this work arises from the environmental management of aquatic systems. We consider the data on 4360 Finnish lakes containing variables from monitoring data and register-based data. This dataset represents large data from which a subsample will be selected. Based on the monitoring data gathered, each lake either has a healthy water quality status or is in the need of management actions to improve the status. Our interest is to model the relationship between the hard-to-measure quality status and the easily available register-based lake features using a Bayesian logistic regression model. The aim is to select such a set of lakes that we can estimate the model parameters as accurately and precisely as possible. However, the data acquisition in the current monitoring program is considered being costly, and that is why the selection of lakes must be made wisely. The question is how to select the lakes to be monitored, if only a subset of them can be measured.

In general, the selection problem considered here is NP-hard (Welch 1982) because of the massive number of combinations. When the number of lakes is n, there are \(2^{n}\) possible subsets if no constraints are applied. A truly optimal selection would require analysing all those available combinations, which understandably is not possible and heuristic selection algorithms are needed.

Optimal subsample selection is related to optimal experimental design (Ryan et al. 2016; Atkinson et al. 2007; Chaloner and Verdinelli 1995) although there are some substantial differences. In both problems, the goal is to find a design that maximizes (or minimizes) an optimality criterion that usually is a function of the information matrix. A popular and well-known criterion is D-optimality, which is equivalent to maximising the determinant of the information matrix. In the case of a nonlinear model (Pronzato and Pázman 2013), the optimal design depends on the unknown model parameters in both problems. Wynn (1982) and Fedorov (1989) consider optimal design when there are restrictions for the density of the design measure. Pronzato and Wang (2021) propose a sequential subsampling method.

Optimal experimental designs are often found using a point-exchange algorithm (Fedorov 1972) or a coordinate-exchange algorithm (Meyer and Nachtsheim 1995). In subsample selection, candidate points are restricted to those in the data and replicates are not available, as in optimal experimental design. This implies that coordinate-exchange algorithms are not applicable, while point-exchange algorithms could be used. In earlier works, approximately optimal subsamples have been found for instance by the greedy method (Reinikainen et al. 2016; Reinikainen and Karvanen 2022), a randomized search (Paglia et al. 2022) or a two-step algorithm (Zuo et al. 2021). We focus on methods that are scalable in sense that the computational load remains bearable in the Bayesian setting even if the size of the population and the size of the subsample increase, and use therefore the greedy method.

In this paper, we study a subsample selection in Finnish lake monitoring setting with the aim of estimating the parameters of a Bayesian logistic regression model predicting the ecological status of a lake as precisely as possible. As the model is nonlinear, the optimal selection depends on the model parameters via the information matrix. When prior information is scarce, the initial model parameters may not be good enough to describe the phenomenon correctly. Updating the model sequentially (Pronzato 2006) after each measurement solves the problem theoretically better but is not feasible in lake management because of the time needed to analyse water samples. A two-stage strategy (Sitter and Forbes 1997; Montepiedra and Yeh 1998; Guillera-Arroita et al. 2014; Pronzato and Pázman 2013) is a practical compromise that helps to recover from poor prior information but keeps the process relatively straightforward.

Plenty of research is available on two-stage methods in optimal experimental design. Ruggoo and Vandebroek (2004) implement a two-stage procedure to improve the model estimates in Bayesian optimal experimental design. Karvanen (2009) considers the cost of the experiment as a design criterion in a sequential design selection. Reilly (1996) studies a general two-stage design in epidemiology, where the response and some easily obtained covariates are available on the first stage, while the more expensive covariates are ascertained only for the subsample of second stage subjects.

In the two-stage selection, we use prior data to select lakes to be measured and update the model with these measurements (first stage). Then we use the updated model to select more lakes to be measured (second stage) and combine all measurements to obtain the final estimates. By updating the model once, we assume to reach a better result than if we had selected the lakes using only the prior information. We are interested in learning how the accuracy of the final estimates behaves as a function of the number of lakes selected at the first stage when the total sample size is fixed.

The paper is organized as follows: Sect. 2 introduces a compound dataset of lake monitoring data and register-based data containing lake features. In Sect. 3, we present notations and derive the D-optimal selection criterion for a logistic regression model. In Sect. 4, we introduce a greedy forward selection algorithm. In Sect. 5, the greedy forward algorithm is applied to the real dataset. Section 6 concludes with discussion about the results and future directions.

2 Data

As a motivating real-life example, we consider an optimal design problem connected to lake monitoring data in Finland. Thanks to the Water Framework Directive (WFD) of European Union (European Parliament 2000), the monitoring program is implemented to improve and to secure the quality of inland waters in EU. Regular and long-term monitoring data of parameters representing biotic structure, supported by the physical and chemical properties of water and hydrological and morphological features, are used to classify the waters into ecological status classes (European Communities 2003). For each classification variable, the status class is assessed against the degree of deviance from the pre-determined reference conditions (Aroviita et al. 2019), and also the expert’s opinions about the status of a lake affects the classification. According to the directive, management actions are needed to implement to improve the ecological status if the lake is in moderate status or less. The collecting and analysing the monitoring data, however, has been considered to be expensive. That is why our goal is selecting lakes to be monitored to save resources reserved to data acquisition. In the value of information context, the lake monitoring has been considered by Koski et al. (2020) for one lake and by Koski and Eidsvik (2024) for several lakes.

We use data about the latest ecological status classification of lakes in Finland based on the monitoring data from the years 2012–2017. The classification is available as an open-source data maintained by Finnish Environment Institute (http://www.syke.fi/en-US/Open_information). Since the demand of management actions is our main interest, we are interested in the status class based on the need of management actions (Fig. 1, left). In addition, we utilise data from other free sources to support the knowledge about the lakes and their characteristics when modelling the ecological status.

According to Finnish Environment Institute, there are about 187,000 lakes in Finland (Heiskanen et al. 2017). The total number is depending on the definition of a lake: the lake area and the stability of water. We have the status classification for \(N = 4360\) lakes. Of the lakes, 174 are located in Helsinki-Uusimaa region, 445 located in Southern Finland, 656 in Western Finland and 3085 in Northern and Eastern Finland. For this study, we excluded the lakes from Åland due to small sample sizes and from other regions the lakes that comprises many water bodies, have large area or are important from other reasons, for example Finland’s largest lake Saimaa. The latter ones are anyway monitored frequently. We have the status classification for these \(N = 4360\) lakes already available, but in our example we simulate a situation where the status is yet to be determined.

In the current study, the ecological status of the lake is modelled by using lake features which are easily available from data sources. Basic features of the lakes having area over one hectare can be found from the interface maintained by Finnish Environment Institute (https://www.syke.fi/en-US/Open_information/Open_web_services/Environmental_data_API). From this interface, we uploaded for each lake the information about the location (the municipality, drainage basin and centre latitude and longitude coordinates) and basic information about lake features, such as waterbed area (hectares) length of shoreline (kilometres), average and maximum depth (meters), volume of water mass (1000 cubic meters) and altitude above sea level (meters). The circularity of the lake was calculated as the ratio of the circumference (of a circular lake) to the length of shoreline. In addition to these variables, we have an agricultural area (Official Statistics of Finland 2020a) and a number of free-time residences in the municipality where each lake is located (Official Statistics of Finland 2020b). We divided the agricultural area and the number of free-time residences of municipalities by the area of the municipality to obtain the percentage of agricultural area in each municipality and the density of free-time residences (Fig. 1, right). As the result of the model selection, we chose as covariates the waterbed area and the agricultural land in the municipality the lake is located for the model used in subsample selection.

Fig. 1
figure 1

Left: The ecological status classification on 4360 lakes on the map of Finland where the original five classes are reduced into binary based on the demand of management actions: nontarget status replacing status classes moderate, poor or bad and target status replacing classes high or good. Right: Data on 4360 lakes on the map of Finland by the percentual amount of agricultural land in the municipality where the lake is located

3 Bayesian Two-Stage Selection for Logistic Regression

3.1 Two-Stage Designs

A general two-stage procedure has the following steps:

  1. 1.

    Based on the initial data, the initial parameter estimates are estimated. Based on them, choose an optimal design (first stage selection).

  2. 2.

    Collect the data according the first stage design and update the initial parameter estimates (first stage analysis).

  3. 3.

    Based on the initial data and the data collected on the first stage, choose an optimal design for additional data collection (second stage selection).

  4. 4.

    Collect the data according the second stage design and analyse the full dataset, from both stages and the initial data, to obtain the final estimates (second stage analysis).

In next sections, we will describe a Bayesian implementation of this procedure.

3.2 Bayesian Subsample Selection

Let N be the number of units, for example lakes as in our case, in the population of interest and let a set of covariates to be available for the whole population. First, let us assume that we have already measured a \((n_0 \times 1)\) response vector \({{\varvec{y}}}_0\) (the ecological status) for a small subset \(S_0\) of size \(n_0\). Furthermore, let \({{\varvec{x}}}_0=\{x_{0ij}\}\), \(j=1,\dots ,J,\) and \(i=1,\dots ,n_0\), be the \((n_0 \times J)\) covariate matrix representing the J covariates.

In the Bayesian framework, we have a likelihood \(p({{\varvec{y}}}_0 \vert {{\varvec{x}}}_0, {\varvec{\beta }})\) where the model parameters \({\varvec{\beta }}\) are defined in parameter space \(\Theta \). The prior distribution is denoted by \(p({\varvec{\beta }})\), and it is obtained from the initial knowledge. The posterior probability distribution \(p({\varvec{\beta }}\vert {{\varvec{y}}}_0, {{\varvec{x}}}_0)\) is proportional to the product \(p({{\varvec{y}}}_0 \vert {{\varvec{x}}}_0,{\varvec{\beta }})p({\varvec{\beta }})\) and describes our current knowledge of the model parameters. In the later analysis, \(p({\varvec{\beta }}\vert {{\varvec{y}}}_0, {{\varvec{x}}}_0)\) is considered as a prior distribution.

First, as the knowledge about the model parameters \({\varvec{\beta }}\) is still poor based on the prior, we are interested in selecting a subset from the \(n=N-n_0\) units with unknown response to improve our knowledge. We have the \((n \times J)\) covariate matrix \({{\varvec{x}}}\) that we have observed, but the \((n \times 1)\) response vector \({{\varvec{y}}}\) in space \(\mathcal {Y}\) is still remaining unmeasured. In order to increase the accuracy of the modelling, we aim to select a subset \(S_1\) of size \(n_1<n\) from the population from which we measure the \((n_1 \times 1)\) vector \({{\varvec{y}}}_1\) and thus have a set \(S_0 \cup S_1\) measured. This is called the first stage selection. Using the prior and the data collected at the first stage, we obtain the posterior \(p({\varvec{\beta }}\vert {{\varvec{y}}}_0, {{\varvec{y}}}_1, {{\varvec{x}}}_0, {{\varvec{x}}}_1) \propto p({{\varvec{y}}}_0, {{\varvec{y}}}_1 \vert {{\varvec{x}}}_0,{{\varvec{x}}}_1,{\varvec{\beta }})p({\varvec{\beta }})\) that will be applied as a prior at the second stage.

At the second stage selection, we are interested in selecting a subset from the \(N-n_0-n_1\) units with unknown response. We aim to select a subset \(S_2\) of size \(n_2< n-n_1\) from the population that is still remaining unmeasured. Thus, we want to have a set \(S_0 \cup S_1 \cup S_2\) measured in total.

The selected subsamples are specified by a measurement plan \({{\varvec{r}}}= (r_1, r_2, \dots , r_n)^\intercal \), where \(r_i=1\) if the response \(y_i\) is planned to be measured for a unit i and \(r_i=0\) otherwise. In other words, \(S_1 \cup S_2\) is a set of units \(i=1,\dots ,n\) with \(r_i=1\). With the general notation for missing data, we may write

$$\begin{aligned} y(r_i)= {\left\{ \begin{array}{ll} y_{i}, &{} \text {if } r_i=1 \\ \text {NA}, &{} \text {if } r_i=0. \end{array}\right. } \end{aligned}$$

The predictive distribution of the response \({{\varvec{y}}}({{\varvec{r}}})\) before the data are collected according to a measurement plan can be written at the first stage as

$$\begin{aligned} p({{\varvec{y}}}({{\varvec{r}}}) \vert {{\varvec{x}}}, {{\varvec{y}}}_0,{{\varvec{x}}}_0) = \int _\Theta p({{\varvec{y}}}({{\varvec{r}}}) \vert {{\varvec{x}}}, {\varvec{\beta }}) p({\varvec{\beta }}\vert {{\varvec{y}}}_0,{{\varvec{x}}}_0) \text {d}{\varvec{\beta }}, \end{aligned}$$
(1)

and at the second stage as

$$\begin{aligned} p({{\varvec{y}}}({{\varvec{r}}}) \vert {{\varvec{x}}}, {{\varvec{y}}}_0,{{\varvec{y}}}_1,{{\varvec{x}}}_0,{{\varvec{x}}}_1) = \int _\Theta p({{\varvec{y}}}({{\varvec{r}}}) \vert {{\varvec{x}}}, {\varvec{\beta }}) p({\varvec{\beta }}\vert {{\varvec{y}}}_0,{{\varvec{y}}}_1,{{\varvec{x}}}_0,{{\varvec{x}}}_1) \text {d}{\varvec{\beta }}, \end{aligned}$$
(2)

where \(p({{\varvec{y}}}({{\varvec{r}}}) \vert {{\varvec{x}}}, {\varvec{\beta }})\) is the model for the new data.

Our aim is to select the subsamples \(S_1\) and \(S_2\) in an optimal way. The optimality is defined in terms of a utility function, which we denote in the general case as \(U(\beta , {{\varvec{y}}}({{\varvec{r}}}), {{\varvec{x}}})\). Given \(n_1\), the optimal subsample is found at the first stage when we find a measurement plan \({{\varvec{r}}}\) that maximises

$$\begin{aligned} {\bar{U}}_1({{\varvec{r}}}) = \int _{\mathcal {Y}} \left[ \int _{\Theta } U({\varvec{\beta }},{{\varvec{y}}}({{\varvec{r}}}),{{\varvec{x}}}) p({\varvec{\beta }}\vert {{\varvec{y}}}({{\varvec{r}}}),{{\varvec{x}}}) \text {d} {\varvec{\beta }}\right] p({{\varvec{y}}}({{\varvec{r}}}) \vert {{\varvec{x}}}, {{\varvec{y}}}_0,{{\varvec{x}}}_0) \text {d} {{\varvec{y}}}\end{aligned}$$
(3)

given the constraint \(\sum _{i=1}^n r_{i}=n_1\). Then, given \(n_2\), the optimal subsample is found at the second stage when we find a measurement plan \({{\varvec{r}}}\) that maximises

$$\begin{aligned} {\bar{U}}_2({{\varvec{r}}}) = \int _{\mathcal {Y}} \left[ \int _{\Theta } U({\varvec{\beta }},{{\varvec{y}}}({{\varvec{r}}}),{{\varvec{x}}}) p({\varvec{\beta }}\vert {{\varvec{y}}}({{\varvec{r}}}),{{\varvec{x}}}) \text {d} {\varvec{\beta }}\right] p({{\varvec{y}}}({{\varvec{r}}}) \vert {{\varvec{x}}}, {{\varvec{y}}}_0,{{\varvec{y}}}_1,{{\varvec{x}}}_0,{{\varvec{x}}}_1) \text {d} {{\varvec{y}}}\end{aligned}$$
(4)

given the constraint \(\sum _{i=1}^{n} r_{i}=n_1+n_2\) and \(r_i=1\) for units already selected at the first stage. The posterior probability distribution \(p({\varvec{\beta }}\vert {{\varvec{y}}}({{\varvec{r}}}),{{\varvec{x}}})\) for the model parameters \({\varvec{\beta }}\) could be estimated for example with importance sampling.

In our analysis, we use an approximation of Eq. (4) (Chaloner and Verdinelli 1995)

$$\begin{aligned} {\bar{U}}_2({{\varvec{r}}}) \approx \int _{\mathcal {Y}} \left[ \int _{\Theta } U({\varvec{\beta }},{{\varvec{y}}}({{\varvec{r}}}),{{\varvec{x}}}) p({\varvec{\beta }}\vert {{\varvec{y}}}_0,{{\varvec{y}}}_1, {{\varvec{x}}}_0,{{\varvec{x}}}_1) \text {d} {\varvec{\beta }}\right] p({{\varvec{y}}}({{\varvec{r}}}) \vert {{\varvec{x}}}, {{\varvec{y}}}_0,{{\varvec{y}}}_1,{{\varvec{x}}}_0,{{\varvec{x}}}_1) \text {d} {{\varvec{y}}}, \end{aligned}$$
(5)

where the posterior \(p({\varvec{\beta }}\vert {{\varvec{y}}}({{\varvec{r}}}),{{\varvec{x}}})\) is replaced with the prior probability distribution \(p({\varvec{\beta }}\vert {{\varvec{y}}}_0,{{\varvec{y}}}_1, {{\varvec{x}}}_0,{{\varvec{x}}}_1)\) for the model parameters \({\varvec{\beta }}\). Equation (3) is approximated in a similar way.

In our case, the utility function measures the precision for the parameters of interest as a selection criterion. Thus, as an D-optimality criterion, the selection criterion maximizes the logarithm of the determinant of the expected information matrix

$$\begin{aligned} U({\varvec{\beta }}, {{\varvec{y}}}({{\varvec{r}}}), {{\varvec{x}}}) =\log \det ({\mathcal {I}}({\varvec{\beta }}\vert {{\varvec{x}}}, {{\varvec{r}}})) = \log \det \left( \sum _{i=1}^{n} {\mathcal {I}}({\varvec{\beta }}\vert x_i, r_i)\right) , \end{aligned}$$
(6)

where

$$\begin{aligned} {\mathcal {I}}({\varvec{\beta }}\vert x_i, r_i)= {\left\{ \begin{array}{ll} {\mathcal {I}}({\varvec{\beta }}\vert x_i), &{} r_i=1 \\ 0, &{} r_i=0 \end{array}\right. } \end{aligned}$$

is the expected information (Chaloner and Verdinelli 1995). In the frequentist approach, the distribution \(p({\varvec{\beta }}\vert {{\varvec{y}}}_0,{{\varvec{y}}}_1, {{\varvec{x}}}_0,{{\varvec{x}}}_1)\) would not be used, but the value of \({\varvec{\beta }}\) is fixed to its current maximum likelihood estimate \({\hat{{\varvec{\beta }}}}\) and \(U({\hat{{\varvec{\beta }}}},{{\varvec{y}}}({{\varvec{r}}}),{{\varvec{x}}})\) measures the utility of measurement plan \({{\varvec{r}}}\).

The brms-package (Bürkner 2017) is used to implement the Bayesian model fitting, which provides an R (R Core Team 2023) interface to Stan (Stan Development Team 2022). The package uses Markov chain Monte Carlo (MCMC) algorithms to draw random samples from the posterior to estimate the model parameters \({\varvec{\beta }}\). The sampling is implemented via adaptive Hamiltonian Monte Carlo (Hoffman and Gelman 2014).

3.3 Expected Information Matrix for a Logistic Model

The subset selection with the criterion of Eq. (6) is studied further in the context of generalised linear models. We assume a binary response \(y_i\), \(i=1,\dots n\), that is distributed as

$$\begin{aligned} P(y_i=1 \mid {{\varvec{x}}}_{i})&=\pi _i, \\ P(y_i=0 \mid {{\varvec{x}}}_{i})&=1-\pi _i, \end{aligned}$$

where the parameter \(\pi _i\) is linked to the set of covariates with a link function \(g(\pi _i)=\eta _i\) as follows:

$$\begin{aligned} g(\pi _i)&={{\,\textrm{logit}\,}}(\pi _i) = \eta _i, \\ \pi _i&= \frac{\exp (\eta _i)}{1 + \exp (\eta _i)}. \nonumber \end{aligned}$$

Here, the response \(y_i\) is modelled with a linear predictor \(\eta _i={{\varvec{x}}}_{i}^\intercal {\varvec{\beta }}\) and \({\varvec{\beta }}=\{\beta _j\}\), \(j=1,\dots , J,\) is an unknown parameter vector that is needed to estimate. The log-likelihood of the data \(\{y_i, {{\varvec{x}}}_{i}^\intercal \}\) is obtained as

$$\begin{aligned} \log p({{\varvec{y}}}\vert {{\varvec{x}}}, {\varvec{\beta }})&= \sum _{i=1}^{n}\log p(y_i \vert x_i, {\varvec{\beta }})= \sum _{i=1}^{n} y_i \log \left( \frac{\exp (\eta _i)}{1+\exp (\eta _i)}\right) \\&\quad +(1-y_i)\log \left( 1-\frac{\exp (\eta _i)}{1+\exp (\eta _i)}\right) \\&= \sum _{i=1}^{n}{y_i {{\varvec{x}}}_{i}^\intercal {\varvec{\beta }}- \log (1+\exp ({{\varvec{x}}}_{i}^\intercal {\varvec{\beta }}))}. \end{aligned}$$

Next, we give the form for the expected information matrix. The expected information matrix (the Fisher information matrix) tells how much information the data are expected to contain. It is used when the collection of the data is still being planned. Formally, the expected information matrix of a generalized linear model is the expected value of the observed information:

$$\begin{aligned} {\mathcal {I}}({\varvec{\beta }}\vert {{\varvec{x}}}, {{\varvec{r}}})=-\sum _{i=1}^n r_i {{\,\mathrm{\mathbb {E}}\,}}\left( \frac{\partial ^2 \log p(y_i \vert x_i, {\varvec{\beta }})}{\partial \beta \partial \beta ^\intercal } \right) , \end{aligned}$$
(7)

where the expectation is taken with respect to response \(y_i\). In our case, for a binary response and the logit link, the (jk) element of the matrix in Eq. (7), also in Eq. (6), takes the form

$$\begin{aligned} {\mathcal {I}}(\beta \vert x_i, r_i)= -r_i {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{\partial ^2 \log p(y_i \vert x_i, {\varvec{\beta }})}{\partial \beta _{j} \partial \beta _{k}^\intercal }\right] = r_i \left[ \frac{\exp (\eta _i)}{1+\exp (\eta _i)} \left( 1 - \frac{\exp (\eta _i)}{1+\exp (\eta _i)}\right) \right] x_{ij} x_{ik}. \end{aligned}$$
(8)

4 Algorithms for Finding D-optimal Selection

Since our selection problem has a massive number of combinations and all of these combinations are not possible to analyse, we need heuristic algorithms to find approximately optimal designs. Here, we chose to use the greedy forward selection in the two-stage approach to find D-optimal designs. The greedy approach, also known as a sequential search (Dykstra 1971), is well-known by mathematicians and computer scientists. The idea of the method is to sequentially add units to the design, optimizing the selection criterion only for the selection of a single unit at each round until the desired size \(n_1\) is reached. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally optimal solutions that approximate the globally optimal solution and can be found in a reasonable running time. The forward selection (Algorithm 1) starts with no additional units selected and adds the most promising unit until desired size \(n_1\) is reached. If two units give the same criterion value U, the selection between them is performed randomly. If \(n_1=n\), the procedure will sort all units into the preferred selection order. The algorithm was implemented in R R Core Team (2023).

Algorithm 1
figure a

Greedy forward selection.

5 Application to Finnish Lake Data

We first introduce the setting to study the performance of the two-stage design selection. Next, we present results on the planning of the two-stage selection in the Bayesian framework. We report the model parameter estimates for the best scenario for two-stage design and compare them to the single-stage situation.

5.1 Study Setting

We applied the two-stage selection described in Sect. 3 and the greedy forward selection algorithm described in Sect. 4 to Finnish lake management problem introduced in Sect. 2. We aimed to imitate the real-life decision-making process of monitoring data gathering and fixed the initial data. It is realistic to assume that if any initial measurements are already available, they are available for the largest lakes. However, this implies that some bias may be unavoidable because the initial datasets have been not selected randomly.

The datasets used in the application were formulated as follows. We first sorted the lake data according to the area of the lake from the largest to the smallest and separated the \(\max (n_0) = 200\) largest lakes. Three initial datasets containing \(n_0 = 25\), \(n_0 = 50\) and \(n_0 = 200\) largest lakes were extracted from this subset. The three choices considered for \(n_0\) represent situations where a small amount of the prior data (\(n_0 = 25\)), a moderate amount of the prior data (\(n_0 = 50\)) and a large amount of the prior data (\(n_0 = 200\)) are available. Next, we used the remaining dataset of size \(n=4360-200=4160\) for the subsample selection. Figure 2 summarizes the study setting.

Fig. 2
figure 2

Flow chart of the two-staged study setting using the notations and sample sizes of the Finnish lake data application. The decision-maker is assumed to have all the knowledge in “Present knowledge,” while the first stage and the second stage are to be planned

In the logistic regression model, the binary response was the status of the lake (the target status being 1 and the nontarget status being 0) and the waterbed area and the agricultural land in the municipality where the lake is located were covariates. We aimed to keep the model simple since the optimization is challenging in a high-dimensional model where the size of the information matrix is large (García-Ródenas et al. 2020). The covariates were centred and standardized by dividing by the standard deviation.

As a preparation for the subsample selection, we first fitted a Bayesian logistic regression model to the initial data of size \(n_0\) using a noninformative prior in model fitting implemented in the brms-package (Bürkner 2017). Student’s t-distribution was used as a prior for the intercept, and uniform distributions were used for regression coefficients. The model fitted to the initial data represents the knowledge that is available when the subsample selection starts and the decision on the measurement plan for the first stage is to be made.

In the subsample selection, we implemented the first stage selection and selected \(n_1\) additional units using Algorithm 1. Here, we use the model fitted to the initial data as the prior and apply an MCMC algorithm to draw posterior samples to estimate the model parameters. The first stage size was varied and took values \(n_1 \in \{100, 700, 1300, 1900, 2500\}\). For the selected \(n_1\) lakes, the ecological status was “measured”, i.e. obtained from the lake data. We added the selected subset to the initial data to obtain \(n_0 + n_1\) units in total and used this to form an informative prior on the second stage.

On the second stage, we selected \(n_2\) units using the same procedure as at the first stage but with the informative prior estimated from \(n_0 + n_1\) units. We fixed the second stage design size \(n_2\) so that at the second stage, the total number of selected lakes was \(n_1 + n_2 = 3000\). This makes the selections with a different number of first stage design sizes comparable with each other. Thus, in total, the number of lakes to be measured was \(n_0 + n_1 + n_2\). The D-criteria after this second stage selection are reported in the next section. We repeated the optimal selection 100 times, generating new initial model parameter values at each time and studied the range of the quantities over these 100 repetitions.

For comparison, we also considered a case where \(n_1+n_2=500\) and the first stage size took values \(n_1 \in \{10, 120, 230, 340, 450\}\). All other details were similar to the case with \(n_1+n_2=3000\).

5.2 Results on Bayesian Two-Stage Selection

Table 1 shows the model parameter estimates and the standard deviations when a small amount of the prior data (\(n_0=25\)), a moderate amount of the prior data (\(n_0=50\)) and a large amount of the prior data (\(n_0=200\)) are available. We use these estimates as informative priors in the subsample selection. Naturally, these estimates have larger standard deviations compared to the model fitted in the full data of size \(N=4360\) for which the D-criterion (Eq. 6) equals \(U({\varvec{\beta }}, {{\varvec{y}}}({{\varvec{r}}}), {{\varvec{x}}})=20.07\). Not surprisingly, the estimates seem to be biased for \(n_0=25\) and \(n_0=50\) because the initial data contain only the largest lakes. With \(n_0=200\), on the other hand, the model manages to estimate the effects of the covariates in the same way as for the full data \(N = 4360\). The D-criteria (Eq. 6) of these initial models are, 7.649, 9.791 and 12.570, respectively.

Table 1 Parameter estimates and their standard deviations of a Bayesian logistic regression model fitted in the initial lake data of size \(n_0 \in \{25, 50, 200\}\) and in the full lake data of size \(N=4360\)

Next, we implemented the two-stage selection. The top row in Fig. 3 shows the mean D-criterion over 100 repetitions after the second stage selection with varying first stage design sizes \(n_1\) when the Bayesian two-stage approach is used with different sizes of prior data \(n_0 \in \{25,50,200\}\). The total number of selected lakes is \(n_0+n_1+n_2=3025\), \(n_0+n_1+n_2=3050\) and \(n_0+n_1+n_2=3200\), respectively. Comparing the results with the three different values of the amount of the prior data, naturally, the general level of D-criterion increases, when the total number of observations increases; see the definition of U in (6).

When \(n_0=25\), the highest second stage D-criterion value is obtained when approximately 1900 lakes are selected to be measured at the first stage. If the first stage design size is higher than 1900, it seems that the second stage D-criterion will be lower. However, the increase in the D-criterion value starts to flatten out already when \(n_1=700\). Thus, by selecting more than 700 lakes at the first stage does not clearly improve the final result given that \(n_1+n_2=3000\). The same pattern can be seen with \(n_0=50\).

When \(n_0=200\), no flattening out after \(n_1=700\) can be seen, but the second stage D-criterion increases strongly when the value of \(n_1\) increases. Intuitively, a large enough amount of prior data makes the two-stage selection less beneficial. More prior data gathered at the first stage simply improve the final result. However, similarly to a small and a moderate amount of the prior data, the maximum can still be found at \(n_1=1900\).

Fig. 3
figure 3

The second stage D-criterion as a function of the size of the first stage in Bayesian subsample selection. The squares and error bars represent the mean, maximum and minimum D-criteria of 100 independent selections. The total number of lakes selected at the first and second stage is \(n_1+n_2=3000\) (top row) or \(n_1+n_2=500\) (bottom row)

Based on Fig. 3 (top row), selecting \(n_1=1900\) at the first stage and \(n_2=1100\) is the best strategy for all the scenarios of the amount of the prior data. Table 2 shows the parameter estimates and their standard deviations for this strategy as means over 100 independent iterations. The mean D-criteria in these cases are 18.776, 18.841 and 18.969, respectively. In addition, Table 2 shows the model parameters and the standard deviations for the single-stage approach where the sizes are \(n_1=3000\) and \(n_2=0\). Assuming the three different prior data scenarios, the total data size used for model fitting is \(n_0+n_1+n_2 \in \{3025, 3050, 3200\}\). The mean D-criteria in these situations are 18.737, 18.802 and 18.886, respectively. Comparing the two-stage approach to the single-stage approach, the two-stage approach works slightly better, but the differences are minor. Once the selection is made, the analysis could be extended and a model made using the other variables as covariates introduced in Sect. 2.

Table 2 Parameter estimates and their standard deviations of a Bayesian logistic regression model fitted in the selected additional data

The top row of Fig. 3 shows the result when the total amount of selected lakes is quite large, \(n_1+n_2=3000\). We now present a situation where the total amount is smaller. The bottom row of Fig. 3 shows the results for the case where \(n_1 + n_2 = 500\) and the first stage design size varies to be \(n_1 \in \{10, 120, 230, 340, 450\}\). For \(n_0=25\), \(n_0=50\) and \(n_0=200\), the total number of selected lakes are \(n_0+n_1+n_2 = 525\), \(n_0+n_1+n_2 = 550\) and \(n_0+n_1+n_2 = 700\), respectively. When \(n_0=25\) and \(n_0=50\), the highest second stage D-criterion value is obtained when approximately 230 lakes are selected to be measured at the first stage. If the first stage design size is higher than 230, the D-criterion value decreases fast. This suggests that the model should be updated early enough. Otherwise, the lakes selected based on insufficient prior information at the first stage constitute a too large a proportion of the all lakes selected. With \(n_0=200\), the pattern seems different: the D-criterion values increase when \(n_1\) increases until approximately 120 lakes are selected at the first stage, and then decreases moderately. It seems that the prior information is sufficient to estimate the model, and thus, no major differences are observed between the first stage design sizes. When \(n_1=230\) and \(n_2=270\), the mean D-criteria for \(n_0=25\), \(n_0=50\) and \(n_0=200\) are 16.571, 16.643 and 17.011, respectively. The corresponding mean D-criteria values for one-stage sampling are 16.239, 16.343 and 16.898. The benefits of two-stage sampling seem to be slightly larger when \(n_1 + n_2 = 500\) than when \(n_1 + n_2 = 3000\).

6 Discussion

We have considered the problem of subsample selection in the context of lake management. The aim was to select lakes that are maximally informative in the sense of Bayesian D-optimality for the prediction of water quality status. We compared two-stage selection to single-stage selection and found that two-stage selection may have only a modest advantage.

Subsample selection has similarities with experimental design. In both, the aim is to maximize D-optimality or another function of information matrix. The main differences are that in subsample selection the set of units is fixed and each unit (lake) can be chosen only once. Algorithms developed for experimental design are often useful also in subsample selection but require modifications. Our choice of using a Bayesian approach together with the greedy forward selection takes into account the uncertainty in the model parameters and keeps the computational time reasonable.

Sequential or multi-stage approaches are never theoretically worse than single-stage approaches. In our application of lake management, the sequential strategy is not feasible because of the time needed to analyse water samples. The two-stage approach is a compromise that is possible to implement. Figure 3 shows that the D-criterion obtained in two-stage selection depends on the size of the first stage subsample in a nontrivial way. In practice, however, the benefit of the two-stage selection remained modest in the lake management data, as shown in Table 2.

The prior distributions of the model parameters were estimated from the initial data on the lakes with the largest surface area. It is realistic to assume that the initial data are collected for lake management purposes and therefore the lakes are not selected randomly. Using only the largest lakes obviously causes bias in the priors, which leads to suboptimal selection and may cause some bias even in the final estimates, especially if the size of the selected data is small.

The lake data had the status classification measured for 4360 lakes, but in the analysis we simulated the situation where the status is yet to be determined. In reality, there are actually 58,707 lakes that can be found from the database maintained by the Finnish Environment Institute. Since the status classification is already available for 4360 lakes of those 58,707 lakes with basic characteristics available, the classification is still missing for 54,347 lakes. The current work can be utilized in the planning of the data collection strategy for these lakes.

The presented methods may be applicable also in other environmental monitoring problems where the quantity of interest is expensive or difficult to measure. In addition to D-optimality, it is possible to consider value of information (Eidsvik et al. 2015; Koski et al. 2020; Koski and Eidsvik 2024) and other criteria that link directly to decision making.