Research Articles

Transfer learning framework for streamflow prediction in large-scale transboundary catchments: Sensitivity analysis and applicability in data-scarce basins

  • MA Kai , 1, 2 ,
  • SHEN Chaopeng 3 ,
  • XU Ziyue 1, 2 ,
  • HE Daming , 1, 2, *
  • 1. Institute of International Rivers and Eco-security, Yunnan University, Kunming 650091, China
  • 2. Yunnan Key Laboratory of International Rivers and Transboundary Eco-security, Yunnan University, Kunming 650091, China
  • 3. Civil and Environmental Engineering, Pennsylvania State University, University Park, PA, United States
*He Daming (1958-), PhD and Professor, specialized in transboundary hydrology and water resources. E-mail:

Ma Kai (1992-), PhD, specialized in transboundary hydrology and hydrologic modelling. E-mail:

Received date: 2023-09-21

  Accepted date: 2024-01-12

  Online published: 2024-05-31

Supported by

National Key Research and Development Program of China(2022YFF1302405)

National Natural Science Foundation of China(42201040)

The National Key Research and Development Program of China(2016YFA0601601)

The China Postdoctoral Science Foundation(2023M733006)


The imbalance in global streamflow gauge distribution and regional data scarcity, especially in large transboundary basins, challenge regional water resource management. Effectively utilizing these limited data to construct reliable models is of crucial practical importance. This study employs a transfer learning (TL) framework to simulate daily streamflow in the Dulong-Irrawaddy River Basin (DIRB), a less-studied transboundary basin shared by Myanmar, China, and India. Our results show that TL significantly improves streamflow predictions: the optimal TL model achieves an average Nash-Sutcliffe efficiency of 0.872, showing a marked improvement in the Hkamti sub-basin. Despite data scarcity, TL achieves a mean NSE of 0.817, surpassing the 0.655 of the process-based model MIKE SHE. Additionally, our study reveals the importance of source model selection in TL, as different parts of the flow are affected by the diversity and similarity of data in the source model. Deep learning models, particularly TL, exhibit complex sensitivities to meteorological inputs, more accurately capturing non-linear relationships among multiple variables than the process-based model. Integrated gradients (IG) analysis further illustrates TL’s ability to capture spatial heterogeneity in upstream and downstream sub-basins and its adeptness in characterizing different flow regimes. This study underscores the potential of TL in enhancing the understanding of hydrological processes in large-scale catchments and highlights its value for water resource management in transboundary basins under data scarcity.

Cite this article

MA Kai , SHEN Chaopeng , XU Ziyue , HE Daming . Transfer learning framework for streamflow prediction in large-scale transboundary catchments: Sensitivity analysis and applicability in data-scarce basins[J]. Journal of Geographical Sciences, 2024 , 34(5) : 963 -984 . DOI: 10.1007/s11442-024-2235-x

1 Introduction

Streamflow modelling accuracy is paramount for effective water resources management (Ault, 2020), safeguarding water security management (Cook and Bakker, 2012), assessing the climate change impact (Schewe et al., 2019), and facilitating early warning systems for flooding (Borga et al., 2014). However, in-situ streamflow observations - a crucial database for accurate streamflow simulation - are globally distributed in a highly imbalanced manner. This imbalance is particularly acute in transboundary basins across Europe and Asia, where significant climate change-induced water security risks are prevalent (Feng and He, 2009; UN World Water Development Report 2020 & 2021). The scarcity of regional data in these areas, combined with complex subsurface conditions, poses substantial challenges for water resource management.
To address the challenges arising from limited streamflow observations, there has been a growing trend toward using model simulations driven by multiple factors. Recent progress in hydrological research underscores the value of incorporating diverse meteorological data, including rainfall corrections and remote sensing attributes, to enhance the model capabilities (Li et al., 2019; Huang et al., 2020; Luo et al., 2020; Wang et al., 2020). Many studies primarily use process-based models to simulate runoff and investigate its response to climate changes in certain sub-basins (Funk et al., 2015; Shrestha et al., 2020). However, the inherent spatial heterogeneity of large-scale transboundary areas can lead to considerable variations in hydrological processes (Kirchner, 2016; Zhu et al., 2023), underscoring the need for innovative approaches that leverage global hydrometeorological data to enhance model accuracy and reliability.
Recently, data-driven models based on machine learning (ML), specifically the long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997), presented great potential in the field of hydrology research with various datasets (Kratzert et al., 2018; Shen, 2018; Feng et al., 2020; Leng et al., 2023; Li et al., 2023). By employing diverse techniques, data-driven models have made strides in streamflow simulation, particularly in basins where data is sparse (Feng et al., 2021; Lu et al., 2021). Transfer learning (TL) (Thrun and Pratt, 1998; Pan and Yang, 2010), a technique facilitating knowledge transfer across tasks, has been identified as beneficial in studies dealing with small sample sizes (Dong et al., 2019; Sun et al., 2019; Zhao et al., 2021). In our prior work, we integrated transfer learning technology upon the LSTM to develop the transfer learning model for streamflow predictions. This was designed to address a series of challenges including data-scarce basins, catchment heterogeneity, and the need for global datasets (Ma et al., 2021).
Notwithstanding these advances, several questions remain unanswered. How does transfer learning, coupled with meteorological data from disparate regions, improve streamflow prediction? Is there a comparable underlying data response in TL and physical process models that accounts for their improved performance? And does this applicability extend to sparse data from large-scale catchments?
On the other hand, there has been rising concern over the interpretability of ML models due to their strength and complexity (Carvalho et al., 2019; Linardatos et al., 2020), and some progress has been made to reveal potential patterns by constraining the training process with physical relationships or by conducting sensitivity analysis of the ML models (Rodriguez-Galiano et al., 2015; Karpatne et al., 2017). However, the enormous quantity of parameters with information in ML poses significant challenges in systematically demonstrating how complex, nonlinear interactions in natural processes are reflected. The differentiable architecture of LSTM within neural networks enables the tracking of output gradients in response to variable inputs (Kratzert et al., 2020; Tsai et al., 2021). This gradient tracking can offer insights into the model’s training features. Techniques such as integrated gradients (IG) have been shown to effectively interpret deep learning models (Sundararajan et al., 2017). These discoveries suggest it is possible to interpret how TL enhanced the prediction capabilities, which in turn provides a deeper insight into streamflow extrapolation across heterogeneity.
In this study, we investigate the sensitivity of input climate variables in a transfer learning framework, and assess its applicability in large-scale catchments with scarce data. We aim to address the aforementioned issues in the following ways: Firstly, we construct and compare TLs trained using different source models to evaluate the reliability and the criteria for selecting source models. Following this, we analyze the sensitivity of LSTM and TL to the forcing data variables by comparing them with a physical process model. Furthermore, by leveraging the differentiable structure of the model, we calculate the integrated gradient (IG) of daily streamflow with respect to the forcing variables, which serves to illustrate the links between deep learning modeling enhancement and hydrological process understanding. Lastly, we compare short-term training models of deep learning and physical process models to evaluate their potential for application in sparse data from large-scale catchments. The insights obtained from these analyses would offer valuable guidance for the application of TL in other catchments.

2 Data and methods

2.1 Data

The Dulong-Irrawaddy River Basin (DIRB), positioned across Myanmar, China, and India, covers an expansive area of 420,934 km² and generates an average annual runoff of 480 billion m3. The mainstream originates from the Dulong River alternatively referred to as the NmaiHka River in southern China, and the MaliHka River in northern Myanmar, extending to a total length of 2288 km. The topography within the designated study area significantly transitions from north to south, reducing from a northern plateau with an elevation of 4761 m in Myanmar to the southern alpine valleys and plains adjacent to the river. However, the runoff processes cannot be sufficiently captured by observational records due to the sparsely deployed gauges and the limited duration of observation (Ji et al., 2020). The absence of coordination and data sharing strategies leads to a prevalent data information asymmetry in a significant number of these transboundary basins, inherently restricting basin-scale hydrological studies in such regions (He et al., 2014).
In this study, our primary focus is on the part of the DIRB upstream of the Pyay station, within the subtropical and tropical monsoon climatic zones. The local data for the DIRB, which includes grid daily precipitation from the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) (Funk et al., 2015) and daily temperature (including minimum and maximum) from ERA5 (the fifth generation ECMWF atmospheric reanalysis of the global climate), was collected from January 1, 2001 to January 1, 2011. Observed daily streamflow data from seven gauges in the basin were used to train and evaluate the model’s performance. Other static attributes used in this study, such as soil texture and land use, are listed in Tables S1 and S2. The DIRB is divided into seven sub-basins based on gauge locations and a Digital Elevation Model. These sub-basins have upstream and downstream relationships based on the river’s confluence path (for example, the Monywa sub-basin in the Chindwin River includes upstream reaches of Hkamti and Mawlaik). The model inputs correspond to the characteristics within each sub-basin.
Datasets detailing meteorological and hydrological patterns from a range of countries (including, but not limited to the US, UK, and Brazil) have been released. These datasets offer a wealth of diverse information - in climate, geography, geology, and other attributes - that is highly beneficial to a transfer learning (TL) application. Accordingly, this research utilized three disparate hydrometeorological datasets as source data to investigate TL. These datasets are the original Catchment Attributes and MEteorology for Large-sample Studies (CAMELS) dataset for the contiguous USA (Newman et al., 2014), the CAMELS-GB for Great Britain (Coxon et al., 2020), and the CAMELS-CL, a moderately-dense dataset for Chile (Alvarez-Garreton et al., 2018). Each dataset incorporates meteorological forcings, daily streamflow records, and static catchment attributes (see Table S1 for further details)
The diversity in source data could lay a robust foundation for TL, we utilized the full extent of available time series data in our training process to maximize data utilization and develop a source model enriched with informative content. This comprehensive approach to training involved datasets spanning over 30 years: CAMELS, covering 671 catchments from October 1, 1985, to October 1, 2015; CAMELS-GB, also encompassing 671 catchments, from October 1, 1970, to October 1, 2015; and CAMELS-CL, including 516 catchments, from October 1, 1980, to October 1, 2015. These extensive datasets were employed in the training of the source models using LSTM, achieving median Nash-Sutcliffe Efficiency (NSE) results of 0.72, 0.89, and 0.78 respectively. Despite the challenging topographic and climatic conditions in Chile affecting the performance of the CAMELS-CL model, particularly in regions dominated by glacial snowmelt, these catchments were nonetheless included in the source model training with the CAMELS-CL dataset. This decision was taken considering their potential effectiveness in impacting TL, despite their poorer performance.

2.2 Transfer learning (TL) framework based on Long Short-Term Memory (LSTM)

2.2.1 Long Short-Term Memory (LSTM) Network

Long Short-Term Memory (LSTM), a form of Recurrent Neural Network (RNN), is proficient in learning from sequential data and has successfully been used for predictions of streamflow (Feng et al., 2020, 2021; Ma et al., 2021), soil moisture (Fang et al., 2017, 2020), stream temperature (Rahmani et al., 2020), and lake water temperature (Read et al., 2019). Unlike simple RNNs, LSTM incorporates “memory states” and “gates” that facilitate learning of the duration for state information retention, discernment of information to forget, and determination of what to output. This architecture provides a model foundation for hydrological multivariate time series simulations (Shen et al., 2021).
In this study, LSTM is utilized in two distinct roles: as a tool for training the source models, and as a reference for comparison with TL models. Hyperparameters specific to LSTM and TLs, including learning rate and dropout rate, are detailed in Table S2. Optimal performance was ensured by adjusting the final values of the hyperparameters through sensitivity analysis.

2.2.2 TL framework

Figure 1 illustrates the transfer learning (TL) framework employed in this research for daily streamflow forecasting. The framework constitutes two main components: a source model and a transfer learning model. The source model, operating on LSTM, utilizes data-rich datasets, comprises a deep network with extensive parameters employed during the transfer stage. The transfer learning model, trained on the forcing and attribute data from the DIRB, harnesses these parameters from source model. Given the diversity inherent to datasets across continents, an initial linear layer was woven into the TL framework to harmonize input dimensions with the attributes specific to the target region.
Figure 1 Transfer learning framework based on LSTM model and the location and observation stations of Dulong-Irrawaddy River Basin
Given that the transferred weights from the source model do not have explicit connections to physical processes, we employ a weight freezing strategy during the TL, labeled as TL-a, TL-b, and TL-c. These strategies methodically reduce the fraction of frozen weights throughout the transfer process. Frozen weights refer to the transferred weights from the source model, which act as initial values during subsequent training. For instance, TL-a sustains most network weights and fine-tunes the model’s top layers. By weight initialization and freezing, these strategies equip the TL with fitting weight parameters for the target region. (Ma et al., 2021). In the end, the optimal transfer strategy is used for the TL prediction of daily streamflow in the DIRB. The equations to the transfer learning (TL) model are detailed in Text S1, while the specific aspects of all weight freezing strategies are described in Figure S1.

2.2.3 Model evaluation

We utilize the Nash-Sutcliffe Efficiency Coefficient (NSE) for model evaluations in this research. Considering the stochastic nature of the training, we have employed five different random seeds and computed the average streamflow from this ensemble for metric evaluation (Kratzert et al., 2019; Feng et al., 2021). To evaluate the model performance of the models across varying percentile streamflow, we used the percent bias of the top 2% peak flow range (flow duration curve (FDC) high-segment volume, or FHV), along with the percent bias of the median 20%-80% flow range (FDC median-segment volume, or FMV), and the percent bias of the bottom 20% low flow range (FDC low-segment volume, or FLV) (Yilmaz et al., 2008). The equations for FHV, FLV, and FMV are detailed in Text S2.

2.3 Model sensitivity analysis

We present a comprehensive sensitivity analysis of the LSTM, TL, and a physically-based hydrological model within a large-scale transboundary basin, aimed at assessing model predictive reliability under various scenarios. In order to evaluate the models under different situations, we set a constant relative increase of precipitation by 10%, an increase of temperature by 1-degree centigrade, and simultaneous increases in both precipitation and temperature.
In this section, we introduce the MIKE SHE (Système Hydrologique Europeén), a physically-based distributed hydrological model, to evaluate its response alongside LSTM and TL models under varying forcing data conditions. Unlike data-driven DL models, MIKE SHE is grounded in established physical processes, lending greater interpretability to its outputs. Additionally, the efficacy of MIKE SHE in data-scarce areas also remains to be compared with that of DL models. The MIKE SHE used in this work was calibrated from January 1, 2001 to January 1, 2006. Detailed information on the driving data and model parameters are provided in Tables S3 and S4.
Moreover, we present an exploratory interpretation of the training process of the LSTM and TLs. Sundararajan et al. (2017) introduced the method of integrated gradients (IG) to articulate the attribution relationship between model predictions and input features, satisfying the foundational sensitivity and implementation invariance axioms. The IG calculation is presented in Equation 1.
$\operatorname{Grads}_{i}(x):=\int_{\alpha=0}^{1} \frac{\partial f\left(x^{\prime}+\alpha\left(x-x^{\prime}\right)\right)}{\partial x_{i}} d \alpha$
where$\frac{\partial f\text{(}x\text{)}}{\partial {{x}_{i}}}~$is the gradient of$f\text{(}x\text{)}$ along the ith dimension. The IG index conveys the importance of an input feature in the model’s prediction, with larger IG values indicating a more significant impact on the predicted streamflow in this study. Particularly, a negative IG value means that the feature decreases the likelihood of the model predicting a specific outcome, such as an increase in streamflow, thereby offering a potential understanding of the model’s predictive behavior. The mean value of the inputs serves as the baseline for calculating the sensitivity of the forcing data to streamflow.

2.4 Experiments

In order to address the previously mentioned issues related to transfer learning (TL) - most notably, its capacity to enhance streamflow prediction and its operational feasibility in data-scarce large-scale basins - we designed two separate experiments. The first, ‘TL utilizing different source models,’ examines the source models and their corresponding TL performance in detail, accompanied with a sensitivity analysis. This investigation aims to reveal inherent patterns within the TL that contribute to improved performance and the determinants for an appropriate selection of source model. Our second experiment, referred to as ‘TL application in data-scarce large-scale basins,’ evaluates models trained or calibrated with just one year’s data, thus simulating and corroborating the TL applicability in large-scale basins under extreme data scarcity.

2.4.1 TL utilizing different source models

The source model, an LSTM trained with diverse datasets, provides essential knowledge on the rainfall-runoff relationship and weight initialization, which constitute a significant portion of the generalization ability inherent in the TL framework. Thus, we developed TL models using source models trained by CAMELS, CAMELS-GB, and CAMELS-CL for the prediction of target streamflow. These diverse datasets were introduced into the TL framework with an aim to ascertain criteria for the selection of source model by investigating the relationship between TL enhancement, source models, and target basins, thereby providing guidance for future widespread TL applications. An additional LSTM model, specifically for the DIRB, acted as the benchmark for comparison. All models were trained over a period of five years, from January 1, 2001, to January 1, 2006, and subsequently validated over the following five years, from January 1, 2006, to January 1, 2011. We used the data from January 1, 2005, to January 1, 2006, as validation data to select the hyperparameters.

2.4.2 TL application in data-scarce large-scale basins

The scarcity of observational data is a typical scenario that highlights the potential value of transfer learning (TL) in applications. Short-term hydrological observations, due to factors such as terrain and political restrictions, cannot buttress a regional hydrologic model - a reality all too common for large-scale transboundary basins devoid of data. To simulate this data deficit scenario, we trained LSTM and TLs over a one-year period from January 1, 2005, to January 1, 2006, and the subsequent test period extended from January 1, 2006, to January 1, 2011. The source models for transfer learning (TL) were selected from CAMELS, CAMELS-GB, and CAMELS-CL based on their individual merits. In this initial study, TL models were trained on all sub-basins. For a comparative benchmark, a Mike SHE model was also calibrated using the identical one-year training period.

3 Results

3.1 Effects of different source models in transfer learning (TL)

Transfer learning (TL) has markedly advanced the prediction accuracy of daily streamflow for the DIRB, with the degree of improvement varying according to source data (Figure 2). Our research identifies TL(US)-b, TL(GB)-c, and TL(CL)-b as the superior option for implementing TL with different source models (Figure S2 details the performance of all TL configurations). This evidence strengthens our earlier findings: retaining a majority of the network weights, through selective freezing, proves beneficial when dealing with smaller target datasets. Compared to LSTM, TL strategies utilizing diverse source data have outperformed, with a gain ranging from 0.001 to 0.102 across the sub-basins under consideration (Table 1). The highest gain was noted when CAMELS-CL was employed as the source data in TL (NSE rose from 0.833 to 0.872), closely followed by optimal TL strategies using CAMELS-US and CAMELS-GB (Figure 2).
Table 1 The Nash-Sutcliffe Efficiency (NSE) of all sub-basins for LSTM and optimal TLs with different source models
Models Hkamti Mawlaik Monywa Katha Sagaing Magway Pyay
LSTM 0.755 0.848 0.817 0.826 0.840 0.878 0.865
TL (US) 0.844 0.863 0.865 0.876 0.855 0.902 0.896
TL (GB) 0.778 0.875 0.885 0.854 0.841 0.916 0.909
TL (CL) 0.857 0.858 0.860 0.876 0.855 0.903 0.894
Figure 2 Performance comparison between LSTM and TLs with different source models. Panel (a) displays the Nash-Sutcliffe Efficiency (NSE) of both LSTM and TL models for the ensemble mean discharge of five members during the test period. Panel (b) illustrates the percentage bias within the 20%-80% flow range (FMV). Panel (c) reveals the percentage bias within the top 2% peak flow range (FHV), while panel (d) represents the percentage bias within the bottom 2% low flow range (FLV). The median value is indicated by the central line, while a plus sign (+) typically represents the mean NSE.
The impact of the source data selection exhibits noticeable variation across different sections of the hydrograph. The FHV of the LSTM manifests a slight positive value, while TLs display a negative bias. The top 2-percentile events prove challenging to capture for all models, most notably when CAMELS datasets serve as the source, inducing the most pronounced negative bias in FHV. This decrement in FHV observed in TL models may be attributed to the source data, as parallel findings of negative bias have been reported in other studies utilizing CAMELS datasets (Kratzert et al., 2019; Feng et al., 2021). the FLV of all TLs demonstrates a modest positive bias in comparison to the LSTM. Notably, the TL model primarily rectifies the positive bias in the 20%-80% mid-flow range, with significant improvement resulting from the employment of CAMELS-GB and CAMELS as source data.

3.2 Model sensitivity analysis

3.2.1 Sensitivity of LSTM, TL, and MIKE SHE model predictions to forcing variable changes

Results from sensitivity analysis point to significant differences in how deep learning models and a physically-based model respond to variations in forcing data. The LSTM and TL models manifest a more dramatic precipitation sensitivity, marked by more substantial changes, compared to the MIKE SHE. Of note, the LSTM model displays the highest precipitation sensitivity, especially during peak flow events. Despite the precipitation sensitivity patterns being similar across different TLs, models employing CAMELS and CAMELS-CL as source data show a heightened sensitivity. Interestingly, the LSTM and TL(GB) models do not demonstrate a considerable decrement in flow, despite the concurrent escalations in both precipitation and temperature - a phenomenon presumably owing to the dominant role of temperature. Conversely, the MIKE SHE model, exhibiting less sensitivity to precipitation changes than the deep learning models, shows smoother fluctuations that initially increase and then decrease, closely aligning with the streamflow hydrograph (Figure 3).
Figure 3 Sensitivity of LSTM, TL and MIKE SHE models in Pyay. It shows the streamflow simulations of different models at the outlet of the DIRB (Pyay) (panel 8) and their observed changes in temperature (panel 1) and rainfall (panel 7). Panels 2-6 show the streamflow changes (Delta) of the machine learning model (LSTM and TL using different source data) and MIKE SHE in the set scenarios, respectively. The shaded sections in the figure highlight examples where extremes in temperature sensitivity align with peak occurrences of rainfall and flow events.
However, LSTM and TL models demonstrate significantly higher temperature sensitivity compared to precipitation (Figure 3), also can be observed from multi-year annual runoff simulations (Figure 4). The predominant influence of temperature variation is attributed to the minimum temperature (T-min), which fluctuates more than the maximum temperature, thus impacting the modeling progression (Figure 3 panel-1). In contrast, the MIKE SHE model exhibits comparable sensitivity to both temperature and precipitation, indicating a more uniform process (Figure 3 panel-6). Furthermore, in LSTM and TL models, the peak occurrences of temperature sensitivity and rainfall, as well as runoff peaks, are observed to be synchronous. This observation suggests that the training process may have effectively identified the dynamic interplay between temperature and precipitation. Interestingly, during periods of maximum precipitation, temperature sensitivity in these models can exhibit positive values.
Figure 4 The multi-year annual runoff of Pyay simulated by LSTM, TLs, and MIKE SHE during the test period. The red dashed line represents the observed streamflow during the test years (mm).
The annual runoff for the test years, as shown in Figure 4, reveals that deep learning models are more substantially impacted by changes in temperature and precipitation compared to the MIKE SHE model. Notably, MIKE SHE shows the weakest response to temperature fluctuations. This could indicate that MIKE SHE is less sensitive to such environmental changes, suggesting that in a progressively warming future climate, MIKE SHE may underestimate the influence of temperature on runoff.

3.2.2 Sensitivity of LSTM and TL models’ training process to forcing variables

The integrated gradient (IG), in accordance with Eq. (1), for LSTM and TLs over all the sub-basins, is depicted in Figure 5. The calculated IG for forcing variables is averaged over the test period. Serving as a measure for interpreting the model process, IG clearly illustrates the discrepancies present across different forcing variables, models, and sub-basins.
Figure 5 The integrated gradient (IG) of different forcing variables for LSTM (a), TL(US) (b), TL(GB) (c), and TL(CL) (d) in each sub-basin is illustrated, with the color deepening as the latitude of the sub-basin decreases.
Among all the forcing variables in streamflow prediction, precipitation displays the highest sensitivity of them all, followed by minimum and maximum temperatures. It is observed that the TLs yield higher integrated gradients (IG) than the LSTM, particularly in the upper sub-basins. Moreover, a progressive decline in IG for all variables is apparent in both LSTM and TLs as the sub-basins approach the outlet. This trend is indicated by increasingly darker hades at lower latitudes in Figure 3, signifying the prominent influence of upstream flow on downstream hydrological processes. Such observations imply that the deep learning model devotes more attention to the upper basins within the network for comprehensive streamflow prediction of the entire DIRB, a focus that is notably enhanced in TLs.
The improvements introduced by the TLs also become evident across different streamflow percentiles. Figure 6 presents the integrated gradient (IG) of precipitation - the most influential among the forcing variables we used - for various models at Hkamti, Sagaing, and Pyay. The precipitation IG was partitioned according to different streamflow percentiles - high, medium, and low flow. Generally, LSTM and TL display comparable patterns and both of them display markedly higher IG at high-flow components compared to medium and low-flow components. This suggests their robust capabilities to capture peak flows, which is corroborated by several studies utilizing LSTM (Kratzert et al., 2019; Feng et al., 2021; Klotz et al., 2022). The enhancements seen in TL models, more pronounced upstream (Figure 5), are reflected in the precipitation IG across various flow percentiles. Particularly in Hkamti (Figure 6a), TL models show higher IG for high and mid-flow components than LSTM, while in Sagaing and Pyay, this difference gradually diminishes. Other forcing variables, such as minimum and maximum temperatures, also display greater IG values for TL models in the upstream, although their overall effect is less significant than that of precipitation.
Figure 6 The integrated gradient of precipitation for LSTM and TLs in (a) Hkamti, (b) Sagaing, and (c) Pyay. The box diagrams highlight the IG values corresponding to the top 2% high-flow (HF), 20%-80% mid-flow (MF), and bottom 30% low-flow (LF) part.

3.3 Transfer learning (TL) performance in data-scarce large-scale basins

Considering the limited availability of observed data within this large-scale catchment, we configured one-year training models to represent data scarcity scenarios, using MIKE SHE and LSTM models as benchmarks. As displayed in Table 2, TLs outperform both benchmarks in these data-limited scenarios, achieving an exceptional NSE of 0.817. TLs using a range of source data deliver comparable maximum improvements, yet the degree of model enhancement differs according to the source data, as evidenced by variable flow percentages. It is notable that the optimal TL model, employing CAMELS and CAMELS-CL as source data, exhibits pronounced improvements within the mid-flow regime, excluding the hydrograph extremes. The MIKE SHE model’s bias across different flow partitions appears minimal, while a comparison of evaluations across sub-basins reveals the discrepancies between it and the deep learning models (Table 3).
Table 2 The performance of MIKE SHE, LSTM and optimal TLs trained with different source models for data deficit scenario
MIKE SHE 0.655 -20.952 38.498 5.893
LSTM 0.774 -22.909 47.278 10.662
TL(US) 0.816 -24.907 50.782 0.757
TL(GB) 0.816 -20.539 45.416 7.205
TL(CL) 0.817 -23.391 53.491 3.928

Bold indicates the optimal TL option with different source models.

Table 3 The sub-basins performance of MIKE SHE, LSTM and optimal TLs for data deficit scenario
Sub-basins Models NSE FHV FLV FMV
Hkamti MIKE SHE 0.694 -43.096 120.874 25.174
LSTM 0.713 -15.062 43.843 2.978
TL(US) 0.766 -8.573 49.933 -3.218
TL(GB) 0.704 1.735 48.006 5.534
TL(CL) 0.784 -6.438 51.001 -14.210
Mawlaik MIKE SHE 0.606 -53.974 25.008 -16.671
LSTM 0.804 -27.454 6.143 -4.908
TL(US) 0.820 -33.287 10.964 -7.304
TL(GB) 0.846 -28.634 -0.134 -2.969
TL(CL) 0.821 -28.656 3.143 -5.487
Monywa MIKE SHE 0.688 -33.390 88.498 -10.337
LSTM 0.800 -22.188 109.393 12.550
TL(US) 0.824 -25.976 126.305 2.934
TL(GB) 0.840 -23.465 94.620 8.971
TL(CL) 0.834 -24.207 112.832 12.804
Katha MIKE SHE 0.666 -22.407 -30.102 -17.640
LSTM 0.719 -9.009 17.872 23.925
TL(US) 0.848 -19.621 14.611 4.918
TL(GB) 0.824 -9.768 31.075 20.119
TL(CL) 0.849 -14.331 24.095 11.791
Sagaing MIKE SHE 0.682 -31.173 68.040 24.324
LSTM 0.765 -33.130 49.622 20.885
TL(US) 0.772 -37.393 49.196 1.288
TL(GB) 0.800 -30.878 54.549 13.486
TL(CL) 0.791 -33.498 60.682 8.731
Magway MIKE SHE 0.613 21.984 -9.157 18.046
LSTM 0.816 -24.745 38.169 7.075
TL(US) 0.845 -24.451 40.021 2.048
TL(GB) 0.854 -25.479 36.321 1.968
TL(CL) 0.833 -26.438 46.936 6.167
Pyay MIKE SHE 0.640 15.390 6.324 18.357
LSTM 0.799 -28.778 65.906 12.124
TL(US) 0.839 -25.048 64.447 4.633
TL(GB) 0.843 -27.280 53.475 3.323
TL(CL) 0.809 -30.171 75.748 7.698

Bold means the TLs’ metrics are better than LSTM and MIKE SHE.

Model evaluations across different sub-basins indicate that the advantage of transfer learning is more pronounced in upstream basins, while the process-based MIKE SHE presents a lesser bias in low-flow part of the downstream basins (Table 3). One year’s worth of data hardly provide the capacity for consistent and accurate prediction of the hydrograph’s extreme portions, yet notable enhancements in TLs for the mid-flow at each sub-basin (with Hkamti as an exception) are observed. The positive mid-flow bias in the lower basins observed in both LSTM and MIKE SHE models is effectively mitigated by transfer learning, contributing significantly to the comprehensive evaluation of the basin.
It is noteworthy that the TL model significantly improves the prediction of peak flow at Hkamti and low flow at Mawlaik. This helps to counteract the extreme flow bias near the catchment outlet, which is primarily due to the considerable contribution of upstream runoff (NWRC, 2018). Furthermore, the MIKE SHE model demonstrates a lower bias for downstream extreme flows, particularly low flows at Pyay. These results reveal a distinctive contrast in the method by which TL and MIKE SHE enhances flow prediction. In particular, TL first addresses the streamflow’s extreme portion in the sub-basin where its performance is comparatively lacking, as observed in Hkamti, corroborating the findings presented by the IG in the sensitivity analysis.

4 Discussion

4.1 Source models selection and potential application in large-scale data limited region

Taking the roles performed by the source models in the TL framework, they provide different knowledge sources for the improvement, which are supposed to have diverse impacts on TL. As a guideline for selecting source models, we acknowledge the connection between the diversity of source model data and the similarity to the target catchment. This suggests that the choice of source models should consider both the variability of data and the resemblance to the hydrological attributes of the target region.
Diverse precipitation patterns and temporal spans in CAMELS-GB, CAMELS, and CAMELS-CL catchments critically enhance transfer learning performance in streamflow simulation. When comparing the water balance of catchments across different datasets, the catchments of CAMELS-GB are primarily energy-limited (Figure 7). Both CAMELS and CAMELS-CL datasets offer a more diverse set of catchments, including those with high precipitation, potentially informing the modelling of DIRB catchments. Similar climatic and hydrological conditions can also provide a favorable learning foundation for the model (Zhang et al., 2023). It is noteworthy that different source models are optimized for distinct parts of simulated streamflow, in reflection of the basins’ characteristics within these models. For example, CAMELS-CL, which encompasses a higher number of high-precipitation and large-area basins, promotes model transferability (Jahanshahi et al., 2022). This provides a solid basis for TL to enhance its understanding of rainfall-runoff relationships, thereby improving overall model performance. Furthermore, the effect of the source models’ time series length is significant. A longer input sequence in the LSTM model positively influences the modelling (Hashemi et al., 2022), and the long-term data (over 30 years) from the source model contributes a climatic change pattern to the transfer process, potentially rectifying the uncertainty bias triggered by insufficient data. These also indicate the potential benefits of diverse datasets, such as Caravan (Kratzert et al., 2023), for furthering transferability.
Figure 7 Water balance of the CAMESL, CAMELS_GB, CAMELS_CL, and DIRB catchments depicted within the Budyko scheme, with the right and upper axes showing corresponding variable histograms
In data-scarce scenarios, our findings underscore the resilience of the TL framework, as it effectively harnesses common rainfall-runoff information, diminishing the need for diverse source data and broadening its utility. In line with our previous findings that a source model encompassing 50-100 catchments substantially enhances TL (Ma et al., 2021), the TL framework constrains the maximum data utilization. We found that the variation in TL enhancement across different source models is relatively small in data scarcity scenarios, despite existing differences among various sub-basins. This pattern suggests that in scenarios where data are limited, the commonality of rainfall-runoff information primarily drives the improvements in TL. Additionally, this finding implies a lessened need for source data diversity in TL under data-scarce scenarios. This enables a more flexible selection of source models, considering both diversity and similarity, while still ensuring satisfactory simulations. Consequently, this expands the practicality and broad applicability of the TL framework in data-scarce scenarios.
Despite relying solely on one year’s worth of data for model training in a data-scarcity scenario, LSTM, TL, and MIKE SHE models exhibit satisfactory performance, attributable to the dominant rainfall-runoff relationship and mitigated impact from other factors (e.g., land use change in the DIRB and the lumped attributes also lead to a spatial information loss in some large areas). Nevertheless, we observe distinct patterns of TL improvement for various target regions. Prior research indicates that when the LSTM model’s performance is relatively poor for the target region (mean NSE < 0.6), such as in the cases of Chile and the UK, the primary advantages of TL are manifested in the correction of extreme flows (Ma et al., 2021). In contrast, our current study shows that when predictions for the target region are comparatively more accurate, the greatest benefits are reaped in the mid-flow part. This pattern of variation may require further corroboration through additional TL experiments conducted in other target regions. Conversely, MIKE SHE outperforms in its simulation of downstream sub-basins’ low-flow portions. This divergence in improvement patterns for different flow portions in different sub-basins between deep learning models and physical process models highlights the immense potential of incorporating physical processes into deep learning models.

4.2 Sensitivity and interpretation of transfer learning (TL) modelling process

The distinct sensitivity patterns exhibited by machine learning models like LSTM and TL, in contrast to the stable responses of the MIKE SHE physical model, underline the necessity for further exploration into the dynamic interplay of factors in hydrological modelling. The relative stability in the sensitivity of the MIKE SHE model can be attributed to the inherent physical processes represented within the model. In contrast, deep learning models, which primarily learn relationships between forcing data with the objective of fitting observed variables, exhibit varied sensitivity patterns. Notably, between TL and LSTM, TL’s sensitivity response is significantly more stable, likely reflecting the incorporation of more hydrological information from the source model. For example, during 2009-2010, LSTM’s temperature sensitivity experienced a significant drop, while TL’s sensitivities were much more moderated.
The response of runoff to simultaneous increases in temperature and precipitation in MIKE SHE is almost linear - the sum of runoff changes due to a one-degree increase and a 10% increase in precipitation equates to the combined response of both (Figure 4). However, considering the interaction between temperature and precipitation, such as increased evaporation from higher temperatures potentially offsetting the additional runoff from increased precipitation, DL me study area, affected by thodels, particularly TL models, exhibit a more complex, non-linear response. Moreover, DL models’ enhanced sensitivity to temperature leads to more significant response fluctuations in precipitation and flow peak, suggesting that they may better capture the interplay between temperature and precipitation. These findings indicate that DL models reflect a more realistic sensitivity to meteorological inputs compared to MIKE SHE models.
However, deep learning models also lead to a less intuitive observation, where the temperature sensitivity shows a positive relationship. Given the lack of embedded physical discipline in deep learning models, we suggest that the model’s interpretation of temperature-induced sensitivity is likely intertwined with other factors. For example, in high-altitude basins dominated by snowmelt, the temperature link may be unclear without the constraints of physical processes, particularly as temperature sensitivity for streamflow is more pronounced around 0℃. Factors like increased glacier melt and changes in land cover due to rising temperatures can also contribute to increased runoff (Bolibar et al., 2022; Ji et al., 2022). This uncertain connection with physical knowledge calls for further investigation, potentially through cross-validation with more examples or by imposing physical constraints on deep learning models, to better comprehend the specific impact mechanisms (Feng et al., 2022; Reichert et al., 2021).
Integrated gradients (IG) analysis uncovers the spatial emphasis and potential of LSTM and TL models in hydrological processes. Upon analyzing the IG, we identified that the modeling process focuses differently on sub-basins. The increase in IG along sub-basins in LSTM and TL models indicates a more concentrated training on the upstream rainfall-runoff relationship, exactly reflecting the spatial connections at river confluences progress. A majority of the mean annual flow in the Ayeyarwady River originates from the higher elevation regions of its northern basin (NWRC, 2018), where the runoff depth exceeds 2000 mm, thus contributing significantly to the total runoff. Particularly in TL models used with CAMELS-CL and CAMELS-US, the daily IG across different sub-basins demonstrates correlation with runoff depth, where diverse data provides a superior constraint relationship for TL (Figure 8). This unique mapping capability of IG sheds new light on the interpretation of LSTM and TL modeling processes, providing a tentative insight into understanding the physical mechanisms encapsulated in deep learning models.
Figure 8 The relationship between runoff depth and integrated gradient (IG) of daily precipitation for Long Short-Term Memory (LSTM) in panel (a), transfer learning from United States (TL(US)) in panel (b), Transfer learning from Great Britain (TL(GB)) in panel (c), and transfer learning from Chile (TL(CL)) in panel (d) across all sub-basins
Furthermore, the enhancement of TL can be interpreted through its larger integrated gradients (IG) in the upper sub-basin relative to LSTM, particularly within high and mid-flow sections. This enhancement strengthens the foundational understanding, enabling a superior fit for streamflow predictions. The heightened IG for high flow suggests the potential efficacy of TL in detecting and attributing flood analysis. Furthermore, the occurrence of negative IG values for precipitation during certain periods does not imply an overall negative impact of precipitation on the model. Instead, this indicates that in these specific periods, precipitation may interact with other factors, such as high temperatures and human activities in different sub-basins, thus reducing the likelihood of increased streamflow. In our perspective, the subtle alterations within the model and improvements in its performance are preliminary in terms of their linkage with hydrological knowledge. Given the advances in differentiable models and gradient-focused studies, opportunities for uncovering physical processes in hydrological models via machine learning are emerging (Feng et al., 2022; Höge et al., 2022; Shen et al., 2023), promising a more detailed exploration from pattern to hydrological processes in future endeavors.

4.3 Limitations and prospects for transfer learning (TL) framework

In the study area, affected by the southwestern Indian monsoon, rainfall patterns are markedly concentrated during the monsoon season, providing a basis for model development in areas with scarce data. In regions with complex climate conditions, it may be necessary to conduct comprehensive analyses or divide regions based on their climate for effective model construction. In the DIRB, human activities concentrated in agricultural and urban areas, especially in the middle and lower sections (National Water Resources Committee, 2018), potentially affect DL model training, leading to notable fluctuations in sensitivity. Although this increased sensitivity may not always align with hydrological knowledge, it underscores the model’s ability to discern intricate relationships among various factors. Further studies focused on the specific impacts of different human activities on model sensitivity could offer more precise and reliable flow evaluations under evolving climate patterns.
The TL framework faces uncertainties related to data quality and model structure, impacting its applicability across different catchments. Local data quality is critical for TL’s effectiveness, exemplified by the significant discrepancies in meteorological data in large basins originating from the Tibetan Plateau (Sirisena et al., 2018; Hu et al., 2023), leading to model prediction uncertainties. Additionally, while static features in TL frameworks have less impact than dynamic meteorological characteristics, linear extraction of static features may not fully capture all critical information in complex basins. Furthermore, given the scenario of future climate warming, the temperature sensitivity of TL warrants further investigation. Despite the inherent uncertainties regarding the universal applicability of DL models in various data density scenarios, TL has shown effective application in diverse basins, including those in the USA, Chile, the UK, and China (Minjiang River basin). This study particularly underscores TL’s efficacy in large-scale basins. Future work will focus on elucidating the mechanisms by which multiple factors impact hydrological processes, enhancing model performance in complex climatic and geological settings. This approach is crucial for bolstering the foundation for water security risk assessment and resource management in large-scale transboundary catchments.

5 Conclusions

In this study, we assess the influence of varying source models on streamflow simulation performance within a transfer learning (TL) framework, used hydro-meteorological data from data-rich continents to refine daily streamflow predictions in large-scale transboundary basins. The TL framework remarkably improves streamflow prediction, reaching a mean Nash-Sutcliffe efficiency of 0.872 and showing a noteworthy enhancement, from 0.755 to 0.857, compared to LSTM in the Hkamti sub-basin. Additionally, the effects of the diversity and similarity of data in the source model are reflected across different parts of the flow, serving as determinants when selecting the source model in transfer learning. Despite the data scarcity, TL still manages to achieve a mean Nash-Sutcliffe Efficiency (NSE) of 0.817, significantly surpassing the 0.655 mark achieved by the MIKE SHE. Moreover, a physical-based model and deep learning models each display unique advantages in flow prediction for data-scarce large-scale basins, implying the significant potential of TL and other deep learning models that integrate physical processes.
The deep learning models, particularly TL, present complex and dramatic sensitivities to forcing. Further analysis employing integrated gradients (IG) reveals the capacity of the TL model to capture the spatial heterogeneity of upstream and downstream sub-basins during simulations, in addition to its adeptness in characterizing different flow regimes. This provides a promising avenue for the investigation of possible physical processes. Our preliminary interpretation suggests that the TL framework enables the model to concentrate more on the hydrological processes in sub-basins with substantial runoff contributions, aligning with accepted hydrological understanding.
From this work, we present an introductory exploration of TL, setting LSTM and a physical processed model as benchmark. This delivers an initial explanation for the benefits of TL in modeling performance and provides substantial guidance for the application of the TL framework within other large-scale catchments. Future studies are expected to develop systematic strategies to harness the power of the TL framework for enhanced understanding of hydrological processes.
Current challenges in water resource management and security for transboundary catchments in Europe and Asia, particularly due to data scarcity and information imbalances, underscore the importance of this work. The TL framework, when coupled with diverse global datasets, promotes state-of-the-art performance in streamflow prediction, thereby amplifying the value of local data. Both the similarity and uniqueness of catchments can contribute as sources of hydrological knowledge, enhancing streamflow predictions in target basins. This provides a rapid and effective approach that holds significant value for streamflow prediction in transboundary basins, especially in regions with severely limited observational data.


We thank Peter Reichert for the support in the sensitivity analysis (Method used in 3.2.1) that helped improve the manuscript. The hydrologic deep learning code used in this work can be accessed at Data for CAMELS can be downloaded at Data for CAMELS-GB can be downloaded at Data for CAMELS-CL can be downloaded at Discharge data of DIRB can be downloaded at

Appendix A

Text S1 Transfer learning (TL) model based on LSTM.
The forward pass of the TL model is described by the following equations:
Table S1 Summary of the forcing and attribute variables from CAMELS, CAMELS-GB, CAMELS-CL datasets
Dataset Forcing Static basin attributes
Variable name Description
Data for DIRB PRCP Averaged precipitation lat, lon, altitude, area, soil_bulk_density, p_mean, q_mean
T_max Daily maximum temperature
T_min Daily minimum temperature
CAMELS PRCP Averaged precipitation elev_mean, slope_mean, area_gages2, frac_forest, lai_max, lai_diff, dom_land_cover_frac, dom_land_cover, root_depth_50, oil_depth_statsgo, soil_porosity, soil_conductivity, max_water_content, geol_1st_class, geol_2nd_class, geol_porostiy, geol_permeability, p_mean, pet_mean, p_seasonality, frac_snow, aridity, high_prec_freq, high_prec_dur, low_prec_freq, low_prec_dur.*
SRAD Incident shortwave radiation
Tmax Daily maximum temperature
Tmin Daily minimum temperature
Vp Water vapor pressure
Dayl Duration of daylight period
CAMELS-GB Precipitation Averaged precipitation p_mean, pet_mean, aridity, p_seasonality, discharges, inter_high_perc, q_mean, runoff_ratio, stream_elas, baseflow_index, Q5, Q95, wood_perc, ewood_perc, grass_perc, shrub_perc, crop_perc, urban_perc, inwater_perc, bares_perc, sand_perc, silt_perc, clay_perc, organic_perc, bulkdens, tawc, porosity_cosby, porosity_hypres, conductivity_cosby, conductivity_hypres, root_depth, soil_depth_pelletier, gauge_lat, gauge_lon, gauge_elev, area, dpsbar, elev_mean, elev_min.*
Temperature Averaged temperature
Humidity Averaged specific humidity
Pet Averaged potential
Shortwave_rad Averaged downward
shortwave radiation
Longwave_rad Averaged longwave radiation
Windspeed Averaged wind speed
CAMELS-CL precip_cr2met Averaged precipitation area, elev_mean, slope_mean, nested_inner, geol_class_1st_frac, geol_class_2nd_frac, crop_frac, nf_frac, fp_frac, grass_frac, shrub_frac, wet_frac, imp_frac, lc_barren, snow_frac, lc_glacier, fp_nf_index, forest_frac, dom_land_cover_frac, land_cover_missing, p_mean_cr2met, pet_mean, aridity_cr2met, p_seasonality_cr2met, frac_snow_cr2met, high_prec_freq_cr2met, high_prec_dur_cr2met, low_prec_freq_cr2met, low_prec_dur_cr2met, big_dam, p_mean_spread, q_mean, runoff_ratio_cr2met, stream_elas_cr2met, slope_fdc, baseflow_index, hfd_mean, Q95, Q5, high_q_freq, high_q_dur, low_q_freq, low_q_dur, zero_q_freq, sur_rights_n, interv_degree. *
tmax Daily maximum temperature
tmin Daily minimum temperature
swe Daily snow water equivalent
pet _8d_modis Potential evapotranspiration obtained from MODIS

*Because of the long-list of attributes from CAMELS-GB and CAMELS-CL, we refer the readers to their respective publications for explanations of variable names.

(1) Input transformation: ${{x}^{t}}=\text{Relu}\left( {{W}_{I}}{{I}^{t}}+{{b}_{I}} \right)$
(2) Input node: ${{g}^{t}}=\text{tanh}\left( \text{D}\left( {{W}_{gx}}{{x}^{t}} \right)+{{b}_{gx}}+D\left( {{W}_{gh}}{{h}^{t-1}} \right)+{{b}_{gh}} \right)$
(3) Input gate: ${{i}^{t}}=\text{ }\!\!\sigma\!\!\text{ }\left( \text{D}\left( {{W}_{ix}}{{x}^{t}} \right)+{{b}_{ix}}+D\left( {{W}_{ih}}{{h}^{t-1}} \right)+{{b}_{ih}} \right)$
(4) Forget gate: ${{f}^{t}}=\text{ }\!\!\sigma\!\!\text{ }\left( \text{D}\left( {{W}_{fx}}{{x}^{t}} \right)+{{b}_{fx}}+D\left( {{W}_{fh}}{{h}^{t-1}} \right)+{{b}_{fh}} \right)$
(5) Output gate: ${{o}^{t}}=\text{ }\!\!\sigma\!\!\text{ }\left( \text{D}\left( {{W}_{ox}}{{x}^{t}} \right)+{{b}_{ox}}+D\left( {{W}_{oh}}{{h}^{t-1}} \right)+{{b}_{oh}} \right)$
(6) Cell state: ${{s}^{t}}={{g}^{t}}\odot {{i}^{t}}+{{s}^{t-1}}\odot {{f}^{t}}$
(7) Hidden state: ${{h}^{t}}=\text{tanh}\left( {{s}^{t}} \right)\odot {{o}^{t}}$
(8) Output: ${{y}^{t}}={{W}_{hy}}{{h}^{t}}+{{b}_{y}}$
where It represents the raw inputs for the time step, Relu is the rectified linear unit, xt is the vector to the LSTM cell, Δ is the dropout operator, W ’s are network weights, b’s are bias parameters, σ is the sigmoidal function, $\odot$ is the element-wise multiplication operator, gt is the output of the input node, it, ft, ot are the input, forget, and output gates, respectively, ht represents the hidden states, st represents the memory cell states, and yt is the predicted output.
Text S2 Calculation formula of FHV, FLV, and FMV.
The FHV (flow duration curve (FDC) high-segment volume) is calculated as follows:
$FHV=\frac{\mathop{\sum }_{h=1}^{N}\left( {{Q}_{si{{m}_{h}}}}-{{Q}_{ob{{s}_{h}}}} \right)}{\mathop{\sum }_{h=1}^{N}{{Q}_{ob{{s}_{h}}}})}\cdot 100$
Table S2 Hyperparameter values (chosen/tested) for all models
Model* Length of training instances LSTM dropout rate Mini-batching size LSTM hidden size Number of training epochs
LSTM DIRB 365/{100, 200, 365} 0.5/{0, 0.3, 0.5} 2/{2,5} 64/{32,64,128} 250/[100,300]
Source model CAMELS 100/{50,100,200} 256/{128,256} 300/[100,500]
CAMELS-GB 128/{64,128,256} 256/{128,256} 300/[100,500]
CAMELS-CL 128/{64,128,256} 256/{128,256} 300/[100,500]
TL model for DIRB TL (US)-optimal 2/{2,5} 256/256 300/[100,300]
TL (GB)-optimal 2/{2,5} 256/256 240/[100,300]
TL (CL)-optimal 2/{2,5} 256/256 180/[100,300]

* All the models were trained for the five-member ensemble with random seeds. For the tested values, square brackets indicate the range of values tested, while curly braces indicate the discrete values that were tested. Dropout rate is the fraction of connections set to 0 by the dropout operator. Hidden size is the size of g, i, f, t and the associated weight matrices in LSTM. Mini-batch is how many basins are grouped together to calculate the loss function before a gradient update is executed. With an epoch, there are as many forward simulations to run through all data points once.

Here, h = 1, 2, …, N corresponds to flow indices of flows with exceedance probabilities less than 0.02. In the context of FHV and FMV computation, N denotes the index of the flow value within the low-flow segment (0.7-1.0 flow exceedance probabilities) and the mid-flow segment (0.2-0.8 flow exceedance probabilities) of the flow duration curve.
Table S3 Summary of the driving data for the MIKE SHE
Component Data type Source Time period Resolution
Topography DEM ASTER Global Digital Elevation Model v002 NA 30 m
Meteorology Precipitation CHIRPS ( 1996-2010 0.25°
Temperature ERA5-land( 1996-2010 0.01°
Vegetation Land use MCD12Q1 data ( 2005 500 m
Leaf-area index CLASS ( 1996-2010 0.05°
Soil Surface and
Sectional type
Harmonized World Soil Database NA 1km
Table S4 Summary of the selected parameters used for the MIKE SHE
Component Parameter Unit value
Evapotranspiration Coefficient, C1 - 0.3
Coefficient, C2 - 0.2
Coefficient, C3 mm/day 20
Canopy interception mm 0.05
Two-layer water balance ET parameter - 1
Root-density distribution 1/m 0.25
Snowmelt Melting temperature 0
Degree-day coefficient mm/day/℃ 1.5
Rivers and lakes Manning coefficient m1/3/s 30
Overland flow Manning coefficient m1/3/s 25
Detention storage mm 0
Initial water depth mm 0
Saturated flow Interflow reservoir Specific yield - 0.35
Time constant day 70
Baseflow reservoir 1 Specific yield - 0.3
Time constant day 260
Baseflow reservoir 2 Specific yield - 0.3
Time constant day 850
Figure S1 The architecture of transfer learning (TL) and weight freezing strategies of TL-a, TL-b and TL-c
Figure S2 NSE values for the mean discharge from a five-member ensemble, corresponding to the local LSTM and TLs with different weight freezing strategy in data deficit scenario (1-year training) and normal training (five-years training)
Alvarez-Garreton C, Mendoza P A, Boisier J P et al., 2018. The CAMELS-CL dataset: Catchment attributes and meteorology for large sample studies - Chile dataset. Hydrology and Earth System Sciences, 22: 5817-5846.

Ault T R, 2020. On the essentials of drought in a changing climate. Science, 368(6488): 256-260.


Bolibar J, Rabatel A, Gouttevin I et al., 2022. Nonlinear sensitivity of glacier mass balance to future climate change unveiled by deep learning. Nature Communications, 13(1): 409.

Borga M, Stoffel M, Marchi L et al., 2014. Hydrogeomorphic response to extreme rainfall in headwater systems: Flash floods and debris flows. Journal of Hydrology, 518: 194-205.

Carvalho D V, Pereira E M, Cardoso J S et al., 2019. Machine learning interpretability: A survey on methods and metrics. Electronics, 8: 832.

Cook C, Bakker K, 2012. Water security: Debating an emerging paradigm. Global Environmental Change, 22: 94-102.

Coxon G, Addor N, Bloomfield JP et al., 2020. Catchment attributes and hydro-meteorological timeseries for 671 catchments across Great Britain (CAMELS-GB). Hydrology and Earth System Sciences, 12: 2459-2483.

Dong X, Chowdhury S, Qian L et al., 2019. Deep learning for named entity recognition on Chinese electronic medical records: Combining deep transfer learning with multitask bi-directional LSTM RNN. PLoS ONE, 14: e0216046.

Fang K, Kifer D, Lawson K et al., 2020. Evaluating the potential and challenges of an uncertainty quantification method for long short-term memory models for soil moisture predictions. Water Resources Research, 56(12): e2020WR028095.

Fang K, Shen C P, Kifer D et al., 2017. Prolongation of SMAP to spatiotemporally seamless coverage of continental U.S. using a deep learning neural network. Geophysical Research Letters, 44: 11030-11039.

Feng D P, Fang K, Shen C P, 2020. Enhancing streamflow forecast and extracting insights using long‐short term memory networks with data integration at continental scales. Water Resources Research, 56(9): e2019WR026793.

Feng D P, Lawson K, Shen C P, 2021. Mitigating prediction error of deep learning streamflow models in large data‐sparse regions with ensemble modeling and soft data. Geophysical Research Letters, 48: e2021GL092999.

Feng D P, Liu J T, Lawson K et al., 2022. Differentiable, learnable, regionalized process-based models with multiphysical outputs can approach state-of-the-art hydrologic prediction accuracy. Water Resources Research, 58(10): e2022WR032404.

Feng Y, He D M, 2009. Transboundary water vulnerability and its drivers in China. Journal of Geographical Sciences, 19(2): 189-199.


Funk C, Peterson P, Landsfeld M et al., 2015. The climate hazards infrared precipitation with stations: A new environmental record for monitoring extremes. Scientific Data, 2: 150066.

Hashemi R, Brigode P, Garambois P A et al., 2022. How can we benefit from regime information to make more effective use of long short-term memory (LSTM) runoff models? Hydrology and Earth System Sciences, 26: 5793-5816.

He D M, Wu R D, Feng Y, et al., 2014. REVIEW: China’s transboundary waters: New paradigms for water and ecological security through applied ecology. Journal of Applied Ecology, 51: 1159-1168.

Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neural Computation, 9: 1735-1780.


Höge M, Scheidegger A, Baity-Jesi M et al., 2022. Improving hydrologic models for predictions and process understanding using neural ODEs. Hydrology and Earth System Sciences, 26: 5085-5102.

Hu X L, Zhou Z, Xiong H B et al., 2023. Inter-comparison of global precipitation data products at the river basin scale. Hydrology Research, nh2023062.

Huang Q, Long D, Du M D et al., 2020. Daily continuous river discharge estimation for ungauged basins using a hydrologic model calibrated by satellite altimetry: Implications for the SWOT mission. Water Resources Research, 56(7): e2020WR027309.

Jahanshahi A, Patil S D, Goharian E, 2022. Identifying most relevant controls on catchment hydrological similarity using model transferability: A comprehensive study in Iran. Journal of Hydrology, 612: 128193.

Ji X, Chen Y F, Jiang W et al., 2022. Glacier area changes in the Nujiang-Salween River Basin over the past 45 years. Journal of Geographical Sciences, 32(6): 1177-1204.


Ji X, Li Y G, Luo X et al., 2020. Evaluation of bias correction methods for APHRODITE data to improve hydrologic simulation in a large Himalayan basin. Atmospheric Research, 242: 104964.

Karpatne A, Atluri G, Faghmous J H et al., 2017. Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data Engineering, 29: 2318-2331.

Kirchner J W, 2016. Aggregation in environmental systems (Part 1): Seasonal tracer cycles quantify young water fractions, but not mean transit times, in spatially heterogeneous catchments. Hydrology and Earth System Sciences, 20: 279-297.

Klotz D, Kratzert F, Gauch M et al., 2022. Uncertainty estimation with deep learning for rainfall-runoff modeling. Hydrology and Earth System Sciences, 26: 1673-1693.

Kratzert F, Klotz D, Brenner C et al., 2018. Rainfall-runoff modelling using Long Short-Term Memory (LSTM) networks. Hydrology and Earth System Sciences, 22: 6005-6022.

Kratzert F, Klotz D, Hochreiter S et al., 2020. A note on leveraging synergy in multiple meteorological datasets with deep learning for rainfall-runoff modeling. Hydrology and Earth System Sciences, 25: 2685-2703.

Kratzert F, Klotz D, Shalev G et al., 2019. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences, 23: 5089-5110.

Kratzert F, Nearing G, Addor N et al., 2023. Caravan: A global community dataset for large-sample hydrology. Scientific Data, 10: 61.

Leng J, Gao M L, Gong H L et al., 2023. Spatio-temporal prediction of regional land subsidence via ConvLSTM. Journal of Geographical Sciences, 33(10): 2131-2156.


Li B, Li R D, Sun T et al., 2023. Improving LSTM hydrological modeling with spatiotemporal deep learning and multi-task learning: A case study of three mountainous areas on the Tibetan Plateau. Journal of Hydrology, 620: 129401.

Li Y G, Zhang Y Y, He D M et al., 2019. Spatial downscaling of the tropical rainfall measuring mission precipitation using geographically weighted regression kriging over the Lancang River Basin, China. Chinese Geographical Science, 29: 446-462.

Linardatos P, Papastefanopoulos V, Kotsiantis S, 2020. Explainable AI: A review of machine learning interpretability methods. Entropy, 23: 18.

Lu D, Konapala G, Painter S L et al., 2021. Streamflow simulation in data-scarce basins using Bayesian and physics-informed machine learning models. Journal of Hydrometeorology, 22: 1421-1438.

Luo X, Fan X M, Li Y G et al., 2020. Bias correction of a gauge-based gridded product to improve extreme precipitation analysis in the Yarlung Tsangpo-Brahmaputra River basin. Natural Hazards and Earth System Sciences, 20: 2243-2254.

Ma K, Feng D P, Lawson K et al., 2021. Transferring hydrologic data across continents: Leveraging data‐rich regions to improve hydrologic prediction in data-sparse regions. Water Resources Research, 57(5): e2020WR028600.

National Water Resources Committee (NWRC), 2018. The Ayeyarwady State of the Basin Assessment. SOBA 1.2: Surface Water Resources. National Water Resources Committee.

Newman A, Sampson K, Clark M et al., 2014. A Large-sample Watershed-scale Hydrometeorological Dataset for the Contiguous USA. Boulder.

Pan S J, Yang Q, 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22: 1345-1359.

Rahmani F, Lawson K, Ouyang W Y et al., 2020. Exploring the exceptional performance of a deep learning stream temperature model and the value of streamflow data. Environmental Research Letters, 16: 024025

Read J S, Jia X W, Willard J, et al., 2019. Process-guided deep learning predictions of lake water temperature. Water Resources Research, 55(11): 9173-9190.


Reichert P, Ammann L, Fenicia F, 2021. Potential and challenges of investigating intrinsic uncertainty of hydrological models with stochastic, time-dependent parameters. Water Resources Research, 57(3): e2020WR028400

Rodriguez-Galiano V, Sanchez-Castillo M, Chica-Olmo M, et al., 2015. Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geology Reviews, 71: 804-818.

Schewe J, Gosling S N, Reyer C, et al., 2019. State-of-the-art global models underestimate impacts from climate extremes. Nature Communications, 10(1): 1005.

Shen C P, 2018. A transdisciplinary review of deep learning research and its relevance for water resources scientists. Water Resources Research, 54(11): 8558-8593.

Shen C P, Appling A P, Gentine P, et al., 2023. Differentiable modelling to unify machine learning and physical models for geosciences. Nature Reviews Earth & Environment, 4: 552-567.

Shen C P, Chen X Y, Laloy E, 2021. Editorial: Broadening the use of machine learning in hydrology. Frontiers in Water, 3: 681023.

Shrestha S, Imbulana N, Piman T, et al., 2020. Multimodelling approach to the assessment of climate change impacts on hydrology and river morphology in the Chindwin River Basin, Myanmar. CATENA, 188: 104464.

Sun A Y, Scanlon B R, Zhang Z Z, et al., 2019. Combining physically based modeling and deep learning for fusing grace satellite data: Can we learn from mismatch? Water Resources Research, 55(2): 1179-1195.

Sundararajan M, Taly A, Yan Q Q, 2017. Axiomatic attribution for deep networks. In:Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia.

The United Nations World Water Development Report 2020: Water and Climate Change, 2020. UNESCO.

The United Nations World Water Development Report 2021: Valuing Water, 2021. UNESCO.

Thrun S, Pratt L, 1998. Learning to Learn. New York: Springer Science & Business Media.

Tsai W P, Feng D P, Pan M, et al., 2021. From calibration to parameter learning: Harnessing the scaling effects of big data in geoscientific modeling. Nature Communications, 12(1): 5988.

Wang Y W, Wang L, Li X P, et al., 2020. An integration of gauge, satellite, and reanalysis precipitation datasets for the largest river basin of the Tibetan Plateau. Earth System Science Data, 12: 1789-1803.

Yilmaz K K, Gupta H V, Wagener T, 2008. A process-based diagnostic approach to model evaluation: Application to the NWS distributed hydrologic model: process-based diagnostic evaluation of hydrologic model. Water Resources Research, 44(9): W09417

Zhang K, Luhar M, Brunner M I, et al., 2023. Streamflow prediction in poorly gauged watersheds in the United States through data-driven sparse sensing. Water Resources Research, 59(4): e2022WR034092.

Zhao G, Pang B, Xu Z G, et al., 2021. Improving urban flood susceptibility mapping using transfer learning. Journal of Hydrology, 602: 126777.

Zhu Y X, Sang Y F, Wang B, et al., 2023. Heterogeneity in spatiotemporal variability of high mountain Asia’s runoff and its underlying mechanisms. Water Resources Research, 59(7): e2022WR032721.