The reliability of the machine learning model prediction for a given input can be assessed by comparing it against the actual output. However, in hydrological studies, machine learning models are often adopted to predict future or unknown events, where the actual outputs are unavailable. The prediction accuracy of a model, which measures its average performance across an observed data set, may not be relevant for a specific input. This study presents a method based on metamorphic testing (MT), adopted from software engineering, to assess the prediction reliability where the actual outputs are unknown. In this method, the predictions for a group of related inputs are considered consistent only if the input and output follow certain relations that are deduced from the properties of the system being modeled. For instance, the predicted runoff volume should increase in a rainfall‐runoff model as the rainfall magnitude of an input increases. In this study, the MT‐based method was applied to assess the predictions made by various machine learning models that were trained to predict the magnitude of flood events in Germany. Surprisingly, the prediction accuracy of a model and its ability to provide consistent predictions were found to be uncorrelated. This study further investigated the factors influencing the assessment result of a given input, such as its similarity to observed data. Overall, this research shows that MT is an effective and simple method for detecting inconsistent model predictions and is recommended when a model is employed to making predictions under new conditions. Plain Language Summary: In hydrological studies, machine learning models are often built to learn the statistical correlations between random variables of interests. In these models, predictions are made based on learned statistical correlations, and the involved hydrological processes are not explicitly considered. Thus, it is difficult to examine the mechanisms that lead to the prediction made for a given input. Metamorphic testing (MT) is a technique to assess the correctness of the predicted output for a given input that does not require knowledge of the actual output. In MT, the relation among a group of related inputs and the corresponding outputs is examined, and the predictions are considered consistent if this relation satisfies certain criteria that are related to the properties of the system being modeled. This study applied MT to assess various machine learning models that predicted the magnitude of flood events in Germany. MT was found to be very effective in detecting inconsistent predictions. Many models failed to capture the positive correlation between precipitation magnitude and the magnitude of flood events, even for those with high prediction accuracies. The MT‐based method presented in this study is useful for assessing the reliability of model predictions, especially when the actual outcomes are unknown. Key Points: This study presents a metamorphic testing‐based method to assess the reliability of model predictions when the actual outcomes are unknownSpecific tests can be designed using hydrological knowledge to check if a model preserves certain properties of the system being modeledSome models with high prediction accuracies can often respond incorrectly when input changes and thereby produce inconsistent predictions [ABSTRACT FROM AUTHOR]