Many questionnaires assessing depressive symptoms are available. Most of these questionnaires are constructed based on classical test theory, making comparisons of individual scores difficult. Item response theory (IRT) allows the comparison of scores from different instruments. In this study, the impact of IRT-based cross-calibration methods on the results of a treatment outcome study was evaluated using 2 instruments. Methods: Data collected during admission and discharge procedures from 1066 inpatients in 2 psychosomatic clinics using different depression measures were analyzed. To achieve comparability across the applied depression measures, we used an IRT-based conversion table to transform scores from one instrument’s scale to the other. Latent trait values were also estimated using different instruments in each clinic. We compared these methods to the traditional approach of using the same instrument in both clinics and examined their effects on the statistical analyses. Results: There was no substantial change in the interpretation of the study results when different instruments were used. However, F values, P values, and effect sizes in the analysis of variance changed significantly. This might be attributed to differences in the content or measurement properties of the instruments. Interestingly, no difference was observed between use of transformed sum scores and latent trait values. Conclusions: IRT cross-calibration methods are a convenient way to enhance the comparability of questionnaire data in applied clinical settings but seem not to be able to overcome differences in measurement properties of the instruments. As these differences can lead to biased results, there is a need for further research into more advanced techniques.