Sie sind hier: Startseite Studium und Lehre Sommersemester 2021 Seminar Statistical Learning for …

Informationen zum Seminar: Statistical Learning for Imbalanced Data Sets (SS 2021)

Dozent: Prof. Dr. Angelika Rohde

Assistent: Dario Kieffer, M.Sc.

Zeit, Ort: Di 14-16, -, online

Vorbesprechung: Mo 01.02.2021, 14:00 Uhr, BBB-Raum vRohde




Die Seminarvorträge können wahlweise in deutscher oder englischer Sprache abgehalten werden.

  • Montag, 01. Februar 2021, 14:00 Uhr: Vorbesprechung und Themenvergabe.
  • Mittwoch, 21. April 2021: Änderung bei den Vortragsterminen.
  • Montag, 07. Juni 2021: Vortrag am 08. Juni auf 15:00 Uhr verschoben.
  • Freitag, 02. Juli 2021: Vortrag vom 06. Juli auf den 13. Juli verschoben.



  • Dienstag, 08. Juni 2021, 15:00 Uhr: Logistic Regression for Massive Data with Rare Events, basierend auf Wang, H. (2020).

Folien: pdf.

  • Dienstag, 22. Juni 2021, 14:00 Uhr: More Efficient Estimation for Logistic Regression with Optimal Subsamples, basierend auf Wang, H. (2019).

Folien: pdf.     Ausarbeitung: pdf.

  • Dienstag, 29. Juni 2021, 14:00 Uhr: Local Case-Control Sampling, basierend auf Fithian, W. & Hastie, T. (2014).

Folien: pdf.     Beweise: pdf.

  • Dienstag, 13. Juli 2021, 14:00 Uhr: Bias Correction in Maximum Likelihood Logistic Regression, basierend auf Schaefer, R. (1983).

Folien: pdf.





Imbalanced data sets are known to significantly reduce the performance of classifiers in statistical learning: Learning algorithms designed for equally balanced classes tend to be biased towards the majority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class. Besides efforts to improve the data mining process, strategies to overcome this deficiency are either synthetic oversampling, where synthetic minority class examples are generated, or subsampling from the majority class to reduce the number of majority class examples. The latter has the appealing property of reducing the computational complexity – however, it may result in a loss of efficiency, as valuable information is unregarded. Yet, there is no general guidance on when to use each technique.
In this seminar, we shall gain some insight on this important problem, studying a combination of rather theoretical and more applied statistical literature.




  • Notwendige Vorkenntnisse: Analysis und Stochastik
  • Nützliche Vorkenntnisse: Grundlagen der Statistik



  • Schaefer, R. (1983). Bias Correction in Maximum Likelihood Logistic Regression. Statistics in Medicine 2, 71-78.
  • Fithian, W. & Hastie, T. (2014). Local Case-Control Sampling: Efficient Subsampling in Imbalanced Data Sets. The Annals of Statistics 42, 1693-1724. pdf
  • Wang, H. (2019). More Efficient Estimation for Logistic Regression with Optimal Subsamples. Journal of Machine Learning Research 20, 1-59. pdf
  • Wang, H. (2020). Logistic Regression for Massive Data with Rare Events. pdf
  • Han, L. & Ming Tan, K. & Yang, T. & Zhang, T. (2020). Local Uncertainty Sampling for Large-Scale Multiclass Logistic Regression. The Annals of Statistics 48, 1770-1788. pdf


Ergänzende Literatur zu Statistical Learning

  • Vapnik, V. (1999). An Overview of Statistical Learning Theory. IEEE Transactions on Neural Networks 10, 988-999. pdf
  • Bousquet, O. & Boucheron, S. & Lugosi, G. (2004). Introduction to Statistical Learning Theory. Advanced Lectures on Machine Learning, 169-207. pdf
  • Hastie, T. & Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Predictionpdf 
  • Dümbgen, L. (2017). Empirische Prozesse. Vorlesungsskript, Universität Bern. pdf