Tue 16 Jul 2019 14:30 - 15:00

A comprehensive evaluation is an integral part of modern software engineering research and is either done using established benchmarks (e.g., DaCapo, the Qualitas corpus or the XCorpus) or using ad-hoc benchmarks. In a few cases, a specifically created test suite is used for evaluation purposes. In all cases the representativeness w.r.t. answering the research questions is basically always questionable. The mentioned established corpora contain a large degree of outdated software and – as recent studies have shown – that code is structurally very different when compared to modern Java code as found on, e.g., Maven central. A second issue of (at least) the established benchmarks is that their usage scenarios are only defined at a very high abstraction level (e.g., “general software engineering research”); making their usage in specific context questionable. Tailored benchmarks or custom test suites are often created without any substantial argument regarding their representativeness; making evaluations build on top of them even more questionable. The naive approach to solve the problems to take as many projects/to collect as much code as possible simply doesn’t scale. Analyzing an extremely large code base, such as all non-trivial Java projects found on GitHub, is prohibitively expensive even for simple analyses. This immediately leads to the question of how to build (reasonably) representative benchmarks. In this talk we will discus representativeness of benchmarks before we will present Hermes; a tool that is a first step towards the creation of representative and minimal benchmarks.

Abhishek Tiwari, Christian Hammer
Lisa Nguyen Quang Do
Michael Eichberg
