A Benchmark for Understanding Data Science Software
This talk will introduce the Boa infrastructure for mining and analyzing large number of software repositories at once and a new dataset of Python software for Boa. The popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the best practices in developing Data Science applications, identifying bug-patterns, recommending code enhancements, etc. To enable this research, we have created a new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks. By analyzing the metadata and code, we have included the projects in our dataset which use a diverse set of machine learning libraries and managed by a variety of users and organizations. The dataset is made publicly available through Boa infrastructure both as a collection of raw projects as well as in a processed form that could be used for performing large scale analysis using Boa language. We also present two initial applications to demonstrate the potential of the dataset that could be leveraged by the community.
Hridesh Rajan is a full professor of Computer Science at Iowa State University, where he has been since 2005. Professor Rajan earned his MS and Ph.D. from the University of Virginia in 2004 and 2005 respectively. Professor Rajan’s recent research and educational activities are aimed at decreasing the barrier to entry to data-driven sciences to broaden participation. His work on the Boa project is aimed at invention and refinement of programming languages and cyberinfrastructures that democratize data-driven science & engineering, including software engineering. His work on the Midwest Big Data Summer School is experimenting with broadly accessible data science curricula. Professor Rajan was the founding general chair of the Midwest Big Data Summer School. Professor Rajan’s research interests also include programming language design and implementation, and software engineering. He leads two research projects: Panini, whose goals are to enable modular reasoning about concurrent programs, and Boa that was established in Summer 2012 as an end-to-end infrastructure for analyzing large-scale software repositories and other open data sets. Professor Rajan is the director of the Laboratory for Software Design at Iowa State University, director of graduate admissions and recruitment for the Department of Computer Science, Professor-In-Charge of the Data Science education programs at Iowa State University, and chair of the information technology committee for the university. Professor Rajan serves on the steering committee of the Midwest Big Data Hub, a consortium of universities in the Midwest region of the United States focussed on promoting data science activities. Professor Rajan is a recipient of the National Science Foundation CAREER award in 2009, LAS Award for Early Achievement in Research in 2010, a Big-12 Fellowship in 2012. He is a 2018-19 Fulbright U.S. Scholar, a distinguished member of the ACM, and a member of IEEE, and AAAS. He is also the inaugural holder of the Kingland Professorship in the Department of Computer Science.
Tue 16 JulDisplayed time zone: Belfast change
15:30 - 17:00
|A Benchmark for Understanding Data Science Software|
Hridesh Rajan Iowa State University
|Android Taint-Analysis Benchmarks: Past, Present and Future|
Felix Pauck Paderborn University, GermanyMedia Attached
|Discussion and Closing|