25 AUG 2016: Big Data Management: Theory & Practice – Hands on Workshop

Professor Christoph Quix
Senior Researcher
Fraunhofer Institute for Applied Information Technology, Germany

The idea of data lakes has been introduced to address the problem of the integration of heterogeneous information in big data applications. Data lakes collect data from heterogeneous sources in its original format and perform only a shallow integration on the syntactical level. The semantic integration of the data is left to the user, who can integrate data by using a unified query interface. Data quality is a challenge in data lakes as data is copied `as-is’ from the sources; thus, data might be incorrect, inconsistent, or difficult to interpret as corresponding metadata is missing. At RWTH Aachen University and the Fraunhofer-Institute for Applied Information Technology (FIT), we are currently developing a data lake system in which metadata and data quality management govern the data ingestion process in a data lake and thereby avoid that the data lake turns into a data swamp. Data quality of incoming data is continuously monitored, and if a new data source is, for example, insufficiently described by metadata, counter actions such as a more detailed metadata extraction or metadata matching can be enabled. The hands-on workshop will give an overview of the big data and current trends, hands-on Apache Spark, issues in big data applications and hands-on data integration.

Short CV: Christoph Quix is a senior researcher in the Life Science Informatics group at the Fraunhofer Institute for Applied Information Technology (FIT) in St. Augustin, Germany, where he leads the department for High Content Analysis. Earlier, he was an assistant professor in the Information Systems Group (Informatik 5) of RWTH Aachen University, Germany, where he completed his habilitation in early 2013 and received his Ph.D. degree in computer science. His research focuses on data integration, big data, management of heterogeneous data, metadata management, and semantic web technologies. He has about 80 publications in scientific journals and international conferences. He has been involved in several national and international research projects, which have been conducted in cooperation with research and industry partners. He was a PC chair of CAiSE 2014, member of the PC for several major conferences on databases and data modeling (e.g., ER, ICDE, and ODBASE), and the organizing chair of several international workshops.

Tentative Program
Hands-on Workshop
Big Data Management: Theory & Practice

25^th August 2016 – Thursday Venue: Faculty of Computing, UTM Johor Bahru, Malaysia
8.00 – 8:30 am	Registration
8.30 – 10:00 am	Session 1: Introduction Explaining Big Data Current Trends Research Challenges Big Data Systems: Hadoop, Apache Spark & Co
10.00 – 10.30 am	Morning Break
10.30 – 12.30 noon	Session 2: Hands-On Part 1: Apache Spark Setting up a simple data processing workflow in Spark
12.30 – 2:00 pm	Lunch
2.00 -3.30 pm	Session 3: Important Issues in Big Data Applications Not just Volume: Variety Data Integration & Metadata Management
3:30 – 5.00 pm	Session 4: Hands-On Part 2: Data Integration Defining Data Integration Workflows Combining data from heterogeneous data sources
5.00 – 5:30 pm	Closing