Main

Attendance Mode: Online

This year we are exploring an innovation to present papers, to make the workshop more exciting and to foster discussion.
All authors will be asked to record a 5 min video 2 weeks before the conference. All authors will be asked to watch other papers videos before the conference and submit 1 question (maybe 2 if the paper looks novel and technically strong). We will make an effort to share papers with the video. During the conference all papers will be presented online with the 5 min video, played by us (format of video TBA). Then all papers will be discussed online as well, in a 60-90 minute session. All questions on each paper will be discussed, one by one, in about 5-8 mins. We expect the entire workshop to last 2 hours maximum, where the online discussion takes most of the time. To make the workshop interesting we will make sure videos highlight main contributions and spark discussion. On the other hand, we will encourage questions that analyze data aspects (big data, databases, statistics), the long-term impact and the underlying theory (if any) of each paper.
Logistics: everyone will be asked to connect online with a video conference tool (MS Teams). If there are authors or other people attending locally in Washington DC, in the conference room, they will be asked to connect online as well. Connection information will be (securely) shared before. We will also do a "test" with all authors 2 weeks before the conference.

Scope

AI is now the ultimate goal of almost any analytic project. On the other hand, there is a new trend building systems and novel approaches, which enable analysis on practically any kind of data: big or small, unstructured or structured, static or streaming, and so on (new generation Big Data). Data science remains the interdisciplinary approach to analyze data, going from detecting data quality problems, spending significant time on data pre-processing and ending up with some sophisticated Machine Learning model (e.g. commonly a neural network).

In this workshop we welcome research that focuses on data processing aspects, before and after an AI model is computed. That is, we seek contributions on data pre-processing, data models, managing diverse file formats, standard data formats and so on. We also welcome how data sets are integrated, transformed to deploy the model in practice. We also welcome "analytic" ML-oriented papers, provided they elaborate on how data sets were cleaned, preprocessed, integrated and transformed including efficiency and programming issues. Any AI-oriented paper should have some content explaining data manipulation, not just equations and accuracy experiments.

Topics

Data integration from diverse sources
Managing non-database data like language and iamges
Creating input vectors or matrices for an ML model
Detecting and solving data quality issues to improve model quality
Interoperability of diverse data pre-processing programs, working with different file formats
Spliting processing between languages and Big Data systems (e.g. R or Python runtime alone and PySpark in a cluster)
Extending data science languages with new operators and functions (e.g. like numpy or pytorch)
Accelerating AI algorithms (statistical summarization, stochastic gradient descent)
Enabling query functionality in data science languages (e.g. optimizing Pandas code)
Cross-language optimization (e.g. optimizing R bottlenecks with C/C++ code)
Splitting processing between data science languages and database languages (e.g. Python and SQL)
Parallel data processing architectures (e.g. combining parallel DBMSs, HDFS and other distributed architectures)
Exploiting new-generation file systems beyond HDFS (HPC file systems)
Flexible, fast, well-defined interfaces to exchange big data (e.g. CSV, JSON files).
Benchmarks, understanding tradeoffs between time performance and ease of use (i.e. the fastest is not necessarily the best alternative)
Case studies, presenting technical details of a library or program that can be used by data analysts across several specialties (i.e. how it was programmed and deployed)