Scope

Nowdays AI is becoming the ultimate goal of almost any analytic project. On the other hand, there is a new trend building systems and novel approaches, which enable analysis on practically any kind of data: big or small, unstructured or structured, static or streaming, and so on (new generation Big Data). Data science is the interdisciplinary approach to analyze data, going from detecting data quality problems, spending significant time on data pre-processing and ending up with some sophisticated Machine Learning model (e.g. commonly a neural network).

In this workshop we welcome research that focuses on data aspects, before an ML model is computed. That is, we seek contributions on data pre-processing, data models, managing diverse file formats, standard data formats and so on. We also welcome traditional "analytic" papers, provided they elaborate how data sets were cleaned, preprocessed, integrated and transformed before computing ML models.

Topics
  • Data integration from diverse sources
  • Creating input vectors or matrices for an ML model
  • Detecting and solving data quality issues to improve model quality
  • Interoperability of diverse data pre-processing programs, working with different file formats
  • Spliting processing between languages and Big Data systems (e.g. R or Python runtime alone and PySpark in a cluster)
  • Extending data science languages with new operators and functions (e.g. like numpy or pytorch)
  • Accelerating ML algorithms (statistical summarization, stochastic gradient descent)
  • Enabling database query functionality in data science languages (e.g. optimizing Pandas code)
  • Cross-language optimization (e.g. optimizing R bottlenecks with C/C++ code)
  • Splitting processing between data science languages and database languages (e.g. Python and SQL)
  • Novel parallel data processing architectures (e.g. combining parallel DBMSs, Hadoop and other distributed architectures)
  • Exploiting new-generation file systems beyond HDFS (HPC file systems)
  • Flexible, fast, well-defined interfaces to exchange big data (e.g. CSV, JSON files).
  • Exploiting HPC libraries like LAPACK and MKL in data science languages
  • Web interfaces for complex processing pipelines (e.g. JavaScript GUI, calling Python)
  • Benchmarks, understanding tradeoffs between time performance and ease of use (i.e. the fastest is not necessarily the best alternative)
  • Case studies, presenting technical details of a library or program that can be used by data analysts across several specialties (i.e. how it was programmed and deployed in some OS)