IEEE International Workshop on Data Science Systems (DSS)

Data Science has subsumed Big Data Analytics as an interdisciplinary endeavor, where the analyst uses diverse programming languages, libraries and tools to integrate, explore and build mathematical models on data, in a broad sense. Nowadays, there is a new trend building systems and novel approaches, which enable analysis on practically any kind of data: big or small, unstructured or structured, static or streaming, and so on. Data science has become the umbrella discipline to analyze data, going from detecting data quality problems, spending significant time on data pre-processing and ending up with some sophisticated Machine Learning model, like deep neural networks.

In the Data Science Systems (DSS) workshop we welcome interdisplinary research mixing programming languages, machine learning, database systems and high-performance computing. The DSS workshop will feature ”systems” research to enable data science on big data (with large-scale parallel processing), but also ”medium scale” data (a powerful workstation with multicore CPUs). Specifically, we welcome papers that present algorithms, data structures, functions, language extensions, optimizations that work well in modern data science languages, especially Python, R and SQL. It is fair to say that modern analysts can tradeoff some performance for ease of programming, ease of use or flexibility. Advances in hardware are making the cloud more attractive for data pre-processing and a local machine preferred for number crunching.

  • Data quality diagnosis and repair, which can be tweaked and customized in DSS languages
  • Interoperability of diverse data pre-processing programs, working with different file formats
  • Querying relational and non-relational data, but outside database systems
  • Spliting processing between DSS languages and Big Data systems (e.g. R or Python runtime alone and PySpark in a cluster)
  • Extending data science languages with new operators and functions (e.g. like numpy)
  • Accelerating ML algorithms (statistical summarization, stochastic gradient descent)
  • Enabling database query functionality in data science languages (e.g. optimizing Pandas code)
  • Cross-language optimization (e.g. optimizing R bottlenecks with C/C++ code)
  • Splitting processing between data science languages and database languages (e.g. Python and SQL)
  • Novel parallel data processing architectures (e.g. combining parallel DBMSs, Hadoop and other distributed architectures)
  • Exploiting new-generation file systems beyond HDFS (HPC file systems)
  • Flexible, fast, well-defined interfaces to exchange big data (e.g. CSV, JSON files).
  • Exploiting HPC libraries like LAPACK and MKL in data science languages
  • Web interfaces for complex processing pipelines (e.g. JavaScript GUI, calling Python)
  • Benchmarks, understanding tradeoffs between time performance and ease of use (i.e. the fastest is not necessarily the best alternative)
  • Case studies, presenting technical details of a library or program that can be used by data analysts across several specialties (i.e. how it was programmed and deployed in some OS)