SciDB: An Open-Source DBMS for Scientific Data

by Michael Stonebraker

SciDB is a native array DBMS that combines data management and mathematical operations in a single system. It is an open source system that can be downloaded from SciDB.org

SciDB is an open-source DBMS oriented toward the data management needs of scientists. As such it mixes statistical and linear algebra operations with data management ones, using a natural nested multi-dimensional array data model. We have been working on the code for three years, most recently with the help of venture capital backing. Currently, there are 14 full-time professionals working on the code base.

SciDB runs on Linux and manages data that can be spread over multiple nodes in a computer cluster, connected by TCP/IP networking. Data is stored in the Linux file system on local disks connected to each node. Hence, it uses a “shared nothing” software architecture.

The data model supported by SciDB is multi-dimensional arrays, where each cell can contain a vector of values. Moreover, dimensions can be either the standard integer ones or they can be user-defined data types with non-integer values, such as latitude and longitude. There is no requirement that arrays be rectangular; hence SciDB supports “ragged” arrays.
Access is provided through an array-version of SQL, which we term AQL. AQL provides facilities for filtering arrays, joining arrays and aggregation over the cell values in an array. Moreover, Postgres-style user-defined scalar functions, as well as array functions are provided.

In addition, SciDB contains pre-built popular mathematical functions, such as matrix multiply, that operate in parallel on multiple cores on a single node as well as across nodes in a cluster.

Other notable features of SciDB include a no-overwrite storage manager that retains old values of updated data, and provides Postgres-style “time travel” on the various versions of a cell. Moreover, we have extended SciDB with support for multiple notions of “null”. Using this capability, users can distinguish multiple semantic notions, such as “data is missing but it is supposed to be there” and “data is missing and will be present within 24 hours”. Standard ACID transactions are supported, as is an interface to the statistical package R, which can be used to run existing R scripts as well as to visualize the result of SciDB queries.

Our storage manager divides arrays, which can be arbitrarily large, into storage “chunks” which are partitioned across the nodes of a cluster and then allocated to disk blocks. Worthy chunks are also cached in main memory for faster access.

We have benchmarked SciDB against Postgres on an astronomy-style workload that typifies the load provided by the Large Synoptic Survey Telescope (LSST) project. On this benchmark, SciDB outperforms Postgres by 2 orders of magnitude. We have also benchmarked SciDB analytics against those in R. On a single core, we offer comparable performance; however SciDB scales linearly with additional cores and additional nodes, a characteristic that does not apply to R.

Early users of SciDB include the LSST project mentioned above, multiple high-energy physics (HEP) projects, as well as commercial applications in genomics, insurance and financial services. SciDB has been downloaded by about 1000 users from a variety of scientific and commercial domains.

A fairly robust and performant version of the system is currently downloadable from our web site (SciDB.org). We plan a production-ready release of SciDB within the next six months. The system is supported by Paradigm4, Inc, a venture-capital backed company in Waltham, Massachusetts, which provides application consulting as well as a collection of enterprise-oriented extensions to SciDB.

Link: http://www.scidb.org

Please contact:
Marilyn Matz
CEO, Paradigm4, Inc, Waltham, Ma, USA
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

{jcomments on}

Sidebar

Contents

SciDB: An Open-Source DBMS for Scientific Data