Sparksoniq 0.9.5 "Larch" (Querying JSON over Spark)
March 6, 2019
Submitted by Ghislain Fourny.
We are happy to announce the latest alpha release of Sparksoniq.
Sparksoniq runs JSONiq queries on top of Spark, taking as input JSON data sets stored on distributed file systems such as (but not only) HDFS. Its goal is to increase productivity when querying heterogeneous, nested datasets that are challenging to handle with DataFrames.
JSONiq is the JSON brother of XQuery (XQuery - XML + JSON) and shares 90% of its DNA.
Sparksoniq is open source (Apache 2.0) and can be downloaded for free. The jar as well as the documentation can be found on http://sparksoniq.org/.
Since the announcement of our initial prototype last year, the following progress was made:
- Many bugfixes following user feedback. It is getting stable enough to consider soon going to beta, and was already used in large classrooms.
- All FLWOR clauses are supported both in parallel and (new) locally. Locally means without invoking Spark transformations with parallelize() or json-file() calls.
- FLWOR expressions can fully nest, with the only exception that those that run in parallel cannot nest with each other (because Spark jobs do not nest).
for $i in json-file("hdfs://path/to/orders.json") (: this will be executed in parallel on that large file, split after HDFS blocks :)
where $i.customer eq "John Smith"
"sorted-items" : [
for $j in $i.items
order by $j.amount
- Navigation such as
count(json-lines("hdfs:///data.json").foo.bar[]) is pushed down and parallelized on Spark.
- We improved the memory footprint, in particular filtering queries are streamed through (within a task) rather than materialized.
- We worked on performance: it can handle files of 10,000,000+ objects on a regular laptop for count, filtering, grouping and ordering with a local Spark execution. Performance also noticeably improved querying bigger datasets on clusters (tested with several billion objects on 64 machines).