Ideas for a query language #75

dominictarr · 2014-09-21T03:41:24Z

This is an idea for a level module, but am posting it here because it's too early in the morning/late at night to implement and this will notify exactly the right people.

level is pretty useful as a straight key value store, but sooner or later you need to query something else,
so you add an index. A lot of leveldb modules are about indexes. For example, level-seach which just indexes everything.

The trouble with all these indexes, is to manage them all - which index do you need for which query and when do you add a new index? there is a design trade off here.

It would be so much easier if you could just not think about indexes at all, just have filters,
just forEach (scan) over all the data and return only what you want. totally functional and elegant,
just not performant.

But what if you could use that interface, and it would transparently use the right indexes when possible? (or even better, generate them when necessary)

Lets say we have the npm registry, and we want to retrieve all the modules written by @substack,
even though substack has written many modules, it is still less than 1% of the registry, so a full scan
would be 99% wasted effort. Lets say that the average registry document is 5k, that is 95000*0.99*5000=470mb of unnecessary reads. 475mb total.

But lets say that we indexed all the maintainers? for simplicity lets say that the average length of both module names and user names is 8 characters. Say we create an index like {username}:{modulename} and index this for every module, 95000*(8+8+1) (1 for the separator) = 1.6mb of index data. but now we just need to read through 1% of the index (since they are ordered by username we can just jump straight to substack) 950*17 = 15200 that is only 15k of unnecessary data.
So, now we only read 15200 + 950*5000 = 4.7mb

Using this index we increase storage requirements by 1.6 mb but increase read perf by 100x.

But what about a more complex query? suppose we want to find modules written by substack that depend (directly) on async? now, a simple way to do this would be scan through the modules written by substack and check whether async is in the deps. Lets say there was a time when substack used async, in 9 modules. 1% of the time. the scan costs us 4.7 mb, like before, but we discarded 99% of it.

Okay, so lets create a filter for this. udm:{username}:{dep}:{module} where module depends on dep
this index is more expensive because it has needs another 9 characters, plus we need a tag to distinguish it from the other indexes. 3+1+3*(8+1) = 29 so this index would need 2.75mb.
how much data would it save us? now we could directly read out the 9 substack modules that use async, 29*9+ 9*5000 = 45k this is 1000 times less! but here is the problem: how do we know we want to perform this query? what if we want to know all the users that depend on optimist? we'd also need a dum:{dep}:{username}:{module} index - and so on, we want our database to be really really fast,
so lets just index every pair of properties twice!

Lets say there are 20 properties on average, how many combinations of 2 properties?
(20!/(2!*(20-2)!) which cancels down to (20*19)/2 = 275), times 2 because we need 2 indexes,
gives us: 550*29 = 15950, that is 3 times the size of the document. if there where 30 properties it would be 40*39*29 = 45240. doubling the size of the document increases the size of the index 3 times,
now 9 times the size of the original document.

Now, at some point, it's better to do a scan than have another index, because the indexes take up too much space, and optimize queries that you never use. Clearly this is dependant on what sort of queries you do in practice, possibly you could have a system that automagically creates indexes when you start to use a query more. maybe even could it know when you have an expensive query?

But is there a nice way that you can gracefully transition between those?
decouple the query interface from the indexes, so you can add and remove indexes
without changing the queries at all. And measuring the efficiency of the queries,
so that you know when new indexes are needed.

The text was updated successfully, but these errors were encountered:

juliangruber · 2014-09-21T08:50:26Z

to me, it sounds like those are the steps to get there:

have the simple interface for querying

db.createQueryStream(function(value){
  return value.foo == 'bar';
}).pipe(...);

figure out what to index (how?)
maybe add an option to have this indexed, like

db.createQueryStream({ index: true }, function(value){
  // ...
})

transparently build those indexes when necesary

dominictarr · 2014-09-21T10:13:08Z

prehaps the example I should have used for the second part is where the combination of substack is a higher percentage. Lets say, substack uses optimist in 50% of his modules. the index still saves us time
29*475 + 475*5000= 2.38 mb but now, the data it saves us from reading is about the same as the data we read but didn't need to on the scan version. a 2x improvement isn't as compelling.
if we where performing this query very frequently, then we would want the extra index, but if we where casually querying occasionally, maybe not.

pgte · 2014-09-21T11:01:07Z

As I see it, the API @juliangruber proposed is not prone to optimization: this hypothetical profiling engine would need to parse the filtering function to properly understand what the intent was. And that function could contain arbitrarily complex JavaScript, which would make the job virtually impossible.

As I see it, one solution for this would be either a query language or a naming convention for the filters (which would also be a language).

Also, a quick thought: some document databases create all the combinations for the indexes by default, allowing ad-hoc queries (using something like xpath).

dominictarr · 2014-09-21T11:09:42Z

@pgte creating all combinations of indexes would be possible, but it's also something that could get out of hand with a document db, but much easier with a schema/table db because you have a fixed number of columns.

Either way, I think my plan is as follows:

A syntax for AND queries on multiple properties
a way to tag manually created indexes, so that they can be automatically chosen by 1)
profile queries that execute, measuring the ROI of each index (at some "exchange rate" between disk space and execution time)

would just be logged and the database administrator could decide to create new indexes.

Now, it would be pretty cool to parse javascript (probably limited to an expression, no assignment or loops) and then use that as a query, but that would mostly be showing off... and would tend to isolate the user from what is really happening, which is antithetical to the level-* spirit. Definitely a cool hack though.

No9 · 2014-09-21T14:12:01Z

Was thinking about a few weeks back and found this video on how elastic search goes about things informative in terms of concrete approach.
https://www.youtube.com/watch?v=PpX7J-G2PEo

No9 · 2014-09-21T14:20:40Z

Made me think of rehydrating this project I used to contribute too.
https://github.com/erictj/node-clucene

rvagg · 2014-09-22T02:03:18Z

Presenting queries in JavaScript would be too hard to optimise, you'd need to do crazy AST inspections to figure out what the query would be.

What we really need is some kind of highly structured language for querying the database. At a minimum it could involve:

Specifying what you want to return, so a list of fields would suffice, a comma-separated list of fields would do here: field1, field2, field100.
Specifying the restrictions on the query that would limit it, starting of with basic operations like =, >, <, you could compound these with "and" and "or" operators and have parens to allow additional flexibility: field3 = 100 and (field4 > 10 or field5 = 'foo')
You could even introduce some basic regex-like operations although full-regex might be overkill so perhaps just a simple * operator as a generic any-string wildcard, but you could consider substituting * for something that doesn't make it look like it has full-regex potential, maybe something like %: field3 = 100 or field6 like 'dominic%'

So combining them you'd end up with a language like: field1, field2 where field3 = 'foo' and (field4 = 'bar' or field5 like 'bang%').

In fact, you could make this specific to sublevels, so you have to say what sublevel you want to grab data from, so it could become: field1, field2 from sublevel1 where field3 = 'foo'

And, then you could even extend the language to enable writes to the database! So you could prefix it with the operation type, 'read' or 'write'. It might be best to express this language as an algebraic operation on sets of data, so reads would be dissecting those sets into sub-sets, which is like an algebraic 'select', so that might be better for 'read' operations: select field1, field2 from sublevel1 where field3 = 'whoa' or field4 = 'substack'. Writes could be subdivided into new-data insertions or updates, so it would be helpful to partition those. How about insert and update for those two? Then you could have fancy things like: update sublevel2 set category = 'level' where name like 'level-%'.

If formally defined we could implement language parsers for this and turn them into level queries and inspect what indexes would be best to use and create. In fact, I could even see this language expanding beyond the level ecosystem because it won't be javascript specific!

dominictarr · 2014-09-22T08:05:43Z

@rvagg absolutely!!! If there is something missing somehow from the nosql movement it's probably easy ways to query your data... and also, come to think of it, to validate it too. everyone writes these orm things that check you have the correct fields. why don't we just INSERT that INTO the DATABASE?

if we are gonna create some crazy new language, there is one thing we should keep in mind though...
When you are creating a language it's impossible not to end up with weird querks... Things that next generation of database administrators will resent you for because now everyone uses it so they won't be able to change it. There is basically no way to avoid doing this by accident, so the solution is to actually do it on purpose! like we'll probably need some sort of DELIMITER to break up parts of the query script...
but what if you could reassign the delimiter - programmer tastes inevetiably change and one man's antipattern is another man's best-practice - but what if you could use semicolons one day, but pipes the next? and who knows, maybe this will be actually useful for something.

Also, of course, one of the best things about level-* is being able to have an evented database that TRIGGERS your code when something is inserted, even running a bash script or something.

This would usher us into a whole new era of nosql... maybe NOSQL++ Node's Original Structured Query Language. With "++" so that every one knows it's extra good (we wouldn't want anyone to think this is something we just made up) that is important because we gotta sell this to the enterprise.
This is gonna make building inventory systems and payroll systems etc TOO EASY.

rvagg · 2014-09-22T10:13:53Z

oo, and we could actually push this a bit further and store actual code in there; they wouldn't be full blown JavaScript functions, perhaps more like procedures. Imagine the power and potential of stored procedures, we'd hardly have to write our applications in JavaScript, the application logic could emerge from within the database itself. WHAT POWER!

dahjelle · 2014-09-22T14:26:33Z

Would it be possible to separate the (1) query language and the (2) indexes into separate modules in a generic way? (I have no idea how dependent most query languages are on the implementation of their indexes.)

For instance, I've been tweaking @tonsky's ClojureScript Datalog implementation to work with pluggable indexes. I have a PouchDB plugin that uses map-reduce to build the indexes, which in this case are triple-store-like entity-attribute-value and attribute-value-entity (though I may need another). Works with simple queries now; will be finding out more about how well it does or doesn't work in the coming weeks. I don't particularly have a goal to keep indexes small in my use-case, and I'm just naïvely indexing everything right now, so we'll see. One could also definitely put a more JS-style API on top of Datalog that would probably be more appealing to more folks…

neonstalwart · 2014-09-22T14:28:20Z

@dominictarr have you considered that you might be describing a graph database (like what levelgraph provides)? the approach is slightly different but the gains are what i think you're looking for.

rather than inserting whole documents straight into your db, you decompose the document into a graph of triples where each property of your document is the name of an edge and the nodes are the values. using a hexastore to model these nodes and edges, you have a fairly linear cost (each triple costs about 6 times the size of the data) but now you have an index for every property which gives you close to linear querying.

wikipedia should provide enough explanation of why i think your pain is relieved with a graph database.

nolanlawson · 2014-09-22T14:38:23Z

I'm not sure how much of this discussion is serious anymore, but there's also the Cloudant Query Language, which is very deliberately modeled after MongoDB's queries. Supposedly it's going to get merged into CouchDB 2.0, so if there's any kind of emerging "NoSQL query language," it's probably this.

pfrazee · 2014-09-22T14:45:34Z

There's also Datomic's query language that's based on datalog.

dominictarr · 2014-09-22T15:59:51Z

@rvagg totally. great idea! what should we make it like? of course code-in-a database is very different to js so we should probably make it different or people might get the wrong idea. We can't use coffeescript, because people are already using that, but what about if we use your CAPSLOCKSCRIPT? not many people are seriously using that yet, and it would be great to give it a boost by wrapping a very useful database around it. Also, using CAPLOCK helps communicate that DATABASES are SERIOUS BUSINESS.

dominictarr · 2014-09-22T16:02:58Z

@neonstalwart graph databases don't address the specific case I mentioned in the opening post,
about the dual index - but it could if you added an edge that "leaped" over other nodes (like, persisting a friend-of-a-friend edge) but here again is the problem - that would cause a combinatorial explosion if you did it for every pair of edges.

Raynos · 2014-09-22T19:32:49Z

@dominictarr 👍 CAPSLOCK.

Everyone knows that you must SELECT('*').FROM('my-table').WHERE().FOO().EQUALS().BAR()

rvagg · 2014-09-23T01:34:06Z

word @Raynos, that's webscale

trygve-lie · 2014-09-30T23:02:20Z

@dominictarr I can not agree more:

If there is something missing somehow from the nosql movement it's probably easy ways to query your data...

Not sure if I'm contributing to this idea but I have for a long time thought that the way one query a Tuple space would work very well in javascript and would be perfect to implement on top of level.

In a Tuple space one do query by pattern matching. In its simplest form; the database is just a big bucket of objects one stuff objects into. There is no IDs and if one want some unique ID one have to put that in the objects one put in the space. Then one search it by sending in a object with keys and values which one should match on.

Tuple spaces have more or less been a Java thing so setting up a query object is a bit "more" than what we are used to in our js world. But, the literal syntax of a js object is so visual that I think it would work very well as a easy thing for people to use as a query.

This is what I would love to see / do:

Lets say I have some objects of LP Records and I stuff them in a database like this:

db.put([
    {id: 0, album: 'High Visibility', artist: 'The Hellacopters', year: 2002, style: 'Punk rock'},
    {id: 1, album: 'Tender Is the Savage', artist: 'Gluecifer', year: 2000, style: 'Rock'},
    {id: 2, album: 'Deathcrush', artist: 'Mayhem', year: 1987, style: 'Black metal'},
    {id: 3, album: 'Supershitty to the Max!', artist: 'The Hellacopters', year: 1996, style: 'Punk rock'},
    {id: 4, album: 'Money for Soul', artist: 'Baby Woodrose', year: 2003, style 'Psychedelic rock'}
]);

To query this, I do it with a "template object" like this:

db.get({
    artist: 'The Hellacopters',
    year: 1996
}, function(arr){});

In the callback function I then get an array which holds the objects which matched the keys and values of the template object. In our case here; one:

[
    {id: 3, album: 'Supershitty to the Max!', artist: 'The Hellacopters', year: 1996, style: 'Punk rock'}
]

For more fancy queries one could let the values of the keys in the template object take a function:

db.get({
    year: function(value){
        return (value >= 1985 && value <= 1990);
    }
}, function(arr){});

This will match all objects where the value for the key "year" is between "1985" and "1990".

For the lower level stuff inside the database I'm no guru, but I guess it would be possible to build indexes of the keys and values. Maybe one could have a function where one could tell the database which keys / values it should index by providing, yet again, a template object:

db.index({
    artist: true,
    year: true
});

Here we're telling the database to index keys and values of "artist" and "year". These are again the only keys and values one can query upon.

Maybe a feature to tell the database what key in the objects which hold a unique key would be nice. One could then optimize that key for fetching a single object based on a unique key.

As an example of query: Imagine how easy it would be to go from taking the FormData object provided by XHR2 when submitting a HTML form to query the database. The FormData object can more or less be used untouched as the template object to query the database.

Just my two cent idea when the topic of how to query pop up.

ghost · 2014-09-30T23:18:00Z

A very leveldb way to do this would be to take the existing {l,g}t{e,} syntax and simply make it nested for inequality comparisons:

var search = require('level-indexifier');
var db = require('level')('./db');
var puts = [
    {id: 0, album: 'High Visibility', artist: 'The Hellacopters', year: 2002, style: 'Punk rock'},
    {id: 1, album: 'Tender Is the Savage', artist: 'Gluecifer', year: 2000, style: 'Rock'},
    {id: 2, album: 'Deathcrush', artist: 'Mayhem', year: 1987, style: 'Black metal'},
    {id: 3, album: 'Supershitty to the Max!', artist: 'The Hellacopters', year: 1996, style: 'Punk rock'},
    {id: 4, album: 'Money for Soul', artist: 'Baby Woodrose', year: 2003, style 'Psychedelic rock'}
];
db.batch(puts, function (err) {
  var s = search(db, {
    artist: 'The Hellacopters',
    year: { gt: 2000 }
  });
  s.on('data', console.log);
});

For each key in the search parameters, if the value is:

a non-object, perform an equality comparison
an object, treat like the parameters that createReadStream() already supports

Using the parameters that createReadStream() already supports will make it easier to learn, more consistent with the rest of level, and easier to implement because these parameters will just be passed through to the underlying indexes.

rvagg · 2014-10-02T04:41:27Z

template searches and using gt/gte/lt/lte (adding eq for completeness would be nice) would be absolutely fantastic, I'd totally use that everywhere

dominictarr · 2014-10-06T20:19:35Z

when I get around to implementing it, this module will be called "mynosql"

pfrazee · 2014-10-06T23:26:33Z

👍 mynosql

hit9 · 2014-10-08T03:19:35Z

@dominictarr Cant wait any more

vweevers · 2015-05-11T16:07:53Z

I'm playing with some of these ideas in level-scout, which can search with or without indexes, using {l,g}t{e,} syntax. It selects an optimal index, converts the query predicates to a (bytewise encoded) key range and adds filters if necessary. Indexes can be compound:

index(db, ['x', 'y', 'z'])

Which the search uses for a query like { x: 5, y: { gt: 10} } or { x: 5, y: 10, z: { lte: 10} }. Basically, in this case, scout can combine zero or more equality predicates with zero or one non-equality predicates.

Also, for equality queries, scout can intersect index scans. If you have two indexes:

index(db, 'x')
index(db, 'y')

And do a search:

search(db, {
  x: 60,
  y: 250
})

.. then scout will access the data via ranged streams on the x index ({ eq: [60] }) and on y ({ eq: [250] }). Because these streams are key-ordered, they can be intersected.

But, still lots to be done.

dominictarr · 2015-05-12T02:08:23Z

@vweevers cool! also by the way I did get around to starting on this, https://github.com/dominictarr/mynosql

dominictarr · 2015-05-12T02:08:39Z

it will be interesting to compare our designs!

okv · 2015-05-17T10:58:16Z

Hi, there.
Very interesting thread!
I also have abstraction over levelup for indexing and querying - nlevel.
All searches are lexicographical (nlevel translates stored object values (according to projections - indexes) and start, end search parameters to the strings).
p.s. Embedded storage for nodejs with indexes and arbitrary queries is exciting idea.

bhurlow · 2017-07-12T01:30:18Z

once a year I come back to this thread and ponder why we're not all using embedded 'nosql++' indices :)

dominictarr · 2017-07-12T03:12:43Z

@bhurlow btw, I eventually (partially) implemented this idea here: https://github.com/flumedb/flumeview-query
and see also https://github.com/dominictarr/map-filter-reduce and https://github.com/flumedb

fergiemcdowall · 2017-07-12T08:07:12Z

Maybe its time for a levelup@2.x.x ?

juliangruber · 2017-07-12T08:51:57Z

i'm totally up for it :) see Level/levelup#354

dominictarr · 2017-07-12T09:56:52Z

@fergiemcdowall I'm post-leveldb.
in flumedb I only use level for indexes, but also can do other sorts of views and indexes that do not use level.

fergiemcdowall · 2017-07-12T09:58:33Z

@dominictarr starred.

fergiemcdowall · 2017-07-12T10:05:31Z

Would just like to add that I am still a level believer. I think the idea of a simple CRUD abstraction that can be put on top of any underlying datastore is really powerful. I think that defaulting to leveldb is great because it is well suited to machines with fast SSD (pretty much all new mechines these days). I also think that being able to replicate datastores everywhere is a super powerful feature that we are only going to see more and more of.

So in summary- I'm a level fanboy, and I hope the project stays healthy.

dominictarr · 2017-07-12T23:52:48Z

@fergiemcdowall the question of whether leveldb is a general database abstraction or just an embedded database abstraction has come up before (certainly an interesting thread for historical reasons.)

fergiemcdowall · 2017-07-13T06:14:25Z

@dominictarr that was an interesting thread, and surely levelup is both?

dominictarr · 2017-07-13T07:09:17Z

@fergiemcdowall I guess it can be different things to different people

vweevers changed the title ~~crazy idea~~ Ideas for a query language Aug 11, 2019

vweevers transferred this issue from Level/levelup Aug 11, 2019

vweevers added the discussion Discussion label Aug 11, 2019

Ideas for a query language #75

Ideas for a query language #75

Comments

dominictarr commented Sep 21, 2014

juliangruber commented Sep 21, 2014

dominictarr commented Sep 21, 2014

pgte commented Sep 21, 2014

dominictarr commented Sep 21, 2014

No9 commented Sep 21, 2014

No9 commented Sep 21, 2014

rvagg commented Sep 22, 2014

dominictarr commented Sep 22, 2014

rvagg commented Sep 22, 2014

dahjelle commented Sep 22, 2014

neonstalwart commented Sep 22, 2014

nolanlawson commented Sep 22, 2014

pfrazee commented Sep 22, 2014

dominictarr commented Sep 22, 2014

dominictarr commented Sep 22, 2014

Raynos commented Sep 22, 2014

rvagg commented Sep 23, 2014

trygve-lie commented Sep 30, 2014

ghost commented Sep 30, 2014

rvagg commented Oct 2, 2014

dominictarr commented Oct 6, 2014

pfrazee commented Oct 6, 2014

hit9 commented Oct 8, 2014

vweevers commented May 11, 2015

dominictarr commented May 12, 2015

dominictarr commented May 12, 2015

okv commented May 17, 2015

bhurlow commented Jul 12, 2017

dominictarr commented Jul 12, 2017

fergiemcdowall commented Jul 12, 2017

juliangruber commented Jul 12, 2017

dominictarr commented Jul 12, 2017

fergiemcdowall commented Jul 12, 2017

fergiemcdowall commented Jul 12, 2017

dominictarr commented Jul 12, 2017

fergiemcdowall commented Jul 13, 2017

dominictarr commented Jul 13, 2017