-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for a query language #75
Comments
to me, it sounds like those are the steps to get there:
db.createQueryStream(function(value){
return value.foo == 'bar';
}).pipe(...);
db.createQueryStream({ index: true }, function(value){
// ...
})
|
prehaps the example I should have used for the second part is where the combination of substack is a higher percentage. Lets say, substack uses |
As I see it, the API @juliangruber proposed is not prone to optimization: this hypothetical profiling engine would need to parse the filtering function to properly understand what the intent was. And that function could contain arbitrarily complex JavaScript, which would make the job virtually impossible. As I see it, one solution for this would be either a query language or a naming convention for the filters (which would also be a language). Also, a quick thought: some document databases create all the combinations for the indexes by default, allowing ad-hoc queries (using something like xpath). |
@pgte creating all combinations of indexes would be possible, but it's also something that could get out of hand with a document db, but much easier with a schema/table db because you have a fixed number of columns. Either way, I think my plan is as follows:
Now, it would be pretty cool to parse javascript (probably limited to an expression, no assignment or loops) and then use that as a query, but that would mostly be showing off... and would tend to isolate the user from what is really happening, which is antithetical to the level-* spirit. Definitely a cool hack though. |
Was thinking about a few weeks back and found this video on how elastic search goes about things informative in terms of concrete approach. |
Made me think of rehydrating this project I used to contribute too. |
Presenting queries in JavaScript would be too hard to optimise, you'd need to do crazy AST inspections to figure out what the query would be. What we really need is some kind of highly structured language for querying the database. At a minimum it could involve:
So combining them you'd end up with a language like: In fact, you could make this specific to sublevels, so you have to say what sublevel you want to grab data from, so it could become: And, then you could even extend the language to enable writes to the database! So you could prefix it with the operation type, 'read' or 'write'. It might be best to express this language as an algebraic operation on sets of data, so reads would be dissecting those sets into sub-sets, which is like an algebraic 'select', so that might be better for 'read' operations: If formally defined we could implement language parsers for this and turn them into level queries and inspect what indexes would be best to use and create. In fact, I could even see this language expanding beyond the level ecosystem because it won't be javascript specific! |
@rvagg absolutely!!! If there is something missing somehow from the nosql movement it's probably easy ways to query your data... and also, come to think of it, to validate it too. everyone writes these orm things that check you have the correct fields. why don't we just INSERT that INTO the DATABASE? if we are gonna create some crazy new language, there is one thing we should keep in mind though... Also, of course, one of the best things about level-* is being able to have an evented database that TRIGGERS your code when something is inserted, even running a bash script or something. This would usher us into a whole new era of nosql... maybe NOSQL++ Node's Original Structured Query Language. With "++" so that every one knows it's extra good (we wouldn't want anyone to think this is something we just made up) that is important because we gotta sell this to the enterprise. |
oo, and we could actually push this a bit further and store actual code in there; they wouldn't be full blown JavaScript functions, perhaps more like procedures. Imagine the power and potential of stored procedures, we'd hardly have to write our applications in JavaScript, the application logic could emerge from within the database itself. WHAT POWER! |
Would it be possible to separate the (1) query language and the (2) indexes into separate modules in a generic way? (I have no idea how dependent most query languages are on the implementation of their indexes.) For instance, I've been tweaking @tonsky's ClojureScript Datalog implementation to work with pluggable indexes. I have a PouchDB plugin that uses map-reduce to build the indexes, which in this case are triple-store-like entity-attribute-value and attribute-value-entity (though I may need another). Works with simple queries now; will be finding out more about how well it does or doesn't work in the coming weeks. I don't particularly have a goal to keep indexes small in my use-case, and I'm just naïvely indexing everything right now, so we'll see. One could also definitely put a more JS-style API on top of Datalog that would probably be more appealing to more folks… |
@dominictarr have you considered that you might be describing a graph database (like what levelgraph provides)? the approach is slightly different but the gains are what i think you're looking for. rather than inserting whole documents straight into your db, you decompose the document into a graph of triples where each property of your document is the name of an edge and the nodes are the values. using a hexastore to model these nodes and edges, you have a fairly linear cost (each triple costs about 6 times the size of the data) but now you have an index for every property which gives you close to linear querying. wikipedia should provide enough explanation of why i think your pain is relieved with a graph database. |
I'm not sure how much of this discussion is serious anymore, but there's also the Cloudant Query Language, which is very deliberately modeled after MongoDB's queries. Supposedly it's going to get merged into CouchDB 2.0, so if there's any kind of emerging "NoSQL query language," it's probably this. |
There's also Datomic's query language that's based on datalog. |
@rvagg totally. great idea! what should we make it like? of course code-in-a database is very different to js so we should probably make it different or people might get the wrong idea. We can't use coffeescript, because people are already using that, but what about if we use your CAPSLOCKSCRIPT? not many people are seriously using that yet, and it would be great to give it a boost by wrapping a very useful database around it. Also, using CAPLOCK helps communicate that DATABASES are SERIOUS BUSINESS. |
@neonstalwart graph databases don't address the specific case I mentioned in the opening post, |
@dominictarr 👍 CAPSLOCK. Everyone knows that you must |
word @Raynos, that's webscale |
@dominictarr I can not agree more:
Not sure if I'm contributing to this idea but I have for a long time thought that the way one query a Tuple space would work very well in javascript and would be perfect to implement on top of level. In a Tuple space one do query by pattern matching. In its simplest form; the database is just a big bucket of objects one stuff objects into. There is no IDs and if one want some unique ID one have to put that in the objects one put in the space. Then one search it by sending in a object with keys and values which one should match on. Tuple spaces have more or less been a Java thing so setting up a query object is a bit "more" than what we are used to in our js world. But, the literal syntax of a js object is so visual that I think it would work very well as a easy thing for people to use as a query. This is what I would love to see / do: Lets say I have some objects of LP Records and I stuff them in a database like this: db.put([
{id: 0, album: 'High Visibility', artist: 'The Hellacopters', year: 2002, style: 'Punk rock'},
{id: 1, album: 'Tender Is the Savage', artist: 'Gluecifer', year: 2000, style: 'Rock'},
{id: 2, album: 'Deathcrush', artist: 'Mayhem', year: 1987, style: 'Black metal'},
{id: 3, album: 'Supershitty to the Max!', artist: 'The Hellacopters', year: 1996, style: 'Punk rock'},
{id: 4, album: 'Money for Soul', artist: 'Baby Woodrose', year: 2003, style 'Psychedelic rock'}
]); To query this, I do it with a "template object" like this: db.get({
artist: 'The Hellacopters',
year: 1996
}, function(arr){}); In the callback function I then get an array which holds the objects which matched the keys and values of the template object. In our case here; one: [
{id: 3, album: 'Supershitty to the Max!', artist: 'The Hellacopters', year: 1996, style: 'Punk rock'}
] For more fancy queries one could let the values of the keys in the template object take a function: db.get({
year: function(value){
return (value >= 1985 && value <= 1990);
}
}, function(arr){}); This will match all objects where the value for the key "year" is between "1985" and "1990". For the lower level stuff inside the database I'm no guru, but I guess it would be possible to build indexes of the keys and values. Maybe one could have a function where one could tell the database which keys / values it should index by providing, yet again, a template object: db.index({
artist: true,
year: true
}); Here we're telling the database to index keys and values of "artist" and "year". These are again the only keys and values one can query upon. Maybe a feature to tell the database what key in the objects which hold a unique key would be nice. One could then optimize that key for fetching a single object based on a unique key. As an example of query: Imagine how easy it would be to go from taking the FormData object provided by XHR2 when submitting a HTML form to query the database. The FormData object can more or less be used untouched as the template object to query the database. Just my two cent idea when the topic of how to query pop up. |
A very leveldb way to do this would be to take the existing var search = require('level-indexifier');
var db = require('level')('./db');
var puts = [
{id: 0, album: 'High Visibility', artist: 'The Hellacopters', year: 2002, style: 'Punk rock'},
{id: 1, album: 'Tender Is the Savage', artist: 'Gluecifer', year: 2000, style: 'Rock'},
{id: 2, album: 'Deathcrush', artist: 'Mayhem', year: 1987, style: 'Black metal'},
{id: 3, album: 'Supershitty to the Max!', artist: 'The Hellacopters', year: 1996, style: 'Punk rock'},
{id: 4, album: 'Money for Soul', artist: 'Baby Woodrose', year: 2003, style 'Psychedelic rock'}
];
db.batch(puts, function (err) {
var s = search(db, {
artist: 'The Hellacopters',
year: { gt: 2000 }
});
s.on('data', console.log);
}); For each key in the search parameters, if the value is:
Using the parameters that createReadStream() already supports will make it easier to learn, more consistent with the rest of level, and easier to implement because these parameters will just be passed through to the underlying indexes. |
template searches and using gt/gte/lt/lte (adding |
when I get around to implementing it, this module will be called "mynosql" |
👍 mynosql |
@dominictarr Cant wait any more |
I'm playing with some of these ideas in level-scout, which can search with or without indexes, using index(db, ['x', 'y', 'z']) Which the search uses for a query like Also, for equality queries, scout can intersect index scans. If you have two indexes: index(db, 'x')
index(db, 'y') And do a search: search(db, {
x: 60,
y: 250
}) .. then scout will access the data via ranged streams on the x index ( But, still lots to be done. |
@vweevers cool! also by the way I did get around to starting on this, https://github.com/dominictarr/mynosql |
it will be interesting to compare our designs! |
Hi, there. |
once a year I come back to this thread and ponder why we're not all using embedded 'nosql++' indices :) |
@bhurlow btw, I eventually (partially) implemented this idea here: https://github.com/flumedb/flumeview-query |
Maybe its time for a |
i'm totally up for it :) see Level/levelup#354 |
@fergiemcdowall I'm post-leveldb. |
@dominictarr starred. |
Would just like to add that I am still a level believer. I think the idea of a simple CRUD abstraction that can be put on top of any underlying datastore is really powerful. I think that defaulting to leveldb is great because it is well suited to machines with fast SSD (pretty much all new mechines these days). I also think that being able to replicate datastores everywhere is a super powerful feature that we are only going to see more and more of. So in summary- I'm a level fanboy, and I hope the project stays healthy. |
@fergiemcdowall the question of whether leveldb is a general database abstraction or just an embedded database abstraction has come up before (certainly an interesting thread for historical reasons.) |
@dominictarr that was an interesting thread, and surely |
@fergiemcdowall I guess it can be different things to different people |
This is an idea for a level module, but am posting it here because it's too early in the morning/late at night to implement and this will notify exactly the right people.
level is pretty useful as a straight key value store, but sooner or later you need to query something else,
so you add an index. A lot of leveldb modules are about indexes. For example, level-seach which just indexes everything.
The trouble with all these indexes, is to manage them all - which index do you need for which query and when do you add a new index? there is a design trade off here.
It would be so much easier if you could just not think about indexes at all, just have filters,
just
forEach
(scan) over all the data and return only what you want. totally functional and elegant,just not performant.
But what if you could use that interface, and it would transparently use the right indexes when possible? (or even better, generate them when necessary)
Lets say we have the npm registry, and we want to retrieve all the modules written by @substack,
even though substack has written many modules, it is still less than 1% of the registry, so a full scan
would be 99% wasted effort. Lets say that the average registry document is 5k, that is
95000*0.99*5000=470mb
of unnecessary reads.475mb
total.But lets say that we indexed all the maintainers? for simplicity lets say that the average length of both module names and user names is 8 characters. Say we create an index like
{username}:{modulename}
and index this for every module,95000*(8+8+1)
(1 for the separator)= 1.6mb
of index data. but now we just need to read through 1% of the index (since they are ordered by username we can just jump straight to substack)950*17 = 15200
that is only 15k of unnecessary data.So, now we only read
15200 + 950*5000 = 4.7mb
Using this index we increase storage requirements by 1.6 mb but increase read perf by 100x.
But what about a more complex query? suppose we want to find modules written by substack that depend (directly) on async? now, a simple way to do this would be scan through the modules written by substack and check whether
async
is in the deps. Lets say there was a time when substack used async, in 9 modules. 1% of the time. the scan costs us 4.7 mb, like before, but we discarded 99% of it.Okay, so lets create a filter for this.
udm:{username}:{dep}:{module}
wheremodule
depends ondep
this index is more expensive because it has needs another 9 characters, plus we need a tag to distinguish it from the other indexes.
3+1+3*(8+1) = 29
so this index would need2.75mb
.how much data would it save us? now we could directly read out the 9 substack modules that use async,
29*9+ 9*5000 = 45k
this is 1000 times less! but here is the problem: how do we know we want to perform this query? what if we want to know all the users that depend onoptimist
? we'd also need adum:{dep}:{username}:{module}
index - and so on, we want our database to be really really fast,so lets just index every pair of properties twice!
Lets say there are 20 properties on average, how many combinations of 2 properties?
(
20!/(2!*(20-2)!)
which cancels down to(20*19)/2 = 275
), times 2 because we need 2 indexes,gives us:
550*29 = 15950
, that is 3 times the size of the document. if there where 30 properties it would be40*39*29 = 45240
. doubling the size of the document increases the size of the index 3 times,now 9 times the size of the original document.
Now, at some point, it's better to do a scan than have another index, because the indexes take up too much space, and optimize queries that you never use. Clearly this is dependant on what sort of queries you do in practice, possibly you could have a system that automagically creates indexes when you start to use a query more. maybe even could it know when you have an expensive query?
But is there a nice way that you can gracefully transition between those?
decouple the query interface from the indexes, so you can add and remove indexes
without changing the queries at all. And measuring the efficiency of the queries,
so that you know when new indexes are needed.
The text was updated successfully, but these errors were encountered: