Issue Details (XML | Word | Printable)

Key: NUCMONGODB-68
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Chris Rued
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
DataNucleus Store MongoDB

Implement COUNT query on datastore using db.collection.count(selector)

Created: 20/Mar/12 11:37 PM   Updated: 06/Apr/12 05:52 PM   Resolved: 22/Mar/12 07:17 PM
Component/s: None
Affects Version/s: None
Fix Version/s: 3.1.0.m2

File Attachments: 1. Text File NUCMONGODB-68-for-commit.patch (14 kB)
2. Text File NUCMONGODB-68-for-review.patch (14 kB)
3. Text File NUCMONGODB-68-generalized-for-review.patch (15 kB)


Datastore: MongoDB


 Description  « Hide
I'd like to be able to execute a count query on the data store rather than transfer a large collection over the network in order to do so in memory on the Java side. I've traced through the code a bit but am feeling a bit lost. I can see the COUNT is being recognized by the QUERY parser, but it's not clear when I would need to insert this.

My immediate thought is to introduce this into JPQLQuery's performExecute method. Here I could call getCompilation() to find the array of exprResults, look for COUNT and then use db.collection.count(selector) to get the count...but, I'm not quite sure.

I'm happy to take a crack at implementing this if you can point me in the right direction.

Thanks

Andy Jefferson added a comment - 21/Mar/12 07:21 AM - edited
Any handling for result ought to go into QueryToMongoDBMapper.java in compileResult() since this is the result clause. You will get an InvokeExpression IIRC with the method name being "count". You need to update the "resultObject" (or extend it). JDO provides count() queries also, so anything has to be done for both query languages or not at all.

Your idea of using dbCollection.count(filter) will obviously only cater for one (small) use-case where there is a single result clause that is count(). You need to set the resultComplete flag if that is the situation otherwise in-memory will run across the result too. Besides you can only do the count like that if you can evaluate the whole filter in the datastore

Chris Rued added a comment - 21/Mar/12 07:07 PM - edited
It looks like the resultObject is a BasicDBObject. I don't see how I would accomplish the count without a modification to performExecute. I don't see the resultObject being used at all in the current implementation (it's set in the datastoreCompilation, but then never used).

As far as I know, the only way to accomplish count() is Mongo via db.collection.count(), or perhaps using map/reduce. At first, map/reduce seemed too complex, but it may simplify GROUP BY with COUNT.

Did you have something else in mind?


Andy Jefferson added a comment - 21/Mar/12 07:25 PM
It's a placeholder currently (see TODO in compileResult). You can easily replace "BasicDBObject" by "MongoDBResult" (new class) that contains a BasicDBObject, and a flag for whether just the count is required. This is then passed to the place executing the query, and if it sees the flag (and no BasicDBObject) then it does your count() call.

Chris Rued added a comment - 22/Mar/12 07:03 AM
Implementation of COUNT using map/reduce in MongoDB. To be more complete, still need to know whether there is a GROUP BY clause in the JPQL expression and either modify the map key to include the GROUP BY fields or to bypass MongoDB execution if GROUP BY is present... (could be more I haven't considered, it's late here)

Chris Rued made changes - 22/Mar/12 07:03 AM
Field Original Value New Value
Attachment NUCMONGODB-68-for-review.patch [ 11621 ]
Chris Rued added a comment - 22/Mar/12 07:07 AM
Forgot to mention -- this patch also includes the change I mentioned in my latest comment on NUCMONGODB-65

Andy Jefferson added a comment - 22/Mar/12 10:25 AM
Patch looks fine except that I would split the common code out of JDOQLQuery/JPQLQuery and put that as a method in MongoDBUtils; that way we have all methods executing queries in MongoDBUtils

Chris Rued added a comment - 22/Mar/12 06:01 PM
Patch to support count query in the datastore under constrained circumstances:

  1. COUNT() is the only thing in the select
  2. No GROUP BY is present

#2 seems like it could be relaxed without a lot of effort, but I currently don't need it. I can try to find some time if you'd like.

I've taken your suggestion and moved the actual map/reduce count operation into MongoDBUtils. It feels like there may be some refactorings to make the map/reduce metaphor accessible more easily outside the COUNT use case, but that wasn't necessary at the moment.

Chris Rued made changes - 22/Mar/12 06:01 PM
Attachment NUCMONGODB-68-for-commit.patch [ 11622 ]
Andy Jefferson added a comment - 22/Mar/12 07:17 PM
SVN trunk now has your patch. Thx. I removed the "hasGrouping" flag since we know that info in the JDOQLQuery/JPQLQuery (grouping != null).

WRT map-reduce, see also NUCMONGODB-24 where I want to make the general capability available using the JDO API (this API is more open to improvements of this nature).

Andy Jefferson made changes - 22/Mar/12 07:17 PM
Status Open [ 1 ] Resolved [ 5 ]
Fix Version/s 3.1.0.m2 [ 11508 ]
Resolution Fixed [ 1 ]
Chris Rued added a comment - 23/Mar/12 05:18 PM - edited
I have a concern about the correctness of the current COUNT implementation. You may be able to answer more quickly that I can find on my own (came up empty with some quick googling):

What is the expected behavior of COUNT in JPQL or JDOQL wrt NULL values? In SQL NULL values are not counted (in practice, and I believe by spec).

Here is a sample JPQL query:
 
    SELECT count(p.textMessage) FROM TestEntity p

If we have two entities, one whose textMessage is NULL (or absent from the MongoDB document) and one with a value, what is the expected result? I think it should be 1 (counting only the non-null value).

What I've found is that the current implementation give 2 (using both the in-memory and the map-reduce approach).

Do you know which is the correct behavior?



I've been doing a bit of work to investigate implementing other aggregate operations with the map-reduce (mostly out of interest at this point, since I only need count) and I'm making some reasonable progress I think...I could apply the correction for the map-reduce case (whichever way is correct).

I'll share a patch of the work in progress later in the day.


Andy Jefferson added a comment - 23/Mar/12 05:45 PM
JDOQL uses "count(this)" ... so count of the number of objects of the candidate type ... hence your examples should be 2.

JPQL uses "count(a.field)" ... <spec>the number of values to which count can be applied</spec> ... hence if there is a property of that name then it counts (even if null value) ... note that this is an interpretation since JPA is designed around RDBMS where there isn't the concept of column missing (fixed schema)

Chris Rued added a comment - 23/Mar/12 06:30 PM
Here's an updated patch that uses a more generalized map-reduce approach
   * Allowing for SUM() and AVG()
      * The intention was to provide a more general structure for other aggregates
   * In order for SUM() and AVG() to work, we have to translate the xQL expression inside the function call to JavaScript -- so added the beginning of some support for that.

This really isn't meant to be ready for commit yet...just looking for a quick sanity check form your end.

Thanks.

Chris Rued made changes - 23/Mar/12 06:30 PM
Andy Jefferson added a comment - 23/Mar/12 07:46 PM
If aiming to allow SUM, AVG I'd prefer that as a separate JIRA; just raise one when you have something complete and put the patch there.

The patch still checks "isCountOnly" - rename it?

If providing AVG, what about MIN,MAX ? then you have the full set.

Why only allow queries returning a single aggregate? is there a problem with returning multiple aggregates in a query? Obviously if the user has multiple result expressions and some are aggregates and some are PrimaryExpression then you have much more work to do, but maybe just think of multiple aggregates for now

Chris Rued added a comment - 23/Mar/12 08:36 PM - edited
> If aiming to allow SUM, AVG I'd prefer that as a separate JIRA;
> just raise one when you have something complete and put the patch there.

Ok, I'll do that.


> The patch still checks "isCountOnly" - rename it?

Yes, that was an incomplete refactoring.


> If providing AVG, what about MIN,MAX ? then you have the full set.

Those are certainly a possibility


> Why only allow queries returning a single aggregate? is there a
> problem with returning multiple aggregates in a query?

That's a good suggestion...

First, we'd have to have MongoDBResult be able to store multiple sub-objects to represent these (maybe a new class MongoDBResultColumn?)

We'd have to store multiple keys in the map() function. Instead of just "c", we could use "c1", "c2", "c3".

Then the reduce() function would update each of these in turn. This would probably by just concatenating together multiple code blocks to appropriately update each of the (c1, c2, c3...) for the operation

Finally, instead of just looking for "c" when providing the results, we'd have to look for the (c1 ... cn)...we should know how many

If there are other PrimaryExpressions, these would have to be present in the GROUP BY, right? So we'd probably want to make this part of the key that is used in the emit() call from the map() function. Other Expressions that may occur would be Literal or DyadicExpression. A Literal should be simple enough (I think). A DyadicExpression would be more difficult...but we could probably have this evaluated as part of the JavaSript call, I think, and then get it from the results generated by the mapReduce() call.


> Obviously if the user has multiple result expressions and some are aggregates and some are
> PrimaryExpression then you have much more work to do, but maybe just think of multiple
> aggregates for now

As of now, my plan is to create a patch to support the full complement of COUNT, SUM, AVG, MIN, MAX as simple single-aggregate-only queries. Then perhaps allow multiple-aggregate-only queries without any other expressions.

After that it seems to me the next thing to support would be GROUP BY (without any other returned paramater types), and then finally GROUP BY with the ability to return other PrimaryExpressions, etc...I'm not sure how much time I'll have, though ;-)

Andy Jefferson made changes - 06/Apr/12 05:52 PM
Status Resolved [ 5 ] Closed [ 6 ]