Issue Details (XML | Word | Printable)

Key: NUCRDBMS-736
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Testcase Required Testcase Required
Assignee: Unassigned
Reporter: Tuong Truong
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
DataNucleus Store RDBMS

DN generated Invalid query syntax is generated for DB2 LUW in Hive 12

Created: 03/Jan/14 08:11 PM   Updated: 11/Jan/14 07:43 PM   Resolved: 09/Jan/14 08:46 AM
Component/s: Datastore Adapter
Affects Version/s: 3.2.1
Fix Version/s: 3.2.11

Environment: Linux, Java

Datastore: IBM DB2
Severity: Production


 Description  « Hide
When configure Hive 12 to use DB2 LUW as the metastore database, many Hive Metastore APIs are not working due to invalid SQL syntax generated by DN. So far, we have detected Hive failures with Alter table, and Drop Table API/statements. The cause seems to be the additional invalid qualifier generated by DataNecleus. Hive 12 uses datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar. Comparing DB2 and Derby, we see that DB2 generated queries uses the row_number() over() function which incorrectly exposed a nested correlation name.

Reproduction Steps:

Configure Hive 12 (http://www.apache.org/dyn/closer.cgi/hive/) to use DB2 (http://www-01.ibm.com/software/data/db2/express-c/download.html):

using Hive CLI ($HIVE_HOME/bin/hive)

create table test(i1 int);
drop table test; <<< this will failed with trace stack
alter table test set tblproperties ('comment' = 'test'); <<< This will failed, but need to look into Hive log to see error



From the Hive logs, we see this:

For Derby, it seems this query was created instead:

SELECT FROM org.apache.hadoop.hive.metastore.model.MStorageDescriptor WHERE this.cd == inCD PARAMETERS MColumnDescriptor inCD RANGE 0,1

Query compiled to datastore query

"SELECT 'org.apache.hadoop.hive.metastore.model.MStorageDescriptor' AS NUCLEUS_TYPE,A0.INPUT_FORMAT,A0.IS_COMPRESSED,A0.IS_STOREDASSUBDIRECTORIES,A0.LOCATION,A0.NUM_BUCKETS,A0.OUTPUT_FORMAT,A0.SD_ID FROM SDS A0 WHERE A0.CD_ID = ?"
 
For DB2, the problem query looks like this:
 
SELECT FROM org.apache.hadoop.hive.metastore.model.MStorageDescriptor WHERE this.cd == inCD PARAMETERS MColumnDescriptor inCD RANGE 0,1
Query compiled to datastore query
"SELECT subq.NUCLEUS_TYPE,subq.A0.INPUT_FORMAT,subq.A0.IS_COMPRESSED,subq.A0.IS_STOREDASSUBDIRECTORIES,subq.A0.LOCATION,subq.A0.NUM_BUCKETS,subq.A0.OUTPUT_FORMAT,subq.A0.SD_ID FROM (SELECT 'org.apache.hadoop.hive.metastore.model.MStorageDescriptor' AS NUCLEUS_TYPE,A0.INPUT_FORMAT,A0.IS_COMPRESSED,A0.IS_STOREDASSUBDIRECTORIES,A0.LOCATION,A0.NUM_BUCKETS,A0.OUTPUT_FORMAT,A0.SD_ID,row_number()over() rn FROM HIVE.SDS A0 WHERE A0.CD_ID = ?) subq WHERE subq.rn<=1"


Hive/DN trace stack:

com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-206, SQLSTATE=42703, SQLERRMC=SUBQ.A0.CREATE_TIME, DRIVER=4.16.53
        at com.ibm.db2.jcc.am.fd.a(fd.java:739)
        at com.ibm.db2.jcc.am.fd.a(fd.java:60)
        at com.ibm.db2.jcc.am.fd.a(fd.java:127)
        at com.ibm.db2.jcc.am.to.c(to.java:2771)
        at com.ibm.db2.jcc.am.to.d(to.java:2759)
        at com.ibm.db2.jcc.am.to.a(to.java:2192)
        at com.ibm.db2.jcc.am.uo.a(uo.java:7827)
        at com.ibm.db2.jcc.t4.ab.h(ab.java:141)
        at com.ibm.db2.jcc.t4.ab.b(ab.java:41)
        at com.ibm.db2.jcc.t4.o.a(o.java:32)
        at com.ibm.db2.jcc.t4.tb.i(tb.java:145)
        at com.ibm.db2.jcc.am.to.kb(to.java:2161)
        at com.ibm.db2.jcc.am.uo.wc(uo.java:3657)
        at com.ibm.db2.jcc.am.uo.b(uo.java:4454)
        at com.ibm.db2.jcc.am.uo.jc(uo.java:760)
        at com.ibm.db2.jcc.am.uo.executeQuery(uo.java:725)
        at com.jolbox.bonecp.PreparedStatementHandle.executeQuery(PreparedStatementHandle.java:172)
        at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeQuery(ParamLoggingPreparedStatement.java:381)
        at org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:504)
        at org.datanucleus.store.rdbms.query.JDOQLQuery.performExecute(JDOQLQuery.java:637)
        at org.datanucleus.store.query.Query.executeQuery(Query.java:1786)
        at org.datanucleus.store.query.Query.executeWithArray(Query.java:1672)
        at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:266)
        at org.apache.hadoop.hive.metastore.ObjectStore.listMPartitions(ObjectStore.java:1698)
        at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsInternal(ObjectStore.java:1428)
        at org.apache.hadoop.hive.metastore.ObjectStore.getPartitions(ObjectStore.java:1402)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
        at java.lang.reflect.Method.invoke(Method.java:611)
        at org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
        at com.sun.proxy.$Proxy7.getPartitions(Unknown Source)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.dropPartitionsAndGetLocations(HiveMetaStore.java:1286)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_core(HiveMetaStore.java:1189)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_table_with_environment_context(HiveMetaStore.java:1328)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
        at java.lang.reflect.Method.invoke(Method.java:611)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.
        at com.sun.proxy.$Proxy8.drop_table_with_environment_context(Unknown Source)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreCl
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreCl
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
        at java.lang.reflect.Method.invoke(Method.java:611)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaSt
        at com.sun.proxy.$Proxy9.dropTable(Unknown Source)
        at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:869)
        at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:836)
        at org.apache.hadoop.hive.ql.exec.DDLTask.dropTable(DDLTask.java:3329)
        at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:277)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1437)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1215)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1043)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
        at java.lang.reflect.Method.invoke(Method.java:611)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)




Sort Order: Ascending order - Click to sort in descending order
Tuong Truong added a comment - 03/Jan/14 08:18 PM
Note that the select list of the invalid syntax SQL contains there column references which has A0. A0 is only valid in the nested context and is subsumed by subq in the outer context. Removing the A0 reference will make the SQL valid.
subq.A0.INPUT_FORMAT,subq.A0.IS_COMPRESSED,subq.A0.IS_STOREDASSUBDIRECTORIES,subq.A0.LOCATION,subq.A0.NUM_BUCKETS,subq.A0.OUTPUT_FORMAT,subq.A0.SD_ID

Andy Jefferson added a comment - 03/Jan/14 08:19 PM
Problem reporting is well defined in this link
http://www.datanucleus.org/project/problem_reporting.html

which does NOT involve downloading and running someone else's software, but instead isolating the "problem" into a standalone testcase. Secondly you are using OLD versions of DataNucleus there and it is expected that people use the latest. No idea what your ALTER TABLE has to do with anything, if you think some query creates invalid SQL then you generate a testcase and quote very clearly the JDOQL, and the SQL it generates. Nothing else is accepted.

Tuong Truong added a comment - 03/Jan/14 08:55 PM
Andy,
Thanks for the quick response. It will some effort for us to write a JDO testcase for this since we have been working at the Hive level, and are not familiar with the Hive's JDO layer and DN in particular. We will also work with the Hive team to see how they would like to handle this issue. I will definitely append a testcase once we are able to have one. I don't expect DN team to run Hive and DB2, but to provide some context for your analysis. BTW, we have downloaded 3.2.9 and try it with Hive 12 and still see the same issue. Thanks, again.

Andy Jefferson added a comment - 04/Jan/14 09:35 AM
Tuong,
likely you're referring to handling of query range with DB2. An IBM employee raised an issue some years ago, http://www.datanucleus.org/servlet/jira/browse/NUCRDBMS-433 and I included what they requested. They never confirmed whether that code worked.

I don't have DB2 here and the only time I tried to get hold of it (for Linux 64bit) the download/install process was such a PITA that I gave up on it, so cannot do anything myself to resolve this. That said, I can provide you with details of where the particular code is that generates this SQL, below :-


GitHub project for the RDBMS plugin is at https://github.com/datanucleus/datanucleus-rdbms

The main classes to look at are
DB2Adapter (specifics for use of DB2)
https://github.com/datanucleus/datanucleus-rdbms/blob/master/src/java/org/datanucleus/store/rdbms/adapter/DB2Adapter.java
in particular the method "getRangeByRowNumberColumn" which signifies that DB2 will use that method to get the range.

SQLStatement (that generates the SQL)
https://github.com/datanucleus/datanucleus-rdbms/blob/master/src/java/org/datanucleus/store/rdbms/sql/SQLStatement.java#L1522
and trace through for getRangeByRowNumberColumn for the parts where DB2's range is applied.



Is DB2's range support supposed to be the same SQL structure as Oracle's by any chance? I know that the Oracle variant in DataNucleus works (since I have Oracle and tested it). If so then you could just _change_ that method in DB2Adapter to be

public String getRangeByRowNumberColumn2()
{
    return "row_number()over()";
}

[there are two alternate methods in SQLStatement for ranges, one used by Oracle and one by DB2 - this would switch DB2 to use the same one as Oracle]. Obviously if DB2 supported SQL2008 OFFSET/FETCH then it would all be much simpler ;-)



If someone could look at those and get back here with what works and attach any patches to this issue (and if it only is applicable to a particular version of DB2 then please state which). Thanks

Tuong Truong added a comment - 09/Jan/14 02:04 AM
Andy,

Thank you for pointing the codes. I look thru the codes you pointed out and isolate the problem to the SQLStatement.java file. This seems to be a general codepath issue with range and row_number column.

Using SVN (learning on the fly), I was able to track it down to the revision 13554 which when the method getSelectStatement() was changed to use Alias when available when building a query for a row number column select (around line 1660 - r13554).

The old code (prior to r13554) does the right thing by removing the qualifier prefix was removed, and code failed to handle properly the case where the Alias was not available. A reasonable fix to this would be something like:
                    int dotIndex = selectedCol.indexOf(" AS ");
                    if (dotIndex > 0)
                    {
                        // Use column alias where possible
                        selectedCol = selectedCol.substring(dotIndex + 4);
                    }
                    else {
// strip out qualifier when encountered from column name
// since we are adding a new qualifier above...
                     // Multiple qualifier is not valid SQL syntax
dotIndex = selectedCol.indexOf(".");
if (dotIndex > 0) {
// Remove qualifier name and the dot
selectedCol = selectedCol.substring(dotIndex+1);
}
}

Andy Jefferson added a comment - 09/Jan/14 08:46 AM
Tuong, I applied what I assume you meant to GitHub master. This is left to you to test; let us know if this change doesn't work. Note though that removing the table prefix will not work in an original statement that selects something like A0.COL1 and also B0.COL1.

For future reference, we don't use SVN now, and when proposing an update the more normal "diff" or Git pull is the best way to provide it since then I don't need to guess at line numbers and what I'm replacing with the proposed code. Thanks.

Tuong Truong added a comment - 09/Jan/14 10:34 PM
Thanks, Andy. I actually applied this fix to an SVN revision closer to the version being used by Hive 12, and it fixed the problem we are seeing. Thanks again for delivering the fix.

The issue you mentioned is inherent in the duplicate column names in a subselect, and column alias is invented to bypass the issue, but the onus is on the SQL writer. From the SQL generation perspective, it's probably a good practice always generate "AS" with subselect select list..

Happy New Year!!!