Issue Details (XML | Word | Printable)

Key: NUCCORE-517
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Unassigned
Reporter: Chris Colman
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
DataNucleus Core

NPE at application start up while processing metadata - intermittent

Created: 27/Apr/10 07:17 PM   Updated: 18/Jun/12 10:05 AM   Resolved: 17/Jun/12 09:06 AM
Component/s: MetaData
Affects Version/s: 2.0.4
Fix Version/s: 3.1.0.m4

Environment: MySQL 5, Java 1.6, Windows

Forum Thread URL: http://www.jpox.org/servlet/forum/viewthread_thread,6080
Datastore: MySQL
Severity: Production


 Description  « Hide
Recently updated to 2.0.x nightly build 2010/04/18

I've started seeing an intermittent exception that I've never seen before. It happens at application start up when DataNucleus is initializing:

Caused by: java.lang.NullPointerException
at org.datanucleus.metadata.AbstractClassMetaData.inheritIdentity(AbstractClassMetaData.java:839)
at org.datanucleus.metadata.ClassMetaData.populate(ClassMetaData.java:195)
at org.datanucleus.metadata.AbstractClassMetaData.validateSuperClass(AbstractClassMetaData.java:724)
at org.datanucleus.metadata.AbstractClassMetaData.determineSuperClassName(AbstractClassMetaData.java:64
5)
at org.datanucleus.metadata.ClassMetaData.populate(ClassMetaData.java:193)
at org.datanucleus.metadata.MetaDataManager$1.run(MetaDataManager.java:2394)

The problem is intermittent so it's impossible to reproduce in a reliable way with a test case.

The current workaround, while frustrating, eventually gets the app started up:

Stop tomcat
Remove the application from the container
Restart tomcat
Install the application

Just removing app and reinstalling it will get it working eventually - it can take a number of cycles of the above steps. In other words, the metadata remains exactly the same so I would think that the problem is not caused by invalid metadata.

I've been looking through the source code to try and find an explanation to this. Looking at the line on which it has occurred (marked with >>>>> below) it appears as though pcSuperclassMetaData has not been initialized during some app start ups but not others. There may be some affect where the 'order of class processing' that can trigger this (eg., iterating through a HashSet collection where ordering is largely non deterministic).

I never saw this until recently when I changed the class model and introduced what might be called an 'indirect self referential relationship' to the design:

Was:


        A
  ^ ^
B M
^
C
^
D


(where ^ means lower class extends class above it in the ASCII diagram)

Changed to:


        A
  ^ ^
B M
^
C ^
^
D --------> N


The -----> represents the new relationship to the new class 'N' which shares a common base class in A and which appears to have triggered this intermittent NPE. Perhaps the variance in 'NPE thrown or works ok' is based on whether class D gets processed before or after class N.

Possibly we could be deep in the nested processing of the class D branch of the hierarchy when we experience class N which has a different branch but the same, potentially partially processed superclass at the root of the same hierarchy.

Additional details:
    * Using datastore identity
    * class A uses new table inheritance strategy
    * all derived classes use 'superclass table' inheritance strategy


Relevant source code:

org.datanucleus.metadata.AbstractClassMetaData.java

    /**
     * Inherit the identity definition from super classes.
     * @throws InvalidMetaDataException if the MetaData of this class conflicts with super classes definition
     */
    protected void inheritIdentity()
    {
        if (objectidClass != null)
        {
            // Make sure the objectid-class is fully-qualified (may have been specified in simple form)
            this.objectidClass = ClassUtils.createFullClassName(((PackageMetaData)parent).name, objectidClass);
        }

        // "persistence-capable-superclass"
        // Check that the class can be loaded, and is a true superclass
        if (persistenceCapableSuperclass != null)
        {
            // Class has superclass, yet has objectid-class defined! ops, this might result in user errors
            if (objectidClass != null)
            {
>>>>> String superObjectIdClass = pcSuperclassMetaData.getObjectidClass();
                if (superObjectIdClass == null || !objectidClass.equals(superObjectIdClass))
                {
                    throw new InvalidMetaDataException(LOCALISER, "044085", name, persistenceCapableSuperclass);
                }

                // by default users should only specify the object-id class in the root persistent class
                NucleusLogger.METADATA.warn(LOCALISER.msg("044086", name, persistenceCapableSuperclass));
            }
            else
            {
                // get the objectid class from superclass
                this.objectidClass = pcSuperclassMetaData.getObjectidClass();
            }

            if (this.identityType == null)
            {
                this.identityType = pcSuperclassMetaData.getIdentityType();
            }

            // Superclass identityType must be the same as this classes identityType
            if (!this.identityType.equals(pcSuperclassMetaData.getIdentityType()))
            {
                // We can't change the identity type from what was specified in the base class
                throw new InvalidMetaDataException(LOCALISER, "044093", fullName);
            }
        }
    }

Peter Dettman made changes - 14/May/10 06:18 AM
Field Original Value New Value
Assignee Peter Dettman [ peter.dettman ]
Peter Dettman added a comment - 14/May/10 06:23 AM
Are you able to test this against a 2.1 nightly build?

If you can send me a somewhat self-contained test program, I'd be happy to take a look.

Chris Colman added a comment - 14/May/10 06:31 AM
I'm currently sticking to 2.0.x for production but I will download 2.1 and use that for development for a few days to see if the problem occurs with 2.1.

It would take some time to create a self contained test that reproduces the problem as the app it's occurring in has about 500 classes. I will try to get some time soon to reduce down a much simplified app that still (intermittently :) ) reproduces the problem.

Chris Colman added a comment - 18/May/10 01:10 AM
An interesting observation:

We develop on windows and deploy to Linux production servers (CentOS).

While this intermittent problem occurs on Windows it has *never* occurred on Linux.

It could be something like Linux allocating memory more predictably or something like that.

Given that it works sometimes and not others tells me that there is some kind of non deterministic mechanism in the DataNucleus start up process - which of course is not a problem in itself but it could explain why sometimes it works and sometimes it doesn't - using the exact same codebase.

This could possibly be:
- a critical timing issue (unlikely)
- an 'order of processing' variance eg., iteration through a hashset - the order in which elements are returned by a hash set iterator is non deterministic over successive runs of the same code because the hash function is based on each element's memory address - which may be more predictable in Linux but not in Windows.

Chris Colman added a comment - 01/Aug/10 03:34 AM
I just confirmed that this still occurs on yesterday's 2.2 nightly build.

If anything it appears to occur more frequently than with 2.0 but as this is an intermittent problem it's very statistically unstable but for the last 10 deployments to the dev system during normal development cycle not once has it started without error on the first attempt. They all required that I stop tomcat and restart it.

Andy Jefferson made changes - 19/Oct/10 08:13 AM
Priority Major [ 3 ] Incomplete [ 6 ]
Andy Jefferson made changes - 17/Feb/11 10:20 AM
Assignee Peter Dettman [ peter.dettman ]
Chris Colman added a comment - 10/Apr/11 09:25 AM
Running under IntelliJ Community edition this problem happens 100% of the time - giving me a chance to debug it for the first time.

I'm stepping through the meta data processing code. It's deep, deep, deep inside the processing of a package.jdo file (which is inside a .jar) and I've noticed something that seems not quite right:

The constructor:

    public ClassMetaData(final PackageMetaData parent, final String name)
    {
        super(parent, name);
    }

is called more than once for the exact same class name. Is this to be expected? I would think that only one instance of each ClassMetaData would be required and when required for further use the existing instance would be located via a look up mechanism.

Am I heading in the right direction or is multiple construction of the same class meta data instance normal?

Andy Jefferson added a comment - 10/Apr/11 09:41 AM
if you have metadata in package.jdo, and package-xxx.orm, and annotations then you get one for each before they are merged

Chris Colman added a comment - 10/Apr/11 09:52 AM
Ok, I'll have a look around in other places for a duplicate. I don't use any .orm files but they may exist in some other .jdo somewhere.

Chris Colman added a comment - 10/Apr/11 03:41 PM - edited
Found that two instances of the .jar file containing the package.jdo were in the classpath - explaining why the ClassMetaData constructor was called twice for the OrgUnit class name. Removing one of them drastically reduced the chances of the error occurring although I did see it occur one more time. Putting the duplicate back caused the NPE to occur most of the time.

Would it be possible for the NPE to be replaced by something more descriptive, perhaps suggesting that a .jar may be included more than once.

Andy Jefferson added a comment - 12/Jan/12 05:24 PM
Been open for nearly 2 yrs and never seen here, nor demonstratable so closing. Obviously if the famous testcase is generated then can be reopened

Andy Jefferson made changes - 12/Jan/12 05:24 PM
Status Open [ 1 ] Resolved [ 5 ]
Resolution Cannot Reproduce [ 5 ]
Andy Jefferson made changes - 06/Feb/12 10:13 AM
Status Resolved [ 5 ] Closed [ 6 ]
Chris Colman added a comment - 06/Feb/12 11:23 AM
I recently upgraded the ASM byte code enhancer from version 3.1 to version 3.3 and since doing that this intermittent NPE on startup issue has not occurred once! That's good news.

I'll keep my fingers crossed that it was an ASM 3.1 issue that has now been solved.

Chris Colman added a comment - 07/Feb/12 09:54 PM
Doh! I spoke to soon. It happened again but it *seems* less frequent with ASM 3.1.

I will try to get a reduced persistent class model that reproduces the problem. Reducing is not trivial anymore with >400 persistent classes in the one project!

Chris Colman added a comment - 08/May/12 07:38 AM
As I have been unable to produce a test case to date would it be possible to trap this exception and reproduce a much more terse error message instead of throwing pages of stack trace?

This would help out in the interim while I try to create a minimalist test case.

Andy Jefferson added a comment - 08/May/12 04:04 PM
Chris, you have the stack trace so know the line numbers etc where it occurs. Additionally you know what extra info you want printed out, so just provide a patch to SVN trunk and it can be applied.

Chris Colman added a comment - 17/Jun/12 02:59 AM
I just discovered a clue that may be useful.

Debugging DN core at app start up I have discovered that if the reentrancy prevention semaphore 'populating' is ever true then the exception is thrown. On occassions where the 'populating' flag is never true the problem never occurs. So it appears as though the problem is related to the scenario in which DN prevents reentrancy of this method:

    public synchronized void populate(ClassLoaderResolver clr, ClassLoader primary, MetaDataManager mmgr)
    {
        if (isInitialised() || isPopulated())
        {
            NucleusLogger.METADATA.error(LOCALISER.msg("044068",name));
            throw new NucleusException(LOCALISER.msg("044068",fullName)).setFatal();
        }
        if (populating)
        {
            return; <--- If this ever gets executed the exception in question will be thrown
        }

Perhaps the error is more likely to occur on multi core systems where there could be a greater chance of concurrently executing threads performing the metadata processing.

Chris Colman added a comment - 17/Jun/12 03:26 AM
Could the problem be that the caller to ClassMetaData.populate always assumes that populate has completed when the call returns - ignoring the fact that the call might have been short circuited by the reentrancy prevention mechanism and thus population is not complete because it is still being carried out by another thread that has not completed the populate method yet?

Chris Colman added a comment - 17/Jun/12 06:33 AM
Bingo!!!

The problem happens when a class' meta data is being processed and, in processing that class, metadata for other classes is processed, and, while processing the other classes a class being processed calls ClassMetaData.populate on the original class. That call short circuits because the populating semaphore is true but the the original class is not fully populated because it's original populate method has not completed. An NPE occurs when a test is made on the class' identity type which is null.

The test is intended to ensure that the entity type of the superclass and the current class match. This test can safely be deferred if identity type is null because it will be performed when the stack unwinds and the original invocation of ClassMetaData.populate continues - at that point identityType will be established and so not cause an NPE.

Fix is attached:

Index: /work/libs/datanucleus/datanucleus/platform/core/trunk/src/java/org/datanucleus/metadata/AbstractClassMetaData.java
===================================================================
--- /work/libs/datanucleus/datanucleus/platform/core/trunk/src/java/org/datanucleus/metadata/AbstractClassMetaData.java (revision 14867)
+++ /work/libs/datanucleus/datanucleus/platform/core/trunk/src/java/org/datanucleus/metadata/AbstractClassMetaData.java (working copy)
@@ -917,11 +917,22 @@
             }
 
             // Superclass identityType must be the same as this classes identityType
- if (!this.identityType.equals(pcSuperclassMetaData.getIdentityType()))
- {
- // We can't change the identity type from what was specified in the base class
- throw new InvalidMetaDataException(LOCALISER, "044093", fullName);
- }
+ if (this.identityType != null)
+ {
+ // This is where the NPE comes from when this.identityType == null because further up the stack this class is still being
+ // populated and DN hasn't yet established it's identityType (or objectidClass for that matter)
+ // Defer doing the check if this.identityType is null - it will be checked when the invocation further up in the stack
+ // continues it's execution
+ if (!this.identityType.equals(pcSuperclassMetaData.getIdentityType()))
+ {
+ // We can't change the identity type from what was specified in the base class
+ throw new InvalidMetaDataException(LOCALISER, "044093", fullName);
+ }
+ }
+ else
+ {
+ NucleusLogger.METADATA.debug("Deferring check of identity type until stack unwinds on earlier populate call for this ClassMetaData instance: " + fullName);
+ }
         }
     }
     
The problem is intermittent because of the random nature in the order in which persistent classes are processed which I observed by dumping the name of the class being processed at the start of ClassMetaData.populate.

Andy Jefferson made changes - 17/Jun/12 09:04 AM
Resolution Cannot Reproduce [ 5 ]
Status Closed [ 6 ] Reopened [ 4 ]
Andy Jefferson made changes - 17/Jun/12 09:04 AM
Priority Incomplete [ 6 ] Minor [ 4 ]
Andy Jefferson added a comment - 17/Jun/12 09:06 AM
Patch didn't apply (to SVN trunk), but I changed the code to only apply that check+exception if superclass identity type is set. Obviously not reproducible here, but the change of the check doesn't affect our tests or the JDO TCK; now in SVN trunk.

Andy Jefferson made changes - 17/Jun/12 09:06 AM
Status Reopened [ 4 ] Resolved [ 5 ]
Fix Version/s 3.1.0.m4 [ 11563 ]
Resolution Fixed [ 1 ]
Andy Jefferson made changes - 18/Jun/12 10:05 AM
Status Resolved [ 5 ] Closed [ 6 ]