JPA : Performance Tuning

DataNucleus, by default, provides certain functionality. In particular circumstances some of this functionality may not be appropriate and it may be desirable to turn on or off particular features to gain more performance for the application in question. This section contains a few common tips


You should perform enhancement before runtime. That is, do not use java agent since it will enhance classes at runtime, when you want responsiveness from your application.

Schema Creation

DataNucleus provides 4 persistence properties datanucleus.autoCreateSchema , datanucleus.autoCreateTables , datanucleus.autoCreateColumns , and datanucleus.autoCreateConstraints that allow creation of the datastore tables. This can cause performance issues at startup. We recommend setting these to false at runtime, and instead using SchemaTool to generate any required database schema before running DataNucleus (for RDBMS, HBase, etc) .

Schema Validation

DataNucleus provides 3 persistence properties datanucleus.validateTables , datanucleus.validateConstraints , datanucleus.validateColumns that enforce strict validation of the datastore tables against the Meta-Data defined tables. This can cause performance issues at startup. In general this should be run only at schema generation, and should be turned off for production usage. Set all of these properties to false . In addition there is a property datanucleus.rdbms.CheckExistTablesOrViews which checks whether the tables/views that the classes map onto are present in the datastore. This should be set to false if you require fast start-up. Finally, the property datanucleus.rdbms.initializeColumnInfo determines whether the default values for columns are loaded from the database. This property should be set to NONE to avoid loading database metadata.

To sum up, the optimal settings with schema creation and validation disabled are:

#schema creation
#schema validation

EntityManagerFactory creation

Creation of EntityManagerFactory objects can be expensive and should be kept to a minimum. Depending on the structure of your application, use a single factory per datastore wherever possible. Clearly if your application spans multiple servers then this may be impractical, but should be borne in mind.

You can improve startup speed by not specifying all classes in the persistence-unit so that they are discovered at runtime. Obviously this may impact on persistence operations later if classes are not known about.

Some RDBMS (such as Oracle) have trouble returning information across multiple catalogs/schemas and so, when DataNucleus starts up and tries to obtain information about the existing tables, it can take some time. This is easily remedied by specifying the catalog/schema name to be used - either for the EMF as a whole (using the persistence properties datanucleus.Catalog , datanucleus.Schema ) or for the package/class using attributes in the MetaData. This subsequently reduces the amount of information that the RDBMS needs to search through and so can give significant speed ups when you have many catalogs/schemas being managed by the RDBMS.

Use of EntityManager

Clearly the structure of your application will have a major influence on how you utilise an EntityManager. A pattern that gives a clean definition of process is to use a different persistence manager for each request to the data access layer. This reduces the risk of conflicts where one thread performs an operation and this impacts on the successful completion of an operation being performed by another thread. Creation of EM's is not an expensive process and use of multiple threads writing to the same manager should be avoided.

O/R Mapping

Where you have an inheritance tree it is best to add a discriminator to the base class so that it's simple for DataNucleus to determine the class name for a particular row. For RDBMS : this results in cleaner/simpler SQL which is faster to execute, otherwise it would be necessary to do a UNION of all possible tables. For other datastores, a discriminator stores the key information necessary to instantiate the resultant class on retrieval so ought to be more efficient also.

Database Connection Pooling

DataNucleus, by default, will allocate connections when they are required. It then will close the connection. In addition, when it needs to perform something via JDBC (RDBMS datastores) it will allocate a PreparedStatement, and then discard the statement after use. This can be inefficient relative to a database connection and statement pooling facility such as Apache DBCP. With Apache DBCP a Connection is allocated when required and then when it is closed the Connection isn't actually closed but just saved in a pool for the next request that comes in for a Connection. This saves the time taken to establish a Connection and hence can give performance speed ups the order of maybe 30% or more. You can read about how to enable connection pooling with DataNucleus in the Connection Pooling Guide.

As an addendum to the above, you could also turn on caching of PreparedStatements. This can also give a noticeable performance boost, depending on your persistence code and the SQL being issued. Look at the persistence property datanucleus.connectionPool.maxStatements .

Retrieval of object by id

When retrieving objects using their identity, and when the object is cached, DataNucleus by default will validate the existence of the object before handing it out. You can skip this check by setting the persistence property datanucleus.findObject.validateWhenCached to false

Commit of transaction

DataNucleus verifies if newly persisted objects are memory reachable on commit, if they are not, they are removed from the database. This process mirrors the garbage collection, where objects not referenced are garbage collected or removed from memory. Reachability is expensive because it traverses the whole object tree and may require reloading data from database. If reachability is not needed by your application, you should disable it. To disable reachability set to false the persistence property datanucleus.persistenceByReachabilityAtCommit .

DataNucleus will, by default, perform a check on any bidirectional relations to make sure that they are set at both sides at commit. If they aren't set at both sides then they will be made consistent. This check process can involve the (re-)loading of some instances. You can skip this step if you always set both sides of a relation by setting the persistence property datanucleus.manageRelationships to false .

Identity Generators

DataNucleus provides a series of value generators for generation of identity values. These can have an impact on the performance depending on the choice of generator, and also on the configuration of the generator.

  • The max strategy should not really be used for production since it makes a separate DB call for each insertion of an object. Something like the table strategy should be used instead. Better still would be to choose auto and let DataNucleus decide for you.
  • The sequence strategy allows configuration of the datastore sequence. The default can be non-optimum. As a guide, you can try setting key-cache-size to 10

The auto identity generator value is the recommended choice since this will allow DataNucleus to decide which identity generator is best for the datastore in use.

Collection/Map caching

DataNucleus has 2 ways of handling calls to SCO Collections/Maps. The original method was to pass all calls through to the datastore. The second method (which is now the default) is to cache the collection/map elements/keys/values. This second method will read the elements/keys/values once only and thereafter use the internally cached values. This second method gives significant performance gains relative to the original method. You can configure the handling of collections/maps as follows :-

  • Globally for the EMF - this is controlled by setting the persistence property datanucleus.cache.collections . Set it to true for caching the collections (default), and false to pass through to the datastore.
  • For the specific Collection/Map - this overrides the global setting and is controlled by adding a MetaData <collection> or <map> extension cache . Set it to true to cache the collection data, and false to pass through to the datastore.

The second method also allows a finer degree of control. This allows the use of lazy loading of data, hence elements will only be loaded if they are needed. You can configure this as follows :-

  • Globally for the EMF - this is controlled by setting the property datanucleus.cache.collections.lazy . Set it to true to use lazy loading, and set it to false to load the elements when the collection/map is initialised.
  • For the specific Collection/Map - this overrides the global EMF setting and is controlled by adding a MetaData <collection> or <map> extension cache-lazy-loading . Set it to true to use lazy loading, and false to load once at initialisation.

NonTransactionRead (Reading persistent objects outside a transaction)

NontransactionalRead has advantages and disadvantages in performance and data freshness in cache. In NontransactionalRead=true mode, the EM is able to read objects outside a transaction. The objects read are held cached by the EM The second time a user application requests the same objects from the EM they are retrieved from cache. The time spent reading the object from cache is minimum, but the objects may become stale and not represent the database status. If fresh values need to be loaded from the database, then the user application should first call refresh on the object.

Another disadvantage of NontransactionalRead=true mode is due to each operation realized opens a new database connection, but it can be minimized with the use of connection pools.

Reading persistent objects outside a transaction and EntityManager

Reading objects outside a transaction and EntityManager is a trivial task, but performed in a certain manner can determine the application performance. The objective here is not give you an absolute response on the subject, but point out the benefits and drawbacks for the many possible solutions.

  • Use datanucleus.RetainValues =true. This is the default for JPA operation and will ensure that after commit the fields of the object retain their values (rather than being nulled).
  • Use detach method.
    Object copy = null;
        EntityManager em = emf.createEntityManager();
        //retrieve in some way the object, query, find, etc
        Object obj = em.find(MyClass.class, id);
        copy = em.detach(obj);
    //read or change the detached object here
  • Use datanucleus.detachAllOnCommit =true. Dependent on the persistence context you may automatically have this set.
    Object obj = null;
        EntityManager pm = emf.createEntityManager();
        //retrieve in some way the object, query, find, etc
        obj = em.find(MyClass.class, id);
        em.getTransaction().commit(); // Object "obj" is now detached
    //read or change the detached object here

The bottom line is to not use detachment if instances will only be used to read values.

Fetch Control

When fetching objects you have control over what gets fetched. This can have an impact if you are then detaching those objects. With JPA the maximum fetch depth is -1 (unlimited). So with JPA you ought to set it to the extent that you want to detach, or better still make use of DataNucleus fetch groups to control the specific fields to detach.

Retrieval of object by identity

If you are retrieving an object by its identity and know that it will be present in the Level2 cache, for example, you can set the persistence property datanucleus.findObject.validateWhenCached to false and this will skip a separate call to the datastore to validate that the object exists in the datastore.


I/O consumes a huge slice of the total processing time. Therefore it is recommended to reduce or disable logging in production. To disable the logging set the DataNucleus category to OFF in the Log4j configuration. See Logging for more information.