Database Integration
EJBCA relies on a shared database model, which means that any data two EJBCA nodes in a cluster shares must be in the common storage.
For this purpose EJBCA uses an SQL database. Not all SQL databases are however created equal and some are definitely more equal than others.
Technology used
Java Persistence API (JPA) is part of the Java Enterprise Edition (JEE) standard and abstraction for talking to the database. JPA is an abstraction layer on to of Java DataBase Connector (JDBC) API.
Just like JDBC we perform operations on the database via a DataSource. The DataSource is either configured in the application server with JDBC driver and injected or manually configured using the JPA provider in stand-alone CLI mode.
Using JEE for persistence and transaction management
We like to use JEE (Java Enterprise Edition) since it allows us to declare transaction boundaries instead of handling this pragmatically which becomes hairy when multiple modules of the application is involved.
Java Persistence API (JPA) is used to abstract away many aspects of database specific variants (and does so well, but not perfectly).
Java Persistence API and the Object Relation Mapping
Object Relation Mapping (ORM) is how Java objects called Entities are mapped to database columns. These mappings are defined in orm-ejbca-{database}.xml files included from persistence.xml. ORM is great since it allows us (theoretically) to work mostly with Java objects and not care so much about the SQL.
The Java Persistence API Query Language (JPQL) is a simplified query language that further tries to abstract the specifics of different SQL dialects making the application more portable.
For advanced operations, like reading presence of indexes we still need to use direct lower level JDBC calls.
Why make the extra effort of writing database agnostic code?
Because the AwesomeSQL DB that we rely on today might be bought up, perverted and/or killed in X years from now.
Keeping things "simple" (or at least agnostic) will allow us to migrate the customers data to another DB that happens to be the best at that point in time.
The Basics and Practical Guidelines
EJBCA relies on the Hibernate implementation as JPA provider even though we theoretically should be able to support others as long as we adhere to the standard.
We bundle the Hibernate JPA provider on application servers that does not provide it and use it in the EJBCA DB CLI.
If a new field is added to a Entity (mapped Java object), all these files must be updated and new table create scripts should be created for all databases.
What to do
Use UpgradableHashMap for all non-searchable fields and store it as a CLOB. XML Serialize using Base64 (or maybe UTF-8).
Use simple types like int, long and String
Set a default value for non-nullable fields in the Entity constructor.
Don't construct queries as Strings. Use Query.setParameter(String, Object) instead.
Mark methods in Entites that begin with "get" with @Transient if you don't intend to map it to database column.
Use transactions if you update or insert
Don't require transactions if you only read (but it can still support transactions)
A declared transaction is only invoked when a method is invoked through an EJB interface.
Update org.ejbca.core.ejb.DatabaseSchemaTest to verify that your mapping can hold what you think it can. This tests also acts as documentation for maximum allowed size of different fields.
Read the rest of this document if you intend to make changes.
What not to do
Don't use boolean data-type. Use an interger with value 0 (false) or 1 (true) instead. For example Ingres 9.x does not support BOOLEAN.
Don't use foreign keys. For example MySQL's MyISAM engine does not support foreign keys.
Don't use BLOB or try to store complex Java objects directly. JBoss Serilization in EJB 2.1 somewhat locked in the EJBCA user to the JBoss platform.
Don't expect an insert to fail (throw an Exception) in the middle of a transaction if it already exists. This will fail at commit-time.
Don't take Serializable isolation mode for granted. Different databases will be configured differently.
Store Date's as something other than "long" (in milliseconds).
Don't name database column after reserved words on _any_ database. We might support additional databases in the future…
Read the rest of this document if you intend to make changes.
Database connections and statements with direct JDBC .
In some advanced use cases where direct JDBC is required, it is very important to clean up properly to avoid resource leakage.
Use the Java 7 try-with-resource-pattern under these circumstances:
try (final Connection connection = dataSource.getConnection();) {
final DatabaseMetaData databaseMetaData = connection.getMetaData();
try (final ResultSet resultSetSchemas = databaseMetaData.getTables(null, null, null, null))
{
...
}
}
You cannot count on that JDBC updates of a database is available from the current transaction and vice versa. So don't mix JDBC rows operations and JPA Entity operations.
Writing database agnostic code with Java Persistence API
Rule #1: You can't ignore the realities of the underlying database
It might look like you are juggling nice POJOs with some annotations in basic examples, but you are not (only) really doing that. There is a database behind all those shiny POJOs, so read the rest.
JPA Criteria API is not great for understanding what will be executed
It is great for building dynamic and advanced queries, but a simple TypedQuery created as a String shows directly what will be executed without having to debug log the actual generated query from the CriteriaBuilder. Keep it simple unless you have to be super-flexible.
NEVER add untrusted input to the query String, always use setParameter(...) calls.
Always define the ORM for each database type
Each JPA provider (and version of the same) like Hibernate might map a Java type to a completely different database column type.
If we don't define the column type, we might end up with different database schemes for two customers that run different versions of the same application server even though they have the same database.
By having this well defined there will be no unpleasant automatic schema updates that prevents 100% uptime for the customer and we have an easier time trying to reproduce the customers environment when needed.
Don't use BLOBs for byte
Stick to the simple data types. For example Postgres will store them separated from the rest of the row data. When the row is removed, the BLOB will remain on storage.
Base64-encode the data. Store it as a CLOB. Enable compression in the database engine to get back the lost space if needed.
On Postgres, don't define the CLOB TEXT column as <lob/> (it will be stored of table exactly like we tried to avoid in the first place).
Don't use stored procedures
No. It wont be easy portable. Stick to more generic optimization techniques instead.
Avoid COUNT(*)
For example Postgres and Oracle will really count every single row in the table →Full table scan.
Avoid boolean / Boolean
Stick to the simple data types. Ingres followed an older version of the SQL standard and there is no boolean database type there. Use int / Integer with value (null,) 0 and 1 instead. Provide transient boolean getters and setters for your code.
Store empty Strings in nullable columns as NULL
Oracle and DB2 in Oracle compatibility mode will do this anyway, so if you are hacking away on another database you might not realize the NPEs you are creating otherwise.
Name JPA getters and setters for CLOBs zzzSomething
Yep. Really. Oracle requires inserts of CLOBs to be last of the prepared statement and Hibernate will (currently 2017 at least) sort by these names. Still define the column name as "something" and provide getSomething() and setSomething() accessors for the rest of the code to use.
Check that your query execution plan is database agnostic
Don't assume that a query execution plan is database agnostic from an observation on your machine. It might very well be dependent on the how the database is currently populated and the specific type of database prioritizes.
Don't assume that the locale of the database is known
Sorting using ORDER BY might not yield the result you expect. Doing toLower() on data in the hope of being able to do case-insensitive search might fail in an Unicode world.
JPA queries might scan all the rows you want it to ignore
Using setFirstResult on a JPA query might still scan all the rows you want it to ignore. Pagination with proper cleanup or sorting with "greater than" might be (less ideal, but anyhow) working solutions.
Don't use pagination if the database client will hit different EJBCA nodes with consecutive requests for next page.
Limits for how wide a row and/or index may be
There are limits for how wide a row and/or index may be in bytes on certain databases. If data doesn't need to be searchable, put it in a CLOB. It is not uncommon that a "char" might map to 3 bytes.
MySQL NDB cluster has historically been the most picky of the supported database, which makes it a good candidate to check for restrictions first.
Calculating checksums on database rows
When calculating checksums on database rows, perform calculation on actual data written. This is not as much a database question, commonly used for our database protection pattern in JPA entities.
Do the checksum over the actually written or read CLOB content instead of Java objects like Map that might not be as deterministic between Java versions as we would hope.
All dates are Unix epoch time 64-bit numbers
Stick to the simple data types. The code can easily work with this and there is no need to special time database column types.
Read and write the ORM comments section
This might inform you about additional quirks and limitations for proper mapping on the specific database type.
Also, please don't invent new type of column mappings unless you have properly researched the subject. It helps if the JPA provider has rather similar mapping to our custom one when it does validation and parsing of data.
Store an upgradeable Java object as a CLOB
An upgradeable Java object stored as a CLOB is easier to upgrade than the schema. If a fields doesn't have to be searchable, it can easily be added or removed from inside a CLOB without schema changes that might be painful for the customers.
(And if we have a policy to only make schema changes in major releases, this will allow us to add new data in minor releases as well.)
Avoid words that are reserved on any database
Don't use words that are reserved on any database. It is usually quite easy to find good list of what this might be. For example a working link 2017-03-31 is https://www.drupal.org/docs/develop/coding-standards/list-of-sql-reserved-words .
Making efficient use of the database from the code
Avoid foreign key mapping on database level
Avoiding foreign key mapping on database level makes Create, Read, Update, Delete (CRUD) simpler. In a perfect world it would make things easier, but perhaps you figure out that you really want to orphan an entry (let say a log entry you want to keep pointing to an object you want to remove).
Even if orphaning seems like a bad idea when designing a feature, keeping a virtual foreign key relation in the code allows this to be more easily changed in the unknown future.
Some databases engines like TokuDB or MySQL's MyIsam don't support foreign keys (but MyIsam does not support transactions either which makes it rather useless for advanced data operations.)
Don't normalize too hard
If you end up doing an additional query to another table each time you fetch something, perhaps you should just have the data in the original row in the fist place and save a round-trip. Mind however that you need to keep replicas of data in sync when they change.
Don't add more indexes than you really need
Whenever you do an update, your indexes over the updated data also needs to be updated. This makes updates slower.
Indexes might also share RAM with caches so consider if scanning through a few recently used rows that fit into memory is better a lookup table for fetching entries from disk. (A bit exaggerated.)
There are two ways to use indexes
Narrow down the number of rows (primary keys) for the rest of the query to search through or act an extra value lookup table (where no row lookup by primary key is needed).
Make sure you know why you added an index and why you selected the order of the columns in the way that you did.
Be optimistic when locking updates
Most of the time you wont have conflicting updates. Use optimistic locking (fail if there was concurrent conflicting updates) instead of blocking others from reading the stuff you want to involve in you transactions. No-one likes to wait.
Don't make updates that don't change anything
Updating an object in the beginning of a transaction and then updating it back in the end will is a good way to waste locking and increases the probability that you will have conflicting updates.
Only tell the database about changes that actually change something and only do it once if you can.
Read objects once
Even if the object will be cached, it probably means that you are doing something wrong or have a problem with the module architecture if you need to ask the database for the same object twice.
And read the full object right away (eager loading) unless you are really sure that you wont use the rest so you don't end up with another round-trip to fetch the "last" pieces data.
Be tolerant and idempotent
If something is gone when you are trying to delete it, was it really a failure? If you make an update but the database already had the correct value, was it a failure? Sometimes yes (think atomic counters).
Most of the time however this is perfectly fine and instead of telling the EntityManager to delete a specific object, you can do an update query that removed any object that has this unique primary key.
Note that partial update queries are harder or impossible when each row is signed. But at least it is nice to not fail hard if the object is already updated with a correct value.
Don't start a transaction unless you intend to use it
If you are just going to read a value you are not going to mandate that it happens in a transaction.
Make sure that you are transactional when needed
If you are going to make changes that depend on input, your want to make sure that you do require a transaction and those readers support transactions.
Don't spread your operations over a lot of small sub-transactions
If you make many small sub-transactions and you have a database system without a battery backed-up cache, you will need to wait for each such transaction to be flushed to disks or cluster. Making all the changes in a single transaction would only require the client to wait for this operation once.
Interacting with external non-transactional systems
If you involve external systems in a transaction, you are responsible for rolling back changes if the overall transaction is rolled back.
The sane thing to do is to not start the transaction at the border to the client, but instead in an internal call where the outcome can be monitored. The intent to interact with such external systems should be committed an queue as part of the inner transaction. If the inner transactions succeeds each such intent could be performed by background threads and taken off the intent-queue. The main call can choose weather to wait for these to complete or let them happen eventually/asynchronously.
Updates might lead to fragmentation of storage
Some databases might never modify written storage. When an existing row is "modified" it simply means that a new row is added marked with the id of the transaction that added it. This makes it easy for different transactions to include the line where the transaction id is confirmed and exclude those like where it isn't. The old row is then eventually garbage collected. Similar reasoning can happen when compression is used and the new data exceeds the existing allocated row space with fragmentations as a result.
This can be avoided by trying to organize data so that static data belonging to an object is stored in one table and dynamic data is organized into another table. This works better if the object will live for a long time and the dynamic part of the object is small. The dynamic part can even be made append-only with timestamps if lookup of the latest item is made fast (few items or indexes).
Having append only data with timestamps can also be used to tolerate merge after split-brain scenarios (or active-active asynchronous setups) where the application can determine from an objects history what the actual state of things is.
Be human readable if you can afford it
For debug and potential use by future clients, it is nice if non-searchable is readable in its raw (CLOB) form.
You should still encode non-ASCII UTF-8 Strings as Base64 in for non-searchable data to ensure what is written in one version will be correctly read by another version. (Remember that people upgrade their systems and might forget to change the specified encoding in the JDBC URL or other parts of the environment that is outside the control of the application.) This does not mean that you need to Base64-encode the entire CLOBs so that numbers and ASCII strings (like property keys and enums) are unreadable from looking at the raw data.
Opportunistic base64-encoding would be your friend and store all values in the CLOB in a semi-human-readable data format (like JSON or XML).
Making 100% uptime compatible code
The idea to make this work is quite simple, but hard to implement.
It is allowed to add completely new tables
It is allowed to add completely new columns that are nullable (this requires that the database type can add new columns to large tables without locking the table for ongoing operations)
Old data that is no longer used should simply be ignored if it is nullable, but needs to be mapped and set to some default value if it is required by the schema.
Strategies for upgrading of how data is used or when existing functionality is extended is described below.
Upgrading format of data X from X1 to X2
Note that these strategies applies both to searchable columns and application-serialized CLOB data.
The strategies depend on the following:
Old nodes read and writes in old format X1. (Nothing we can really do about this.)
New nodes can read and write data both old format X1 and new format X2.
Algorithm "Also write in legacy format"
This is suitable for large data sets where migration of the data might not be feasible. There is need to wait for / check if all nodes have been upgraded.
Code to be able to read old format needs to be kept until data objects are no longer used (or can be assumed to have never been written in the old format.)
New nodes write data in new format X2 and also old format X1
New nodes aways try to read the new format X2 first
if X2 data is present, ignore the data in X1 format (faster once everything is upgraded)
if no X2 data is present, read the data as format X1
Algorithm "Tolerate and keep legacy format"
This is suitable for large data sets where migration of the data might not be feasible.
Code to be able to read old format needs to be kept until data objects are no longer used (or can be assumed to have never been written in the old format.)
New nodes write data in old format X1 until all nodes are running the new version.
New nodes aways try to read the new format X2 first
if X2 data is present, ignore the data in X1 format (faster once everything is upgraded)
if no X2 data is present, read the data as format X1
Algorithm "Migrate everything to new format"
This is suitable for small data sets where migration of the data is feasible.
Code to be able to read old format can be phased out when upgrading from a version where writing this was used is no longer supported.
New nodes write data in new format X2. New nodes also write data in old format X1 and until all nodes are running the new version.
New nodes aways try to read the new format X2 first
if X2 data is present, ignore the data in X1 format (faster once everything is upgraded)
if no X2 data is present, read the data as format X1
When last node is upgraded all existing rows are read and written in the new format.
Setting up development environments for the supported databases
Normally EJBCA expects the "Read Committed" transaction isolation level. Make sure that the database is set up for this. Too strict and you might end up in dead-locks, too weak and your transactions make make calculations on data that will be rolled back.
Below is a set of (by now pretty outdated) instructions for how to setup test environments for different database from the old EJBCA wiki. Please replace this with updated, working and proper confluence pages if you have to do any of these setups.