This post introduces the reader to the functionality available through Berkeley DB, and the concepts involved.
Berkeley DB is an open source Database Management System. But there are some major differences between Berkeley DB and other open source DBMS such as MySQL and Postgres:
- Berkeley DB is NOT a relational DBMS. Instead of accessing data by means of SQL sentences, it offers an API as a set of functions for the access and modification of data, and for the management of the database.
- Berkeley DB is embedded into the application that uses it. It runs in the same address space, and therefore no inter-process communication mechanism is required, either local (unix pipes,…) or through the network. The database can be accessed concurrently by several processes or several threads inside a process, and the library manages transparently the low level services, such as locking, transactions, shared buffer, etc.
These two differences allow Berkeley DB to outperform relational database implementations in most cases, making it the preferred choice when large amounts of data need to be processed, and there is no need for more sophisticated access to data only provided by the relational DBMS.
Berkeley DB is distributed as a library with interfaces to several programming languages, including C, C++, Java, Perl, Tcl, Python and PHP. Besides the Berkeley DB for Java standard version, there is also a Berkeley DB Java Edition which is 100% Java.
The library runs in most platforms, including Linux distributions, Windows and real time embedded operating systems, and it runs in 64bit as well as 32bit platforms.
There are four different storage structures that can be chosen for a Berkeley DB database:
In a hash table each record is identified by a unique key. Hash Tables are usually preferred for very large databases that need predictable access and update times.
Btrees allow to select all the existing records where the keys are in a given range. The tree structure keeps physically close records with keys that have close values
This kind of storage is provided for applications that identify records by a record number. It is like a Hash Table where the key is the record number. Berkeley DB automatically generated the record number when a new record is stored in the database.
Queues are suitable for applications that create records, and then need to process them in the order they were created. A good example is an on-line purchasing system. Orders can enter the system at any time, but they should be processed in the order they were received.
Data Management services
Berkeley DB implement concurrency, transactional and recovery services.
Several operations can be grouped in a single transaction, that can be either validated (commited) or cancelled (rolled back) atomically. Berkeley DB implements a technique called “two-phase locking” to ensure that two concurrent transactions are isolated one from the other, and a technique called “write-ahead logging” to guarantee that finished transactions survive possible application, system or hardware failures. When an application opens a database, it can request the library to perform a recovery.
To improve the performance, the application can also disable those services that won’t be needed. For instance, a single user application where accesses will never happen concurrently could disable the locking and transaction systems.