CodernityDB - a Fast Pure Python NoSQL Database

Software Development Magazine - Project Management, Programming, Software Testing

Scrum Expert - Articles, tools, videos, news and other resources on Agile, Scrum and Kanban

CodernityDB - a Fast Pure Python NoSQL Database

Jedrzej Nowak, Codernity, https://codernity.com

CodernityDB is an open source, pure Python without 3rd party dependency, fast, multi platform, schema-less, NoSQL database.

You can also call it a more advanced key-value database, with multiple key-values indexes (not an index that you probably know from SQL databases) in the same engine (for sure it's not "simple key/value store"). What do we mean by advanced key-value database? Imagine several "standard" key-value databases inside one database. You can also call it a database with one primary index and several secondary indexes. Having this layout and having programmable database behavior gives quite a lot of possibilities. Some of them will be described in this article. At first we will focus on very fast overview of CodernityDB architecture. A more detailed description will come in further sections.

Web Site: http://labs.codernity.com/codernitydb/
Version described: 0.4.2
System requirements: Python 2.6-2.7
License & Pricing: Apache 2.0
Support: directly via db [at] codernity.com and https://bitbucket.org/codernity/codernitydb

CodernityDB - a Fast Pure Python NoSQL Database

General information

To install CodernityDB just run:

pip install CodernityDB

And that's all.

CodernityDB is fully written in Python 2.x. No compilation needed at all. It's tested and compatible with:

CPython 2.6, 2.7
PyPy 1.6+ (and probably older)
Jython 2.7a2+

It's mainly tested on Linux environments, but it will work everywhere where Python runs fine. You can find more details about test process in how it's tested in documentation.

CodernityDB is one of projects developed and released by Codernity, so you can contact us directly in any case via db [at] codernity.com (please consider checking FAQ section first)

Do you want to contribute? Great! Then just fork our repository ( https://bitbucket.org/codernity/codernitydb) on Bitbucket and do a pull request. It can’t be easier! CodernityDB and all related projects are released under Apache 2.0 license. To fill a bug, please also use Bitbucket.

CodernityDB index

What is that mysterious CodernityDB index?

At first, you have to know that there is one main index called id. CodernityDB object is kind off "smart wrapper" around different indexes. CodernityDB cannot work without id index. It's the only requirement. If it's hard to you to understand it, you can treat that mysterious id index as key-value database (well, in fact it is a key/value store). Each index will index data by key and it will associate the value for it. When you insert data into database, you always in fact insert it into the main index called id. Then the database passes your data to all other indexes. You can't insert directly into the index. The data inside the secondary indexes is associated with primary one. So there is no need to duplicate stored data inside the index (with_doc=True when querying index). The exception from that rule is when you really care about performance, having data also in secondary indexes doesn't require a background query to id index to get data from it. That is also the reason why you need to have an index in your database from the beginning, otherwise you will need to re-index your new or changed index, when you have records already in the database.

CodernityDB index is nothing less or more than a python class that is added to database. You can compare it to a read-only table that you may know from the SQL databases, or View from CouchDB. Here is an example of very simple index that will index data by x value:

class XHashIndex(HashIndex):

    def __init__(self, *args, **kwargs):
        kwargs['key_format'] = 'I'
        super(XHashIndex, self).__init__(*args, **kwargs)

    def make_key_value(self, data):
        x = data.get['x']
        return x, None

    def make_key(self, key):
        return key

As you can see it is very easy python class. Nothing non standard for Pythonista, right ? Having such index in the database allows you to make queries about data in the database that has the "x" value. The important parts are for sure make_key_value make_key and key_format. make_key_value function can return:

value, data: data has to be a dictionary, record will be indexed
value, None: no data associated with that record in this index (except main data from id index)
None: record will be not indexed by this index

You can find a detailed description in the documentation. And don't worry, you will not need to add indexes every time you want to use the database. CodernityDB saves them on disk (in _indexes directory), and loads them when you open database.

Please look closer at the class definition: you will find there that our index is a subclass of HashIndex. In CodernityDB we have currently implemented:

* Hash Index - Hash map implementation

Pros

Fast
"Simple"

Cons

Records are not in the order of insert / update / delete but in random order
Can be queried only for given key, or iterate over all keys

* B+Tree Index (called also Tree index, because it's shorter) - B+Tree structure implementation

Pros

Can be queried for range queries
Records are in order (depending on your keys)

Cons

Slower than Hash based indexes
More "complicated" than Hash one

You should spend a while to decide which index is correct for your use case (or use cases). Currently you can define up to 255 indexes. Please keep in mind that having more indexes slows down the insert / update / delete operations, but it doesn't affect the get operations at all, because get is made directly from index.

You can perform following basic operations on indexes: * get - single get * get_many - get more records with the same key, key range * all - to get all records in given index. Because it is quite common to associate more than one record in secondary index with one record in primary we implemented something called: Multikey index.

Writing a whole python class just to change some parts of the method, like indexing y instead of x, would be not very user friendly. Thus we created something that we call IndexCreator. It is some kind of meta-language that allows you to create your simple indexes faster and much easier. Here is an example of exactly the same XHashIndex:

name = MyTestIndex
type = HashIndex
key_format = I
make_key_value:
x, None

When you add that index to CodernityDB, the custom logic behind IndexCreator creates the python code from it. We created helpers for common things that you might need in your index. (IndexCreator docs), so once you get used to it, it is pretty straightforward.

As you maybe already noticed, you can split data to separate "tables/collections" with CodernityDB index (Tables, Collections in docs). All that you need is to have separate index for every table / collection that you want to have.

You should avoid operations directly on indexes, you should always run index methods / operations via Database object like:

db.get('x', 15)
db.run('x', 'some_funct', *args, **kwargs)

Usage

You should have now a basic knowledge about CodernityDB internals and you may wonder how easy is to use CodernityDB. Here is an example::

#!/usr/bin/env python

from CodernityDB.database import Database

def main():
    db = Database('/tmp/tut1')
    db.create()

    for x in xrange(100):
        print db.insert(dict(x=x))

    for curr in db.all('id'):
        curr[x] += 1
        db.update(curr)

if __name__ == '__main__':
    main()

This is the fully working example. Adding this index from previous section will allow us to do for example:

print db.get('x', 15)

For detailed usage please refer to the quick tutorial on our web site.

Index functions

You can easily add index side functions to a database. While adding them to an embedded database might make no sense for you, adding them to the server version is very recommended. Those functions have direct access to the database and index objects. You can for example define a function for the x index :

def run_avg(self, db_obj, start, end):
    l = []
    gen = db_obj.get_many(
        'x', start=start, end=end, limit=-1, with_doc=True)
    for curr in gen:
        l.append(curr['doc']['t'])
    return sum(l) / len(l)

Then when you will execute it with:

db.run('x', 'avg', 0, 10)

You get the answer directly from the database. Please keep in mind, that changing the code of these functions doesn't require re-indexing your database.

Server version

CodernityDB is an embedded database engine by default, but we also created a server version. We created also the CodernityDB-PyClient that allows you to use the server version without any code changes except:

from CodernityDBPyClient import client
client.setup()

Those 2 lines of will patch further CodernityDB imports to use the server version instead. You can migrate from the embedded version to the server version in seconds (+ time needed to download requirements from pypi). The Gevent library is strongly recommended for the server version

Future of CodernityDB

We're currently working on or have already released (it depends when you will read this article) the following features:

TCP server that comes with TCP client, exactly on the same way as HTTP one
Permanent changes index, used for "simple" replication for example
Message Queue system (single guaranteed delivery with confirmation)
Change subscription / notification (subscribe to selected database events)

If these features are not yet released and you want to have them before the public release, send us a mail and tell what are you interested in.

For advanced users

ACID

CodernityDB never overwrites existing data. The id index is always consistent. Other indexes can be always restored, refreshed (CodernityDB.database.Database.reindex_index() operation) from it.

In given time, just one writer is allowed to write into a single index (update / delete actions). Readers are never blocked. The write is first performed on storage, and then on index metadata. After every write operation, the index does flush of the storage and metadata files. It means that in worst case (power lost during write operation) the previous metadata and storage information will be valid. The database doesn’t allow multiple object operations and has no support for typical transaction mechanism (like SQL databases have). But single object operation is fully atomic. To handle multiple updates to the same document we use the _rev (like CouchDB) field, which informs us about the document version. When rev is not matched with one from the database, the write operation is refused. There is also nothing like delayed write in the default CodernityDB implementation. After each write, internals and file buffers are flushed and then the write confirmation is returned to the user.

CodernityDB does not sync kernel buffers with disk itself, it relies on system/kernel to do so. To be sure that data is written to the disk please call fsync(), or use CodernityDB.patch.patch_flush_fsync() to call fsync always when the flush is called after data modification.

Sharding in indexes

If you expect that one of your database indexes (id is the most common one there) might be bigger than 4GB, you should shard it. It means that instead of having one big index, CodernityDB splits the index into parts, still leaving API as it would be single index. It gives you about 25% performance boost for free.

Custom storage

If you need, you can define your custom storage. The default storage uses marshal to pack and unpack objects. It should be the best for most use cases, but you have to remember that you can serialize with it only basic types. Implementing a custom storage is very easy. For example you can implement a storage that uses Pickle (or cPickle) instead of marshal, then you will be able to store your custom classes and make some fancy object store. Anything is possible in fact. If you prefer you can implement remote storage. The sky is the limit there. The other reason might be implementing a secure storage. You can define a storage like this and you will get an encrypted transparent storage mechanism. No one without access to the key will be able to decrypt it.

More Database Knowledge

Database Tutorials and Videos

Click here to view the complete list of tools reviews

This article was originally published in the Fall 2013 issue of Methods & Tools

Methods & Tools
is supported by

Software Testing
Magazine

The Scrum Expert