Versioning Rails models with Neo4j.rb

I recently helped add versioning support to Neo4j.rb. Versioning only works for Rails models at the moment. To add versioning to your model, you'll need include the Versioning module:

How it works

Versioning creates snapshots under a given rails model instance. Note that snapshots aren't Rails models, but vanilla nodes. The relationship from the instance to its snapshot versions captures information about the snapshot, such as the version number, the class that is being versioned, and the id of the versioned model. When asked to retrieve a particular version of an model, the versioning module internally does a lucene search using all of these parameters. 

Versions

Full snapshots

Snapshots currently capture all the properties of an instance. The snapshots create links to the same nodes that are linked to an instance, with one important difference. The snapshot's relationships all have a 'version_' prefix - the prefix is added in order to distinguish between 'regular' relationships and those created for versioning. Both incoming as well as outgoing relationships are recorded.

Let's look at an example:

With version 1 of the driver, there are no links to any cars:

Version1

With version 2, we link the driver to a sports car - a Porsche.

Version2

Now, both the driver and the current version are linked to the Porsche.

With version 3, we link the driver to a second car - a Ferrari. The graph now starts getting complicated.

Version3

The driver, as well as the latest snapshot, are both linked to two cars - the Porsche and the Ferrarri.

Working with a version

To retrieve a snapshot, you'll need to call the version API.

The snapshot object will respond to the exact same properties as those on instance. It also allows relationships to be traversed using the incoming / outgoing API. The prefix stuff I'd mentioned above is handled transparently, so you can continue to use the same relationship names that you've used in your model.

Reverting to a snapshot

To revert to a particular version, simply call

Revert restores properties and relationships, and creates a new version. So in the driver / car example, let's see what the driver example looks like when we revert to version 2:

Version4

This creates a new version (version 4), and ensures that the driver is linked to one sports car, just like it was with version 2.

Max versions

If you'd like to limit the maximum number of versions, use the max_versions declaration in your model class. This deletes the oldest version once the max versions threshold is exceeded.

Gotchas

In case any of the nodes linked to an instance are deleted, the versioning support makes no attempt to capture this. The relationship between the snapshot and such a node simply disappears.

In other words, when related nodes are deleted, reverting a node to it's previous version will not recreate deleted nodes and relationships.

Future improvements

The versioning support doesn't currently capture properties on relationships, but that should be relatively straightforward to address.

Deleting a node does not always delete its snapshots (this can happen whenever snapshots have relationships to other nodes). On second thoughts, this sounds more like a bug than an improvement :-).

Finally, supporting delta updates would allow more efficient storage (like I said earlier, each snapshot currently stores all the properties of a model).

That's about it, feedback and comments welcome!

PS: In case you're wondering who Walter Plinge is, check out http://en.wikipedia.org/wiki/Walter_Plinge

Multitenancy with Neo4j.rb

A little while ago, I'd helped add multitenancy support to Neo4j.rb. This allows the same Neo4j database instance to support multiple tenants/customers simultaneously. There are several ways of achieving multitenancy, but the main aim of the current solution was to minimize the number of required changes to an existing codebase.

So here's how it all works.

Partitioning graphs

The Neo4j.rb multitenancy approach partitions the graph in such a way that all queries and traversals are scoped to a given tenant. This means that queries like Order.all, for example would return different (and correct) results depending on what the current tenant is.

Once Neo4j supports sharding, it should be possible to adapt this scheme to take advantage of it.

Reference nodes

The reference node is a starting point in the graph space. All Neo4j graphs are connected by default, which basically means that nodes and relationships cannot exist in isolation. The multitenancy feature works by 'moving' the reference node to whatever the current tenant is.

The Neo4j.rb metamodel

Neo4j.rb stores type metadata in the graph. Let's consider an example where we have two models: Tenant and Country.

In the screenshot below, the home icon represents the reference node, which the default starting point in the graph. From the reference node, each model type has an outgoing relationship, named after the model class. Neo4j.rb refers to the node at the end of this relationship as a Rule node. The 'all' rule node is connected to every instance, and its count property keeps track of the number of instances.

Ref_node

This particular database has a single tenant instance...

Tenant_instances

... and three countries. The _classname property stores the Rails model class name for a given node.

Country_instances

Tenants

While creating new tenants, it's often necessary to set up data for each tenant. Here's one way of doing this:

In case it takes a while to set up the data for a client, it's worth doing the data setup in the background.

'Movable' reference nodes

Changing the reference node to a tenant ends up effectively partioning the graph. All tenant specific entities end up getting attached under the tenant, like so for Tenant 1:

Tenant_1

... and for Tenant 2.

Tenant_2

Setting the Threadlocal reference node

Neo4j.rb allows a reference node to be set for the current thread. A :before_filter method in your controller is a reasonable place to set the threadlocal reference node.

Lucene Indexing

By default, for every Rails model, Neo4j.rb creates lucene indices using this format:

<TopLevelModulename>_<NextlevelModuleName>_<ModelClassName>-exact/fulltext.

The multitenancy work adds the tenant name as an additional dimension to the index name. Each tenant node contributes an index prefix. The prefix makes it possible to partition the lucene index on a per tenant basis.

<Tenant Name>_<TopLevelModulename>_<NextlevelModuleName>_<ModelClassName>

The net effect is that lucene queries are scoped to the current tenant. (Except in the case of shared models, which are accessible across all tenants and use a single lucene index).

Reference data / Shared models

What about data that's shared across tenants? Examples could be a list of valid currencies, or a list of countries. It doesn't make sense to replicate this data on a per tenant basis. To handle these kind of entities, you can declare the model's reference node to be the default reference node, which ensures that all instances of the model are accessible across all tenants.

Gotchas

  • Clearing the reference node.
    Neo4j.rb has Rack middleware that resets the reference node after every request. In Rails' multithreaded mode, you could end up with stale / wrong reference nodes in case this were not done.
  • The 'too many open files' problem. 
    Since sets of lucene indices are created for each tenant, you can expect approximately 7 * N * T open files, where N is the total number of models, and T is the total number of tenants. By default, Neo4j's Lucene IndexWriters are never closed (keeping the writers open improves performance). The default per process file limit is 1024 on Linux.
    I've submitted a pull request to the Neo4j database here to fix this issue. (See https://github.com/neo4j/community/pull/51). The fix involves using a configurable LRU cache for Lucene index writers/searchers.