BRAHMS

Main-memory storage for RDF/S

BRAHMS is designed as a fast main-memory RDF/S storage, capable of storing, accessing and querying large ontologies. It does not used any DB backend and all data is kept in main memory. It is implemented in C++ for high performance and strict memory control.

Purpose

BRAHMS was created as a framework for testing graph search algorithms for SemDis project, where there is a need for storage that can execute graph search algorithm quite fast on big RDF data.

Idea for system like BRAHMS came after testing other RDF/S storages (like Jena, Sesame or Redland) while using model in main memory. They are very good for working with smaller ontlogies using memory-based model, but for large ontologies that model is not sufficient. Only available solution is a DB-based model, but access time for performing gra-search algorithms on big ontologies was not sufficient.

At this point, there was a need for system that is capable of handling big ontologies (hundreds Mbytes to Gbytes) while still offering fast access and querying. The only solution was to have memory-based model, as all disk or database accesses are much slower. Designed model has to be memory efficient, as both ontology, indexes and user program must execute in the same physical memory. Also it must offer good speed, so proper indexing is needed.

Another constraint came from SemDis project, where we deal with knowledge discovery in ontologies. This means executing different graph-based algorithm. Most of them are based on node neighborhoods - querying them, merging, intersecting. It has a direct impact for indexing scheme for BRAHMS.

Most of knowledge discovery problems that we work with, strongly distinguish between instances (resource), schema (including taxonomy) and literals. This has a major impact for defining different classes for different types of RDF resources and handling them separately.

Driving by this requirements, following design decisions were made for BRAHMS:

Design

Strengths

BRAHMS main strength is offered speed for simple graph operations. It is achieved by indexing scheme and usage of simple types as resource identifiers (that speeds-up all comparisons). Together with read-only memory model, it caused to step back from strict object-oriented modelling to remove memory-pointer references from objects in favor of indirect references (by identifier) and do not use virtual functions (as usage of them adds another indirect memory reference). But results showed, that such decisions resulted in high speed for graph computations.

Division of resource types offer focused methods for graph search algorithms to concentrate only on given type (of resource or statement) and eliminates the need of filtering unwanted types during search. This caused, that during the search on instance level (which is most common in the project), user only access instance resources. There is no need to filter out literals or schema classes associated with instance. It makes the serach simpler and more effective.

Full indexing scheme of statements [subject - predicate - object] allows for linear merge of statements in path expressions, instead of doing unsorted set intersection (which for two unsorted sets takes n2 time). For path expression of two adjacent statements, user can take first one sorted by "object" and second one sorted by "subject". This will result, that the common resources for both statements, will be sorted in both iterators. So, finding the intersection becomes linear.

Limitations

BRAHMS is mainly limited by user system and available memory. On 32-bit system it can only handle 2Gb snapshot file with maximum of 2^27 (around 134 mln) instances of each type. This limitation is caused by size of resource identifier (int) and use of some higher bits for internal system purposes. File size limit is caused by system and need to index file bytes. On 64-bit system these limitations are much higher as addressing space grows significantly.

Physical memory is another BRAHMS limitation. It is needed both for creating snapshot (loading and parsing RDF/S files to memory) and for using created snapshot. As BRAHMS is designed to keep all data in physical memory, using files bigger than available physical memory (decreased by amount used by system processes) is not recommended. In this case swapping starts and all speedup gained from accessing data in memory is lost.

One of the limitataions caused by design decision is read-only knowledge base. That's why BRAHMS can be used sucessfully for querying ontology and running different graph algorithms as long as knowledge base does not change. Each modification of knowledge base (adding/deleting even one statement) requires stopping BRAHMS, recreating the snapshot and loading new instance to memory. So, this storage is not designed for environment where ontology changes frequently.

Use of BRAHMS

BRAHMS is already sucessfully used in projects that are created in LSDIS lab.

Publications