[Mulgara-general] Back online.

Andrae Muys andrae at netymon.com
Mon Dec 11 00:42:55 CST 2006


Well the conference is over, but I'll be in Melbourne until  
Wednesday.  I finally found a network connection which with only a  
little ethereal work, included an smtp-relay that will accept my  
outbound email.  So finally, after almost a week incommunicado, I'm  
online again.

I notice that there are no bug reports on the transaction fix, so I  
have started merging the code into trunk.  That should be finished by  
Wed.

OSDC went very well.  I managed to get in a lightning talk on Mulgara  
and RDF, which attracted some interest from various quarters.  Not  
surprising was the interest from those involved in Data-Warehousing,  
and Library systems; a little surprising was the interest expressed  
from a number of people in the perl community who where at the  
conference.

In the week before the conference I enrolled in a Masters of  
Philosophy (IT and Electrical Engineering) at The University of  
Queensland - what is commonly referred to as a "Research Masters".   
The thesis will be on the design and implementation of XA2.  Which  
means I now have 4 years to develop XA2.  That's longer that  
necessary, indeed if I can afford to work on it full-time XA2 can be  
finished over the course of next year.

In between trying to get a paper finished for LCA, and a seminar  
prepared for UQ, I have been fleshing out the design parameters of  
XA2.  At the moment for 16 billion triples we are looking at 512MB  
per index (uncompressed) - ie. 3GB uncompressed data in total for the  
StatementStore.  Now even with purely random data we can expect at  
least 25% compression which means that we probably require (worst- 
case) ~2GB of RAM for reasonable performance of a StatementStore with  
1e10 triples.  Which is not unreasonable.  This means we can probably  
do 1e10 in 5-6GB or RAM and probably 10-20TB of disk.  These are all  
back-of-the-envelope calculations, and the biggest wild-card is the  
size of the stringpool which I'm just guessing at.

If anyone is interested we're looking at ~1.5TB of RAM required for a  
StatementStore containing 4 Trillion Triples - of course that's XA3  
so beyond my design horizon for the moment.

It's worth noting that large XA2 stores may start to run up against  
the max-files per process limit.  A single phase of the  
StatementStore with 1e10 triples contains upto 28 files per index -  
that's ~168 files per complete phase.  Now the vast majority will be  
shared with other phases, but the default maximum files per process  
is 256 on many platforms.

As far as estimating the StringPool, the biggest problems are the  
significant part compression is intended to play, and the existing  
problem of estimating the repeatability of data.  In much the same  
way that the StatementStore is a cross between a Base-4 Numeric Heap  
and a Skip-List; the current StringPool design is a cross between a  
Zeroless Numeric Random Access List and a Prefix Trie.

Of course the work on the paper and seminar are the reason why I  
haven't published the more detailed design doc I promised.  Much of  
that detail is in the paper, when the paper is finished I'll extract  
the design sections and post it to the list - the rest of the paper  
will be posted after LCA.

Andrae Muys

-- 
Andrae Muys
andrae at netymon.com
Principal Mulgara Consultant
Netymon Pty Ltd



More information about the Mulgara-general mailing list