On 20 jan 2011, at 21:32, Tudor Girba wrote:
The goal of Moose is to help us analyze data. This
means: modeling, mining, measuring, querying, visualizing, browsing etc. To do this, the
prerequisite is being able to manipulate the data. Right now, we have all objects in
memory. To be able to scale we need database support.
Currently, the models have to fit into 32 bit address space. Modern machines support much
more than that (data points: 16 GB @ 160 Euro for my current machine, standard
workstations support 192GB). Do you have many models that wouldn't fit in 192 GB?
The kinds of analysis Moose does are not supported efficiently by standard relational
databases at all. They are optimized for a very different access scheme: selecting a very
small subset of data and changing that. That means that they are only able to provide
reasonable results for datasets that (nearly) fit into memory. In short: they allow you to
avoid using a 64 bit Pharo image, and are able to use more cores. What you lose is having
to copy data from and to the database and having to generate queries that don't fit
the object model well. They are unlikely to provide better performance than a
straightforward 64 bit Pharo image would, but can provide a short-term solution.
Datawarehouse style databases (Vertical) and object oriented databases (Gemstone) are
probably able to do better. Datawarehouse databases by pregenerating all kinds of cross
sections and projections of the data, and oodbs by navigating instead of joining (and
Gemstone by being able to use all memory). But even there, a lot of the Moose analysis
seem to touch a large part of the model, and the interactivity needed means that disk
based models will never become popular.
Scaling of Moose is more likely to come from going 64 bit and distributing the model over
multiple vms.
Stephan