Image of Navigational Map linked to Home / Contents / Search Toybox

by Peter Wone - Independent Developer
Image of Line Break

A nice hot cuppa

Java is everything I'd hoped, and less.

Beans means Borland

Javabeans are just like Delphi VCLs, only more so. Brilliant. Borland had a big hand in them, and it shows. Of the various Java development environments I tried out, Borland's JBuilder handles beans best. The bean wizard is easy to use and just plain works. Not one of the developments I've investigated feels like a first try. Maybe as an industry we're starting to learn.

Meanwhile, back at the ranch... MS is still trying to poison Java as an escape route for the terminally sick of Windows. Sun is still playing holier than thou and nobody has done anything really useful with Java. Not even me.

On the other hand, the same could be said of Visual Basic until it reached version three.

Déjà vu

For me, using Java is a strange experience. Architecturally it's familiar: with the advent of beans, very much like Delphi. But syntactically it's C++ flavoured, and that's not entirely comfortable for me. I speak from experience when I say that the key to the Delphi is familiarity with its object hierarchy and the strategies it embodies; knowing The Way Thing Are Done Around Here, and what gimcrackery is around to help you. That extends to the whole environment - you get a bit head start if you already know the lay of the land in Windows.

With Java I'm still a bit lost in that regard. I can't use any Windows specific knowledge. Fortunately Java libraries tend to be a damn sight simpler and more orthogonal (there's that word again) than the WinAPI, but it's slow going to start with.

Performance is... lacklustre. But - two buts, actually - firstly, that will change. Especially when cheap Javachips come onto the market. Secondly, I've scoped out most of the contenders and they're all a bit sluggish.

JBuilder is the environment in which I'm having the least grief coming to grips with Java.

Object materialisation as a JOIN

When a persistent object is streamed back into memory, the code (data about behaviour) is read into memory (if it's not already there due to another instance of the same class) and then the instance data (data about state) are read into memory where the two are combined (joined) to form a single object.

The code portion occurs once for each class. The instance data occurs once for each instance, but many times for each class. The two are in a one-to-many relationship. Object instantiation is a JOIN operation.

Part of what I like about this observation is what it does to the notion of denormalising for performance. The idea behind this is that since joins take time, giving them the flick will be good for performance.

While this is true in abstractum, there are several implications depending on the way one goes about it.

Messing up the model

Anyone with any real understanding of the whys and wherefores of normalisation will be laughing at the very suggestion. You'd have to perform multiple updates to record a single datum (this is going to speed things up?) which is just plain inviting trouble.

Worse, the behaviour of data no longer necessarily conforms to set theory. Does "black art" ring any bells?

Two models

In this scenario data is written into a properly normalised database. Periodically, frequently performed joins are realised in one big batch and cached.

This will work. It does not jeopardise data integrity and it may well furnish considerable performance improvements to particular specialised operations, but it introduces two issues:

Both of these options will chew resources like nobody's business.  Bigger rows (physically stored tuples) and more of them means more I/O. A denormalisation may produce a cross-product - this means an order of magnitude more rows.

I/O is the most time expensive thing you can do with a computer. I leave it as an exercise for the reader as to what this will do to performance - and, for that matter, space requirements.

The two-models approach may well be a good idea for data-marts, having the additional benefit of making it easy to get DSS activity loads off the reference database server (you put the second model on another server).

Now consider our objects in memory where they execute. What maniac would keep a separate copy of behavioural data (code) in each tuple? When a particular class instance (object) method call occurs, a join is performed on demand.

Sometimes the entry point for each possible method call is bound into the object when its Create method is invoked. Sometimes all the object gets is the address of a VMT and it must perform a double join to resolve the many to many relationship between class instances and their related methods. It is interesting to note that the former style is the two-models system in action.

Cache-flow management

Pascal has come a long way, since the halcyon days of my youth. It has objects, and knows how to stream them. Perhaps it occurs, to the Pascal programmers among us, that the principle difference between objects which are in memory, and objects which are not, are two:

This amounts to a singular silliness. The only differences between RAM and disk are that RAM is fast and volatile, whereas disk is slow and persistent. All the action happens in the CPU's registers. RAM is nothing more than a cache for the registers.

Managing the movement of objects between disk and RAM is a cache management issue. No more, no less. And cache management is a problem long solved. There are even stable reliable distributed cache management technologies.

A web server, for example, is a request broker with a cache manager. A web browser is a display rendering engine with a cache manager. The two cache managers use an extremely sophisticated protocol called HTTP to negotiate transfers. The protocol even supports degrees of staleness, although this is seldom used in practice.

HTTP also supports transfers in both directions. It makes no special distinction between client and server; these are merely roles assumed for the duration of a transfer. Often the web server is not set up to function as a client, and few browsers are set up to function as servers, but these are deficiencies of implementation in particular products, not of the protocol.

Type as set membership

Class is type. Type is class. Both are indicative of set membership. Set membership tells us (among other things) which items are related to which items, and which operations may be performed upon them.

For example, arithmetic may be performed upon the set of numbers, and only on the set of numbers. The arithmetic operators are instance methods of the class Number. (Let us not confuse them with algebraic operators, which are class methods of the class Number. They have to do with the expression of symbolic relations between subclasses, not the scalar realisation of completely bounded solution spaces.) Therefore, any item with a numeric type is suitable for arithmetic operations, and any item which does not have a numeric type is not suitable.

What is type? If a variable is an object type, it tells the computer how to bind to the code part of the object. If it's a simple type, for example a string, it tells the computer what sort of functions are allowed to operate on the object (a string, in this case). If it's a file, it tells the computer which program to use to manipulate the file.

In all cases, type tells the computer which code (behavioural data) goes with a given datum.

That's not the only use of type. In the case of files, it provides a basis for classifying data, for display. In databases and other such structured storage, type also provides a basis for classifying data, in this case in order to provide a basis for finding it again later (indexes).

The structure of an entire row can be regarded as a single type for classification purposes - Pascal programmers will recognise the RECORD type.

Soup of objects

All of this leads inexorably to a sort of "soup" of objects which are just sort of "there". Ultimately, all a database system does is:

The misleading and ultimately meaningless notion of "where" the data is located has no place in such a system. Objects exist in abstractum. This is a Good Thing. Most of the code in any program has to do with "where" data is. While this is very necessary activity, it has no place in application code. If things can be fixed so that "where" is a meaningless question, programs must become simpler, and in consequence more robust.

Transactions in a distributed soup of objects

The soup of objects idea is a good one, but it raises another issue: abstracting the physical distribution of data. "Where" starts to have real meaning when multiple physical locations are involved. And in this internetworked world, to say that multiple physical locations are involved is a gross understatement.

You can talk the idea around in circles for ages, but eventually it all comes down to two things:

Globally unique identifiers aren't as tough as you might think (Microsoft likes to call them "GUIDs, which you pronounce like "squids") but distributed transaction management is another story.

So far everybody's attempts at distributed transaction co-ordination are based on two-phase commit. This is less than perfect: two-phase commit has (literally) its moment of weakness, during which a failure can leave the two component (and presumably now isolated) systems uncertain of each other's transactional status, and therefore uncertain of their own status. "In limbo," they call it, and it's a big problem. Not that I have any better ideas.

The net's main transfer protocol, HTTP, isn't even state based, never mind transaction based, but that's not necessarily a bad thing - it inhibits the use of short-sighted strategies suited only to favourable conditions. Oddly, it also reminds me of COBOL.

OOP is designed to give the same effect, but with less of the grief. We have looked at the functional differences between disk  and RAM, and observed that a cache manager makes unnecessary such a distinction, separating the humdrum issue of where data (whether instance data or code) happen to be parked, from the more significant issues of what it is and how it behaves.

So there we have it: what's needed to achieve a veritable SOOP is some kind of transactional enhancement for HTTP, or maybe another layer over the top of HTTP, or something like it but transaction capable.

So who's going to do all this?

Microsoft is. Badly, as usual. Or maybe not. Maybe you recall that my hate list of stuff that was fundamentally wrong with the design of database servers had evolved into a functional spec for a database server.

I've seen Sphinx, their total rewrite of SQL Server, and so have several of my friends. Those who'd seen my diagram of how a database server ought to be built said anyone would think I'd been leading the development team. What can I say? OK, I'm annoyed that I don't get to be the next Bill Gates, but all I really wanted was for the nightmare to end.

While I can think of lots of nasty things to say about their distributed transaction server (and about their coding philosophy and their codebase in general and...) at least they're building one. And they're building a distributed message queue. Well, sort of. Actually what they're really doing is tarting up a mess they bought from IBM, but there is obviously someone at Microsoft with lots of clout, someone who understands!

There is a glimmer at the end of the tunnel.

Maybe in a couple of years they'll work out that where is less important than what.

When I say denormalising for performance I often include both the strict meaning - the physical storage of relations for which realisation in a normalised database would require one or more joins - and deliberately induced redundancy which does not necessarily affect the degree of normalisation. A good example of this would be keeping the John Doe current balance in the John Doe tuple of the accounts table. Doing this would increase redundancy because this information is already implied by the records in the Transactions table but would leave the database in 3NF (assuming it was in 3NF to start with) because the values are indeed functionally dependent on the PK of this table. My thanks to CJ Date for pointing out this crucial subtlety.

A virtual method table (VMT) expresses the many to many relationship between class intances and the methods related to them.

Herein lies an echo of the past. COBOL was a success in part due to something it shares with OOP languages; heavy emphasis on structure, rather than process. Coding procedures was an unbelievably tedious activity, which encouraged planning and the use of data structures designed to minimise the complexity of manipulation. Ironically the emergence of tools to alleviate this problem has actually exacerbated it by reducing evolutionary pressure against ignorance of the issue.

By SOOP I mean Soup-of-Objects Oriented Programming. Indulge me my sense of humour.

Written by: Peter Wone
January '98

Image of Arrow linked to Previous Article
Image of Line Break