Self-describing data

Lately, for reasons I won't delve into here, I have been thinking quite a lot about self-describing data. I'll have to admit that the first time I heard of this expression, when I was reading about ontologies, DAML+OWL, RDF, I didn't give too much attention to it. It seemed a little odd for me to worry about it.

However, lately I have realized the real importance of it. Its wastefulness has some very interesting consequences:

- Independence between systems: two systems don't have to completely understand each other to interoperate, as long as they can find within their data things they can understand. Independence also relates to same-system upgrades - if you use this data for storing state, it's easier for a new version to understand the old version's state without having to explicitly code for it.
- "Isolationability" of data: data can be understood (somehow) without a system around it - you can build generic tools to look at the data - they will never be optimal for anything, but they allow for inspection and improve debug-ability.
- Extensibility: as you know that your clients can live with only understanding part of your date (yes, your clients have to be built with that in mind), you can extend your data without worrying too much about the consequences.

Let's give an example. Let's say that you are trying to store personal information about your friends around. Your first solution could be something like:

Joe Smith, 10/12, 555-555-5555
Clara Smurf, 05/05/1977, 555-555-5544
Joe Jack, 01/01

Now you have to build a system that will parse each line and understand that the name is before the first comma, the birth date (with or without year) is next and then the phone number (if you know it). Now let's say that you want also to add a city of residence. Now your system, has to know how to parse something completely new.

But what if you write a second file that contains how to parse this file? Something like:

<name>, <birth-month>/<birth-day>[/<birth-year>][, <phone-number>]

Let's call this file a dictionary. Now you have to keep that file always with your initial file. More than this you have to make sure that when you change one, you change the second one too. And that if you are dealing with files written with different dictionaries so you have to keep a copy of all the dictionaries around and how they relate to each other. One misunderstanding on which dictionary to pick and you are suddenly getting everything wrong. But I have to admit that it's better than the first option.

Now, what if you wrote something like:


Friend 1:
Family name: Smith
First name: Joe
Birth month: 10
Birth day: 12
Phone number: 555-555-5555
Friend 2:
Family name: Smurf
First name: Clara
Birth month: 05
Birth day: 05
Birth year: 1977
Phone number: 555-555-5544
Friend 3:
Family name: Jack
First name: Joe
Birth month: 01
Birth day: 01

Isn't it much easier to understand what I mean? Surely it's much more expensive to store in disk and to send out to people, also more expensive to parse and process, but what it's a matter of looking forward and thinking where do you want to go with your data. Are you building something that will only work for what you know you need right now, or do you accept that your needs change with time and you would like not to have to reinvent the wheel ever time it does (and more than this, keep track of all the wheels you have invented in the past, because they might come back to haunt you).

Surely I'm over-simplifying things here. There are lots of things that you can do that will make things not backwards compatible on the self-describing solution. Also there are things you can do to the dictionary solution that would contain the version of the dictionary in the message contents and a dictionary repository that you could get any specific version.

I guess my point is the following: if you can live with the complexities of non-self-describing data, if things don't change, or if change is centered around systems that you have good control over, go for it. For all the rest of us, mortals, it's just too painful for gains that it offers. Especially in our world where networks are fast and disk space is cheap.