Tuesday, April 24, 2007

So much data, so little time

The truth is out there. But it's lost in the static.

Terabytes of information every day and somewhere in there is the pattern that will create a new drug, predict financial markets or catch fraudsters. Terabytes of information. Er ...excuse me .. wazzat ??

A terabyte is a thousand gigabytes is a thousand megabytes is a thousand kilobytes is a thousand bytes and the byte is the unit of storage equivalent to a word.

So imagine a terabyte is like a thousand billion words.

A kilo-word is a short story, a mega-word is a set of encyclopedias, a giga-word is a library, a tera-word is all the libraries in Europe.

So when you hear 10 terabytes, visualise someone trying to search all the words in all the books in all the libraries in the world and youre getting a taste of the scale of the problem.

In fact youre only getting a taste of the scale of the problem circa 1980 when the first terabyte Data Warehouses were built to run programs that took days to mine the data in their mainframes.

Now in the internet age, we're beginning to look into the realm of peta-bytes - thousands of terabytes. And the users of Data Warehouses dont want to wait days anymore, they want the answer in real-time.

Search engines need to know what searches are popular, which links are being clicked, how long people are waiting for an answer before clicking elsewhere in disgust.

Financial markets are vast, but within the chaos are patterns. Imagine reading those patterns a second ahead of your competitor.

Retail companies need to know which lines are moving and which are stagnating, which offer is bringing in business and which are a waste of time.

Drug companies generate insane amounts of data from clinical trials : spot the pattern, or else a harmful drug is released or the next-big thing is poured down the sink.

In most modern industries the right answer at the right time can be worth billions, supplying the wherewithal to find that right answer has become the fastest-growing sector of the IT industry.

The world of data warehousing today is an almost religious clash of design philosphies: ETL versus ELT, Inmon versus Kimball, Appliance versus Big Database. All the usual blue-chip computing giants versus a crowd of crazy start-ups like you thought had all died in the Dot-Bomb bubble in 2001.

The truth is out there but it'll cost you.

1 comment:

Kenny said...

You know full well that this one is close to my heart. It is utterly mind-boggling.

I take it by appliance you mean Netxxxx? If so, I'd go that route every day. A huge Oracle database is much harder to maintain. At leat the guys from N build in redundancy and you don't have to worry about resiliency.