2007-06-05

Google's infrastructure: scalability++

Google has introduced a vast array of products where search remains the main focus.

It is interesting to learn how Google does it and what are the internal IT services behind it.


Here is a list of Google products and of programing APIs.

A good place to start is Google's mission statement:


"To organize the world's information and make it universally accessible and useful."

Wow; this is a formidable challenge and offers a good insight at the underlaying data volumes, storage, computational complexity, network and computing topology across the world to design, build and deliver Google's work.

I found this video of much interest re subject. The video describes essential building blocks that offer a glimpse at the technology behind Google's services; they include:

  • Computers. Google uses hundreds of thousands conventional computers, plain vanilla Intel/AMD x86 units, powered by a custom Linux OS.
  • GFS - Google File System. GFS is a distributed file system, a basic unit of storage to save abstractions such as BigTable.
  • BigTable. BigTable is a storage abstraction for managing structured data designed to scale to petabytes, 10^15 bytes, of data.
  • MapReduce. MapReduce: simplified data processing in large Clusters. A model to define a given programming task across a large data set using a Map, a key and value pair programming abstraction, and associated computing function(s).
This reference is about An Economic Case for Chip Multiprocessing.

I found this blog offering more detail about Google's infrastructure.
And this one I found recently has much more information re subject.

Also this one has a summary report from Google's Scalability Conference:
At Google they do a lot of processing of very large amounts of data. In the old days, developers would have to write their own code to partition the large data sets, checkpoint code and save intermediate results, handle failover in case of server crashes, and so on as well as actually writing the business logic for the actual data processing they wanted to do which could have been something straightforward like counting the occurence of words in various Web pages or grouping documents by content checksums. The decision was made to reduce the duplication of effort and complexity of performing data processing tasks by building a platform technology that everyone at Google could use which handled all the generic tasks of working on very large data sets. So MapReduce was born.

No comments: