Microsoft’s long-discontinued efforts to kill Hadoop have not been in vain after all, according to a new report from veteran Redmond watcher Mary Jo Foley. The homegrown alternative to the batch processing platform that the software giant developed until 2011 supposedly gave rise to an entirely new data crunching framework that is set to launch on its public cloud in the foreseeable future.
Cosmos, as the platform is known, has apparently seen extensive use within the company since its predecessor disappeared off the agenda. The internal implementation aggregates data from every major Microsoft service, including Azure, Skype and Bing, into a mostly shared pool of information that the different departments tap for their own purposes.
Over 5,000 engineers and thousands more business workers rely on Cosmos for a broad spectrum of use cases ranging from tracking the state of the company’s data centers to analyzing search traffic for meaningful trends, according to Foley. That seemingly shows that the technology is both far enough along its evolution to power core business processes and robust enough to accommodate a wide variety of workloads.
This should make for a powerful pitch if and when Cosmos becomes available to the outside world. Foley cited an unnamed insider as saying that Microsoft will position the technology as “complementary” to the managed Hadoop service already available on Azure, yet the official description for the technology points to a great deal of overlap between the two.
Cosmos is touted as similar to MapReduce but with ability to represent operations spanning multiple operations in the form of directed acyclic graphs, an approach that reduces the time and effort involved in carrying out complex analysis while improving performance. That pits it against Spark, which uses the abstraction as the native execution format and ranks alone in the lead to becoming the next default data crunching engine in Hadoop.
However, it’s worth adding that Microsoft’s analytics framework has apparently evolved a great deal since the paper came out in 2011. Based on Foley’s claim that engineers use it to analyze telemetry coming off the company’s infrastructure, Cosmos might have a stream processing component – but so does Spark. And the similarities don’t end there. The homegrown platform reportly also includes a structured query interface, another feature that it shares with the open-source project.
Taken together with the fact that Microsoft is using Cosmos for the same use cases that Hadoop is designed to perform, there’s little room for doubt that the engine will compete with the batch processing framework. That puts it in the same category as Google Dataflow, another homegrown analytics service capable of ingesting both historical and real-time data that is being turned into a commercial offering after seeing success internally.
But there is one difference between the two cloud offerings that may work in the search giant’s advantage. Dataflow is an evolution of existing technologies from the Hadoop ecosystem, which provides the groundwork for Google to provide interoperability with the framework and Spark in particular, functionality that it’s already working on. Cosmos, in contrast, is a proprietary engine that may not have the ability to provide the same kind of integration.
That’s a major barrier to entry for a number of reasons, not the least of which is that a sizable portion of the companies that would use the service have already aligned their data strategies on Hadoop. But Redmond is apparently nonetheless confident enough to push forward with the project, which sets up the analytics space for a highly competitive year.