Last updated on September 21st, 2015 at 03:08 pm
At the recent @Scale Conference in San Jose, Calif., leading figures and experts in computer engineering, coding and cloud computing gathered to share news, views, successes and failures of their profession. One of those experts, Arun Jayandra, software development lead at Microsoft, shared his experiences using Spark cluster computing and Cassandra database technologies for Big Data analytics.
With his involvement in Office 365, Jayandra and his team at Microsoft designed the online office productivity suite to run with three-nines and four-nines availability—99.9 and 99.99 percent reliability.
Office 365 tenants, or customers, were not satisfied with this level of performance, according to Jayandra. But another issue underlay customer satisfaction with the Office 365 experience—analyzing actual reliability of the applications. “To date, we’re not the most experienced at measuring the availability the tenant is getting,” Jayandra says.
Big believer in Big Data
Naturally, Jayandra’s Microsoft team wanted to use their internal IP to create a Big Data analytics engine for Office 365. But after trying to build the analytics engine with proprietary Microsoft technology, the development team turned to open source solutions to replace their own products.
They did so at least in part based on their need for real time and batch mode analytics. For these purposes, a week’s worth of user data insights seemed sufficient, according to Jayandra. But even seven days of user storage and retrieval information proved daunting. “With Office 365 data there is much data velocity,” Jayandra says. “It’s very high frequency data with 10 terabytes (TB) stored a day.”
Having so much customer data on hand posed a lot of risk for the Office 365 team, which brought about the need to create a protection methodology with resilience and redundancy. “The customer data needed to be protected in multiple geographies replicated across datacenters,” Jayandra says.
He also spotted an issue relating to data signals. “A small set of signals tend to double every eight months. So we needed a model that can scale linearly.” In other words, Microsoft wanted Cassandra, with its “continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple datacenters and cloud availability zones,” as its website notes.
Can’t start a fire without Spark
Possessing ability to run on top of Hadoop, standalone or in the cloud, Spark’s made for processing large scales of data quickly. Jayandra had particular interest in using the Spark Streaming solution for building fault-tolerant computer clusters. “We spent time building fault tolerance and resilience,” he says.
Using the Spark connector to Cassandra made Office 365’s performance better, according to Jayandra. For example, the gateway services for Azure, Microsoft’s own cloud computing solution, can pull data from Spark and push it into Cassandra. “In the cluster, we run Spark and Cassandra,” Jayandra says. “Analytics run in the other datacenter.”
However, this was only for batch mode analytics. “We cannot have real-time apps,” Jayandra says. “Even Spark Streaming has no support to pull real time data.”
Data never rests
With geo-redundancy in Microsoft’s Spark strategy, it’s a matter of having a similar passive stack in a different region: one on the U.S. East Coast and one on the U.S. West Coast. “The web server that powers the interface can query both datacenters, depending on which the user is closest to,” Jayandra says. That said, Office 365 does not use the analytics cluster in the passive region.
In other cases, the analytics cluster cannot access data due to legal restrictions in some countries against storing customer data abroad. “So we have to replicate data in country to make data queries faster,” Jayandra says.
Lessons and mistakes with Spark, Cassandra
Overall, while building 36 nodes of Cassandra and Spark, Jayandra came to several conclusions: It is not a low maintenance process, cannot be built just with open source Apache products. Also it needed to take bits from DataStax, a leading technology provider to Big Data applications developers.
As Microsoft’s first open source project, Jayandra says they made some rookie mistakes. For example, rows were too wide, which led to compaction slowing down and COM errors. Records became really big and rules were too large to load into memory. “What was a stable system had to be remodeled after just three weeks” Jayandra says.
Despite the Spark and Cassandra configuration passing stability tests, when the project was moved to bigger production servers it really slowed the system, according to Jayandra. “You can’t test for this,” he says. “Instead of a manual update of tables, the admin created a state where it went up by hundreds of thousands. It got us into a state where there were 200,000 files per node.” And you cannot let a node get like that. “Because there’s no going back,” Jayandra says.
In Azure, only a small bandwidth exists between datacenters, making it impossible to rebuild a datacenter, according to Jayandra. “Instead, we need to back up and restore.” Monitoring is very important in those scenarios where there are datacenter replication problems. Jayandra learned to take a datacenter out of the cluster if problems manifest themselves.
As it is today, Office 365 running on Spark and Cassandra is a low volume activity, with only tens of jobs on a daily basis. “As we increase jobs, we see there is no good job server,” Jayandra says. “We have not had good luck with open source job servers.”
What they’ve done to compensate for lack of reliable job server solutions is to create an alert when performance drops by 10 to 15 percent. “That way we use Cassandra data as a deterministic test to check on the pipeline.”
Photo via Derek Handova