Industries from healthcare to agriculture to banking to retail are hitching their wagons to the big data and analytics train, and for good reason. Having usable information about your industry, customers and products can provide key competitive advantages and ultimately improve your business. However, outside of the world of PHD statisticians, few business users know of the technologies and techniques that drive those larger, grander initiatives.
Though the terms analytics, business intelligence and big data are bandied about on a fairly constant basis, many business users still view data science techniques and approaches as the realm of, well, data scientist. This is a shame as most business users are currently engaging in some form of data science on a daily basis. Every time we sum up a column of sales results for a particular product or perform a VLOOKUP to pull content from one worksheet into another, we are performing these tasks, just in an ad hoc (and in many cases, inefficient) manner.
Here are five reasons it’s time to get to know them, no matter what your job function:
- The amount of data is growing, rapidly
The amount of data that a business professional has at his or her fingertips is staggering. Everything from industry datasets to CRM data to the information gathered from social media makes this one of the richest times in history for those who place a high value on information and data. At the same time, the pressure to do something with that data leads many of us to hit seemingly insurmountable walls during our analysis efforts. These walls often lead either to data that doesn’t tell the full story or to a project that ends far too quickly.
Data scientists are used to dealing with disparate data types and they have worked to minimize the amount of effort it takes to acquire content from various sources. Data scientists first start with data acquisition, typically storing the data in the form of an object that can be easily manipulated. Languages like R and Python or vendor-provided analytics products make it easy to acquire the data, whether from comma-separated files, social media application programming interfaces, databases or even scraped from a website. Once the data is acquired, and done so in a way that it can be easily be manipulated by one of these tools, the normalizing process can begin.
Normalizing datasets is a key tenet of the data science toolkit, and it’s much easier to do if you first acquire the data through proper channels. From there it’s merely applying the right tools for the task, once you’ve positioned yourself to properly take advantage of the tools.
- Data wrangling takes time
Acquiring and normalizing data aren’t the end of the process as you are often expected to combine and subset data from multiple sources. In a recent survey by Crowdflower, the surveyed data scientists indicated that 52.9% of their time was spent collecting data sets, and 66.7% was spent cleaning and organizing data. Data scientists seek to create “tidy” datasets. The definition of a tidy dataset is one that is “easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.”
Gathering data and creating tidy data is the ugly, overlooked part of data science. If there is a skill that the average business user should learn, though, it is how to prepare data for analysis. Understanding this process is critical if you want to separate your output from what can be done with the average spreadsheet. There are tools and functions available to help you automate tasks like deduplication, string separation, joining, gathering and separating columns well beyond the depths of a normal knowledge pool, but in the hands of someone who has practiced with the techniques, the functions can be powerful, leading to insights previously unavailable because of the limits associated with typical office software packages.
Additionally, those data sets are growing well beyond the size limits of a typical spreadsheet (1M+ rows), which can cause your project to ultimately fail when you reach the size limits. In most cases, though, the average business user would have already quit working on the project out of frustration over the time it takes to manipulate files of this size in a spreadsheet format. Tailored software can help make tidy data more manageable.
- Graphs are worth a million words
If a picture is worth a thousand words, a graph is worth a million. A good graph can condense millions of rows of data into a single image and instantly tell a story. Spreadsheet programs make it easy to create and present graphs. For many purposes (if the data is small enough and already in the right format), this works.
When you are talking about graphing in the world of data science, though, there is an overabundance of theories and techniques that most people will never need. But the underlying complexity of the graphs that are produced and the tidy data concepts that helped shape the data are very valuable. Just searching for “data graph best practices,” reading up on some of the results, and incorporating those techniques into your daily repertoire may be enough to separate you from your peers and make you graphs more effective.
In addition, there are many software packages that have already incorporated theories from books like The Visual Display of Quantitative Information by Edward Tufte and The Grammar of Graphics by Leland Wilkinson into their toolsets, so you instantly gain from the knowledge of others without having to become a PHD theorist.
- Repeatable and replicable output can be a challenge
Throughout my career, the concept of repeatable and replicable data for my projects has always been a frustration. For projects that happened only one to three times a year, I had to go back and retrace my steps so that I could determine where the data came from, how I manipulated it and the assumptions I made to get the outcome that I ultimately settled on. This effort, in many cases, took the exact same amount of time each time, and made up the majority of total time of total energy put into the project.
Upon digging into the world of data science, the focus that data scientists put in replicable and repeatable content was refreshing, and the fact that the tools are designed to do this are also refreshing. Languages like Markdown that can combine executable code with text mean you can now create a narrative that describes your thoughts and efforts at the same time, in addition to being able to record the code in the same document.
For users who complete a project only to come back to it several weeks or months later, the breadcrumbs now exist to allow you to quickly regain your footing with the project and get back up to speed as quickly as possible.
- It’s the future
An organization’s desire to know more about its business, partners and customers through data is only going to grow going forward, most likely at a rapid pace. Those individuals who have researched and applied these tools and techniques to their business practices are going to excel in an increasingly data-driven world that has moved beyond the spreadsheet. Continuing to expand your toolkit ahead of the requirements of your job will not only make you shine, but keep you sane going forward.