Removing Barriers To Data Wrangling To Let Data Scientists Get Down To The Science

By Patrick Halpin, Vice President Data Science, 84.51°

When data scientists are tasked with building a business solution, the goal is to efficiently develop a solution that’s accurate, reliable, and scalable. But the data science process involves so many steps that they often don’t get to spend enough time on the modeling or algorithmic development portion of the work needed to develop the optimal solution.

At 84.51°, we’ve developed tools and resources that help data scientists focus on the science quicker. We’ve created standard, reusable components that data scientists can use to simplify the end-to-end process and make it easier to go from ideation to production.

These components, which we often refer to as being like Lego bricks, embed best practices and enable data scientists to efficiently solve for chunks of their solution. They make it easier to create a better data science solution while using leading-edge techniques, for solutions that are better, faster and standardized so they’re easier to build and share.

Simplifying data wrangling and prep

The packages we have created help streamline the data science process from the beginning. The first step of the process is data wrangling, which involves transforming raw data into a more useful form for the solution they are building.

Once they have an idea of the model or solution, they’re going to develop to solve a business problem, data scientists spend a lot of time trying to find out where the data is, what the quality of the data is, how to get access to it, and who the experts on that data are so they can learn about any variables.

Through our standard, reusable components, we’re removing those barriers and making it easier to get started. And we’re doing it through process and code line.

Identifying patterns for a faster start

When data scientists start on projects, they usually need to codeeverything from scratch. But we found that process typicallyinvolves similar patterns for data wrangling, such as pulling similar timeframes, similar data sources, key fields, or other characteristics of our data assets.

We’ve examined those patterns and created packages that makesextracting and aggregating that data so much simpler, because instead of having to write hundreds of lines of custom code each time, data scientists only need to put in five or six different parameters to get back the data set they want. These packages save many hours of work in writing and QA testing code and data.

Not only do these packages save many hours of work in writing and QA testing code, they also standardize the process so it’s easier to bring new people on board and easier to ensure quality. Because those pieces are already built and tested, you don’t need to rebuild them from scratch each time and we can tune their performance once for all users to benefit from.

Sharing new data

Data scientists love to create new measure, metrics, and segmentations that have some sort of science or aggregations embedded in it. But that data has previously been difficult for others to find and access.

To facilitate smooth sharing, we’ve created packages and processes so other people can register their data and metadata about. As a result, if you’re in one business area and someoneon a different team creates a new segmentation or measure, it’seasier to locate, understand and use that information. When you have standards that dictate what should be written up about the data when registering it — what it is and how it’s used, who to contact — it makes people more comfortable about using the data. They can build on the best of what another team has created without having to come up with it themselves from the start.

Delivering diagnostics

Once the data wrangling is complete, we have also created components to do diagnostics on the data.

These diagnostics allow data scientists to check the validity of the measures and metrics being used, and they also enable thescientists to create relevant KPIs to be used to monitor and QA input data, output data for better overall model health.

It’s all part of an effort to make the data wrangling step easier,so data scientists can move more quickly to spending time on the science. Our components don’t take away flexibility or creativity, they make it easier to spend more time on the model development side, for innovative solutions that bring better business results.

As Vice President of Data Science, Patrick Halpin leads a team of more than 40 data scientists, research scientists, and machine learning engineers that are helping to solve problems, drive the evolution of data science work, best practices, and data science governance for the enterprise.
Prior to 84.51°, Patrick held multiple positions at dunnhumbyUSA after joining the company in 2004. He has more than 20 years of experience in analytics and data science across the retail, transportation, gaming and financial industries. He has extensive experience with solving data science problems for some of the world’s largest retailers such as The Kroger Co., Tesco, Best Buy, and Home Depot.
Patrick holds a Master of Applied Mathematics (Operations Research) from the College of William & Mary and a Bachelor of Science in Mathematics (Statistics and Actuarial Science)
from Elizabethtown College.
About 84.51°
•
84.51° is a retail data science, insights and media company helping The Kroger Co., consumer packaged goods companies, agencies, publishers and affiliated partners create more personalized and valuable experiences for shoppers across the path to purchase.
Powered by cutting edge science, we leverage 1st party retail data from nearly 1 of 2 US households and 2BN+ transactions to fuel a more customer-centric journey utilizing 84.51° Insights, 84.51° Loyalty Marketing and our retail media advertising solution, Kroger Precision Marketing