Data scientists need to decide which data to include in data warehouses. To make this decision-making process easier, learn tips for keeping control of your data funnel.
From 2022 2.5 quintillion bytes of new data are created worldwide every day. Although some of this data will be useful for analysis, it can be time consuming and difficult to sort. By creating an efficient data funnel, you will be able to more easily filter the data you need.
LOOK: Hire kit: Database engineer (TechRepublic Premium)
What is a data funnel?
The data funnel refers to narrowing the amount of data you allow in your master data store.
A good way to think of a data funnel is to compare it to the hiring funnels that the Human Resources tool uses when using job seeker resume software. HR introduces open position requirements into analytics software that reviews incoming resumes to create a smaller input funnel of job applicants. This allows HR managers and interviewers to focus on more important tasks instead of manually directing CVs.
The function also works on data. In one case, a life sciences company studying a particular molecule for its disease-fighting potential eliminated all incoming data sources that did not mention the molecule by name. The goals were to save storage and processing, as well as to gain insights earlier. While filtering all of this external data works for this company, controlling the data funnel is balancing between how much data you need and how much data you can afford to store and process.
How do you decide which data is important?
The very costs of storage and processing, whether internal or in the cloud, force companies to estimate exactly how much data they need for business analysis.
In some cases, it’s easy to decide which data to discard. You probably don’t want the noise of network and machine handshakes in your data, but deciding which topic-related data to exclude is harder. There is also a risk that analysis teams may miss important information due to excluded data.
For example, using data he typically collects, a UK retailer may not have found that homeowners make up most of their online purchases while their husbands are away from football games.
Examples like this unexpected but impactful idea are why IT and end-to-end business groups need to be careful when deciding how much they narrow the input funnel.
3 best practices for controlling a data funnel
Outline the usage cases that your analyzes support and the data that you think they need
This should be a joint exercise between IT / data science and end users. Do you want to include product complaints on social media when analyzing your sales and revenue data? And if you’re studying disease levels in your New York City health care area, are you interested in what’s going on in California?
Determine how accurate your analyzes should be
The gold standard for analysis accuracy is that the analysis should reach at least 95% accuracy compared to what human subjects experts would conclude – but do you always need 95%?
You may need 95% accuracy if you are assessing the likelihood of a medical diagnosis based on certain patient’s health conditions, but 70% accuracy may only be needed if you predict what climatic conditions might be in 20 years.
Accuracy requirements apply to the data funnel, and you may be able to turn off more data and narrow your funnel if you’re only looking for general, long-term trends.
Test the accuracy of your analyzes regularly
If your analyzes show 95% accuracy the first time you apply them, but decrease to 80% over time, it makes sense to recheck the data you’re using and recalibrate the data funnel.
New data sources may be available that were not originally available and should be used. Adding these data sources will expand the data funnel, but if you increase the levels of accuracy, expanding the data funnel is worth the cost.
How to control the data funnel: Follow these 3 best practices