Building Data Lakes or Data Landfill? Pitfalls and How to Avoid Them

    Are you building data lakes that will create business value or are you creating a data landfill? In spite of the best of intentions, the reality is that many data lake initiatives are doomed to be costly and ineffective from the start. In this post, we look at why the prospects for data lakes can be so dicey, and we examine how to put the odds of success squarely in your favor.

    Hoarding: Not a Winning Strategy

    People have high expectations when building data lakes. Put a lot of data together in one place, and insights will be gleaned that propel the business. Or so the theory goes. However, the reality often falls far short of the vision. To use a real estate equivalent, people envision a clean post-modern mansion with a view of downtown. What they get is something out of an episode of “Hoarders.”

    What’s happening? In short, teams set out to build a data lake without a clear plan. Many leaders will start gathering data, but not have a clear idea of how this data will be used. They haven’t assembled a data science team, but believe that the act of pushing all the data into one big database will provide them value in the future.

    Too often, in an effort to be comprehensive, teams aggregate as much data as possible in a single location, regardless of the value or utility of the specific records incorporated. Ultimately, these repositories can quickly balloon in size—and associated maintenance efforts, storage costs, and business risks all increase dramatically as well. Ironically, the bigger the repository grows, the more your odds of success may actually decrease.

    While valuable assets may be housed, the volume of items overwhelms any ability to find, let alone use, the things that matter. Instead of building data lakes, teams wind up with data landfill.

    How Data Science Works

    Giving a team a dump of data isn’t useful; it’s counter-productive. To remedy the scenarios above, you need to start with a clear understanding of how data science, analysis, and engineering actually work.

    First, it’s important to start with recognizing who should run point on a data lake initiative. While data engineers represent essential team members, they aren’t the people to lead these efforts. These engineers are responsible for tools and can manage the mechanics of gathering and storing data. However, they don’t analyze or create the models from the data, so they aren’t necessarily who you want deciding which data sets are valuable and need to be aggregated. Instead, look to have data scientists and data analysts lead the way. Scientists can create models that will give you deep insight into the data. Analysts can highlight trends they are seeing and, based on these insights, help point out places to focus transformation efforts.

    The reality is that, without data, there aren’t any dashboards or reports. But having too much unstructured data makes it laborious and tedious to find the specific data that will yield the insights required. Leveraging structured data can help. Still, whether the data is structured or not, unless teams have a clear, consistent understanding of what the data means, and what questions they are trying to answer, it’s more apt to be garbage rather than useful.

    Start with Objectives

    Before your teams break ground on building data lakes and collecting data, it’s critical to start by clearly determining what you’re looking to get out of the effort. This includes answering a number of questions:

    • What business questions do you want to answer?
    • What data do you need to run the business?
    • What questions will you ask your team from the data you see?

    A big part of this is also aligning the user and objective with the appropriate level of detail. For example, vice presidents don’t need to know when a build is broken—unless the team they manage is falling apart and they are stepping in to provide day-to-day oversight. However, they will want to see if there’s a delay in getting a product to market, which may be the result of builds being continuously broken. When they see that there is a delay in the schedule, they should be able to ask their teams the reason for the delay. Further, teams should be able to provide a concrete answer based on the data they are seeing.

    Similarly, in the technology domain, a CIO will want high-level availability and service level metrics, not CPU utilization metrics for a specific server. At the same time, the server admin will absolutely need those metrics for troubleshooting. The reality is that every minute, sensors or monitoring agents may collect extensive details, but only a handful of metrics may be needed to forecast demand or usage. Also, it may be much better to track rolling averages of a few key elements rather than entire, interval-specific feeds.

    Validating Your Data and Approach

    When building data lakes, it’s important to do so intentionally. Test and validate data, findings, and approaches, and do so both early and often. For example, once an initial feed is complete, create a chart based on the data and run it by others who are close to the source information. Make sure that glitches aren’t skewing results, and that the data being delivered is aligned with reality.

    After these initial assessments, it’s important to validate and then make necessary fixes and refinements. These fixes can fall into three general categories:

    • Process fixes. Any number of issues can come up within the data aggregation process. This can include data being inadvertently omitted, or, on the contrary, being included when it shouldn’t be. This can also be an opportunity to identify where manual, mistake-prone tasks can be automated, or where inefficiencies can be weeded out.
    • People fixes. To gain the insights you need into the products being created, you may need teams to change the way they’re doing things. If you can’t build good predictability charts from the data collected, you might need better alignment around the data teams enter and update in their user stories, tests, defects, and so on. Or when monitoring in production, you might need certain items to be logged so you can see specific errors.
    • Technology fixes. During initial implementation phases, any number of technology issues may crop up. This can include misconfigured systems and inefficient architectures. In addition, established calculations and algorithms need to be vetted. It’s critical to look for inconsistencies or errors, and determine how to weed them out.

    Conclusion

    No one likes to be dumped upon. That certainly holds true for the teams forced to contend with data lakes comprised of large volumes of data that’s not filtered, sanitized, or even needed. These dumps present a lot of work for people on the receiving end, with little likelihood of delivering any real value to the business. The guiding principles above are instrumental in ensuring that data lakes don’t inflict raw data dumps on others. Instead, these approaches can set the stage for building data lakes that ultimately deliver the insights that teams need to boost operations, decision making, and planning.