Data Lakes, Data Hubs, Federation: Which One Is Best?

The more things change, the more they stay the same. No matter who you are or what business you are in, it’s safe to say that you still have data silos. And these silos are still slowing down your business or government agency.

Sometimes the data silos are an obvious cause of problems. Teams can’t answer seemingly simple questions because data spans two (or ten) systems. Other times, fragmented and incompatible data stifles innovation, but rather than silos being blamed, the development teams and managers are.

Unfortunately, integrating data to merge the silos is notoriously difficult. There is no magic bullet to integrate disparate data, but there are new, breakthrough techniques that ease the pain.

I’ll cover three newer approaches in this blog: Virtual Database (also called database “Federation”), Data Lakes and Data Hubs – and I’ll explain how they differ and list their pros and cons.

MarkLogic supports all three to different extents, but we prefer Data Hubs, and at the end, I’ll explain why Hubs are the best and how MarkLogic supports that approach. But even if you don’t build a Data Hub or don’t use MarkLogic, I promise this blog will be useful to you.


The Old Way: Comprehensive Enterprise Modeling

The Old Way: Comprehensive Enterprise Modeling

First a brief word on the old approach. People used to (and occasionally still) build a new enterprise data model comprising every field and value in their existing enterprise, across all silos, and then map every silo to this new model in a new data warehouse via ETL jobs.

ITBusinessEdge surveyed companies and found that this approach always fails. Survey respondents report that it goes over budget or fails 100% of the time.

So we’ll skip that and talk about Virtual Databases, Data Lakes and Data Hubs, which each try to solve this problem in a new way.


Definitions: Virtual Databases, Data Lakes & Data Hubs

First, let’s define our terms.

Virtual Database

A Virtual Database is a system that accepts queries and pretends to be a big database that includes many disparate siloed data sets. In actuality, it queries the back-end (live, production or warehouse) systems in real time and converts the data to a common format as it is queried.

This is also called a Federated Database, and the sub-systems integrated are called Federates.

Data Lake

A Data Lake is a term popularized in the Hadoop community and has multiple definitions. Broadly, if you move all your data from disparate silos into one system (Hadoop/HDFS) it is now a Data Lake. The data need not be harmonized, indexed, searchable or even easily usable, but at least you don’t have to connect to a live production system every time you want to access a record.

Data Hub

A Data Hub is a hub-and-spoke approach to data integration where data is physically Moved and Re-Indexed into a new system. To be a Data Hub (vs. a Data Lake) this system would support discovery, indexing, and analytics.


Movement, Harmonization and Indexing

I’m going to make the case that we can distinguish the three based on how and when (and if) there is Movement, Harmonization and Indexing of data. More definitions below.

Maybe you wondered why I capitalized Movement, Harmonization and Indexing? I’ll use these as specific terms throughout, and have capitalized them to emphasize that these are the three Key concepts that distinguish Lakes, Hubs and Federations.

Data Movement

Hubs and Lakes both Move data. The data from silos is copied to a new set of disks and processed by a new set of servers and software. Disk space is now very, very cheap – so the operational (and organizational) benefits of having a separate data store that does not hit your production servers for every lookup is huge.

In contrast, a Virtual (Federated) Database delegates queries to each source system for each query, so the data does not move and remains in the silos.

Data Harmonization

Hubs and Federations both Harmonize data (although Federations only do it as it is returned or processed). Data is in different formats with different field names and schemas in the various silos, but when it is ultimately served up, the records must look at least similar to be processed as a group. Harmonization falls into three categories:

  1. Naming Differences: The simplest to handle, naming differences occur when different silos have an identical value with identical meaning, but use a different name. E.g. “PERSON.firstName” vs. “IDENTITY.GIVEN_NAME” in two different relational database tables – but holding the same sort of data.
  2. Structural Differences: Somewhat more complex than Naming Differences, structural differences occur when the number and combination of fields differs across silos. E.g. “boxes_available” in one system might be total boxes in the warehouse plus a count_per_box field to derive total items, but in another system may directly represent total_items (regardless of boxing). Further: one may represent available inventory not already allocated to unfilled orders, where the other may represent all physical inventory, regardless of open orders.
  3. Semantic Differences: These are the worst. One system may have three patient statuses and another may have five. These statuses will often overlap and be hard to map to one another ({Scheduled, Needs_Followup, Inactive} vs. {Intake, Scheduled, Telemedicine-only, Discharged})

Progressive Harmonization for Agility

All the approaches Harmonize. At the end of the day, well-structured, documented APIs or data exports (or reports) must return coherent data in a defined form. However, that Harmonization can and should be agile and progressive.

Typically, key requirements are identified early, and the associated data elements are defined to be “critical data elements” and are Harmonized at first. Over time, additional requirements drive Harmonization of more and more data. I’ll call this harmonization of the critical data “partial” harmonization. The process of increasing the subset that is harmonized over time is “progressive” harmonization.

MarkLogic particularly shines here, as MarkLogic Server indexes the structure of data from source systems in both Harmonized and Raw forms, so some search and processing can be done in the Data Lake style, but critical data is indexed and accessed in Harmonized form.

Indexing

The final characteristic that defines the approaches and distinguishes among them is in how and when they Index the data. Indexing allows fast lookup and fast analytics on any indexed field. Indexes are written to disks, and build on top of the data fields themselves, so how data is Moved and Harmonized drives how it can be Indexed.

  • Federated (Virtual) Databases don’t do any Indexing, nor do they have separate data storage to store indexes at all. They rely on the various silos and source systems to have adequate indexing (a bad idea and false assumption, by the way). They map any request into a different request for each source system and execute on all source systems.
  • Data Lakes do very little Indexing. If they do index, they do it by integrating additional components (thus building a “franken-stack” from many integrated components) to index parts separately – such as smaller data marts, SOLR text indexes, HBase tables with indexes and so on. See my personal experience with these Franken-systems.
  • Data Hubs Move all data to one place, partially- or fully-Harmonize it, and then index the data in its Harmonized form. This is ideal because the most useful indexes can only be built atop Harmonized data.

World of Pain

Clearly, it is very hard to aggregate, find, group, or analyze data if the data is named differently, has fundamental structural and semantic differences or is not indexed. You or your developers will somehow work around the differences, but if you’re not careful the treatment will be worse than the cure – you’ll end up with piles of un-governable code scattered across dozens of reports, batch processes, and ETL jobs.


Summing Up the Differences

So we have three approaches and three criteria that distinguish them. Here’s a table to sum up the three approaches and how they behave with respect to each criterion:

This makes each algorithm or ETL process responsible for dealing with the variety of data formats in its own way.

Approaches Is Data Moved? Is Data Harmonized? Is Data Indexed?
Federation No. Data is left in the silos.
Data Lakes Yes. Data is moved to one place, but left in its source format. Not much. It’s hard to index data when it is in many incompatible forms.
Data Hub Yes. Data is moved to one place. Yes. Data is (at least partially) harmonized as it is moved Yes, Data is indexed in the harmonized form for efficient access and analysis.

Implications of Each Approach

That covers quite a lot about the differences between the approaches. But they say the devil is in the details.

So now I’ll go into a few of the implications of those differences to show how each plays out in the real world, and make the case that Data Hubs are the best solution.

Virtual Database (Federation) Implications

Because Virtual Federated Databases rely on source systems for query, they are limited by the capabilities and availability of those source systems.

Least Common Denominator Query: If any source system or silo does not support a query – because that query searches by a particular field, orders by a particular field, uses geospatial coordinate search, uses text search, or involves custom relevance scores – then the overall system can’t support it. This also means that any new systems added later may actually decrease the overall capabilities of the Federation, rather than increase it.

Query Re-mapping: The particular shortcoming of Virtual Federated Databases is that every query to the overall system must be converted into many different queries or requests – one for every federated silo or sub-system. This creates additional development work and tightly couples the Federated system to the silos.

Real-Time Data: This is the one benefit of Federation. You don’t need to hook into change detection mechanisms, or implement them if you don’t have them. I would argue that if your various sources and silos have no ability to figure out what has changed every hour or day or on some schedule, that is an enterprise problem you should solve. That said, Federated systems always query the live data, so they are always real-time even without any ability to detect changes in the source systems.

Operational Isolation: Federated systems go down when any federate goes down, or require complex code and user interactions to support partial queries in a degraded state. Often, live source systems do not have capacity for even minimal real-time queries, much less critical batch processes, so the federated virtual database may bring down or impact critical up-stream systems.

This diagram illustrates the just-in-time form of Harmonization in a Federation.

Federated

Virtual (Federated) Databases perform all Harmonization in real-time, they so can support real-time web services to “run the business” but cannot typically scale to support batch processing or analytics. You can run parts of your business on a Federated Database, but have to be cautious about how much load you can put on the systems.

To sum up my views on Virtual Federated Databases:

Even if you automate the process of scurrying around to query 10 different systems, you’re still scurrying around to query 10 different systems. It just doesn’t help much.


Data Lake Implications

Operational Isolation: Data Lakes are better than Federated systems because they do provide operational isolation by moving the data to a separate infrastructure. This is arguably their primary advantage.

Query: Data Lakes do not Index or Harmonize data, and the source system indexes are not available in the Data Lake – so query-ability is actually worse with a Data Lake than a Virtual Federated Database.

Batch Processing: Data Lakes (and Hadoop systems generally) do reasonably well at batch processing, including predictive analytics. If you are willing to load and process every record no matter what, you do not need sophisticated query and can work around lack of indexes.

Simplicity and Manageability: Because data is in various formats in a Data Lake, it requires complex logic in each batch process, ETL job or analytic job. This code rapidly becomes un-governable, mismatched, out of date and a burden.

Federated

Data Lakes require batch analytics to brute force process data during analysis or egress, or batch ETL to load disparate data formats into a new form, so are generally for data analytics or insight, rather than real-time processing. You can “see your business” with a Data Lake, but you cannot “run your business” on them, because of their analytic/batch nature.

As Data Lakes add this additional ETL, they are actually Harmonizing data and moving toward a Data Hub, but still without the transactional, real-time capabilities that allow an enterprise to run the business.

To sum up my views on Data Lakes:

If your child has a disastrously messy room, and picks everything up off the floor only to throw it in the closet, and then closes the closet door, is her room really clean? Or is it still a mess, just located in a central place?

Data Hub Implications

Query: Data Hubs maintain their own indexes – and build these indexes over Harmonized data. Remember that harmonization can be a progressive, agile process – so as the range of Harmonized data increases, the power of the indexing increases as well.

In contrast, other systems must default to “least common denominator” indexing and query. Queries can only be supported in a Federation if they are supported by the least powerful system in that Federation. Analytic or reporting processes can only be supported in a Data Lake if the code can handle the messiest least-compatible data sources in the lake.

Put another way – query is dictated by the least powerful system or data source without Harmonization and Indexing, but query can exceed even the most powerful source system using a Data Hub because it Indexes on top of Harmonized data.

Note that MarkLogic indexes: Fields, XML Structure, JSON structure, Text, RDF semantic data, and geospatial data out of the box. Binary data is also stored, with text and binary metadata extracted from formats including Word documents, PDFs, JPEGs and others. This is one reason MarkLogic typically recommends Data Hubs.

Operational Isolation: Like a Data Lake, a Data Hub Moves data to separate disks and infrastructure, so load and management are isolated from source systems.

Batch Processing: Like a Data Lake, you can query and process many records when needed in a Hub. Unlike a Data Lake, Data Hubs allow Index-driven query to restrict the quantity of data being processed for any particular purpose, and to do fast lookups and joins to combine data during batch jobs without a “sub-batch” to find relevant, related records.

Simplicity and Manageability: Because data is at least partially-Harmonized in a Data Hub, all data can be handled in common code. Building on the inventory example above, the Harmonized data set might have “unitsOnHand” as a harmonized field regardless of which incompatible source system the data came from (total_items vs boxes_available). The Harmonized fields will be indexed and uniform, making access to inventory information much simpler.

Discovery and Exploration: Ad-hoc or unexpected analyses can be served easily from a Data Hub. By Moving the data to one place and Harmonizing the critical data elements, the hard work has been done during Movement and Indexing, making it easy to query, interact, filter and drill down into data whenever needed.

Query re-mapping: Data Hubs avoid query re-mapping to each source system by Harmonizing data first and indexing the uniform data set. Note that Harmonization is a “sunk development cost” since you cannot reasonably have a system that returns data from every source in a different format – all the approaches must eventually Harmonize (even via bespoke code in every report). By Indexing and storing after Harmonization, Data Hubs allow simple, uniform queries.

The Old Way: Comprehensive Enterprise Modeling

Data Hubs have indexed, queryable data in a common form, so they can support web services, real-time applications, batch and also analytics. Because of the real-time services, in particular, you can “run your business” on a Data Hub.

To sum up my views on Data Hubs:

Build a Data Hub. It is the most powerful approach and is the easiest to build and maintain.


Pourquoi choisir MarkLogic ?

I hope all this information is useful regardless of your current approach or ultimate architectural choice. Having established the advantages of the Data Hub approach over others, why use MarkLogic to build one?

While the MarkLogic Server supports Data Lakes and Federations, we have found Hubs to be the best option, and have spent more than a decade building features to keep your operations running smoothly.

These features include:

Universal Index: MarkLogic takes any data, as-is, and stores and indexes it. We leverage the structure already present in common record formats including XML, JSON and RDF, and index that structure automatically (with no coding).

Massive Scale: MarkLogic is inherently (and transparently!) clustered, and uses reverse indexes – which makes MarkLogic scale horizontally like a search engine, rather than vertically like most database products.

Big Data Total Cost of Ownership: As huge amounts of data become integrated and available, MarkLogic automatically moves older or less critical data to cheaper storage, including HDFS or S3 for high-performance computing balanced with cost.

World-class Security: Silo busting is critical for any business, and makes data easier to access… but also easier to steal. MarkLogic secures data at the record level with best-in-class role-based security and compartment security. In MarkLogic 9, field-level automatic redaction will be provided as well.

Flexible, Industry Standard Services: MarkLogic ships with a searchable REST API to expose or customize integrated data. Arbitrary data transforms can be applied as it is served so all clients and APIs get the integrated data in the format they need. Typically this is various JSON, XML, RDF and CSV forms.

Text Search: MarkLogic indexes all words per your database configuration. Stemming, phrases search, tf/idf relevance, and/or/not/near queries, decompounding, multiple languages, and other features are present as expected in a full-featured search product.

Geospatial Query: The more data you integrate, the more ways you’ll need to access it. MarkLogic supports more than other products out of the box, including latitude/longitude geospatial query with points, polygons, distance, and polygons with “holes.”

I know – that sounds like a lot of features and capabilities – almost too good to be true.

Did I mention we’ve been doing this for over a decade?

Interested in Learning More?

Operational Data Hubs: A series of short video tutorials, averaging 15 min each, that guides you through to create a data hub.