Gradient

Large Scale Data Replication Deployment

Enabling a data lake at a global enterprise with HVR

Practice Areas

  • Data Engineering
  • Data Lakes

Business Impacts

  • Better performance of BI reporting platform
  • Increased data accuracy
  • Reduced data platform management costs

Challenges

  • Aging data platform unable to meet performance requirements
  • Difficult to manage BI reporting platform

Technologies

  • HVR
  • SAP
  • Oracle
  • Greenplum

Background

Our client, a global manufacturer in the power generation industry, faced challenges with an aging reporting environment based on multiple Oracle ODS systems and reporting straight from source databases. The legacy reporting environment was causing unnecessary load, reduced system reliability and made cross-platform identification difficult to manage. This in turn led to numerous cascading problems across the organization, including performance issues during end-of-quarter closing and underperforming reporting projects due to siloed data. They turned to Starschema for a better solution.

Challenge

The vision was to leverage a data lake fed by near real-time data as a long-term solution running on dedicated hardware. This represented a very significant upgrade from the legacy environment. The initial starting point consisted of several Oracle ERPs and a few custom applications on dedicated Oracle DBMSs.

Another key requirement was the ability to deliver multiple streams of data from a single database read process, and for our client’s developers to be able to use production data in a development environment.

Solution

The answer to our client’s challenges lay in HVR, an enterprise data replication solution built for demanding environments with continuous high-volume, real-time data integration. HVR is the only software designed specifically for the data lake era, powering daily operations in high-throughput powerhouses like Lufthansa and NL Post. Starschema has been working with this client on its company-wide HVR deployment since the project started in 2013.

HVR reads transactional data with a separate job called “capture” and makes copies of the transaction files for different targets. These files are then integrated by separate integrate jobs for each target. This enables HVR’s ability to scale from a dual ingestion (Non-Prod & Prod or Hadoop and non-Hadoop in parallel) environment to potentially hundreds of heterogeneous targets for a single data source — another key requirement.

The granularity of replication can be fine-tuned in several ways. First, HVR can be configured for certain tables, entire schemas, or even entire databases. Fine-tuning is then possible on a per-column level, and data can even be filtered during the change-data capture process. The ingestion of new sources – even with the security review – can be carried out by DevOps through swift, 10-day development sprints.

HVR also provides internal capabilities to determine data quality via a solution called HVR Database Compare. Tables can be compared either in bulk or row-by-row. In-flight data is often a problem with real-time replication, but HVR accounts for transactions that are captured but not integrated yet.

HVR Key Attributes:

  • Suitable for the most complex large-scale deployments.
  • Support for source and target systems includes, but is not limited to, SAP ERP, SAP HANA, Oracle, Salesforce, IBM DB2, Hadoop, AWS S3, Snowflake, Apache Kafka and MSSQL.
  • Addresses challenges introduced by cloud migration and centralized data lakes.
  • Delivers high availability between primary and standby databases and manages geographically distributed data.
  • Able to deliver multiple data streams from a single database read.
  • Enables the use of production data development environments.
  • Dual ingestion can scale to hundreds of heterogeneous targets.
  • Ability to stream data from clustered and pool tables.
  • Granularity of replication can be fine-tuned.
  • A pre-ingestion review process, including security audits, has been developed for out-of-the-box compliance.
  • Data quality can be determined and managed.

Outcome

In the current environment, 15 ERPs are replicated — nine are Oracle EBS, four are custom-made systems and two are SAP implementations. HVR serves as the backbone of the infrastructure, combined with Pivotal’s Greenplum MPP and Cloudera’s Hadoop implementation that together enable our client to push data from over 80 databases and more than 15,000 tables — approximately amounting to an 80 TB mirror. Data is then aggregated and used for reporting though self-service BI tools like Tableau, Cognos, data science applications using R or Python, and a range of custom analytical products. The replication tool also supports Apache Kafka and AWS S3/BlobStore ingestion as well for other data lake use cases.

Our client’s current environment sees over 3,000 tables captured from SAP sources. HVR’s SAP replication is one of the few solutions on the market delivering both packed and unpacked object data replication, with support for SAP’s HANA in-memory DBMS is supported both as a source and a target system. Rows from raw cluster and pool tables are generated in real-time for the tables enrolled in the replication channel. SAP’s. Initial data loads for all supported systems can be made on multiple parallel threads with built-in support for table slicing with options like boundaries, modulo or value lists.

Enterprise-Scale AWS Cloud Migration

Our client, a Fortune 50 energy company needed to migrate all their existing data and analytics platforms to Amazon Web Services and redesign the existing architecture to leverage AWS-native technologies. This was an initiative to optimize performance and reduce long-term operation costs by taking advantage of recent advancements with AWS cloud native solutions.

Oracle to AWS Data Platform Migration

When large organizations merge or divest, then new entity has business transformation thrust upon it. Our client, a provider of power generation solutions, divested from a larger organization to become a brand-new company. With the looming deadline of a cut-over from the existing data platform, the client engaged Starschema to validate and test the existing technology stack and determine the best way to move forward.

Starschema Antares iDL™

A fully automated, compliant-by-design intelligence data lake architecture with real-time ingestion and best-of-breed standardization and audit features.

Getting to a Single Source of Truth

Large enterprises acquiring companies often face challenges integrating their financial systems. Over the last several decades, our client has acquired dozens of businesses, each with its own ERP system and data warehouse solution.