Large Scale Data Replication Deployment | Starschema
Gradient starschema cut 02

Large Scale Data Replication Deployment

Enabling a data lake at a global enterprise with HVR

Practice Area

  • Data Engineering
  • Data Lakes

Business Impact

  • Better performance of BI reporting platform
  • Increased data accuracy
  • Reduced data platform management costs


  • Aging data platform unable to meet performance requirements
  • Difficult to manage BI reporting platform


HVR, SAP, Oracle, Greenplum


Our client, a global manufacturer in the power generation industry, faced challenges with an aging reporting environment based on multiple Oracle ODS systems and reporting straight from source databases. The legacy reporting environment was causing unnecessary load, reduced system reliability and made cross-platform identification difficult to manage. This in turn led to numerous cascading problems across the organization, including performance issues during end-of-quarter closing and underperforming reporting projects due to siloed data. They turned to Starschema for a better solution.


The vision was to leverage a data lake fed by near real-time data as a long-term solution running on dedicated hardware. This represented a very significant upgrade from the legacy environment. The initial starting point consisted of several Oracle ERPs and a few custom applications on dedicated Oracle DBMSs.

Another key requirement was the ability to deliver multiple streams of data from a single database read process, and for our client’s developers to be able to use production data in a development environment.


The answer to our client’s challenges lay in HVR, an enterprise data replication solution built for demanding environments with continuous high-volume, real-time data integration. HVR is the only software designed specifically for the data lake era, powering daily operations in high-throughput powerhouses like Lufthansa and NL Post. Starschema has been working with this client on its company-wide HVR deployment since the project started in 2013.

HVR reads transactional data with a separate job called “capture” and makes copies of the transaction files for different targets. These files are then integrated by separate integrate jobs for each target. This enables HVR’s ability to scale from a dual ingestion (Non-Prod & Prod or Hadoop and non-Hadoop in parallel) environment to potentially hundreds of heterogeneous targets for a single data source — another key requirement.

The granularity of replication can be fine-tuned in several ways. First, HVR can be configured for certain tables, entire schemas, or even entire databases. Fine-tuning is then possible on a per-column level, and data can even be filtered during the change-data capture process. The ingestion of new sources – even with the security review – can be carried out by DevOps through swift, 10-day development sprints.

HVR also provides internal capabilities to determine data quality via a solution called HVR Database Compare. Tables can be compared either in bulk or row-by-row. In-flight data is often a problem with real-time replication, but HVR accounts for transactions that are captured but not integrated yet.

HVR Key Attributes:

  • Suitable for the most complex large-scale deployments.
  • Support for source and target systems includes, but is not limited to, SAP ERP, SAP HANA, Oracle, Salesforce, IBM DB2, Hadoop, AWS S3, Snowflake, Apache Kafka and MSSQL.
  • Addresses challenges introduced by cloud migration and centralized data lakes.
  • Delivers high availability between primary and standby databases and manages geographically distributed data.
  • Able to deliver multiple data streams from a single database read.
  • Enables the use of production data development environments.
  • Dual ingestion can scale to hundreds of heterogeneous targets.
  • Ability to stream data from clustered and pool tables.
  • Granularity of replication can be fine-tuned.
  • A pre-ingestion review process, including security audits, has been developed for out-of-the-box compliance.
  • Data quality can be determined and managed.


In the current environment, 15 ERPs are replicated — nine are Oracle EBS, four are custom-made systems and two are SAP implementations. HVR serves as the backbone of the infrastructure, combined with Pivotal’s Greenplum MPP and Cloudera’s Hadoop implementation that together enable our client to push data from over 80 databases and more than 15,000 tables — approximately amounting to an 80 TB mirror. Data is then aggregated and used for reporting though self-service BI tools like Tableau, Cognos, data science applications using R or Python, and a range of custom analytical products. The replication tool also supports Apache Kafka and AWS S3/BlobStore ingestion as well for other data lake use cases.

Our client’s current environment sees over 3,000 tables captured from SAP sources. HVR’s SAP replication is one of the few solutions on the market delivering both packed and unpacked object data replication, with support for SAP’s HANA in-memory DBMS is supported both as a source and a target system. Rows from raw cluster and pool tables are generated in real-time for the tables enrolled in the replication channel. SAP’s. Initial data loads for all supported systems can be made on multiple parallel threads with built-in support for table slicing with options like boundaries, modulo or value lists.

Demand Forecasting with Latent Matrix Factorization

For many businesses, demand planning is essential. Profitability, cash flow, and customer satisfaction and retention all hinge on getting this right. This white paper will introduce latent matrix factorization to model demand curves and discuss how it can be used to achieve these outcomes.

Introducing the Stack of the Future for Modern Data Leaders

Fast unobstructed access to data and time to insight matters more now than ever. In these quickly changing times, businesses must innovate and implement a ‘Stack of the Future’ to be able to make accurate, data-driven decisions in minutes, not hours or days. The potential value of data is well known but in the new environment, the ability to easily share and collaborate on data is a competitive differentiator that will be leveraged by forward-thinking companies

COVID–19 Data Set Modeling and Analytics

During times of crisis, companies must look at the available data — both internal and external— and try to understand how that data can be used to determine how the business is currently being impacted, how it is likely to be affected in the future, what are most likely scenarios that will play out, what can be done to counter those scenarios and take advantage of hidden opportunities in this rapidly changing environment. The Starschema COVID-19 dataset ingests reliable data from multiple sources and makes it analytics-ready so it can be easily accessed and used.

A DataOps Journey

Keeping your data platforms running with operational efficiency is both paramount and can be a costly and complicated endeavor. Join us and learn how to apply strategies, techniques, and tools to build a reliable and effective DataOps practice in your organization.


This website uses cookies

To provide you with the best possible experience on our website, we may use cookies, as described here. By clicking accept, closing this banner, or continuing to browse our websites, you consent to the use of such cookies.

I agree