Large Scale Data Replication Deployment | Starschema

Large Scale Data Replication Deployment

Enabling a data lake at a global enterprise with HVR

Practice Area

  • Data Engineering

Business Impact

  • Better performance of BI reporting platform
  • Increased data accuracy
  • Reduced data platform management costs

Challenges

  • Aging data platform unable to meet performance requirements
  • Difficult to manage BI reporting platform

Technologies

HVR, SAP, Oracle, Greenplum

Background

Our client, a global manufacturer in the power generation industry, faced challenges with an aging reporting environment based on multiple Oracle ODS systems and reporting straight from source databases. The legacy reporting environment was causing unnecessary load, reduced system reliability and made cross-platform identification difficult to manage. This in turn led to numerous cascading problems across the organization, including performance issues during end-of-quarter closing and underperforming reporting projects due to siloed data. They turned to Starschema for a better solution.

Challenge

The vision was to leverage a data lake fed by near real-time data as a long-term solution running on dedicated hardware. This represented a very significant upgrade from the legacy environment. The initial starting point consisted of several Oracle ERPs and a few custom applications on dedicated Oracle DBMSs.

Another key requirement was the ability to deliver multiple streams of data from a single database read process, and for our client’s developers to be able to use production data in a development environment.

Solution

The answer to our client’s challenges lay in HVR, an enterprise data replication solution built for demanding environments with continuous high-volume, real-time data integration. HVR is the only software designed specifically for the data lake era, powering daily operations in high-throughput powerhouses like Lufthansa and NL Post. Starschema has been working with this client on its company-wide HVR deployment since the project started in 2013.

HVR reads transactional data with a separate job called “capture” and makes copies of the transaction files for different targets. These files are then integrated by separate integrate jobs for each target. This enables HVR’s ability to scale from a dual ingestion (Non-Prod & Prod or Hadoop and non-Hadoop in parallel) environment to potentially hundreds of heterogeneous targets for a single data source — another key requirement.

The granularity of replication can be fine-tuned in several ways. First, HVR can be configured for certain tables, entire schemas, or even entire databases. Fine-tuning is then possible on a per-column level, and data can even be filtered during the change-data capture process. The ingestion of new sources – even with the security review – can be carried out by DevOps through swift, 10-day development sprints.

HVR also provides internal capabilities to determine data quality via a solution called HVR Database Compare. Tables can be compared either in bulk or row-by-row. In-flight data is often a problem with real-time replication, but HVR accounts for transactions that are captured but not integrated yet.

HVR Key Attributes:

  • Suitable for the most complex large-scale deployments.
  • Support for source and target systems includes, but is not limited to, SAP ERP, SAP HANA, Oracle, Salesforce, IBM DB2, Hadoop, AWS S3, Snowflake, Apache Kafka and MSSQL.
  • Addresses challenges introduced by cloud migration and centralized data lakes.
  • Delivers high availability between primary and standby databases and manages geographically distributed data.
  • Able to deliver multiple data streams from a single database read.
  • Enables the use of production data development environments.
  • Dual ingestion can scale to hundreds of heterogeneous targets.
  • Ability to stream data from clustered and pool tables.
  • Granularity of replication can be fine-tuned.
  • A pre-ingestion review process, including security audits, has been developed for out-of-the-box compliance.
  • Data quality can be determined and managed.

Results

In the current environment, 15 ERPs are replicated — nine are Oracle EBS, four are custom-made systems and two are SAP implementations. HVR serves as the backbone of the infrastructure, combined with Pivotal’s Greenplum MPP and Cloudera’s Hadoop implementation that together enable our client to push data from over 80 databases and more than 15,000 tables — approximately amounting to an 80 TB mirror. Data is then aggregated and used for reporting though self-service BI tools like Tableau, Cognos, data science applications using R or Python, and a range of custom analytical products. The replication tool also supports Apache Kafka and AWS S3/BlobStore ingestion as well for other data lake use cases.

Our client’s current environment sees over 3,000 tables captured from SAP sources. HVR’s SAP replication is one of the few solutions on the market delivering both packed and unpacked object data replication, with support for SAP’s HANA in-memory DBMS is supported both as a source and a target system. Rows from raw cluster and pool tables are generated in real-time for the tables enrolled in the replication channel. SAP’s. Initial data loads for all supported systems can be made on multiple parallel threads with built-in support for table slicing with options like boundaries, modulo or value lists.

Customer Segmentation at an Energy Company

Our client, a retail gas and electricity provider, operates a call center to facilitate customer requests and to offer complementary services, such as equipment maintenance.

Using Data to Improve Global Supply Chains

Over 60% of the world’s global seaborne trade is shipped using intermodal freight containers, and the ports that manage them serve as central points for supply chains – over 90% of global trade is conducted through ports.

Optimizing Data Visualization

Great visualizations can make data come alive, leading to better insights and decisions. Data consumers of every kind rely on dashboards to perform their work, but they have a common complaint, slow loading times.

Palette Insight

Palette Insight provides actionable intelligence about your Tableau deployment to help you maximize the benefits of Tableau Server.

SCROLL

This website uses cookies

To provide you with the best possible experience on our website, we may use cookies, as described here. By clicking accept, closing this banner, or continuing to browse our websites, you consent to the use of such cookies.

I agree