Background
Our client, a global manufacturer in the power generation industry, faced challenges with an aging reporting environment based on multiple Oracle ODS systems and reporting straight from source databases. The legacy reporting environment was causing unnecessary load, reduced system reliability and made cross-platform identification difficult to manage. This in turn led to numerous cascading problems across the organization, including performance issues during end-of-quarter closing and underperforming reporting projects due to siloed data. They turned to Starschema for a better solution.
Challenge
The vision was to leverage a data lake fed by near real-time data as a long-term solution running on dedicated hardware. This represented a very significant upgrade from the legacy environment. The initial starting point consisted of several Oracle ERPs and a few custom applications on dedicated Oracle DBMSs.
Another key requirement was the ability to deliver multiple streams of data from a single database read process, and for our client’s developers to be able to use production data in a development environment.
Solution
The answer to our client’s challenges lay in HVR, an enterprise data replication solution built for demanding environments with continuous high-volume, real-time data integration. HVR is the only software designed specifically for the data lake era, powering daily operations in high-throughput powerhouses like Lufthansa and NL Post. Starschema has been working with this client on its company-wide HVR deployment since the project started in 2013.
HVR reads transactional data with a separate job called “capture” and makes copies of the transaction files for different targets. These files are then integrated by separate integrate jobs for each target. This enables HVR’s ability to scale from a dual ingestion (Non-Prod & Prod or Hadoop and non-Hadoop in parallel) environment to potentially hundreds of heterogeneous targets for a single data source — another key requirement.
The granularity of replication can be fine-tuned in several ways. First, HVR can be configured for certain tables, entire schemas, or even entire databases. Fine-tuning is then possible on a per-column level, and data can even be filtered during the change-data capture process. The ingestion of new sources – even with the security review – can be carried out by DevOps through swift, 10-day development sprints.
HVR also provides internal capabilities to determine data quality via a solution called HVR Database Compare. Tables can be compared either in bulk or row-by-row. In-flight data is often a problem with real-time replication, but HVR accounts for transactions that are captured but not integrated yet.
HVR Key Attributes:
- Suitable for the most complex large-scale deployments.
- Support for source and target systems includes, but is not limited to, SAP ERP, SAP HANA, Oracle, Salesforce, IBM DB2, Hadoop, AWS S3, Snowflake, Apache Kafka and MSSQL.
- Addresses challenges introduced by cloud migration and centralized data lakes.
- Delivers high availability between primary and standby databases and manages geographically distributed data.
- Able to deliver multiple data streams from a single database read.
- Enables the use of production data development environments.
- Dual ingestion can scale to hundreds of heterogeneous targets.
- Ability to stream data from clustered and pool tables.
- Granularity of replication can be fine-tuned.
- A pre-ingestion review process, including security audits, has been developed for out-of-the-box compliance.
- Data quality can be determined and managed.
Outcome
In the current environment, 15 ERPs are replicated — nine are Oracle EBS, four are custom-made systems and two are SAP implementations. HVR serves as the backbone of the infrastructure, combined with Pivotal’s Greenplum MPP and Cloudera’s Hadoop implementation that together enable our client to push data from over 80 databases and more than 15,000 tables — approximately amounting to an 80 TB mirror. Data is then aggregated and used for reporting though self-service BI tools like Tableau, Cognos, data science applications using R or Python, and a range of custom analytical products. The replication tool also supports Apache Kafka and AWS S3/BlobStore ingestion as well for other data lake use cases.
Our client’s current environment sees over 3,000 tables captured from SAP sources. HVR’s SAP replication is one of the few solutions on the market delivering both packed and unpacked object data replication, with support for SAP’s HANA in-memory DBMS is supported both as a source and a target system. Rows from raw cluster and pool tables are generated in real-time for the tables enrolled in the replication channel. SAP’s. Initial data loads for all supported systems can be made on multiple parallel threads with built-in support for table slicing with options like boundaries, modulo or value lists.