Getting to a Single Source of Truth | Starschema
Gradient starschema cut 02

Getting to a Single Source of Truth

Structured, SOX compliant, multi-layer data lake

Practice Area

  • Data Engineering
  • Data Lakes

Business Impact

  • Dramatically reduced operational costs
  • Timely, accurate data for critical finance department functions


  • Compliance
  • Auditing capability
  • Large number of source systems
  • Scalability


HVR, Talend, Hadoop, Oracle


Large enterprises acquiring companies often face challenges integrating their financial systems. Over the last several decades, our client, a global enterprise, has acquired dozens of businesses, each with its own ERP system and data warehouse solution. By operating dozens of data warehouses, our client incurred license, equipment and personnel costs.

To improve performance and reduce costs, our client launched an ambitious project to consolidate these disparate data warehouses into a single data lake for its financial data. This presented many technical, operational and security challenges.

Our client reached out to Starschema to design and implement a highly performant, SOX compliant and secure data lake solution.


Every day, our client's companies record tens of thousands of financial transactions. This poses a unique challenge for a data lake:

  • Highly granular data, usually transaction-level, needs to be ingested in near real-time (latency <1 hour)
  • Over 100 source systems belonging to more than thirty different types
  • The solution needs to comply with Sarbanes Oxley (SOX) requirements
  • Zero tolerance for data inconsistencies
  • 200TB of enterprise data comprising more than 25,000 data domains (tables) of over five hundred distinct types
  • Provide simultaneous data consumption and continuous ingestion


Starschema implemented a state-of-the-art architecture by deploying the Starschema Antares iDL™ design, in which raw data would first be mirrored by ingestion into a Massively Parallel Processing (MPP) relational database using HVR and Talend. Subsequently, data would be replicated to in-memory and Hadoop (file system) based consumption layers for later use, including aggregation, data stores, and data science applications.

Data consumption then takes place over a dynamic lambda architecture, providing streaming and batch processing layers. To facilitate the operation of this multi-layer data lake, a standard data definition structure (standard model) was devised for identical domain types of raw data, and a metadata knowledge base was used to store discovered constraints and relationships within the data. At this stage, a standard data model was devised for identical domain types of raw data.

In addition, the data lake automatically generated a continuously updated metadata knowledge base to store discovered constraints and relationships within the data, providing a comprehension of the data lake’s underlying structure itself. This in turn drives the Generic ETL Framework (GEF) and Data Lake Audit Framework (DAF) that constantly maintains and audits the data layers. Operations, change management, and development are supported by ITIL and SOX compliant DevOps CI/CD pipeline applications, which maintain compliance through a process-forcing design.


Every day, the SOX certified the Finance Data Lake ingests approximately 200TB data through 6,000 parallel ETL processes from a diverse range of source systems – Oracle, SAP, PeopleSoft, Hyperion, enterprise-developed systems, etc. – into a single Oracle EBS type Standard Model.

Data is ingested in near real-time, allowing the enterprise to perform crucial finance functions, such as closing and reporting, account reconciliation and centralized tax calculation, based on accurate, consistent and up-to-date data.

Demand Forecasting with Latent Matrix Factorization

For many businesses, demand planning is essential. Profitability, cash flow, and customer satisfaction and retention all hinge on getting this right. This white paper will introduce latent matrix factorization to model demand curves and discuss how it can be used to achieve these outcomes.

Introducing the Stack of the Future for Modern Data Leaders

Fast unobstructed access to data and time to insight matters more now than ever. In these quickly changing times, businesses must innovate and implement a ‘Stack of the Future’ to be able to make accurate, data-driven decisions in minutes, not hours or days. The potential value of data is well known but in the new environment, the ability to easily share and collaborate on data is a competitive differentiator that will be leveraged by forward-thinking companies

COVID–19 Data Set Modeling and Analytics

During times of crisis, companies must look at the available data — both internal and external— and try to understand how that data can be used to determine how the business is currently being impacted, how it is likely to be affected in the future, what are most likely scenarios that will play out, what can be done to counter those scenarios and take advantage of hidden opportunities in this rapidly changing environment. The Starschema COVID-19 dataset ingests reliable data from multiple sources and makes it analytics-ready so it can be easily accessed and used.

A DataOps Journey

Keeping your data platforms running with operational efficiency is both paramount and can be a costly and complicated endeavor. Join us and learn how to apply strategies, techniques, and tools to build a reliable and effective DataOps practice in your organization.


This website uses cookies

To provide you with the best possible experience on our website, we may use cookies, as described here. By clicking accept, closing this banner, or continuing to browse our websites, you consent to the use of such cookies.

I agree