Background
A Fortune 50 conglomerate was not satisfied with its ability to serve internal customers’ data needs and track the costs associated with data requests. Their system made it difficult to provide accurate information for the chargeback process that takes place after a company in the conglomerate accesses data, the costs of which get billed to the central corporate office.
The company needed a solution to achieve improved data access, as well as transparency and accuracy in intra-conglomerate financial affairs, with a unified audit log. Based on their experience with an earlier joint project, they selected Starschema to carry out the necessary development.
The solution had to involve a service model that would enable users to subscribe to various data sources and retrieve data from them according to a fixed schedule to relieve the stress on the client’s finance data lake, while continuing to accommodate ad hoc requests. In addition to promoting easy access to data, this system needed to facilitate charging users for the costs associated with data requests by providing accurate data on transactions. The client also required that the solution comprise a metadata-driven framework to eliminate the need for ad hoc development so that modifications would not require costly and time-consuming additional work. The framework also has to be based on cloud-native technology for flexibility and scalability, which meant abandoning the legacy ETL tool for the new solution.
After reviewing the options and suggestions provided by Starschema, the client decided against implementing an always-running cluster, as it could have ensured lower costs only at the expense of accuracy in measuring the parameters of individual transactions. Ultimately, they chose a solution that would be more costly but provide uncompromising accuracy to prevent disputes about chargeback amounts.
The technology stack includes Apache Spark as the analytics engine, AWS Glue as the ETL service providing custom-tailored jobs and Amazon RDS for PostGreSQL as the database engine. Amazon S3 enabled the measuring of the amount of storage used for a request, while Glue made it easier to tag every job run with the name of the subscriber making the request and follow the pricing of individual jobs, since each job run in Glue represents a separate instance. Because the key AWS-native technologies of the solution can clearly indicate the costs associated with a request, their combination made it easy to calculate exact chargeback values.
The DaaS framework will also serve as the basis for a series of developments. To promote cost savings in shared environment usage, when the number of subscribers reaches critical mass, an update to the framework will enable moving part of the ingestion and the architecture to an EMR cluster. Another development will be a metadata-driven “pub-sub” subscription model, which will automatically move approved requests into the metadata layer and start the feed to improve user experience and make maintenance easier. Finally, a mirror stream will complement the current data-model-based system to enable true live streaming of data, fulfil requests more quickly and grant access to data that is not a part of the data model.