Our client had hundreds of Tableau reports built/compiled from different Data lakes and Datamarts. The aim of the project was to switch the source of those reports from traditional relational databases to a Hadoop cluster that met the demand for ever increasing response times in spite of the exponential growth in data.
Also, possible Big-Data applications can serve as a layer for Data Scientists to run their Spark and Impala jobs on.
Both Hadoop clusters are spun up with data ingestion and framework setup for the implementation of data mining tasks.
We used customized/tailored Talend jobs to run Sqoop processes on top of the cluster to get approximately 1 Terabyte of data into Hadoop on a daily basis – which requires a custom method for executing incremental loads.
An additional request was to fine-tune the performance of the cluster in order to get a better response time for the clients reporting tools. With the current Hadoop operational toolset hundreds of parameters could be set to maximize performance. The status of the cluster was tracked with dashboards so that the effect of each setup modification could easily be evaluated. We were also able to experiment with different MapReduce engines alongside the very latest Hadoop components e.g. Hive and Spark 2.0 to try and cope with the strict security requirements through Ranger and Kerberos.