THIS POST IS THE THIRD IN OUR “BECOMING SUSTAINABLY DATA DRIVEN” SERIES. READ THE OTHER ENTRIES HERE:
- Step One: Acquiring the Data
- Step Two: Build Data Structures for Efficient and Flexible Consumption
- Step Three (YOU ARE HERE): Reporting, Dashboarding and Visualization in a Big Data World
- Step Four: Robust and Agile Advanced Analytics that Make a Difference
In the “olden days” of BI, reports were generated from structured data, most of which came from a relational database, where schemas are predefined to serve a specific purpose. The advent of Big Data has increased not only the volume and variety of data, but its sources as well: increasingly, BI is asked to consume and deliver data from schema-on-read sources. Data now comes from sensors, log files, emails, Web-site clicks, and a host of other unstructured sources; for example, Hadoop clusters, with thousands of nodes, can extract data from and perform text analytics on a stream of PDF documents that contains a complex treasure trove of information that would be otherwise inaccessible to traditional, structured databases. The relative lack of a standardized structure presents a major challenge for BI developers who must now meet the demands of their organization to extract meaningful, contextualized reporting not only from traditional Data Warehouses, but also from a Data Lake, or an Analytic platform. When developing BI from Big Data, your first decision should be which reporting tool you will use: a traditional BI tool, a visualization tool, or one of the new Big Data tools, taking into account your organization’s existing tools and the skills of your employees. Most traditional BI tools now support a Hadoop connection either with a native connector or a standard ODBC/JDBC connection; in some cases, you might need to connect to the data source using REST API or a third-party connector. New visualization tools also support querying Hadoop via ODBC/JDBC. And, of course, Big Data tools such as Hive and Apache Spark are built to access Hadoop data via SQL on Hive or Apache Spark. Once you decide on the tool(s), you will need to consider several important aspects:
Security: You will need a comprehensive framework to allow easy and secure access to an increasing number of Hadoop users, each with a user identity defined by your security architect. If you use Kerberos for AD and LDAP authentication, you should consider the Apache Knox Gateway (Knox) for centralized authentication and access to the Hadoop environment. Use an SSL connection to secure data transfers. You can also setup specific security for HIVE and Spark; for example, you can manage access to Hive and/or Spark on an individual basis.
SOURCE DATA → DATA LAKE → HIVE TABLES → VIEWS ON HIVE.
A traditional ETL tools can be used to move the data along the data flow; however, once you have your data on Hive, some fine tuning may be required. For example, you might end up using materialization to improve performance, or, if you wish to perform time-frame analysis, you might consider moving that specific data to a traditional database.