A relatively large percentage of the data that we have and can access in recent times has been generated over the years on a daily basis and presumably hourly. We generate tones of millions of bytes of data coming across from multiple platforms such as from social media platforms, capturing transactional information in banks, high quality digital images and motion pictures and so much more. Therefore as a result, it is only fitting that companies develop much more complex big data solutions that will enable a robust, affordable, scalable and flexible way of capturing, processing and even storing the data they generate. As such, it is important to move from a simple traditional way of capturing this data to a way that allows the organizations to capture this newly complex and unstructured data that has evolved with time and technological advancements. It is therefore important to know which architecture to use and the resources to employ to work hand in hand with the immensely large data architectures as data nowadays does not encompass storing figures into relations and has seen different types of content such as photos/pictures, voice notes, documents in their various forms and videos also requiring storage.
EXPLANATION OF BIG DATA ARCHITECTURE AND ITS TECHNOLOGIES
DEFINITION OF BIG DATA
Big data refers to massive data quantities that are impossible to store and process when employing a simple system used for managing a database and its approach in a specific time frame. It references all data that is in petabytes or greater in memory size that causes drawbacks in storing, analysing and envisioning the data, i.e. Terabytes, Exabytes, Zetabytes etc. Its volume outweighs the resources that are used to store it or even process it. This type of data is not transactional and has evolved to either be user generated or can be generated by machines that are of artificial intelligence.
BIG DATA ARCHITECTURE
Big data architecture, a basis for big data analytics, is an outcome of the intercommunication of big data application resources. These resources or database technology are put together to achieve high performance, high fault comprehension and scalability. It is dependent upon resources that the organization has and also on the data environment an organization has.
A big data structure is devised to handle the ingestion of data, its processing and analysis of data that is too large and difficult for simple traditional database systems. The Solutions normally involve the processing of big data sources in batches (at rest), the big data processing in real-time (in motion), interactive study of this data and analytics and machine learning that are apocalyptic.
A bunch of big data structures involve some or most of the following components;
Data source: It is possible to find a stand-alone data source or they can be many and used interchangeably based on the amount of data the organisation creates. These range from mounted data store databases to files that implementations like web server log files make.
Data storage: Operational data that results from bulk processing gets written to a distributed storage file that has the ability to hold immense data quantities in their various forms commonly referred to as a data lake.
Batch processing: The solution must systematically digest data using reliable tasks to choose, assign and make it ready for it to be analysed. This process involves reading source files, processing them and writing output to new files.
Real-time message recording: the architecture should include ways to record or store real-time communication for online processing only when the solution involves real-time sources.
Analytical Data Store: The Solution should prepare data for inspection and give out the examined one in an organized form that will allow it to simply be accessed using analytical resources.
Orchestration: Orchestration technology can be employed to enforce correlation and correspondence for solutions that involve repetition of operations responsible for digesting data and positioning the data into a data store and assemble the output in the form of a report.
EXAMPLES OF BIG DATA ARCHITECTURE
Internet of Things (IoT) architecture: The Internet of Things has no precise and universally agreed consensus regarding its architecture. As such, multiple architectures have been proposed. These include three & five-layer architectures, cloud and fog based architectures, social IoT, representative architecture and a whole lot of others. A basic layer of the IoT architecture has a sensor/device, edge, data intelligence and application layers that are stacked one over the other to carry out unique tasks with each having sub layers within it.
Lambda architecture : A data processing architecture modeled to handle large quantities of data by making use of both batch and stream processing methods. Such an approach to architecture makes an effort to find a balance between latency, throughput, and fault tolerance by using batch processing to provide a well-rounded and precise views of batch data, while simultaneously using real-time stream processing to cater for online data views. The rise of lambda architecture corresponds with the growth of big data, real-time analytics and the drive to mitigate the latencies of MapReduce.
Lambda architecture is dependent on a data model with an append-only immutable data source that serves as a system of record. It is intended for ingesting and processing time stamped events that are appended to existing events rather than overwriting them. has three layers:
1. The Batch Layer manages the master data and precomputes the batch views
2. The Speed Layer serves recent data only and increments the real-time views
3. The Serving Layer is responsible for indexing and exposing the views so that they can be queried.
The three layers are outlined in the below diagram along with a sample choice of technology stacks:
Hadoop Architecture: Hadoop Skill Set needs a considerable amount of knowledge of every process in the hadoop stack right from understanding the various components in the hadoop architecture and deviseing a hadoop cluster that includes performance,tuning it and setting up the top chain responsible for processing data.
It follows a primitive master slave architecture devise for storing data and processing distributed data using HDFS and MapReduce respectively. Hadoop is the master node for data storage, HDFS is the NameNode and the Job Tracker is the master node for parallel processing of data using Hadoop’s MapReduce. Slave nodes in the hadoop architecture are other machines in the Hadoop cluster which store data and carry out complex operations. Each slave node is designated a Task Tracker daemon and a DataNode that links the processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be established in the cloud or on-premise.
Image Credit : OpenSource.com
BIG DATA TECHNOLOGIES
Big data technologies are the means through which drawbacks in data analytics, visualization and storage are tackled. Because of the problems brought forth by big data’s volume, variety and velocity, it prompts for new technology solutions. The most prominent and widely used big data technology is the Hadoop open source project which was invented by Apache. This open source library was created with the focus placed on scalable, reliable, distributed and flexible computing systems that can handle this big data. Hadoop is made up of two components that work hand in hand.
First up is the Hadoop Distributed File System (HDFS) which gives way to high-bandwidth that is necessary for big data computing.
The second component that makes up Hadoop is a data processing structure or platform known as MapReduce. It is important as it distributes huge data sets from search engines (e.g. google search technology) across many servers which will in turn process the overall data set it receives and creates a summary before more traditional analysis resources are used. The distribution and summary creation of the large data sets is what is presumed to be the “map” and “reduce “respectively.
Hadoop technology and various big data resources have evolved to solve the challenges faced in the big data environment. These big data resources can be classified into categories as follows;
1. Data Storage and Management
Examples include NoSQL MongoDB, CouchDB, Cassandra, HBase, Neo4J, Talend, Apache Hadoop, Apache Zoo Keeper etc
2. Data Cleaning
Examples include MS Excel, Open Refine etc.
3. Data Mining
A process of discovery insights in a database. Examples include Rapidminer, TeraData etc.
A collection of concepts that enable efficient, effective and rapid processing of data sets which are characterised by reliability, scalability, flexibility, agility and performance. Because it is called ‘NoSQL’, which is a short notation for “not-SQL” or rather “not only SQL”, it does not mean that it employs the use of a language other than SQL. It utilises SQL as well as other query languages.
NoSQL is an advancement to databases that shows a drift from simple popular relational database management systems (RDBMS). When explaining NoSQL, caution is taken to ensure that we first explain SQL, which is a structured step by step query language employed by the RDBMS. These type of databases depend on relations/tables, rows, columns or schemas to categorize and recover data.
In comparison, NoSQL does not rely on the later. It rather uses much more reliable and flexible data models. Because Relational Database Management Systems have tremendously been unable to meet the flexibility, performance and scalability needs required by these data-intensive next-generation applications, NoSQL databases have been embraced by multiple mainstream organizations fulfil the shortcomings of these RDBMS.
NoSQL is specifically used to store data that is unstructured, grows much more rapidly than structured one and does not fit into tables in the RDBMS. Regular examples of unstructured data comprises of:
Massive objects such as videos and images, chat/messaging and log based data, user entered and session generated data and lastly time series(real-time) data such as IoT and device data.