Twitter Data Analysis

September 13, 2021

Goal:

Collect data of who follows whom from Twitter app and performed grouping the users by the number of users they follow. For each group, calculated the number of users belonging to that group.

Basics:

Map reduce: A MapReduce program is composed of a map procedure, which performs filtering and sorting, and a reduce method, which performs a summary operation. It is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce framework (or system) is usually composed of three operations (or steps):

Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed.

Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.

Reduce: worker nodes now process each group of output data, per key, in parallel.

Spark: Apache Spark is a highly developed engine for data processing on large scale over thousands of compute engines in parallel. This allows maximizing processor capability over these compute engines.

Pig: Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-level of abstraction for processing over the MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes. First, to process the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a component of Apache Pig) converted all these scripts into a specific map and reduce task. But these are not visible to the programmers in order to provide a high-level of abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of Pig always stored in the HDFS.

Hive: Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). Hive makes job easy for performing operations like Data encapsulation, Ad-hoc queries, analysis of huge datasets

In Hive, tables and databases are created first and then data is loaded into these tables.

Hive as data warehouse designed for managing and querying only structured data that is stored in tables.

While dealing with structured data, Map Reduce doesn’t have optimization and usability features like UDFs but Hive framework does. Query optimization refers to an effective way of query execution in terms of performance.

Hive’s SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning.

Experimental observations and analysis:

https://github.com/swatidamele/DataAnalysis

References:

https://towardsdatascience.com/introduction-to-apache-spark-with-scala-ed31d8300fe4

https://en.wikipedia.org/wiki/MapReduce

https://www.geeksforgeeks.org/introduction-to-apache-pig/

https://www.guru99.com/introduction-hive.html

https://www.projectpro.io/article/mapreduce-vs-pig-vs-hive/163

Search This Blog

Technical Content

Twitter Data Analysis

Comments

Post a Comment

Popular posts from this blog

Garbage Classification

Sentiment Analysis using NBC

Study - Concept of overfitting using higher order linear regression