DevOps pipeline for Real Time Social/Web Mining

Workflow

Technology Stack

Git: Version Control
GitHub: Distributed Development and SCM
Python: Tweepy and Pandas library for Data Mining using Twitter API and Matplotlib library for Data Visualization
Java: Big Data cleaning and stripping workflow using MapReduce
Apache Maven: Build Automation Tool for Java
GitHub Actions: Continuous Integration tool for Apache Maven build whenever Java source code is pushed.
Hadoop: Setup a HDFS cluster for Big Data Analytics.
Likert Scaling: Data Classification into 5 class model.
Python: Sentimental Analysis programming
Docker: Cross-platform package image pushed to DockerHub.
DataDog: Monitoring tool for our Docker Package.
Docker-Compose: Integrating Docker Image of StatusNeo Twitter Mining and DataDog Agent
HashiCorp Packer: Creating cross platform deployable images
HashiCorp Terraform: Infrastructure as Code
Ansible: Configuration Management and Automated Provisioning

Important Source files and dependencies

pom.xml - Setup Apache Maven
helloworld.java - Basic Java project setup
maven.yml - setup GitHub Actions
crawler.py - Web Crawler in Python to extract twitter data based on specific hashtags.
info.csv - data file created as an output for the crawler and to be sent to the HDFS core for processing
MapReduce functionalities in Java

Sentimental Analysis in Python

Convolutional Neural Networks
Decision Tree
SVM
Pre-Processing
Random Forests
Naive Bayes
XGBoost

matplotlib.py - Data Visualization using matplotlib in python
Hadoop Setup

Dockerfile

Install.sh to provision the docker image locally before pushing it to DockerHub.c

Automation.sh Run locally on Linux based machine.
docker-compose.yml for DataDog x Docker integration.
Ansible Playbook
Packer Image Builder
Infrastructure as Service using Terraform

Backlog

[x] Setting up Apache Maven for Java project - User Interface and MapReduce functions

[x] Setting up GitHub repository workflow

[x] Setting up GitHub Actions for automation

[x] Creating a web crawler in Python using Tweepy library to fetch data based on some parameter.

[] Create a User Interface

[x] Create a HDFS cluster for MapReduce functionality and program Hadoop MapReduce in Java

[x] Setup Hadoop Core and create Job Tracker and Task Trackers for the project

[x] Implement MapReduce in HDFS using Java to count the frequency of significant words in Data dictionary, in Twitter string

[x] Configure Apache Maven with MapReduce codes and install Apache Hadoop Jar dependency

[x] Configure MapReduce code in GitHub Actions for automation

[x] Automate the Big Data pipeline till MapReduce using GitHub Actions

[] Use Data Ingestion tools like Flume to send data from crawler to HDFS at real time

[x] WAP in Java to implement MapReduce from JSON file extracted from crawler to find the frequency of significant words - Textual Analysis

[] Data Classification - create a multi-class data dictionary for sentimental analysis - currently for words (in future, we might extend it for phrases and sentences for improved accuracy)

[x] Data Predicition - Using the KNN algorithm in Python to find the relation between tweets and their sentiments.

[x] Data Visualization - Using the Python matplotlib library to implement visualization.

How to Contribute

It is an open source project. Open for everyone.

Follow these contribution guidelines.

License

MIT License, copyrighted to StatusNeo, forked from Storms in Brewing (2019-2020)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
ansible		ansible
bin		bin
hdfs_setup		hdfs_setup
img		img
packer		packer
resources		resources
src		src
terraform		terraform
Automation.sh		Automation.sh
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DataDog.md		DataDog.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
install.sh		install.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DevOps pipeline for Real Time Social/Web Mining

Workflow

Technology Stack

Important Source files and dependencies

Backlog

How to Contribute

License

About

Releases

Packages

Contributors 3

Languages

License

StatusNeo/Observability-As-Code

Folders and files

Latest commit

History

Repository files navigation

DevOps pipeline for Real Time Social/Web Mining

Workflow

Technology Stack

Important Source files and dependencies

Backlog

How to Contribute

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages