Skip to content

StatusNeo/Observability-As-Code

Repository files navigation

DevOps pipeline for Real Time Social/Web Mining

Workflow

Workflow

Technology Stack

  • Git: Version Control

  • GitHub: Distributed Development and SCM

  • Python: Tweepy and Pandas library for Data Mining using Twitter API and Matplotlib library for Data Visualization

  • Java: Big Data cleaning and stripping workflow using MapReduce

  • Apache Maven: Build Automation Tool for Java

  • GitHub Actions: Continuous Integration tool for Apache Maven build whenever Java source code is pushed.

  • Hadoop: Setup a HDFS cluster for Big Data Analytics.

  • Likert Scaling: Data Classification into 5 class model.

  • Python: Sentimental Analysis programming

  • Docker: Cross-platform package image pushed to DockerHub.

  • DataDog: Monitoring tool for our Docker Package.

  • Docker-Compose: Integrating Docker Image of StatusNeo Twitter Mining and DataDog Agent

  • HashiCorp Packer: Creating cross platform deployable images

  • HashiCorp Terraform: Infrastructure as Code

  • Ansible: Configuration Management and Automated Provisioning

Important Source files and dependencies

  1. pom.xml - Setup Apache Maven

  2. helloworld.java - Basic Java project setup

  3. maven.yml - setup GitHub Actions

  4. crawler.py - Web Crawler in Python to extract twitter data based on specific hashtags.

  5. info.csv - data file created as an output for the crawler and to be sent to the HDFS core for processing

  6. MapReduce functionalities in Java

  1. Sentimental Analysis in Python
  • Convolutional Neural Networks
  • Decision Tree
  • SVM
  • Pre-Processing
  • Random Forests
  • Naive Bayes
  • XGBoost
  1. matplotlib.py - Data Visualization using matplotlib in python

  2. Hadoop Setup

  1. Dockerfile
  1. Automation.sh Run locally on Linux based machine.

  2. docker-compose.yml for DataDog x Docker integration.

  3. Ansible Playbook

  4. Packer Image Builder

  5. Infrastructure as Service using Terraform

Backlog

[x] Setting up Apache Maven for Java project - User Interface and MapReduce functions

[x] Setting up GitHub repository workflow

[x] Setting up GitHub Actions for automation

[x] Creating a web crawler in Python using Tweepy library to fetch data based on some parameter.

[] Create a User Interface

[x] Create a HDFS cluster for MapReduce functionality and program Hadoop MapReduce in Java

[x] Setup Hadoop Core and create Job Tracker and Task Trackers for the project

[x] Implement MapReduce in HDFS using Java to count the frequency of significant words in Data dictionary, in Twitter string

[x] Configure Apache Maven with MapReduce codes and install Apache Hadoop Jar dependency

[x] Configure MapReduce code in GitHub Actions for automation

[x] Automate the Big Data pipeline till MapReduce using GitHub Actions

[] Use Data Ingestion tools like Flume to send data from crawler to HDFS at real time

[x] WAP in Java to implement MapReduce from JSON file extracted from crawler to find the frequency of significant words - Textual Analysis

[] Data Classification - create a multi-class data dictionary for sentimental analysis - currently for words (in future, we might extend it for phrases and sentences for improved accuracy)

[x] Data Predicition - Using the KNN algorithm in Python to find the relation between tweets and their sentiments.

[x] Data Visualization - Using the Python matplotlib library to implement visualization.

How to Contribute

It is an open source project. Open for everyone.

Follow these contribution guidelines.

License

MIT License, copyrighted to StatusNeo, forked from Storms in Brewing (2019-2020)