Repository for all files and codes related to the project: A Multilevel Streaming Data Analytics Infrastructure for Predictive Analytics http://cs.queensu.ca/~farhana/bam-lab/projects/
There has been a considerable interest in developing systems for processing continuous data streams with the increasing need for real-time analytics for decision support in business, healthcare, manufacturing, security, and internet of things. Some of the data processed through streaming data processing systems need further processing for which most systems currently store the data on the disk and re-load the data in memory for the next level of processing. Storing re-loading of large streaming data incurs compute and storage overhead. The goal of this project is to design and implement a multi-level architecture that can support high speed real time streaming data processing and complex machine learning analytics.
Extracting meaningful and timely insights from unbounded data is very challenging. Currently there are many open-source and proprietary systems for data stream processing. The large number of available systems is good but poses a major challenge in terms of selecting the right components or processing framework for different use cases. Understanding the required capabilities of streaming architectures is vital in making the right design or usage choice. As first step in achieving the objectives of the this project, we conducted a systematic literature review, propose a taxonomy and architecture, perform a comparative study of distributed data stream processing/analytics frameworks, and conducted a critical review of representative open source (Storm, Spark Streaming, Structured Streaming, Flink, Kafka Streams, KSQL) and commercial (IBM Streams) distributed data stream and graph processing frameworks. This study identified open problems (research opportunities) and can serve as a guide for organizations and individuals planning to implement a real-time data stream processing and analytics framework. The outcome of our review has been published in the IEEE Access entitled "A Survey of Distributed Data Stream Processing Frameworks". URL: https://ieeexplore.ieee.org/document/8864052
An essential part of building a data-driven organization is the ability to handle and process continuous streams of data to discover actionable insights. The explosive growth of interconnected devices and the social Web has led to a large volume of data being generated on a continuous basis. Streaming data sources such as stock quotes, credit card transactions, trending news, traffic conditions, time-sensitive patient’s data are not only very common but can rapidly depreciate if not processed quickly. The ever-increasing volume and highly irregular nature of data rates pose new challenges to data stream processing systems. One such challenging but important task is how to accurately ingest and integrate data streams from various sources and locations into an analytics platform. These challenges demand new strategies and systems that can offer the desired degree of scalability and robustness in handling failures. This project investigates the fundamental requirements and the state of the art of existing data stream ingestion systems, propose a scalable and fault-tolerant data stream ingestion and integration framework that can serve as a reusable component across many feeds of structured and unstructured input data in a given platform, and demonstrate the utility of the framework in a real-world data stream processing case study that integrates Apache NiFi and Kafka for processing high velocity news articles from across the globe. The study also identifies best practices and gaps for future research in developing large-scale data stream processing infrastructure. The outcome of this study was presented during the 2018 IEEE Bigdata conference at Seattle, WA, USA. Paper: A Scalable and Robust Framework for Data Stream Ingestion. URL: https://ieeexplore.ieee.org/abstract/document/8622360
The rapid growth of data in velocity, volume, value, variety, and veracity has enabled exciting new opportunities and presented big challenges for businesses of all types. Recently, there has been considerable interest in developing systems for processing continuous data streams with the increasing need for real-time analytics for decision support in the business, healthcare, manufacturing, and security. The analytics of streaming data usually relies on the output of offline analytics on static or archived data. However, businesses and organizations like our industry partner Gnowit, strive to provide their customers with real time market information and continuously look for a unified analytics framework that can integrate both streaming and offline analytics in a seamless fashion to extract knowledge from large volumes of hybrid streaming data. Here, as part of this project, we present our study on designing a multilevel streaming text data analytics framework by comparing leading edge scalable open-source, distributed, and in-memory technologies. We demonstrate the functionality of the framework for a use case of multilevel text analytics using deep learning for language understanding and sentiment analysis including data indexing and query processing. Our framework combines Spark streaming for real time text processing, the Long Short Term Memory (LSTM) deep learning model for higher level sentiment analysis, and other tools for SQL-based analytical processing to provide a scalable solution for multilevel streaming text analytics. The outcome of this study was presented during the 2019 IEEE COMPSAC conference at Milwaukee, Wisconsin, USA. Paper: A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning. URL: https://ieeexplore.ieee.org/abstract/document/8754149
The following are workshops that we hosted as a result of this project:
A Hands-on Tutorial on Deep Learning for Object and Pattern Recognition @CASCON 2019, November 4-6, 2019, Markham Canada.
The objective of this workshop is to provide an introduction to fundamental concepts of deep learning algorithms and hands-on tutorials to aspiring data scientists, researchers, industry practitioners, and deep learning enthusiasts looking to build or integrate the power of deep learning in their business applications. URL: https://dl.acm.org/doi/abs/10.5555/3370272.3370331
Large-Scale Multilevel Streaming Data Analytics @CASCON 2018, October 29 - 31, 2018, Markham, Canada.
The objective of this workshop was to provide a forum for researchers and industry practitioners to discuss new ideas and share their experiences in the areas of streaming data analytics. Participants presented their work on topics including methods, models, algorithms, infrastructures, quality issues, applications, and open problems for largescale streaming data analytics. URL: https://dl.acm.org/doi/abs/10.5555/3291291.3291356
The following are talks that we presented as a result of this project:
Spot and Stop Fake News: Using Deep Learning to Predict the Veracity of News Streams. SOSCIP 3 Minute Impact Competition, Advance Ontario, May 15th-16th, 2019.
A Multilevel Streaming Data Analytics Infrastructure for Predictive Analytics. Technology Expo, CASCON ‘18, October 30th, 2018.
BAM-Lab/Gnowit Collaboration, Impact, and Success Stories. Queens Post-Doc Research Showcase & Reception, September 20th, 2018.
A Multilevel Streaming Data Analytics Infrastructure for Predictive Analytics. SOSCIP Postdoctoral Fellows Lightning Round Competition, OARCC'18, May 16th, 2018.
This project was featured in the June 2018 SOSCIP newsletter https://www.soscip.org/wp-content/uploads/2017/08/soscip_impactreport2018_pages.pdf
The postdoctoral fellow was interviewed and featured in the SOSCIP researcher showcase for September 2018 SOSCIP newsletter https://www.soscip.org/stories/researcher-spotlight-making-the-connection