Twitter is a powerful platform for accessing real-time information, and it can be a valuable source of data for a variety of applications. However, collecting and analyzing data from Twitter can be a challenging task, especially for those without specialized technical skills. In this context, a web app that simplifies the process of scraping Twitter data and provides useful features for analyzing and visualizing the data can be a useful tool for researchers, journalists, businesses, and individuals. This is an open source web app that allows you to scrape tweets from Twitter using the snscrape library, visualize the data using Streamlit and upload the scraped content to Mongo DB for future usage. It has been developed by me, Nirmal Kumar, a Data science enthusiast and Python developer. The app is intended for educational and research purposes only, and should not be used for any commercial or unethical activities.
Before you begin, you will need to have a few tools installed on your machine:
- Python 3.7 or higher. [Note: Streamlit only supports .py files as of now. So, notebook(.ipynb) files are not recommended]
- MongoDB software.
- The snscrape, pandas and streamlit, pymongo packages.
Python is the programming language used to develop this project. It is a popular high-level programming language known for its readability and versatility. It is widely used for web development, data analysis, and machine learning. It provides a powerful and flexible foundation for scraping and analyzing Twitter data.
MongoDB is a cross-platform document-oriented database program. It uses JSON-like documents with optional schemas and is classified as a NoSQL database. We used it here to store the scraped Twitter data. It provides a flexible and scalable solution for managing large amounts of data.
SNScrape is a Python library that allows you to scrape social media data without using an API or request limits. Moreover, you don't even need an active account to scrape content when you use snscrape. It supports a variety of platforms including Twitter, Facebook, and Instagram. We used it here to scrape tweets from Twitter. It provides greater flexibility and control over the data we collect.
Streamlit is an open-source Python library that makes it easy to create and share custom web apps for machine learning and data science. We used it here to deploy our project as a web app. It makes it easy to create an interactive user interface for exploring and visualizing the scraped Twitter data.
PyMongo is a Python distribution containing tools for working with MongoDB. We used it here to connect to a MongoDB server and perform database operations using Python.
Pandas is a popular Python library used for data manipulation and analysis. We used pandas to convert scraped tweets list into dataframe and to convert dataframe into .csv, .json and .dict formats.
Scraping tweets using Streamlit is legal, but it is important to ensure that you are not violating any applicable laws or the terms of service of Twitter. It is always recommended to obtain explicit consent from Twitter users before scraping their tweets, and to use ethical and responsible scraping practices. Snscrape provides a convenient way to scrape up to 100,000 tweets, which is considered to be well within Twitter's guidelines and framework.
1. Go to the web app URL in your web browser.
2. Choose the type of search - **Keyword** or **Hashtag**
3. Enter the keyword/hashtag of your choice
4. Set **Start Date** and an **End Date**. By default start date will be 100 days before today
5. You can also view your selected options under the **Details Pane:** in the sidebar to ensure accuracy.
6. Select number of tweets to scrape
7. Click on the **Scrape Tweets** button
8. Use the two tabs – **SHOW**, **DOWNLOAD** – to view the scraped data then and there and you can download scraped tweets in .csv or .json format
9. In the sidebar you can **UPLOAD DATA TO MONGO_DB** and you can view the **COLLECTIONS HISTORY**
To run the app, follow these steps:
1. Clone the repository to your local machine using the following command: git clone [https://github.com/Nirmal7781/Twitter_scrapping.git].
2. Install the required libraries by running the following command: pip install -r requirements.txt.
3. Open a terminal window and navigate to the directory where the app is located using the following command: cd [.py file directory].
4. Run the command [streamlit run twitter_scraper.py] to start the app.
5. The app should now be running on a local server. If it doesn't start automatically, you can access it by going to either
* Local URL: [http://localhost:8501] or * Network URL: [http://192.168.43.83:8501].
To modify the app, you can:
1. Add filters to the search results table to allow users to sort and filter the results.
2. Add a visualization of the search results, such as a word cloud or a chart.
3. Allow users to save their search queries for future use.
4. Use machine learning algorithms to perform sentiment analysis on the tweets and display the results.
-
Social media monitoring: A company can use the app to monitor mentions of its brand on Twitter and analyze the sentiment of those mentions.
-
Influencer marketing: An influencer can use the app to track their own social media presence and engagement metrics, or to identify other influencers in their industry.
-
Market research: A business can use the app to monitor conversations about their industry on Twitter and gather insights on consumer behavior, preferences, and trends.
-
News and journalism: A journalist can use the app to track breaking news and trending topics on Twitter and gather data for a news story.
-
Academic research: A researcher can use the app to collect data from Twitter for academic studies, such as sentiment analysis or social network analysis.
-
Political analysis: A political analyst can use the app to monitor public opinion on political issues and track the social media activity of political candidates and parties.
-
Before knowing about this issue let's know a bit about session_state in streamlit.
Session state is a powerful feature in Streamlit that allows for the creation of dynamic and interactive apps with a more reactive user interface. Session state came into the picture because Streamlit is a stateless framework, which means that each time a user interacts with the app, the entire script is re-executed from top to bottom. This can lead to performance issues and can make it difficult to create interactive apps that rely on storing user data across multiple interactions.
To solve this problem, Streamlit introduced the session state feature, which allows the app to store and retrieve data across multiple interactions without the need to re-execute the entire script. This makes it easier to create dynamic and interactive apps that can respond to user input in real-time.
Thus, session state came into the picture to solve the problem of state management in stateless frameworks like Streamlit.
Now the potential issue here is eventhough session_state is a powerful feature it's still under development and this could cause the app to rerun on its own from top to bottom even if we use session_state feature to prevent this action.
-
This app would work extremely well and good on local server in every aspect that I have mentioned above. But when you deploy it in cloud you can't use pymongo there to upload the dataset to MongoDB. This is a persistent issue and most developers acknowledge this issue in streamlit community forum and we have to wait a bit more to get this rectified.
This application is intended for educational and research purposes only and should not be used for any commercial or unethical activities.
If you have any questions, comments, or suggestions for the app, please feel free to contact me at [nirmal.works@outlook.com]