GitHub - jay377/reddit-scraper: Reddit Scraper is an Apify actor for extracting data from Reddit. It allows you to extract posts and comments together with some user info without login. It is built on top of Apify SDK and you can run it both on Apify platform and locally.

Reddit Scraper

Reddit Scraper is an Apify actor for extracting data from Reddit. It allows you to extract posts and comments together with some user info without login. It is build on top of Apify SDK and you can run it both on Apify platform and locally.

Input
Output
Compute units consumption
Extend output function

Input

Field	Type	Description	Default value
startUrls	array	List of Request objects that will be deeply crawled.
searches	array	An array containing keywords that will be used in the Reddit's search engine. Each item on the array will perform a different search. This field should be empty when using startUrls.
type	enum	Select the type of search tha will be performed. "Posts" or "Communities and users".	"posts"
time	enum	Filter the search of posts by the last hour, day, week, month or year	"all"
sort	enum	Sort search by Relevance, Hot, Top, New or Comments	nul
maxItems	number	The maximum number of items that will be saved in the dataset. If you are scrapping for Communities&Users, remember to consider that each category inside a community is saved as a separeted item. More details here.	50
maxPostCount	number	The maximum number of posts that will be scraped for each Posts Page or Communities&Users URL	50
maxComments	number	The maximum number of comments that will be scraped for each Comments Page.	50
maxCommunitiesAndUsers	number	The maximum number of "Communities & Users"'s pages that will be scraped if your seach or startUrl is a Communites&Users type.	50
maxLeaderBoardItems	number	Limit of communities inside a leaderboard page that will be scraped. If set to '0' all items will be scraped.	20
extendOutputFunction	string	A Javascript function passed as plain text that can return custom information. More on Extend output function.
proxyConfiguration	object	Proxy settings of the run.	`{"useApifyProxy": true }`

Limiting results with maxItems

When searching for Communities&Users, each community has different categories inside them (ie: New, Hot, Rising, etc..). Each of those are saved as a separated item in the dataset so you have to account for them when setting the maxItems input. As an example, if you set maxCommunitiesAndUsers to 10 and each community has 4 categories, you will have to set maxItems to at least 40 (10 x 4) to get all the categories for each community in the resulted dataset.

When searching for Posts, you can set maxItems to the same number as maxPostCount since each post is saved as an item in the dataset. If the maxItems is less than maxPostCount, the number of posts will be equal the maxItems.

StartUrls examples

Almost any url from reddit will return a result. If the url is not supported the scraper will display a message before scraping the page. Here are some examples of urls that can be scraped:

{
"scraping communities": "https://www.reddit.com/r/worldnews/", 
"scraping channels within communities": "https://www.reddit.com/r/worldnews/hot", 
"scraping search results for users/communities": "https://www.reddit.com/search/?q=news&type=sr%2Cuser", 
"scraping popular communities": "https://www.reddit.com/subreddits/leaderboard/crypto/", 

"scraping users": "https://www.reddit.com/user/lukaskrivka/", 
"scraping users' comments": "https://www.reddit.com/user/lukaskrivka/comments/}",

"scraping posts": "https://www.reddit.com/r/learnprogramming/comments/lp1hi4/is_webscraping_a_good_skill_to_learn_as_a_beginner/", 
"scraping search results for posts": "https://www.reddit.com/search/?q=news", 
"scraping popular posts": "https://www.reddit.com/r/popular/", 
}

If you use a search url as a parameter of startUrls it will only scrape for posts. If you want to search for communities and users use the search field or the specific url instead.

Output

Output is stored in a dataset.

Post Example:

{
  "postUrl": "https://www.reddit.com/r/TrueOffMyChest/comments/hdipdr/my_wife_doesnt_know_but_once_or_twice_a_month/",
  "communityName": "r/TrueOffMyChest",
  "numberOfVotes": 787000,
  "postedBy": "u/Rpark888",
  "postedDate": "2020-06-23T00:53:29.675Z",
  "title": "My wife doesn't know. But once or twice a month after she falls asleep, I order a medium pizza and 8 wings, and I eat them outside in the backyard, by myself, and throw away the evidence before I go back to bed.",
  "text": "It's honestly the most exciting thrill that I often daydream about and look forward to. I wake up pretty thirsty and bloated though, lol.UPDATE: I'm going to pull this off again sometime in the next couple days. I'll try to document it with some pictures of all the glory!!!!UPDATE 2 with pics!!!I ordered a large pizza tonight instead of a medium because of a coupon, but I went with the thin crust. I like regular crust wayyy better!!! After I polished off the 8 hot wings, I had to tap out after 4.5 pizzas..I was just too full. Anyways, thanks for all the love :)Probably won't be doing many more of these, but, glad some of you enjoyed partaking in my secret plate night indulgences.Love yourself. Choose to be happy. Even if it means not sharing with your wife :)",
  "comments": [
    {
      "commentUrl": "https://www.reddit.com/r/TrueOffMyChest/comments/hdipdr/my_wife_doesnt_know_but_once_or_twice_a_month/t1_fvlehno",
      "userName": "annoyedNYC",
      "commentDate": "2020-06-23T00:53:29.677Z",
      "description": "I tried sneaking a pizza past my wife once. I forgot to turn off the smart security camera though!",
      "points": "4"
    },
    {
      "commentUrl": "https://www.reddit.com/r/TrueOffMyChest/comments/hdipdr/my_wife_doesnt_know_but_once_or_twice_a_month/t1_fvlenth",
      "userName": "marijuana-",
      "commentDate": "2020-06-23T00:53:29.677Z",
      "description": "21st century problems amirite",
      "points": "1"
    },
    {
      "commentUrl": "https://www.reddit.com/r/TrueOffMyChest/comments/hdipdr/my_wife_doesnt_know_but_once_or_twice_a_month/t1_fvlrd44",
      "userName": "bemental_",
      "commentDate": "2020-06-23T00:53:29.677Z",
      "description": "I just found out our grocery store loyalty card number tracks and stores all our orders in the same interface my wife uses to schedule our online grocery order pickups.I thought I was being super sneaky going into the store for a quick treaty treat before picking up our groceries.She’s known the whole time and not brought it up. I married way better then I deserved to have.",
      "points": "305"
    },
    {
      "commentUrl": "https://www.reddit.com/r/TrueOffMyChest/comments/hdipdr/my_wife_doesnt_know_but_once_or_twice_a_month/t1_fvltoiq",
      "userName": "gHHqdm5a4UySnUFM",
      "commentDate": "2020-06-23T00:53:29.678Z",
      "description": "I’d rather be caught eating pizza than have to explain why every month at 1am the security cameras are mysteriously turned off",
      "points": "139"
    }
  ]
}

Community Example:

This will be replicated for each category inside the comunity to save each category posts in a different object.

{
  "title": "Pizza",
  "title2": "r/Pizza",
  "createdAt": "Created Aug 26, 2008",
  "members": 266000,
  "moderators": ["6745408", "AutoModerator", "BotTerminator"],
  "category": "top",
  "posts": [
    {
      "postUrl": "https://www.reddit.com/r/Pizza/comments/hjtnw4/margherita_life/",
      "numberOfVotes": 10000,
      "communityName": "r/Pizza",
      "postedBy": "u/4000xxl",
      "postedDate": "2020-07-02T09:21:51.445Z",
      "title": "Margherita = life"
    },
    {
      "postUrl": "https://www.reddit.com/user/popdusteats/comments/hfam2q/hellofreshs_newest_offer_is_giving_you_80_off/",
      "numberOfVotes": 3,
      "communityName": "user/popdusteats",
      "postedBy": "u/popdusteats",
      "postedDate": "2020-06-25T00:21:51.448Z",
      "title": "HelloFresh's newest offer is giving you $80 OFF including FREE Shipping! HelloFresh helps you add variety to your daily meals. If you're looking for easy to make meals at an affordable price, click here to learn more."
    }
  ]
}

Compute units consumption

The CU consumption is expected to be aproximatly 1.4 per 100 requests

Extend output function

You can use this function to update the result output of this actor. You can choose what data from the page you want to scrape. The output from this will function will get merged with the result output.

The return value of this function has to be an object!

You can return fields to achive 3 different things:

Add a new field - Return object with a field that is not in the result output
Change a field - Return an existing field with a new value
Remove a field - Return an existing field with a value undefined

async () => {
  return {
    title: document.querySelecto("title").innerText,
  };
};

This example will add the title of the page to the final object:

{
  "title": "Pizza",
  "title2": "r/Pizza",
  "createdAt": "Created Aug 26, 2008",
  "members": 266000,
  "moderators": ["6745408", "AutoModerator", "BotTerminator"],
  "category": "top",
  "posts": [
    {
      "postUrl": "https://www.reddit.com/r/Pizza/comments/hjtnw4/margherita_life/",
      "numberOfVotes": 10000,
      "communityName": "r/Pizza",
      "postedBy": "u/4000xxl",
      "postedDate": "2020-07-02T09:21:51.445Z",
      "title": "Margherita = life"
    },
    {
      "postUrl": "https://www.reddit.com/user/popdusteats/comments/hfam2q/hellofreshs_newest_offer_is_giving_you_80_off/",
      "numberOfVotes": 3,
      "communityName": "user/popdusteats",
      "postedBy": "u/popdusteats",
      "postedDate": "2020-06-25T00:21:51.448Z",
      "title": "HelloFresh's newest offer is giving you $80 OFF including FREE Shipping! HelloFresh helps you add variety to your daily meals. If you're looking for easy to make meals at an affordable price, click here to learn more."
    }
  ],
  "title": "homemade chicken cheese masala pasta"
}

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github/workflows		.github/workflows
src		src
.editorconfig		.editorconfig
.eslintrc		.eslintrc
.gitignore		.gitignore
.prettierrc		.prettierrc
.versionrc.json		.versionrc.json
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
INPUT_SCHEMA.json		INPUT_SCHEMA.json
README.md		README.md
apify.json		apify.json
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Scraper

Input

Limiting results with maxItems

StartUrls examples

Output

Compute units consumption

Extend output function

About

Releases

Packages

Languages

jay377/reddit-scraper

Folders and files

Latest commit

History

Repository files navigation

Reddit Scraper

Input

Limiting results with maxItems

StartUrls examples

Output

Compute units consumption

Extend output function

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages