Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add Jsdom approach blog #2687

Merged
merged 12 commits into from
Oct 8, 2024
Merged

docs: add Jsdom approach blog #2687

merged 12 commits into from
Oct 8, 2024

Conversation

souravjain540
Copy link
Collaborator

added blog(approved from engg+marketing)

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few nitpicks about formatting for now.

i haven't read the whole thing, but i see there is not a single mention of the JSDOMCrawler we have in crawlee, which feels weird given it should be published here.

Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also skimmed through it and left some (incomplete) grammar suggestions.

I don't know how the new rules work, but you can try asking the marketing team for a Grammarly license. I'm dreading the day when my academic pass expires (and I'll likely continue paying for it). It really pays off :)

website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
@fnesveda fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label Oct 2, 2024
Co-authored-by: Martin Adámek <banan23@gmail.com>
Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>
@souravjain540
Copy link
Collaborator Author

@B4nan @barjin, thanks for the review. Suggestions applied.

I will take care of US English next time :)

also, according to Alexey, The JSDOM crawler waits till the page is loaded, and there is no option to customize the process, so the approach does not apply to the current version of the crawler. That's why we didn't use it.

However, this article still uses Crawlee and is for experienced developers, so it might be a nice way to reach that audience and introduce them to the crawler.

@B4nan
Copy link
Member

B4nan commented Oct 3, 2024

I don't understand what he means, but either way, this is crawlee blog, we need to mention it, otherwise publishing such article here makes little to no sense if you ask me. Its like talking about a feature you have, but you know you did it wrong, so you don't even mention it. This is bad marketing for me.

Btw this is the first time hearing about some issues about jsdom in crawlee, he should have reached out to us instead so we could fix things. I wouldn't merge this unless we rewrite it to use the crawler, not just mention it.

@souravjain540
Copy link
Collaborator Author

okay, i will talk to him :)

@B4nan
Copy link
Member

B4nan commented Oct 3, 2024

Yeah tell him to reach out to us on slack and we will try to sort things out with priority to unblock this article, I can imagine it won't be much work.

@mnmkng
Copy link
Member

mnmkng commented Oct 3, 2024

@B4nan @barjin My process is to not bother with grammar too much, because the blogs should go through Content editing at the end.

@souravjain540 can you confirm that it does and that the guys can skip grammar fixes in future reviews?

EDIT: Oh, it says approved from Marketing in the initial post. How could it be approved by Marketing with so many errors?

Co-authored-by: Ondra Urban <23726914+mnmkng@users.noreply.github.com>
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left few more comments.

i really think we shold simplify the example and remove the SDK/apify related parts like the input handling or proxy configuration

i also dislike the overuse of object destructing, and arrow functions, but maybe that's just my pet peeve again.

website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
website/blog/2024/09-30-jsdom-based-scraping/index.md Outdated Show resolved Hide resolved
const items = data.list;
const counter = itemsCounter + items.length;
const dataItems = items.slice(0, resultsLimit && counter > resultsLimit ? resultsLimit - itemsCounter : undefined);
await Actor.pushData(dataItems);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use the context helper here instead of the global API. also the use of apify SDK here feels a bit weird, its an article about crawlee, it doesn't mention the apify platform anyhow, yet in the code you can find those pieces. it could be confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@B4nan it was a POC of building an Actor using Crawlee, I can use context helper, but want it to remain as an Actor, will add few lines to explain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are not many articles that link Actor building and crawlee. i will add explanation of building Actor and then let's if it makes sense.

souravjain540 and others added 3 commits October 7, 2024 10:51
Co-authored-by: Martin Adámek <banan23@gmail.com>
@souravjain540
Copy link
Collaborator Author

@B4nan made these changes:

  • formatted code
  • changed title
  • added a note section to mention JSDOM Crawler
  • added link to explain the SDK in between code.

Let’s see how we are going to create the `getApiUrlWithVerificationToken` function:

```js
const getApiUrlWithVerificationToken = async (body, url) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this is pretty much the same, the formatting is completely off on many places

Now coming to our main code, we will use CheerioCrawler and will use `prenavigationHooks` to inject the headers that we got from the earlier function into the `requestHandler`.

```js
const crawler = new CheerioCrawler({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i see you broke all the code blocks actually :D this one is also misformatted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammarly it is 🤦 updating

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, formatting code blocks with grammarly, that's bold :D

i can format this myself if you'd struggle with that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was editing english with it :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done using prettier now.

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better, but still not there. also please use 4 spaces instead of 2 for indenting

Comment on lines 150 to 153
if (
["static", "scontent"].find((x) => urlToOpen.startsWith(`https://${x}`))
) {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not doing anything

Comment on lines 182 to 193
createSessionFunction: async (sessionPool) => createSessionFunction(sessionPool, proxyConfiguration),
},
preNavigationHooks: [
(crawlingContext) => {
const { request, session } = crawlingContext;
request.headers = {
...request.headers,
...session.userData?.headers,

};
},
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still misformatted

@souravjain540
Copy link
Collaborator Author

better, but still not there. also please use 4 spaces instead of 2 for indenting

done 👍

Comment on lines 53 to 54
Before making calls to this API, we will need few required headers (auth data, so we will first make the call to `https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en`
We will start this approach by creating a function that will create the URL for the API call for us and, make the call and get the data.
Copy link
Member

@B4nan B4nan Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are missing a closing ) here (and a dot I guess), and remove the line break (or add a blank line if this was supposed to be two paragraphs)

```js
export const createStartUrls = (input) => {
const {
days = "7",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use single quotes everywhere


In the above function, we create the start url for the API call that include various parameters as we talked about earlier. After creating the URL according to the parameters it will call the `creative_radar_api` and fetch all the results.

But it won’t work until we get the headers. So, let’s create a function that will first create a session using sessionPool and proxyConfiguration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
But it won’t work until we get the headers. So, let’s create a function that will first create a session using sessionPool and proxyConfiguration.
But it won’t work until we get the headers. So, let’s create a function that will first create a session using `sessionPool` and `proxyConfiguration`.

@souravjain540
Copy link
Collaborator Author

done @B4nan

Comment on lines 249 to 250
To make things more clear, here is how code flow looks:
![code flow](./img/code-flow.webp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To make things more clear, here is how code flow looks:
![code flow](./img/code-flow.webp)
To make things more clear, here is how code flow looks:
![code flow](./img/code-flow.webp)

Comment on lines 206 to 223
const { itemsCounter = 0, resultsLimit = 0 } = userData;
if (!json.data) {
throw new Error('BLOCKED');
}
const { data } = json;
const items = data.list;
const counter = itemsCounter + items.length;
const dataItems = items.slice(
0,
resultsLimit && counter > resultsLimit
? resultsLimit - itemsCounter
: undefined,
);
await context.pushData(dataItems);
const {
pagination: { page, total },
} = data;
log.info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still misindented, 8 spaces instead of 4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 😸

@souravjain540 souravjain540 merged commit 1c5cb67 into master Oct 8, 2024
9 checks passed
@souravjain540 souravjain540 deleted the jsdom branch October 8, 2024 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants