-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add Jsdom approach blog #2687
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few nitpicks about formatting for now.
i haven't read the whole thing, but i see there is not a single mention of the JSDOMCrawler we have in crawlee, which feels weird given it should be published here.
website/blog/2024/09-12-finding-students-accommodations/index.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also skimmed through it and left some (incomplete) grammar suggestions.
I don't know how the new rules work, but you can try asking the marketing team for a Grammarly license. I'm dreading the day when my academic pass expires (and I'll likely continue paying for it). It really pays off :)
Co-authored-by: Martin Adámek <banan23@gmail.com> Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>
@B4nan @barjin, thanks for the review. Suggestions applied. I will take care of US English next time :) also, according to Alexey, However, this article still uses Crawlee and is for experienced developers, so it might be a nice way to reach that audience and introduce them to the crawler. |
I don't understand what he means, but either way, this is crawlee blog, we need to mention it, otherwise publishing such article here makes little to no sense if you ask me. Its like talking about a feature you have, but you know you did it wrong, so you don't even mention it. This is bad marketing for me. Btw this is the first time hearing about some issues about jsdom in crawlee, he should have reached out to us instead so we could fix things. I wouldn't merge this unless we rewrite it to use the crawler, not just mention it. |
okay, i will talk to him :) |
Yeah tell him to reach out to us on slack and we will try to sort things out with priority to unblock this article, I can imagine it won't be much work. |
@B4nan @barjin My process is to not bother with grammar too much, because the blogs should go through Content editing at the end. @souravjain540 can you confirm that it does and that the guys can skip grammar fixes in future reviews? EDIT: Oh, it says approved from Marketing in the initial post. How could it be approved by Marketing with so many errors? |
Co-authored-by: Ondra Urban <23726914+mnmkng@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left few more comments.
i really think we shold simplify the example and remove the SDK/apify related parts like the input handling or proxy configuration
i also dislike the overuse of object destructing, and arrow functions, but maybe that's just my pet peeve again.
const items = data.list; | ||
const counter = itemsCounter + items.length; | ||
const dataItems = items.slice(0, resultsLimit && counter > resultsLimit ? resultsLimit - itemsCounter : undefined); | ||
await Actor.pushData(dataItems); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use the context helper here instead of the global API. also the use of apify SDK here feels a bit weird, its an article about crawlee, it doesn't mention the apify platform anyhow, yet in the code you can find those pieces. it could be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@B4nan it was a POC of building an Actor using Crawlee, I can use context helper, but want it to remain as an Actor, will add few lines to explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are not many articles that link Actor building and crawlee. i will add explanation of building Actor and then let's if it makes sense.
@B4nan made these changes:
|
Let’s see how we are going to create the `getApiUrlWithVerificationToken` function: | ||
|
||
```js | ||
const getApiUrlWithVerificationToken = async (body, url) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this is pretty much the same, the formatting is completely off on many places
Now coming to our main code, we will use CheerioCrawler and will use `prenavigationHooks` to inject the headers that we got from the earlier function into the `requestHandler`. | ||
|
||
```js | ||
const crawler = new CheerioCrawler({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i see you broke all the code blocks actually :D this one is also misformatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammarly it is 🤦 updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch, formatting code blocks with grammarly, that's bold :D
i can format this myself if you'd struggle with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was editing english with it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done using prettier now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better, but still not there. also please use 4 spaces instead of 2 for indenting
if ( | ||
["static", "scontent"].find((x) => urlToOpen.startsWith(`https://${x}`)) | ||
) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not doing anything
createSessionFunction: async (sessionPool) => createSessionFunction(sessionPool, proxyConfiguration), | ||
}, | ||
preNavigationHooks: [ | ||
(crawlingContext) => { | ||
const { request, session } = crawlingContext; | ||
request.headers = { | ||
...request.headers, | ||
...session.userData?.headers, | ||
|
||
}; | ||
}, | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still misformatted
done 👍 |
Before making calls to this API, we will need few required headers (auth data, so we will first make the call to `https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en` | ||
We will start this approach by creating a function that will create the URL for the API call for us and, make the call and get the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are missing a closing )
here (and a dot I guess), and remove the line break (or add a blank line if this was supposed to be two paragraphs)
```js | ||
export const createStartUrls = (input) => { | ||
const { | ||
days = "7", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use single quotes everywhere
|
||
In the above function, we create the start url for the API call that include various parameters as we talked about earlier. After creating the URL according to the parameters it will call the `creative_radar_api` and fetch all the results. | ||
|
||
But it won’t work until we get the headers. So, let’s create a function that will first create a session using sessionPool and proxyConfiguration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it won’t work until we get the headers. So, let’s create a function that will first create a session using sessionPool and proxyConfiguration. | |
But it won’t work until we get the headers. So, let’s create a function that will first create a session using `sessionPool` and `proxyConfiguration`. |
done @B4nan |
To make things more clear, here is how code flow looks: | ||
![code flow](./img/code-flow.webp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make things more clear, here is how code flow looks: | |
![code flow](./img/code-flow.webp) | |
To make things more clear, here is how code flow looks: | |
![code flow](./img/code-flow.webp) |
const { itemsCounter = 0, resultsLimit = 0 } = userData; | ||
if (!json.data) { | ||
throw new Error('BLOCKED'); | ||
} | ||
const { data } = json; | ||
const items = data.list; | ||
const counter = itemsCounter + items.length; | ||
const dataItems = items.slice( | ||
0, | ||
resultsLimit && counter > resultsLimit | ||
? resultsLimit - itemsCounter | ||
: undefined, | ||
); | ||
await context.pushData(dataItems); | ||
const { | ||
pagination: { page, total }, | ||
} = data; | ||
log.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still misindented, 8 spaces instead of 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 😸
added blog(approved from engg+marketing)