Option to not cache entry URLs #1322

SeeYangZhi · 2022-03-31T06:16:53Z

SeeYangZhi
Mar 31, 2022

Is there a method to not cache the URLs added to requestQueue?

// Open the default request queue associated with the actor run
const queue = await Apify.openRequestQueue();

// Open a named request queue
const queueWithName = await Apify.openRequestQueue('some-name');

// Start with an initial URL to start crawling for deeper URLs, this URL should not be cached in the db.sqlite so that whenever I restart the crawler/process, this URL will be crawled again.
await queue.addRequest({ url: 'http://example.com/aaa' });

// These following URLs will be cached as usual so that it will not be duplicated.
await Apify.utils.enqueueLinks({
    page,
    requestQueue,
    selector: 'a.product-detail',
    pseudoUrls: ['https://www.example.com/handbags/[.*]', 'https://www.example.com/purses/[.*]'],
});

Answered by mnmkng

Mar 31, 2022

It's a bit verbose, but I wanted to make it clear what's going on:

import Apify from 'apify'

const repeatableRequests = []

const requestQueue = await Apify.openRequestQueue();

// You can do this only once. The requests will stay in the queue until you delete the file.
const { request } = await requestQueue.addRequest({ url: 'https://example.com' })
repeatableRequests.push(request)

// But you need to save the information about the requests for the subsequent runs.
await Apify.setValue('repeatable-requests', repeatableRequests);


// run the crawler normally
const crawler = new Apify.CheerioCrawler({
    requestQueue,
    handlePageFunction: async ({ request }) => {
        console.log(r…

View full answer

mnmkng · 2022-03-31T07:17:37Z

mnmkng
Mar 31, 2022
Maintainer

What exactly do you mean by that?

3 replies

SeeYangZhi Mar 31, 2022
Author

Based on what I understand, URLs added to the requestQueue will be automatically cached to a db.sqlite file. I would like to know if there is any way to make it so that I am able to specify certain URLs to not be cached. As such, if I were to trigger the crawler again, those URLs will be crawled again and not persisted in the db.sqlite file.

mnmkng Mar 31, 2022
Maintainer

Ok, let me make sure that I understand this correctly. You want to:

Add many requests to the queue, some of which you want to mark as crawlable again and some not crawlable again.
Then, on a subsequent run, you only want to crawl the ones marked as crawlable again (and maybe some new ones), while making sure that the ones which you crawled before and marked not crawlable again will not be crawled a second time.

Do you know the list of all the crawlable again URLs beforehand or is it dynamic?

SeeYangZhi Mar 31, 2022
Author

Thanks for the reply, your understanding is correct. The list of crawlable again URLs is known beforehand as to my current requirements.

mnmkng · 2022-03-31T10:51:31Z

mnmkng
Mar 31, 2022
Maintainer

It's a bit verbose, but I wanted to make it clear what's going on:

import Apify from 'apify'

const repeatableRequests = []

const requestQueue = await Apify.openRequestQueue();

// You can do this only once. The requests will stay in the queue until you delete the file.
const { request } = await requestQueue.addRequest({ url: 'https://example.com' })
repeatableRequests.push(request)

// But you need to save the information about the requests for the subsequent runs.
await Apify.setValue('repeatable-requests', repeatableRequests);


// run the crawler normally
const crawler = new Apify.CheerioCrawler({
    requestQueue,
    handlePageFunction: async ({ request }) => {
        console.log(request.url)
    }
})

await crawler.run()

// This is the place where the selected requests are updated to be crawlable again.
// We're just telling the queue that those requests were not handled yet,
// which re-enables their crawling in the queue.
const requestsToUpdate = await Apify.getValue('repeatable-requests');
const promises = requestsToUpdate.map(req => {
    return requestQueue.client.updateRequest({
        ...req, // you need a copy of the original request
        handledAt: undefined // and change their handled state
    })
});
await Promise.all(promises)

2 replies

SeeYangZhi Apr 3, 2022
Author

Thank you for your detailed reply, I was able to refer to it for what I needed. May I know if there is a way to use RequestList in this scenario if the repeatableRequests consists of a list of URLs.

const { request } = await Apify.openRequestList('my-request-list', [
    'http://www.example.com/page-1',
    'http://www.example.com/page-2',
    'http://www.example.com/page-3',
  ]);
repeatableRequests.push(request)

mnmkng Apr 3, 2022
Maintainer

Unfortunately, to be able to enqueue more requests, you have to use request list and request queue in combination. And in this case, the requests from the list are first enqueued. So you would be back where you started.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to not cache entry URLs #1322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Option to not cache entry URLs #1322

SeeYangZhi Mar 31, 2022

Replies: 2 comments · 5 replies

mnmkng Mar 31, 2022 Maintainer

SeeYangZhi Mar 31, 2022 Author

mnmkng Mar 31, 2022 Maintainer

SeeYangZhi Mar 31, 2022 Author

mnmkng Mar 31, 2022 Maintainer

SeeYangZhi Apr 3, 2022 Author

mnmkng Apr 3, 2022 Maintainer

SeeYangZhi
Mar 31, 2022

Replies: 2 comments 5 replies

mnmkng
Mar 31, 2022
Maintainer

SeeYangZhi Mar 31, 2022
Author

mnmkng Mar 31, 2022
Maintainer

SeeYangZhi Mar 31, 2022
Author

mnmkng
Mar 31, 2022
Maintainer

SeeYangZhi Apr 3, 2022
Author

mnmkng Apr 3, 2022
Maintainer