Page 1 of 1

Is Scrapy overkill for this kind of task?

Posted: Sun Aug 21, 2016 5:46 pm UTC
by jacques01
Suppose there is a stream of incoming URLs. The task is to get the HTML of each of these URLs and store it in a database. There is no need for traditional crawling, as each URL is the end of the line.

What I'm thinking for my technology stack:

1) Flask for web container
2) Celery to manage tasks / queue of URLs
3) Requests library to get the HTML of each URL
4) Save HTML to a MongoDB (key is URL since it's necessarily unique)
5) Pool of proxies to avoid blacklisting.

Things I'd need to implement:

1. Controlling concurrency
2. Adding in delays / auto throttling.
3. Detecting when a proxy / IP address is no longer productive
4. Controlling speed of scraping

My understand is that Scrapy could do those all things easily except Proxy management. But because I'm actually just interested in the HTML and not the crawling, I'm unsure if Scrapy is necessary for this task.

Re: Is Scrapy overkill for this kind of task?

Posted: Sun Aug 21, 2016 6:59 pm UTC
by Flumble
We may be able to help if you can provide us reasonable evidence that you're not breaching your country's law and the target's terms of service. :wink:

Re: Is Scrapy overkill for this kind of task?

Posted: Tue Aug 23, 2016 10:00 am UTC
by lorb
Scrapy is pretty much exactly the right tool for this. Use scrapy-cloud and they also deal with avoiding the blacklisting for you, if what you do is legal/sane. If you really need to scrape so many URLs that concurrency is unavoidable and a concern they deal with that too, but you will have to get one of the paid-for plans.

Re: Is Scrapy overkill for this kind of task?

Posted: Wed Aug 24, 2016 12:43 am UTC
by jacques01
Could you explain why Scrapy is the right tool for this task?

My understanding is that Scrapy is good if you're doing actual crawling, i.e.:

1. Start from a predetermined group of seed urls
2. Put these into your URL pool.
3. For each url in your URL pool:
a. Get the page HTML
b. Do something with the HTML
c. Find other URLs to add to the pool

I'm only doing step a.). That is, my list of URLs will never change based on what the spider discovers, because it's not discovery. It's just scraping.

What advantages does Scrapy offer for just getting HTML versus the approach I outlined above?

Re: Is Scrapy overkill for this kind of task?

Posted: Wed Feb 08, 2017 2:03 pm UTC
by lorb
It offers the advantage of not having to build the whole tech stack that you outlined. You just run scrapy and don't do b) and c). All you need is to install scrapy and write a python snippet of code that is about 10 lines, or run it from scrapy cloud and you don't even have to install anything. I can't imagine anything that you built yourself being easier/better.