Monday 25 May 2015

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are doing for this type of task. Are people building one tool that has a lot of variables and inputs or are people designing 20+ programs to scrape and parse this data. Everything is hard-coded into each console and run through Windows Task Manager.

Added a couple additional thoughts/details:

    Of the 30 sources, they all have unique properties, all are uploaded into individual MySQL tables and all have varying frequencies. For example, one data source is hit once a minute, another on 5 minute intervals. Majority are once an hour and once a day.

At current I download the formats (xml, csv, html), parse them into a formatted csv and put them into staging folders. Within that folder, I run an application that reads a config file specific to the folder. When a new csv is added to the folder, the application then uploads the data into the specific MySQL tables designated in the config file.

Im wondering if it is worth re-building all this into a larger complex program that is more capable of dynamically adding content+scrapes and adjusting to format changes.

Looking for outside thoughts.

5 Answers

What you are working on is basically ETL. So at a high level you need an export component (get stuff) a transform component (map to known format) and a load (take known format and put stuff somewhere). If you are comfortable being tied to a RDBMS you could use something like SQL Server SSIS packages. What I would do is create a host application that managed common aspects of the overall process (errors, and pipeline processing). Then make the specifics of the E, T, and L pluggable. A low ceremony way to get this would be to host the powershell runtime and create each seesion with common context objects that the scripts will use to communicate. You get a built in pipe and filter model for scripts and easy, safe extensibility. This design has worked extremely for my team with a similar situation.

Resist the temptation to rewrite.

However, for new code, you could plan for what you know has already happened. Write a retrieval mechanism that you can reuse through configuration. Write a translation mechanism that you can reuse (maybe in a library that you can call with very little code). Write a saving mechanism that can be called or configured.

At this point, you've written #21(+). Now, the following ones can be handled with a tiny bit of code and configuration. Yay!

(You may want to implement this in a service that handles multiple conversions, but weight the benefits of it versus the ability to separate errors in one module from the rest.)

1

It depends - if you need the scrapers to feed into a single application/database and have a uniform data format, it makes sense to have them all in a single program (possibly inheriting from a common base scraper).

If not and they are completely unrelated to each other, might as well keep them separate so changes in one have no effect on another.

Update, following edits to question:

Don't change things just for the sake of change. You have something that works, don't mess with it too much.

Since your data sources and data sinks are all separate from each other, combining them into one application will simply create a very complicated application that will be very difficult to change when needed.

Since the scrapers are separate, keep the separation as you have it now.

As sbrenton said, this most falls in with ETL. You should check out Talend Open Studio. It specializes in handling data flows like I imagine yours are as well as other things like duplicate removal, normalization of fields; tens/hundreds of drag and drop ETL components, you can also write custom code as Talend is a code generator as well, either Java or Perl are options. You can also use Talend to execute system commands. I use it for my ETL work, although not in production, in production we will use SSIS, mostly due to lots of other Microsoft products in house.

You may want to use some good scheduling library, like Quartz.NET.

In a few words, here's what you can expect:

    Your tasks are represented by classes and not processes

    You can set and forget tasks and scale across multiple servers

    You have an out-of-the-box system to actually take care of what is needed to be run when, what failed and needs to be re-run, etc. etc.

Source: http://programmers.stackexchange.com/questions/118077/data-scraping-one-application-or-multiple/118098#118098


No comments:

Post a Comment