Sunday, 24 August 2014

Php Scraping data from a website

I am very new to programming and need a little help with getting data from a website and passing it into my PHP script.

The website is http://www.birthdatabase.com/.

I would like to plug in a name (First and Last) and retrieve the result. I know you can query the site by passing the name in the URL, but I am having problems scraping the results.

http://www.birthdatabase.com/cgi-bin/query.pl?textfield=FIRST&textfield2=LAST&age=&affid=

I am using the file_get_contents($URL) function to get the page but need help after that. Specifically, I would like to scrape only the results from a certain state if there are multiple results for that name.



You need the awesome simple_html_dom class.

With this class you can query the webpage's DOM in a similar way to jQuery.

First include the class in your page, then get the page content with this snippet:

$html = file_get_html('http://www.birthdatabase.com/cgi-bin/query.pl?textfield=' . $first . '&textfield2=' . $last . '&age=&affid=');

Then you can use CSS selections to scrape your data (something like this):

$n = 0;
foreach($html->find('table tbody tr td div font b table tbody') as $element) {
    @$row[$n]['tr']  = $element->find('tr')->text;
    $n++;
}

// output your data
print_r($row);



Source: http://stackoverflow.com/questions/15601584/php-scraping-data-from-a-website

Obtaining reddit data

I am interested in obtaining data from different reddit subreddits. Does anyone know

if there is a reddit/other api similar like twitter does to crawl all the pages?


Yes, reddit has an API that can be used for a variety of purposes such as data

collection, automatic commenting bots, or even to assist in subreddit moderation.

There are a few places to discover information on reddit's API:

    github reddit wiki -- provides the overview and rules for using reddit's API

(follow the rules)
    automatically generated API docs -- provides information on the requests needed to

access most of the API endpoints
    /r/redditdev -- the reddit community dedicated to answering questions both about

reddit's source code and about reddit's API

If there is a particular programming language you are already familiar with, you

should check out the existing set of API wrappers for various languages. Despite my

bias (I am the package maintainer) I am quite certain PRAW, for python, has support

for the largest number of reddit API features.



Source: http://stackoverflow.com/questions/14322834/obtaining-reddit-data

Saturday, 23 August 2014

Scraping data in dynamic sites

I'm trying to scrape data from our local government. What I want is address from kids adoption offices. Here, in Brazil, all adoptions go through the government. So I have the URL of one office, there are 2 or 3 thousands more. But if I can manage to get one, the others will be easy. I made many attempts, bellow I show three.

The problem could be related to a Javascript (Ajax maybe) that refresh the page.

Note: I am not a PHP developer.

First attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP GET 1</h1>';

echo ini_get("allow_url_fopen");
echo ini_get("allow_url_fopen");

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$html = file_get_contents($url);
var_dump($html);

echo '</body></html>';

// Output
// 11
// Warning:
file_get_contents(http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?
transacao=CONSULTA&vara=2673) [function.file-get-contents]: failed to open stream: HTTP
request failed! HTTP/1.1 404 Not Found in /home/rsl/www/sc01_get.php on line 14
// bool(false)

Second attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 3</h1>';

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;

$html=@curl_exec($curl);

if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    exit;
}
else{
   echo '<br>begin HTML[';
    echo  $html;
   echo '<br>]end html ';
}
echo '</body></html>';

// Output
// 1

third attempt

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");

    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 5</h1>';

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;

$html=@curl($curl);


if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    exit;
}
else{
    echo '<br>begin HTML[';
    echo  $html;
    echo '<br>]end html ';
}
echo '</body></html>';

// Output
// cURL error number:0
// cURL error:

If the pages are really ajax based meaning the information that you need to scrape is loaded or shown through javascript execution, you will need another approach. You would need to automate with a real browser. You can go the Selenium route which can be written in a number of languages or use CasperJS with Javascript as the programming language.



Source: http://stackoverflow.com/questions/24611046/scraping-data-in-dynamic-sites

Friday, 22 August 2014

Scraping dynamic data


I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Thursday, 21 August 2014

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Wednesday, 20 August 2014

Scrape Data Point Using Python


I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
 <td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
 <td><b>0.0775/0.1146</b></td>
 <td><b>860.00000</b></td>
 <td><b>66.65 CAD</b></td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title



Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
    text = element.get_attribute('innerHTML').strip('<b>|</b>')
    try:
        bid = float(text)
        if lowest_bid > bid:
            lowest_bid = bid
    except:
        pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
    server = Connect(username)
    try:
        server.login(username,password)
        server.sendmail(username,username,contents)
    except smtplib.SMTPException,error:
        Print(error)
    Disconnect(server)

def Connect(username):
    serverName = username[username.index("@")+1:username.index(".")]
    while True:
        try:
            server = smtplib.SMTP(serverDict[serverName])
        except smtplib.SMTPException,error:
            Print(error)
            continue
        try:
            server.ehlo()
            if server.has_extn("starttls"):
                server.starttls()
                server.ehlo()
        except (smtplib.SMTPException,ssl.SSLError),error:
            Print(error)
            Disconnect(server)
            continue
        break
    return server

def Disconnect(server):
    try:
        server.quit()
    except smtplib.SMTPException,error:
        Print(error)

serverDict = {
    "gmail"  :"smtp.gmail.com",
    "hotmail":"smtp.live.com",
    "yahoo"  :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...



Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python

Sunday, 17 August 2014

Data From Web Scraping Using Node.JS Request Is Different From Data Shown In The Browser

Right now, I am doing some simple web scraping, for example get the current train arrival/departure information for one railway station. Here is the example link, http://www.thetrainline.com/Live/arrivals/chester, from this link you can visit the current arrival trains in the chester station.

I am using the node.js request module to do some simple web scraping,

app.get('/railway/arrival', function (req, res, next) {
    console.log("/railway/arrival/  "+req.query["city"]);
    var city = req.query["city"];
    if(typeof city == undefined || city == undefined) { console.log("if it undefined"); city ="liverpool-james-street";}
    getRailwayArrival(city,
       function(err,data){
           res.send(data);
        }
       );
});

function getRailwayArrival(station,callback){
   request({
    uri: "http://www.thetrainline.com/Live/arrivals/"+station,
   }, function(error, response, body) {
      var $ = cheerio.load(body);

      var a = new Array();
      $(".results-contents li a").each(function() {
        var link = $(this);
        //var href = link.attr("href");
        var due = $(this).find('.due').text().replace(/(\r\n|\n|\r|\t)/gm,"");   
        var destination = $(this).find('.destination').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var on_time = $(this).find('.on-time-yes .on-time').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        if(on_time == undefined)  var on_time_no = $(this).find('.on-time-no').text().replace(/(\r\n|\n|\r|\t)/gm,"");
        var platform = $(this).find('.platform').text().replace(/(\r\n|\n|\r|\t)/gm,"");

        var obj = new Object();
        obj.due = due;obj.destination = destination; obj.on_time = on_time; obj.platform = platform;
        a.push(obj);
console.log("arrival  ".green+due+"  "+destination+"  "+on_time+"  "+platform+"  "+on_time_no);      
    });
    console.log("get station data  "+a.length +"   "+ $(".updated-time").text());
    callback(null,a);

  });
}

The code works by giving me a list of data, however these data are different from the data seen in the browser, though the data come from the same url. I don't know why it is like that. is it because that their server can distinguish the requests sent from server and browser, that if the request is from server, so they sent me the wrong data. How can I overcome this problem ?

thanks in advance.

2 Answers

They must have stored session per click event. Means if u visit that page first time, it will store session and validate that session for next action you perform. Say, u select some value from drop down list. for that click again new value of session is generated that will load data for ur selected combobox value. then u click on show list then that previous session value is validated and you get accurate data.

Now see, if you not catch that session value programatically and not pass as parameter with that request, you will get default loaded data or not get any thing. So, its chalenging for you to chatch that data.Use firebug for help.

Another issue here could be that the generated content occurs through JavaScript run on your machine. jsdom is a module which will provide such content but is not as lightweight.

Cheerio does not execute these scripts and as a result content may not be visible (as you're experiencing). This is an article I read a while back and caused me to have the same discovery, just open the article and search for "jsdom is more powerful" for a quick answer:

Source:http://stackoverflow.com/questions/15785360/data-from-web-scraping-using-node-js-request-is-different-from-data-shown-in-the?rq=1