The art of scraping

Diego Kuperman

diegok | @freekey

Why I'm talking about this topic?

I do scrap for fun!

...and for profit!

(for +15 years)


I love the web!

(most of it)

Doing it...

And scraping it!

(kind of reverse engineering fun game)

So, what is this scrap thing!?

What do I need to know to have that fun!?

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

Web scraping is closely related to web indexing, which indexes information on the web using a bot or web crawler and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.

  • Skill acquired by experience, study, or observation
  • An occupation requiring knowledge or skill

What do we need to know to have fun with this scrap thing?

  • HTTP
  • Simulate human exploration
  • Extract information from websites
  • Using a bot or web crawler
  • Transformation of unestructured data
  • ...

What do we need to know to have fun with this scrap thing?

  • HTTP protocol
  • How browsers work
  • How do I write a web crawler
  • How do I extract data from HTML (from any response type)
  • How do web apps internals work


  • HTTP protocol
  • Browser basic functionality
  • Tools to inspect browsing
  • UserAgent libs
  • DOM parsing libs
  • Use it all together
  • Tricks to bypass some restrictions


0.9 (1991, informal, test drive the www)

1.0 (1996, first RFC about ^)

1.1 (RFC'd in 1997-1999, What we use today)

Do you speak HTTP?

$ telnet 80


GET /ye2015/ HTTP/1.1


HTTP/1.1 200 OK
Server: Apache/1.3.42 (Unix) mod_perl/1.31
Set-Cookie: act=language&en; path=/; expires=Thu, 25-Feb-2016 16:28:37 GMT
Content-Type: text/html; charset=UTF-8
X-Cache: MISS from
Transfer-Encoding: chunked
Date: Sat, 29 Aug 2015 16:28:38 GMT
X-Varnish: 401396155
Age: 0
Via: 1.1 varnish
Connection: keep-alive

<!DOCTYPE html>

Request headers
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)... (snip)
Accept:text/html,application/xhtml+xml,appli... (snip)
Cookie: soysuper=eyJ3aCI... (snip)

Response headers

Date:Fri, 07 Nov 2014 10:25:30 GMT
Set-Cookie:soysuper=eyJ6aX... (snip)

The browser speaks HTTP for you

Browser 101

  • DNS resolution
  • Request building and send
  • Response parsing and render
  • Cookie jar
  • Cache
  • ...

Browser as a dev tool

  • DOM inspector
  • Network activity
  • Request cycle inspection
  • Request manipulation

Network activity

Request cycle inspector

Request cycle inspector

Request manipulation

Firefox Tamper Data Addon

User Agents for perl

  • Control full browser
  • Native perl UAs

Remote full browser

  • WWW::Selenium
  • WWW::Mechanize::PhantomJS
  • WWW::Mechanize::Firefox (mozrepl)

Selenium web driver

use WWW::Selenium;

my $sel = WWW::Selenium->new(
  host        => "localhost",
  port        => 4444,
  browser     => "*iexplore",
  browser_url => "",

$sel->type("q", "hello world");
print $sel->get_title;

Screen capture

use WWW::Mechanize::PhantomJS;

my $mech = WWW::Mechanize::PhantomJS->new();

$mech->eval_in_page('alert("Hello PhantomJS")');
my $png= $mech->content_as_png();


Native UA's

  • LWP::UserAgent (libwww)
  • WWW::Mechanize
  • AnyEvent::HTTP
  • Web::Query
  • Mojo::UserAgent
  • ...

Perl core

require LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get('');

  ? say $response->decoded_content  # or whatever
  : die $response->status_line;

Mech browser

use v5.10;
use WWW::Mechanize::Cached;

my $mech = WWW::Mechanize::Cached->new();

    form_number => 1,
    fields      => { q => 'diegok', },

$mech->follow_link( text_regex => qr/WWW::EZTV/ );
say $mech->content;


Part of Mojolicious

Async suport by default

Websockets support

Mojo::UserAgent simple example

use Mojo::UserAgent;
use v5.10;

my $ua = Mojo::UserAgent->new;

say $ua->get('')->res->body;

(More on this later)

DOM parsing and data extraction

  • HTML::Parser
  • HTML::TreeBuilder
  • HTML::TreeBuilder::XPath
  • HTML::Selector::XPath::Simple
  • JSON
  • ...
  • Mojo::DOM & Mojo::CSS

Mojo::UserAgent data extraction (still simple)

use Mojo::UserAgent;
use v5.10;

my $ua = Mojo::UserAgent->new;

# Scrape the latest headlines from a news site with CSS selectors
say $ua->get('')
       ->res->dom->find('h2 > a')

JSON response

use v5.10;
use Mojo::UserAgent;
use Mojo::URL;

my $ua = Mojo::UserAgent->new;
my $api_url = Mojo::URL->new('');

my $res = $ua->get( $api_url->clone->query(q => 'author:DIEGOK') )->res;

say $res->json->{hits}{hits}[1]{_source}{archive};

# Mojo::JSON::Pointer (rfc6901)
say $res->json('/hits/hits/1/_source/archive');


use v5.10;
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new;

my $tx = $ua->post( '', form => {
    user => 'diegok',
    pass => 's3cr3t' 

my $tx = $ua->post( '', json => {
    query => 'something',
    page  => 3

Putting it all together


Scraping can become adictive

Login (Post form)

use v5.10;
use Mojo::UserAgent;
use Mojo::URL;
use Data::Dump qw(pp);

my $ua = Mojo::UserAgent->new;

my $base = Mojo::URL->new('');
my $home_url  = $base->clone->path('/ye2015/main');
my $login_url = $base->clone->path('/ye2015/LOGIN');

my $dom = $ua->post( $login_url, form => {
    credential_0 => 'diegok',
    credential_1 => 'mypass',
    destination  => '/ye2015/main' # hidden

die "Login error!"
    unless $dom->at('.navbar-right a > .hidden-md')->all_text =~ /diegok/;

# Now we're logged in!, this is why:
say pp($ua->cookie_jar);

Building a crawler

package MyCrawler;
use Mojo::Base -base;
use Mojo::UserAgent;
use Mojo::URL;
use Time::HiRes qw( gettimeofday tv_interval );

has ua => sub {
    my $self = shift;

    my $ua = Mojo::UserAgent->new;
    $ua->transactor->name('My Crawler 1.0');

    $ua->on( start => sub {
        my ( $ua, $tx ) = @_;
        my @time = gettimeofday();

        $tx->on( finish => sub {
            my $tx = shift;
            say STDERR sprintf("%s [%s] %s (%.2f)",
                $tx->req->method, $tx->res->code || 'ERR',
                $tx->req->url, tv_interval(\@time)
    return $ua;

Fill-in forms helper

sub submit_form {
    my ( $self, $tx, $selector, $values ) = @_;

    if ( my $form = $tx->res->dom->at($selector) ) {
        my $url = Mojo::URL->new($form->attr('action'))->to_abs($tx->req->url);

        my $method = $form->attr('method') || 'get';
        my $header = { Referer => $tx->req->url.'' };

        if ( $method eq 'post' ) {
            return $self->ua->post( $url, $referer,
                form => $self->build_params( $form, $values )
        else {
            return $self->ua->get( $url, $referer,
                form => $self->build_params( $form, $values )

Fill-in forms helper

sub build_params_for_submit {
    my ($self, $form, $fill) = ( shift, shift, shift||{} );

    my $values = {};
    $form->find('input')->each(sub {
      if ( my $name = $_[0]->attr('name') ) {
        my $type = $_[0]->attr('type') || 'text';
        return if $type =~ /radio|checkbox/
               && not exists $_[0]->attr->{checked};

        if ( exists $values->{ $name } ) {
          if ( ref $values->{ $name } ) {
            push @{$values->{ $name }}, $_[0]->attr('value');
          else {
            $values->{$name} = [ $values->{$name}, $_[0]->attr('value') ];
        else {
          $values->{ $name } = $_[0]->attr('value');

    $values->{$_} = $fill->{$_} for keys %$fill;

Scrap this YAPC!

package ACT::YE2015;
use Mojo::Base 'MyCrawler';
use Mojo::URL;

has user     => sub { die 'Need username' };
has password => sub { die 'Need password' };
has base_url => sub { Mojo::URL->new('') };

sub _url { shift->base_url->clone->path('/ye2015/'. shift) }

sub login {
  my $self = shift;

  my $tx = $self->get( $self->_url('main') );
  die "Can't get login page" unless $tx->success;

  my $dom = $self->post_form( 'form' => {
    credential_0 => $self->user,
    credential_1 => $self->password,

  die "Login error!"
      unless $dom->at('.navbar-right a > .hidden-md')->all_text
          eq $self->user;


Scrap this YAPC!

sub committed_users {
  my $self = shift;

  my $tx = $self->get( $self->_url('stats') );
  die "Can't get stats page" unless $tx->success;

  my $stats = $tx->res->dom->find('.main-content p')->[0]->text;

  if ( $stats =~ /(\d+) committed users/ ) {
    return $1;

  die "Can't find stats info.";

Scrap this YAPC!

use ACT::YE2015;
use v5.10;

my $crawler = ACT::YE2015->new(
    user     => 'diegok',
    password => 's3cre3t'

say $crawler->login->committed_users; # 262

Troubleshoting guide

How to bypass some restrictions

Golden rules

  • If your browser can do it...
  • Look every header/cookie (I really mean every one)
  • Use tamper data to simulate your crawler
  • Dump headers on your crawler and compare

Browser sniffing

  • Change user-agent
  • ^ Simulate google bot
  • Ensure header order and spacing
  • Use real browser

IP banning

  • TOR
  • Open or paid proxies
  • Get IP block + iptables
  • Amazon EC2

Javascript puzzle

  • Parse code and simulate
  • Use real browser
  • Use real browser and copy cookie-jar

Use real browser and copy cookie-jar

var page = require('webpage').create();

page.settings.resourceTimeout = 10000; // 10 secs
page.settings.userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36';'', function(status) {
    if (status !== "success") {
        console.log("Request failed with status: " + status);
    else {

function get_cookies(page) {
  if ( page.cookies.length ) {
    console.log( JSON.stringify( page.cookies ) );
  else {
    setInterval(function() { get_cookies(page) }, 5);

Use real browser and copy cookie-jar

use v5.10;
use Mojo::UserAgent;
use Mojo::JSON  qw(decode_json);
use File::Which qw(which);

my $phantomjs_bin = which('phantomjs') || die "Can't find phantomjs binary";
my $ua            = Mojo::UserAgent->new;

my $cookies = decode_json(`$phantomjs_bin cookies.js`);
for my $cookie ( @$cookies ) {
            expires => $_->{expiry},
            map { $_ => $cookie->{$_} }
             qw ( name value domain path secure httponly ),

say $_->name for @{$ua->cookie_jar->all};


  • P0rn
  • MTurk
  • Look for bugs


Thank you!

Any question?