Building your first scraper.

Someone (u/bettingnetwork) on Reddit linked to this webpage, I have no idea who they are, but they're probably trying to scam you into buying picks. Don't fall for it.

This is a product of the Crowdsource Syndicate which was created to honor the legacy of TomG (not the conspiracy theory one).

6/24/20 - Added function to parse spreads. At the bottom. All files updated.

6/12/20 - Added function to parse futures. Check the bottom

5/28/20 - I made some changes and updated the files to reflect that. The main thing is adding a new table called "Matchups". This will make it easier to create a view of our data like our table tennis tracker or our eSimulation sports tracker. MMA is a little broken because BM doesn't put the event in the description field for the game node. It's in the banner, but the banner isn't embedded in anything meaningful. Best option IMO is to add them manually, but I will think of an alternative.

I gave myself a challenge - build a useable scraper for Bookmaker in 30 minutes to teach the Crowdsource Syndicate on how to make one. It turned out not too bad. I wrote this in PHP vs. Python like we've been using. PHP was more in use when XML was popular, so there's a lot of built in functions that make XML files easy to work with. XML files are more of an old school way to share data, partly because of the file size bloat. It's been replaced by API's which usually give you back json encoded data, but some places still use them. The good thing about them is their structure. Once you figure out how it's built then you can parse the file and throw the data into a database.

First we need a publically available XML file, so I chose bookmaker. Theirs is hosted at lines.bookmaker.eu

Lets look at an example Bookmaker file

You should see some nesting. There's an upper level of "Data" and "Leagues". Nested underneath "Leagues" is "league" and then nested inside "league" are a series of "games". Each league has a unique "IdLeague" in the attributes, so we can create a mapping of IdLeague to what is it, but outputting IdLeague, IdSport, and Description under the "league" attributes.

We can do that by the code below

<?php
$BMFeed = simplexml_load_file('Bookmaker-Nascar2.xml');
foreach($BMFeed->Leagues->league as $event) {
    $att = $event->attributes();
    echo $att['Description'] . " / " . $att['IdSport'] . " / " . $att['IdLeague'] . "<BR>";
}
?>
You should get a list similar to below:

GERMANY - BUNDESLIGA / SOC / 10004
GERMANY - 2ND BUNDESLIGA / SOC / 12228
AUSTRIA - BUNDESLIGA / SOC / 12067
AUSTRIA - CUP / SOC / 12299
BELARUS - CUP / SOC / 13011
BELARUS - PERVAYA LIGA / SOC / 13927
BELARUS - VYSSHAYA LEAGUE WOMEN / SOC / 13972
CROATIA - HNL LEAGUE 1 / SOC / 12344
CROATIA CUP / SOC / 12294
CZECH REPUBLIC - FNL / SOC / 13030
CZECH REPUBLIC - GAMBRINUS LIGA / SOC / 12285
DENMARK - 1ST DIVISION / SOC / 12287
DENMARK - DBU POKALEN / SOC / 12239
DENMARK - SAS LIGAEN / SOC / 12253
ESTONIA - MEISTRILIIGA / SOC / 12426
FAROE ISLAND - PREMIER LEAGUE / SOC / 13391
GERMANY - 3. LIGA / SOC / 12277
GERMANY - DFB POKAL / SOC / 12229
HUNGARY - NB 1 / SOC / 12925
NORWAY - TIPPELIGAEN / SOC / 10012
POLAND - EKSTRAKLASA / SOC / 12389
POLAND - LIGA 1 / SOC / 12590
POLAND - PUCHAR POLSKI / SOC / 12606
RUSSIA - PREMIER LEAGUE / SOC / 10094
SERBIA - SUPERLIGA / SOC / 12405
SERBIA - CUP / SOC / 13998
COSTA RICA PRIMERA DIVISION / SOC / 12908
COSTA RICA - SEGUNDA DIVISION / SOC / 13410
ARMENIA - PREMIER LEAGUE / SOC / 12960
ISRAEL - WINNER LEAGUE / SOC / 12201
SOUTH KOREA - K CHALLENGE / SOC / 12958
SOUTH KOREA - K-LEAGUE CLASSIC / SOC / 12471
TAIWAN - PREMIER LEAGUE / SOC / 13897
TURKMENISTAN - YOKARY LIGA / SOC / 13938
ESOCCER - BATTLE - CHAMPIONS LEAGUE / SOC / 13840
ESOCCER - BATTLE - EURO 2020 / SOC / 13865
ESOCCER - BATTLE - CIS CUP / SOC / 13881
ESOCCER - CYBER WORLD CUP / SOC / 13914
ESOCCER - CYBER CHAMPIONS LEAGUE / SOC / 13919
ESOCCER - CYBER EUROPA LEAGUE / SOC / 13920
ESOCCER - GERMANY CYBER STARS LEAGUE / SOC / 13921
ESOCCER - ITALY CYBER STARS LEAGUE / SOC / 13917
ESOCCER - LIGA PRO / SOC / 13829
ESOCCER - SPAIN CYBER STARS LEAGUE / SOC / 13918
E-SPORTS CS GO / MU / 13117
E-SPORTS DOTA 2 / MU / 13118
E -SPORTS LEAGUE OF LEGENDS / MU / 13119
E-SPORTS STARCRAFT 2 / MU / 13120
E-SPORTS OTHER / MU / 12868
TT - CHALLENGER SERIES / MU / 13969
TT - RUSSIA - LIGA PRO / MU / 13858
TT - SETKA CUP / MU / 13856
TT - SETKA CUP - WOMEN / MU / 13857
TT - TT CUP / MU / 13854
TT - TT CUP - WOMEN / MU / 13855
TT - UKRAINE - WIN CUP / MU / 13910
MADDEN NFL20 / NFL / 13826
MLB - THE SHOW 20 / MLB / 13844
UFC EA SPORTS 3 / MU / 13947
SPORTS SIMULATION FUTURES / TNT / 13894
MLB SPECIALS / MU / 13613
TAIWAN - CPBL / MLB / 13965
TAIWAN - CPBL MINOR / MLB / 13966
KOREA REP - KBO LEAGUE / MLB / 13967
MLB - PLAYER SEASON FUTURES / TNT / 12192
U.S.A. ELECTIONS - MATCHUPS / MU / 12066
POLITICAL EVENTS / TNT / 12822
POLITICAL PROPS / MU / 13698
TENNIS - RUSSIA - LIGA PRO / MU / 13931
TENNIS - EXO ATLANTA 2020 / MU / 13993
TENNIS - POINT EXHIBITION SERIES / MU / 13951
TENNIS - PICK WINNER / TNT / 12005
NASCAR HEAD TO HEAD / MU / 14
NASCAR PROPS / MU / 12615
NASCAR PICK WINNER / TNT / 12015
MMA - UFC / MU / 206
PDC HOME TOUR / MU / 13933
REMOTE DARTS LEAGUE / MU / 13939
CZECH PREMIER LEAGUE 2020 / MU / 13935
NFL / NFL / 1
NFL - FUTURES / TNT / 208
NFL - REGULAR SEASON WINS / NFL / 12020
NCAA (F) - FUTURES / TNT / 117
GOLF MATCHUPS / MU / 11
GOLF PICK WINNER / TNT / 12003
BOXING / MU / 12064
RL - NRL / MU / 13375
VOLLEYBALL - RUSSIA / MU / 12827
BADMINTON - INFINITY CUP / MU / 13950
ENTERTAINMENT EVENTS / MU / 12557
NFL - FUTURES / TNT / 208
NFL - REGULAR SEASON WINS / NFL / 12020
NCAA (F) - FUTURES / TNT / 117
MLB - PLAYER SEASON FUTURES / TNT / 12192
GOLF PICK WINNER / TNT / 12003
TENNIS - PICK WINNER / TNT / 12005

That's a lot of leagues! And it's just the leagues that currently have lines. Writing a separate parser for each league would be tedious, so our goal should be to make them as re-useable as possible. In my personal setup I only have four types and I pass hundreds of leagues to one of those four. Our focus has been Nascar so lets look at how we can create a database for Head to Head matchups.

What info is important?

  • Drivers
  • Race title and date
  • Moneyline for Driver 1 vs. Driver 2

I envision this as three separate tables. In table 1 we will have the drivers. You may be thinking - why have a separate table just for drivers? Why not just put them next to the lines? And this is where you need to think a little more long term. If you bet on Bookmaker a lot you'll notice they spell the same team names differently over and over and over. Sometimes there's typos, other times they use their full name, other times just their last name, etc. By creating a separate drivers table we could easily link these different spelling later on. My drivers table will have two columns; an id number and a name column (see below for the full structure of our tables). I'm going to call it a generic "Players" instead of "Drivers" because we are actually going to re-use this structure for many sports!

Our tournament table is being made for the same reason as drivers, but this will have three columns; an id column, the race name, and the date of the event being held.

Finally our matchup table will contain 8 columns:

  • An ID
  • idTournament which we will pull from our tournaments table
  • 2 columns for the id's of the drivers from the drivers table
  • 2 moneyline columns, one for each driver.
  • input date and time

I'll call this table NASCAR_FT_MATCHUPS. The reason I'm including the FT is maybe we want to reuse this for something that contains halfs, then we can use the same structure but pass it to NASCAR_1H_MATCHUPS. I'm going to put all of these into a database called BM_Nascar

The structure for these tables can be viewed here

Now, we need to think logically about what we want our scraper to do. Here's an idea:

  • Load the XML file and cycle through all the leagues
  • If the league matches nascar, send information to a function that parses it.
  • Grab the driver names from the XML file, get their IDs. If they don't have an ID create one.
  • Grab the tournament from the XML file, get an ID for the tournament. If it doesn't have one, create it.
  • Create a matchup ID for player vs. player based on the event.
  • Grab the moneyline prices
  • Check if the moneyline price is new (bookmaker returns the entire feed every time, not just what's changed). If it's new, insert in our database. If it's the same as last time we checked, ignore.
  • Repeat for every option for the league
  • Set up a process to re-run the scraper every X minutes

So I did that! The code is below. But we can also expand this to other sports. Off the top of my head sports that contain the same structure as H2H Nascar Matchups:

  • MMA
  • Boxing
  • Table Tennis
  • Regular Tennis
  • Golf
  • eSports

I went and added all of those in seconds once I had the general structure up. As you look through the file, see how I made it generic enough to handle any of these sports and how I had to make some adjustments to certain sports to make it work.

Here is the table for the Rounds I used

You can go through and create these tables for every sport by the same process:

  • Make a database called BM_Sport where Sport is the third variable you are passing to the matchups function
  • Name all the tables SPORT_FT_MATCHUPS, SPORT_FT_ROUNDS, Players, Tournaments. For the first two tables change the first column to idSPORT_FT_MATCHUPS
  • Add the league ID to the loop in the beginning

I prefer the different databases for each sport vs. having one database with hundreds of tables.

How can you run the scraper every X minutes? Set up a cron job

Bookmaker has been known to ban IP addresses for flooding the XML feed. I use every 3 minutes. If you're a bottom feeder / line chaser / angry at dress makers because they won't take your 10k bet on a division 3 college football game try every minute and get a new IP if you're banned?

The usual PSA - I'm not a professional computer programmer. I'm a statistician who programs. I'm sure this isn't the best way to do it, but it works and was made in only 30 minutes! I tested it on PHP7, unsure if it works in older version. I started with something similar 15 years ago and over hundreds of hours it's evolved based on what I want and use the data for. This project here can get you started and as you determine your needs you can work off the ideas in it (or build new ones!). If you have any suggestions to improve that would be great!

Some activities you might want to try out on your own

  • Bookmaker also offers Nascar To Win odds (id: 12015) and Props (id: 12615). How could you write a function for those? The To Win function could also be re-used for all "To Win" type bets (divisions, super bowl, conferences)
  • Expand the scraper to include spreads. Table Tennis has spreads and so does eSports.
  • Adapt for eSports. There's different sports (e.g., CS-Go, League of Legends), but within each sports there's multiple tournaments going on at once. How would you deal with that?
  • There's a lot of functions here. The check id, tournament, etc. could be combined into one - how would you do it?
  • It's hard to test for major sports (basketball, football, etc.) since they're not operating right now, but logically plan out how you could write a different function for those and then test it when real sports come back.
  • Can you create a display for this data that would be helpful for your sports betting?
  • Create a results table which combines the opening and closing lines with the results (manually input or scraped off somewhere else) for each match. Write a script to populate this table daily at 4am with only finished events.
  • You probably don't want 30 if statements at the end, can you restructure that part to be more efficient?

Tried one of the above and stuck? Hit us up in the CrowdSource Syndicate thread. I added some comments in places where it might not be clear why I did what I did.

Updated 6/24/20: Download the PHP script as a txt file here

Other tutorials:

Parse Nascar PDF files using Python

Cleaning ugly data

We can also parse futures. It we look at the structure of the XML file the futures have their own unique ID and are set up the same. Lets look at the golf one.

(etc)

It looks like there is one game, the future is listed in the attribute htm under game, and embedded underneath it is a series of line nodes which has the player or team with an odds. We obviously can't reuse our head-to-head parser since the structure is different, but we can re-use some of our functions. The logic behind this function: Get a tournament (the future, we made it), get an ID for the player/team (we made it), cycle through the lines and check if the moneyline is new (easy alternation to our previous function), if the ML is different then add it, otherwise move on to the next team/player. We also need a table. Lets call it SPORTSNAME_FUTURES with 6 columns. Say for golf, we can add the table GOLF_FUTURES to our BM_Golf database and our 6 columns will be, idGolfFutures (auto increment), idTournament, idPlayer, ML, inputDate, inputTime.

Below is the structure for the table and the code. Test is out with the Golf futures by adding this to the bottom of your script

If ($att['IdLeague'] == "12003") { parseFutures($event, $connection, 'Golf'); }

Structure for the table:

CREATE TABLE `GOLF_FUTURES` (
  `idGolfFutures` int(11) NOT NULL,
  `idTournament` int(11) NOT NULL,
  `idPlayer` int(11) NOT NULL,
  `ML` varchar(10) NOT NULL,
  `inputDate` varchar(10) NOT NULL,
  `inputTime` varchar(10) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

ALTER TABLE `GOLF_FUTURES`
  ADD PRIMARY KEY (`idGolfFutures`);

ALTER TABLE `GOLF_FUTURES`
  MODIFY `idGolfFutures` int(11) NOT NULL AUTO_INCREMENT;  

Two functions to add to your script:

function FHistory($team, $connection, $limit, $method, $idTournament, $sport) {
// if you want closing price do limit = 1, method = DESC when you call. opening price limit = 1 method = ASC
    $sqlF = "SELECT * from BM_" . $sport . "." . strtoupper($sport) . "_FUTURES where idPlayer = '" . $team . "' and idTournament ='" . $idTournament . "' ORDER BY id" . $sport . "Futures " . $method . " LIMIT " . $limit . ";";
    $resultF = $connection->query($sqlF);
    $fID = $resultF->fetch_assoc();
    $resultF->free_result();
    return $fID['ML'];
}

function parseFutures($event, $connection, $sport) {
    $curTime = date('H:i', strtotime('-3 hours'));
    foreach($event->game as $game) {
        $attr = $game->attributes();
        $tourney = $attr['vtm'] . " " . $attr['htm'];
        $eDate = date('Y-m-d', strtotime($attr['gmdt'] . " " . $attr['gmtm']));
        $TournamentID = H2HGetTournament($tourney, $connection, $eDate, $sport);
        foreach ($game->line as $line) {
            $line2 = $line->attributes();
            $team = trim(preg_replace('/\W/', ' ',$line2['tmname']));
            $HID = H2HGetPlayer($team, $connection, $sport);
            $ML = $line2['odds'];
            If ($ML != "") {
                $oldFML = FHistory($HID, $connection, '1', 'DESC', $TournamentID, $sport);
                If ($oldFML != $ML) {
                        $insertMLQuery = "INSERT INTO BM_" . $sport . "." . strtoupper($sport) . "_FUTURES (idTournament, idPlayer, ML, inputDate, inputTime) VALUES ('$TournamentID', '$HID', '$ML', CurDate(), '$curTime');";
                        $connection->query($insertMLQuery);
                }
            }
        }
    }
}

And now we can track all those crazy things like To Win Nascar, To Win Golf, Who will win NBA championships, who will win NFC East, etc. We did it.

Adding spreads

Lets create spreads for golf (and we can use the same idea for other sports)

CREATE TABLE `GOLF_FT_SPREAD` (
  `idGOLF_FT_SPREAD` int(11) NOT NULL,
  `idTournament` int(11) DEFAULT NULL,
  `idMatchup` int(11) NOT NULL,
  `idP1` int(11) NOT NULL,
  `idP2` int(11) NOT NULL,
  `spreadVisitor` varchar(5) DEFAULT NULL,
  `spreadVisitorPrice` varchar(6) DEFAULT NULL,
  `spreadHome` varchar(5) DEFAULT NULL,
  `spreadHomePrice` varchar(6) DEFAULT NULL,
  `inputDate` varchar(10) NOT NULL,
  `inputTime` varchar(10) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Four things we need to consider: the visitor and home spread and their related prices. The table above will consider all of those.

In the BM xml feed the spreads are located here:

        $spreadVisitor = $line['vsprdt'];
        $spreadVisitorPrice = $line['vsprdoddst'];
        $spreadHome = $line['hsprdt'];
        $spreadHomePrice = $line['hsprdoddst'];

We can create a function to get the old spreads

function H2HGetHistorySpread($TournamentID, $MatchupID, $Part1, $Part2, $connection, $eventLength, $sport) {
    $sql = "SELECT * from BM_" . $sport . "." . strtoupper($sport) . "_" . $eventLength . "_SPREAD where idTournament = '$TournamentID' and idMatchup = '$MatchupID' and idP1 = '$Part1' and idP2 = '$Part2' ORDER BY id" . strtoupper($sport) . "_" . $eventLength . "_SPREAD DESC LIMIT 1";
    $result = $connection->query($sql);
    $row = $result->fetch_assoc();
    $h['spreadVisitor'] = $row['spreadVisitor'];
    $h['spreadVisitorPrice'] = $row['spreadVisitorPrice'];
    $h['spreadHome'] = $row['spreadHome'];
    $h['spreadHomePrice'] = $row['spreadHomePrice'];
    $result->free_result();
    return $h;
}

The one difference with spread is there's two scenarios where we would want to add a new spread to our database. (a) The spread changes (say from -4 -110 to -5 -110) or (b) the price changes (from -4 -110 to -115). So we need to check if one of them changed and if they did then insert the new spread.

We can do this by adding:

        If ($spreadVisitor != "" && $spreadVisitorPrice != "") {
            $oldSpread = H2HGetHistorySpread($TournamentID, $MatchupID, $HID, $VID, $connection, $eventLength, $sport);
            // some times the spread might change and price stays the same and vice versa so lets check that one is different
            If (($oldSpread['spreadVisitorPrice'] != $spreadVisitorPrice) || ($oldSpread['spreadVisitor'] != $spreadVisitor)) {
            $insertRLQuery = "INSERT INTO BM_" . $sport . "." . strtoupper($sport) . "_" . $eventLength . "_SPREAD (idTournament, idMatchup, idP1, idP2, spreadVisitor, spreadVisitorPrice, spreadHome, spreadHomePrice, inputDate, inputTime) VALUES ('$TournamentID', '$MatchupID', '$HID', '$VID', '$spreadVisitor', '$spreadVisitorPrice', '$spreadHome', '$spreadHomePrice', CurDate(), '$curTime');";
            $connection->query($insertRLQuery); 
            }
        }

And we're done! One thing we can do now is add all sports. Note: we didn't add any error handling to our script, so if the sport has a spread and you don't have a table it will crash. We can add error handling later, but for now you can either add an if statement for only spreads you want or make sure there's a spread table for every sport.

In the file I added Rugby League and Rugby Union. All I did was copy all our tables into a new BM_RugbyLeague.

Added the ID and now we are tracking that. You can add any sport, NFL, NBA, etc. Next we can talk about how to separate halfs and quarters from the full game.

Here is the most recent script