Lets Lose All Our Money and Time Analyzing NASCAR

Raw H2H, To Win, Top 3, Top 5, etc. lines from sportsbooks

This data is a mess and needs to be cleaned. It contains every bet offered by a major sportsbook from 2012-Present for Nascar.

Nascar Players

The structure of this CSV is two columns. One is an ID (idPlayer) and the other is a name. The name is the proposition that you can bet on. So why are there 8795 rows when there aren't 8795 NASCAR racers? The way this book sets up their bets is you can bet Yes or No on bets such as "Kyle Busch wins Food City 500", but then "Kyle Busch wins Auto Club 400" is a different bet. They are also inconsistent on how they list the racers for their H2H matchups. Some times they use "Kyle Busch", "Kyle Bush" or "Kyle Busch #18". Also note that occasionally Kyle Busch will race in Trucks or XFinity with a different car number. One issue is how to combine all these so we can have a page that is just all bets related to Kyle Busch.

Nascar Tournaments

The structure of this CSV is three columns. One is an ID (idTournament), the Tournament, and the start date. As above, each race can have multiple "tournaments". For example, the Goodys Fast Relief has all these

  • "Race Matchups - Goodys Fast Relief 500 - Martinsville Speedway - Martinsville, VA - Sunday, March 1st, 1:00 PM - Fox"
  • "Race Winner - Goodys Fast Relief 500 - Martinsville Speedway - Martinsville, VA - Fox"
  • "Top 3 Finish - Goodys Fast Relief 500 - Martinsville Speedway - Martinsville, VA - Fox"
  • "Top 5 Finish - Goodys Fast Relief 500 - Martinsville Speedway - Martinsville, VA - Fox"
  • "Race Matchups - Goodys Fast Relief 500 - Martinsville Speedway - Martinsville, VA - Fox" (Notice, this one is the same as the first, but is just missing the Sunday, March 1st, etc. part. When they re-brought up the lines they changed the header).

Ideas on how to fix this: Leverage the start date info and the similar structure to the beginning of the entry. We can parse through and look for "Race Matchups" (H2H), "Race Winner", "Top 3 Finish", etc. and create a new column to distinguish between these.

Nascar Lines Post 2018

This is the key CSV. The idTournament column corresponds to the idTournament above. So we can separate out which race the bet belongs to. idP1 and idP2 correspond to the bets in the Nascar Players CSV. P1ML and P2ML are the corresponding moneylines. We also have the input date and time. This will become important as lines get posted multiple times throughout the week (beginning, after qualifying, after practices) which use different information.

Some of the rows contain idTournament of "0" which is obviously a data input error. We can ignore them, but it is 3,027 entries. A better method to deal with these is to use the inputDate and the idP1 and idP2 to determine which race it belongs to. I don't think we will fix all 3,027 entries, but we can get most of them.

Nascar Lines Pre 2018

Same as above, but some time in 2018 my server died and I archived every line movement. There may be some overlap in idNASCAR_FT_MATCHUPS ID between the two files. We will need to append a 0 at the beginning of this file to prevent this (as none in the other file start with a 0). Or we can add a suffix such as 0-2018.

To-Do List

  • Gathering the Data

    • Cleaning the existing lines data

    • Determining what data we need

    • Creating a new database

    • Scraping NASCAR results data

  • Visualizing the data

    • Create an API

    • Determine which charts are important

    • Creating a front end visualization for all the data (e.g., matchups, results, track data)

  • Developing the model

  • Testing the model

  • Pulling new data automatically and building lines.

  • Shout out TomG

Cleaning up the tournaments file

The first thing we want to do is import our data and create a new column called "BetType". In this new column we can distinguish if it is a Head to Head (H2H) matchup, Race Winner, Top X, etc. type of bet. My process for creating the entries of the new column was sort of trial and error. There were common terms like Race Matchups, Top 3, Top 5 and I coded those first. Then I would check which entries contained "Unknown" in the BetType column. I noticed then that sometimes "Race Matchups" was actually posted as "Race Matchups" (two spaces) or "Race Matchup". I could just do a string_replace to fix them all to be the same, but going forward I'm assuming the sportsbook might be inconsistent, therefore I just added the "or" operator for each case. I cycled through this process several times until I had most of the 1,790 rows coded.

In [1]:
import pandas as pd
NT = pd.read_csv('Nascar-Tournaments.csv', dtype=object)
NT['BetType'] = ''
NT.head()
Out[1]:
idTournament Tournament startDate BetType
0 1 Race Winner 2012-03-18
1 2 Top 3 Finish 2012-03-18
2 3 Top 5 Finish 2012-03-18
3 4 Race Matchups 2012-03-18
4 5 Finishg Position 2012-03-18
In [2]:
#Replacing the values in BetType depending on what is contained in the tournament column, if there is no match add "Unknown"
NT['BetType'] = pd.np.where(NT.Tournament.str.contains("Top 5"), "Top5",
                               pd.np.where(NT.Tournament.str.contains("Top 3|Top3"), "Top3",
                               pd.np.where(NT.Tournament.str.contains("Race Winner|Race  Winner"), "Winner",            
                               pd.np.where(NT.Tournament.str.contains("Top 10"), "Top10",
                               pd.np.where(NT.Tournament.str.contains("Top 8"), "Top8",
                               pd.np.where(NT.Tournament.str.contains("Finishing Position|Finishing position-"), "FP",
                               pd.np.where(NT.Tournament.str.contains("Qualifying Matchups|Qualifying Matchup"), "QH2H",
                               pd.np.where(NT.Tournament.str.contains("Race Matchup|Race  Matchups|Race Matchups"), "H2H",
                               pd.np.where(NT.Tournament.str.contains("Three on One Matchups"), "TON1",
                               pd.np.where(NT.Tournament.str.contains("Sprint Cup Championship|Sprint Cup Champion|Sprint Cup Series Winner|Cup Series Championship - Winner|NASCAR Championship 2018 Winner|Monster Energy Cup Championship - Winner|NASCAR Championship 2018 Winner"), "CUPWIN",
                               pd.np.where(NT.Tournament.str.contains("Race Props|Propositions|Total Cautions|Race Proposition"), "Props", "Unknown")))))))))))
In [3]:
NT[NT['BetType'] == 'Unknown']
Out[3]:
idTournament Tournament startDate BetType
4 5 Finishg Position 2012-03-18 Unknown
244 245 Qua;ifying Matchups - Bank of America 500 - Ch... 2012-10-11 Unknown
277 278 Finishing Positon - AAA Texas 500 - Texas Moto... 2012-11-04 Unknown
467 468 e Motor Speedway - Louden, NH - Friday, 12th a... 2013-07-12 Unknown
479 480 Chase for the Sprint Cup - make the field 2013-07-28 Unknown
494 495 Finishing Postiotion - Crown Royal 400 - India... 2013-07-28 Unknown
575 576 Qualifyung Matchups - Bank of America - Charlo... 2013-10-10 Unknown
576 577 Qualifyung Matchups - Bank of America 500 - Ch... 2013-10-10 Unknown
634 635 2014 Sprint Cup Series - total season wins mat... 2014-02-23 Unknown
687 688 Qualofying Matchups - Duck Commander 500 - Tex... 2014-04-04 Unknown
715 716 Drivers to make 2014 Chase 2014-05-10 Unknown
716 717 2014 Sprint Cup - Makes the Chase 2014-05-10 Unknown
807 808 Finishing positionj- GoBowling.com 400 - Pocon... 2014-08-03 Unknown
890 891 Qualifiyng Matchups - FORD Ecoboost 400 - Home... 2014-11-14 Unknown
895 896 2015 Sprint Cup Series total season wins matchups 2015-02-22 Unknown
896 897 2015 Sprint Cup Series total season points mat... 2015-02-22 Unknown
1032 1033 Race Finish - Toyota - Save Mart 350 - Sonoma ... 2015-06-28 Unknown
1136 1137 Qualifiyng Matchups - MyAFibRisk.com 400 - Chi... 2015-09-18 Unknown
1137 1138 Qualifiyng Matchups - MyAFibRisk.com 400 - Chi... 2015-09-18 Unknown
1209 1210 Qualifiying Matchups - FORD EcoBoost 400 - Hom... 2015-11-22 Unknown
1211 1212 Qualifiying Matchups - FORD EcoBoost 400 - Hom... 2015-11-20 Unknown
1213 1214 Qualifiying Matchups - Ford EcoBoost 400 - Hom... 2015-11-20 Unknown
1224 1225 Total Season Points 2016-02-13 Unknown
1226 1227 Total Season Wins 2016-02-13 Unknown
1244 1245 Race Position - Daytona 500 - Daytona Internat... 2016-02-21 Unknown
1431 1432 NASCAR Cup Series - Points Race Wins 2017-11-19 Unknown
1432 1433 NASCAR Cup Series Championship - Total Season ... 2017-11-19 Unknown
1436 1437 NASCAR Cup Series Championship - Points Race Wins 2017-11-19 Unknown
1482 1483 Race Macthups - Bank of America 500 - Charlott... 2017-10-08 Unknown
1503 1504 Daytona 500 - Matchups 2018-02-11 Unknown
1504 1505 NASCAR Championship 2018 Winner 2018-02-18 Unknown
1518 1519 2018 Nascar Cup Series Championship - Total Se... 2018-02-18 Unknown
1519 1520 2018 Nascar Cup Series Championship - Total Se... 2018-02-18 Unknown
1592 1593 Save Mart 350 - Sonoma Raceway - Sonoma, CA - FS1 2018-06-24 Unknown
1604 1605 Quaker State 400 - Kentucky Speedway - Sparta,... 2018-07-14 Unknown
1611 1612 Gander Outdoors 400 - Pocono Raceway - Long Po... 2018-07-29 Unknown
1617 1618 Go Bowling at The Glen - Watkins Glen Internat... 2018-08-05 Unknown
1623 1624 Consumers Energy 400 - Michigan International ... 2018-08-12 Unknown
1635 1636 Rcae Winner - Bojangles' Southern 500 - Darlin... 2018-09-02 Unknown
1678 1679 2019 Monster Energy Cup Championship - Total S... 2019-11-17 Unknown
1679 1680 2019 Monster Energy Cup Championship - Total S... 2019-11-17 Unknown
1699 1700 O'Reilly Auto Parts 500 - Texas Motor Speedway... 2019-03-31 Unknown
1712 1713 Gander RV 400 - Dover International Speedway -... 2019-05-05 Unknown
1718 1719 Monster Energy NASCAR All-Star Race - Charlott... 2019-05-17 Unknown
1760 1761 First Data 500 - Martinsville Speedway - Marti... 2019-10-26 Unknown
1770 1771 Daytona 500 - Daytona International Speedway -... 2020-02-16 Unknown
1771 1772 2020 Monster Energy Cup Cahmpionship - Winner 2020-11-08 Unknown
1778 1779 Pennzoil 400 - Las Vegas Motor Speedway - Las ... 2020-02-23 Unknown
1779 1780 FanShield 500 - Phoenix Raceway - Avondale, AZ... 2020-03-08 Unknown
1785 1786 Folds of Honor QuikTrip 500 - Atlanta Motor Sp... 2020-03-15 Unknown

This process left about 40 entries that weren't coded. Some are props I don't care too much about, others are typos (Who knew Qualifying was so hard to spell), and then the last couple look like race matchups where they forgot to post "Race Matchups" before the header. Lets tackle entries 468, 1593, 1605, 1612, 1618, 1624, 1761, 1771, 1779, 1780, 1786. I'm pretty sure they're race matchups, but what I'm going to do is load the lines from NASCAR_FT_Matchups for those tournaments and then see if they look like H2H lines or something else just in case.

In [4]:
NP = pd.read_csv('Nascar-Players.csv')
NL = pd.read_csv('NASCAR_FT_MATCHUPS.csv')
In [5]:
#in NP, idPlayer is the ID # and name is the name, in NL we want to look for idP1 and idP2 for a specific idTournament
#What I want to do is replace the idP1 and idP2 in NL with the name from NP
#Create a subset of the data for the specific tournament
NLT = NL[NL['idTournament'] == 1786]
NLT.reset_index(drop=True, inplace=True)
NLT
Out[5]:
idNASCAR_FT_MATCHUPS idTournament idP1 idP2 P1ML P2ML inputDate inputTime
0 27315 1786 112.0 98.0 -115 -115 2020-03-10 15:38
1 27316 1786 7513.0 6119.0 -115 -115 2020-03-10 15:39
2 27317 1786 112.0 6119.0 -115 -115 2020-03-10 15:39
3 27318 1786 6120.0 5230.0 100 -130 2020-03-10 15:39
4 27319 1786 6120.0 7511.0 100 -130 2020-03-10 15:39
5 27320 1786 6119.0 98.0 -115 -115 2020-03-10 15:39
6 27321 1786 7513.0 98.0 -115 -115 2020-03-10 15:43
7 27322 1786 6118.0 113.0 -115 -115 2020-03-10 15:43
8 27323 1786 6178.0 95.0 100 -130 2020-03-10 15:43
9 27324 1786 6120.0 95.0 115 -145 2020-03-10 15:43
10 27325 1786 7511.0 5230.0 -115 -115 2020-03-10 15:44
11 27326 1786 6118.0 2754.0 105 -135 2020-03-10 15:44
12 27327 1786 7513.0 6119.0 -120 -110 2020-03-10 18:46
13 27328 1786 6120.0 7511.0 105 -135 2020-03-10 18:46
14 27329 1786 6118.0 113.0 -105 -125 2020-03-10 18:46
15 27330 1786 6178.0 95.0 105 -135 2020-03-10 18:47
16 27331 1786 6120.0 95.0 120 -150 2020-03-10 18:47
17 27332 1786 6118.0 2754.0 115 -145 2020-03-10 18:47
In [6]:
#Just scroll through the first ten bets to see what the matchups are.
for x in range(0, 10):
    id1 = NLT['idP1'][x]
    id2 = NLT['idP2'][x]
    name1 = NP['name'][NP['idPlayer'] == id1].to_string()
    name2 = NP['name'][NP['idPlayer'] == id2].to_string()
    print(name1, name2)
111    Jimmie Johnson #48                       97    Denny Hamlin #11                        
7512    Kurt Busch #1 6118    William Byron #24
111    Jimmie Johnson #48                       6118    William Byron #24
6119    Ryan Blaney #12 5229    Alex Bowman #88
6119    Ryan Blaney #12 7510    Martin Truex Jr #19
6118    William Byron #24 97    Denny Hamlin #11                        
7512    Kurt Busch #1 97    Denny Hamlin #11                        
6117    Chase Elliott #9 112    Brad Keselowski #2                      
6177    Alex Bowman#88 94    Kyle Busch #18                          
6119    Ryan Blaney #12 94    Kyle Busch #18                          

It appears all of them except for 468 were race matchups. 1593 and 1605 are H2H race matchups, but something happened where they used the last years race id (both races were in 2018 but used the 2017 race id). 468 was a typo in the header which ended up getting fixed immediately and replaced with 469. This is something we will fix later. Next, lets look through all the typo ones and see if they ended up getting replaced.

5 has no entries (delete)

245 replaced by 236

278 has no entries (delete)

468 has no entries (delete)

495 replaced by 496

577 and 578 didn't get replaced, but they're the same bet.

688 replaced by 689

808 replaced by 809

1033 has no entries (delete)

1137 and 1138 not replaced, same bet

1210, 1212, 1214, 1215 all replaced by 1216

1245 prop on race average speed, only one entry (delete)

1483 not replaced

1504 h2h

1636 replaced by 1635

1719 h2h

1772 replaced by 1774

For the races that were replaced, my plan is to delete that entry and then later on when we clean up the Matchups CSV replace the tournament ID (e.g., all the Tournament ID's of 495 will get replaced by 496). So I'll add

Qualifyung Matchups|Qualifiyng Matchups|Qualifying Matchup

Race Macthups

to the filters and delete 5, 495, 577, 688, 1033, 1137, 1210, 1212, 1214, 1215, 1245, 1636. And the rest of the tournaments are things based on points, wins, etc. Stuff I don't really care about so I'll just label them as "Misc" instead of "Unknown".

In [7]:
#Note: drop removes rows based on labels (the left most value in the tables above). Since our idTournament started at 1 and the labels start at 0, just subtract one from everything.
NT = NT.drop([4,244,277,467,494,576,687,807,1032,1136,1209,1211,1213,1214,1244,1635, 1771], axis=0)
In [8]:
#Our new filtering statement
NT['BetType'] = pd.np.where(NT.Tournament.str.contains("Top 5"), "Top5",
                               pd.np.where(NT.Tournament.str.contains("Top 3|Top3"), "Top3",
                               pd.np.where(NT.Tournament.str.contains("Race Winner|Race  Winner"), "Winner",            
                               pd.np.where(NT.Tournament.str.contains("Top 10"), "Top10",
                               pd.np.where(NT.Tournament.str.contains("Top 8"), "Top8",
                               pd.np.where(NT.Tournament.str.contains("Finishing Position|Finishing position-"), "FP",
                               pd.np.where(NT.Tournament.str.contains("Qualifying Matchups|Qualifyung Matchups|Qualifiyng Matchups|Qualifying Matchup"), "QH2H",
                               pd.np.where(NT.Tournament.str.contains("Race Matchup|Race  Matchups|Race Matchups|Race Macthups"), "H2H",
                               pd.np.where(NT.Tournament.str.contains("Three on One Matchups"), "TON1",
                               pd.np.where(NT.idTournament.str.contains("1504|1719|1593|1605|1612|1618|1624|1700|1713|1761|1771|1779|1780|1786"), "H2H",
                               pd.np.where(NT.Tournament.str.contains("Sprint Cup Championship|Championship 2018 Winner|Sprint Cup Champion|Sprint Cup Series Winner|Cup Series Championship - Winner|NASCAR Championship 2018 Winner|Monster Energy Cup Championship - Winner|NASCAR Championship 2018 Winner"), "CUPWIN",
                               pd.np.where(NT.Tournament.str.contains("Race Props|Propositions|Total Cautions|Race Proposition"), "Props", "MiscBets"))))))))))))
In [9]:
NT[NT['BetType'] == 'Unknown']
Out[9]:
idTournament Tournament startDate BetType

Now that everything is categorized, we can re-export the CSV by NT.to_csv('Nascar-Tournaments-clean.csv', index=False) giving us this new Nascar Tournaments file

In [10]:
NT.to_csv('Nascar-Tournaments-clean.csv', index=False)