The main source from data is NASCAR themselves which would be great except they distribute them as PDFs. Other sites take this data and put it into a nicer form, but they tend to leave stuff out (e.g., fantasy racing cheat sheet provides green flag speed averages over a number of races, but doesn't have the speed for one individual race). The other negative is practice and qualifying data is distributed as PDFs and sometimes there's only an hour or two between qualifying and the start of the race, so we can't wait and hope some other site will post them. Luckily tabula exists. We can use it to parse the PDF files. There's some quirks using it and not every file behaves the same (hence all the horrible if statements which we should probably fix).

I just want to build a big data dump, so I'm not worrying too much about the structure of my tables for now. I'm basically building a table for each page I parse and then when we are done we can put it altogether. I prefer this approach because if the scraper breaks or spits out something weird we can just re-dump it into the database vs. deleting it from our main tables.

Here's what I ran into and how I addressed it:

  • Tabula jumbled columns together, but it isn't consistent (hence all the if's we need to fix). With splits we ran into issues if the combined columns contained the teams (since the teams have a variable number of words in their name). For those, teams were always the beginning of the column so we split it, saw how many splits it made, and then used the last two indexes (length-1, length-2 because computer scientists aren't mathematicians) for the variables I wanted.

  • I wanted something to link across the tables, so I made sure to include the track, race, driver, and date in every table. The date in the PDFs isn't the date of the race though, it's when the stats were released. We will combine this with a racing schedule file later and update everything.

  • I set it up so we can drop the new PDFs into a folder, have it process every PDF in the folder, and then move it.

  • We didn't need to parse every page. 1 has most of the info we need. 2,3,4 are redundant with the data on page 1. 5,6,8, and 10 all contain different speeds so I used the same function for everything. Including the lap data caused some issues so I removed it since we already have it. 14 is unique. I created two new columns, place differential and closing speed.

I ran this on all the PDFs since 2005 and it worked for almost every one. The big issues were the Duel's at Daytona. Instead of trying to adjust the script I just added those manually.

Try it out. Any questions, ways to improve, comment in the thread! This isn't the most elegant solution. It works and we aren't trying to blow this up to scale, so I feel investing time in improving it is a waste. If it breaks its semi-easy to fix.

I created a database called Nascar for these tables. BTW you need Java installed to use Tabula and Python 3.6 or above. I lost my PDF loop files pre 2019, but when we are all done I'll post the full database.

SQL Tables

Loop Files From 2019-2020

Code File

In [ ]:
#pip install tabula-py
import tabula
import mysql.connector as mariadb
from datetime import datetime
import os
import shutil

def parse_page1(df, mdbc, cur):
    dat = df[0].fillna("")
    track = dat.iloc[0]['Box Score']
    race = dat.iloc[1]['Box Score']
    ed = dat.iloc[2]['Box Score']
    a = ed.split()
    siz = len(a)
    eventdatetemp = a[siz-3]
    xx = datetime.strptime(eventdatetemp, '%m/%d/%Y')
    eventdate = xx.strftime('%Y-%m-%d')
    a, gfpasses = dat.iloc[3][len(dat.columns)-1].split(":")
    sql = "INSERT INTO temprace_tbl (track,race,eventdate, gfpasses) VALUES (%s, %s, %s, %s)"
    var = (track, race, eventdate, gfpasses)
    cur.execute(sql, var)
    mdbc.commit()
    for i in range(0, len(dat)):
        car = dat.iloc[i][0]
        if car.isnumeric() == True:
            driver = dat.iloc[i][1]
            startpos = dat.iloc[i][2]
            midrace = dat.iloc[i][3]
            closerpos = dat.iloc[i][4]
            finishpos = dat.iloc[i][5]
            if len(dat.columns) == 11:
                highpos = dat.iloc[i][6]
                sp = dat.iloc[i][7]
                a = sp.split()
                siz = len(a)
            if len(dat.columns) == 12: 
                highpos = dat.iloc[i][6]
                sp = dat.iloc[i][8]
                a = sp.split()
                siz = len(a)
            if len(dat.columns) == 13: 
                highpos = dat.iloc[i][6]
                sp = dat.iloc[i][8]
                a = sp.split()
                siz = len(a)
            if len(dat.columns) == 9: 
                sp = dat.iloc[i][6]
                a = sp.split()
                siz = len(a)
            if len(a) == 11:
                plitop15 = a[siz-2]
                litop15 = a[siz-1]
                fastestlaps = a[siz-3]
                pqp = a[siz-4]
                qp = a[siz-5]
                gpd = a[siz-6]
                gp = a[siz-7]
                pd = a[siz-8]
                avgpos = a[siz-9]
                lowpos = a[siz-10]
                highpos = a[siz-11]
                lapsled = dat.iloc[i][7]
                sp2 = dat.iloc[i][8]
                pll,tl,dr,pts = sp2.split(" ")
            if len(a) == 8:
                fastestlaps = a[siz-1]
                pqp = a[siz-2]
                qp = a[siz-3]
                gpd = a[siz-4]
                gp = a[siz-5]
                pd = a[siz-6]
                avgpos = a[siz-7]
                lowpos = a[siz-8]
                sp1 = dat.iloc[i][8]
                litop15,plitop15 = sp1.split(" ")
                lapsled = dat.iloc[i][9]
                sp2 = dat.iloc[i][10]
                pll,tl,dr,pts = sp2.split(" ")
            if len(a) == 7:
                lowpos = dat.iloc[i][7]
                fastestlaps = a[siz-1]
                pqp = a[siz-2]
                qp = a[siz-3]
                gpd = a[siz-4]
                gp = a[siz-5]
                pd = a[siz-6]
                avgpos = a[siz-7]
                sp1 = dat.iloc[i][9]
                litop15,plitop15 = sp1.split(" ")
                lapsled = dat.iloc[i][10]
                sp2 = dat.iloc[i][11]
                pll,tl,dr,pts = sp2.split(" ")
            if len(a) == 6:
                lowpos = dat.iloc[i][7]
                avgpos,pd,gp,gpd,qp,pqp = sp.split(" ")
                fastestlaps = dat.iloc[i][9]
                sp1 = dat.iloc[i][10]
                litop15,plitop15 = sp1.split(" ")
                lapsled = dat.iloc[i][11]
                sp2 = dat.iloc[i][12]
                pll,tl,dr,pts = sp2.split(" ")
            closergained = int(closerpos)-int(finishpos)
            placedifferential = int(startpos)-int(finishpos)
            sql = "INSERT INTO loopdata_tbl (location, race, eventdate, car, driver, startpos, midrace, closerpos, finishpos, highpos, lowpos, avgpos, passdiff, greenpass, greenpassed, qualitypass, qualitypassperc, fastestlaps, lapsintopfifteen, lapsintopfifteenperc, lapsled, lapsledperc, totallaps, driverrating, points, closergained, placedifferential) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
            var = (track, race, eventdate, car,driver,startpos,midrace,closerpos,finishpos,highpos,lowpos,avgpos,pd,gp,gpd,qp,pqp,fastestlaps,litop15,plitop15,lapsled,pll,tl,dr,pts,closergained,placedifferential)
            cur.execute(sql, var)
            mdbc.commit()

def parse_speed(page, mdbc, cur, tbl, chk):
	dat = page[0].fillna("")
	track = dat.iloc[0][3]
	race = dat.iloc[1][3]
	ed = dat.iloc[2][3]
	a = ed.split()
	siz = len(a)
	eventdatetemp = a[siz-3]
	xx = datetime.strptime(eventdatetemp, '%m/%d/%Y')
	eventdate = xx.strftime('%Y-%m-%d')
	numcol = len(dat.columns)
	for i in range(0, len(dat)):
		rank = dat.iloc[i][0]
		if rank.isnumeric() == True:
			car = dat.iloc[i][1]
			driver = dat.iloc[i][2]
			gfs = dat.iloc[i][numcol-1]
			sql = "INSERT INTO " + tbl + " (location,race,eventdate,rank,car,driver,speed) VALUES (%s, %s, %s, %s, %s, %s, %s)"
			var = (track,race,eventdate,rank,car,driver,gfs)
			cur.execute(sql,var)
			mdbc.commit()				

def parse_sbs(p14, mdbc, cur):
	dat = p14[0].fillna("")
	track = dat.iloc[0][3]
	race = dat.iloc[1][3]
	ed = dat.iloc[2][3]
	a = ed.split()
	siz = len(a)
	eventdatetemp = a[siz-3]
	xx = datetime.strptime(eventdatetemp, '%m/%d/%Y')
	eventdate = xx.strftime('%Y-%m-%d')
	for i in range(0, len(dat)):
		rank = dat.iloc[i][0]
		if rank.isnumeric() == True:
			car = dat.iloc[i][1]
			driver = dat.iloc[i][2]
			sp = dat.iloc[i][3]
			#inconsistent size of team names, all three columns on the same column. split by space, get size, last 2 will be finish and q1
			if len(dat.columns) == 8:
				a = sp.split()
				siz = len(a)
				finish = a[siz-3]
				q1 = a[siz-2]
				q2 = a[siz-1]
				q3 = dat.iloc[i][4]
				q4 = dat.iloc[i][6]
				tot = dat.iloc[i][7]
			if len(dat.columns) == 7:
				a = sp.split()
				siz = len(a)
				finish = a[siz-4]
				q1 = a[siz-3]
				q2 = a[siz-2]
				q3 = a[siz-1]
				q4 = dat.iloc[i][5]
				tot = dat.iloc[i][6]
			if len(dat.columns) == 9:
				a = sp.split()
				siz = len(a)
				finish = a[siz-2]
				q1 = a[siz-1]
				q2 = dat.iloc[i][4]
				q3 = dat.iloc[i][5]
				q4 = dat.iloc[i][7]
				tot = dat.iloc[i][8]
			sql = "INSERT INTO loopsbs_tbl (location, race, eventdate, rank, car, driver, finish, q1, q2, q3, q4, total) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
			var = (track,race,eventdate,rank,car,driver,finish,q1,q2,q3,q4,tot)
			cur.execute(sql,var)
			mdbc.commit()

def process_loopdata(pdf, mdbc,cur):
	df = tabula.read_pdf(pdf, pages=1)
	parse_page1(df, mdbc, cur)
	page5 = tabula.read_pdf(pdf, pages=5)
	parse_speed(page5,mdbc,cur,'loopfdearly_tbl','yes')
	page6 = tabula.read_pdf(pdf, pages=6)
	parse_speed(page6,mdbc,cur,'loopfdlate_tbl','yes')
	page8 = tabula.read_pdf(pdf, pages=8)
	parse_speed(page8,mdbc,cur,'loopfonr_tbl','no')
	page10 = tabula.read_pdf(pdf, pages=10)
	parse_speed(page10,mdbc,cur,'loopgfs_tbl','no')
	page14 = tabula.read_pdf(pdf, pages=14)
	parse_sbs(page14,mdbc,cur)

mariadb_connection = mariadb.connect(user='',password='',database='Nascar')
cursor = mariadb_connection.cursor()
#change to wherever you store your data
path = '/var/www/loopdata/'
dirs = os.listdir(path)
for file in dirs:
	if file.endswith('.pdf'):
		print(file)
		process_loopdata(path + file, mariadb_connection, cursor)
		shutil.move(path + file, path + '/processed/' + file)

Next, we want a summary for the races and tracks. The races file will help us match up to the PDF data. The tracks data will be important since there are similarities and differences among the tracks. We expect people will race differently and some people will be better at different types of tracks.

Types of tracks:

Large Oval Tracks - 1.5 to 2 miles. Medium to high banks.

  • Atlanta
  • Auto Club
  • Charlotte
  • Chicagoland
  • Kansas
  • Kentucky
  • Las Vegas
  • Michigan
  • Texas

We can break these up into two mile, 1.5 mile, and quad ovals.

Two Miles - Auto Club and Michigan. High top speeds, wide tracks that lead to a lot of passing. Big difference is Michigan is low tire wear vs. Auto Club which is high tire wear. Both are really stand alone tracks, so I wouldn't use too much info from one another. The package change in 2019 led to a different style of racing at Michigan. I would ignore all the results before it.

1.5 Miles - Chicagoland, Kansas, Kentucky, and Las Vegas. Turning is key. Long straightaway. Shaped like a D.

Quad Ovals - Atlanta, Charlotte, Texas. 1.5 miles, two doglegs. High banked, high speed, corner turning is key. Texas was re-paved recently (2017) and it made it a super boring track.

Steep Tracks - High speeds on the top of the track.

  • Bristol
  • Darlington
  • Dover
  • Homestead-Miami

Two types of steep tracks: concrete and asphalt.

Concrete - Bristol and Dover. One mile or less in length. Concrete has more grip, less tire wear. Passing is hard. Consistent temperatures so day of time or weather doesn't affect speed throughout the race.

Asphalt - Darlington and Homestead-Miami. Larger than one mile. Asphalt is influenced by temperature so car conditions could change throughout the race. More passing.

Flat Tracks - Braking is more important. Hard to pass, only one line in the corner to pass.

  • Indianapolis
  • Martinsville
  • New Hampshire
  • Phoenix
  • Pocono
  • Richmond

Two categories: long and short.

Short flat - Martinsville, New Hampshire, Phoenix, Richmond. People in the back will go a lap down almost instantly. Track position matters.

Long flat - Indianapolis, Pocono. Element of randomness due to pit strategies. Fuel mileage matters and people can pit while green and not lose a lap.

Restrictor Plate - High banked, long (>= 2.5 miles). The crews install a plate on the cars engine to restrict the inflow of air to the carburator. Limits power and speed, so most cars have the same speed. Drafting is important. Easy to pass, tons of accidents.

  • Daytona
  • Talladega

Road Course Tracks - long track, slow speeds. Has both left and right turns. You can pit without losing a lap. Hard to pass. Watkins is faster.

  • Charlotte Road Course
  • Sonoma Raceway (previously Infineon)
  • Watskins Glen International

Here's the tables for tracks and races

CSV for Races table

CSV for Tracks table

In [ ]: