Structuring Data Capture for Future Analysis

Post Reply
robsmith
Posts: 76
Joined: Wed Aug 25, 2010 12:19 pm

Hi,

I am putting together a simple macro to capture data. The idea is to build a database that I can use for analysis.

My question is what is the best layout to make future analysis easier? For example, is it better to store each race in a separate database table or have it all in a single table, a table for each day, etc.

Also, is the layout below ok or is there a better way? (I intend to sort by date time to make it sequential).

Image

Thanks
You do not have the required permissions to view the files attached to this post.
User avatar
jimibt
Posts: 3641
Joined: Mon Nov 30, 2015 6:42 pm
Location: Narnia

it's better to have it in a single table with indexes on key fields (datetime, course, racetype etc). that way, you can query across the entire race universe to obtain a resultset that has greater relevance.

you could of course get fancy and have different tables for different sports, but for horse racing, this looks about right to me. if it helps, this layout is almost identical to the one that i use for analysis and has worked well for me.
Mark-H
Posts: 57
Joined: Mon Oct 17, 2016 6:35 pm
Location: North West UK

That's the $64,000 question and just to play devils advocate I'm going to suggest that it's probably a deeper question than you originally anticipated. At least it was when I was looking to do the same thing recently.

I've been looking at the same problem using the free Betfair Historical Data that was offered recently. As I see it, it comes down to a number of factors before you even start.

A typical Greyhound race has a fixed 6 runners and doesn't go in-play. Trading usually only occurs in the 3 minutes before the race starts. The resulting database load would be fairly predictable from race to race or day to day. Not too bad.

However, I looked at importing a Cheltenham Festival 24 runner that traded many hours before post as well as a healthy in-play component. This one race resulted in 165mb of data to import into the database. What would that be for the whole festival, or per day? I dread to think what the Grand National would be! In situations like this I think it would be necessary to optimise the database as much as possible to remove repetitive or redundant data to another table. After all, do you really need 100,000 rows all saying "Chelt 13th Mar - 14:10 2m1f Grd3 Hcap Hrd"

It may even be worth considering saving the data as individual files similar to how Betfair does it. If you only plan to work on one race at a time then it's relatively easy to pull a small(ish) compressed file into your software of choice, say excel, Python, Microsoft Power Bi etc and work on it. Text can be compressed very efficiently resulting in fairly small files stored on a hard drive. I know that many users prefer to use services such as AWS but that is adding another layer of complexity and cost to your project.

Using the Flumine Python package (created and developed by LinusP on this forum) it's possible to run strategies through a large number of race data files on the fly without bothering to use a database.

So I guess the ultimate answer is to ask the question - How do you intend to use the data and what's it worth to you?
towelfox
Posts: 32
Joined: Mon May 27, 2019 10:32 pm
Contact:

I think having lots of files would get messy and overwhelming very quickly.

A well planned database will save a lot of headaches down the line. Doesn't need to be fancy. I have all of the greyhound races since 2004 in a database structured as follows:

Track table
Dog table (racing dogs, retired sire/dam)
Owner table
Trainer table
Race table

Runner data is on a separate table with dog/race id's as keys and that table has price data stored in a JSON (text) column. Very easy to pull all manner of stats.

Horses would be a bit more complicated but only because of the jockey.
Mark-H
Posts: 57
Joined: Mon Oct 17, 2016 6:35 pm
Location: North West UK

I think having lots of files would get messy and overwhelming very quickly.

A well planned database will save a lot of headaches down the line. Doesn't need to be fancy. I have all of the greyhound races since 2004 in a database structured as follows...
In general, I agree. Yes, the files can be structured by month/day/race meeting/race file as Betfair supplies it, but the real crux of your reply is in the next paragraph. Many people trying this stuff are not database experts and in that case a single table with everything including the kitchen sink will also become very messy very fast.

For most people, your "doesn't need to be fancy..." is actually VERY fancy. But as you rightly point out, it's necessary or you'll end up either bloating the database or missing out on data that may be useful that you haven't thought of yet..

Your database example is the sort of information amateurs like us need at the very start of the data collection process.
Atho55
Posts: 637
Joined: Tue Oct 06, 2015 1:37 pm
Location: Home of Triumph Motorcycles

This is what I have done and it works for me.

Have your data in an external database as you will run out of rows in Excel
Break your data down into individual elements so you can create combinations easily
Delete stuff you might never use

My Access data looks like this
Access.jpg
It only becomes messy if you let it become messy

Once it`s in your database of choice, you can create a link directly from Access to Excel like this. I use Pivot tables as I find it easy to create and manage them.

Connection.jpg
Once you have your connection you can display your data how you wish. I was pointed towards the Slicer feature in Excel and for ease of use, pressing a few buttons to assemble your underlying data into every combination you can imagine is easy. It looks like this.

Slicer.jpg
You do not have the required permissions to view the files attached to this post.
User avatar
ShaunWhite
Posts: 9731
Joined: Sat Sep 03, 2016 3:42 am

Atho55 wrote:
Thu Jul 30, 2020 1:00 pm
Delete stuff you might never use
The stuff you need in the future is always the stuff you didn't think you needed. Sod's Law. Storage costs virtually nothing so no excuse for not keeping everything.

Usual structure for data would be a central repository containing anything and everything from the year dot. Then you pull out a query database containing just what you need for the job at hand. You can't have a decent archive and also expect to use it for running a bunch of variations, it just takes too long.

I know my example isn't typical but i collect about 1.5Gb of data a day, running 10 queries for strategy variations would take more years than I have left. So I extract just the data & fields I need with a simple fast query, say uk dogs Q1 2019, back and lay prices & volume from -2min -> start, and work with that.
robsmith
Posts: 76
Joined: Wed Aug 25, 2010 12:19 pm

Thanks for all your help guys
User avatar
firlandsfarm
Posts: 2688
Joined: Sat May 03, 2014 8:20 am

My advice would be you must use a pukka database app, NOT a spreadsheet. I started with MS Access but that has a file limit of 2Gb including 'working space' (the extra space the database needs to enable it to create temporary tables etc. when querying the data. But an Access database can talk to an Access database by which I mean if your database get too large you can move some tables to another Access Database to spread the size and link your main database to it. For example you might have one database to cover the recent data and another (or 2 or more) for the longer term history. Access is very user friendly. Beyond the likes of Access you will be looking at SQL which you will find is a whole new ball game.

As for design, search the Internet for "database normalization". Normalization is about designing/splitting the data into tables so that each item of data is only entered onc and indexing fields for fast recall of data etc. … basically making the database as efficient as possible. For example my racing data is divided into the following cross referenced tables …

Races
RaceTypes
RaceClassifications
Courses
Distances
Goings
Horses
HorseDetailsInRace
Jockies
Trainers

Time spent designing the database structure is time well spent and will reward you with time saved inputting and recalling data in the future. And I fully agree with Shaun … save everything because I guarantee the stuff you don't need now will be needed in the future. :)
User avatar
Therrm
Posts: 5
Joined: Mon Jan 06, 2014 10:55 pm

ShaunWhite wrote:
Thu Jul 30, 2020 3:33 pm
Atho55 wrote:
Thu Jul 30, 2020 1:00 pm
Delete stuff you might never use
The stuff you need in the future is always the stuff you didn't think you needed. Sod's Law. Storage costs virtually nothing so no excuse for not keeping everything.

Usual structure for data would be a central repository containing anything and everything from the year dot. Then you pull out a query database containing just what you need for the job at hand. You can't have a decent archive and also expect to use it for running a bunch of variations, it just takes too long.

I know my example isn't typical but i collect about 1.5Gb of data a day, running 10 queries for strategy variations would take more years than I have left. So I extract just the data & fields I need with a simple fast query, say uk dogs Q1 2019, back and lay prices & volume from -2min -> start, and work with that.
Hello Shaun,
I've been collecting lots of data too (maybe not as much as you though) but now that I want to analyze it, I don't know where to start. Excel isn't an option and, as mentioned above, Access is too limited with its 2GB file limit. Even though you split it into different file, I can't manage all my data with it.
I'm considering transferring the data to a local SQL database to which I'll connect using Excel. What do you use to analyze such amount of data?
Thx
User avatar
murdok
Posts: 151
Joined: Sun Apr 02, 2017 7:10 pm

my data

Image
User avatar
gazuty
Posts: 2547
Joined: Sun Jun 26, 2011 11:03 am
Location: Green land :)

Possibly wrong thread but saves starting a new one and it’s about data.

Public holiday here today so I took a bit of time to review all my results from this year.

Biggest insight - My automations are a consistent and overall loser on R1 of the day at every track. So R1 is now out. Every other race is 👍

Other insight, I’ll be scaling up a little in UK racing.

And a question - is Irish ☘️ racing similar enough to UK? Being more specific, in terms of volume and pattern of arrival of money? Or are we talking shamrocks and oranges 🍊?
LinusP
Posts: 1871
Joined: Mon Jul 02, 2012 10:45 pm

gazuty wrote:
Mon Oct 05, 2020 12:26 pm
Biggest insight - My automations are a consistent and overall loser on R1 of the day at every track. So R1 is now out. Every other race is 👍
Whats the sample size? / p value?
User avatar
gazuty
Posts: 2547
Joined: Sun Jun 26, 2011 11:03 am
Location: Green land :)

LinusP wrote:
Mon Oct 05, 2020 12:47 pm
gazuty wrote:
Mon Oct 05, 2020 12:26 pm
Biggest insight - My automations are a consistent and overall loser on R1 of the day at every track. So R1 is now out. Every other race is 👍
Whats the sample size? / p value?
5355 races of which 495 are R1.

Interesting becasue there was a discussion recently on the forum as to why R1 is slightly different.
Post Reply

Return to “Betfair Data”