Data scraping Do's & Dont's?

Discussion regarding the spreadsheet functionality of Bet Angel.
Post Reply
User avatar
Brovashift
Posts: 475
Joined: Tue May 18, 2021 12:35 am

Hi all,

Any advice or tips when data scraping? Im thinking along the lines of things to avoid to try and prevent things breaking if/when a site gets updated etc.
I see many things online using Python so was just going to start with that, but if anyone has any better (easier) ideas; in the words of Universal Soldier "Im all ears..." πŸ˜‚
I like to try and work smarter, not harder lol πŸ˜‰

TIA
User avatar
Frogmella
Posts: 220
Joined: Mon May 30, 2011 2:44 pm
Location: Towcester

A lot of people do this and I've been looking into it myself. The advice seems to be, where possible, instead of scraping use the web-site's own API. Apparently many have them.
User avatar
Brovashift
Posts: 475
Joined: Tue May 18, 2021 12:35 am

Frogmella wrote: ↑
Sat Jul 30, 2022 11:54 am
A lot of people do this and I've been looking into it myself. The advice seems to be, where possible, instead of scraping use the web-site's own API. Apparently many have them.
Humm interesting, thanks for this...

I don't suppose you've got any recommendations have you? I want to pull some tennis data to display on my one-click screen. I was thinking of getting from Tennis Abstract as he collects some pretty unique data on there, but no live in-play player data. Doing a quick google search now Sportsdata.io looks quite promising, but I hate it when they don't give any sort of prices, just "Get In Touch".
User avatar
Realrocknrolla
Posts: 1903
Joined: Fri Jun 05, 2020 7:15 pm

Sofascore mate!
User avatar
Brovashift
Posts: 475
Joined: Tue May 18, 2021 12:35 am

Realrocknrolla wrote: ↑
Sat Jul 30, 2022 2:18 pm
Sofascore mate!
Cheers mate, nice one!!!

This isn't an api no? Just a site to have open, or scrape if necessary... I dont suppose you know anywhere that includes "Unforced Error's" do you? I cant remember where I've seen it now, possibly Wimbledon. But think its a good live stat to have to know if a player is winning/losing because of skillz or because they keep spooning it lol

Cheers 8-)
User avatar
Realrocknrolla
Posts: 1903
Joined: Fri Jun 05, 2020 7:15 pm

https://stackoverflow.com/questions/590 ... ing-python

Have a read of this and this forum mate!
greenmark
Posts: 4948
Joined: Mon Jan 29, 2018 2:15 pm

Brovashift wrote: ↑
Sat Jul 30, 2022 11:04 am
Hi all,

Any advice or tips when data scraping? Im thinking along the lines of things to avoid to try and prevent things breaking if/when a site gets updated etc.
I see many things online using Python so was just going to start with that, but if anyone has any better (easier) ideas; in the words of Universal Soldier "Im all ears..." πŸ˜‚
I like to try and work smarter, not harder lol πŸ˜‰

TIA
My view is that scraping websites is reasonable. They don't change a lot (every change costs them a lot more than it does you).
I did it with Java and even though I had 20+ years IT experience, Java was new to me and a colossal pain in the bum.
One of my (highly experienced) former colleagues recommended Python, so that might be the right way to go.
But for sure, using API's is smart although you may have to pay.
All I'd say is if you go down the road of writing your own stuff, please, please document every step, it will benefit you long term (despite being a pain right now). Trust me!
User avatar
Brovashift
Posts: 475
Joined: Tue May 18, 2021 12:35 am

Realrocknrolla wrote: ↑
Sat Jul 30, 2022 4:18 pm
https://stackoverflow.com/questions/590 ... ing-python

Have a read of this and this forum mate!
Been a while since Ive logged into StackOverFlow, a good few years lol. But I think I can understand how that code is working. I'll do a few basic Python tutorials to familiarise myself, see how I get on. Cheers RocknRoller, always appreciated πŸ‘πŸ»
User avatar
Brovashift
Posts: 475
Joined: Tue May 18, 2021 12:35 am

greenmark wrote: ↑
Sat Jul 30, 2022 4:40 pm
Brovashift wrote: ↑
Sat Jul 30, 2022 11:04 am
Hi all,

Any advice or tips when data scraping? Im thinking along the lines of things to avoid to try and prevent things breaking if/when a site gets updated etc.
I see many things online using Python so was just going to start with that, but if anyone has any better (easier) ideas; in the words of Universal Soldier "Im all ears..." πŸ˜‚
I like to try and work smarter, not harder lol πŸ˜‰

TIA
My view is that scraping websites is reasonable. They don't change a lot (every change costs them a lot more than it does you).
I did it with Java and even though I had 20+ years IT experience, Java was new to me and a colossal pain in the bum.
One of my (highly experienced) former colleagues recommended Python, so that might be the right way to go.
But for sure, using API's is smart although you may have to pay.
All I'd say is if you go down the road of writing your own stuff, please, please document every step, it will benefit you long term (despite being a pain right now). Trust me!
Cheers Greenmark πŸ‘πŸ»

Ive dabbled with Java myself when developing an android app, but most of my experience is in C#. I was watching a video on YouTube yesterday about programming in finance and the guy was saying to get started Python is the way to go, simply because it just works, start it up and write your code. Problem with a lot of programming interfaces is you have to install all kinds of crap to get them to work... usually followed by an error because you've not got the right .net framework installed, or missing java update files or some nightmare lol.
I like the idea of hopefully copy & pasting some python code, make a few tweaks, loading data into a csv file, ready to be imported into guardian πŸ‘ŒπŸΌ. Thats either naivety or over confidence talking πŸ˜…

Documentation: checkπŸ‘πŸ»
User avatar
MemphisFlash
Posts: 2126
Joined: Fri May 16, 2014 10:12 pm
Location: Leicester

if python is beyond you (as it is for me) then use Parse Hub or Octoparse.
That's how i get my data.

Capture.PNG
You do not have the required permissions to view the files attached to this post.
sniffer66
Posts: 1666
Joined: Thu May 02, 2019 8:37 am

Also, if you want live tennis stats from the SofaScore API I've already done the work for you. Check the tennis automation sub forum.

I'm not a pro coder so it may not be the best/most efficient code but it does work. All the usual in play stats are available and I've provided a sample baf to show how you get the data into Guardian.

If you want to go your own route with Python, it will at least give you a starting point to port code and endpoints from.
foxwood
Posts: 390
Joined: Mon Jul 23, 2012 2:54 pm

Stumbled across this and thought of you lol https://realpython.com/learning-paths/p ... -scraping/

I used to scrape with my own .net progs in vb/c# but quite a few sites are anti-scraping which is tedious to cater for.

This year I switched to python using selenium to handle the website connections. It looks to most sites like a proper user and I've not encountered any blocking since.

If you understand code and html document layout then is generally fairly simple, If not then maybe the path suggested by @Memphis would be easier for you.
User avatar
jimibt
Posts: 3641
Joined: Mon Nov 30, 2015 6:42 pm
Location: Narnia

foxwood wrote: ↑
Sun Jul 31, 2022 12:18 pm
Stumbled across this and thought of you lol https://realpython.com/learning-paths/p ... -scraping/

I used to scrape with my own .net progs in vb/c# but quite a few sites are anti-scraping which is tedious to cater for.

This year I switched to python using selenium to handle the website connections. It looks to most sites like a proper user and I've not encountered any blocking since.

If you understand code and html document layout then is generally fairly simple, If not then maybe the path suggested by @Memphis would be easier for you.
i would also recommend using selenium with chromedriver under the covers. you can then (if you need to scrape async), save the session cookies and headers and continue with normal HttpClient libraries. Of course, if it's a site of only a few dozen *pages*, then I'd just use selenium for the whole shebang.
User avatar
Brovashift
Posts: 475
Joined: Tue May 18, 2021 12:35 am

Cheers for these tips guys, much appreciated πŸ™Œ

I seem to have hit a bit of a flat spot in my progress, losing as much as I'm making and making as much as I'm losing, so taking a bit of time this week to work on my execution, see if I can get a bit confidence back. Will dig into these suggestions a bit later πŸ‘

Thanks :mrgreen:
User avatar
Brovashift
Posts: 475
Joined: Tue May 18, 2021 12:35 am

Does anyone here use or have used tennisprofits?

I don't supposed it possible to scrap data from behind a pay wall is it?
Post Reply

Return to β€œBet Angel - Spreadsheet / Excel chat”