SB Thanks for posting this code. Just run it and it looks good!spreadbetting wrote: ↑Wed Jan 22, 2020 3:42 pmProbably not as long as you'd think , I bought the £9 udemy python course that was recommended on this thread viewtopic.php?f=55&t=19959
It's about 30 hours but most of that is exercises or tests which I skipped as they were a bit boring plus some of the SQL stuff isn't needed to start but not hard either. I finished watching it around xmas time so needed to test my new found skills on something and scraping with python isn't too hard. I had coded previously with php but mainly just look on google when I need to do something still . But doing a course does give you that structured learning and the course, even though it gets boring, is quite good and well done. I probably managed around an hour most days, can't imagine I'd be able to write anything without google thoughBut for me coding is a means to an end so once I have something working I never bother coding or trying to code.
Only had dealings so far with VBA for excel, php for old web stuff and python but got to say python is definetly the easiest and , being a newer language, seems to have learnt alot from the failings of other coding languages.
Here's the code I wrote, I imagine most pro coders would spot so many areas it could be made more efficient but as a first attempt at a scraper I was happy it actually kicked out what I needed.
Code: Select all
import re import requests from bs4 import BeautifulSoup from requests_html import HTMLSession def extract_times(input): times_regex = re.compile(r'Best: (.....)sLast: (.....)s') best_times_regex = re.compile(r'Best: (.....)s') match = times_regex.search(input) best_match = best_times_regex.search(input) if match: if float(match.group(2)) < float(match.group(1)): return float(match.group(1)) else: return round((float(match.group(1))+float(match.group(2)))/2,2) if best_match: return float(best_match.group(1)) return 100 session = HTMLSession() baseUrl = "https://www.sportinglife.com" str="/greyhounds/racecards/20" res = requests.get("https://www.sportinglife.com/greyhounds/racecards") soup=BeautifulSoup(res.text,"html.parser") summary=soup.find_all("a", class_="") x=0 for link in soup.find_all('a'): link = link.get('href') if str in link: res = session.get(baseUrl+link) soup=BeautifulSoup(res.text,"html.parser") race = soup.find_all('h1')[1].get_text() distance =soup.find(class_='gh-racecard-summary-race-class gh-racecard-summary-always-open').get_text() summary=soup.find_all(class_="gh-racing-runner-key-info-container") Runners = dict() for link in summary: Trap= link.find(class_="gh-racing-runner-cloth").get_text() Name =re.sub(r'\(.*\)', '',link.find(class_="gh-racing-runner-greyhound-name").get_text()) Average_time = extract_times(link.find(class_="gh-racing-runner-greyhound-sub-info").get_text()) Runners[Average_time]= Trap+'. '+Name if bool(Runners) == True and ('OR' in distance or 'A' in distance): x = sorted(((k,v) for k,v in Runners.items())) if (x[1][0]-x[0][0]) >=0.1: timeDiff =round((x[1][0]-x[0][0]),2) print(f"{race},{x[0][1]}, class {distance}, time difference {timeDiff}")
(For anyone new to Python, I had to :
PIP Install request
PIP Install requests_html
From terminal to get it running