7 - Scraping the interwebs
Why?Imagine you really want to extract some information from a website. Let’s say you really need to know the weather information at lots of airports immediately. I’m thinking ‘Airplane’ set in 2018 with on-board Wi-fi: “We need to land this plane where there isn’t a storm. And don’t call me Shirley!”
The most immediate data comes from a national agency: National Weather Service. If Shirley wanted to check what the weather was like at my local airport (Midway in the South Side of Chicago), he would need to type its code in and click Get METAR data, as shown: But Leslie Nielsen ain’t got time to be clicking around on a website that looks like I built it for a class project! He needs info on ALL 19,299! NOW!
Solution: Web ScrapingShirley can automate this process of clicking around and gathering the data so he can focus on important things like inflating the pilot.
URLsShirley could do a bit of sleuthing and realize that there is a pattern to the URLs of the pages he is taken to: http://www.aviationweather.gov/metar/data?ids=Kmdw&format=raw&date=0&hours=0 So he can now just enlist a ‘headless browser’ to go to the page for each airport code.
However, he is only interested in the weather data (which I have highlighted), not the hundred-odd links on this page.