A script to scrape information from web pages directly into a WordPress Custom Post

It was designed for one of my clients where they update the same information across various websites, including their own. No API is available from the source website so the only method to get the data is to scrape it. I quite enjoy things like this, they’re slight ‘grey internet’ to be exciting enough and stretch my PHP skills a little. Actually it’s not that grey, Googlebot does the same thing when it comes round to indexing websites.

I’ve done a few of these over the years, mostly for creating web directories. I even had a script extracting data from emails piped into a script which required Regex rather than xpath.

The script is designed to run outside of WordPress though it could probably do with being converted into a plugin then it can take advantage of the WP Cron. Getting the featured image loaded was the trickiest part. There are essentially five parts to the script:

1. Include the WP ecosystem, so I can access various functions

2. Setup some functions

3. Scrape our seed page, then individual sub pages

4. Use xPath to extract and then insert the data into our CPT, which also has ACFs

5. De-dupe any posts that have been removed from the source website

Below is the script in full. If commercialising this, I would want some sort of interface for selecting the xPath from the source web page. I found a great example of a jQuery script to do this in a CodeCanyon plugin called RSS Autopilot. It provided a similar interface to Firepath or Chrome Web Scraper