Using AppleScript To Turn Safari Into a Web Scraping Robot

R
Ukraine
chatGPT

With extensive help from chatGPT, I created an AppleScript that would download Ukraine web pages that contain Ministry of Defence estimates of Russian losses day by day.

Author

John Goldin

Published

January 16, 2024

Last Spring I did a post to display the trend in Russian losses as reported by the Ukrainian Ministry of Defence. I’ve been updating it every week or so. Originally I was downloading the HTML pages containing the daily reports from the Ukraine Ministry of Defence using the readr::read_lines() function. In December the reports were moved to a more general public information portal (at https://www.kmu.gov.ua/en/tag/oborona). The new server would not allow access via readr::read_lines(). I get a message that says “We apologize for the inconvenience…but your activity and behavior on this site made us think that you are a bot.” They’ve got a point. Presumably they think I might be an ill-intentioned Russkie. I was able to continue to access the page via Safari. So I thought I might be able to continue to get batches of updates through Safari by turning Safari into a bot.

Using AppleScript to Control Safari

To do that I needed to have AppleScript run a script that would control actions in Safari. Although I’ve had some small encounters with AppleScript since it appeared in the early 1990’s, I’ve never done real work with it. So my first problem was that I didn’t really know how to code with AppleScript. But now I know something that does: chatGPT.

I’ve been using chatGPT to help with R coding and I’ve installed GitHub co-pilot into RStudio. It’s very helpful. But I didn’t know whether it would help at all with Safari and AppleScript. To me it seems that AppleScript is a relatively obscure language.

I started small. In the chatGPT prompt I made a simple request:

Using MacOS AppleScript script editor, create a script that will open Safari and open a web page with part of the URL taken from the current date.

Quickly chatGPT produced some AppleScript to fetch a page. I needed to fetch pages for a range of dates so I got involved in the issue of how to get that set up. The process didn’t go very smoothly. So I decided I needed to start over while giving chatGPT a more concrete and explicit statement of what I wanted to do. Here’s my second attempt:

Let’s start over. I want to get a url that ends with a date in the form of DDMMYYYY. An example of the URL is: https://www.kmu.gov.ua/en/news/zahalni-boiovi-vtraty-protyvnyka-z-24022022-po-06012024 I want to ask for input of a date and then return with a url like this that ends with that date.

This worked better. Usually chatGPT would respond with some code along with an almost line by line explanation of what the code was intended to do. It gave me lots of commentary on what was happening with the code. I got into a rhythm. After chatGPT would suggest some code I would change it a bit to move it toward the direction I was going for. Not infrequently that would produce an error. Next I would quote the error message to chatGPT, chatGPT would apologize and then give me something else to try. Over a period of several hours I got closer and closer to something that fit what I needed.

I ended up with a script that asked for a starting and ending date, caused Safari to open the Ukraine reporting page for each date in the range, and saved the HTML page to a file. I was able to run it and get the results I needed for my Ukraine stats page.

I used it to get the stats for 200 days. After I added another small batch of dates, I got the message that the web site suspected I was a bot. But a day later I was able to get another batch of dates. So long as I limit the volume of requests in a short space of time, it seems to work fine.

Just to give some idea of what AppleScript code looks like, here is a function takes a starting and ending date and then creates a list of all the dates in that range.

A sample of AppleScript code that shows how it is formatted.

At the bottom of this post you can find the full script that I ended up with.1 It gets the job done.

The way in which chatGPT can explain the code it has created is a huge help. There’s a line later in the script that runs a shell command to copy a file. I wasn’t sure exactly what was going on in that line so I asked chatGPT to explain it to me. Here’s what it said:

Certainly! Let’s break down what this line in the AppleScript is doing:

do shell script "mkdir -p " & quoted form of (POSIX path of folderPath) & "; echo " & quoted form of theSource & " > " & quoted form of posixPath

This line of code is executing a shell command from within AppleScript. It is composed of two main parts:

  1. Creating the Directory (if needed):
    • mkdir -p: This is a shell command that creates a directory. The -p flag ensures that the command does not throw an error if the directory already exists and also allows the creation of nested directories as needed.
    • & quoted form of (POSIX path of folderPath): This AppleScript part concatenates the mkdir -p command with the path where the file should be saved. The folderPath is an AppleScript variable that holds the path in AppleScript’s path format (colon-separated). POSIX path of folderPath converts this path to a POSIX path (slash-separated, which is the standard format in shell environments). quoted form of ensures that the path is correctly quoted in case it contains spaces or special characters.
  2. Writing the Source Code to a File:
    • ;: This is a semicolon used in shell commands to separate multiple commands. It allows you to execute several commands in sequence.
    • echo: This is another shell command that outputs the string that follows it.
    • & quoted form of theSource: This concatenates the echo command with the source code of the web page. theSource is a variable in your AppleScript that holds the HTML source code of the page.
    • >: This is a redirection operator in shell commands. It takes the output of the command on its left (in this case, the echo command) and writes it to the file specified on its right.
    • & quoted form of posixPath: Finally, this part specifies the file to which the source code should be written. posixPath is the full path to the file, and again, quoted form of is used to handle spaces or special characters in the path.

In summary, this line in your script is creating the necessary directory (if it doesn’t already exist) and then writing the HTML source code of a web page to a file at the specified path. This is a common technique to perform file and directory operations from within AppleScript, leveraging the power of shell commands.

This is typical of the amount and style of explanation chatGPT provides as it suggests code.

Anthropomorphizing chatGPT

It’s difficult to resist thinking of chatGPT as a quasi-living entity. I felt like it was a team mate. That’s not the same thing as regarding it as being a human being. I think we often have experiences where we are interacting with inanimate objects and talking about them as living beings. One might say something like “My car is acting up today.” Or “The lock on my front door always gives me a hard time.” We don’t really think of the car or the door as being alive. But we do think of them as having some sort propensity to behave in a particular way.

That may be the same as saying they have a personality. Personality in this case may refer to a set of our expectations about how an entity is likely to behave. As we interact with chatGPT our expectations of what it might do and say change. We develop our sense that it has a personality. For example, I certainly notice that it’s quick to apologize, even when something is my fault. That affects how I respond to it.

GPT’s Summary of the Result

When I felt I had finished the script, I fed the completed script back into chatGPT, told it I was finished for now, and that I was giving it the finished script in case I wanted to come back and make changes later. I wasn’t really expecting anything more than that. I was surprised when I got a generous closing statement, including praising me.

Thank you for sharing your final script. It’s great to hear that you’ve successfully used it to download a significant number of daily reports. Your script appears well-structured and covers a range of functionalities, from date handling to web page interaction and data processing.

It’s clear you’ve put a lot of effort into ensuring the script not only retrieves data effectively but also handles potential errors (like 404 pages) and organizes the saved data in a structured manner. The inclusion of logging at various stages is a good practice for monitoring the script’s progress and diagnosing any issues.

Your experience with this script is a testament to the power and versatility of AppleScript, especially when combined with other tools and technologies. It’s also a great example of how automation can be applied to gather and organize information efficiently.

If you plan to revisit or further develop this script in the future, or if you have any more questions or need assistance with AppleScript or any other topic, feel free to reach out. Happy scripting!

My initial reaction to this was, Wow, what a suck-up! Is chatGPT going to ask me for a job recommendation later? The paragraph definitely has a certain tone.

I was curious about this so I had chatGPT regenerate a different response to the same prompt. The second time it went in a different direction and provided more of a summary of what the script showed. I like this as an example of getting some interesting variation in responses to the same prompt.

Thank you for sharing your final script. It’s great to hear that you have successfully used it to download over 200 daily reports. This script is a comprehensive example of how AppleScript can be used to automate web data retrieval, process date ranges, interact with a web browser, and handle file operations.

Your script effectively demonstrates several key AppleScript techniques, including:

  1. Date Manipulation: Generating a range of dates based on user input.

  2. Web Interaction: Navigating to specific URLs in Safari and retrieving page source.

  3. Content Validation: Searching the page source for specific strings to check for errors like “404 Not Found.”

  4. File System Operations: Creating directories and saving files with dynamic names based on the retrieved data.

  5. User Interface Interaction: Prompting the user for dates and confirmation, and logging progress and results.

This script can serve as a valuable reference for similar tasks or for anyone learning AppleScript and its application in automating web-related tasks. If you plan to revisit or modify this script in the future, it might be helpful to add comments throughout the script to describe the purpose of each section and any specific logic used. This practice can make it easier to understand and update the script later.

If you have any more questions or need further assistance with your AppleScript projects, feel free to ask!

The Final Code

Here is the full code for the AppleScript I ended up using. I doubt I could have done this without chatGPT. If I had attempted something like this AppleScript it would have taken me much, much longer. So much so that it’s possible I would have given up before finishing it.

Show the full AppleScript code
-- Function to prompt for date input and return it as a date object
on promptForDate(promptMessage)
    -- Get the current date
    set firstDate to ((current date) - 7 * days)
    set currentDay to day of firstDate as number
    set currentMonth to month of firstDate as number
    set currentYear to year of firstDate as number
    
    -- Format the current date as MMDDYYYY
    set defaultDate to my zeroPad(currentMonth) & my zeroPad(currentDay) & currentYear as string
    
    set inputDate to text returned of (display dialog promptMessage default answer defaultDate)
    
    -- Length check for the input
    if length of inputDate is not 8 then
        display dialog "Invalid date. Please enter 8 digits for the date as MMDDYYYY." buttons {"OK"} default button 1
        return promptForDate(promptMessage) -- Recursively call the function until valid input is provided
    end if
    
    -- Convert the string to a date object
    set dayPart to text 1 thru 2 of inputDate
    set monthPart to text 3 thru 4 of inputDate
    set yearPart to text 5 thru 8 of inputDate
    set formattedDateString to dayPart & "/" & monthPart & "/" & yearPart
    
    return date formattedDateString
end promptForDate

-- Function to generate all dates between two given dates
on generateDates(startDate, endDate)
    set currentDate to startDate
    set dateList to {}
    
    repeat until currentDate > endDate
        copy currentDate to the end of dateList
        set currentDate to currentDate + 1 * days
    end repeat
    
    return dateList
end generateDates

-- Main script execution
tell application "System Events"
    log "And away we go..." & ((current date) as string)
    
    set startDate to my promptForDate("Enter the start date (MMDDYYYY):")
    set endDate to my promptForDate("Enter the end date (MMDDYYYY):")
    
    set allDates to my generateDates(startDate, endDate)
    
    -- Check on the number of dates we are about to process
    set userResponse to button returned of (display dialog "Count of dates: " & (count of allDates) & " from " & (item 1 of allDates) & " to " & (item (count of allDates) of allDates) buttons {"Cancel", "OK"} default button 2)
    
end tell
repeat with aDate in allDates
    -- log "in loop " & aDate
    tell application "Safari"
        set formattedDateDMY to my zeroPad(day of aDate as number) & my zeroPad(month of aDate as number) & (year of aDate as number)
        set formattedDateYMD to (year of aDate as number) & "-" & my zeroPad(month of aDate as number) & "-" & my zeroPad(day of aDate as number)
        -- log "formattedDateDMY is " & formattedDateDMY
        -- log "formattedDateYMD is " & formattedDateYMD
        
        -- for debugging, verify before opening web page
        -- set userResponse to button returned of (display dialog "Verify date DMY: " & formattedDateDMY & "Verify date YMD: " & formattedDateYMD buttons {"Cancel", "OK"} default button 2)
        
        set theURL to "https://www.kmu.gov.ua/en/news/zahalni-boiovi-vtraty-protyvnyka-z-24022022-po-" & formattedDateDMY
        
        log "URL: " & theURL
        
        activate
        open location theURL
        delay 5 -- Adjust the delay as necessary
        -- Get the source code from Safari
        set theSource to source of document 1
        -- log "after set theSource to source of document 1"
    end tell
    -- Assuming theSource contains the HTML source code
    
    -- String to search for
    set searchString to ">404</div>"
    
    -- Use AppleScript's text item delimiters to check if the string is in the source
    set AppleScript's text item delimiters to searchString
    set textItems to text items of theSource
    set AppleScript's text item delimiters to "" -- Reset the delimiters
    
    -- Check if the string indicating 404 error was found
    if (count of textItems) > 1 then
        log "404 NOT FOUND:" & theURL
        -- Handle the case where the string is found, i.e., don't save the page source
    else
        -- Handle the case where the string is not found
        -- Define the file path and name
        set homePath to path to home folder as string
        set folderPath to homePath & "Documents:R_local_repos:ukrainestats:ukr_reports:"
        set fileName to "ukraine_stats_" & formattedDateYMD & ".html"
        set fullFilePath to folderPath & fileName
        
        -- log "fullFilePath is " & fullFilePath
        
        -- Replace colons with slashes for shell command
        set posixPath to POSIX path of fullFilePath
        
        log posixPath
        
        -- Save the source code
        do shell script "mkdir -p " & quoted form of (POSIX path of folderPath) & "; echo " & quoted form of theSource & " > " & quoted form of posixPath
    end if
    
    tell application "Safari"
        -- Check if there is at least one window open
        if (count of windows) > 0 then
            set currentWindow to front window
            
            -- Check if there is at least one tab open in the current window
            if (count of tabs of currentWindow) > 0 then
                set currentTab to current tab of currentWindow
                set tabName to name of currentTab
                
                -- Log the name of the current tab
                -- log "Closing tab: " & tabName
                
                -- Close the current tab
                tell currentWindow to close currentTab
            end if
        end if
    end tell
    
end repeat

-- Function to add a leading zero to single-digit numbers
on zeroPad(anumber)
    if anumber < 10 then
        return "0" & (anumber as string)
    else
        return anumber as string
    end if
end zeroPad

Footnotes

  1. Unfortunately the code does not include syntax highlighting. For the code sample above I took a screenshot of the code as it appears in Apple AppleScript editor. This blog is created using Quarto, and Quarto relies on pandoc highlighting rather than Highlight.js.↩︎

Reuse