This support article originally appeared as a blog post on the Talis website.
The Talis Developer Community is a group of people who need to ‘get stuff done’ with Talis and are taking advantage of Advanced MIS and our APIs to make efficiency savings. This contribution to the community comes from Tim Graves, Systems Librarian at the University of Sussex.
Anyone fancy checking 49,937 links in 8 hours?
Over the last couple of years, I have been refining a Python script to check through all the links in a current year’s reading lists, and tell us which ones are not working.
The results have been positive – from my perspective in Systems. Every year, I have presented colleagues with a huge spreadsheet of links that have been flagged up with the dreaded 404 ‘Page Not Found’. I don’t imagine these spreadsheets of dead links have been anywhere near as welcome to my colleagues in the reading list team who then need to trawl through them.
The original process of checking, though, was inelegant. Fiddly. Something that only I could do myself.
In the first year, I had to download from Aspire a *csv file of all items on current lists, manually extract the only three columns that I wanted, then isolate the ones that contained a hyperlink.
In the second year, I rewrote my Python script to do all the *csv reformatting for me. The improvements worked. But again, it made for ugly, overcomplicated code, that still required me to export a *csv file to start it off.
It was still a job that only I could do.
No chance of fobbing it off on automation.
And then we subscribed to Advanced MIS
Advanced MIS brought with it not only access to the database, but it also offered an API.
That all beckoned to me with the alluring promise of completely automating the job, so that I never had to bother with it again. Using the right machine, I would be able to schedule the link checker to run any time I chose, and send out its results by email or SFTP.
So, for instance, the script could run every week and email its findings to my colleagues. I would be able to forget it existed, and leave the brainwork to other people. If my colleagues wanted, a small tweak to the script could report only on items that had been edited that week; which would give the team a much smaller, regular, ongoing list to deal with.
The best bit was that I would be able to share the code with all Aspire sites, spare them the hassle of having to write it for themselves, and push myself forwards for a Talis Innovation Award 🙂
How does it work?
The way that Advanced MIS fits into this new process is by allowing the script to connect directly into our analytics database for all the data, which dispenses with the need for any CSV faffery.
The first thing that the script does is open a connection to our analytics database, using the four details provided by Talis:
- Host
- Database name
- User name
- Password
The script then uses a straightforward SQL query to ask for all the current lists:
SELECT
i.item_url,
l.title,
i.web_address
FROM
f_rl_items i,
f_rl_lists l
WHERE i.list_guid = l.list_guid
AND i.status = 'Published'
AND i.web_address != ''
AND i.time_period = '21/22'
At that stage, my tried-and-tested link-checking code takes over and checks the http status of every hyperlink.
It takes the script some six to eight hours to run, but it certainly does it faster than I could. And it works nights!
I have now got the code to the point where it runs in Python on either Windows or Linux. I have not yet added the functionality to email or FTP, because I am not sure yet whether my colleagues want them (also, it means I would need to get myself another Linux server set up!).
What price do you pay for all this new Aspire functionality?
Nothing!
The code, and a wiki of setup instructions, is all at:
Please help yourselves!
Don’t be worried if you have never used Python before. It is a clear and poetic language to get started with, and you will only ever need to add your connection details to a configuration file.
If you improve the code, or add any emailing or FTP functions, please let me know, so that I can make sure they are added to Github (after I have swiped them for myself).
Tim Graves
University of Sussex Library
t.c.graves@sussex.ac.uk