KCL Software Solutions

Validating Links in Records

Organizations probably will not let you know when a website address address has changed, and the result can be annoying or embarrassing if the link goes nowhere or somewhere inappropriate. Checking broken links can cause problems though - if not done correctly, link checkers can get out of control and crawl too much too quickly, causing service disruption and potentially forcing the crawling IP to be blocked. Automated checking also fails to identify cases where the link no longer goes to the correct location - including cases where the website may have been overtaken by scammy or spammy sites. Below are some guidelines for keeping your links up-to-date:

Check all record links as part of the update process

The only way to confirm that links go where they are supposed to go is to make sure you follow all links in a record every time you update or modify it. Following this practice will keep you out of trouble if a domain name is taken over and presents unsafe or unpleasant content. Link review should be a standard practice for all data managers.

Using an automated link checker

If you decide that you want to use an automated link checker, you need to follow these guidelines:

  1. Use the Record List page (e.g. https://test.cioc.ca/recordlist.asp or https://test.cioc.ca/volunteer/recordlist.asp) of your site as the starting point. It is designed for this type of purpose. This ensures that you minimize what is crawled.To help you deal with large numbers of records, you can page the results. For example, https://test.cioc.ca/recordlist.asp?PS=500&Page=2 shows 500 records per page, and loads the 2nd page of results.

  2. The link checker program used must allow a maximum recursion depth when following links. The depth should be 1 for most tools - the first (non-recursive) check goes to the record details pages, and the recursion will check the links on the record page.

  3. Test the checker out in a controlled manner (e.g. with a recursion depth of 0 on a single record details page). Make sure you are actually prepared to deal with the results - If the report doesn't isolate actual problems, you may find yourself with a useless document showing 10s of thousands of crawls and checks that you have to wade through. Link reports on thousands of records are enormous; be prepared for a significant review process. If your template (header, footer, menu items) has broken links, you should fix them before you validate all your records. If you don't do this, your report will be cluttered with repetitions of the same error.

  4. Do not run link checker programs during peak hours (7am EST to 7pm EST for CIOC-hosted members). You should limit the use of these tools to once every few months

  5. Be aware that link checker tools will register as a Record View and impact your Record View statistics (and therefore could have a small impact on your usage fees if run frequently).

  6. Remember that automated tools generally cannot crawl logged-in pages, and will operate like a public user. Many tools also respect the robots.txt file, if present.

  7. Because many sites are switching to SSL-only, be on the lookout for redirects in your report noting that you need to change the protocol from http:// to https://.

If you use a link checking service or software that cannot throttle its requests, and it causes a deterioration of service, you will be banned from using that software/service. An example free service is https://validator.w3.org/checklink (if you use this service, a page size of 500 is recommended - this service cannot handle thousands of record links in one page).

There are tools available, including a wide variety of link checker add-ons and tools for your browser, which will assist in automatically checking links on the current page. This may be helpful to identify "invisible" link problems (such as missing logo files), but is not a substitute for actually clicking on and reviewing links for individual records.

Did you find this article helpful?