Recently I blogged about creating health-check monitoring endpoints in your application even when its deployed in production and pinging them periodically. These are kind of like auto-tests, but they live far into the ALM – way beyond Dev and Testing. They still live in Production.
The intention is that these tests can (if required) perform some quite advanced testing, but they do so in a way that doesn’t strain the system, and really doesn’t leak any info to the caller. This is important as these endpoints will probably be publicly available, although at an unpublished URL. Even so, if someone can find them, they will find them, so I think it’s wise to ensure that if someone did find the URL and called it, that it really reveals almost nothing so they don’t get too excited about it.
So creating the endpoint and exposing it is relatively easy, however what should the service be that pings your endpoint. Requirements should be:
- its a secure service.
- logs recent results
- alerts you when things go wrong.
- alerts you when things are back on track.
- is a fault tolerant service (you don’t want to find you lost all your jobs).
- It has a reasonable SLA of availability – it would be just your luck to find the pinging service is down for 5 hours at the same time your production site is. In that case you’ll hear about it from your users first.
I’ve investigated now 3 services that can do the pinging for me:
- Azure VM management – custom endpoint for a URL on that VM
- Azure Scheduler – a new base IAAS offering from MS that allows me to schedule anything and call a URL.
- SetCronJob.com – an external 3rd party service, with dead easy setup and amazing pricing.
Today I had them all setup and went and killed my AppPool on my application, knowing this would cause my health-check endpoint to return 503 and various alarm bells would start ringing. I wanted to check the alarm bells. All 3 services above were set to ping my URL every 5 mins. I expected emails within 5 mins of killing the AppPool. Here’s what happened:
Azure VM management:
Within 5 minutes I had an email in my Inbox from the Management Services part of Azure alerting me that things were toast. This kind of monitoring works on a different basis. It sends one email to say your endpoint is down. It will keep retrying as per the normal cycle (5 mins in my case) – and if it failed 100 more times, I get no more emails. But then when the problem is resolved I will get one more email saying its been ‘resolved’. I quite like this, its not too spammy, but it does rely on you taking that first email seriously and not missing it.
Within 5 minutes, this service had also emailed me telling me the endpoint was not returning valid success codes. Unlike the Azure example, it sends me an email every 5 minutes. I haven’t re-enabled my AppPool so as of writing this blog, it has emailed me 10 times. And it has now emailed me again saying its disabling the CronJob. This is all OK, but does mean I have to manually remember to re-enable my job after I’ve fixed the problem. Personally I prefer the Azure approach above – but I’m splitting hairs in a way, I could live with either.
This one is strange, the jobs are failing for sure as expected (see below). Azure has this concept that the job can error, but only after it errors 5 times does it record that as a fault. You can see below the log of the failing job. The ‘Retry’ is the column going from 0->4 and resetting.
The dashboard view shows similar and seems to confirm that 5 failed responses in a row is a Fault (9 faults for 50 errors).
This is all good, and you can see that QA1 job has had 50 failures and 9 faults, yet I didn’t get a single email alert warning me about this. I can’t find anywhere in the scheduler to configure this – whereas the VM Endpoint monitoring and alerting had options to email the Service owner or an alternate address. So I’m a bit confused.
Scheduler is still in Preview and there are a few other things on the portal that are weird – so I presume it will all be ironed out over time.
At this stage SetCronJob.com is probably leading the pack in terms of simplicity, and that its $10/year for 3000 jobs executed every day.
Azure Scheduler might win later (at $10/month) if you need mission critical and fault tolerant job scheduling. As per normal Azure offerings, your jobs are geo-replicated, so they will just keep working from a different data centre if something goes wrong. That kind of fault tolerance high availability could be critical in production environments.