Rpi freezes every now and then, how to fix it with a watchdog?

Question

I'm building a system with a raspberry pi located in a very remote area connected to internet with an internet stick. The tests are promising so far but the pi freezes every here and then and I'm not able to connect to the pi anymore. Because I don't want to take a 2 hour drive everytime it freezes I want to build a redundant system which checks the other system.
The worst case will be to cut the frozen system from power to reboot. This should be done by the working pi.

Now the question as a total noob when it comes to building electronics.

I checked out the ATXRaspi R3 but I'm not sure how to "digitally" fire off the 6sec press on that power controller to cut the power by the other pi...

What would be the easiest way to cut power by another pi? Any hints are greatly welcomed.

berto · Accepted Answer · 2019-06-15T13:14:25.310

Before you go looking into additional hardware, please read up on what's called a "watchdog timer". The Raspberry Pi has a hardware watchdog built in that will power cycle it if the chip is not refreshed within a certain interval.

I have setup the watchdog on a Raspberry Pi 3 and a new'ish version of Raspbian with very little configuration. The first thing to check is that the hardware watchdog is available (I checked my system and it looks like the version of Raspbian I have installed compiles watchdog support right into the kernel; no need to load a kernel module):

pi@unicornpi:~ $ ls -al /dev/watchdog*
crw------- 1 root root  10, 130 Nov  3  2016 /dev/watchdog
crw------- 1 root root 252,   0 Nov  3  2016 /dev/watchdog0

If you see /dev/watchdog you're all set. All you have to do is configure the watchdog facility built into Systemd.

In the file /etc/systemd/system.conf, set the following lines:

pi@unicornpi:~ $ grep Watchdog /etc/systemd/system.conf
RuntimeWatchdogSec=10
ShutdownWatchdogSec=10min

What the lines above say is:

refresh the hardware watchdog every 10 seconds. if for some reason the refresh fails (I believe after 3 intervals; i.e. 30s) power cycle the system
on shutdown, if the system takes more than 10 minutes to reboot, power cycle the system

Once you have this configured and reboot, you will see something like this in the dmesg logs:

pi@orangepi:~ $ dmesg | grep -i watchdog
[    0.763148] bcm2835-wdt 3f100000.watchdog: Broadcom BCM2835 watchdog timer
[    1.997557] systemd[1]: Hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0
[    2.000728] systemd[1]: Set hardware watchdog to 10s.

If you see Set hardware watchdog to 10s you're all set.

The best way I've found to verify that the watchdog works is to overload the system. I've done this with a "fork bomb", which will completely saturate the system with garbage process forks. If you run this the Pi will become unresponsive and the watchdog should kick in. Your system should be up and running again after about a minute:

:(){ :|:& };:

Paste that into a shell and your system will be taken down. You've been warned.

More info on the watchdog system built into Systemd is on the author's website.

Milliways · Answer 2 · 2019-06-14T08:44:23.363

Cutting power is a brute force method and has risks.

The conventional solution to lock-up problems is to use a watchdog.

There is a BCM hardware watchdog; If you want to start the hardware watchdog include dtparam=watchdog=on in /boot/config.txt

In and of itself this does little, although it should restart the system if not "kicked" regularly. You can write code which opens /dev/watchdog to kick it off.

There is also a watchdog daemon which you can configure to activate the watchdog; you should be able to start with sudo systemctl enable watchdog

PS Incidentally, if you want to pursue the brute force approach - don't bother cutting power - just pull the Reset pin (labeled RUN) low. This is equivalent to powering off then on again.

tlfong01 · Answer 3 · 2019-07-27T06:11:41.860

Question

Remote Rpi's freeze from time to time. How to wake them up?

Answer

Update 2019jul27hkt1406

I recently upgraded my Rpi3B+ stretch to Rpi4B buster and again I followed @berto's tutorial to set the watch dog timer. I found everything works as smoothly as before. In other words, no changes need to make to @berto's tutorial when upgrading to Rpi4.

Last time I knew nothing about the watchdog timer thing. So it took me more than 3 hours to google to understand everything inside out (well, almost inside out). This time I know what is going on, and all the linux tricks, so it took me only a couple of minutes to complete @berto's tutorial.

2019jun18 Updates

After more thoughts, I concluded that my answer is coming to an end. My conclusion it that @berto's watchdog tutorial and experiment suggestion is good, and his answer is the real answer for the OP's question.

I did his suggested experiment successfully, verified results by the forkbomb program, and after a lot of googling and reading for more than 10 hours, I think I finally understood thoroughly the idea of watchdog timer.

Earlier I wrongly thought that I still needed to learn how to set the timer to 10 seconds or more. But as @berto says, 10 seconds is all that to be set. I also read that I can set timer to as long as 16 seconds, and linux watchdog default is even one minute. But that is not critical.

I have removed all the long winded reading notes in the appendices, to make the answer shorter. I would suggest newbies not to try to understand all the details of watchdog, not to mention the much more complicated daemon SystemD, because our life is short, and those system things are too complicated for non professionals.

I would like to add two points to end my answer.

(1) There are many reasons for an Rpi to hang in a couple of days (but usually not months). Often it is not the application program's fault, but because of the drivers or library functions creating too much garbage, eg. sockets created, used but not properly disposed. If it is the application program itself making garbage, the program can do "garbage collection" and problem solved. But it is hard to remove garbage sockets which are not generated by the application program. So a watchdog timer is useful here.

(2) Other ways to avoid too much garbage using up resources include rebooting every now and then by software or hardware. I do think rebooting every morning and also use software switchable power supply to do the system resetting adds another layer of protection. And using only one Rpi is not very safe. Using two Rpi's as each other's watchdog (using URT for message passing, eg) add one more layer of protection. Another method I have not explored is using ESP8266 Wifi sockets. I hope I can try that later.

This the the end of my answer. Cheers.

2019jun17 Updates

So I tried the fork bomb. The system rebooted after executing the program, in about 15 seconds.

2019jun16 Updates

I found @berto's fork bomb program is a bit newbie scary. So I am learning Bash to find out what that fork bomb is doing. Basically it is just a function named ":", which is defined as a function calling itself two times, thus forking indefinitely, as fast as rabbits growing exponentially, using up all the resources, and crashing linux.

I have also found the following interesting version of forkbomb using Unicode symbols:

( ) { | & } ;

2019jun14/15 Updates

@thesnow suggests a very nice layered approach using a smart plug. I think the smart plug or smart IoT stuff is the way to go. However, I am a not so smart newbie in smart stuffm though I am keen to learn. So I am going to buy a smart plug, do some research, and improve my answer afterwards. For now, I have added some related learning resources in the reference section below.

I found @berto's suggestion of using Rpi's hardware watchdog timer also very good. I have not played with any watchdoog stuff before. So I am going to try it now. @berto's instructions are very detailed, but still a bit hard for me, because I don't know very well the meaning of the commands "grep" and "dmseg". So I googled and made some reading notes in the appendices below. Then I followed @berto's suggestion, and strugged a bit to complete part 1. I have not yet reboot, because I need to take a break to digest things. Anyway, here is the screen capture.

I rebooted and got the following dmesg:

I think I am going too fast and now need to take a break to first study more linux things, like systemd, before coming back to carry on the test on watchdog.

/ to continue, ...

The Answer

I have the same problem. I am building a rooftop garden with a couple of Rpi's each of which connects to various wireless stuff (BlueTooth, Wifi) sensors, relays, and solenoids. There are two huge motors near by, controlling big water tanks and lifts. The motors generate EMI and from time to time freeze nearby electronics things.

My plan is to use software switchable PSUs (Power Supply Units) to power switch off/on frozen Rpi's and other devices (Bluetooth devices freeze most often. The BlueTooth and other little devices do not have any software reset command or hardware reset pin, so powering off/on their 5V Vcc is a quick and dirty, but still safe get around). In short, The Rpi's regularly watch each other and their devices and POR (Power On Reset) any guy fallen to sleep.

Of course I can also use a GPIO pin to trigger the Rpi hardware on board reset pin. But I am too lazy to do extra wiring, and too poor a hobbyist to afford professional/industrial grade non stop system devices such as the SwitchDoc Labs Dual WatchDog Timer (see reference below)

I modify ordinary DC-DC (12V to 5V) PSUs' so that any Rpi or MCP23x17 GPIO pins can power on/off the LM2956/LM2947 voltage regulator chip of the PSU. (LM2941 can be used for 1A current switches, LM2596 for 5V 3A PSU. The on/off pin is also connected to a push button, for manual power on/off testing.)

Actually each of my 7 Rpi3B+'s is connected to a cheapy DS3231 Real Time Clock Module which has a hardware interrupt pin to reset PSU, Rpi, or other devices.

Whenever possible and practical I tie up all the devices' reset pins together (removing some of the pull up resistors, so not to overload the GPIO pin).

Now the external DS3231 RTC wakes up everybody in the morning, and switches off lights at midnight, so everybody goes to bed.

References

1. LM2596/LM2941 Based Software Resettable PSU / Current Switches - Rpi StkEx Discussion

Rpi Hardware watchdog Discussion

SwitchDoc Labs Dual WatchDog Timer

ATXRaspi R3 - LowPowerLab US$14.95

A hackable ESP8266 inside a smart plug Want to play with ESP8266 without worrying about the hardware? - Mat 2017aug06

Reverse Engineering 101 of the Xiaomi IoT ecosystem HITCON Community 2018 – Dennis Giese

Xiaomi WiFi socket + MiHome app 21,307 views

espHome [ESP8266/ESP32]

AliExpress WiFi Smart Plug

Smart device -Wikipedia

WiFi Garage Door Opener using ESP8266 - Ray Wang 2016may13 56,335 views

Appendices

Appendix A - WatchDog Timer Reading Notes

Watchdog timer -Wikipedia

Linux WatchDog Man Page

Linux Watchdog - General Tests

Appendix B - Linux commands grep and dmesg reading notes

Appendix C - systemd references

systemd System and Service Manager - FreeDeskTop

systemd - Wikipedia

Appendix D - Fork and Fork Bomb References

Fork (system call) Wikipedia

Appendix E - Bash Learning Notes

score 2 · Answer 4 · answered Jun 14 '19 at 19:47

I have quite a few Pis. All of them, except one ran flawlessly. The problem child would crash periodically and would never recover after a power outage without being power cycled again. I had it reboot itself every night via cron and that helped somewhat.

What fixed it though was taking the SD card and sensor hardware and putting them into another Pi. It has run without error ever since. Maybe you too have a hardware issue.

thesnow · Answer 5 · 2019-06-14T20:41:20.060

If you have wi-fi and just need to power off / power on, you could also consider using a smart plug. Amazon makes one for ~$25, you can power it on / off remotely and also set up timer routines if that's preferable. I've had a few for several months and they're quite reliable. You don't actually need an Echo or any other dedicated device. I use my smart phone. Amazon Smart Plug

Edit: I realize this doesn't provide a solution to the first part of the question, but if I had the prospect of a 2 hour drive if something went wrong I'd consider a layered approach.

Rpi freezes every now and then, how to fix it with a watchdog?

5 Answers5

Linked