What I've done to deal with the problem
I reset the MSP430, clearing the hung state. I've slowed down the I2C clock's frequency, beefed up the Power Supply, changed pull-up resistors - no change.
Running additional processes increase the number of errors
However, the "errors" increase significantly by busying the ARM with two calculation-intensive programs. I use Python in RPi, using SMBus, SMBus and
i2c_LCD_driver and try/except routines in the RPi to catch the bad access and reset the MSP - after that, accesses every 5 seconds continue fine until the next hang, ~1000 accesses later. I use C for the Interrupt Service Routines to manage interrupts in the MSP430.
My Question
I know both devices have hardware state machines that manage the I2C. Given that the bugs hangs when the RPi is overworked, I suspect the implementation of the canned I2C Python code. Is anyone aware of any weird
stuff about the Broadcom Serial Controller (BSC) I2C controller in the ARM that would hang a slave's state machine based on busy-ness of the OS managing the BSC... or other cases of very intermittent bus hangs?
response to someone's clock question - I'm using a RPi-zero W, and yes, the i2c frequency changes if I modify the /boot/config.txt file (I confirmed with a scope that I can make it 10Khz, 400Khz, or whatever.... however, the default is 62.5Khz, not 100Khz). This is the entry in that file when set to 10Khz (the default is to comment out that 2nd line):
dtparam=i2c_arm=on
dtparam=i2c_arm_baudrate=10000
Further issue However, more strangeness with the OS or whatever is causing my problem: 1) if I run my program using daemontools via a "run" program pointed to in /etc/service, the errors are as I described: ~1 per 1000 accesses 2) however, if I run the program from a bash window, leaving the window open (i.e. not killing the program), errors come in at the rate of ~10 per 1000 accesses, i.e. 10x faster !!! (that is, at the prompt, I run python3 my_program.py ... I've also tried sudo python3 my_program.py and it doesn't change - still tons of errors. )
What is the difference? ... has to be something to do with the OS !
here is the result of ps aux | grep python3 ....
1) when I use daemontools (resulting in few errors):
*~/code $ ps aux | grep python3*
pi 2498 0.0 1.7 11296 7896 pts/0 T 06:50 0:06 python3 my_program.py
root 18682 3.7 1.7 10144 7732 ? S 13:07 0:03 python3 my_program.py
pi 18687 0.0 0.4 4368 1972 pts/1 S+ 13:09 0:00 grep python3
2) when I do it manually in a bash window using sudo (many errors):
*~/code $ ps aux | grep python3*
root 26325 0.0 0.8 7600 3556 pts/0 S+22:16 0:00 sudo python3 my_program.py
root 26329 0.6 1.7 11296 7832 pts/0 S+ 22:16 0:05 python3 my_program.py
pi 31327 1.0 0.4 4368 1800 pts/1 S+ 22:29 0:00 grep python3
Any idea what's going on ??
Problem solved (but not why) I increased the i2c bus frequency to 400KHz, and my worse case scenario (i.e. 2 busy-the-OS programs running, and an SSH bash shell manual start) has not yielded a single error in 3 hours (that's 2000+ accesses). I am surprised that speeding up the bus makes things better. Since the default is really 62.5Khz, when I dropped it to 50Khz, the change wasn't statistically significant, and thus I concluded it didn't change anything. I still think it's something to do with the ARM's BSC, and the canned code used to service it - maybe a timer that times out and hangs the bus when clk freq is low (it runs with huge number of errors at 10Khz!). Time to move on. thanks for any help provided and/or contemplated :)