Software Team Meeting March 16th, 2005
Where: At Andrew's Cabin in SE Portland (503.788.1343)
Attendees: AndrewGreenberg, ?DavidAllen, ?JoshTriplett, ?JameySharp, BartMassey, ?KeithPackard, IanOsgood
Agenda:
- Eat pizza.
- Try out Jamey's CAN driver/FC/sequencer/kitchen sink changes on actual hardware.
- Implement the "nodes.c off" mode in sequencer for hardware hacking
- Discuss Andrew's desired changes to rocketview for better node state transparency
- Talk about moving to 2.6.
- Talk about this pesky PPC thing.
We didn't really get past #2, and surprisingly we didn't quite finish #1. Pretty much the whole evening, for most of us, was spent either discussing or testing issues of performance and message loss. David was working on the simulator, however.
Testing alternative Uncanny and run_threads
Jamey had prepared beforehand a patch to Uncanny and a variety of changes to run_threads, and (besides eating pizza) was mostly interested in testing those. Ian inspected and compiled the Uncanny patch. Andrew left us alone for a long time, and it was during that period that we discovered we couldn't turn the flight computer on... Eventually Andrew came back and fixed it in a few minutes.
We have 48MB of log data from the period during our recent trip to Bend that the rocket was turned on and at the tower. Obviously this data was collected with the unpatched Uncanny and unrevised run_threads. At this meeting we collected the following additional data:
- unpatched Uncanny, new run_threads (40MB)
- unpatched Uncanny minus some debugging, new run_threads (17MB)
- patched Uncanny, old run_threads (35kB, 23kB)
- patched Uncanny, new run_threads (5MB)
On several of these tests, we observed run_threads hanging, though we're no closer to establishing why. (In all cases it responded to SIGINT by correctly cleaning up.) Case #3 was particularly bad, as the hang occurred soon after startup during both of the two runs we tried.
Detailed analysis of this data is still needed to find out what effect each of these combinations had on message loss. Jamey intends to do that.
GPS data loss
During some of the tests, we observed run_threads complaining every five seconds, with nearly 100% consistency, that it had received a message 1102 with a bad checksum. On investigation we found that at five second intervals, these messages would arrive with up to half (about 30) of their CAN packets missing.
Andrew pulled the Jupiter GPS receiver out and hooked it directly up to a serial port on his laptop, and we analyzed the bytes it produced that way. We saw no data loss for message 1102s in 1MB of data collected.
Hooking the Jupiter board back into its normal configuration, we found that this data loss symptom occurred more as the number of messages on the CAN bus increased: in particular, it occurred fairly reliably when the IMU was set to full data rate, but occurred sometimes before that as well.
More troubleshooting is needed. Current candidates for the cause of the problem are:
- The old CAN driver and PicCore, still used on the GPS node.
- The CAN controller in either the GPS node or the flight computer.
- Uncanny.
- run_threads' raw message handling.
The problem can't be the gps component of run_threads because that component is not involved in logging the raw CAN messages.