Thursday, January 2, 2020

Starliner Clock Off By Eleven Hours

NASA has not recently given any updates about the first (and intended to be the last) unmanned test of the Starliner capsule. For that we must turn to SpaceNews.
The spacecraft’s mission elapsed timer, which is set by communicating with its Atlas 5 rocket prior to liftoff, was off by 11 hours. That caused the spacecraft to think it was on the wrong phase of its mission after separation from the rocket’s upper stage, triggering thruster firings that used excessive amounts of fuel until ground controllers could take over and turn off the thrusters.
I previously opined that the snafu seemed reminiscent of the Mars Climate Orbiter from a couple decades ago, which failed to convert properly between standard and metric units and burned up in the Marsian atmosphere. The revelation of an 11-hour time gap reinforces that opinion. Such a differential implies that this was not the result of clock drift between the launch system and the orbiter, but was likely due to a fundamental mismatch between how they were encoding time - most likely having to do with timezone differences. Coding for timezones and daylight savings is a very bug-prone domain, but issues mostly arise when people try to implement their own logic to handle timezones rather than using the standard libraries. However, that would seem to be irrelevant in an orbiting spacecraft. At the ISS altitude, Starliner would be crossing all 24 timezones about every 90 minutes. Storing timestamps as anything but UTC would be a disaster-prone mistake. The space station itself operates under UTC.

There is no relevant timezone that is offset 11 hours from UTC. Both UTC +11 and -11 sit out in the middle of the vast Pacific ocean. However, there is an 11-hour time differential between the launch site in Florida and India - where Boeing previously outsourced critical software components of the disastrous 737 Max to $9-an-hour programmers. One possible scenario is that the timestamps were encoded with local timezones: the Atlas V set to the timezone of the launch site and Starliner's hard-coded to the Indian timezone in development and never updated.

Assuming the anomaly occurred because of a data format mismatch between the two platforms, then who is to blame? Considering that Boeing designed the spacecraft, and that the launch system is a partnership between Boeing and Lockheed, doesn't the blame fall on Boeing? That's the problem with NASA organizing missions among private vendors: there is endless blame-shifting when things go wrong. (It's the same reason I will only book flights directly from airlines. Otherwise getting stranded leads to the further frustration of being subjected to a perpetual loop of finger pointing among the involved parties.) Boeing is certainly looking to shift the blame.
Why the timer was off, particularly by such a large amount, any why it wasn’t detected prior to launch is not known. “If I knew, it wouldn’t have happened,” said Jim Chilton, senior vice president for Boeing’s space and launch division, at a Dec. 21 briefing. “We are surprised. A very large body of integrated tests, approved by NASA, didn’t surface this.”
If NASA approved the integration tests, the Boeing VP says, then it's their fault if software faults occur. When we look at the metric/standard conversion bug of the failed Marsian probe, the error occurred at the interface of NASA's platform and vendor software. The fault lied with NASA because they provided too loose of a design contract. However, it's too early to say from public statements if the same kind of problem occurred with the more recent problem. Integration tests cannot root out all classes of bugs that may have been written, although they should have ensured that data flow between the two were properly formatted. As long as the data was being properly sent from the launch vehicle to the spaceship, it's possible that no integration test would catch that the spacecraft was then misinterpreting it.
After landing, NASA leadership stated that the problem, once understood and corrected, would not necessarily prevent Boeing from proceeding with a crewed test flight. “It is not something that is going to prevent us from moving forward quickly,” NASA Administrator Jim Bridenstine said at a post-landing briefing Dec. 22. “We can still move forward quickly. We can get it fixed.” He also suggested, though, that the timer problem might lead to a more thorough review of the Starliner’s overall flight software or other systems.
The two most recent tests have revealed major failures (a parachute failed to deploy during the ground abort test, although we are told it was redundant and would not have endangered the crew). I would not want to be one of the astronauts about to jump into their rocket. They must be very brave.

While not many outlets have picked this up, Motley Fool chimed in from a business perspective.
If NASA and Boeing are willing to roll the dice, and proceed with a crewed mission to the ISS without redoing the uncrewed test, there's the potential for Boeing to reset the table and (partially, at least) catch up with SpaceX. It would be a risky move for both actors -- and would perhaps attract criticism from space fans, who might accuse NASA of gambling with astronauts' lives to benefit a favored contractor. On the other hand, as astronaut Mann pointed out, having humans in the cockpit might be key to preventing a second failure for Boeing.
The downside of killing a crew of astronauts when already under scrutiny for the deaths of two airliners' worth of passengers is probably steeper than the upside of a successful mission. The 737 Max was the first jet Boeing developed after Diversity because a core company value, and Starliner the first spacecraft. They are learning that hard way that outsourcing critical software is not our strength.

No comments:

Post a Comment