Dealing with Emergent Complexity by Improving Software Engineering Processes
by Thomas Maufer on 22 January 2009 - 10:03:37 AM
Efficiently Addressing Emergent Complexity Requires Improved Processes
For an example of a process improvement,
look at uninitialized pointers and buffer overflow opportunities in
string handling routines. Both of these were relatively easy to catch,
even by a manual code review. They are much more likely to be caught
now because people have seen these problems over and over (and over!)
again and have learned what to look for through a very painful
education process. But any networking product code is complex not only
because the code is intrinsically complex, but also because it isn't
executing in isolation. Protocol implementations are a very special
(and especially difficult) kind of software because these programs have
no control over their inputs, and because input validation is difficult.
Even
the best programmers are perplexed imagining the near limitless ways
for protocol exchanges to go wrong. The two types of bugs I just
discussed are very localized to small blocks of code, and fairly easy
to spot (again, you have to know what to look for). But when networked
programs interact with other programs, or when complex function call
chains exist within a single program, and when different people write
the various parts of the code, it's much easier for mistakes to emerge
from this complexity. In the Network TCP/IP model, there are no delivery guarantees. Internet
Protocol (IP) layer is connectionless (it's also referred to as
"stateless"). It has really simple functionality (and relatively low
complexity), partly because it's not reliable. That's really important!
Moreover, the networks that IP runs over are worse than simply not reliable:
-
Network traffic and applications may be corrupted, including truncation or even having extra data appended.
There is a very weak header checksum in the IP header but it doesn't
protect the rest of the packet: The IP payload. In fact, it barely
protects the IP header!
Many people might assume that the MAC checksum protects the frame. While it's true that the MAC (Ethernet) checksum is
much stronger than the IP header checksum, it only protects the frame
when it's on the wire -- not when it's inside a switch or router! So a
packet can be fine when it arrives at a switch, be corrupted inside the
switch, and when it leaves the switch on the outbound interface, the
packet will have a newly calculated checksum that *will* be correct,
but the packet is no longer the same as the one that arrived!
Finally,
the MAC checksum, while admittedly stronger than the IP header's
checksum, can't detect a wide class of multi-bit corruptions that can
change the packet without affecting the checksum. These classes of
corruption are therefore undetectable.
The only guaranteed way
to ensure that a received packet is identical to what was sent is to
use a cryptographically strong checksum that depends on a securely
negotiated session key.
-
Traffic may be duplicated, sometimes spectacularly.
-
Traffic may be reordered or delayed by varying amounts.
When network traffic traverse WANs, some of the above effects might be
more likely than in LAN scenarios, but they can appear anywhere. It's
really hard to write code that can efficiently expect the unexpected.
-
Networks connect implementations of standards written by different people - this has nothing to do with malicious network exploits
-
Interoperability (or the lack of it)
means two communicating implementations won't behave exactly the same
in all circumstances. This divergence of behavior causes or exposes
bugs previously not visible if the implementation only received
standards-compliant traffic.
Is a bug still a bug if it only appears for certain classes of input?
Absolutely! A developer can't possibly predict what kinds of broken
traffic their code will be presented with in real-world networks. Code
that only accepts standards-compliant traffic would be too brittle to
use in an open IP network and would crash under the slightest
provocation.
-
Software has bugs and some network traffic packets will *start out*
broken, at least in the eyes of the receiver. Whatever damage the
network does before the packets arrive at the receiver will serve to
make those packets worse, not better. Packets that start out broken
will not be fixed by the network.
Even if a receiver can tell that a packet is wrong, sometimes there is
enough good in it that the receiver can figure out what the sender
meant. This is the basis for the Postel's Law: "Be conservative in what you do and liberal in what you accept from others." In practice, that's really easy to say but very
hard to do. Inferring meaning to the sender when packets arrive over an
actively malicious network is very hard. It's not surprising that
programmers write code that isn't perfect.
The reason that TCP is connection-oriented and reliable is that some
applications need more reliability than what IP provides (i.e., none at
all). TCP exists to provide a reliable, ordered byte stream. UDP is
connectionless like IP, and simply exists to provide a multiplexing
layer above IP so that multiple UDP-based applications to hide behind
the same IP address by using different UDP ports. Again, UDP is
stateless (connectionless) and the UDP checksum (like the TCP checksum)
only protects the header, not the payload. The implication of UDP being
stateless is that application developers have to implement their own
customized reliability mechanisms. Unfortunately, it's not easy to
figure out how TCP works and reverse-engineer just the pieces that they
need. Achieving reliability is hard, especially when code is going to
be deployed in aggressively hostile environments.
But what do we do about this? Graduating from college was hard, but
people do it all the time. The answer lies in the second part of the
statement I quoted from the eWeek article:
...make sure every programming team has processes in place to find, fix or avoid these problems and has the tools needed to verify their code is as free of these errors as automated tools can verify.
That's the key, really: Automated processes. But the processes need teeth: The right tools. In the final segment of this blog posting, we'll look at how Mu is able to integrate with the software development life cycle to provide testing solutions that embrace, rather than ignore, the complexity inherent in the behavior of network protocol implementations.
Comments:
Write a comment
- Required fields are marked with *.
|