innovation to day: Java

Just about everyone on the planet agrees that Apple products are the soul of innovative design. But are they good for innovators? For me the answer is "not so much."

I have been using Apple laptops and iPhones for years. As a software developer, I have a list of annoyances with Mac OS X starting with Apple's incomprehensible management of Java. However, Mac OS X is far more productive than MS Windows, with its viruses, crummy OS releases, and bloatware. iPhones are close to worthless as telephones in the area where I live in large part due to ATT's network. But you can now switch to Verizon, so that's not such a problem either.

The real problem with Apple is that their products are closed. Want to install a new file system? Not here. Want to pick a different motherboard to play around with power utilization? Try somewhere else. Want to know what the OS is really doing under the covers or (gasp) inspect the source code? Dieu forfend!

Innovation in my chosen field of databases is increasingly based on breaking down the dividing lines between hardware and software to manage massive quantities of data economically and quickly. The more I learn about hardware, the less I want fully integrated products. I want devices I can interact with and learn from. I want visibility into internals. I want works-in-progress, not ready-made perfection. In short, I want open platforms that give me the parts but do not tell me what to build with them.

A few weeks ago my iPhone dropped on the floor and shattered. The replacement is a Droid 2 Global running Android. The user interface is clumsy. You have to watch out for viruses again. But the hardware is lightning fast. There is a free-for-all of people inventing new Android applications. The source code for Android itself is available on code.google.com. The open nature of Android is rapidly making it the locus of innovation for mobile devices. I feel at home already.

They sometimes go bad in completely unpredictable ways. Here's a problem I have now seen twice in production situations. A host boots up nicely and mounts file systems from the SAN. At some point a SAN switch (e.g., through a Fibrechannel controller) fails in such a way that the SAN goes away but the file system still appears visible to applications.

This kind of problem is an example of a Byzantine fault where a system does not fail cleanly but instead starts to behave in a completely arbitrary manner. It seems that you can get into a state where the in-memory representation of the file system inodes is intact but the underlying storage is non-responsive. The non-responsive file system in turn can make operating system processes go a little crazy. They continue to operate but show bizarre failures or hang. The result is problems that may not be diagnosed or even detected for hours.

What to do about this type of failure? Here are some ideas.

Be careful what you put on the SAN. Log files and other local data should not go onto the SAN. Use local files with syslog instead. Think about it: your application is sick and trying to write a log message to tell you about it on a non-responsive file system. In fact, if you have a robust scale-out architecture, don't use a SAN at all. Use database replication and/or DRBD instead to protect your data.
Test the SAN configuration carefully, especially failover scenarios. What happens when the host fails from access one path to another? What happens when another host picks up the LUN from a "failed" host? Do you have fencing properly enabled?
Actively look for SAN failures. Write test files to each mounted file system and read them back as part of your regular monitoring. That way you know that the file system is fully "live."

The last idea gets at a core issue with SAN failures--they are rare, so it's not the first thing people think of when there is a problem. The first time this happened on one of my systems it was around 4am in the morning. It took a really long time to figure out what was going on. We didn't exactly feel like geniuses when we finally checked the file system.

SANs are great technology, but there is an increasingly large "literature" of SAN failures on the net, such as this overview from Arjen Lentz and this example of a typical failure. You need to design mission-critical systems with SAN failures in mind. Otherwise you may want to consider avoiding SAN use entirely.

innovation to day

Blog Archive

Is Apple Good for Innovation?

When SANs Go Bad