Saturday, October 31, 2009

Programming is not as easy as it sounds

(After reading "Teach Yourself Programming in Ten Years" by Peter Norvig)

I found this article after I read Prof. Johnson's blog post about it. All the points from this article are so true and I think every programmer should read it.

Back to when I was in college, many students in Computer Science major thought programming is easy and kind of boring. It only requires time but less skills. Many people who have been programming for years after school still think in the same way. I think the major reason is that as computer hardware, programming languages and development tools has been improving so fast in the past decades, writing runnable and reasonably fast programs requires lower and lower learning curve. Even if you know nothing about computers, you can probably write a piece of Python code to do quite something in one or two days. However, does it really mean that programming become no-brainer now? It certainly does not.  The superficial simplicity prevents people from thinking more about it. Just like Peter Norvig, I also have a few tips about how to teach yourself programming.

First thing first, make yourself interested in programming. Maybe programming is not one of your dream jobs, but if it is what you do now, try to become interested in it. This helps you kill the pains and have more enjoyable programming time.

If you did not know computers much, try to learn something about it. I strongly recommend a few books for basic concepts: Computer Organization and Design: The Hardware/Software InterfaceComputer Architecture: A Quantitative Approach,  Modern Operating Systems and Computer Networks. These books are all very well written and teach you many things you should know when you write programs. Without this knowledge, you won't be able to understand many important aspects of programming, such as pointers, why buffered I/O streams are faster, how to use threads correctly to avoid concurrency bugs and so on. Additionally, Jeff Dean has a list of numbers he believes everybody should know:


(from the slides of Jeff Dean's talk

Read the Design Patterns book once a year. I personally think this is the best book about programming I have ever read. It opens a new door to think about programing. When you read this book, think more beyond patterns for OO programming. There are actually programming patterns everywhere and they are powerful weapons to conquer system design problems. The more patterns you know, the more choices and beautiful ideas you will have when you design a system.

Open your mind and learn more languages even if you do not need them for work. Learning a new language helps you re-think programming. For example, say you are a Java programmer originally, when you learn Python, the first thing you may think about is dynamic typing. It may make you feel uncomfortable at the beginning, because you are so afraid that you will make typing mistakes. But after a while, you may feel dynamic typing helps you write programs faster. OK, now what, is it good or bad after all?

Read other people's code. One of the best ways to learn is to learning by examples, either good examples or bad examples. You can read open source code or read code written by your colleagues. Remember it when you think their code is good and try to improve it when you think their code is not that good. Our company has very strict code review policy. I learned a lot by reviewing other people's code and letting them review my code.

Get familiar with software development tools. There is a Chinese epigram saying "sharpen your knife before you cut wood" (工欲善其事,必先利其器). Many great development tools can significantly improve efficiency and quality of software development. Take Java as an example. Java is famous for having many great tools. Just to name a few open-source tools:
  • IDE: Eclipse, NetBeans
  • Buidling: Ant, Maven
  • Debugging: jdb, jhat (heap analysis), jconsole (monitoring), Eclipse integrated debugging features. 
  • Profiling: jprof, NetBeans Profiler (I really like this)
  • Static analysis tools: FindBugs, PMD
  • Unit testing: JUnit framework, Emma (check unit testing coverage)
Always keep in mind that testing rocks, debugging sucks. I did not really realize this until I started working in the company. I would say making solid testing is one of the biggest lessons I learned in the real world. Good testing can totally change the quality of a program and "life quality" of programmers. Once you adopt test-driven development, you will never go back.

Think while you do. This is more about attitude. Even if you are in charge of just a small component of a large system, don't always bind your mind to that particular component. Think about big pictures, such as how the whole system works, what are good and what are bad, how to improve it and so on. If you think more, you will find you can learn a lot even if you are working on something that you thought was boring before.

Happy programming. :-)

Sunday, October 25, 2009

When to learn a new language?

"A language that doesn't affect the way you think about programming, is not worth knowing."




I like this epigram so much. How did the languages you have learned affect the way you think about programming?

Saturday, October 17, 2009

What does Steve Jobs say about design?

''People think it's this veneer -- that the designers are handed this box and told, 'Make it look good!' That's not what we think design is. It's not just what it looks like and feels like. Design is how it works.''
-- Steve Jobs

Wednesday, October 14, 2009

"Stress Testing On Concurrent Problems - CHESS


Testing concurrent problems, such as data races, deadlock and so on, could be very hard. This kind of problems only happen with certain input and highly depend on how threads/processes are scheduled. In order to have good coverage of threads/processes schedules, people need to run the same testing many times and keep their fingers crossed to run into concurrent problems if there is any. This doesn't work well for problems that happen very rarely. People sometimes also use stress testing, testing with very heavy workload, to find potential concurrent problems. But this doesn't work well either, because heavy workload may introduce more concurrent operations, but it does not mean to introduce the right concurrent operations/orders to trigger concurrent problems.

The CHESS project from Microsoft Research tried to solve this problem. CHESS uses model checking technologies to explore different concurrent schedules, more specifically - how threads interleaves with each other, to find the one with concurrent problem. Once the problematic schedule is found, CHESS is able to accurately reproduce it later and let programmers debug the program. As an example from their paper, CHESS reported a deadlock after executing 6,727 different schedules in about 20 seconds. This is apparently is one of those tricky current bugs that take days to find.

The most interesting part to me about CHESS is how it controls the state space. It is not hard to imagine that the number of different schedules of a multi-threading program can be huge. CHESS uses various technologies to shrink the search space. There are two major strategies. The first one is preemption bounding - to limit the number of preemptions and use them carefully. According their previous paper, it shows that if the schedule desides a part of the preemptions, the rest of the schedule is determined automatically. Basically a problematic schedule only depends on a few preemptions to happen at the right places. The experience shows that having at most 2 preemptions is enough to find serious concurrency bugs. The other approach CHESS uses is to cache visited program states to prune redundant search. The questions is that what can be used as program state in terms of concurrent behaviors? CHESS uses happens-before graphs. The rationale is that "two executions that generate the same happens-before graph only differ in the order of independent synchro-nizations operations".

Microsoft Research already integrated CHESS into Visual Studio 2008. It works for multi-threaded single-process application. If you are interested, download and give it a try. For latest updates about CHESS, check out its blog.

Thursday, October 08, 2009

Refactor Java Loops With ParallelArray

(After reading "ReLooper: Refactoring for Loop Parallelism")

The ParallelArray Framework is a new feature in Java 7  that provides parallel operations on an array. For example, the programmers can apply a modification on each element of an array, map each element of an array to element of another array, reduce a set of element of an array to a value and so on. These operations will be parallelized by with ParallelArray Framework automatically. ReLooper is tool that helps programmer rewrite existing loops using ParallelArray to improve performance. It delivers 1.32-1.96x performance speedup on a 2-core machine.

Rewriting a loop to a parallelized version with ParallelArray is not hard. The hard part is make sure the new loop behaves correctly as the originally loop. Relooper uses several static analysis technologies to analyze a loop to make sure it can be safely parallelized with ParallelArray. If Relooper finds any potential problems, it warns the user before start refactoring. There are mainly three constraints that may prevent a loop from being parallelized.

First, the original loop should go through all elements in the array in order and there is no inter-iteration dependency. If the original loop is a for-loop. Relooper should check if the iteration variable is used as the index of the array and if the iteration variable changes from 0 to the length of the array. Relooper also checks if any control flow statements, such as break, return and continue, skip some array operations.

Second, Relooper checks if different iterations of the original loop access shared data. If there is any conflict accesses, the original loop can not be parallelized easily. This check if not trivial because of the complexity of memory reference analysis and method calls. Relooper develops an inter-procedure data-flow analysis to do the check.

Third, the original loop should not have blocking I/O operations. Because ParallelArray uses Fork/Join Framework to parallelize operations, blocking I/O can significantly hurt the performance.

These constraints make a lot of sense, but in my opinions, they do not completely kill the opportunity to parallelize a loop. It would be better if the paper can describe what should be done if any of them happen. Is there anything smart Relooper can do to still parallelize the code, or programmers have to be involved to make the decision or even adjust the code?

Relooper is implemented as an Eclipse plug-in and you can download it from its website. I really appreciate the author's effort to make it real and usable.

Wednesday, October 07, 2009

Refactor Java Programs To Take Advantage of Concurrency

(After reading "Refactoring sequential Java code for concurrency via concurrent libraries".)

There are more and more new libraries and language supports to help programmer write better concurrent code. They greatly benefit writing new code, but how about existing code? We probably need new code refactoring technologies to improve existing code.  This paper proposed a tool, CONCURRENCER, which can refactor existing sequential Java programs to take advantage of the java.util.concurrent (j.u.c) library. CONCURRENCER contains three main refactoring technologies.

First, it can convert usages of an int field into AtomicInteger. This can make sure that access to this field become thread-safe. Besides AtomicInteger, the j.u.c library also provides a few other atomic data types.

Second, it can convert usage of HashMap into ConcurrentHashMap. The traditionally way to make a map thread-safe is to use lock to protect every operation on this map, which introduce unnecessary contention.  ConcurrentHashMap is a pretty cool class in the j.u.c library to solve this problem. It uses a smarter locking mechanism to allow concurrent retrieval operations and partial-concurrent update operations. It divides the hash table into chunks and each chunk has its own update lock. So different chunks can be updated concurrently.

Third, the most complex, it can convert a recursive method to a new version based on the Fork/Join framework. To solve a divide-and-conquer problem, the Fork/Join Framework recursively solve one task by using multiple threads to solve its sub-tasks, wait all of them to finish and then aggregate the results for the original task. I wrote about this framework before. According to the experiments, this refactoring strategy can improve the average performance of a sequential recursive method to 2.97x (on 4-core systems).

I like this paper because I think it is practical. Each of these three refactoring ideas looks small, but if they can be correctly applied, they can much improve the code quality and performance of existing sequential programs. The authors also mentioned that they keep looking for refactoring opportunities to take advantage of other features in the j.u.c library.

One problem of CONCURRENCER is that it depends on the programmer to pick the code to refactor. It would be more helpful if it can collaborate with other static analysis tools to find this code automatically.

Friday, October 02, 2009

"Creeping Featurism is a Strength" if it is done right.

(Reading notes for Chapter 11 of "Beautiful Architecture" GNU Emacs: Creeping Featurism Is a Strength)


This chapter describes the software architecture of Emacs and discusses its creeping featurism "problem". Creeping featurism means that people tend to put more and more features into a system and eventually make it too complicated to manage. Many people think creeping featurism leads to no good, but if it can certainly become a strength if it is done right. Software becomes very complicated these days and people expect more and more from software. So a closed system is very hard to be very successful. We can see this tendency from desktop applications, like Firefox, iPhoto and so on, to enterprise software, like SplunkBase, to web-based applications, like Facebook.

Extension is especially important for open source software which usually has a lot of uncentralized developers and users.  I will talk about a few thoughts on how to make a software system open to extensions but in the meantime prevent creeping featurism.

Firstly, a system should have a fairly simple data model. The book chapter compares the data model of Emacs and the data model of Microsoft Word. I think this is a very good example. The simplicity of Emacs data model, strings of characters, makes it much easier to extend then MS Word, I'm not criticizing MS Word, after all, Emacs is an editor, not a word processor.

Secondly, the interface between the core part of a system and its other rich features should be clear and well defined. Emacs' MVC model is a good example again. Views are automatically updated once a relevant model change is detected. The core controller is written in C, but the controller code of most command we use is written in Emacs Lisp, which is also used as the major language to develop extensions. This design lets extensions integrated into the whole system naturally. Most extension code can be modified without recompiling the core Emacs, even without restarting Emacs.

Thirdly, a system should have a good default set of features that fit common needs. Most Emacs users only use a small subset of its features, but a complete installation of Emacs is already pretty big, it takes time to start and occupies quite some system resource when running. This is actually one of the major attacks from Vi advocators. To find the balance between efficiency and reasonable set of features is hard. This article presents one solution of having different sets of default features, like expert mode, normal mode, hardcore mode and such.

Fourthly, to make it very easy to install an extension. If it takes hours to install an extension, I don't believe people will do it at all. Emacs extensions are not trivial to install. For most of them, you need to modify your ".emacs" configuration file and copy package files to the right place. For some of them, it even requires compiling and/or installing new stuff. Software like Eclipse and Firefox do a better job on this.

Finally, to have a good platform for people to share and look for extensions is important. If a system is designed to be extend, extensions should be managed efficiently and provide good communication between people who write extensions and people to require new features.

Making enterprise software extendable and control its feature scale is quite different from to do so to open source software. I will probably write another post about that.