By Michael Woloszynowicz

By Michael Woloszynowicz

Sunday, January 23, 2011

Why are you Still Making Crap? Open Source to the Rescue

The more I delve into the bevy of open source tools and frameworks the more I'm stunned when I see poorly executed applications. A few years ago developers had to make concessions when designing an application or implementing certain components as they were limited by their technological abilities or infrastructure constraints. That's not to say that constraints no longer exist, as time is always a factor, but nowadays you can rarely make excuses that you're limited by technology. Never before have so many resources been available to even the smallest of companies at little or no cost. The barriers have been broken down and newly formed startups now have access to many of the same technologies that were previously reserved for the fortune 500. Below I discuss some of the challenging problems developers face and the open source toolkits that exist to help us solve them.

Reading Various File Formats
Extraction of metadata and content from a bevy of file types is rarely a task that programmers get excited about. The explosion of file formats and variations has made content extraction a large problem for document management systems and search engines which want to gain access to file content in a standardized way, without having to get into the taxonomies of file formats. Fortunately, an open source Apache product named Tika helps us achieve this with fairly little effort. Tika provides automatic and reliable media type detection, metadata extraction, language detection, and parser selection for a multitude of file formats in an easy to use package. Tika outputs parsed content in a standardized fashion as either plain-text or XHTML that is fed to your application via a SAX parser which keeps memory usage to a minimum when dealing with large files. Tika's architecture is easily extensible to accommodate new IANA MIME types, custom content handlers or parsers, its basic tools can be learned very quickly, and integration is a breeze. It's one of those tools that’s applicable to almost every web application you build.

Lucene is the de facto standard for search and has been battle tested by providing search capabilities to data giants such as Twitter and Wikipedia. Not only is Lucene a high performance indexing system, it also provides out of the box support for complex analyzers such as N-Grams, Synonyms, Stemming, Double Metaphone, Edit Distance etc. which allow you to cool things like processing fuzzy queries. Lucene can also be combined with Tika to index a wide variety of file formats and provide search over their contents. Lucene on its own can be a little onerous to work with as you'll almost surely have to write some infrastructure code to perform maintenance and deal with performance and ACID issues. There are two additional tools that can help with this, the first of which is Solr which is REST based indexing service that can work quite well out of the box for typical requirements. The second and more powerful choice is Hibernate Search. The prerequisite for Hibernate Search is of course the use of the ubiquitous Hibernate ORM, but once Hibernate is in place, plugging in search is merely a matter of annotation or XML configuration. Hibernate Search works by providing synchronization between your relational database and a Lucene index, and mapping result sets back to Hibernate entities. This can improve your search drastically, both from a performance and quality standpoint, as Hibernate Search removes the need from writing slow SQL LIKE queries and supports all the great Lucene analyzers described above. Finally, Hibernate Search has built-in master-slave style clustering support which is not provided natively by Lucene and is widely regarded as a nightmare to implement.

Computations with Very Large Datasets
Many startups are limited in their abilities to efficiently process large amounts of data as writing distributed code is both complex and time consuming. With the introduction of Google's MapReduce algorithm we were provided with a standard way of dealing with large data sets and mapping them to a desired result set. MapReduce was eventually incorporated into certain non-relational DB's such as MongoDB which made writing MapReduce algorithms fairly straightforward. The problem is that for MapReduce to be useful at a large scale, we still need a way to handle distributing and merging the data across an array of commodity machines or cloud servers. To help with this, Lucene and Tika creator Doug Cutting created an open source version of MapReduce called Hadoop. Hadoop does all the work involved in distributing, merging, scheduling, and fail-over, with the users only having to write the MapReduce code to feed into it. Although Hadoop is geared towards Java, it supports streaming via Unix pipes as well as Python or PHP scripts. If you're planning on writing a lot of MapReduce algorithms you would be wise to study a functional programming language such as LISP in order to get more comfortable with the MapReduce way of thinking. An added benefit for startups with little up-front capital is that Amazon provides Hadoop clusters which can be used for a relatively low cost. Hadoop is well supported and field tested as it is continually developed by Yahoo and used by companies such as Facebook, Twitter, and LinkedIn.

Machine Learning
For many web firms, the ability to classify data or provide recommendations to users is a highly desirable and potentially profitable feature. Companies such as Amazon and Google have perfected this, and has become an important factor in their success. The downside is that a machine learning tool is certainly not something that a small startup wants to start out by writing, unless this is what your startups business model is. The amount of work involved is both immense and difficult, and requires a strong understanding of data mining algorithms. Once again the Apache Software Foundation comes to the rescue with the introduction of Mahout. Mahout is still in its infancy but already provides many of the tools needed to automatically classify or cluster data and provide recommendations in a scalable manner. Mahout achieves its scalability through the use of Hadoop so we can immediately see how these various technologies build on one another and why it's important to study them in a hierarchical manner. A less obvious but also useful application of Mahout is to perform large or complex vector and matrix computations which are well supported by Mahouts math package. Mahout does not require prior knowledge of data mining or machine learning (although it is helpful), and can largely be used as a black box.

Rich Web Based UI’s
Dealing with cross browser inconsistencies and maintaining clean and robust JavaScript code made the development and maintenance of RIA's a nightmare, and was often handed off to third-party tools such as Flash or Silverlight. Today we have a wide variety of tools such as Dojo, JavaScriptMVC, jQuery, backbone.js, and many others that allow us to write well structured MVC JS code, that can work seamlessly in conjunction with a REST service to provide loose coupling and portability. In addition, toolkits such as Dojo and jQuery provide a large number prebuilt widgets for common UI elements like trees, grids, pickers, modal dialogs, tabs, etc. These toolkits also abstract away differences in event handling and DOM styling, and speed up the programmatic creation of DOM nodes. Whether you are an object oriented of functional programmer there is a toolkit out there that will help you to dramatically improve the user experience of your web application and the ease of doing so. Of course these tools have to be combined with good fundamental knowledge of usability and graphic design, but they drastically simplify the steps involved from concept to implementation.

Although the above list contains a good number of open source tools to solve some of these difficult problems, it is by no means complete so please discuss your favorites in the comments. These are the technologies that I've had the most experience with and have found useful. Although a large chunk of the above tools are geared towards Java, this should not discourage you if your application is written in another language as they can easily be wrapped in a REST or SOAP web service. What is important to takeaway is that above technologies are no longer the domain of large corporations, but are available to be harnessed by the smallest of startups. As more and more firms begin to incorporate these technologies and use them in a strategic manner to produce high quality applications, the benchmarks will slowly increase, further necessitating their adoption. If you haven't tried any of these technologies or others like them, I recommend you at least learn more about them so that you may bring them into your applications when needed. I don't recommend that you add approximate search or a recommendation engine purely for the sake of doing so. The use of these technologies must be driven by a market need. The difference between 5 years ago and today is that any skilled programmer is in fact able to meet this need, and can no longer fall back on excuses of technological constraints.

If you liked this post please follow me on Twitter for more.

Note: An in depth discussion on implementing these technologies is beyond the scope of one blog post so if you want to find out more on using any of the above, Manning Publications has a number of good books on Lucene, Hadoop, Mahout, and Tika, while O'Reilly has good publications on Dojo and jQuery. 

No comments:

Post a Comment