Reading Various File Formats
Extraction of metadata and content from a bevy of file types is rarely a task that programmers get excited about. The explosion of file formats and variations has made content extraction a large problem for document management systems and search engines which want to gain access to file content in a standardized way, without having to get into the taxonomies of file formats. Fortunately, an open source Apache product named Tika helps us achieve this with fairly little effort. Tika provides automatic and reliable media type detection, metadata extraction, language detection, and parser selection for a multitude of file formats in an easy to use package. Tika outputs parsed content in a standardized fashion as either plain-text or XHTML that is fed to your application via a SAX parser which keeps memory usage to a minimum when dealing with large files. Tika's architecture is easily extensible to accommodate new IANA MIME types, custom content handlers or parsers, its basic tools can be learned very quickly, and integration is a breeze. It's one of those tools that’s applicable to almost every web application you build.
Lucene is the de facto standard for search and has been battle tested by providing search capabilities to data giants such as Twitter and Wikipedia. Not only is Lucene a high performance indexing system, it also provides out of the box support for complex analyzers such as N-Grams, Synonyms, Stemming, Double Metaphone, Edit Distance etc. which allow you to cool things like processing fuzzy queries. Lucene can also be combined with Tika to index a wide variety of file formats and provide search over their contents. Lucene on its own can be a little onerous to work with as you'll almost surely have to write some infrastructure code to perform maintenance and deal with performance and ACID issues. There are two additional tools that can help with this, the first of which is Solr which is REST based indexing service that can work quite well out of the box for typical requirements. The second and more powerful choice is Hibernate Search. The prerequisite for Hibernate Search is of course the use of the ubiquitous Hibernate ORM, but once Hibernate is in place, plugging in search is merely a matter of annotation or XML configuration. Hibernate Search works by providing synchronization between your relational database and a Lucene index, and mapping result sets back to Hibernate entities. This can improve your search drastically, both from a performance and quality standpoint, as Hibernate Search removes the need from writing slow SQL LIKE queries and supports all the great Lucene analyzers described above. Finally, Hibernate Search has built-in master-slave style clustering support which is not provided natively by Lucene and is widely regarded as a nightmare to implement.
Computations with Very Large Datasets
Many startups are limited in their abilities to efficiently process large amounts of data as writing distributed code is both complex and time consuming. With the introduction of Google's MapReduce algorithm we were provided with a standard way of dealing with large data sets and mapping them to a desired result set. MapReduce was eventually incorporated into certain non-relational DB's such as MongoDB which made writing MapReduce algorithms fairly straightforward. The problem is that for MapReduce to be useful at a large scale, we still need a way to handle distributing and merging the data across an array of commodity machines or cloud servers. To help with this, Lucene and Tika creator Doug Cutting created an open source version of MapReduce called Hadoop. Hadoop does all the work involved in distributing, merging, scheduling, and fail-over, with the users only having to write the MapReduce code to feed into it. Although Hadoop is geared towards Java, it supports streaming via Unix pipes as well as Python or PHP scripts. If you're planning on writing a lot of MapReduce algorithms you would be wise to study a functional programming language such as LISP in order to get more comfortable with the MapReduce way of thinking. An added benefit for startups with little up-front capital is that Amazon provides Hadoop clusters which can be used for a relatively low cost. Hadoop is well supported and field tested as it is continually developed by Yahoo and used by companies such as Facebook, Twitter, and LinkedIn.
For many web firms, the ability to classify data or provide recommendations to users is a highly desirable and potentially profitable feature. Companies such as Amazon and Google have perfected this, and has become an important factor in their success. The downside is that a machine learning tool is certainly not something that a small startup wants to start out by writing, unless this is what your startups business model is. The amount of work involved is both immense and difficult, and requires a strong understanding of data mining algorithms. Once again the Apache Software Foundation comes to the rescue with the introduction of Mahout. Mahout is still in its infancy but already provides many of the tools needed to automatically classify or cluster data and provide recommendations in a scalable manner. Mahout achieves its scalability through the use of Hadoop so we can immediately see how these various technologies build on one another and why it's important to study them in a hierarchical manner. A less obvious but also useful application of Mahout is to perform large or complex vector and matrix computations which are well supported by Mahouts math package. Mahout does not require prior knowledge of data mining or machine learning (although it is helpful), and can largely be used as a black box.
Rich Web Based UI’s
Although the above list contains a good number of open source tools to solve some of these difficult problems, it is by no means complete so please discuss your favorites in the comments. These are the technologies that I've had the most experience with and have found useful. Although a large chunk of the above tools are geared towards Java, this should not discourage you if your application is written in another language as they can easily be wrapped in a REST or SOAP web service. What is important to takeaway is that above technologies are no longer the domain of large corporations, but are available to be harnessed by the smallest of startups. As more and more firms begin to incorporate these technologies and use them in a strategic manner to produce high quality applications, the benchmarks will slowly increase, further necessitating their adoption. If you haven't tried any of these technologies or others like them, I recommend you at least learn more about them so that you may bring them into your applications when needed. I don't recommend that you add approximate search or a recommendation engine purely for the sake of doing so. The use of these technologies must be driven by a market need. The difference between 5 years ago and today is that any skilled programmer is in fact able to meet this need, and can no longer fall back on excuses of technological constraints.
If you liked this post please follow me on Twitter for more.
Note: An in depth discussion on implementing these technologies is beyond the scope of one blog post so if you want to find out more on using any of the above, Manning Publications has a number of good books on Lucene, Hadoop, Mahout, and Tika, while O'Reilly has good publications on Dojo and jQuery.