By Michael Woloszynowicz

By Michael Woloszynowicz

Sunday, January 23, 2011

Why are you Still Making Crap? Open Source to the Rescue

The more I delve into the bevy of open source tools and frameworks the more I'm stunned when I see poorly executed applications. A few years ago developers had to make concessions when designing an application or implementing certain components as they were limited by their technological abilities or infrastructure constraints. That's not to say that constraints no longer exist, as time is always a factor, but nowadays you can rarely make excuses that you're limited by technology. Never before have so many resources been available to even the smallest of companies at little or no cost. The barriers have been broken down and newly formed startups now have access to many of the same technologies that were previously reserved for the fortune 500. Below I discuss some of the challenging problems developers face and the open source toolkits that exist to help us solve them.

Reading Various File Formats
Extraction of metadata and content from a bevy of file types is rarely a task that programmers get excited about. The explosion of file formats and variations has made content extraction a large problem for document management systems and search engines which want to gain access to file content in a standardized way, without having to get into the taxonomies of file formats. Fortunately, an open source Apache product named Tika helps us achieve this with fairly little effort. Tika provides automatic and reliable media type detection, metadata extraction, language detection, and parser selection for a multitude of file formats in an easy to use package. Tika outputs parsed content in a standardized fashion as either plain-text or XHTML that is fed to your application via a SAX parser which keeps memory usage to a minimum when dealing with large files. Tika's architecture is easily extensible to accommodate new IANA MIME types, custom content handlers or parsers, its basic tools can be learned very quickly, and integration is a breeze. It's one of those tools that’s applicable to almost every web application you build.

Lucene is the de facto standard for search and has been battle tested by providing search capabilities to data giants such as Twitter and Wikipedia. Not only is Lucene a high performance indexing system, it also provides out of the box support for complex analyzers such as N-Grams, Synonyms, Stemming, Double Metaphone, Edit Distance etc. which allow you to cool things like processing fuzzy queries. Lucene can also be combined with Tika to index a wide variety of file formats and provide search over their contents. Lucene on its own can be a little onerous to work with as you'll almost surely have to write some infrastructure code to perform maintenance and deal with performance and ACID issues. There are two additional tools that can help with this, the first of which is Solr which is REST based indexing service that can work quite well out of the box for typical requirements. The second and more powerful choice is Hibernate Search. The prerequisite for Hibernate Search is of course the use of the ubiquitous Hibernate ORM, but once Hibernate is in place, plugging in search is merely a matter of annotation or XML configuration. Hibernate Search works by providing synchronization between your relational database and a Lucene index, and mapping result sets back to Hibernate entities. This can improve your search drastically, both from a performance and quality standpoint, as Hibernate Search removes the need from writing slow SQL LIKE queries and supports all the great Lucene analyzers described above. Finally, Hibernate Search has built-in master-slave style clustering support which is not provided natively by Lucene and is widely regarded as a nightmare to implement.

Computations with Very Large Datasets
Many startups are limited in their abilities to efficiently process large amounts of data as writing distributed code is both complex and time consuming. With the introduction of Google's MapReduce algorithm we were provided with a standard way of dealing with large data sets and mapping them to a desired result set. MapReduce was eventually incorporated into certain non-relational DB's such as MongoDB which made writing MapReduce algorithms fairly straightforward. The problem is that for MapReduce to be useful at a large scale, we still need a way to handle distributing and merging the data across an array of commodity machines or cloud servers. To help with this, Lucene and Tika creator Doug Cutting created an open source version of MapReduce called Hadoop. Hadoop does all the work involved in distributing, merging, scheduling, and fail-over, with the users only having to write the MapReduce code to feed into it. Although Hadoop is geared towards Java, it supports streaming via Unix pipes as well as Python or PHP scripts. If you're planning on writing a lot of MapReduce algorithms you would be wise to study a functional programming language such as LISP in order to get more comfortable with the MapReduce way of thinking. An added benefit for startups with little up-front capital is that Amazon provides Hadoop clusters which can be used for a relatively low cost. Hadoop is well supported and field tested as it is continually developed by Yahoo and used by companies such as Facebook, Twitter, and LinkedIn.

Machine Learning
For many web firms, the ability to classify data or provide recommendations to users is a highly desirable and potentially profitable feature. Companies such as Amazon and Google have perfected this, and has become an important factor in their success. The downside is that a machine learning tool is certainly not something that a small startup wants to start out by writing, unless this is what your startups business model is. The amount of work involved is both immense and difficult, and requires a strong understanding of data mining algorithms. Once again the Apache Software Foundation comes to the rescue with the introduction of Mahout. Mahout is still in its infancy but already provides many of the tools needed to automatically classify or cluster data and provide recommendations in a scalable manner. Mahout achieves its scalability through the use of Hadoop so we can immediately see how these various technologies build on one another and why it's important to study them in a hierarchical manner. A less obvious but also useful application of Mahout is to perform large or complex vector and matrix computations which are well supported by Mahouts math package. Mahout does not require prior knowledge of data mining or machine learning (although it is helpful), and can largely be used as a black box.

Rich Web Based UI’s
Dealing with cross browser inconsistencies and maintaining clean and robust JavaScript code made the development and maintenance of RIA's a nightmare, and was often handed off to third-party tools such as Flash or Silverlight. Today we have a wide variety of tools such as Dojo, JavaScriptMVC, jQuery, backbone.js, and many others that allow us to write well structured MVC JS code, that can work seamlessly in conjunction with a REST service to provide loose coupling and portability. In addition, toolkits such as Dojo and jQuery provide a large number prebuilt widgets for common UI elements like trees, grids, pickers, modal dialogs, tabs, etc. These toolkits also abstract away differences in event handling and DOM styling, and speed up the programmatic creation of DOM nodes. Whether you are an object oriented of functional programmer there is a toolkit out there that will help you to dramatically improve the user experience of your web application and the ease of doing so. Of course these tools have to be combined with good fundamental knowledge of usability and graphic design, but they drastically simplify the steps involved from concept to implementation.

Although the above list contains a good number of open source tools to solve some of these difficult problems, it is by no means complete so please discuss your favorites in the comments. These are the technologies that I've had the most experience with and have found useful. Although a large chunk of the above tools are geared towards Java, this should not discourage you if your application is written in another language as they can easily be wrapped in a REST or SOAP web service. What is important to takeaway is that above technologies are no longer the domain of large corporations, but are available to be harnessed by the smallest of startups. As more and more firms begin to incorporate these technologies and use them in a strategic manner to produce high quality applications, the benchmarks will slowly increase, further necessitating their adoption. If you haven't tried any of these technologies or others like them, I recommend you at least learn more about them so that you may bring them into your applications when needed. I don't recommend that you add approximate search or a recommendation engine purely for the sake of doing so. The use of these technologies must be driven by a market need. The difference between 5 years ago and today is that any skilled programmer is in fact able to meet this need, and can no longer fall back on excuses of technological constraints.

If you liked this post please follow me on Twitter for more.

Note: An in depth discussion on implementing these technologies is beyond the scope of one blog post so if you want to find out more on using any of the above, Manning Publications has a number of good books on Lucene, Hadoop, Mahout, and Tika, while O'Reilly has good publications on Dojo and jQuery. 

Tuesday, January 18, 2011

Apple and a Jobsless Future

With yesterday's announcement of Jobs' leave of absence, speculation regarding Apple's future without Jobs reignited once again. TechCrunch and many others argued that the situation is not as bleak as it may seem, as Apple's product roadmap is defined for the next 3-4 years, and that the operations side of the business is safe with Tim Cook at the helm. While I'm not of the opinion that Apple cannot survive without Jobs, I do take issue with the product roadmap argument. Although it's certainly beneficial to have a roadmap in place, its value depreciates rapidly in an industry plagued by constant change and strengthening competition. A four year plan is more symbolic than it is beneficial and is assumed to change in response to market conditions and customer tastes. As a result, Jobs' work will only be a starting point for the current management team rather than a four year guarantee.

Although the Apple executive team has all the hard skills necessary to be successful, such as industrial design, operations, marketing, and technology knowhow, it is Jobs' vision and strategy that have made the company successful. Their success stems from the fact that Jobs has been able to define the needs of consumers and create new markets, as they did with the iPhone, iPad, and iPad. In addition to the physical products, Jobs has proven to be a master strategist by tying in software offerings such as the iTunes music store and app store that has increased the value of the physical good, boosted customer retention, and created significant barriers to entry for competitors. By creating a cohesive set of products reinforced by a brilliant software strategy Apple has created a virtuous cycle, with each product's success propelling the next. I would argue that this is Apple's biggest strength and is what will continue to propel them forward for the next few years. Thanks to their first mover advantage in nearly every market they serve, competitors must take a reactionary approach to each of Apple's products which has left them at a place of disadvantage. The problem once again is that competition is catching up, particularly in the mobile space with Android, so the string of market defining products, and Jobsian clairvoyance and strategic ruthlessness is something that will always be needed in order to continue the growth trajectory that the investment community has come to expect.

Apple has a powerful footing and an executive team that can execute flawlessly, what is needed is someone that can inspire people the way Jobs does through his mix of showmanship and vision, and can tie in the software and hardware parts of the business to maintain a leadership position in the markets they serve today and the ones yet to be explored. This all being said I wish Jobs a speedy recovery, and hope that they don't soon have to worry about their future.

If you liked this post please follow me on Twitter for more.

Sunday, January 16, 2011

Why You Need to Learn JavaScript

No more than 5 years ago, back-end programmers viewed JavaScript (JS) with a great deal of disdain, seeing it as little more than a means for hacking around browser inconsistencies and achieving some basic level of interactivity. For the most part they stayed away from it and laughed at those that called themselves JS programmers. To say the tables have turned would be incorrect as the back-end remains a vital part of any web application, particularly when complex analysis is needed and scalability issues arise. What we have seen is a dramatic increase in the importance of JS to the web community. Today users expect a high degree of interaction and responsiveness from their web applications that simply cannot be achieved through simple page submissions, hence the birth of AJAX. Although AJAX was a monumental step in boosting JavaScript's importance, it didn't result in an immediate shift in sentiments towards JS. A high level of interaction was out of reach for most companies as the majority of their client-side developers lacked the programming fundamentals to develop large scale JS based tools, while back-end developers refused to venture into client-side programming.

Where we are today is very different. A great deal of the work that was once done on the server has now moved into the client side, and as the complexity of the web UI's grows, the need for individuals with a background in CS that can develop well structured and robust JS applications will grow with it. Couple this with the expansion of mobile development using tools like PhoneGap and Sencha Touch, server side frameworks like Node.js, and the growing shift of desktop applications into "the cloud", it is clear that JS is here to stay. Despite its growing relevance, many professional developers still refuse to embrace it and leave it as something to be dealt with by client side programmers. The problem with this attitude is that web developers are often not well suited for the demands of today's JS applications. Web developers are trained to design appealing and usable interfaces and transfer these designs into accessible HTML/CSS pages, not to write 5000+ line object oriented or functional* applications. What we need are individuals with a strong knowledge of design patterns, data structures, and OO or functional programming to produce code that is maintainable and robust. The best way to make this transition is to start with a framework that removes the drudgery of cross browser compatibility and activates the power of JS. For me this was the outstanding Dojo toolkit which has turned JS programming from a chore to a joy. While the learning curve for Dojo can be a bit steep, it lends itself well to server side developers as it centers around packages, classes, and all the other goodies of OO programming. With a proper toolkit in hand you'll find that JS is not the steaming pile you thought it to be, but rather another valuable programming language to be mastered. Your versatility and value as a programmer will grow, you'll find that developing highly interactive and innovative UI's is both challenging and rewarding, and you'll be well positioned to develop class leading web applications from start to finish.

Although toolkits like Dojo are excellent and necessary for creating production applications, you can't fully utilize them without first learning the fundamentals of JavaScript and the document object model. Developing with Dojo often involves studying its source code so a working knowledge of JS is needed. After all, you wouldn't want to jump into Scala without learning Java first. Study some books on JS (JavaScript: The Good Parts is a nice quick read) and practice writing some basic DOM manipulation, event handling, and object based code. Once you have a good understanding of how pointers are passed around in JavaScript, the various ways methods can be called and referenced, the way attributes can be accessed, how scope works (closures), and how objects can be encapsulated, you should then begin reading some of the Dojo source code. I recommend studying the API (particularly the ItemFileReadStore) first as it is very useful, well documented, and exemplifies the power of JS and the techniques you can use within your own applications. Once you've become comfortable with JS, dive further into Dojo (see my previous post on Lessons with Dojo) and enjoy writing JS applications, just as you would any other languages. JS is no longer an option for true programmers, it's a requirement...and it's not the nightmare you think it is.

Note: Although I've focused largely on Dojo in the last part of this post, I'm not saying that you should ignore other toolkits like jQuery. jQuery is a great framework in its own right, but given that I have geared this post towards server side programmers I've chosen Dojo as it's syntactically familiar and heavily object driven, and hence more accessible this group of users. 

* A thanks to HN poster gibsonf1 for pointing out my omission of functional programming. Whether you are a server side functional or object oriented programmer is irrelevant. Functional frameworks like backbone.js are great if you have a lot of experience with LISP, Clojure, Haskell or any other functional language. The important thing is to pick a framework that matches your style of server side programming and apply the same rigor towards writing great JS code. 

If you liked this post please follow me on Twitter for more.

Saturday, January 8, 2011

The Niche Market Software Squeeze

Every year we see a growing emphasis and rapid improvement in the user experience and innovation of consumer based and broad market enterprise products. Niche market products however have not fared so well. As I look at numerous products aimed at niche market consumers or businesses, the majority still exhibit poorly designed interfaces, interactions, and lackluster execution of their core functionality. There are a number of reasons why this situation exists. First off, the highest talent individuals strive to make as large an impact on society as possible, thus pursuing markets that are large in size and highly visible. Secondly, VC's primarily back business that pursue large target markets, thus driving funding to those business and leaving niche market providers to bootstrap their ideas. And lastly, niche markets are harder to identify by outsiders as they require an understanding of their inner workings and processes. Two of these factors are also strong reasons why more technology entrepreneurs need pursue these markets. There are fewer competitors in the market, and those that exist are doing a fairly poor job of invoking user loyalty and passion into their products. As a result of these disappointing products, it's not always necessary to create disruptive innovations and form new markets, it's simply a matter of providing the services that already exist and delivering them in a way that instills customer loyalty through joy of use. The greatest challenge in many of these markets is the reluctance to change that users exhibit, as they are forced to use products that don't work the way they should, and rarely deliver on their promise. Success in these spaces is not just about servicing a need, it's about servicing it well and doing so with an understanding of the end user, their limitations, and their goals. It's not about force feeding them a generic product targeted at several industries but one that does the specific job they've hired your product for.

Although an entrepreneur may not gain the same level of notoriety as one who develops the next big thing, your chances of success are much larger as needs are easier to identify and products are easier to monetize. The funding problem sadly remains, but with the rise of communities such as Y Combinator and the explosion of open source tools and cloud services it becomes less of an issue every day. There's lots money to be made in these markets and there are millions of users who are begging you to create software products that rival the consumer products of today. Find people that work in these markets and ask them about their daily challenges. Start with a small and focused product that delivers a minimum level of functionality in a market that has potential for numerous additional offerings. Observe and measure frequently, and use this feedback to refine your product through continuous deployment. Resist the urge to add functionality to satisfy a small set of potential clients, and always maintain focus on the job your product was initially hired for. Try to design your products as services rather than standalone offerings so that as the number of products grows you design an ecosystem that can be easily integrated, or incorporated with third party solutions. Use each of these smaller offerings to grow your reputation in the market and springboard to new offerings. An in depth discussion on strategies requires an entire book so I will simply say that principles and theories such the Lean Startup, Crossing the Chasm, and the Innovators Dilemma all apply and should be studied by those who decide to pursue these markets. Niche markets are not without their share of problems, but what startup is? Forget about the masses and concentrate on changing the way an industry works.

If you liked this post please follow me on Twitter for more.