Update from Lucene

May 10, 2016

It has been awhile since we heard about our old friend Apache Lucene, but the open source search engine has something new, says Open Source Connections in the article, “BM25 The Next Generation Of Lucene Relevance.”  Lucene is added BM25 to its search software and it just might improve search results.

“BM25 improves upon TF*IDF. BM25 stands for “Best Match 25”. Released in 1994, it’s the 25th iteration of tweaking the relevance computation. BM25 has its roots in probabilistic information retrieval. Probabilistic information retrieval is a fascinating field unto itself. Basically, it casts relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant.”

Apache Lucene formerly relied on TF*IDF, a way to rank how users value a text match relevance.  It relied on two factors: term frequency-how often a term appeared in a document and inverse document frequency aka idf-how many documents the term appears and determines how “special” it is.  BM25 improves on the old TF*IDF, because it gives negative scores for terms that have high document frequency.  IDF in BM25 solves this problem by adding a 1 value, therefore making it impossible to deliver a negative value.

BM25 will have a big impact on Solr and Elasticsearch, not only improving search results and accuracy with term frequency saturation.

 

Whitney Grace, May 10, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

On the Prevalence of Open Source

November 11, 2015

Who would have thought, two decades ago, that open source code was going to dominate the software field? Vallified’s Philip O’Toole meditates on “The Strange Economics of Open-Source Software.” Though  the industry gives so much away for free, it’s doing quite well for itself.

O’Toole notes that closed-source software is still in wide use, largely in banks’ embedded devices and underpinning services. Also, many organizations are still attached to their Microsoft and Oracle products. But the tide has been turning; he writes:

“The increasing dominance of open-source software seems particularly true with respect to infrastructure software.  While security software has often been open-source through necessity — no-one would trust it otherwise — infrastructure is becoming the dominant category of open-source. Look at databases — MySQL, MongoDB, RethinkDB, CouchDB, InfluxDB (of which I am part of the development team), or cockroachdb. Is there anyone today that would even consider developing a new closed-source database? Or take search technology — elasticsearch, Solr, and bleve — all open-source. And Linux is so obvious, it is almost pointless to mention it. If you want to create a closed-source infrastructure solution, you better have an enormously compelling story, or be delivering it as part of a bigger package such as a software appliance.”

It has gotten to the point where developers may hesitate to work on a closed-source project because it will do nothing for their reputation.  Where do the profits come from, you may ask? Why in the sale of services, of course. It’s all part of today’s cloud-based reality.

Cynthia Murrell, November 11, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Prepare To Update Your Cassandra

June 2, 2015

It is time for an update to Apache’s headlining, open source, enterprise search software!  The San Diego Times let us know that “DataStax Enterprise 4.7 Released” and it has a slew of updates set to make open source search enthusiasts drool.   DataStax is a company that built itself around the open source Apache Cassandra software.  The company specializes in enterprise applications for search and analytics.

The newest release of DataStax Enterprise 4.7 includes several updates to improve a user’s enterprise experience:

“…includes a production-certified version of Cassandra 2.1, and it adds enhanced enterprise search, analytics, security, in-memory, and database monitoring capabilities. These include a new certified version of Apache Solr and Live Indexing, a new DSE feature that makes data immediately available for search by leveraging Cassandra’s native ability to run across multiple data centers.”

The update also includes DataStax’s OpCenter 5.2 for enhanced security and encryption.  It can be used to store encryption keys on servers and to manage admin security.

The enhanced search capabilities are the real bragging points: fault-tolerant search operations-used to customize failed search responses, intelligent search query routing-queries are routed to the fastest machines in a cluster for the quickest response times, and extended search analytics-using Solr search syntax and Apache Spark research and analytics tasks can run simultaneously.

DataStax Enterprise 4.7 improves enterprise search applications.  It will probably pull in users trying to improve their big data plans.  Has DataStax considered how its enterprise platform could be used for the cloud or on mobile computing?

Whitney Grace, June 2, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

A Little Lucene History

March 26, 2015

Instead of venturing to Wikipedia to learn about Lucene’s history, visit the Parse.ly blog and read the post, “Lucene: The Good Parts.”  After detailing how Doug Cutting created Lucene in 1999, the post describes how searching through SQL in the early 2000s was a huge task.   SQL databases are not the best when it comes to unstructured search, so developers installed Lucene to make SQL document search more reliable.  What is interesting is how much it has been adopted:

“At the time, Solr and Elasticsearch didn’t yet exist. Solr would be released in one year by the team at CNET. With that release would come a very important application of Lucene: faceted search. Elasticsearch would take another 5 years to be released. With its recent releases, it has brought another important application of Lucene to the world: aggregations. Over the last decade, the Solr and Elasticsearch packages have brought Lucene to a much wider community. Solr and Elasticsearch are now being considered alongside data stores like MongoDB and Cassandra, and people are genuinely confused by the differences.”

If you need a refresher or a brief overview of how Lucene works, related jargon, tips for using in big data projects, and a few more tricks.  Lucene might just be a java library, but it makes using databases much easier.  We have said for a while, information is only useful if you can find it easily.  Lucene made information search and retrieval much simpler and accurate.  It set the grounds for the current big data boom.

Whitney Grace, March 26, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com

  • Archives

  • Recent Posts

  • Meta