Taming the WEASL, Weapon for Evaluation & Analysis of Solr/Lucene

WEASEL

Apache SolrI do a lot of work with Apache Solr, which is a front end wrapper for Apache Lucene.  Lucene is a text analysis search engine.  Basically, you give Lucene a “document”, and it analyzes the text of that document, and stores an index of which words and phrases occur in which documents.  What Solr does, is provide an http based front-end and a “ready to go” installation for the entire system.  Lucene is an amazing piece of software, and Solr is just awesome, as far as I’m concerned.

With that being said, Solr’s admin interface leaves a lot to be desired.  Solr has the concept of “cores”, which are independent, unrelated indexes.  Reminiscent of Drupal’s multi-site installation, Solr’s multi-core, allows you to set up multiple search engines for multiple sites, all on a single installation of Solr.  Unfortunately, you can’t work across multiple cores in the admin interface.  For instance, in one installation that I work on, I have multiple cores, one for each site.  These sites are all similar, and they all use the same Solr schema.  There’s also the potential for the same content to be created across multiple sites, and to end up in multiple Solr cores.  Now, if I want to see if a document occurs in multiple cores, then I have to go each core’s admin interface separately, and run a query.  That’s sort of my base gripe with Solr, and I’ve been working on a tool to alleviate the problem, for a while now.  I call it The WEASL!

The WEASL, (Weapon for Evalution and Analysis of Solr/Lucene) isn’t quite ready for prime time, but I thought I’d go ahead and put it out there in case it helps anyone else.  Basically, it started as a “throw away” python script that got a list of cores in a Solr installation for further processing.  I quickly realized that once I had that list, then that could be a base for some multi-core querying.  Right now, the WEASL is mostly concerned with counting documents, but it does a few other things:

The WEASL can

  • List all the cores in a Solr installation
  • List document totals per core, and aggregate and export those counts to a csv file
  • Execute a query across multiple cores
  • Execute a query against a single core

Like I said before, it started off as a throw away script, so it needs some clean up, but it’s usable now.  It needs to be more flexible, so that it can be used in other situations.  Right now, some of the queries are hard coded, which stinks.  I’m also planning on wrapping it in a python class, and making it more suitable to be a module.

You’re welcome to give it a try, it’s available on GitHub: https://github.com/technopoetic/weasl.  But here are a few of the high points:

This function hits the cores status action in the admin interface to get the list of cores.  This is the base of the multi-core query functionality.

# Gets all cores for the solr installation by parsing the list of 
# cores in the admin interface
def get_cores_list(): 
  admin_url = Config.get("Solr server", "master_host") + "/solr/admin/cores?action=STATUS"
  cores_list = [] 
  tree = ET.parse(urllib2.urlopen(admin_url)) 
  root = tree.getroot() 
  status = root.find(".//lst[@name='status']") 
  for child in status: 
    cores_list.append(child.get('name')) 
  return cores_list

Once we have the list of cores, we can do stuff with them.  Here’s the function to execute a query across all of the cores.

# Execute a query across all cores. 
def query_multi_core(query): 
  cores = get_cores_list() 
  results = [] 
  numResults = 0 
  for core in cores: 
    url_string = Config.get("Solr server", "master_host") + '/solr/{0}/select/?q={1}'.format(core, query)
    try: 
      tree = ET.parse(urllib2.urlopen(url_string)) 
      rootElem = tree.getroot().find('result') 
      print "\n" + core + ": " + url_string 
      print "Results: " + rootElem.attrib.get('numFound') 
      numResults += int(rootElem.attrib.get('numFound')) 
    except urllib2.HTTPError: 
      print "Error connecting to core: {0}".format(core) 
  print "Total Results across all cores: " + str(numResults)

Right now it’s a command line script, but I plan to wrap it in a python class, in order to make it an actual module.