INTERVIEW ON THE PRICE OF BUSINESS SHOW, MEDIA PARTNER OF THIS SITE.
This article discusses some of the issues involved with integrated full-text and metadata searching using as an example the dtSearch enterprise and developer product line. As general background, dtSearch products instantly search terabytes of “Office” files, PDFs, emails along with even nested attachments, databases and Internet or Intranet data. The dtSearch product line can further run locally or remotely, including in a cloud-based environment like Azure or AWS.
Because dtSearch can instantly search terabytes, many dtSearch customers are large enterprises like 4 out of 5 of the Fortune 500’s largest Aerospace and Defense companies and federal, state and international government agencies. But along with enterprise-level search, dtSearch also lets you instantly search your own documents, emails and the like, and you can download a fully-functional evaluation version anytime at dtSearch.com.
On the technical end, dtSearch products instantly search terabytes after first building one or more terabyte-size indexes. All you have to do is point dtSearch at the relevant folders to start the indexing process. There is no need to tell dtSearch what mix of Word, Excel, PowerPoint, Access, OneNote, XML, HTML, PDF files, emails, ZIP, RAR files, etc. are in the collection. dtSearch will automatically figure that out for itself. And then after indexing, dtSearch provides over 25 different search options to instantly search through the data.
Normally, when addressing the topic of text retrieval, the focus is on full-text searching. So, for example, you could enter a search like chocolate candy w/30 of peanut butter, and dtSearch would search both the full-text and the metadata of everything looking for the phrase chocolate candy within 30 words of the phrase peanut butter.
But sometimes limiting a search based on the metadata and adding that to a full-text search yields more relevant search results. You could still enter the chocolate candy w/30 of peanut butter search request, but limit the search results to just emails that contain Aunt Sally in the correspondents field. Going one step further, application developers can then take a combination of metadata and full-text searching to an entirely different level.
Developers using the dtSearch Engine can associate a metadata tag with a document or email even if that metadata tag doesn’t appear in the document or email itself. If a certain data collection is part of the ProjectABC fileset, even if all documents don’t necessarily reference ProjectABC in a particular metadata field like a Subject field, the developer can add ProjectABC as a searchable element relating to each document in the collection.
The developer could of course modify the original documents to add a new field. But typically it is not a good idea to go in and actually alter original records. The better way to add that type of element is as a metadata element included in the index, regardless of whether such a reference is also paralleled in the documents.
Sometimes, however, it is not just one or two pieces of metadata that a developer needs to add but a whole backend database like SQL, NoSQL or SharePoint worth of metadata as it relates to a separate document collection. In that case, the dtSearch Engine can index the database metadata together with the referenced documents – or BLOB data, as the documents are called when stored inside the database. With that, it is possible to do even more complex combinations of full-text and metadata searching.
One way dtSearch developers can leverage complex metadata is to let end-users use that metadata to do faceted searching. With faceted searching, end-users can drill down through certain metadata prior to entering a full-text search. That way, an end-user could search all toys in an online store, or limit a search to just board games and puzzles and then further limit a search to board games and puzzles recommended for a certain age range.
In an enterprise setting, the dtSearch developer can use the complex metadata associated with each file to make sure that each end-user only accesses documents and the like associated with a specific combination of metadata that defines that end-user’s access right. With granular data classification, HR will see one specific subset of the ProjectABC files, legal will see another subset, accounting will see another subset, and someone in oversight might see a completely different cross-section of the data.
In closing, as mentioned above, you are welcome to go to dtSearch.com and download a fully-functional evaluation version to instantly search terabytes of your own data — and metadata.
The Price of Business is one of the longest running shows of its kind in the country and is in markets coast to coast. The Host, Kevin Price, is a multi-award winning author, broadcast journalist, and syndicated columnist. Learn more about the show and its digital partners at www.PriceofBusiness.com (scroll down to the bottom of the page).
LISTEN TO THE INTERVIEW IN ITS ENTIRETY HERE: