Skip to content

THE TIMES USA

The News that Matters

Menu
  • ABOUT
  • CONTACT
  • LIFESTYLE
  • NATIONAL NEWS
  • BUSINESS
  • INTERNATIONAL NEWS
  • TECHNOLOGY
  • PRICE OF BUSINESS SHOW AUDIOS
Menu

6 Navigation Tips for Terabytes of Data

Posted on April 23, 2021 by admin

By Elizabeth Thede, Special for The Times USA

 

Need to sift through terabytes of data? This article offers a step-by-step guide to using a search engine to find what you need.

This guide uses terminology, etc. from the search engine dtSearch®. dtSearch’s enterprise and developer products can run “on premises” or on cloud platforms to instantly search terabytes across a wide range of online and offline data. Many dtSearch customers are Fortune 100 companies and government agencies. But anyone with lots of data can download a fully-functional 30-day evaluation copy from dtSearch.com

Step 1: Use the search engine to index the data. A search engine like dtSearch can search terabytes of “Office” files, PDFs, emails along with nested attachments, etc., even if you don’t build a search index. But while unindexed search is slow, indexed search is typically instantaneous, even for multiple concurrent users across terabytes of data.

How do you get the search engine to build an index? Just point to the directories, email archives and other data repositories you want to index, and dtSearch will do the rest. In fact, the same index can include multiple different data repositories. For each data repository (including compressed archives inside of a data repository), dtSearch goes through every file, email and the like and automatically figures out the relevant file type.

In doing so, dtSearch uses information inside each file rather than relying on the file extension like .PDF, .DOCX, .PST, etc. It is all too easy to have an Access database with an Excel spreadsheet extension, or a PowerPoint with an email file extension. Looking inside the file itself to determine the correct format is essential to correctly parsing the data.

Step 2: Check the index log. Checking the index log is an important step to make sure that everything is fully indexed. For example, you can have “image only” PDFs mixed in with ordinary PDFs without even realizing it. An “image only” PDF is a PDF that may look ordinary, but is really just a picture only. When you try to copy and paste what looks like words, the copy and paste doesn’t work because the underlying words are a pure image.

The indexing log flags image-only PDFs so you can run them through an OCR application like Adobe Acrobat to turn these into regular text-based PDFs. As a side note, when dtSearch updates an index, it need only look at what has been added, deleted or changed rather than rebuilding the whole index from scratch.

Step 3: Leverage all of the many search options to refine your search query. The main tip here is not to limit yourself to natural language unstructured search requests or simple word and phrase searching. dtSearch has over 25 different search types, everything from Boolean, proximity and concept searching to metadata-focused options and credit card recognition. Use the full range of search features to generate a query tailored to exactly what you are looking for.

One specific search option to keep in mind is fuzzy searching, adjustable from 0 to 10 to sift through minor typographical or OCR errors. If you are looking for coffee and it is misspelled coffre in an email or as a result of a blurry OCR’ed original, a low level of fuzzy searching will still pick that up. Fuzzy searching works on top of other search options.

The search options mentioned here, including fuzzy searching, work not only with English text but any of the hundreds of international Unicode-based languages. For use in a multi-user concurrent-searching environment, each search request runs on its own thread. That way, each user’s search requests can proceed separately and with instant response.

Step 4: Sort and re-sort search results by relevance and other sorting metrics. After a search, dtSearch shows a full view of each retrieved file, email and the like with highlighted hits. If your search retrieves only a small number of files, scanning all of them is relatively straightforward. But when a search retrieves a large number of items, sorting becomes important.

Relevancy-ranking uses a “vector-space” algorithm to sort by hit term density and rarity. If a search term is less frequent in indexed data, it will get a higher relevancy ranking. Say you search for coffee or tea. If there are millions of tea references but only a few coffee mentions, coffee references and especially files with denser coffee mentions will have a higher relevancy rank.

But the main point is that you are not stuck with your initial sorting. If you have relevancy-ranking as the default, you can instantly re-sort by descending or ascending file and email date, by file or email location, etc. Different sorting options can give you a better window into search results.

Step 5: Generate a search report. dtSearch can also generate a search report pulling together each hit across all retrieved files with as much context around each hit as you want. A search report is a great way to bring together a lot of hits across a large number of files into one easy-to-read summary.

Step 6: Consider caching. For caching, you need to go back to step one, where you build an index. Caching can store a full copy of each indexed file or email in the index itself. That way, when dtSearch goes to display a file, even if the original is no longer there or is subject to a spotty online connection, the search results can nonetheless instantly show the retrieved item with highlighted hits.

If you have terabytes of data you need to navigate, you are welcome to download a 30-day evaluation version from dtSearch.com to get started now.

RELATED: Kevin Price of the Price of Business show discusses the topic with Thede on a recent interview.

You Might Also Like...

  • Top Tips for Small Business Survival From Top Small Business Leader

    INTERVIEW ON THE PRICE OF BUSINESS SHOW, MEDIA PARTNER OF THIS SITE. Recently Kevin Price,…

  • 5 Tips For Running a Successful Business Event

      Running a business event can be a daunting task. Making sure that the event is…

  • Personal Data Privacy - The Movement

    INTERVIEW ON THE PRICE OF BUSINESS SHOW, MEDIA PARTNER OF THIS SITE. Recently Kevin Price,…

  • Data Privacy's Importance to Americans

    By the Price of Business Show, Hosted by Kevin Price.  The Price of Business is a media…

  • Beyond Boolean Search

    By Elizabeth Thede, Special for The Time USA   Many people have heard of Boolean…

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

VIDEO: This Week’s Best of our Network

https://www.thetimesusa.com/wp-content/uploads/2022/05/Newsweek-Editor-on-Politicization-of-Buffalo-Tragedy-and-More-5_23_22.mp4

GDPR Compliance

USABR does not collect data on its visitors.  For more information visit: https://www.usabusinessradio.com/contact-us/

Contact

Contact articles@usabusinessradio.net for more information on articles on this site. BMuyco@usabusinessradio.net for all other information.

Recent Articles

  • A “Defensive Posture” in Finance is Not Sticking Your Head in The Sand
  • Clinic for Family Dentistry in Mattapan: How To Find One?
  • Why Digital Marketing Is Now the Go-To Method of Marketing
  • 5 Things a Small Business Owner Should Know Before Scaling
  • Different Types of Legal Fees Charged by Lawyers

The Secret to Making New Rich Books Work

The Secret to Making New Rich Books Work

Also in TTUSA

  • Coconut Creek Casino Florida – Online casino with real money 2020
  • Real Time Gaming Casino | Welcome bonus without casino deposit
  • The PA Profession Continues to Grow
  • Ill Make My Own Casino With Blackjack – Tutorials to play online casino like a real expert
  • Compulsive Gambling Statistics – Casino action analysis and review

RSS The Daily Blaze

  • Why Has Shopify Plus Gained in Popularity? – 5 Factors
  • Post Reporter on Misuse and Abuse of Loans for COVID Relief
  • NY Times Bestselling Author on Why People Like Controversy and Why That Matters to Writers
  • The Importance of Using Social Media Platforms To Improve Business and Website Traffic
  • Have You Installed the Vingo App? Open a Free Account Today

The Leader in Business News

The Leader in Business News

RSS USA Business Radio

  • Adam Shapiro Discusses the Explosion in Rental Prices Nationwide
  • Cody Willard on How to Navigate a Dicey Stock Market
  • Newsweek Editor on Politicization of Buffalo Tragedy and More
  • EPIC Software-as-a-Service – A Consumption Model for Healthcare Providers
  • US Stock Market Moving Rapidly to a Bear Market

RSS USA Daily Times

  • Journey to Becoming a Millionaire Begins in Your Mind
  • Chris Miles’ New Commentary Feature on Price of Business Digital Network
  • 6 Ways You Can’t Hide Text in Files, Emails and Other Data From a Search Engine
  • What to Do When Your Health Insurance Won’t Cover a Treatment
  • Post Reporter on Disaster Relief Disparity in Texas

RSS USA Daily Chronicles.

  • The Significance of the Leaked Dobbs Supreme Court Draft Opinion
  • Technology Has Improved the Way That We Play Online Games
  • Is Now a Good Time to Sell a Business?
  • The Impact of the Roe vs. Wade Overturning Might Have Could Surprise People
  • Barbara Comstock Assess the Political Primary Season

RSS Price of Business

  • 5 Things To Consider Before Hiring A Contractor To Build Your Office Building
  • Adam Shapiro Discusses the Explosion in Rental Prices Nationwide
  • Post Reporter on Misuse and Abuse of Loans for COVID Relief
  • The Importance of Hiring a Commercial Lawyer for Your Business
  • NY Times Bestselling Author on Why People Like Controversy and Why That Matters to Writers

RSS US Daily Review

  • How Texas Is Looking To Provide More Mental Health Providers Into the Education System
  • Sen. Manchin Joins GOP in Blasting Biden’s Interior Secretary
  • The Benefits of IT Staff Augmentation Services
  • Handgun vs. Pistol: What’s Best for Self-Defense?
  • The “Tax Cuts Causes Inflation Argument” is Old and Requires Examination

PoB Digital Network

US Daily Review

USA Business Radio

USA Daily Chronicles

USA Daily Times

The Daily Blaze

The Times USA

Price of Business

© 2022 THE TIMES USA | Powered by Superbs Personal Blog theme