Data Culling: 6 Best Practices to Avoid Overcollection and Reduce eDiscovery Expenses


It’s no secret that eDiscovery is expensive—and that’s a problem for organizations that are looking for ways to cut costs in an inflationary economy. But at the same time, the amount of data that organizations generate and manage is increasing, further driving up the cost of eDiscovery. What’s a cost-conscious organization to do?

Ideally, you’d have less data to manage. One way to accomplish that is to stop overcollecting data during the early stages of eDiscovery. Collecting more than necessary means you’ll spend more money storing and transferring data and more hours processing, reviewing, and analyzing it—wasting precious time and resources that proportional collection could have saved.

Luckily, your organization can avoid overcollection of data by improving its data culling practices, keeping costs down and saving you time. What are some of the best practices for data culling and avoiding overcollection of evidence? Keep reading to find out.


What is data culling?
Why is data culling becoming more important?
What are the advantages of in-house data culling and processing?
6 best practices for efficient in-house data culling
How analytical tools can make data culling more efficient
Why eDiscovery software is critical to strategic data culling

What is data culling?

Data culling is the act of paring down information to eliminate everything irrelevant, redundant, and otherwise unhelpful. You can separate this information from relevant data by file type, search term or keyword, email sender or recipient, and date range. You can also identify and eliminate duplicate data. Data that is culled isn’t subject to the document review and analysis process; it’s cut from the corpus of relevant information, so it no longer needs to be managed through eDiscovery.

Why is data culling becoming more important?

Data culling is becoming more and more important due to our information-driven society. Organizations now have more data to collect, process, review, and analyze than ever before, which means higher eDiscovery costs. Meanwhile, many organizations have less bandwidth to shoulder those expenses.

As global economic growth is expected to slow to 2.7% in 2023—markedly down from 6% in 2021—we are entering what is likely to be one of the biggest economic slowdowns since 2001, after the global financial crisis of 2007 to 2008 and the worst stage of the COVID-19 pandemic. In this context, organizations are rightfully concerned with keeping expenses low, and this is as true as ever when it comes to eDiscovery.

Avoiding overcollection by weeding out irrelevant and duplicative data is key to keeping eDiscovery expenses down. The bottom line is: when you reduce the amount of data you are working with, you reduce your eDiscovery expenses. This is where data culling comes in.

So, how do you achieve that goal, and who should be charged with weeding out unnecessary data? As it turns out, there are many benefits to performing data culling and processing in house.

What are the advantages of in-house data culling and processing?

Culling and processing data in house offers many benefits. Performing these tasks within your organization allows you to:

  • carefully define and limit your project’s scope;  
  • perform as many rounds of data culling as necessary to isolate relevant subsets from the rest of your data; and
  • monitor and control your budget to further reduce costs.

To fully capitalize on these advantages, consider implementing the following best practices to improve and expedite the way your in-house team handles data culling and processing.

6 best practices for efficient in-house data culling

Here are six ways to get the most out of your data culling processes and ultimately avoid overcollection of evidence.

  1. Use data clustering

Clustering is a data classification technique in which an algorithm groups similar kinds of data in a dataset. Clustering can give you an overall picture of the data you’re working with and reveal similarities among data. Clustering also allows you to review different datasets by topic and identify the key terms under each topic. From there, you can create a list of search terms that will lead you to irrelevant data.

  1. Generate lists of search terms

Once you understand what the most common terms are within each data set, you can create lists of search terms. You can use these search terms to differentiate relevant documents from irrelevant documents.

  1. Isolate custodian data

Make a list of the key players in your case and assign custodian IDs to each of them. You can then isolate important custodians’ data and use that information to iteratively generate search term lists and topic clusters that you can use to further cull data. You may also be able to use this information when you perform searches relating to other custodians.

  1. Isolate email domains

An email domain is the web address that follows the “@” symbol in an email address, such as “” When culling data, you can identify email domains within each data set and exclude senders and recipients associated with irrelevant email domains, such as spam and newsletters. You can use this same process to identify email exchanges that contain privileged information, such as emails between members of your organization and outside legal counsel.

  1. Perform quality control on search terms

When performing searches, running statistics on random document samples can help you better understand your search results and determine what steps to take to further cull data. For example, you may discover an additional exclusionary search term that will help you eliminate irrelevant documents from further searches.

  1. Use in-place search technology

In-place search technology can help you reduce data more efficiently and effectively. This technology allows you to quickly perform comprehensive searches across various data sources before you even collect that data. It can also compile search results into a content index containing valuable information about the data, such as where it is stored, why it is stored there, how long it has been there, and who has access and modification privileges. This index can give you insights into the data’s relevance to your project, informing what data you’re interested in before collection begins.

There’s another aspect to data culling as well: using analytical tools to eliminate extraneous files and “junk data.” We’ll turn there next.

How analytical tools can make data culling more efficient

In addition to the above best practices, consider employing the following analytical tools to further cull your data and avoid the unnecessary costs associated with overcollection.

  • Email threading. An email thread is a sequence of emails replying to an original email—and these threads can get quite long. Email threading is the act of grouping together related emails to make email review and analysis more efficient.  
  • Clustering is an example of unsupervised machine learning, which groups ‘similar’ items together and allows users to recognize the characteristics or topics that make those items (i.e. documents) similar. This way, the user is able to take actions on a whole group of items, hence optimizing their workflow.
  • Near-duplication. The near-duplication tool identifies documents and emails that are 50% and more similar in text context. Examples of such include: multiple versions of a Microsoft Word document with slight modifications; repetitive email threads.
  • Technology Assisted Review (TAR) is a way of handling the review phase of eDiscovery by deploying algorithms that can classify documents based on the input from expert reviewers. TAR is dramatically faster, more thorough and accurate than human-only review, thus expediting the organization and prioritization of the document collection.
  • Sentiment Analysis. Sentiment analysis tools aim to automatically extract positive or negative emotions or viewpoints expressed within text. That said, they can help reviewers better prioritize data for review and identify relevant document to a particular case faster.

The above analytical tools can make all the difference in your data culling process, especially when combined with the right eDiscovery software.

Why eDiscovery software is critical to strategic data culling

With the rising pressure to cut costs and the concurrent explosion of data volume, corporate law departments can feel like they’re caught between a rock and a hard place. That makes data culling more important than ever for avoiding overcollection and keeping eDiscovery costs down.

Fortunately, there’s dedicated eDiscovery software that can help in-house teams effectively and efficiently manage their own data culling processes.

ZyLAB ONE is a proven eDiscovery solution that includes all the features you need to identify, analyze, and process data. With ZyLAB ONE, you can connect to live, in-place data across multiple repositories and streamline your eDiscovery process by searching, reviewing, and analyzing data in place before collection even begins.

Live Early Data Assessment (Live EDA) is another invaluable tool for avoiding overcollection. Live EDA’s in-place search technology helps you identify and locate data across various data sources prior to collection. By searching data before collecting it, you can save the time and expense associated with unnecessary collection. If you discover new data sources, custodians, or search terms along the way, you can expand your search at any time without restarting the process.

Ready to learn more about our eDiscovery software to see which solutions are right for you? Get in touch with us today to set up a demonstration.