Having followed the instructions in the previous article in this series we have downloaded Project Gutenberg’s entire collection of English texts and now have a local repository of   approximately 80,000 text files. But something doesn’t smell right here. How can we have this many files when Project Gutenberg only claims to contain approximately 60,000 books in total? Duplicates… that’s how.

In this blog article we will determine how many unique text files we really have, learn about why these duplicates exist, and purge them from our file index. This is known as de-duping.

In doing so, we will also learn some new ways of using the find command, introduce you to the sort and uniq commands and construct a basic regular expression.

Why De-Dupe?

You might be forgiven for thinking that Big Data is all about volume. The more data we have, the better. Right? Well, Big Data hasn’t changed anything about the basic fundamentals of data science. Quality still trumps quantity and it always will.

But what about duplicate data? It might be good quality data, we just have multiple copies of it. The problem with duplication in your data is that it can contribute to the following problems:

Skew / Bias

Imagine you are counting votes for a national referendum and you mistakenly receive vote counts for a particular region a number of times. If this mistake isn’t spotted and the duplicate votes removed from the national count, the results of the vote will be skewed towards the percentages inherent in your duplicated data.

Inefficiency

The larger our data set is, the longer it will take to perform analysis on it or train ML models with it. By reducing the size of our data set early on in the data validation process through identification and removal of duplicate data (aka de-duping) we can save ourselves significant amounts of unnecessary processing time further down the line.

So let’s get on and check our file store for duplicates.

Leaf Directories In The File Store

We’ll start by traversing the aleph.gutenberg.org file store directory structure and counting the number of leaf directories that we find. These are directories that contain no sub-directories. Move to the parent directory of the file store and type the following into the terminal:

find aleph.gutenberg.org -type d -links 2 | wc -l

We have previously used find -type f to recursively search a directory tree for files. find -type d does the same for directories. Every directory has a link to its parent ( .. ), a link to itself ( ) and a link to each sub-directory that it contains. The -links 2 argument limits the find to directories that have 2 links. This is only true for leaf directories which only have a parent link and a link to self. By piping ( | ) the output of find through wc -l we produce a count of the number of leaf directories found. 

Leaf Directories With Multiple Files

The output of the previous command tells me that I have 48,718 leaf directories (your number may be different). Hmm… I have a lot more files than that. This requires a little more  investigation. Let’s identify which leaf directories contain multiple files.

Type the following into the terminal:

find aleph.gutenberg.org -type f -printf '%h\n' | sort | uniq -c -d

We’re using a new find argument here: -printf. This produces formatted output, with the format specified in quotes. The %h directive is shorthand for the full path of the directory that the file resides in. The \n is a carriage return, ensuring that each directory is displayed on a separate line. 

There are two new linux commands here that need an introduction: sort and uniq. With no additional options specified, sort orders the lines of your data into alphabetic order. uniq  removes any adjacent lines that are duplicates. Because uniq only filters out adjacent matching lines it’s important to provide it with data that has been sorted first. As for the options that we have specified for uniq, -d displays duplicate lines only (one for each group) and -c displays a count of how many duplicates were found alongside each line displayed

You should see a list of file paths like the sample shown below. This gives us the file counts for all leaf directories that contain more than 1 file. 

      2 aleph.gutenberg.org/8/5/9/8594
      2 aleph.gutenberg.org/8/6/6/8668
      2 aleph.gutenberg.org/8/6/7/8674

Duplicate Files And Character Sets

The .zip files in our file store are just compressed archives that contain plain text (.txt)  files. These files are delivered in zip format to save on bandwidth and storage space. Project Gutenberg has a file formats page that briefly explains what .txt files can contain.

TXT is a generic extension used for any plain text file, regardless of the character set. Thus, while most of our .TXT files contain ASCII, some contain ISO-8859 or Big-5* or Unicode.

Let’s look at the contents of one of the directories in our file store that contains multiple files. Pick any directory displayed in the directory listing that you’ve just generated and type the following, replacing my example directory with the one you have chosen:

ls aleph.gutenberg.org/8/5/9/8594

You should see a file listing containing at least 2 files, similar to the following:

8594-0.zip  8594-8.zip  8594.zip

These are variants of the same text file using different character sets. Where these files exist, the file ending in -0.zip contains ISO-8859 text, the file ending in -8.zip contains UTF-8 text and the remaining file contains ASCII text. Full details of Project Gutenberg’s file naming schema can be found here.

While Project Gutenberg also mentions Big-5 as a possible character set for plain text files we don’t expect to find any such files in our data set as this is used for Chinese characters only. However, what we are about to do next will ensure that any Big-5 files that happen to exist within our repository will be removed.

De-Duping The File Index

We’ll focus our efforts on ASCII text only by re-building our file index so that it only contains ASCII encoded plain text files. Type the following into the terminal window to overwrite our index.txt  file with a new de-duped index.

find aleph.gutenberg.org -type f | grep -E "/[0-9]+\.zip" > index.txt

grep prints lines of data that match a specified pattern. In the above example, the pattern matches any filename that is made up of only digits followed by a .zip extension. i.e. it excludes all *-0.zip (ISO8859), *-5.zip (Big-5) and *-8.zip (UTF8) files. The -E option interprets the pattern as a regular expression. Regular expressions are a language for describing patterns in text. The following aspects of our regular expression are worth explaining: 

[0-9]+One or more digits from 0 through to 9.
+ = “one or more”
[0-9] = “any digit from 0 through to 9”
\.A literal period ( . ).
The backslash is an escape character, without which the period has a special meaning of “any single character”.

The remainder of the regular expression is literal text that provides both a left context and a right context for our pattern. Without both of these, the pattern will still admit -0.zip and -8.zip  files.

For instance, the pattern “[0-9]+\.zip” (missing the left context) matches the following, with the matched portion highlighted in bold:

aleph.gutenberg.org/8/8/8/8886/8886-0.zip
aleph.gutenberg.org/8/8/8/8886/8886.zip
aleph.gutenberg.org/8/8/8/8886/8886-8.zip

and the pattern “/[0-9]+” (missing the right context) matches the following, with the matched portion highlighted in bold:

aleph.gutenberg.org/8/8/8/8886/8886-0.zip
aleph.gutenberg.org/8/8/8/8886/8886.zip
aleph.gutenberg.org/8/8/8/8886/8886-8.zip

A quick wc  *-l index.txt to count the number of files in our de-duped file index shows that we have more than halved the length of our file index to 37,512 files. This will save us an awful lot of processing time in what we are about to do next (data validation and cleansing).

It’s worth noting that our new file index length (37,512) is lower than the number of leaf directories in the file store (48,718). This tells us that not all Project Gutenberg English e-books are available in ASCII plain text.

That's All For Now!

We now have a fully de-duped file index that references a local repository of over 37,000 ASCII-encoded plain text files. Let’s pat ourselves on the back for a job well done and call it a day for now.

We’re now ready to perform some serious text analysis. In the next article in the series we will write some code to help us to identify and remove any foreign language and atypical texts (data validation).


1 Comment

JamesDic · April 28, 2020 at 3:07 pm

Lovely Webpage, Carry on the good work. With thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *