This series of blog articles is a diary of my thoughts and experiences as I attempt to create a data science ready repository of English written texts and use it to answer some interesting questions about language usage.

In this article I explain the origin of my obsession endeavour and talk you through the data collection process. Along the way I introduce some useful linux commands, including wget, df, cd, mkdir, ls, findgrep, and wc, as well as output redirects> ), pipes ( | ) and linux abbreviations for home directory ( ~ ) and current working directory ( . ).

Background Story

Every data science endeavour starts with a question. This is no different.

We are often taught rules as aide memoirs. One of the most famous of these is the spelling rule: I before E except after C“. Of course, every rule has exceptions. When the exceptions are few, we simply learn these exceptions by rote. However, when the rule breakers outnumber the rule takers the reliability of the rule is brought into question.

This problem was highlighted in an episode of the BBC TV series QI which presented the following statistics that debunked the “I before E…” rule. 

In the English language

  • there are 923 words with a CIE in them
  • 21 times as many words break the rule than don’t

Not knowing where the QI elves sourced their statistics from I wondered how I might validate them myself. I would need a list containing all the words in the English language, or a reasonable proxy for that. One such proxy is a word list used by a spell checker, such as GNU Aspell.

I was concerned that a spell checking word list might not provide complete enough coverage of the English language. To address this concern, I thought it might be worth considering merging it with a word list generated from a library of written texts to try and improve the word coverageThis is where Project Gutenberg comes in.

What is Project Gutenberg?

Before I go any further, I should probably explain what Project Gutenberg is. This is best done with reference to its Wikipedia entry.

Project Gutenberg is a volunteer effort to digitize and archive cultural works, to encourage the creation and distribution of eBooks. It was founded in 1971 by American writer Michael S. Hart and is the oldest digital library.

As of 23 June 2018, Project Gutenberg reached 57,000 items in its collection of free eBooks. Most releases are in the English Language, but many non-English works are also available.

Now that we’ve got the introductions out of the way, let’s get down to business. 

In the remainder of this article I will walk through the process that I followed to create a local repository of all the English texts on Project Gutenberg in simple machine readable ascii form. I will also show you how you can use the file repository to generate an index file and test the efficacy of that index file by using it to test the validity of the repository files.

Downloading From Project Gutenberg

Project Gutenberg is intended for human consumption only. Their robot readme page states that they do not allow script-based access to their website:

Any perceived use of automated tools to access the Project Gutenberg website will result in a temporary or permanent block of your IP address.

Drat! I was so looking forward to dusting the cobwebs off my spidering skills and re-acquainting myself with the dark art of parsing HTML using XPaths. Thankfully, anyone wishing to create their own copy of the Project Gutenberg repository can do so using wget instead.

wget is a computer program that retrieves content from web servers. Its name derives from world wide web and get. It supports downloading via HTTP, HTTPS and FTP.

This is because Project Gutenberg allows the use of wget as an exception as long as you follow their recommended usage, as shown below:

wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=FILETYPE"

Where FILETYPE defines the file format from the following options:

  • html
  • txt
  • epub.images
  • epub.noimages
  • kindle.images
  • kindle.noimages
  • mp3

As we are collecting source materials for text analysis we are only interested in text files so we will use filetypes[]=txt.

The recommended wget arguments have the following meaning

-w 2
(–wait)
Wait 2 seconds between retrievals. Use of this option lightens the load on Project Gutenberg’s servers by making less frequent requests.
-m
(–mirror)
Turn on recursion and time-stamping, set infinite recursion depth and keep FTP directory listings. Recursion is the act of following all the links found in the files retrieved. Without recursion wget will only retrieve the document found at the specified URL, which is simply an index file containing links to the files of interest.
-H
(–span-hosts)
Enable spanning across hosts when doing recursive retrieving.

wget has many more arguments than this which we won’t go into here. If you’re interested to know more about wget and its command line options you can read more about that here.

Project Gutenberg stores books written in many languages. We would like to extract just one language for now, English. As it happens, we can add another key/value pair to the URL query to focus our file retrieval on a single language.

wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=FILETYPE&langs[]=LANGUAGE"

Where LANGUAGE is a two-letter ISO-639-1 language code.

As we are only interested in English texts we will use langs[]=en. For a full list of language codes you could experiment with, click here.

Let's Get Downloading!

Let’s now retrieve some files. I’m using Raspbian, a modified version of Debian, but the commands that follow should be good for any linux machine.

Step 1: Check disk space

First of all, make sure that you have sufficient available disk space for the file download. At the time of writing this article, Project Gutenberg has approximately 11GB of English text files in its repository. If disk space is at a premium for you there is an option to follow this blog series using only a portion of the repository (I’ll talk about that later).

You can check how much disk space you have available with the df command (disk free). In the terminal window, type the following:

df -h --output='avail' .
The dot at the end is a linux abbreviation meaning “current directory“. Without it, df will report disk usage for all mounted filesystems. Naming our current directory at the end of this command tells df to only report disk usage for the filesystem we are currently using. The -h option reports the result in human friendly form (e.g. “188G” as opposed to “196569920”) and --output='avail' only reports the available disk space value, which is what we are interested in.

You should see output similar to the following (your numbers will be different):

Avail
 188G

Step 2: Create a download directory

Create a directory from inside your user’s home directory to contain the repository of data we will be collecting and move into it. In a terminal window type the following:

cd ~
mkdir -p data/text/gutenberg
cd data/text/gutenberg

The cd command (change directory) moves you into the named directory and the  mkdir  command (make directory) creates the named directory. The ~ character is a special character meaning “my home directory” and the -p argument tells the computer to create parent directories where necessary. In this case, it has to create the ~/data and ~/data/text directories before it can create the ~/data/text/gutenberg directory.

Step 3a: Initiate the download and wait (full repository)

We want text format versions of all English books so type the following wget command into a terminal window:

wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"

This process can take a while. The 2 second pause that is enforced between each file download limits you to a maximum download rate of approximately 2000 files per hour, regardless of the speed of your internet connection.

For reference, on my 16Mb broadband connection I achieved a download rate of approximately 1000 files per hour. Last time I ran this command it took over 3 days to complete and downloaded 80,000* files taking up 11GB of disk space.

Once the download is complete the text materials will be stored locally on your hard drive, ready for you to perform some cool data science on them.

* the eagle-eyed amongst you might have noticed an anomaly here given the stated size of the Project Gutenberg repository. I suspect duplicate content. We’ll address that concern when we come to validate and cleanse our data later in the series..

Step 3b: Initiate the download and wait (sample only)

If you can’t wait a few days for a full download to complete or you don’t have sufficient space to store the full Project Gutenberg file repository on your hard drive you can opt to follow this tutorial series using just a sample of the full repository. However, be aware that your results will differ from mine.

To download a sample, simply limit the download to a size quota with the -Q (quota) option. wget will stop as soon as the quota is reached. As a guide, a 1GB quota should give you around 8000 files and take around 8 hours to complete. A wget command with a 1GB quota would look like this:

wget -Q1024m -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"

-Q1024m means “stop when we have downloaded 1024 MB of data” (1GB).

The more data we have at our disposal the more accurate results we can get, so choose the largest number of MB that your disk space and patience will support.

Step 4: Sanity check the results

You should see the following statistics reported in the terminal window once the wget command has completed. The numbers of interest are highlighted in bold red font.

FINISHED ~2020-02-29 14:16:43~ 
Total wall clock time: 3d 12h 24m 39s
Downloaded 79558 files, 11G in 1d 7h 53m 27s (105 KB/s)

Your numbers may be different. However, if you’ve downloaded the full repository the number of files and disk space usage reported should be the same or higher (the file repository is only likely to have grown in size since this article was written).

If you’ve opted to work with a sample only, the disk usage reported should match your quota value and you should have downloaded approximately 8 files per MB of quota set.

Generating A File Index

Let’s generate a file index. We will use this index to process the full data repository in future data analysis tasks.

Type the following command into the terminal window to list what is in the directory:

ls

You should get the following response, showing the directories created by the download process:

aleph.gutenberg.org  www.gutenberg.org

The www directory contains a set of XML index files which could be traversed by a spidering program to locate all files in the repository on our hard drive. We won’t be needing these. The aleph directory is where all the text files can be found.

We can use a command called find to generate a one-file-per-line file index by traversing the aleph directory structure and recording the full path of every file that we find. This will be much easier to use than the XML index files that were downloaded along with the repository files.

Type the following in the terminal window and hit enter:

find aleph.gutenberg.org -type f > index.txt

This writes the path of everything it finds within the file store directory to a file called index.txt. The -type f option limits this to items that are deemed to be files (there are also directories in there that we wish to ignore).

The > character is known as an output redirect. This redirects the output of the command to the file that follows it (in this case, index.txt). Without this, the command will simply print its result to standard output (the screen).

Counting Files in the File Store

We can count how many text files are in the aleph file store by using the wc command (word count). To do this let’s re-run the find command we previously used, this time passing its output into the input of the wc command. We do this by using a | character between the find and the wc commands. We call this piping our output, and the bar is a pipe.

In the terminal, type the following and hit enter:

find aleph.gutenberg.org -type f | wc -l

Ordinarily, wc will print character, word and line counts. The -l argument tells wc to just print the line count. In the above context this is the number of files found.

Repeat this command for the www folder to count the number of index files downloaded:

find www.gutenberg.org -type f | wc -l

As an additional sanity check, the sum of the two numbers reported above should match the file count reported in the wget download log output. You can never have too many checks and balances in data science – “a stitch in time saves nine“.

The Contents of the File Store

The aleph file store uses a file naming and directory structure based on the file ID assigned to each file on the Gutenberg Project site. Thus a file called  1002.zip has a Gutenberg file ID of 1002 and it will be stored as aleph.gutenberg.org/1/0/0/1002/1002.zip.

The downloaded files all have .zip extensions. These are text files that have been compressed to save space. It is tempting to consider unzipping the files for convenient access later on. However it’s easy to process zip files directly within our data analysis code so it’s probably best to leave them in their zipped form to save on storage space. If dealing with zip files at scale adds an unnecessary amount of time to our processing we may decide to revisit this decision later.

Testing The Repository & File Index

Before wrapping up this blog article we should probably confirm that our file index works as intended and the file repository is valid and complete. For convenience, I’ve written a test script that you can use for this purpose.

This demonstrates our ability to use the file index to process all files in the file store by testing each of the files listed in the index for existence, file type, and validity as a zip file. It counts the number of files tested (which should match the number of files in our file index) and reports statistics for each failure type, listing all files that fail each test.

I have hosted this test script on Github. It can be downloaded to the current directory via wget by typing the following into the terminal window and hitting enter:

 wget https://raw.githubusercontent.com/LaurenceMolloy/CreativeSmartThings/master/blog/287_test_gutenberg_file_index.py

Once downloaded, run it by typing the following into the terminal and hitting enter:

 python3 287_test_gutenberg_file_index.py

This reports the following output:

FILE COUNT: 78768
GOOD: 78767
BAD: 0
MISSING: 0
NOT ZIP: 1
[NOT ZIP FILES]
	aleph.gutenberg.org/robots.txt

Hmm… so there is a robots.txt file at the top level of the repository’s file store that I hadn’t noticed. We have that file listed in our file index and it’s been identified by the test script as not being a zip file. Test scripts are great at spotting things us mere mortals can often overlook. This could have really mucked us around when we come to process these files. For one thing, it’s unlikely to contain English written text.

Let’s regenerate the file index, removing this file from the output by piping it through a grep command before redirecting the output to the index file, and then re-run the test script to prove that we’ve fixed the issue:

find aleph.gutenberg.org -type f | grep ".zip" > index.txt 
python3
287_test_gutenberg_file_index.py

grep is a wonderful little command line utility that searches plain text (often files, but in this case our piped find command output) for lines that match a string or regular expression. In this case, we only write filenames with .zip extensions to the file index.

The output of the test script now looks like this – yay!

FILE COUNT: 78767
GOOD: 78767
BAD: 0
MISSING: 0
NOT ZIP: 0
*** ALL GOOD ***

That's All For Now!

We now have a fully indexed local repository of text files. Let’s pat ourselves on the back for a job well done and call it a day for now.

In the next article in the series we will examine the contents of our file store more closely to identify and remove any duplicate files (de-duping).


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *