fastai’s Text DLs Exploration

fastai

technical

exploration

Exploring different methods to create the dls object in fastai

Author

Kevin Bird

Published

May 14, 2021

Introduction

This post is an exploration into how to convert data into dls objects that can be used by fastai’s learner. I was having issues creating a dls object that had the ability to show_batch and met my arbitrarily custom needs. So I set out to figure out how to create dls that worked well for my needs.

This blog post uses the Human Numbers dataset which is a dataset that counts sequentially from 1 to 9999 but in english text rather than numerical form. This is an interesting problem because there is quite a bit of repetition, but also new tokens and patterns being introduced regularly that a model will need to figure out.

My goal was to create a dls that would have X=1,2,3 and y=4. Over the course of this blog post, I will show ~4 ways to create dls that enable show_batch to work as expected.

from fastai.text.all import *

import fastai, fastcore

fastai.__version__,fastcore.__version__

('2.3.1', '1.3.20')

path = untar_data(URLs.HUMAN_NUMBERS)

First, I create a tokenizer, combine all of the text into a single string, and tokenize each word

tokenizer = Tokenizer(WordTokenizer())

Testing the Tokenizer

tokenizer('one two three')

(#4) ['xxbos','one','two','three']

tokenizer('one, two')

(#4) ['xxbos','one',',','two']

Reading the train and validation files:
* train - [1-8000] * valid = [8001-9999]

train_txt = ', '.join(o.strip() for o in (path/'train.txt').readlines())
valid_txt = ', '.join(o.strip() for o in (path/'valid.txt').readlines())

For this problem, I will create my own validation set. It will split close to the same as this, but by creating my own split, I don’t have to do anything special when creating chunks around the train->validation split point

all_text = train_txt+valid_txt

all_text_tok = tokenizer(all_text)

all_text_tok

(#63094) ['xxbos','one',',','two',',','three',',','four',',','five'...]

Next, I take the tokenized text, count how many times each tokenizer occurs and create a vocab with that.

count=Counter(all_text_tok)
vocab = make_vocab(count)

print(count)

Counter({',': 9996, 'hundred': 9000, 'thousand': 8999, 'one': 2900, 'two': 2900, 'three': 2900, 'four': 2900, 'five': 2900, 'six': 2900, 'seven': 2900, 'nine': 2899, 'eight': 2898, 'twenty': 1000, 'thirty': 1000, 'forty': 1000, 'fifty': 1000, 'sixty': 1000, 'seventy': 1000, 'eighty': 1000, 'ninety': 1000, 'ten': 100, 'eleven': 100, 'twelve': 100, 'thirteen': 100, 'fourteen': 100, 'fifteen': 100, 'sixteen': 100, 'seventeen': 100, 'eighteen': 100, 'nineteen': 100, 'xxbos': 1, 'nineeight': 1})

print(vocab)

['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', ',', 'hundred', 'thousand', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'nine', 'eight', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'xxfake']

all_text_tok_chunked = list(chunked(all_text_tok, 11))

#drop last non-full row
all_text_tok_chunked = all_text_tok_chunked[:-1]

Next I create something that will get_x and get_y from the chunked data.

def get_x(o):
    return o[:10]

def get_y(o):
    return [o[10]] if len(o) == 11 else ['.']

print(f"{get_x(all_text_tok_chunked[0])} -> {get_y(all_text_tok_chunked[0])}")

['xxbos', 'one', ',', 'two', ',', 'three', ',', 'four', ',', 'five'] -> [',']

print(f"{get_x(all_text_tok_chunked[-1])} -> {get_y(all_text_tok_chunked[-1])}")

['nine', 'thousand', 'nine', 'hundred', 'ninety', 'seven', ',', 'nine', 'thousand', 'nine'] -> ['hundred']

TitledStringDecoder is a transform that only decodes and what it enables is the show_batch and show_results function to actually work properly. Without this, I had troubles getting those functions to work because TensorText doesn’t have a proper show function or a truncate function.

class TitledStringDecoder(Transform):
    def decodes(self, o):
        return TitledStr(' '.join(o))

All TitledStringDecoder really does is takes an array of text (‘one’, ‘two’) and converts it into a space-concatenated string instead of type Titled str which knows how to display itself in a nice way.

TitledStr(' '.join(['one', 'two']))

'one two'

tmp_ts = TitledStr(' '.join(all_text_tok[:10]))

tmp_ts

'xxbos one , two , three , four , five'

tmp_ts.truncate(3)

'xxbos one ,'

I create the splits based off the chunks. Putting 80% of the chunks into the training set and the last 20% in the validation set

splits = [L(range(int(len(all_text_tok_chunked)*0.8))), L(range(int(len(all_text_tok_chunked)*0.8),len(all_text_tok_chunked)))]

splits

[(#4588) [0,1,2,3,4,5,6,7,8,9...],
 (#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...]]

Now, let’s test the transforms work properly

Numericalize(vocab)(TitledStringDecoder()(get_x(all_text_tok_chunked[0])))

TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16])

And confirm that they will work as a pipeline as well

pipeline = Pipeline([TitledStringDecoder, Numericalize(vocab)])

get_x(pipeline(all_text_tok_chunked[0]))

TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16])

pipeline_x = Pipeline([get_x, TitledStringDecoder, Numericalize(vocab)])
pipeline_y = Pipeline([get_y, TitledStringDecoder, Numericalize(vocab)])

pipeline_y(all_text_tok_chunked[0])

TensorText([9])

Using Datasets + Dataloaders

dsets = Datasets(all_text_tok_chunked, tfms=[pipeline_x,pipeline_y], splits=splits)

dsets[0]

(TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16]), TensorText([9]))

dsets.show(dsets[0])

xxbos one , two , three , four , five
,

Next, we can create the dataloaders. This can be done with either DataLoaders.from_dsets(...) or dsets.dataloaders(...). Both methods are shown below.

dls = DataLoaders.from_dsets(dsets, shuffle=False, drop_last=True)

dls.show_batch()

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

dls = dsets.dataloaders(bs=16, shuffle=False, drop_last=True)

dls.show_batch()

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

Using Datasets -> train TfmdDL + valid TfmdDL -> dataloaders

Another way to get dls is to create TfmdDLs and pass those into DataLoaders. If you use DataLoader rather than TfmdDL, dls won’t have a show_batch method available.

train_dl = TfmdDL(dsets.train, bs=16, drop_last=True)

valid_dl = TfmdDL(dsets.valid, bs=16, drop_last=True)

dls = DataLoaders(train_dl, valid_dl)

dls.show_batch()

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

X,y = dls.one_batch()

Using DataBlock -> datasets -> dataloaders

Another way to get dataloaders is to use DataBlock. DataBlock wants to know what type of data will be passed which can be specified to blocks. It also wants a splitter and the functions to get_x and get_y

blocks = [TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)]), # x piece
          TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)])] # y piece

splits[-1]

(#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...]

IndexSplitter(splits[-1])(all_text_tok_chunked)

((#4588) [0,1,2,3,4,5,6,7,8,9...],
 (#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...])

dblock = DataBlock(blocks=blocks,
                   splitter=IndexSplitter(splits[-1]),
                   get_x=get_x,
                   get_y=get_y)

With the dblock created, you can create a dset and then from the dset, you can create a dls similar to the one created above.

dsets_via_dblock = dblock.datasets(all_text_tok_chunked)

dsets_via_dblock

(#5735) [(TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16]), TensorText([9])),(TensorText([17,  9, 18,  9, 20,  9, 19,  9, 29,  9]), TensorText([30])),(TensorText([ 9, 31,  9, 32,  9, 33,  9, 34,  9, 35]), TensorText([9])),(TensorText([36,  9, 37,  9, 38,  9, 21,  9, 21, 12]), TensorText([9])),(TensorText([21, 13,  9, 21, 14,  9, 21, 15,  9, 21]), TensorText([16])),(TensorText([ 9, 21, 17,  9, 21, 18,  9, 21, 20,  9]), TensorText([21])),(TensorText([19,  9, 22,  9, 22, 12,  9, 22, 13,  9]), TensorText([22])),(TensorText([14,  9, 22, 15,  9, 22, 16,  9, 22, 17]), TensorText([9])),(TensorText([22, 18,  9, 22, 20,  9, 22, 19,  9, 23]), TensorText([9])),(TensorText([23, 12,  9, 23, 13,  9, 23, 14,  9, 23]), TensorText([15]))...]

dsets_via_dblock.show(dsets_via_dblock[0])

xxbos one , two , three , four , five
,

dls = dsets_via_dblock.dataloaders(bs=16,shuffle=False, drop_last=True)

dls.show_batch()

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

Using DataBlock -> dataloaders

Another option is to go directly from dblock to dls with dblock.dataloaders. Behind the scenes this is creating a dataset as well, but it can be a cleaner looking way to handle it if you always go from dblock -> dls.

dls = dblock.dataloaders(all_text_tok_chunked, bs=16, shuffle=False, drop_last=True)

dls.show_batch()

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

Conclusion

Creating dls is an extremely important capability when using fastai because that is what a learn expects to deal with all of the data. There are many different ways to get a dls object created so this isn’t a comprehensive list, but at least shows a few ways to do the task. In a future blog post, I will be using this dls and exploring transformer models with it. Hopefully this will help others get their DLs working.

I’d like to give a special thanks to Arto for helping me get things working properly and everybody in the fastai discord channel for dealing with my questions and for creating a great community to learn with every step of the way.

Useful Links

https://arampacha.github.io/thoughtsamples/fastai/pytorch/2021/01/02/transformer-lm-from-scratch.html

https://github.com/fastai/fastai/blob/ab154927696338741e59e0ffc4774777c4a9781c/nbs/39_tutorial.transformers.ipynb
https://github.com/fastai/fastai/blob/ab154927696338741e59e0ffc4774777c4a9781c/dev_nbs/course/lesson7-human-numbers.ipynb