Introduction

This post is an exploration into how to convert data into dls objects that can be used by fastai's learner. I was having issues creating a dls object that had the ability to show_batch and met my arbitrarily custom needs. So I set out to figure out how to create dls that worked well for my needs.

This blog post uses the Human Numbers dataset which is a dataset that counts sequentially from 1 to 9999 but in english text rather than numerical form. This is an interesting problem because there is quite a bit of repetition, but also new tokens and patterns being introduced regularly that a model will need to figure out.

My goal was to create a dls that would have X=1,2,3 and y=4. Over the course of this blog post, I will show ~4 ways to create dls that enable show_batch to work as expected.

from fastai.text.all import *
import fastai, fastcore
fastai.__version__,fastcore.__version__
('2.3.1', '1.3.20')
path = untar_data(URLs.HUMAN_NUMBERS)

First, I create a tokenizer, combine all of the text into a single string, and tokenize each word

tokenizer = Tokenizer(WordTokenizer())

Testing the Tokenizer

tokenizer('one two three')
(#4) ['xxbos','one','two','three']
tokenizer('one, two')
(#4) ['xxbos','one',',','two']

Reading the train and validation files:

  • train - [1-8000]
  • valid = [8001-9999]
train_txt = ', '.join(o.strip() for o in (path/'train.txt').readlines())
valid_txt = ', '.join(o.strip() for o in (path/'valid.txt').readlines())

For this problem, I will create my own validation set. It will split close to the same as this, but by creating my own split, I don't have to do anything special when creating chunks around the train->validation split point

all_text = train_txt+valid_txt
all_text_tok = tokenizer(all_text)
all_text_tok
(#63094) ['xxbos','one',',','two',',','three',',','four',',','five'...]

Next, I take the tokenized text, count how many times each tokenizer occurs and create a vocab with that.

count=Counter(all_text_tok)
vocab = make_vocab(count)
print(count)
Counter({',': 9996, 'hundred': 9000, 'thousand': 8999, 'one': 2900, 'two': 2900, 'three': 2900, 'four': 2900, 'five': 2900, 'six': 2900, 'seven': 2900, 'nine': 2899, 'eight': 2898, 'twenty': 1000, 'thirty': 1000, 'forty': 1000, 'fifty': 1000, 'sixty': 1000, 'seventy': 1000, 'eighty': 1000, 'ninety': 1000, 'ten': 100, 'eleven': 100, 'twelve': 100, 'thirteen': 100, 'fourteen': 100, 'fifteen': 100, 'sixteen': 100, 'seventeen': 100, 'eighteen': 100, 'nineteen': 100, 'xxbos': 1, 'nineeight': 1})
print(vocab)
['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', ',', 'hundred', 'thousand', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'nine', 'eight', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'xxfake']
all_text_tok_chunked = list(chunked(all_text_tok, 11))
all_text_tok_chunked = all_text_tok_chunked[:-1]

Next I create something that will get_x and get_y from the chunked data.

def get_x(o):
    return o[:10]

def get_y(o):
    return [o[10]] if len(o) == 11 else ['.']
print(f"{get_x(all_text_tok_chunked[0])} -> {get_y(all_text_tok_chunked[0])}")
['xxbos', 'one', ',', 'two', ',', 'three', ',', 'four', ',', 'five'] -> [',']
print(f"{get_x(all_text_tok_chunked[-1])} -> {get_y(all_text_tok_chunked[-1])}")
['nine', 'thousand', 'nine', 'hundred', 'ninety', 'seven', ',', 'nine', 'thousand', 'nine'] -> ['hundred']

TitledStringDecoder is a transform that only decodes and what it enables is the show_batch and show_results function to actually work properly. Without this, I had troubles getting those functions to work because TensorText doesn't have a proper show function or a truncate function.

class TitledStringDecoder(Transform):
    def decodes(self, o):
        return TitledStr(' '.join(o))

All TitledStringDecoder really does is takes an array of text ('one', 'two') and converts it into a space-concatenated string instead of type Titled str which knows how to display itself in a nice way.

TitledStr(' '.join(['one', 'two']))
'one two'
tmp_ts = TitledStr(' '.join(all_text_tok[:10]))
tmp_ts
'xxbos one , two , three , four , five'
tmp_ts.truncate(3)
'xxbos one ,'

I create the splits based off the chunks. Putting 80% of the chunks into the training set and the last 20% in the validation set

splits = [L(range(int(len(all_text_tok_chunked)*0.8))), L(range(int(len(all_text_tok_chunked)*0.8),len(all_text_tok_chunked)))]
splits
[(#4588) [0,1,2,3,4,5,6,7,8,9...],
 (#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...]]

Now, let's test the transforms work properly

Numericalize(vocab)(TitledStringDecoder()(get_x(all_text_tok_chunked[0])))
TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16])

And confirm that they will work as a pipeline as well

pipeline = Pipeline([TitledStringDecoder, Numericalize(vocab)])
get_x(pipeline(all_text_tok_chunked[0]))
TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16])
pipeline_x = Pipeline([get_x, TitledStringDecoder, Numericalize(vocab)])
pipeline_y = Pipeline([get_y, TitledStringDecoder, Numericalize(vocab)])
pipeline_y(all_text_tok_chunked[0])
TensorText([9])

Using Datasets + Dataloaders

dsets = Datasets(all_text_tok_chunked, tfms=[pipeline_x,pipeline_y], splits=splits)
dsets[0]
(TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16]), TensorText([9]))
dsets.show(dsets[0])
xxbos one , two , three , four , five
,

Next, we can create the dataloaders. This can be done with either DataLoaders.from_dsets(...) or dsets.dataloaders(...). Both methods are shown below.

dls = DataLoaders.from_dsets(dsets, shuffle=False, drop_last=True)
dls.show_batch()
text text_
0 xxbos one , two , three , four , five ,
1 six , seven , eight , nine , ten , eleven
2 , twelve , thirteen , fourteen , fifteen , sixteen ,
3 seventeen , eighteen , nineteen , twenty , twenty one ,
4 twenty two , twenty three , twenty four , twenty five
5 , twenty six , twenty seven , twenty eight , twenty
6 nine , thirty , thirty one , thirty two , thirty
7 three , thirty four , thirty five , thirty six ,
8 thirty seven , thirty eight , thirty nine , forty ,
dls = dsets.dataloaders(bs=16, shuffle=False, drop_last=True)
dls.show_batch()
text text_
0 xxbos one , two , three , four , five ,
1 six , seven , eight , nine , ten , eleven
2 , twelve , thirteen , fourteen , fifteen , sixteen ,
3 seventeen , eighteen , nineteen , twenty , twenty one ,
4 twenty two , twenty three , twenty four , twenty five
5 , twenty six , twenty seven , twenty eight , twenty
6 nine , thirty , thirty one , thirty two , thirty
7 three , thirty four , thirty five , thirty six ,
8 thirty seven , thirty eight , thirty nine , forty ,

Using Datasets -> train TfmdDL + valid TfmdDL -> dataloaders

Another way to get dls is to create TfmdDLs and pass those into DataLoaders. If you use DataLoader rather than TfmdDL, dls won't have a show_batch method available.

train_dl = TfmdDL(dsets.train, bs=16, drop_last=True)
valid_dl = TfmdDL(dsets.valid, bs=16, drop_last=True)
dls = DataLoaders(train_dl, valid_dl)
dls.show_batch()
text text_
0 xxbos one , two , three , four , five ,
1 six , seven , eight , nine , ten , eleven
2 , twelve , thirteen , fourteen , fifteen , sixteen ,
3 seventeen , eighteen , nineteen , twenty , twenty one ,
4 twenty two , twenty three , twenty four , twenty five
5 , twenty six , twenty seven , twenty eight , twenty
6 nine , thirty , thirty one , thirty two , thirty
7 three , thirty four , thirty five , thirty six ,
8 thirty seven , thirty eight , thirty nine , forty ,
X,y = dls.one_batch()

Using DataBlock -> datasets -> dataloaders

Another way to get dataloaders is to use DataBlock. DataBlock wants to know what type of data will be passed which can be specified to blocks. It also wants a splitter and the functions to get_x and get_y

blocks = [TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)]), # x piece
          TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)])] # y piece
splits[-1]
(#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...]
IndexSplitter(splits[-1])(all_text_tok_chunked)
((#4588) [0,1,2,3,4,5,6,7,8,9...],
 (#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...])
dblock = DataBlock(blocks=blocks,
                   splitter=IndexSplitter(splits[-1]),
                   get_x=get_x,
                   get_y=get_y)

With the dblock created, you can create a dset and then from the dset, you can create a dls similar to the one created above.

dsets_via_dblock = dblock.datasets(all_text_tok_chunked)
dsets_via_dblock
(#5735) [(TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16]), TensorText([9])),(TensorText([17,  9, 18,  9, 20,  9, 19,  9, 29,  9]), TensorText([30])),(TensorText([ 9, 31,  9, 32,  9, 33,  9, 34,  9, 35]), TensorText([9])),(TensorText([36,  9, 37,  9, 38,  9, 21,  9, 21, 12]), TensorText([9])),(TensorText([21, 13,  9, 21, 14,  9, 21, 15,  9, 21]), TensorText([16])),(TensorText([ 9, 21, 17,  9, 21, 18,  9, 21, 20,  9]), TensorText([21])),(TensorText([19,  9, 22,  9, 22, 12,  9, 22, 13,  9]), TensorText([22])),(TensorText([14,  9, 22, 15,  9, 22, 16,  9, 22, 17]), TensorText([9])),(TensorText([22, 18,  9, 22, 20,  9, 22, 19,  9, 23]), TensorText([9])),(TensorText([23, 12,  9, 23, 13,  9, 23, 14,  9, 23]), TensorText([15]))...]
dsets_via_dblock.show(dsets_via_dblock[0])
xxbos one , two , three , four , five
,
dls = dsets_via_dblock.dataloaders(bs=16,shuffle=False, drop_last=True)
dls.show_batch()
text text_
0 xxbos one , two , three , four , five ,
1 six , seven , eight , nine , ten , eleven
2 , twelve , thirteen , fourteen , fifteen , sixteen ,
3 seventeen , eighteen , nineteen , twenty , twenty one ,
4 twenty two , twenty three , twenty four , twenty five
5 , twenty six , twenty seven , twenty eight , twenty
6 nine , thirty , thirty one , thirty two , thirty
7 three , thirty four , thirty five , thirty six ,
8 thirty seven , thirty eight , thirty nine , forty ,

Using DataBlock -> dataloaders

Another option is to go directly from dblock to dls with dblock.dataloaders. Behind the scenes this is creating a dataset as well, but it can be a cleaner looking way to handle it if you always go from dblock -> dls.

dls = dblock.dataloaders(all_text_tok_chunked, bs=16, shuffle=False, drop_last=True)
dls.show_batch()
text text_
0 xxbos one , two , three , four , five ,
1 six , seven , eight , nine , ten , eleven
2 , twelve , thirteen , fourteen , fifteen , sixteen ,
3 seventeen , eighteen , nineteen , twenty , twenty one ,
4 twenty two , twenty three , twenty four , twenty five
5 , twenty six , twenty seven , twenty eight , twenty
6 nine , thirty , thirty one , thirty two , thirty
7 three , thirty four , thirty five , thirty six ,
8 thirty seven , thirty eight , thirty nine , forty ,

Conclusion

Creating dls is an extremely important capability when using fastai because that is what a learn expects to deal with all of the data. There are many different ways to get a dls object created so this isn't a comprehensive list, but at least shows a few ways to do the task. In a future blog post, I will be using this dls and exploring transformer models with it. Hopefully this will help others get their DLs working.

I'd like to give a special thanks to Arto for helping me get things working properly and everybody in the fastai discord channel for dealing with my questions and for creating a great community to learn with every step of the way.