Introduction

This post is an exploration into how to convert data into dls objects that can be used by fastai's learner. I was having issues creating a dls object that had the ability to show_batch and met my arbitrarily custom needs. So I set out to figure out how to create dls that worked well for my needs.

This blog post uses the Human Numbers dataset which is a dataset that counts sequentially from 1 to 9999 but in english text rather than numerical form. This is an interesting problem because there is quite a bit of repetition, but also new tokens and patterns being introduced regularly that a model will need to figure out.

My goal was to create a dls that would have X=1,2,3 and y=4. Over the course of this blog post, I will show ~4 ways to create dls that enable show_batch to work as expected.

from fastai.text.all import *

import fastai, fastcore

fastai.__version__,fastcore.__version__

('2.3.1', '1.3.20')

path = untar_data(URLs.HUMAN_NUMBERS)

First, I create a tokenizer, combine all of the text into a single string, and tokenize each word

tokenizer = Tokenizer(WordTokenizer())

Testing the Tokenizer

tokenizer('one two three')

(#4) ['xxbos','one','two','three']

tokenizer('one, two')

(#4) ['xxbos','one',',','two']

Reading the train and validation files:

train - [1-8000]
valid = [8001-9999]

train_txt = ', '.join(o.strip() for o in (path/'train.txt').readlines())
valid_txt = ', '.join(o.strip() for o in (path/'valid.txt').readlines())

For this problem, I will create my own validation set. It will split close to the same as this, but by creating my own split, I don't have to do anything special when creating chunks around the train->validation split point

all_text = train_txt+valid_txt

all_text_tok = tokenizer(all_text)

all_text_tok

(#63094) ['xxbos','one',',','two',',','three',',','four',',','five'...]

Next, I take the tokenized text, count how many times each tokenizer occurs and create a vocab with that.

count=Counter(all_text_tok)
vocab = make_vocab(count)

print(count)

Counter({',': 9996, 'hundred': 9000, 'thousand': 8999, 'one': 2900, 'two': 2900, 'three': 2900, 'four': 2900, 'five': 2900, 'six': 2900, 'seven': 2900, 'nine': 2899, 'eight': 2898, 'twenty': 1000, 'thirty': 1000, 'forty': 1000, 'fifty': 1000, 'sixty': 1000, 'seventy': 1000, 'eighty': 1000, 'ninety': 1000, 'ten': 100, 'eleven': 100, 'twelve': 100, 'thirteen': 100, 'fourteen': 100, 'fifteen': 100, 'sixteen': 100, 'seventeen': 100, 'eighteen': 100, 'nineteen': 100, 'xxbos': 1, 'nineeight': 1})

print(vocab)

['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', ',', 'hundred', 'thousand', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'nine', 'eight', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'xxfake']

all_text_tok_chunked = list(chunked(all_text_tok, 11))

all_text_tok_chunked = all_text_tok_chunked[:-1]

Next I create something that will get_x and get_y from the chunked data.

def get_x(o):
    return o[:10]

def get_y(o):
    return [o[10]] if len(o) == 11 else ['.']

print(f"{get_x(all_text_tok_chunked[0])} -> {get_y(all_text_tok_chunked[0])}")

['xxbos', 'one', ',', 'two', ',', 'three', ',', 'four', ',', 'five'] -> [',']

print(f"{get_x(all_text_tok_chunked[-1])} -> {get_y(all_text_tok_chunked[-1])}")

['nine', 'thousand', 'nine', 'hundred', 'ninety', 'seven', ',', 'nine', 'thousand', 'nine'] -> ['hundred']

TitledStringDecoder is a transform that only decodes and what it enables is the show_batch and show_results function to actually work properly. Without this, I had troubles getting those functions to work because TensorText doesn't have a proper show function or a truncate function.

class TitledStringDecoder(Transform):
    def decodes(self, o):
        return TitledStr(' '.join(o))

All TitledStringDecoder really does is takes an array of text ('one', 'two') and converts it into a space-concatenated string instead of type Titled str which knows how to display itself in a nice way.

TitledStr(' '.join(['one', 'two']))

'one two'

tmp_ts = TitledStr(' '.join(all_text_tok[:10]))

tmp_ts

'xxbos one , two , three , four , five'

tmp_ts.truncate(3)

'xxbos one ,'

I create the splits based off the chunks. Putting 80% of the chunks into the training set and the last 20% in the validation set

splits = [L(range(int(len(all_text_tok_chunked)*0.8))), L(range(int(len(all_text_tok_chunked)*0.8),len(all_text_tok_chunked)))]

splits

[(#4588) [0,1,2,3,4,5,6,7,8,9...],
 (#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...]]

Now, let's test the transforms work properly

Numericalize(vocab)(TitledStringDecoder()(get_x(all_text_tok_chunked[0])))

TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16])

And confirm that they will work as a pipeline as well

pipeline = Pipeline([TitledStringDecoder, Numericalize(vocab)])

get_x(pipeline(all_text_tok_chunked[0]))

TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16])

pipeline_x = Pipeline([get_x, TitledStringDecoder, Numericalize(vocab)])
pipeline_y = Pipeline([get_y, TitledStringDecoder, Numericalize(vocab)])

pipeline_y(all_text_tok_chunked[0])

TensorText([9])

Using Datasets + Dataloaders

dsets = Datasets(all_text_tok_chunked, tfms=[pipeline_x,pipeline_y], splits=splits)

dsets[0]

(TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16]), TensorText([9]))

dsets.show(dsets[0])

xxbos one , two , three , four , five
,

Next, we can create the dataloaders. This can be done with either DataLoaders.from_dsets(...) or dsets.dataloaders(...). Both methods are shown below.

dls = DataLoaders.from_dsets(dsets, shuffle=False, drop_last=True)

dls.show_batch()

dls = dsets.dataloaders(bs=16, shuffle=False, drop_last=True)

dls.show_batch()

Using Datasets -> train TfmdDL + valid TfmdDL -> dataloaders

Another way to get dls is to create TfmdDLs and pass those into DataLoaders. If you use DataLoader rather than TfmdDL, dls won't have a show_batch method available.

train_dl = TfmdDL(dsets.train, bs=16, drop_last=True)

valid_dl = TfmdDL(dsets.valid, bs=16, drop_last=True)

dls = DataLoaders(train_dl, valid_dl)

dls.show_batch()

X,y = dls.one_batch()

Using DataBlock -> datasets -> dataloaders

Another way to get dataloaders is to use DataBlock. DataBlock wants to know what type of data will be passed which can be specified to blocks. It also wants a splitter and the functions to get_x and get_y

blocks = [TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)]), # x piece
          TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)])] # y piece

splits[-1]

(#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...]

IndexSplitter(splits[-1])(all_text_tok_chunked)

((#4588) [0,1,2,3,4,5,6,7,8,9...],
 (#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...])

dblock = DataBlock(blocks=blocks,
                   splitter=IndexSplitter(splits[-1]),
                   get_x=get_x,
                   get_y=get_y)

With the dblock created, you can create a dset and then from the dset, you can create a dls similar to the one created above.

dsets_via_dblock = dblock.datasets(all_text_tok_chunked)

dsets_via_dblock

(#5735) [(TensorText([ 2, 12,  9, 13,  9, 14,  9, 15,  9, 16]), TensorText([9])),(TensorText([17,  9, 18,  9, 20,  9, 19,  9, 29,  9]), TensorText([30])),(TensorText([ 9, 31,  9, 32,  9, 33,  9, 34,  9, 35]), TensorText([9])),(TensorText([36,  9, 37,  9, 38,  9, 21,  9, 21, 12]), TensorText([9])),(TensorText([21, 13,  9, 21, 14,  9, 21, 15,  9, 21]), TensorText([16])),(TensorText([ 9, 21, 17,  9, 21, 18,  9, 21, 20,  9]), TensorText([21])),(TensorText([19,  9, 22,  9, 22, 12,  9, 22, 13,  9]), TensorText([22])),(TensorText([14,  9, 22, 15,  9, 22, 16,  9, 22, 17]), TensorText([9])),(TensorText([22, 18,  9, 22, 20,  9, 22, 19,  9, 23]), TensorText([9])),(TensorText([23, 12,  9, 23, 13,  9, 23, 14,  9, 23]), TensorText([15]))...]

dsets_via_dblock.show(dsets_via_dblock[0])

xxbos one , two , three , four , five
,

dls = dsets_via_dblock.dataloaders(bs=16,shuffle=False, drop_last=True)

dls.show_batch()

Using DataBlock -> dataloaders

Another option is to go directly from dblock to dls with dblock.dataloaders. Behind the scenes this is creating a dataset as well, but it can be a cleaner looking way to handle it if you always go from dblock -> dls.

dls = dblock.dataloaders(all_text_tok_chunked, bs=16, shuffle=False, drop_last=True)

dls.show_batch()

Conclusion

Creating dls is an extremely important capability when using fastai because that is what a learn expects to deal with all of the data. There are many different ways to get a dls object created so this isn't a comprehensive list, but at least shows a few ways to do the task. In a future blog post, I will be using this dls and exploring transformer models with it. Hopefully this will help others get their DLs working.

I'd like to give a special thanks to Arto for helping me get things working properly and everybody in the fastai discord channel for dealing with my questions and for creating a great community to learn with every step of the way.

Useful Links

https://arampacha.github.io/thoughtsamples/fastai/pytorch/2021/01/02/transformer-lm-from-scratch.html

https://github.com/fastai/fastai/blob/ab154927696338741e59e0ffc4774777c4a9781c/nbs/39_tutorial.transformers.ipynb
https://github.com/fastai/fastai/blob/ab154927696338741e59e0ffc4774777c4a9781c/dev_nbs/course/lesson7-human-numbers.ipynb

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,

	text	text_
0	xxbos one , two , three , four , five	,
1	six , seven , eight , nine , ten ,	eleven
2	, twelve , thirteen , fourteen , fifteen , sixteen	,
3	seventeen , eighteen , nineteen , twenty , twenty one	,
4	twenty two , twenty three , twenty four , twenty	five
5	, twenty six , twenty seven , twenty eight ,	twenty
6	nine , thirty , thirty one , thirty two ,	thirty
7	three , thirty four , thirty five , thirty six	,
8	thirty seven , thirty eight , thirty nine , forty	,