from fastai.text.all import *
fastai’s Text DLs Exploration
Introduction
This post is an exploration into how to convert data into dls objects that can be used by fastai’s learner. I was having issues creating a dls object that had the ability to show_batch
and met my arbitrarily custom needs. So I set out to figure out how to create dls that worked well for my needs.
This blog post uses the Human Numbers dataset which is a dataset that counts sequentially from 1 to 9999 but in english text rather than numerical form. This is an interesting problem because there is quite a bit of repetition, but also new tokens and patterns being introduced regularly that a model will need to figure out.
My goal was to create a dls that would have X=1,2,3 and y=4. Over the course of this blog post, I will show ~4 ways to create dls that enable show_batch to work as expected.
import fastai, fastcore
fastai.__version__,fastcore.__version__
('2.3.1', '1.3.20')
= untar_data(URLs.HUMAN_NUMBERS) path
First, I create a tokenizer, combine all of the text into a single string, and tokenize each word
= Tokenizer(WordTokenizer()) tokenizer
Testing the Tokenizer
'one two three') tokenizer(
(#4) ['xxbos','one','two','three']
'one, two') tokenizer(
(#4) ['xxbos','one',',','two']
Reading the train and validation files:
* train - [1-8000] * valid = [8001-9999]
= ', '.join(o.strip() for o in (path/'train.txt').readlines())
train_txt = ', '.join(o.strip() for o in (path/'valid.txt').readlines()) valid_txt
For this problem, I will create my own validation set. It will split close to the same as this, but by creating my own split, I don’t have to do anything special when creating chunks around the train->validation split point
= train_txt+valid_txt all_text
= tokenizer(all_text) all_text_tok
all_text_tok
(#63094) ['xxbos','one',',','two',',','three',',','four',',','five'...]
Next, I take the tokenized text, count how many times each tokenizer occurs and create a vocab with that.
=Counter(all_text_tok)
count= make_vocab(count) vocab
print(count)
Counter({',': 9996, 'hundred': 9000, 'thousand': 8999, 'one': 2900, 'two': 2900, 'three': 2900, 'four': 2900, 'five': 2900, 'six': 2900, 'seven': 2900, 'nine': 2899, 'eight': 2898, 'twenty': 1000, 'thirty': 1000, 'forty': 1000, 'fifty': 1000, 'sixty': 1000, 'seventy': 1000, 'eighty': 1000, 'ninety': 1000, 'ten': 100, 'eleven': 100, 'twelve': 100, 'thirteen': 100, 'fourteen': 100, 'fifteen': 100, 'sixteen': 100, 'seventeen': 100, 'eighteen': 100, 'nineteen': 100, 'xxbos': 1, 'nineeight': 1})
print(vocab)
['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj', ',', 'hundred', 'thousand', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'nine', 'eight', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'xxfake']
= list(chunked(all_text_tok, 11)) all_text_tok_chunked
#drop last non-full row
= all_text_tok_chunked[:-1] all_text_tok_chunked
Next I create something that will get_x and get_y from the chunked data.
def get_x(o):
return o[:10]
def get_y(o):
return [o[10]] if len(o) == 11 else ['.']
print(f"{get_x(all_text_tok_chunked[0])} -> {get_y(all_text_tok_chunked[0])}")
['xxbos', 'one', ',', 'two', ',', 'three', ',', 'four', ',', 'five'] -> [',']
print(f"{get_x(all_text_tok_chunked[-1])} -> {get_y(all_text_tok_chunked[-1])}")
['nine', 'thousand', 'nine', 'hundred', 'ninety', 'seven', ',', 'nine', 'thousand', 'nine'] -> ['hundred']
TitledStringDecoder is a transform that only decodes and what it enables is the show_batch and show_results function to actually work properly. Without this, I had troubles getting those functions to work because TensorText doesn’t have a proper show function or a truncate function.
class TitledStringDecoder(Transform):
def decodes(self, o):
return TitledStr(' '.join(o))
All TitledStringDecoder really does is takes an array of text (‘one’, ‘two’) and converts it into a space-concatenated string instead of type Titled str which knows how to display itself in a nice way.
' '.join(['one', 'two'])) TitledStr(
'one two'
= TitledStr(' '.join(all_text_tok[:10])) tmp_ts
tmp_ts
'xxbos one , two , three , four , five'
3) tmp_ts.truncate(
'xxbos one ,'
I create the splits based off the chunks. Putting 80% of the chunks into the training set and the last 20% in the validation set
= [L(range(int(len(all_text_tok_chunked)*0.8))), L(range(int(len(all_text_tok_chunked)*0.8),len(all_text_tok_chunked)))] splits
splits
[(#4588) [0,1,2,3,4,5,6,7,8,9...],
(#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...]]
Now, let’s test the transforms work properly
0]))) Numericalize(vocab)(TitledStringDecoder()(get_x(all_text_tok_chunked[
TensorText([ 2, 12, 9, 13, 9, 14, 9, 15, 9, 16])
And confirm that they will work as a pipeline as well
= Pipeline([TitledStringDecoder, Numericalize(vocab)]) pipeline
0])) get_x(pipeline(all_text_tok_chunked[
TensorText([ 2, 12, 9, 13, 9, 14, 9, 15, 9, 16])
= Pipeline([get_x, TitledStringDecoder, Numericalize(vocab)])
pipeline_x = Pipeline([get_y, TitledStringDecoder, Numericalize(vocab)]) pipeline_y
0]) pipeline_y(all_text_tok_chunked[
TensorText([9])
Using Datasets + Dataloaders
= Datasets(all_text_tok_chunked, tfms=[pipeline_x,pipeline_y], splits=splits) dsets
0] dsets[
(TensorText([ 2, 12, 9, 13, 9, 14, 9, 15, 9, 16]), TensorText([9]))
0]) dsets.show(dsets[
xxbos one , two , three , four , five
,
Next, we can create the dataloaders. This can be done with either DataLoaders.from_dsets(...)
or dsets.dataloaders(...)
. Both methods are shown below.
= DataLoaders.from_dsets(dsets, shuffle=False, drop_last=True) dls
dls.show_batch()
text | text_ | |
---|---|---|
0 | xxbos one , two , three , four , five | , |
1 | six , seven , eight , nine , ten , | eleven |
2 | , twelve , thirteen , fourteen , fifteen , sixteen | , |
3 | seventeen , eighteen , nineteen , twenty , twenty one | , |
4 | twenty two , twenty three , twenty four , twenty | five |
5 | , twenty six , twenty seven , twenty eight , | twenty |
6 | nine , thirty , thirty one , thirty two , | thirty |
7 | three , thirty four , thirty five , thirty six | , |
8 | thirty seven , thirty eight , thirty nine , forty | , |
= dsets.dataloaders(bs=16, shuffle=False, drop_last=True) dls
dls.show_batch()
text | text_ | |
---|---|---|
0 | xxbos one , two , three , four , five | , |
1 | six , seven , eight , nine , ten , | eleven |
2 | , twelve , thirteen , fourteen , fifteen , sixteen | , |
3 | seventeen , eighteen , nineteen , twenty , twenty one | , |
4 | twenty two , twenty three , twenty four , twenty | five |
5 | , twenty six , twenty seven , twenty eight , | twenty |
6 | nine , thirty , thirty one , thirty two , | thirty |
7 | three , thirty four , thirty five , thirty six | , |
8 | thirty seven , thirty eight , thirty nine , forty | , |
Using Datasets -> train TfmdDL + valid TfmdDL -> dataloaders
Another way to get dls is to create TfmdDLs and pass those into DataLoaders. If you use DataLoader rather than TfmdDL, dls won’t have a show_batch method available.
= TfmdDL(dsets.train, bs=16, drop_last=True) train_dl
= TfmdDL(dsets.valid, bs=16, drop_last=True) valid_dl
= DataLoaders(train_dl, valid_dl) dls
dls.show_batch()
text | text_ | |
---|---|---|
0 | xxbos one , two , three , four , five | , |
1 | six , seven , eight , nine , ten , | eleven |
2 | , twelve , thirteen , fourteen , fifteen , sixteen | , |
3 | seventeen , eighteen , nineteen , twenty , twenty one | , |
4 | twenty two , twenty three , twenty four , twenty | five |
5 | , twenty six , twenty seven , twenty eight , | twenty |
6 | nine , thirty , thirty one , thirty two , | thirty |
7 | three , thirty four , thirty five , thirty six | , |
8 | thirty seven , thirty eight , thirty nine , forty | , |
= dls.one_batch() X,y
Using DataBlock -> datasets -> dataloaders
Another way to get dataloaders is to use DataBlock. DataBlock wants to know what type of data will be passed which can be specified to blocks
. It also wants a splitter
and the functions to get_x
and get_y
= [TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)]), # x piece
blocks =[TitledStringDecoder, Numericalize(vocab)])] # y piece TransformBlock(type_tfms
-1] splits[
(#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...]
-1])(all_text_tok_chunked) IndexSplitter(splits[
((#4588) [0,1,2,3,4,5,6,7,8,9...],
(#1147) [4588,4589,4590,4591,4592,4593,4594,4595,4596,4597...])
= DataBlock(blocks=blocks,
dblock =IndexSplitter(splits[-1]),
splitter=get_x,
get_x=get_y) get_y
With the dblock created, you can create a dset and then from the dset, you can create a dls similar to the one created above.
= dblock.datasets(all_text_tok_chunked) dsets_via_dblock
dsets_via_dblock
(#5735) [(TensorText([ 2, 12, 9, 13, 9, 14, 9, 15, 9, 16]), TensorText([9])),(TensorText([17, 9, 18, 9, 20, 9, 19, 9, 29, 9]), TensorText([30])),(TensorText([ 9, 31, 9, 32, 9, 33, 9, 34, 9, 35]), TensorText([9])),(TensorText([36, 9, 37, 9, 38, 9, 21, 9, 21, 12]), TensorText([9])),(TensorText([21, 13, 9, 21, 14, 9, 21, 15, 9, 21]), TensorText([16])),(TensorText([ 9, 21, 17, 9, 21, 18, 9, 21, 20, 9]), TensorText([21])),(TensorText([19, 9, 22, 9, 22, 12, 9, 22, 13, 9]), TensorText([22])),(TensorText([14, 9, 22, 15, 9, 22, 16, 9, 22, 17]), TensorText([9])),(TensorText([22, 18, 9, 22, 20, 9, 22, 19, 9, 23]), TensorText([9])),(TensorText([23, 12, 9, 23, 13, 9, 23, 14, 9, 23]), TensorText([15]))...]
0]) dsets_via_dblock.show(dsets_via_dblock[
xxbos one , two , three , four , five
,
= dsets_via_dblock.dataloaders(bs=16,shuffle=False, drop_last=True) dls
dls.show_batch()
text | text_ | |
---|---|---|
0 | xxbos one , two , three , four , five | , |
1 | six , seven , eight , nine , ten , | eleven |
2 | , twelve , thirteen , fourteen , fifteen , sixteen | , |
3 | seventeen , eighteen , nineteen , twenty , twenty one | , |
4 | twenty two , twenty three , twenty four , twenty | five |
5 | , twenty six , twenty seven , twenty eight , | twenty |
6 | nine , thirty , thirty one , thirty two , | thirty |
7 | three , thirty four , thirty five , thirty six | , |
8 | thirty seven , thirty eight , thirty nine , forty | , |
Using DataBlock -> dataloaders
Another option is to go directly from dblock to dls with dblock.dataloaders
. Behind the scenes this is creating a dataset as well, but it can be a cleaner looking way to handle it if you always go from dblock -> dls.
= dblock.dataloaders(all_text_tok_chunked, bs=16, shuffle=False, drop_last=True) dls
dls.show_batch()
text | text_ | |
---|---|---|
0 | xxbos one , two , three , four , five | , |
1 | six , seven , eight , nine , ten , | eleven |
2 | , twelve , thirteen , fourteen , fifteen , sixteen | , |
3 | seventeen , eighteen , nineteen , twenty , twenty one | , |
4 | twenty two , twenty three , twenty four , twenty | five |
5 | , twenty six , twenty seven , twenty eight , | twenty |
6 | nine , thirty , thirty one , thirty two , | thirty |
7 | three , thirty four , thirty five , thirty six | , |
8 | thirty seven , thirty eight , thirty nine , forty | , |
Conclusion
Creating dls
is an extremely important capability when using fastai because that is what a learn expects to deal with all of the data. There are many different ways to get a dls object created so this isn’t a comprehensive list, but at least shows a few ways to do the task. In a future blog post, I will be using this dls and exploring transformer models with it. Hopefully this will help others get their DLs working.
I’d like to give a special thanks to Arto for helping me get things working properly and everybody in the fastai discord channel for dealing with my questions and for creating a great community to learn with every step of the way.
Useful Links
https://arampacha.github.io/thoughtsamples/fastai/pytorch/2021/01/02/transformer-lm-from-scratch.html
https://github.com/fastai/fastai/blob/ab154927696338741e59e0ffc4774777c4a9781c/nbs/39_tutorial.transformers.ipynb
https://github.com/fastai/fastai/blob/ab154927696338741e59e0ffc4774777c4a9781c/dev_nbs/course/lesson7-human-numbers.ipynb