fastai's Text DLs Exploration
Exploring different methods to create the dls object in fastai
- Introduction
- Using Datasets + Dataloaders
- Using Datasets -> train TfmdDL + valid TfmdDL -> dataloaders
- Using DataBlock -> datasets -> dataloaders
- Using DataBlock -> dataloaders
- Conclusion
- Useful Links
This post is an exploration into how to convert data into dls objects that can be used by fastai's learner. I was having issues creating a dls object that had the ability to show_batch
and met my arbitrarily custom needs. So I set out to figure out how to create dls that worked well for my needs.
This blog post uses the Human Numbers dataset which is a dataset that counts sequentially from 1 to 9999 but in english text rather than numerical form. This is an interesting problem because there is quite a bit of repetition, but also new tokens and patterns being introduced regularly that a model will need to figure out.
My goal was to create a dls that would have X=1,2,3 and y=4. Over the course of this blog post, I will show ~4 ways to create dls that enable show_batch to work as expected.
from fastai.text.all import *
import fastai, fastcore
fastai.__version__,fastcore.__version__
path = untar_data(URLs.HUMAN_NUMBERS)
First, I create a tokenizer, combine all of the text into a single string, and tokenize each word
tokenizer = Tokenizer(WordTokenizer())
Testing the Tokenizer
tokenizer('one two three')
tokenizer('one, two')
Reading the train and validation files:
- train - [1-8000]
- valid = [8001-9999]
train_txt = ', '.join(o.strip() for o in (path/'train.txt').readlines())
valid_txt = ', '.join(o.strip() for o in (path/'valid.txt').readlines())
For this problem, I will create my own validation set. It will split close to the same as this, but by creating my own split, I don't have to do anything special when creating chunks around the train->validation split point
all_text = train_txt+valid_txt
all_text_tok = tokenizer(all_text)
all_text_tok
Next, I take the tokenized text, count how many times each tokenizer occurs and create a vocab with that.
count=Counter(all_text_tok)
vocab = make_vocab(count)
print(count)
print(vocab)
all_text_tok_chunked = list(chunked(all_text_tok, 11))
all_text_tok_chunked = all_text_tok_chunked[:-1]
Next I create something that will get_x and get_y from the chunked data.
def get_x(o):
return o[:10]
def get_y(o):
return [o[10]] if len(o) == 11 else ['.']
print(f"{get_x(all_text_tok_chunked[0])} -> {get_y(all_text_tok_chunked[0])}")
print(f"{get_x(all_text_tok_chunked[-1])} -> {get_y(all_text_tok_chunked[-1])}")
TitledStringDecoder is a transform that only decodes and what it enables is the show_batch and show_results function to actually work properly. Without this, I had troubles getting those functions to work because TensorText doesn't have a proper show function or a truncate function.
class TitledStringDecoder(Transform):
def decodes(self, o):
return TitledStr(' '.join(o))
All TitledStringDecoder really does is takes an array of text ('one', 'two') and converts it into a space-concatenated string instead of type Titled str which knows how to display itself in a nice way.
TitledStr(' '.join(['one', 'two']))
tmp_ts = TitledStr(' '.join(all_text_tok[:10]))
tmp_ts
tmp_ts.truncate(3)
I create the splits based off the chunks. Putting 80% of the chunks into the training set and the last 20% in the validation set
splits = [L(range(int(len(all_text_tok_chunked)*0.8))), L(range(int(len(all_text_tok_chunked)*0.8),len(all_text_tok_chunked)))]
splits
Now, let's test the transforms work properly
Numericalize(vocab)(TitledStringDecoder()(get_x(all_text_tok_chunked[0])))
And confirm that they will work as a pipeline as well
pipeline = Pipeline([TitledStringDecoder, Numericalize(vocab)])
get_x(pipeline(all_text_tok_chunked[0]))
pipeline_x = Pipeline([get_x, TitledStringDecoder, Numericalize(vocab)])
pipeline_y = Pipeline([get_y, TitledStringDecoder, Numericalize(vocab)])
pipeline_y(all_text_tok_chunked[0])
dsets = Datasets(all_text_tok_chunked, tfms=[pipeline_x,pipeline_y], splits=splits)
dsets[0]
dsets.show(dsets[0])
Next, we can create the dataloaders. This can be done with either DataLoaders.from_dsets(...)
or dsets.dataloaders(...)
. Both methods are shown below.
dls = DataLoaders.from_dsets(dsets, shuffle=False, drop_last=True)
dls.show_batch()
dls = dsets.dataloaders(bs=16, shuffle=False, drop_last=True)
dls.show_batch()
Another way to get dls is to create TfmdDLs and pass those into DataLoaders. If you use DataLoader rather than TfmdDL, dls won't have a show_batch method available.
train_dl = TfmdDL(dsets.train, bs=16, drop_last=True)
valid_dl = TfmdDL(dsets.valid, bs=16, drop_last=True)
dls = DataLoaders(train_dl, valid_dl)
dls.show_batch()
X,y = dls.one_batch()
Another way to get dataloaders is to use DataBlock. DataBlock wants to know what type of data will be passed which can be specified to blocks
. It also wants a splitter
and the functions to get_x
and get_y
blocks = [TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)]), # x piece
TransformBlock(type_tfms=[TitledStringDecoder, Numericalize(vocab)])] # y piece
splits[-1]
IndexSplitter(splits[-1])(all_text_tok_chunked)
dblock = DataBlock(blocks=blocks,
splitter=IndexSplitter(splits[-1]),
get_x=get_x,
get_y=get_y)
With the dblock created, you can create a dset and then from the dset, you can create a dls similar to the one created above.
dsets_via_dblock = dblock.datasets(all_text_tok_chunked)
dsets_via_dblock
dsets_via_dblock.show(dsets_via_dblock[0])
dls = dsets_via_dblock.dataloaders(bs=16,shuffle=False, drop_last=True)
dls.show_batch()
Another option is to go directly from dblock to dls with dblock.dataloaders
. Behind the scenes this is creating a dataset as well, but it can be a cleaner looking way to handle it if you always go from dblock -> dls.
dls = dblock.dataloaders(all_text_tok_chunked, bs=16, shuffle=False, drop_last=True)
dls.show_batch()
Creating dls
is an extremely important capability when using fastai because that is what a learn expects to deal with all of the data. There are many different ways to get a dls object created so this isn't a comprehensive list, but at least shows a few ways to do the task. In a future blog post, I will be using this dls and exploring transformer models with it. Hopefully this will help others get their DLs working.
I'd like to give a special thanks to Arto for helping me get things working properly and everybody in the fastai discord channel for dealing with my questions and for creating a great community to learn with every step of the way.