After the second lesson of fastai’s “Practical Deep Learning For Coders 2022” we’re encouraged to try to build our own deep learning classifier using transfer learning. So without getting to far into the deep learning weeds, can I train a model to recognize my prescription medication in photos taken with my laptop’s webcam or my phone? If you don’t want to read about it and just want to see the classifiers in action, click here.
World IBD Day 2023 Update! My prescription medication changed and so did the packaging! I’ve updated this post with a new section just for that.
- Transfer Learning
- Learner = Model + Data
- The Data
- The Model
- Test Dataset
- Other Model Considerations
- Medication Classification In Action
- World IBD Day 2023 Update – My Prescription Medication change!
- Some thoughts on fastai so far
Transfer Learning
Transfer learning involves “fine tuning” an existing pre-trained model (a neural network architecture) for a problem it wasn’t initially trained for. The potential advantages of this are that you start with a “real world tested” model, need less training data and it is faster than building a new model from scratch.
Learner = Model + Data
fastai defines a “learner” as model (a neural network architecture) and the data to train it. In this instance I used a vision learner (a convolutional neural network) and defined the training data with a DataBlock. A DataBlock is a list of the pre-processing steps that convert the images in into elements of the model and any image transformations you want to use to provide more variations of the images in your dataset. I will go into some of the details of these transformations with examples later. The fastai library includes pre-trained PyTorch torchvision models and based on the recommendation from the lectures I started out with a resnet18 model.
The Data
I decided to go with the boxes that my medication comes in rather than the individual tablets because I thought it would be an easier way to start. As you can see in the photo below there’s very little visual difference in a webcam photo between two different tablets and I also didn’t want to be taking tablets out of their packets in case they dropped on the floor or anything else unforeseen that meant I might have to throw them out.
As per the lesson I tried Bing Image Search (and then Google) but neither returned remotely useful images of my prescribed medication. Interestingly (to me at least!) searching for Imraldi (the adalimumab biosimilar I take) returned a lot more results for the original and significantly more expensive original brand name version Humira.
Not wanting to get bogged down in worrying about building a “proper” dataset the easiest thing to do was take photos myself. I split them between my laptop’s webcam (0.9 megapixels) and my iPhone’s front-facing (12 megapixels). I also didn’t worry about properly framing the medication in each photo. The medication box sizes are 8cm (Calcichew), 14.5cm (Pentasa) and 18.5cm (Imraldi). As long as the box was clearly visible in each photo that was fine. Things like proper framing are considerations for any future iterations of the classifier.
When the pharmacist checks off my prescription they sometimes put a sticker on each box horizontally, vertically or the sticker goes on the outside of the bag. The sticker can often cover a significant part of the box (especially the Calcichew). I took 11 photos (6 webcam and 5 iPhone) for each “sticker situation” and added 1 extra for good luck (:-P). Even if it was too small a dataset fastai has built in functionality to make modifications to existing images so that you don’t need to cover every possible permutation of photo.
Data Transformations
For the purposes of training all the items (the images) need to be the same size. The standard size according to the lessons are 224px squares. This basic resizing can be performed with the standard Resize() command that’s part of fastai’s data transformations functionality. Resize, by default, grabs the centre of each image and resizes it.
A more common approach is to use RandomResizedCrop(). It takes a different part of an image each time and zooms in on it. An issue with this approach is that you will lose some parts of an image. However, this can be outweighed by the fact these extra variations can help to your learner to generalize and handle new unseen images instead of overfitting on the training data preventing it from accurately classifying new images. The grid below shows examples of this transformation.
Data Augmentations
These are applied at a batch level and in order to be able to make use of them you must first resize all your training images so they are the same size as described above. A “batch” is a group of images that PyTorch loads and process all at the same time on the GPU.
Augmentations allow you to additional variety to your existing inputs by making runtime copies and performing manipulations like cropping, perspective warping, or contrast changes. These random variations of the training data can help to make them appear different without changing the “meaning” of the data and further avoid overfitting. For reference here is the same image from the item transformations section above but with augmentations also applied.
Below are some training images, with their labels above each image, after item and batch transformations were applied.
While I did have some concerns I didn’t have enough training data there’s a lot of scope for parameter tinkering with transformations and augmentations that I did not want to get side tracked into. For the purposes of these first experiments I left all the parameters at their defaults.
A benefit of the augmentations I hadn’t considered until after taking my training set photos is that taking uniform photos with a webcam is not easy. To pick up an object and hold it in the exact same position every time is difficult. So being able to generate resized, skewed and contrast changed versions with several lines of code was really useful.
Something to watch out for!
You can see below the medication was almost or completely removed in some of these potential training images (and a sinister looking section of my face was left behind LOL!). It was lucky to stumble across these when using the show_batch() command.
I retook the 3 photos with the medication closer to the centre of the image. Problems like these could possibly be overcome by making sure the medication is always in the centre of the frame when taking a photo but that’s an application issue that can be dealt with in a later iteration.
The Model
Fastai comes with a lot of pre-trained models and the recommendation from the lessons was to start with ResNet models as they are quick to train when building prototypes. For the experience I also chose several LeViT models. They’re very highly rated according to the chart in Jeremy Howard’s “Which image models are best?” notebook. It was extremely easy to use them: install the timm library, import it into my Jupyter Notebook and add in the model name for my vision_learner instance.
Fine Tuning
FastAI “fine tunes” the pretrained model using the differences between what the model was originally trained for and the new dataset I’m trying to use it. At the end of each training epoch the model’s ability to “learn” the features of the images is tested against the validation set to give an “error rate” metric. I created a Jupyter Notebook for 3 ResNet and 4 LeViT models to train with each DataBlock listed below.
Each learner and DataBlock combination was fine tuned until there was an error rate of 0% on the validation set predictions or 20 epochs were reached with the total time taken recorded and averaged out.
It’s possible to see how a model performed on a validation set after finishing training by calling show_results(). In the image below the resnet18 model misclassified ‘other’ as ‘imraldi’ and ‘calcichew’.
The ResNet model training results:
Model | Lowest Error Rate % | DataBlock | Avg. Training Epoch (seconds) |
resnet18 | 3 | resize_only_and_default_augmentations_dls | 9 |
resnet34 | 0 | resize_only_dls | 13 |
resnet50 | 0 | resize_only_dls | 24 |
And the additional LeViT model training results:
Model | Lowest Error Rate % | DataBlock | Avg. Training Epoch (seconds) |
levit_128 | 12 | random_resize_and_default_augmentations_dls | 5 |
levit_192 | 7 | random_resize_and_default_augmentations_dls | 8 |
levit_256 | 4 | random_resize_and_default_augmentations_dls | 9 |
levit_384 | 0 | random_resize_and_default_augmentations_dls | 15 |
Confusion Matrix and Top Losses
fastai has a lot of very useful tools to help see what’s going on with the fine tuning process so that it isn’t some sort of magic black box process. The confusion matrix is great for visualising what the model is currently having problems classifying. And plot_top_losses helps to identify if you have images that are poor quality or perhaps if you got them from an internet search, aren’t correctly labeled. It’s likely the images I happened to catch earlier using show_batch() would probably have appeared in the top losses.
Interestingly (to me at least :-P), every model had a problem with the below image of the pharmacy’s paper bag. Most would flip-flop between ‘imraldi’ and ‘pentasa’ until they’d gone through at least 10 to 12 fine tuning epochs. I would be curious to know what parts of the image the various models were focusing in on, if that’s possible!
What’s in a prediction?
Once the learner is trained satisfactorily it can be used to may predictions on images by calling its predict function. It returns 3 elements:
- decoded_prediction: the label/class associated with the prediction e.g. “pentasa”
- predicted_class_index: the index of the label/class within the learner’s CategoryBlock’s vocab list e.g. [‘calcichew’, ‘imraldi’, ‘other’, ‘pentasa’]
- predicted_probabilities_per_class: a list of values (one for each label/class) representing how likely the image is one of the labels/classes.
Test Dataset
This wasn’t actually part of the first 2 lectures and I ended up going on a bit of wild but extremely very useful tangent. I wanted to explore the possibility of checking the prediction accuracy of the fine tuned models against lots of new images. Initially I did this using a for loop to iterate over an array of new image files. But given that training and validation works with collections of images I thought there must be a way to give the trained model a collection of tests images. Luckily poking around in the source code and the documentation I found references to test_dl and get_preds().
The test set should be made up of images that have not been used for training or validation. The purpose of the test set is to check if your model has learned the features of your training data too well and can’t classify new images very well. I built the test set from 5 randomly taken images of each medication and random non-medication things for a total of 25 images.
By default get_preds() returns the list of predicted probabilities per class per image in the test set (in the same way it returns a list when you call predict()).
Visualising the Test Dataset Results
The text based output doesn’t really give you much indication of how the learner did on the test set. I found show_results() and plot_top_losses() extremely useful but they don’t exist for tests datasets (at time of writing at least!). If I remember correctly it has something to do with fastai initially being designed with Kaggle competitions in mind that don’t need to visualise the output. (Unfortunately I can’t remember if I found that out from a forum post, a lecture video or another fastai related video…but I don’t think I imagined it!).
There is an argument you can pass to get_preds() to get the predicted class indices too ‘with_decoded‘.
I took some inspiration from the output of plot_top_losses() and in an effort to get better at list comprehensions I wrote a small script that uses HTML and inline CSS to build a grid based on the with_decoded output inside my Jupyter Notebook. The truncated screenshot below is an example of the output from the script.
Here’s the link if you want to the repository and use it for your own test set.
Test Dataset Performance
Here is how each model did against the test set. This was simply a case of running a with_decoded get_preds(), running the visualising script and counting up the red bordered images.
Model | Test Set Error Rate % |
resnet34 | 20 |
resnet50 | 20 |
levit_384 | 5 |
Other Model Considerations
Inference Speed
Inference speed is how quickly the model returns a prediction. This isn’t a enormous concern of mine at such an early stage it was more out of curiousity about how the models performed. Inference speed is available through get_preds() so these were calculated by iterating over a list of images, calling predict() on each and timing it with the ‘%test’ IronPython magic.
Model | Average Inference Speed (ms) |
resnet34 | 288.40 |
resnet50 | 402.85 |
levit_384 | 105.35 |
Exported Model Size
Like inference speed the size of the model isn’t a serious issue for me right now and I compared them purely out of curiousity. You can export your fine tuned model into a pickle file using the learner’s export() function.
Model | Exported Size (MB) |
resnet34 | 83.42 |
resnet50 | 98.13 |
levit_384 | 151.28 |
Medication Classification In Action
The levit_384 model was the best performing of the three and I chose to use that for making a quick prototype.
Model | Exported Size (MB) | Inference Speed (ms) | Test Set Error Rate % |
resnet34 | 83.42 | 288.40 | 20 |
resnet50 | 98.13 | 402.85 | 20 |
levit_384 | 151.28 | 105.35 | 5 |
This very basic version of the application is simply a Flask site running inside a Windows 10 WSL2 install. The photos are taken using the getUserMedia() to access a video MediaStream from my laptop webcam or my phone front facing camera. The still photo from the video stream is written to canvas element that is then dataURL’d to a full quality JPEG and POST’d to the Flask endpoint. I created an instance of the fine tuned levit_384 learner from its exported pickle file using load_learner. The learner at the endpoint performs a predict() on the submitted image. The predicted class and probability (1.0 being 100%) are returned to the front end.
(If you want to run something similar yourself this video provides an excellent explanation of how to set up port forwarding and firewall rules to expose a server running inside WSL2 on your own LAN).
In the two short clips below you can see the classifier performance on my target devices: a laptop webcam and my phone’s camera. I’m extremely pleased with how it has worked out.
There is plenty of room for iterative improvements. There are several other things I’d like to be able to do now: do classification from video rather than still images, have a go at pill classification and integrate it all into my existing home healthcare project.
World IBD Day 2023 Update
My prescription medication changed and so did the packaging. The newest box also has 2 stickers (“renew” and “fridge”) on it that I didn’t have when previously training the classifier. Will it still be able to reliably identify it as Imraldi or has the packaging changed too much?
There’s a lot of things that can change without your knowledge and in spite of your best efforts living with chronic illness. Thankfully this was a good change! Injecting with the “old” Imraldi was super double not fun to put it politely. I’d liken it to someone pinching and twisting a lump of skin on the back of your thigh for 10 to 15 seconds. I can’t say I ever sprung out of bed at the delight of injecting myself but not injecting would obviously be worse. I have also tried all the “tips” and “tricks” to reduce the pain like leaving the pen outside the fridge for 30 minutes, put a freezer pack on the injection site to numb it, squeeze the skin together, stretch the skin apart, rub the alcohol swab in a counter-clockwise direction, and hopping on one leg an odd number of times. I might have made those last two up but still nothing ever worked in reducing the pain.
The new stuff is almost painless to inject with (needles are never totally painless!). As it is 0.4ml lighter than the old version I’m guessing, what medical professionals would call, “the stingy stuff” was in that 0.4ml. And good riddance to it!
In 2019 I wrote a post on the EU and WHO guidelines regarding “Medication Adherence”. My issues with “the stingy stuff” would probably fall under 2 – “Therapy” and “Patient” related – of the WHO’s 5 factors that impact your ability as a patient to adhere to your medical treatment.
So how did the classifier do with identifying the new and never seen before Imraldi? Surprisingly well as you can see in the example below taken from my laptop’s webcam.
That’s all folks! Happy World IBD Day 2023 ๐
Some thoughts on fastai so far
fastai’s top-down “teaching the whole game” approach has been really appealing. It is inspired by the idea that when teaching kids to play baseball, they learn by playing baseball rather than each individual part of the game in isolation. As someone who a very long time ago went to lectures about the basics of neural networks that were mostly very dry maths first followed by even dryer MATLAB classes, I much prefer fastai’s way of doings. In the very first lecture you learn how to build an “is it a bird?” classifier and in the second lecture Jeremy shows you how to gather images online, pick different learners and deploy your learner into production.
In my own experience, I have often learned things “better” when I can build and experiment. And now with a basic working prototype under my belt I know I’ll be able to better understand the deeper theory in the later lectures and chapters of the book.