Making ScanSnap Receipts usable
For a long time I used a service called Shoeboxed to scan and organize my credit card receipts. Basically stuff your receipts in a US prepaid envelope, drop it in the mail, and they scan, OCR and shred them, as well as analyzing the text to extract the information. Unfortunately, since I moved to the UK the service leaves to be desired, and the price has also gone up over time.
I have a couple of Fujitsu ScanSnap document scanners, a S1500M, which is no longer supported on macOS Mojave (but fortunately is by the third-party app ExactScan), and an iX100 which still is supported, as well as a SV600 which is utterly unsuited to dealing with crumpled receipts. The new, dumbed-down ScanSnap Home app that ships with ScanSnaps has a receipt mode. Since I never use the iX100 to scan documents at home given I have a S1500M (it’s a handheld battery=powered simplex scanner that’s mostly intended for mobile use), I dedicated it and ScanSnap Home to scanning receipts.
The basic functionality of scanning, deskewing, OCR-ing and extracting date, amount, vendor and so on mostly works, but otherwise ScanSnap Receipts is an ergonomic disaster. For starters, it never recognizes the currency correctly and always identifies my transactions as being in dollars rather than pounds. Secondly, it inexplicably lacks the ability to batch-edit receipts, e.g. select the date range for my last trip to France and change all of them from dollars to euros. You need to edit them one by one, which is as incredibly tedious as you can imagine.
After a few weeks of this, I decided to take matters in my own hand. It turns out ScanSnap Home uses Core Data backed by an underlying SQLite database. SQLite is the world’s most widely deployed database (every single Android and iOS smartphone includes it, for starters), but the Core Data object-relational mapper above it does a terrific job of obfuscating it and reducing its performance. Nonetheless, after a little bit of digging, I wrote the following script to automate the most repetitive operations:
- Stop all ScanSnap auxiliary processes, as they have the DB opened even if you quit the app
- Set the currency for all transactions tagged as Unchecked (ScanSnap Home
does this by default) to
GBP
- Normalize all unchecked Waitrose vendor names to
Waitrose
- Same for Tesco
- Rename
M&S
toMarks & Spencer
- Fixes for Superdrug and Sainsbury’s
- Set the amount of receipts tagged as duplicates to zero
- Attempt to fix little-endian European format DD/MM/YYYY dates parsed as middle-endian US format MM/DD/YYYY
- Attempt to fix Euro format DD/MM/YY dates parsed as US format YY/MM/DD
- Restart ScanSnap Home
The timestamps in the database are in a strange format that seems to be the number of seconds since an epoch of 2001-01-01T00:00:00 UTC. I hope no one tries to scan receipts older than that… You can convert to UNIX timestamps by adding 978307200 and from there to SQLite’s Julian Day format.
One major annoyance thing this script attempts to fix is dates. Because date formats are ambiguous (is 11/3/20 March 11 2020, or November 3 2020 or perversely March 20 2011?) and point-of-sale vendors are neither ISO 8601 nor even Y2100 compliant, parsing dates is a minefield. My assumption is that receipts will be scanned in a reasonably timely manner, and if there is ambiguity, the closest date should win.
#!/bin/sh
pkill -9 -f ScanSnap
sqlite3 "$HOME/Library/Application Support/PFU/ScanSnap Home/Managed/ScanSnapHome.sqlite" << EOF
.mode lines
UPDATE zcontent
SET zcurrencysign=(SELECT z_pk FROM zcurrencysign WHERE zvalue='GBP')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Waitrose')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue<>'Waitrose' AND zvalue LIKE '%waitrose%'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Tesco')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue<>'Tesco' AND zvalue LIKE '%tesco%'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Marks & Spencer')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue LIKE '%M&S%'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Superdrug')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue<>'Superdrug' AND zvalue LIKE '%superdrug%'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Sainsbury''s')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue<>'Sainsbury''s' AND zvalue LIKE '%Sainsbury%'
);
UPDATE zcontent
SET zamount=0.0
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Duplicate'
);
UPDATE zcontent
SET zreceiptdate = strftime('%s',
strftime('%Y-%d-%m', zreceiptdate+978307200, 'unixepoch', 'localtime'),
'utc'
)-978307200
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
)
AND zreceiptdate IS NOT NULL
AND strftime('%s',
strftime('%Y-%d-%m', zreceiptdate+978307200, 'unixepoch', 'localtime'),
'utc'
)-978307200
BETWEEN zreceiptdate AND strftime('%s', 'now')-978307200;
UPDATE zcontent
SET zreceiptdate = strftime('%s',
strftime('20%d-%m-', zreceiptdate+978307200,
'unixepoch', 'localtime') ||
substr(strftime('%Y', zreceiptdate+978307200,
'unixepoch', 'localtime'), 3),
'utc'
)-978307200
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
)
AND zreceiptdate IS NOT NULL
AND strftime('%s',
strftime('20%d-%m-', zreceiptdate+978307200,
'unixepoch', 'localtime') ||
substr(strftime('%Y', zreceiptdate+978307200,
'unixepoch', 'localtime'), 3),
'utc'
)-978307200
BETWEEN zreceiptdate AND strftime('%s', 'now')-978307200;
EOF
open /Applications/ScanSnapHomeMain.app
Note that the SQLite database is not used only by expenses
(ZCONTENT.ZDOCTYPE=4
) but also to store a summary of all documents
scanned. Also, the ZCONTENT
table has a column ZUNCHECKED
that is not what
you would expect, it is a constant 1 even if you remove the Unchecked
tag
from the transaction.
Now, all the usual disclaimers apply, modifying the database directly is not something supported by the app developer, and could have unintended consequences. If you use this script (or more likely modify it for your needs), I disclaim responsibility for any damages or data loss this may cause.
Update (2020-10-01):
Added:
- setting duplicate receipt amounts to 0
- fixing DD/MM/YYYY dates misparsed as MM/DD/YYYY
- fixing DD/MM/YY dates misparsed as YY/MM/DD