For a long time I used a service called Shoeboxed
to scan and organize my credit card receipts. Basically stuff your receipts in
a US prepaid envelope, drop it in the mail, and they scan, OCR and shred them,
as well as analyzing the text to extract the information. Unfortunately, since
I moved to the UK the service leaves to be desired, and the price has also
gone up over time.
I have a couple of Fujitsu ScanSnap document
scanners, a S1500M, which is no longer supported on macOS Mojave (but
fortunately is by the third-party app
ExactScan), and an iX100 which still is
supported, as well as a SV600 which is utterly unsuited to dealing with
crumpled receipts. The new, dumbed-down ScanSnap Home app that ships with
ScanSnaps has a receipt mode. Since I never use the iX100 to scan documents at
home given I have a S1500M (it’s a handheld battery=powered simplex scanner
that’s mostly intended for mobile use), I dedicated it and ScanSnap Home to
scanning receipts.
The basic functionality of scanning, deskewing, OCR-ing and extracting date,
amount, vendor and so on mostly works, but otherwise ScanSnap Receipts is an
ergonomic disaster. For starters, it never recognizes the currency correctly
and always identifies my transactions as being in dollars rather than
pounds. Secondly, it inexplicably lacks the ability to batch-edit receipts,
e.g. select the date range for my last trip to France and change all of them
from dollars to euros. You need to edit them one by one, which is as
incredibly tedious as you can imagine.
After a few weeks of this, I decided to take matters in my own hand. It turns
out ScanSnap Home uses Core Data backed by an underlying
SQLite database. SQLite is the world’s most
widely deployed database (every single Android and iOS smartphone includes it,
for starters), but the Core Data object-relational mapper above it does a
terrific job of obfuscating it and reducing its performance. Nonetheless,
after a little bit of digging, I wrote the following script to automate the
most repetitive operations:
- Stop all ScanSnap auxiliary processes, as they have the DB opened even if
you quit the app
- Set the currency for all transactions tagged as Unchecked (ScanSnap Home
does this by default) to
GBP
- Normalize all unchecked Waitrose vendor names to
Waitrose
- Same for Tesco
- Rename
M&S
to Marks & Spencer
- Fixes for Superdrug and Sainsbury’s
- Set the amount of receipts tagged as duplicates to zero
- Attempt to fix little-endian European format DD/MM/YYYY dates parsed as
middle-endian US format MM/DD/YYYY
- Attempt to fix Euro format DD/MM/YY dates parsed as US format YY/MM/DD
- Restart ScanSnap Home
The timestamps in the database are in a strange format that seems to be the number of seconds since an epoch of 2001-01-01T00:00:00 UTC. I hope no one tries to scan receipts older than that… You can convert to UNIX timestamps by adding 978307200 and from there to SQLite’s Julian Day format.
One major annoyance thing this script attempts to fix is dates. Because date
formats are ambiguous (is 11/3/20 March 11 2020, or November 3 2020 or
perversely March 20 2011?) and point-of-sale vendors are neither ISO
8601 nor even Y2100
compliant, parsing dates is a minefield. My assumption is that receipts will
be scanned in a reasonably timely manner, and if there is ambiguity, the
closest date should win.
#!/bin/sh
pkill -9 -f ScanSnap
sqlite3 "$HOME/Library/Application Support/PFU/ScanSnap Home/Managed/ScanSnapHome.sqlite" << EOF
.mode lines
UPDATE zcontent
SET zcurrencysign=(SELECT z_pk FROM zcurrencysign WHERE zvalue='GBP')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Waitrose')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue<>'Waitrose' AND zvalue LIKE '%waitrose%'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Tesco')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue<>'Tesco' AND zvalue LIKE '%tesco%'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Marks & Spencer')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue LIKE '%M&S%'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Superdrug')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue<>'Superdrug' AND zvalue LIKE '%superdrug%'
);
UPDATE zcontent
SET zvendor=(SELECT z_pk FROM zvendor WHERE zvalue='Sainsbury''s')
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
) AND zvendor IN (
SELECT z_pk FROM zvendor
WHERE zvalue<>'Sainsbury''s' AND zvalue LIKE '%Sainsbury%'
);
UPDATE zcontent
SET zamount=0.0
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Duplicate'
);
UPDATE zcontent
SET zreceiptdate = strftime('%s',
strftime('%Y-%d-%m', zreceiptdate+978307200, 'unixepoch', 'localtime'),
'utc'
)-978307200
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
)
AND zreceiptdate IS NOT NULL
AND strftime('%s',
strftime('%Y-%d-%m', zreceiptdate+978307200, 'unixepoch', 'localtime'),
'utc'
)-978307200
BETWEEN zreceiptdate AND strftime('%s', 'now')-978307200;
UPDATE zcontent
SET zreceiptdate = strftime('%s',
strftime('20%d-%m-', zreceiptdate+978307200,
'unixepoch', 'localtime') ||
substr(strftime('%Y', zreceiptdate+978307200,
'unixepoch', 'localtime'), 3),
'utc'
)-978307200
WHERE zdoctype=4 AND z_pk IN (
SELECT z_4contents
FROM z_4labels
JOIN zlabel ON z_15labels=zlabel.z_pk
WHERE zlabel.zname='Unchecked'
)
AND zreceiptdate IS NOT NULL
AND strftime('%s',
strftime('20%d-%m-', zreceiptdate+978307200,
'unixepoch', 'localtime') ||
substr(strftime('%Y', zreceiptdate+978307200,
'unixepoch', 'localtime'), 3),
'utc'
)-978307200
BETWEEN zreceiptdate AND strftime('%s', 'now')-978307200;
EOF
open /Applications/ScanSnapHomeMain.app
Note that the SQLite database is not used only by expenses
(ZCONTENT.ZDOCTYPE=4
) but also to store a summary of all documents
scanned. Also, the ZCONTENT
table has a column ZUNCHECKED
that is not what
you would expect, it is a constant 1 even if you remove the Unchecked
tag
from the transaction.
Now, all the usual disclaimers apply, modifying the database directly is not
something supported by the app developer, and could have unintended
consequences. If you use this script (or more likely modify it for your
needs), I disclaim responsibility for any damages or data loss this may cause.
Update (2020-10-01):
Added:
- setting duplicate receipt amounts to 0
- fixing DD/MM/YYYY dates misparsed as MM/DD/YYYY
- fixing DD/MM/YY dates misparsed as YY/MM/DD