Raison d'etre for a Mint + OCR
A lot of people use Mint.com for keeping track of accounts.
A lot of people have expenses they would like to scan and submit.
Some of us have to find a way to reconcile the receipts and not make mistakes such as picking the same recurring transaction etc.
If we could look up Mint.com using an API and download the transactions.
And OCR each receipt and get a match from the Mint transactions.
We could reasonably easily build an expense app.
Out of scope
We are not trying to create an app. This is just for proving it works.
Mint API
Thanks to: https://github.com/mrooney/mintapi
For providing a thought into how to get at the mint data.
I added a way to get transactions out of mint as shown below
def search_transactions(token,query="",offset=0): request_id = "42" # magic number? random number? rnd="410249975266" data = { "acctChanged":"T", "filterType": "cash", "offset":offset, "comparableType":8, "rnd":rnd, "query":query, "task": "transactions" } murl="https://wwws.mint.com/app/getJsonData.xevent?" response= session.get(murl, params=data) jres= json.loads(response.text) for jresitem in jres["set"][0]["data"]: Mint_Transaction(jresitem) return jres["set"][0]["data"]
Notes:
- This retrieves 100 transactions per fetch.
- You can get clever with random number generation but a fixed random number appears to work just fine.
- Query is a string and I noticed it is not a multiple word search but an exact phrase search.
- offset allows you to specify a number and it will fetch the previous 100 records from the offset.
- So, using the above along with the mintapi code I was able to have a simple for loop fetch 500 records.
- For storage a simple json.dump is sufficient to cache the transactions on local drive.
PyTesser
This is straight from : https://code.google.com/p/pytesser/
The only thing I changed is to use the latest Tesseract executable which happens to be a bit better at OCR.
Learn from CAPTCHA
Found this that helped a bunch in enhancing the images to prep it for tesseract: http://www.boyter.org/decoding-captchas/
PIL
This is critical to help change the characteristics of an image so tesseract has a better chance in recognition, or to rotate it etc etc..
OCR process
OCR is tricky at best. So the current approach is to:
- start with image or the receipt as is
- run tesseract
- search for transaction information within the text.
- Then if the amount, vendor, date etc match a single entity exit.
- If not then adjust the image and run through the above again.
- If none of the adjustments have provided a single outcome then return the 'most likely' candidates.
Image adjustment variants
Captcha 'simple':
Increases the size of image to 3000 pixels wide and runs a sharpen filter.
def captchasimple(im): img=im.convert("RGBA") # pixdata=img.load() basewidth=3000 if basewidth> img.size[0]: wpercent = basewidth/float(img.size[0]) x=int(float(img.size[1])*float(wpercent)) big = img.resize((basewidth, x), Image.LINEAR) else: big=img big=big.convert("") big=big.filter(ImageFilter.SHARPEN) return big
CAPTCHA to convert image using a 'threshold' to a b/w image:
This works for phone camera images of receipts. A low-contrast image can yield better results with this image manipulation
def captcha(im,threshold=100): img=im.convert("RGBA") pixdata=img.load() basewidth=2000 # Make the letters bolder for easier recognition for y in xrange(img.size[1]): for x in xrange(img.size[0]): if pixdata[x, y][0] < threshold: pixdata[x, y] = (0, 0, 0, 255) else: pixdata[x, y] = (255, 255, 255, 255) for y in xrange(img.size[1]): for x in xrange(img.size[0]): if pixdata[x, y][1] < threshold: pixdata[x, y] = (0, 0, 0, 255) else: pixdata[x, y] = (255, 255, 255, 255) for y in xrange(img.size[1]): for x in xrange(img.size[0]): if pixdata[x, y][2] < threshold: pixdata[x, y] = (0, 0, 0, 255) else: pixdata[x, y] = (255, 255, 255, 255) wpercent = basewidth/float(img.size[0]) x=int(float(img.size[1])*float(wpercent)) big = img.resize((basewidth, x), Image.ANTIALIAS) big=big.convert("") big=big.filter(ImageFilter.SHARPEN) return big
Getting incremental results:
I noted that each run on each image may be favorable for some portions of the image but not others. Therefore I accrued the text results rather than replacing with the new result set. This gave an opportunity for better matches.
I noted that each run on each image may be favorable for some portions of the image but not others. Therefore I accrued the text results rather than replacing with the new result set. This gave an opportunity for better matches.
def getTextFromImg(imgfn): imfh=Image.open(imgfn) im=imfh ocrts=[] ocrts.append(image_to_string(im,False)) yield ocrts im=imfh.filter(ImageFilter.SHARPEN) ocrts.append(image_to_string(im,False)) yield ocrts im=captchasimple(imfh) ocrts.append(image_to_string(im,False)) yield ocrts for t in [100,102,103,104,105,106,110]: im=captcha(imfh,t) ocrts.append(image_to_string(im,False)) yield ocrts
The last captcha loop is manipulating the threshold as a last ditch brute force and I noticed that there are certain sweet spots - at least in my receipt image captures. These threshold values allowed the sweet spots to be discovered and a make the OCR yield better results.
Matching with MINT:
In my proof, I built a class to hold each transaction with a search function for OCR string which returned a confidence number (sort of a like a search engine and an aggregator of results).
Three items are checked:
- Merchant name
- Date
- Amount
class Mint_Transaction(): """ {"isTransfer":false "isEdited":false "isLinkedToRule":false "isPercent":false "isMatched":false "odate":"Jun 11" "isSpending":true "isFirstDate":true "isCheck":false "date":"Jun 11" "mcategory":"Groceries" "isDuplicate":false "id":433821332 "amount":"$17.55" "isPending":true "fi":"Chase Bank" "note":"" "isAfterFiCreationTime":true "txnType":0 "ruleCategory":"" "merchant":"Harris Teeter" "omerchant":"HARRIS TEETER ####" "categoryId":701 "labels":[] "manualType":0 "category":"Groceries" "ruleCategoryId":0 "numberMatchedByRule":295 "hasAttachments":false "account":"CREDIT CARD" "mmerchant":"Harris Teeter" "ruleMerchant":"" "isChild":false "isDebit":true "userCategoryId":null} """ EXCLUDESRCH=["isTransfer", "isEdited", "isLinkedToRule", "isPercent", "isMatched", "odate", "isSpending", "isFirstDate", "isCheck", "date", "mcategory", "isDuplicate", "id", "amount", "isPending", "fi", "isAfterFiCreationTime", "txnType", "ruleCategory", "categoryId", "manualType", "category", "ruleCategoryId", "numberMatchedByRule", "hasAttachments", "account", "ruleMerchant", "isChild", "isDebit", "userCategoryId"] datefields=["odate","date",] json=None strRep=None pat=re.compile("\W+") def __init__(self,jsonitem): self.json=jsonitem self.strRep="" for key,val in jsonitem.iteritems(): if val != None: if key in self.datefields: try: val=parser.parse(val) self.strRep+=" "+val.strftime("%Y %M %d") except: sys.stderr.write("could not parse date:%s\n"%val) self.strRep+=" "+str(val).lower() elif not key in self.EXCLUDESRCH: self.strRep+=" "+str(val).lower() setattr(self, key, val) def getDataRow(self): amount=self.amount.decode("utf-8") amount=re.sub("[ $,]", "",amount) return [ self.id,self.date.strftime("%Y"), self.date.strftime("%m"), self.date.strftime("%d"), self.merchant,amount] def getDatePatterns(self): pat=[] y= ["%Y","%y"] m=["%m","%b"] d="%d" dt=self.date dtlist=[self.date-timedelta(days=x) for x in range(-1,2) ] for yitem in y: for mitem in m: pms=[ [yitem,mitem,d], [mitem,d,yitem], [d,mitem,yitem], ] for iteritem in pms: strf="[^\d]*".join( iteritem) for dtitem in dtlist: yield dtitem.strftime(strf) def getMatchConfidence(self,search): search=search.decode("utf-8") line = self.pat.sub("",search) line=line.lower() words= self.pat.split(search) confidence=0 for word in self.merchant.split(): if len(word) < 4: continue word=word.lower() try: num=float(word) # don't check for numbers in search # possibilities of match is too high for fragmented numbers continue except: pass if line.find(word)>=0: sys.stderr.write("Found word in search %s in %s\n"%(word,self.merchant)) confidence+=2 for dtst in self.getDatePatterns(): match=re.search(dtst, search) if match!=None: sys.stderr.write("Found matching date in search %s in %s\n"%(dtst,self.merchant)) confidence+=1 break amount=self.amount.decode("utf-8") amount=re.sub("[ $,]", "",amount) if float(amount)>0: if search.find(amount)>=0: confidence+=2 sys.stderr.write("Found matching amount in search %s in %s\n"%(amount,self.merchant)) return confidence def __repr__(self): return "%s %s %s"%(self.merchant,self.date,self.amount)
And the calling 'aggregator':
def getSearchMatches(ocrtext,translist,min=0): confmap={} cur_max=0 for mt in translist: assert isinstance(mt, Mint_Transaction) if mt.isDebit == False or mt.isTransfer == True: continue conf=mt.getMatchConfidence(ocrtext) if conf < min: continue confmap[mt.merchant+str(mt.id)]=[mt,conf] if cur_max < conf: cur_max = conf if len(confmap)==0: return (0,[]) items=sorted(confmap.iteritems(),key=lambda x: x[1][1],reverse=True) retval=[] if cur_max < 2 or cur_max < min: return (cur_max,retval) for key,val in items: mt=val[0] conf=val[1] if float(conf)/float(cur_max) > 0.90: #print conf,mt.merchant,mt.amount,mt.date retval.append(mt) return (cur_max,retval)
And finally a main method to orchestrate the rest
def main(img,tl): if img.endswith("jpg") or img.endswith("JPG"): with open("ocr.txt","w") as out: out.write("") prevmax=0 for ocrsamples in getTextFromImg(img): with open("ocr.txt","a") as out: out.write("--------------\n") out.write(ocrsamples[-1]) (max,res)= mint.getSearchMatches("".join(ocrsamples), tl,prevmax) if max > prevmax: prevmax=max else: continue if len(res)==1: print "-------",max,res[0].merchant, res[0].amount, res[0].date return res if res != None: for resitem in res: print "====",max,resitem.merchant, resitem.amount return res
Where img is the image to be OCR'ed. And 'tl' is the json transaction list retrieved from MINT using the "search_transactions" function above.
Results
OCR on this image produces:OCR text from the image:
I Date
Time 05:03:25
Distance 14.00mi
FARE.. . . . . .....$ 32.95
EXTRAS.. .... 1.00
S
TIP............$ 6.79
S
TOTAL.......... 40.74
Visa
xxxx xxxx xxxx 0148
รข€˜!ID 445100003992
The analysis of this text shows:
Tesseract Open Source OCR Engine v3.02 with Leptonica
Found matching amount in search 40.74 in Red Top Cab
Found matching date in search 03[^\d]*25[^\d]*14 in Washington Gas
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
------- 2 Red Top Cab $40.74 2014-04-09 00:00:00
379861323
Where the last number is the ID of the Mint Transaction related to the cab.
Note that the date search is not quite tuned well therefore 3/25/14 is actually matching 05:03:25Distance 14.00
Well that can be adjusted but sounds like a 'brewing' item.
No comments:
Post a Comment