Friday, June 20, 2014

Freecommander multi-rename: using matched sub groups

Quick help for multi rename function on freecommander

Choosing "Regular expression" allows you programming language like regular expressions with near similar syntax - and as far as I can see it is the same.

The pattern 'group' logic is also present but a bit hidden. It is simply a dollar '$' prefixed index.

So to rename my mp3 files for multiple discs that were originally ripped as: 01_01.mp3 to 04_01.mp3 etc to Disc 01 Track 01.mp3 would look like:

Result would look like

Tuesday, June 17, 2014

Mint api and OCR using PyTesseract = expense application with smarts!

Raison d'etre for a Mint + OCR


A lot of people use Mint.com for keeping track of accounts.
A lot of people have expenses they would like to scan and submit.
Some of us have to find a way to reconcile the receipts and not make mistakes such as picking the same recurring transaction etc.
If we could look up Mint.com using an API and download the transactions.
And OCR each receipt and get a match from the Mint transactions.
We could reasonably easily build an expense app.

Out of scope

We are not trying to create an app. This is just for proving it works.

Mint API

For providing a thought into how to get at the mint data. 
I added a way to get transactions out of mint as shown below


def search_transactions(token,query="",offset=0):
    request_id = "42" # magic number? random number?
    rnd="410249975266"
    data = {
        "acctChanged":"T",
        "filterType": "cash", 
        "offset":offset,
        "comparableType":8,
        "rnd":rnd,
        "query":query,
        "task": "transactions"
        }
    murl="https://wwws.mint.com/app/getJsonData.xevent?" 
    response= session.get(murl, params=data)
    jres= json.loads(response.text)
    for jresitem in jres["set"][0]["data"]:
        Mint_Transaction(jresitem)
    return jres["set"][0]["data"]

Notes:

  • This retrieves 100 transactions per fetch.
  •  You can get clever with random number generation but a fixed random number appears to work just fine.
  • Query is a string and I noticed it is not a multiple word search but an exact phrase search.
  • offset allows you to specify a number and it will fetch the previous 100 records from the offset.
  • So, using the above along with the mintapi code I was able to have a simple for loop fetch 500 records.
  • For storage a simple json.dump is sufficient to cache the transactions on local drive.

PyTesser

This is straight from : https://code.google.com/p/pytesser/ 
The only thing I changed is to use the latest Tesseract executable which happens to be a bit better at OCR.

Learn from CAPTCHA

Found this that helped a bunch in enhancing the images to prep it for tesseract: http://www.boyter.org/decoding-captchas/

PIL

This is critical to help change the characteristics of an image so tesseract has a better chance in recognition, or to rotate it etc etc.. 

OCR process

OCR is tricky at best. So the current approach is to:
  • start with image or the receipt as is 
  • run tesseract
  • search for transaction information within the text.
  • Then if the amount, vendor, date etc match a single entity exit.
  • If not then adjust the image and run through the above again.
  • If none of the adjustments have provided a single outcome then return the 'most likely' candidates.

Image adjustment variants

Captcha 'simple':
Increases the size of image to 3000 pixels wide and runs a sharpen filter.

def captchasimple(im):
    img=im.convert("RGBA")
#     pixdata=img.load()
    basewidth=3000
    if basewidth> img.size[0]:
        wpercent = basewidth/float(img.size[0])
        x=int(float(img.size[1])*float(wpercent))
        big = img.resize((basewidth, x), Image.LINEAR)
    else:
        big=img
    big=big.convert("")
    big=big.filter(ImageFilter.SHARPEN)
    return big


CAPTCHA to convert image using a 'threshold' to a b/w image:
This works for phone camera images of receipts. A low-contrast image can yield better results with this image manipulation

def captcha(im,threshold=100):
    img=im.convert("RGBA")
    pixdata=img.load()
    basewidth=2000
    # Make the letters bolder for easier recognition

    for y in xrange(img.size[1]):
        for x in xrange(img.size[0]):
            if pixdata[x, y][0] < threshold:
                pixdata[x, y] = (0, 0, 0, 255)
            else:
                pixdata[x, y] = (255, 255, 255, 255)
                
                
    for y in xrange(img.size[1]):
        for x in xrange(img.size[0]):
            if pixdata[x, y][1] < threshold:
                pixdata[x, y] = (0, 0, 0, 255)
            else:
                pixdata[x, y] = (255, 255, 255, 255)
                
    for y in xrange(img.size[1]):
        for x in xrange(img.size[0]):
            if pixdata[x, y][2] < threshold:
                pixdata[x, y] = (0, 0, 0, 255)
            else:
                pixdata[x, y] = (255, 255, 255, 255)
                
    wpercent = basewidth/float(img.size[0])
    x=int(float(img.size[1])*float(wpercent))
    big = img.resize((basewidth, x), Image.ANTIALIAS)
    big=big.convert("")
    big=big.filter(ImageFilter.SHARPEN)
    return big

Getting incremental results:
I noted that each run on each image may be favorable for some portions of the image but not others. Therefore I accrued the text results rather than replacing with the new result set. This gave an opportunity for better matches.

def getTextFromImg(imgfn):
    imfh=Image.open(imgfn)
    im=imfh
    ocrts=[]
    ocrts.append(image_to_string(im,False))
    yield ocrts
    im=imfh.filter(ImageFilter.SHARPEN)
    ocrts.append(image_to_string(im,False))
    yield ocrts
    im=captchasimple(imfh)
    ocrts.append(image_to_string(im,False))
    yield ocrts
    for t in [100,102,103,104,105,106,110]:
        im=captcha(imfh,t)
        ocrts.append(image_to_string(im,False))
        yield ocrts

This function runs through each image adjustment scenario and offers up an iterator of OCR'ed list of strings.
The last captcha loop is manipulating the threshold as a last ditch brute force and I noticed that there are certain sweet spots - at least in my receipt image captures. These threshold values allowed the sweet spots to be discovered and a make the OCR yield better results.

Matching with MINT:

In my proof, I built a class to hold each transaction with a search function for OCR string which returned a confidence number (sort of a like a search engine and an aggregator of results).
Three items are checked:
  • Merchant name
  • Date
  • Amount

class Mint_Transaction():
    """
{"isTransfer":false
"isEdited":false
"isLinkedToRule":false
"isPercent":false
"isMatched":false
"odate":"Jun 11"
"isSpending":true
"isFirstDate":true
"isCheck":false
"date":"Jun 11"
"mcategory":"Groceries"
"isDuplicate":false
"id":433821332
"amount":"$17.55"
"isPending":true
"fi":"Chase Bank"
"note":""
"isAfterFiCreationTime":true
"txnType":0
"ruleCategory":""
"merchant":"Harris Teeter"
"omerchant":"HARRIS TEETER ####"
"categoryId":701
"labels":[]
"manualType":0
"category":"Groceries"
"ruleCategoryId":0
"numberMatchedByRule":295
"hasAttachments":false
"account":"CREDIT CARD"
"mmerchant":"Harris Teeter"
"ruleMerchant":""
"isChild":false
"isDebit":true
"userCategoryId":null}

    """
    EXCLUDESRCH=["isTransfer",
    "isEdited",
    "isLinkedToRule",
    "isPercent",
    "isMatched",
    "odate",
    "isSpending",
    "isFirstDate",
    "isCheck",
    "date",
    "mcategory",
    "isDuplicate",
    "id",
    "amount",
    "isPending",
    "fi",
    "isAfterFiCreationTime",
    "txnType",
    "ruleCategory",
    "categoryId",
    "manualType",
    "category",
    "ruleCategoryId",
    "numberMatchedByRule",
    "hasAttachments",
    "account",
    "ruleMerchant",
    "isChild",
    "isDebit",
    "userCategoryId"]


    datefields=["odate","date",]
    json=None
    strRep=None
    pat=re.compile("\W+")
    def __init__(self,jsonitem):
        self.json=jsonitem
        self.strRep=""
        for key,val in jsonitem.iteritems():
            if val != None:
                if key in self.datefields:
                    try:
                        val=parser.parse(val)
                        self.strRep+=" "+val.strftime("%Y %M %d")
                    except:
                        sys.stderr.write("could not parse date:%s\n"%val)
                        self.strRep+=" "+str(val).lower()
                elif not key in self.EXCLUDESRCH:
                    self.strRep+=" "+str(val).lower()
                setattr(self, key, val) 
    def getDataRow(self):
        
        amount=self.amount.decode("utf-8")
        amount=re.sub("[ $,]", "",amount)
        return [ self.id,self.date.strftime("%Y"),
                self.date.strftime("%m"),
                self.date.strftime("%d"),
                self.merchant,amount]
    
    def getDatePatterns(self):
        pat=[]
        y= ["%Y","%y"]
        m=["%m","%b"]
        d="%d"
        dt=self.date
        dtlist=[self.date-timedelta(days=x) for x in range(-1,2) ]
        
        for yitem in y:
            for mitem in m:
                pms=[
                        [yitem,mitem,d],
                        [mitem,d,yitem], 
                        [d,mitem,yitem], 
                     ]
                for iteritem in pms:
                    strf="[^\d]*".join( iteritem)
                    for dtitem in dtlist:
                        yield dtitem.strftime(strf)
                    
                                
    def getMatchConfidence(self,search):    
        search=search.decode("utf-8")
        line = self.pat.sub("",search)
        line=line.lower()
        words= self.pat.split(search)
        confidence=0

        for word in self.merchant.split():
            if len(word) < 4:
                continue
            word=word.lower()
            try:
                num=float(word)
                # don't check for numbers in search 
                # possibilities of match is too high for fragmented numbers
                continue
            except:
                pass 
            if line.find(word)>=0:
                sys.stderr.write("Found word in search %s in %s\n"%(word,self.merchant))
                confidence+=2
        for dtst in self.getDatePatterns():
            match=re.search(dtst, search)
            if match!=None:
                sys.stderr.write("Found matching date in search %s in %s\n"%(dtst,self.merchant))
                confidence+=1
                break
        amount=self.amount.decode("utf-8")
        amount=re.sub("[ $,]", "",amount)
        if float(amount)>0:
            if search.find(amount)>=0:
                confidence+=2
                sys.stderr.write("Found matching amount in search %s in %s\n"%(amount,self.merchant))
        return confidence
    def __repr__(self):
        return "%s %s %s"%(self.merchant,self.date,self.amount) 

And the calling 'aggregator':

def getSearchMatches(ocrtext,translist,min=0):
    confmap={}
    cur_max=0
    for mt in translist:
        assert isinstance(mt, Mint_Transaction)
        if mt.isDebit == False or mt.isTransfer == True:
            continue
        conf=mt.getMatchConfidence(ocrtext)
        if conf < min:
            continue
        confmap[mt.merchant+str(mt.id)]=[mt,conf]
        if cur_max < conf:
            cur_max = conf
    if len(confmap)==0:
        return (0,[])
    items=sorted(confmap.iteritems(),key=lambda x: x[1][1],reverse=True)
    
    retval=[]
    if cur_max < 2 or cur_max < min:
        return (cur_max,retval)
    for key,val in items:
        mt=val[0]
        conf=val[1]
        if float(conf)/float(cur_max) > 0.90:
            #print  conf,mt.merchant,mt.amount,mt.date
            retval.append(mt)
    return (cur_max,retval)

And finally a main method to orchestrate the rest

def main(img,tl):
    if img.endswith("jpg") or img.endswith("JPG"):
        
        with open("ocr.txt","w") as out:
            out.write("")
        prevmax=0    
        for ocrsamples in getTextFromImg(img):
            with open("ocr.txt","a") as out:
                out.write("--------------\n")
                out.write(ocrsamples[-1])
            (max,res)= mint.getSearchMatches("".join(ocrsamples), tl,prevmax)
            if max > prevmax:
                prevmax=max
            else:
                continue
            if len(res)==1:
                print "-------",max,res[0].merchant, res[0].amount, res[0].date
                return res
            if res != None:
                for resitem in res:
                    print "====",max,resitem.merchant, resitem.amount
        return res

Where img is the image to be OCR'ed. And 'tl' is the json transaction list retrieved from MINT using the "search_transactions" function above.


Results

 OCR on this image produces:
 OCR text from the image:

I Date

Time 05:03:25
Distance 14.00mi

FARE.. . . . . .....$ 32.95
EXTRAS.. .... 1.00

S
TIP............$ 6.79
S

TOTAL.......... 40.74

Visa
xxxx xxxx xxxx 0148

รข€˜!ID 445100003992


The analysis of this text shows:

Tesseract Open Source OCR Engine v3.02 with Leptonica
Found matching amount in search 40.74 in Red Top Cab
Found matching date in search 03[^\d]*25[^\d]*14 in Washington Gas
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
------- 2 Red Top Cab $40.74 2014-04-09 00:00:00
379861323


Where the last number is the ID of the Mint Transaction related to the cab.
Note that the date search is not quite tuned well therefore 3/25/14 is actually matching 05:03:25Distance 14.00
Well that can be adjusted but sounds like a 'brewing' item.