Wednesday, September 10, 2014

Upgrading Portable Python and virtualenv

Steps to upgrade your virtualenvs when you upgrade portable python (and perhaps any other python).

  1. Add this portable python to your path variable. I am using msys to give me a linux like environment so I do a:
PYTHON_HOME="/PortableApps/PortablePython2.7.6.1/App"

export PATH="$PYTHON_HOME:$PYTHON_HOME/Scripts:$PATH"
  1. I like pip in the environment so get pip:
    • easy_install pip
  2. Get virtualenv for this portable python:
    • pip install virtualenv
  3. Now redo the virtual env with this virtualenv in the new python. Note "--system-site-packages" is optional and based on what you used previously to build this environment.
    • virtualenv --system-site-packages <path to existing virtual environment>

Friday, June 20, 2014

Freecommander multi-rename: using matched sub groups

Quick help for multi rename function on freecommander

Choosing "Regular expression" allows you programming language like regular expressions with near similar syntax - and as far as I can see it is the same.

The pattern 'group' logic is also present but a bit hidden. It is simply a dollar '$' prefixed index.

So to rename my mp3 files for multiple discs that were originally ripped as: 01_01.mp3 to 04_01.mp3 etc to Disc 01 Track 01.mp3 would look like:

Result would look like

Tuesday, June 17, 2014

Mint api and OCR using PyTesseract = expense application with smarts!

Raison d'etre for a Mint + OCR


A lot of people use Mint.com for keeping track of accounts.
A lot of people have expenses they would like to scan and submit.
Some of us have to find a way to reconcile the receipts and not make mistakes such as picking the same recurring transaction etc.
If we could look up Mint.com using an API and download the transactions.
And OCR each receipt and get a match from the Mint transactions.
We could reasonably easily build an expense app.

Out of scope

We are not trying to create an app. This is just for proving it works.

Mint API

For providing a thought into how to get at the mint data. 
I added a way to get transactions out of mint as shown below


def search_transactions(token,query="",offset=0):
    request_id = "42" # magic number? random number?
    rnd="410249975266"
    data = {
        "acctChanged":"T",
        "filterType": "cash", 
        "offset":offset,
        "comparableType":8,
        "rnd":rnd,
        "query":query,
        "task": "transactions"
        }
    murl="https://wwws.mint.com/app/getJsonData.xevent?" 
    response= session.get(murl, params=data)
    jres= json.loads(response.text)
    for jresitem in jres["set"][0]["data"]:
        Mint_Transaction(jresitem)
    return jres["set"][0]["data"]

Notes:

  • This retrieves 100 transactions per fetch.
  •  You can get clever with random number generation but a fixed random number appears to work just fine.
  • Query is a string and I noticed it is not a multiple word search but an exact phrase search.
  • offset allows you to specify a number and it will fetch the previous 100 records from the offset.
  • So, using the above along with the mintapi code I was able to have a simple for loop fetch 500 records.
  • For storage a simple json.dump is sufficient to cache the transactions on local drive.

PyTesser

This is straight from : https://code.google.com/p/pytesser/ 
The only thing I changed is to use the latest Tesseract executable which happens to be a bit better at OCR.

Learn from CAPTCHA

Found this that helped a bunch in enhancing the images to prep it for tesseract: http://www.boyter.org/decoding-captchas/

PIL

This is critical to help change the characteristics of an image so tesseract has a better chance in recognition, or to rotate it etc etc.. 

OCR process

OCR is tricky at best. So the current approach is to:
  • start with image or the receipt as is 
  • run tesseract
  • search for transaction information within the text.
  • Then if the amount, vendor, date etc match a single entity exit.
  • If not then adjust the image and run through the above again.
  • If none of the adjustments have provided a single outcome then return the 'most likely' candidates.

Image adjustment variants

Captcha 'simple':
Increases the size of image to 3000 pixels wide and runs a sharpen filter.

def captchasimple(im):
    img=im.convert("RGBA")
#     pixdata=img.load()
    basewidth=3000
    if basewidth> img.size[0]:
        wpercent = basewidth/float(img.size[0])
        x=int(float(img.size[1])*float(wpercent))
        big = img.resize((basewidth, x), Image.LINEAR)
    else:
        big=img
    big=big.convert("")
    big=big.filter(ImageFilter.SHARPEN)
    return big


CAPTCHA to convert image using a 'threshold' to a b/w image:
This works for phone camera images of receipts. A low-contrast image can yield better results with this image manipulation

def captcha(im,threshold=100):
    img=im.convert("RGBA")
    pixdata=img.load()
    basewidth=2000
    # Make the letters bolder for easier recognition

    for y in xrange(img.size[1]):
        for x in xrange(img.size[0]):
            if pixdata[x, y][0] < threshold:
                pixdata[x, y] = (0, 0, 0, 255)
            else:
                pixdata[x, y] = (255, 255, 255, 255)
                
                
    for y in xrange(img.size[1]):
        for x in xrange(img.size[0]):
            if pixdata[x, y][1] < threshold:
                pixdata[x, y] = (0, 0, 0, 255)
            else:
                pixdata[x, y] = (255, 255, 255, 255)
                
    for y in xrange(img.size[1]):
        for x in xrange(img.size[0]):
            if pixdata[x, y][2] < threshold:
                pixdata[x, y] = (0, 0, 0, 255)
            else:
                pixdata[x, y] = (255, 255, 255, 255)
                
    wpercent = basewidth/float(img.size[0])
    x=int(float(img.size[1])*float(wpercent))
    big = img.resize((basewidth, x), Image.ANTIALIAS)
    big=big.convert("")
    big=big.filter(ImageFilter.SHARPEN)
    return big

Getting incremental results:
I noted that each run on each image may be favorable for some portions of the image but not others. Therefore I accrued the text results rather than replacing with the new result set. This gave an opportunity for better matches.

def getTextFromImg(imgfn):
    imfh=Image.open(imgfn)
    im=imfh
    ocrts=[]
    ocrts.append(image_to_string(im,False))
    yield ocrts
    im=imfh.filter(ImageFilter.SHARPEN)
    ocrts.append(image_to_string(im,False))
    yield ocrts
    im=captchasimple(imfh)
    ocrts.append(image_to_string(im,False))
    yield ocrts
    for t in [100,102,103,104,105,106,110]:
        im=captcha(imfh,t)
        ocrts.append(image_to_string(im,False))
        yield ocrts

This function runs through each image adjustment scenario and offers up an iterator of OCR'ed list of strings.
The last captcha loop is manipulating the threshold as a last ditch brute force and I noticed that there are certain sweet spots - at least in my receipt image captures. These threshold values allowed the sweet spots to be discovered and a make the OCR yield better results.

Matching with MINT:

In my proof, I built a class to hold each transaction with a search function for OCR string which returned a confidence number (sort of a like a search engine and an aggregator of results).
Three items are checked:
  • Merchant name
  • Date
  • Amount

class Mint_Transaction():
    """
{"isTransfer":false
"isEdited":false
"isLinkedToRule":false
"isPercent":false
"isMatched":false
"odate":"Jun 11"
"isSpending":true
"isFirstDate":true
"isCheck":false
"date":"Jun 11"
"mcategory":"Groceries"
"isDuplicate":false
"id":433821332
"amount":"$17.55"
"isPending":true
"fi":"Chase Bank"
"note":""
"isAfterFiCreationTime":true
"txnType":0
"ruleCategory":""
"merchant":"Harris Teeter"
"omerchant":"HARRIS TEETER ####"
"categoryId":701
"labels":[]
"manualType":0
"category":"Groceries"
"ruleCategoryId":0
"numberMatchedByRule":295
"hasAttachments":false
"account":"CREDIT CARD"
"mmerchant":"Harris Teeter"
"ruleMerchant":""
"isChild":false
"isDebit":true
"userCategoryId":null}

    """
    EXCLUDESRCH=["isTransfer",
    "isEdited",
    "isLinkedToRule",
    "isPercent",
    "isMatched",
    "odate",
    "isSpending",
    "isFirstDate",
    "isCheck",
    "date",
    "mcategory",
    "isDuplicate",
    "id",
    "amount",
    "isPending",
    "fi",
    "isAfterFiCreationTime",
    "txnType",
    "ruleCategory",
    "categoryId",
    "manualType",
    "category",
    "ruleCategoryId",
    "numberMatchedByRule",
    "hasAttachments",
    "account",
    "ruleMerchant",
    "isChild",
    "isDebit",
    "userCategoryId"]


    datefields=["odate","date",]
    json=None
    strRep=None
    pat=re.compile("\W+")
    def __init__(self,jsonitem):
        self.json=jsonitem
        self.strRep=""
        for key,val in jsonitem.iteritems():
            if val != None:
                if key in self.datefields:
                    try:
                        val=parser.parse(val)
                        self.strRep+=" "+val.strftime("%Y %M %d")
                    except:
                        sys.stderr.write("could not parse date:%s\n"%val)
                        self.strRep+=" "+str(val).lower()
                elif not key in self.EXCLUDESRCH:
                    self.strRep+=" "+str(val).lower()
                setattr(self, key, val) 
    def getDataRow(self):
        
        amount=self.amount.decode("utf-8")
        amount=re.sub("[ $,]", "",amount)
        return [ self.id,self.date.strftime("%Y"),
                self.date.strftime("%m"),
                self.date.strftime("%d"),
                self.merchant,amount]
    
    def getDatePatterns(self):
        pat=[]
        y= ["%Y","%y"]
        m=["%m","%b"]
        d="%d"
        dt=self.date
        dtlist=[self.date-timedelta(days=x) for x in range(-1,2) ]
        
        for yitem in y:
            for mitem in m:
                pms=[
                        [yitem,mitem,d],
                        [mitem,d,yitem], 
                        [d,mitem,yitem], 
                     ]
                for iteritem in pms:
                    strf="[^\d]*".join( iteritem)
                    for dtitem in dtlist:
                        yield dtitem.strftime(strf)
                    
                                
    def getMatchConfidence(self,search):    
        search=search.decode("utf-8")
        line = self.pat.sub("",search)
        line=line.lower()
        words= self.pat.split(search)
        confidence=0

        for word in self.merchant.split():
            if len(word) < 4:
                continue
            word=word.lower()
            try:
                num=float(word)
                # don't check for numbers in search 
                # possibilities of match is too high for fragmented numbers
                continue
            except:
                pass 
            if line.find(word)>=0:
                sys.stderr.write("Found word in search %s in %s\n"%(word,self.merchant))
                confidence+=2
        for dtst in self.getDatePatterns():
            match=re.search(dtst, search)
            if match!=None:
                sys.stderr.write("Found matching date in search %s in %s\n"%(dtst,self.merchant))
                confidence+=1
                break
        amount=self.amount.decode("utf-8")
        amount=re.sub("[ $,]", "",amount)
        if float(amount)>0:
            if search.find(amount)>=0:
                confidence+=2
                sys.stderr.write("Found matching amount in search %s in %s\n"%(amount,self.merchant))
        return confidence
    def __repr__(self):
        return "%s %s %s"%(self.merchant,self.date,self.amount) 

And the calling 'aggregator':

def getSearchMatches(ocrtext,translist,min=0):
    confmap={}
    cur_max=0
    for mt in translist:
        assert isinstance(mt, Mint_Transaction)
        if mt.isDebit == False or mt.isTransfer == True:
            continue
        conf=mt.getMatchConfidence(ocrtext)
        if conf < min:
            continue
        confmap[mt.merchant+str(mt.id)]=[mt,conf]
        if cur_max < conf:
            cur_max = conf
    if len(confmap)==0:
        return (0,[])
    items=sorted(confmap.iteritems(),key=lambda x: x[1][1],reverse=True)
    
    retval=[]
    if cur_max < 2 or cur_max < min:
        return (cur_max,retval)
    for key,val in items:
        mt=val[0]
        conf=val[1]
        if float(conf)/float(cur_max) > 0.90:
            #print  conf,mt.merchant,mt.amount,mt.date
            retval.append(mt)
    return (cur_max,retval)

And finally a main method to orchestrate the rest

def main(img,tl):
    if img.endswith("jpg") or img.endswith("JPG"):
        
        with open("ocr.txt","w") as out:
            out.write("")
        prevmax=0    
        for ocrsamples in getTextFromImg(img):
            with open("ocr.txt","a") as out:
                out.write("--------------\n")
                out.write(ocrsamples[-1])
            (max,res)= mint.getSearchMatches("".join(ocrsamples), tl,prevmax)
            if max > prevmax:
                prevmax=max
            else:
                continue
            if len(res)==1:
                print "-------",max,res[0].merchant, res[0].amount, res[0].date
                return res
            if res != None:
                for resitem in res:
                    print "====",max,resitem.merchant, resitem.amount
        return res

Where img is the image to be OCR'ed. And 'tl' is the json transaction list retrieved from MINT using the "search_transactions" function above.


Results

 OCR on this image produces:
 OCR text from the image:

I Date

Time 05:03:25
Distance 14.00mi

FARE.. . . . . .....$ 32.95
EXTRAS.. .... 1.00

S
TIP............$ 6.79
S

TOTAL.......... 40.74

Visa
xxxx xxxx xxxx 0148

รข€˜!ID 445100003992


The analysis of this text shows:

Tesseract Open Source OCR Engine v3.02 with Leptonica
Found matching amount in search 40.74 in Red Top Cab
Found matching date in search 03[^\d]*25[^\d]*14 in Washington Gas
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
Found matching date in search 03[^\d]*25[^\d]*14 in Amazon
------- 2 Red Top Cab $40.74 2014-04-09 00:00:00
379861323


Where the last number is the ID of the Mint Transaction related to the cab.
Note that the date search is not quite tuned well therefore 3/25/14 is actually matching 05:03:25Distance 14.00
Well that can be adjusted but sounds like a 'brewing' item.
 


Wednesday, May 28, 2014

Large LDIF dn record indexer and record getter

I had to retrieve individual records based on dn from a 500MB LDIF file and using traditional python LDIF libraries using the LDIFParser took around 90 seconds or so. The issue is that this is a 'sparse' lookup problem. I didn't really need to know about 99% of the records, just around 100-1000 records that were anywhere in the LDIF file.

I figured that it would be preferable to index the entire LDIF quickly or as quickly as possible and then use a getDN function to retrieve entries individually. Of  course you can get pretty clever with this and extend it to do some more interesting indexing etc.

Stats of the LDIF file I used:
    size: 448,355kB
    Lines: 16,337,077
    Entries:  389,776
And it was able to index the file in:
    Time to index: 14,763ms
    Time to retrieve a 1000 entries using the getDN: 1,855ms
    Total memory: ~45MB
Using the LDIFParser and storing dns in a map for easy retrieval:
    Time to index: 93,628ms
    Time to retrieve a 1000 entries using the map: 1,432ms
    Total memory: ~1.5GB

class largeLDIF(ldif.LDIFParser):
    """indexes through large ldifs to only store dn vs file.seek number as a map
    retrieval is done on demand afterwards"""
    dnMap=OrderedDict()
    count=0
    def __init__(self,input_file):
        if isinstance(input_file, str):
            self.ldif_if=open(input_file,'rb')
        elif hasattr(input_file, "read"):
            self.ldif_if=input_file
        ldif.LDIFParser.__init__(self,self.ldif_if)
        self.dnMap={}
    
    def parse(self):
        """override parse to only look for dns to generate a map"""
        loc=self.ldif_if.tell()
        self._line  = self.ldif_if.readline()
        while self._line:
            colon_pos=self._line.find(':')
            if self._line[0]=='#' or colon_pos<0:
                loc=self.ldif_if.tell()
                self._line  = self.ldif_if.readline()
                continue
            
            attr_type = self._line[0:colon_pos]
            
            if attr_type == 'dn':
                # check if line is folded, do this only if a dn was found
                unfolded_lines=[self._stripLineSep(self._line)]
                self._line=self.ldif_if.readline()
                # check for fold and don't process if not needed
                
                if self._line and self._line[0]==' ':
                    while self._line and self._line[0]==' ':
                        unfolded_lines.append(self._stripLineSep(self._line))
                        self._line=self.ldif_if.readline()
                        
                    unfoldeddn=''.join(unfolded_lines)
                else:
                    unfoldeddn=unfolded_lines[0]
                value_spec = unfoldeddn[colon_pos:colon_pos+2]
                if value_spec=='::':
                    # attribute value needs base64-decoding
                    attr_value = base64.decodestring(unfoldeddn[colon_pos+2:])
                elif value_spec==':\r\n' or value_spec=='\n':
                    # no use if dn is empty
                    continue
                else:
                    attr_value = unfoldeddn[colon_pos+2:].lstrip()    
                        
                self.dnMap[attr_value]=loc
                self.count+=1
                if self.count % 10000==0:
                    sys.stderr.write("%s - %s\n"%(self.count,attr_value))
                
                # now read until an empty line or end of file is found
                # since that would indicate end of record
                while self._line and not ( self._line == '\r\n' or self._line=='\n') :
                    self._line = self.ldif_if.readline()
                
            loc=self.ldif_if.tell()
            self._line=self.ldif_if.readline()
                
                    
    def getDN(self,dn):
        if self.dnMap.has_key(dn):
            self.ldif_if.seek(self.dnMap[dn])
            #read the entry block into a stringIO object
            ldifStr=StringIO.StringIO()
            line=self.ldif_if.readline()
            while line:
                if not line or line == '\r\n' or line=='\n':
                    ldifStr.write(line)
                    break
                else:
                    ldifStr.write(line)
                line=self.ldif_if.readline()
            ldifStr.seek(0)
            # record is read in
            rec=ldif.LDIFRecordList(ldifStr)
            rec.parse()
            return rec.all_records[0]

So, I couldn't help it.. I tested a simple file read to see what is the fastest it could read through the file without doing any operations. Just plain read a 500MB file.
Well, that turned out to be around 2 second mark.
Next I tried adding one or two if-then-elses to see how it affected the performance and I got to around 7seconds.

So I rewrote the class with fewest branching and operational steps I could as shown below:
Important things to note is to use the for loop with the file object improves the performance signficantly.
Also, to optimize checking for 'dn' I found that startswith takes an enormous amount of time in comparison to using a slice comparison (see this post:http://stackoverflow.com/questions/13270888/why-is-startswith-slower-than-slicing).


class largeLDIF(ldif.LDIFParser):
    """indexes through large ldifs to only store dn vs line number as a map
    retrieval is done on demand afterwards"""
    dnMap=OrderedDict()
    count=0
    _loc=0L
    _newloc=0L
    def __init__(self,input_file):
        if isinstance(input_file, str):
            self.ldif_if=open(input_file,'rb')
        elif hasattr(input_file, "read"):
            self.ldif_if=input_file
        ldif.LDIFParser.__init__(self,self.ldif_if)
        self.dnMap={}
        
    def parse(self):
        """override parse to only look for dns to generate a map"""
        loc=0L
        newloc=0L
        # prime this
        unfolded_lines=None
        unfoldcheck=False
        dnloc=0L
        readuntilEndofRec =False
        colon_pos=2
        for self._line in self.ldif_if:
            loc=newloc
            newloc+=len(self._line)
            if readuntilEndofRec:
                #skip records until it matches return character
                if self._line == "\r\n" or self._line == "\n":
                    readuntilEndofRec=False
                    continue
                else:
                    continue
                    
            if unfoldcheck:
                if  self._line[0]==' ' :
                #folded line detected
                    unfolded_lines.append(self._stripLineSep(self._line[1:]))
                    continue
                else:
                    unfoldeddn=''.join(unfolded_lines)
                    value_spec = unfoldeddn[colon_pos:colon_pos+2]
                    if value_spec=='::':
                        # attribute value needs base64-decoding
                        attr_value = base64.decodestring(unfoldeddn[colon_pos+2:])
                    elif value_spec==':\r\n' or value_spec=='\n':
                # no. use if dn is empty
                        unfoldcheck=False
                        continue
                    else:
                        attr_value = unfoldeddn[colon_pos+2:].lstrip()    
                    
                    self.dnMap[attr_value.lower()]=dnloc
                    if self.count % 10000==0:
                        sys.stderr.write("%s - %s\n"%(self.count,attr_value))
                    unfoldcheck=False
                    readuntilEndofRec=True
                    continue
                
            attr_spec=self._line[:3]    
        
            if attr_spec=='dn:':
                unfolded_lines=[self._stripLineSep(self._line)]
                unfoldcheck=True
                dnloc=loc
                self.count+=1
            
    def getDN(self,dn):
        if self.dnMap.has_key(dn.lower()):
            self.ldif_if.seek(self.dnMap[dn.lower()])
            #read the entry block into a stringIO object
            ldifStr=StringIO.StringIO()
            for line in self.ldif_if:
                ldifStr.write(line)
                if line == '\r\n' or line=='\n':
                    break
            ldifStr.seek(0)
            # record is read in
            rec=ldif.LDIFRecordList(ldifStr)
            rec.parse()
            return rec.all_records[0]
            
Well, here are the results:

    Time to index: 10,515ms
    Time to retrieve a 1000 entries using the getDN: 342ms
    Total memory: ~50MB 
So we shaved another 25% off the time and improved performance!
Also, interesting to note is that I get a solid disk IO at 50MB/s in the initial indexing step (which makes sense as rough math translates to 50MB x 10seconds=500MB total so it adds up). I however don't believe this is a diskIO bottleneck since a raw read using python without checking anything is around the 2sec mark, possibly with buffering etc the perceived disk speed is around 250MB/s (and I don't have an SSD in this test so obviously this has to be some OS buffering etc).

 

Tuesday, May 27, 2014

Yahoo mail advertisement-pane (pain) remover (greasemonkey)

This greasemonkey script removes the advert pane on the right hand side in Yahoo mail. You can now reclaim the lost email real estate as well as reducing the ads shown.
Which happens to be a premium feature - thanks greasemonkey.

https://openuserjs.org/scripts/venkman69/Yahoo_Mail_New_Ad_Remover


Redfin and Walkscore greasemonkey script

Here is a Redfin + Walkscore greasemonkey script that adds a 'Show walkability' link which you DON'T have to click - a mouse hover shows the walkscore page. This is mostly all I need, a quick check and carry on instead of launching a whole new page, going to the page, closing the page and all that unnecessary effort.

And since userscripts.org is in a tailspin I have uploaded this to openuserjs.org:

https://openuserjs.org/scripts/venkman69/nnat.redfin.walkscore/RedFin_WalkScore_Popup



Wednesday, May 14, 2014

A python CSV writer decorator

A CSV writer decorator for functions returning a list (row)

I was looking for a clean way to implement a repeatable pattern where I could have a function return a value list and for this value list to be output to a CSV. The hope is that it should be adaptable to any other target format as well.

After a bit of research I came across:

And the decorator mentioned in it seemed like an excellent idea.

The additional requirement for this was to render the header (or field names) for the csv as a one time output for the top line of the CSV file.

The following assumptions are made:
  • The 'out' file handle is a global and is setup prior to calling the function
  • The 'attrNameList' is a global and is the header that should be the first line of the CSV file.
  • The function passed to the decorator returns a compatible list-type value such that decorator can render it using the csv.writer.
I am guessing this is not quite rigorously pythonic due to the expectations from the passed-function - whereas a true decorator might expect to be completely agnostic of the function it decorates. But I think it is still reasonable and reasonable is good enough for me in many cases.

For the header, a single boolean flag is injected into the function to indicate whether to write the header or not.

So here is the decorator code:

def writeCSV(func):
    def wrapcsv(*args,**kwargs):
        global out
        global attrNameList
        cw=csv.writer(out, delimiter=",",lineterminator='\n')
        if wrapcsv.header==False:
            cw.writerow(attrNameList)
        wrapcsv.header=True
        funcval= func(*args,**kwargs)
        cw.writerow(funcval)
        
        return funcval
    
    wrapcsv.header=False
    return wrapcsv


And here is a function it could decorate (I was parsing an LDIF file and extracting specified attributes to a CSV file):

@writeCSV    
def reportAttrs(dn,ent):
    global attrNameList
    global out
    vallist=[]
    for attrName in attrNameList: 
        if ent.has_key(attrName.lower()):
            vallist.append(ent[attrName.lower()][0])
        else:
            vallist.append("")
        if attrName.lower() == "dn":
            vallist.append(dn)
    return vallist

This can be adapted to writing XML, or JSON or whatever you need

Wednesday, May 7, 2014

Setting up python virtualenv without sudo

Setup virtualenv and packages locally without sudo/root

I wanted to create a deployable package into production machines in which I neither have root access nor sudo and also they are firewalled so something simple such as getting pip installation etc is not going to work since the ports are blocked outwards as well.

I usually have ssh and scp access to the machine so I can place files on the server but anything else is pretty locked down.

So after some web research, I came across the following which helped with scripting a solution:
Couple of notes:
  • At the end of this setup, it would have installed a virtualenv with the tools that were packaged in the setup script's directory.
  • We will have the use of pip to make it easier for subsequent installs.
  • However, pip may or may not work to download packages if the system is constrained (with either network firewalling or automatic proxies - I could not figure a way past auto proxies).
  • You have to provide any required packages as well as provide correct versions of packages.
  • This will be a fully isolated setup.
Steps:
  1. Get virtualenv.py from : https://raw.github.com/pypa/virtualenv/master/virtualenv.py 
  2. Get pip from: https://pypi.python.org/pypi/pip#downloads
  3. Get setuptools from https://pypi.python.org/pypi/setuptools, the download is towards the bottom (clicking on download button didn't work for me).
    https://pypi.python.org/packages/source/s/setuptools/setuptools-17.1.1.zip
  4. Get additional packages as tar.gz format from pypi etc. and place them into pypackage directory
  5. Setup the virtual env: (setting up to homedir/myenv shown below)
  6. python $SCRIPTDIR/virtualenv.py --no-setuptools ~/myenv
    
  7. Activate this environment (note the leading 'dot' is required with space following)
  8. . ~/myenv/bin/activate
    
  9. Unpack the pip to some folder (not the virtual env folder):
  10. tar xvzf pip-7.0.3.tar.gz
    
  11. Unpack the setuptools directory:
  12. unzip setuptools-17.1.1.zip
  13. cd to the unpacked directory and setup pip into virtual env:
    cd setuptools-17.1.1
    python setup.py build
    python setup.py install
  14. cd to the unpacked directory and setup pip into virtual env:
  15. cd pip-7.0.3
    python setup.py build
    python setup.py install
    
From this point forward, once you activate the environment, you can run pip on a locally downloaded tar.gz file as:
pip install <path to tar.gz>
And it will install into the virtual environment.

Wednesday, April 30, 2014

Shell script to show progress of process

Shell script progress

Okay, I know, it is gimmicky and it certainly is. But I get tired of not being able to see long running process with some progress indicator. Then you are using commands like 'watch' and running 'tail' this that and the other thing.

So, this little snippet shows a spinning char animation of | / - | - \  in sequence. Yes it is ultra dumb but it keeps me amused :)
This is to be customized to your needs. (this particular code counts user records being exported in an ldapsearch): 
   cnt=`grep -c "dn" $ldiffile`
Line #20 is what does the display and shows $cnt value and a rotating-feel-good blip.
I also don't have a time out if the process goes haywire.

Calling script looks something like:
<background someprocess> &
pid=`jobs -p %%`
progscript="grep -c \"someword\" procoutputfile"
showprog $pid $progscript
The function to show the progress is as follows:
function showprog(){
pid=$1
prog=$2

blip="|"
while ps -p $pid >/dev/null
do
    cnt=`eval $prog`
    if [ "$blip" = "|" ]
    then
        blip="/"
    elif [ "$blip" = "/" ]
    then
        blip="-"
    elif [ "$blip" = "-" ]
    then
        blip="\\"
    elif [ "$blip" = "\\" ]
    then
        blip="|"
    fi
    echo -en "\r$cnt$blip"
sleep 2
done
echo
}

Friday, April 25, 2014

How To: Python LDIF package stops parsing upon error - how to get around this

Python LDIF package stops when user dn does not meet internal regex for dn. While this is great, the problem is when you just want to get the record - good or bad for other purposes than to push it into LDAP etc.
It is simple to override the 'is_dn' method used by the ldif package by setting the package method like so:
import ldif

def is_dn(s):
    return 1

ldif.is_dn=is_dn
 
That is it. From this point on ldif package will use the new is_dn() method and will not fail on invalid dns.