Wednesday, May 28, 2014

Large LDIF dn record indexer and record getter

I had to retrieve individual records based on dn from a 500MB LDIF file and using traditional python LDIF libraries using the LDIFParser took around 90 seconds or so. The issue is that this is a 'sparse' lookup problem. I didn't really need to know about 99% of the records, just around 100-1000 records that were anywhere in the LDIF file.

I figured that it would be preferable to index the entire LDIF quickly or as quickly as possible and then use a getDN function to retrieve entries individually. Of  course you can get pretty clever with this and extend it to do some more interesting indexing etc.

Stats of the LDIF file I used:
    size: 448,355kB
    Lines: 16,337,077
    Entries:  389,776
And it was able to index the file in:
    Time to index: 14,763ms
    Time to retrieve a 1000 entries using the getDN: 1,855ms
    Total memory: ~45MB
Using the LDIFParser and storing dns in a map for easy retrieval:
    Time to index: 93,628ms
    Time to retrieve a 1000 entries using the map: 1,432ms
    Total memory: ~1.5GB

class largeLDIF(ldif.LDIFParser):
    """indexes through large ldifs to only store dn vs file.seek number as a map
    retrieval is done on demand afterwards"""
    dnMap=OrderedDict()
    count=0
    def __init__(self,input_file):
        if isinstance(input_file, str):
            self.ldif_if=open(input_file,'rb')
        elif hasattr(input_file, "read"):
            self.ldif_if=input_file
        ldif.LDIFParser.__init__(self,self.ldif_if)
        self.dnMap={}
    
    def parse(self):
        """override parse to only look for dns to generate a map"""
        loc=self.ldif_if.tell()
        self._line  = self.ldif_if.readline()
        while self._line:
            colon_pos=self._line.find(':')
            if self._line[0]=='#' or colon_pos<0:
                loc=self.ldif_if.tell()
                self._line  = self.ldif_if.readline()
                continue
            
            attr_type = self._line[0:colon_pos]
            
            if attr_type == 'dn':
                # check if line is folded, do this only if a dn was found
                unfolded_lines=[self._stripLineSep(self._line)]
                self._line=self.ldif_if.readline()
                # check for fold and don't process if not needed
                
                if self._line and self._line[0]==' ':
                    while self._line and self._line[0]==' ':
                        unfolded_lines.append(self._stripLineSep(self._line))
                        self._line=self.ldif_if.readline()
                        
                    unfoldeddn=''.join(unfolded_lines)
                else:
                    unfoldeddn=unfolded_lines[0]
                value_spec = unfoldeddn[colon_pos:colon_pos+2]
                if value_spec=='::':
                    # attribute value needs base64-decoding
                    attr_value = base64.decodestring(unfoldeddn[colon_pos+2:])
                elif value_spec==':\r\n' or value_spec=='\n':
                    # no use if dn is empty
                    continue
                else:
                    attr_value = unfoldeddn[colon_pos+2:].lstrip()    
                        
                self.dnMap[attr_value]=loc
                self.count+=1
                if self.count % 10000==0:
                    sys.stderr.write("%s - %s\n"%(self.count,attr_value))
                
                # now read until an empty line or end of file is found
                # since that would indicate end of record
                while self._line and not ( self._line == '\r\n' or self._line=='\n') :
                    self._line = self.ldif_if.readline()
                
            loc=self.ldif_if.tell()
            self._line=self.ldif_if.readline()
                
                    
    def getDN(self,dn):
        if self.dnMap.has_key(dn):
            self.ldif_if.seek(self.dnMap[dn])
            #read the entry block into a stringIO object
            ldifStr=StringIO.StringIO()
            line=self.ldif_if.readline()
            while line:
                if not line or line == '\r\n' or line=='\n':
                    ldifStr.write(line)
                    break
                else:
                    ldifStr.write(line)
                line=self.ldif_if.readline()
            ldifStr.seek(0)
            # record is read in
            rec=ldif.LDIFRecordList(ldifStr)
            rec.parse()
            return rec.all_records[0]

So, I couldn't help it.. I tested a simple file read to see what is the fastest it could read through the file without doing any operations. Just plain read a 500MB file.
Well, that turned out to be around 2 second mark.
Next I tried adding one or two if-then-elses to see how it affected the performance and I got to around 7seconds.

So I rewrote the class with fewest branching and operational steps I could as shown below:
Important things to note is to use the for loop with the file object improves the performance signficantly.
Also, to optimize checking for 'dn' I found that startswith takes an enormous amount of time in comparison to using a slice comparison (see this post:http://stackoverflow.com/questions/13270888/why-is-startswith-slower-than-slicing).


class largeLDIF(ldif.LDIFParser):
    """indexes through large ldifs to only store dn vs line number as a map
    retrieval is done on demand afterwards"""
    dnMap=OrderedDict()
    count=0
    _loc=0L
    _newloc=0L
    def __init__(self,input_file):
        if isinstance(input_file, str):
            self.ldif_if=open(input_file,'rb')
        elif hasattr(input_file, "read"):
            self.ldif_if=input_file
        ldif.LDIFParser.__init__(self,self.ldif_if)
        self.dnMap={}
        
    def parse(self):
        """override parse to only look for dns to generate a map"""
        loc=0L
        newloc=0L
        # prime this
        unfolded_lines=None
        unfoldcheck=False
        dnloc=0L
        readuntilEndofRec =False
        colon_pos=2
        for self._line in self.ldif_if:
            loc=newloc
            newloc+=len(self._line)
            if readuntilEndofRec:
                #skip records until it matches return character
                if self._line == "\r\n" or self._line == "\n":
                    readuntilEndofRec=False
                    continue
                else:
                    continue
                    
            if unfoldcheck:
                if  self._line[0]==' ' :
                #folded line detected
                    unfolded_lines.append(self._stripLineSep(self._line[1:]))
                    continue
                else:
                    unfoldeddn=''.join(unfolded_lines)
                    value_spec = unfoldeddn[colon_pos:colon_pos+2]
                    if value_spec=='::':
                        # attribute value needs base64-decoding
                        attr_value = base64.decodestring(unfoldeddn[colon_pos+2:])
                    elif value_spec==':\r\n' or value_spec=='\n':
                # no. use if dn is empty
                        unfoldcheck=False
                        continue
                    else:
                        attr_value = unfoldeddn[colon_pos+2:].lstrip()    
                    
                    self.dnMap[attr_value.lower()]=dnloc
                    if self.count % 10000==0:
                        sys.stderr.write("%s - %s\n"%(self.count,attr_value))
                    unfoldcheck=False
                    readuntilEndofRec=True
                    continue
                
            attr_spec=self._line[:3]    
        
            if attr_spec=='dn:':
                unfolded_lines=[self._stripLineSep(self._line)]
                unfoldcheck=True
                dnloc=loc
                self.count+=1
            
    def getDN(self,dn):
        if self.dnMap.has_key(dn.lower()):
            self.ldif_if.seek(self.dnMap[dn.lower()])
            #read the entry block into a stringIO object
            ldifStr=StringIO.StringIO()
            for line in self.ldif_if:
                ldifStr.write(line)
                if line == '\r\n' or line=='\n':
                    break
            ldifStr.seek(0)
            # record is read in
            rec=ldif.LDIFRecordList(ldifStr)
            rec.parse()
            return rec.all_records[0]
            
Well, here are the results:

    Time to index: 10,515ms
    Time to retrieve a 1000 entries using the getDN: 342ms
    Total memory: ~50MB 
So we shaved another 25% off the time and improved performance!
Also, interesting to note is that I get a solid disk IO at 50MB/s in the initial indexing step (which makes sense as rough math translates to 50MB x 10seconds=500MB total so it adds up). I however don't believe this is a diskIO bottleneck since a raw read using python without checking anything is around the 2sec mark, possibly with buffering etc the perceived disk speed is around 250MB/s (and I don't have an SSD in this test so obviously this has to be some OS buffering etc).

 

Tuesday, May 27, 2014

Yahoo mail advertisement-pane (pain) remover (greasemonkey)

This greasemonkey script removes the advert pane on the right hand side in Yahoo mail. You can now reclaim the lost email real estate as well as reducing the ads shown.
Which happens to be a premium feature - thanks greasemonkey.

https://openuserjs.org/scripts/venkman69/Yahoo_Mail_New_Ad_Remover


Redfin and Walkscore greasemonkey script

Here is a Redfin + Walkscore greasemonkey script that adds a 'Show walkability' link which you DON'T have to click - a mouse hover shows the walkscore page. This is mostly all I need, a quick check and carry on instead of launching a whole new page, going to the page, closing the page and all that unnecessary effort.

And since userscripts.org is in a tailspin I have uploaded this to openuserjs.org:

https://openuserjs.org/scripts/venkman69/nnat.redfin.walkscore/RedFin_WalkScore_Popup



Wednesday, May 14, 2014

A python CSV writer decorator

A CSV writer decorator for functions returning a list (row)

I was looking for a clean way to implement a repeatable pattern where I could have a function return a value list and for this value list to be output to a CSV. The hope is that it should be adaptable to any other target format as well.

After a bit of research I came across:

And the decorator mentioned in it seemed like an excellent idea.

The additional requirement for this was to render the header (or field names) for the csv as a one time output for the top line of the CSV file.

The following assumptions are made:
  • The 'out' file handle is a global and is setup prior to calling the function
  • The 'attrNameList' is a global and is the header that should be the first line of the CSV file.
  • The function passed to the decorator returns a compatible list-type value such that decorator can render it using the csv.writer.
I am guessing this is not quite rigorously pythonic due to the expectations from the passed-function - whereas a true decorator might expect to be completely agnostic of the function it decorates. But I think it is still reasonable and reasonable is good enough for me in many cases.

For the header, a single boolean flag is injected into the function to indicate whether to write the header or not.

So here is the decorator code:

def writeCSV(func):
    def wrapcsv(*args,**kwargs):
        global out
        global attrNameList
        cw=csv.writer(out, delimiter=",",lineterminator='\n')
        if wrapcsv.header==False:
            cw.writerow(attrNameList)
        wrapcsv.header=True
        funcval= func(*args,**kwargs)
        cw.writerow(funcval)
        
        return funcval
    
    wrapcsv.header=False
    return wrapcsv


And here is a function it could decorate (I was parsing an LDIF file and extracting specified attributes to a CSV file):

@writeCSV    
def reportAttrs(dn,ent):
    global attrNameList
    global out
    vallist=[]
    for attrName in attrNameList: 
        if ent.has_key(attrName.lower()):
            vallist.append(ent[attrName.lower()][0])
        else:
            vallist.append("")
        if attrName.lower() == "dn":
            vallist.append(dn)
    return vallist

This can be adapted to writing XML, or JSON or whatever you need

Wednesday, May 7, 2014

Setting up python virtualenv without sudo

Setup virtualenv and packages locally without sudo/root

I wanted to create a deployable package into production machines in which I neither have root access nor sudo and also they are firewalled so something simple such as getting pip installation etc is not going to work since the ports are blocked outwards as well.

I usually have ssh and scp access to the machine so I can place files on the server but anything else is pretty locked down.

So after some web research, I came across the following which helped with scripting a solution:
Couple of notes:
  • At the end of this setup, it would have installed a virtualenv with the tools that were packaged in the setup script's directory.
  • We will have the use of pip to make it easier for subsequent installs.
  • However, pip may or may not work to download packages if the system is constrained (with either network firewalling or automatic proxies - I could not figure a way past auto proxies).
  • You have to provide any required packages as well as provide correct versions of packages.
  • This will be a fully isolated setup.
Steps:
  1. Get virtualenv.py from : https://raw.github.com/pypa/virtualenv/master/virtualenv.py 
  2. Get pip from: https://pypi.python.org/pypi/pip#downloads
  3. Get setuptools from https://pypi.python.org/pypi/setuptools, the download is towards the bottom (clicking on download button didn't work for me).
    https://pypi.python.org/packages/source/s/setuptools/setuptools-17.1.1.zip
  4. Get additional packages as tar.gz format from pypi etc. and place them into pypackage directory
  5. Setup the virtual env: (setting up to homedir/myenv shown below)
  6. python $SCRIPTDIR/virtualenv.py --no-setuptools ~/myenv
    
  7. Activate this environment (note the leading 'dot' is required with space following)
  8. . ~/myenv/bin/activate
    
  9. Unpack the pip to some folder (not the virtual env folder):
  10. tar xvzf pip-7.0.3.tar.gz
    
  11. Unpack the setuptools directory:
  12. unzip setuptools-17.1.1.zip
  13. cd to the unpacked directory and setup pip into virtual env:
    cd setuptools-17.1.1
    python setup.py build
    python setup.py install
  14. cd to the unpacked directory and setup pip into virtual env:
  15. cd pip-7.0.3
    python setup.py build
    python setup.py install
    
From this point forward, once you activate the environment, you can run pip on a locally downloaded tar.gz file as:
pip install <path to tar.gz>
And it will install into the virtual environment.