Wednesday, May 28, 2014

Large LDIF dn record indexer and record getter

I had to retrieve individual records based on dn from a 500MB LDIF file and using traditional python LDIF libraries using the LDIFParser took around 90 seconds or so. The issue is that this is a 'sparse' lookup problem. I didn't really need to know about 99% of the records, just around 100-1000 records that were anywhere in the LDIF file.

I figured that it would be preferable to index the entire LDIF quickly or as quickly as possible and then use a getDN function to retrieve entries individually. Of  course you can get pretty clever with this and extend it to do some more interesting indexing etc.

Stats of the LDIF file I used:
    size: 448,355kB
    Lines: 16,337,077
    Entries:  389,776
And it was able to index the file in:
    Time to index: 14,763ms
    Time to retrieve a 1000 entries using the getDN: 1,855ms
    Total memory: ~45MB
Using the LDIFParser and storing dns in a map for easy retrieval:
    Time to index: 93,628ms
    Time to retrieve a 1000 entries using the map: 1,432ms
    Total memory: ~1.5GB

class largeLDIF(ldif.LDIFParser):
    """indexes through large ldifs to only store dn vs file.seek number as a map
    retrieval is done on demand afterwards"""
    dnMap=OrderedDict()
    count=0
    def __init__(self,input_file):
        if isinstance(input_file, str):
            self.ldif_if=open(input_file,'rb')
        elif hasattr(input_file, "read"):
            self.ldif_if=input_file
        ldif.LDIFParser.__init__(self,self.ldif_if)
        self.dnMap={}
    
    def parse(self):
        """override parse to only look for dns to generate a map"""
        loc=self.ldif_if.tell()
        self._line  = self.ldif_if.readline()
        while self._line:
            colon_pos=self._line.find(':')
            if self._line[0]=='#' or colon_pos<0:
                loc=self.ldif_if.tell()
                self._line  = self.ldif_if.readline()
                continue
            
            attr_type = self._line[0:colon_pos]
            
            if attr_type == 'dn':
                # check if line is folded, do this only if a dn was found
                unfolded_lines=[self._stripLineSep(self._line)]
                self._line=self.ldif_if.readline()
                # check for fold and don't process if not needed
                
                if self._line and self._line[0]==' ':
                    while self._line and self._line[0]==' ':
                        unfolded_lines.append(self._stripLineSep(self._line))
                        self._line=self.ldif_if.readline()
                        
                    unfoldeddn=''.join(unfolded_lines)
                else:
                    unfoldeddn=unfolded_lines[0]
                value_spec = unfoldeddn[colon_pos:colon_pos+2]
                if value_spec=='::':
                    # attribute value needs base64-decoding
                    attr_value = base64.decodestring(unfoldeddn[colon_pos+2:])
                elif value_spec==':\r\n' or value_spec=='\n':
                    # no use if dn is empty
                    continue
                else:
                    attr_value = unfoldeddn[colon_pos+2:].lstrip()    
                        
                self.dnMap[attr_value]=loc
                self.count+=1
                if self.count % 10000==0:
                    sys.stderr.write("%s - %s\n"%(self.count,attr_value))
                
                # now read until an empty line or end of file is found
                # since that would indicate end of record
                while self._line and not ( self._line == '\r\n' or self._line=='\n') :
                    self._line = self.ldif_if.readline()
                
            loc=self.ldif_if.tell()
            self._line=self.ldif_if.readline()
                
                    
    def getDN(self,dn):
        if self.dnMap.has_key(dn):
            self.ldif_if.seek(self.dnMap[dn])
            #read the entry block into a stringIO object
            ldifStr=StringIO.StringIO()
            line=self.ldif_if.readline()
            while line:
                if not line or line == '\r\n' or line=='\n':
                    ldifStr.write(line)
                    break
                else:
                    ldifStr.write(line)
                line=self.ldif_if.readline()
            ldifStr.seek(0)
            # record is read in
            rec=ldif.LDIFRecordList(ldifStr)
            rec.parse()
            return rec.all_records[0]

So, I couldn't help it.. I tested a simple file read to see what is the fastest it could read through the file without doing any operations. Just plain read a 500MB file.
Well, that turned out to be around 2 second mark.
Next I tried adding one or two if-then-elses to see how it affected the performance and I got to around 7seconds.

So I rewrote the class with fewest branching and operational steps I could as shown below:
Important things to note is to use the for loop with the file object improves the performance signficantly.
Also, to optimize checking for 'dn' I found that startswith takes an enormous amount of time in comparison to using a slice comparison (see this post:http://stackoverflow.com/questions/13270888/why-is-startswith-slower-than-slicing).


class largeLDIF(ldif.LDIFParser):
    """indexes through large ldifs to only store dn vs line number as a map
    retrieval is done on demand afterwards"""
    dnMap=OrderedDict()
    count=0
    _loc=0L
    _newloc=0L
    def __init__(self,input_file):
        if isinstance(input_file, str):
            self.ldif_if=open(input_file,'rb')
        elif hasattr(input_file, "read"):
            self.ldif_if=input_file
        ldif.LDIFParser.__init__(self,self.ldif_if)
        self.dnMap={}
        
    def parse(self):
        """override parse to only look for dns to generate a map"""
        loc=0L
        newloc=0L
        # prime this
        unfolded_lines=None
        unfoldcheck=False
        dnloc=0L
        readuntilEndofRec =False
        colon_pos=2
        for self._line in self.ldif_if:
            loc=newloc
            newloc+=len(self._line)
            if readuntilEndofRec:
                #skip records until it matches return character
                if self._line == "\r\n" or self._line == "\n":
                    readuntilEndofRec=False
                    continue
                else:
                    continue
                    
            if unfoldcheck:
                if  self._line[0]==' ' :
                #folded line detected
                    unfolded_lines.append(self._stripLineSep(self._line[1:]))
                    continue
                else:
                    unfoldeddn=''.join(unfolded_lines)
                    value_spec = unfoldeddn[colon_pos:colon_pos+2]
                    if value_spec=='::':
                        # attribute value needs base64-decoding
                        attr_value = base64.decodestring(unfoldeddn[colon_pos+2:])
                    elif value_spec==':\r\n' or value_spec=='\n':
                # no. use if dn is empty
                        unfoldcheck=False
                        continue
                    else:
                        attr_value = unfoldeddn[colon_pos+2:].lstrip()    
                    
                    self.dnMap[attr_value.lower()]=dnloc
                    if self.count % 10000==0:
                        sys.stderr.write("%s - %s\n"%(self.count,attr_value))
                    unfoldcheck=False
                    readuntilEndofRec=True
                    continue
                
            attr_spec=self._line[:3]    
        
            if attr_spec=='dn:':
                unfolded_lines=[self._stripLineSep(self._line)]
                unfoldcheck=True
                dnloc=loc
                self.count+=1
            
    def getDN(self,dn):
        if self.dnMap.has_key(dn.lower()):
            self.ldif_if.seek(self.dnMap[dn.lower()])
            #read the entry block into a stringIO object
            ldifStr=StringIO.StringIO()
            for line in self.ldif_if:
                ldifStr.write(line)
                if line == '\r\n' or line=='\n':
                    break
            ldifStr.seek(0)
            # record is read in
            rec=ldif.LDIFRecordList(ldifStr)
            rec.parse()
            return rec.all_records[0]
            
Well, here are the results:

    Time to index: 10,515ms
    Time to retrieve a 1000 entries using the getDN: 342ms
    Total memory: ~50MB 
So we shaved another 25% off the time and improved performance!
Also, interesting to note is that I get a solid disk IO at 50MB/s in the initial indexing step (which makes sense as rough math translates to 50MB x 10seconds=500MB total so it adds up). I however don't believe this is a diskIO bottleneck since a raw read using python without checking anything is around the 2sec mark, possibly with buffering etc the perceived disk speed is around 250MB/s (and I don't have an SSD in this test so obviously this has to be some OS buffering etc).

 

No comments:

Post a Comment