Beyond Compare

Notes collected from: 3 Quick Ways To Compare Data with Python.

Image

Photo by Franki Chamaki on Unsplash

The mannual approach

you could comapre two numeric files in Excel - of course. But it consumes much time and energy.

The slightly advanced approach

years ago, I used to build VBA to compare cell by cell, and have logic returns which imply if two cells are the same - it’s faster, but troublesome to code the ‘old-fashioned’ VBA scripts.

The Pythonic way

The original article introduces a Python-based approach to quickly compare two files, this is the focus of this note. \

  1. Check the integrity of data
    • MD5 Checksum: returns a hexadecimal number for the contents of a file.
    import hashlib, sys
    files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
    def md5(fname):
        md5hash = hashlib.md5()
        with open(fname) as handle: #opening the file one line at a time for memory considerations
            for line in handle:
                md5hash.update(line.encode('utf-8'))
        return(md5hash.hexdigest())
    print('Comparing Files:',files[0],'and',files[1])
    if md5(files[0]) == md5(files[1]):
        print('Matched')
    else:
        print('Not Matched')
    

    Running above gives:
    Image

    • the SHA1 algorithm: another hexadecimal algorithm that converts file contents into a string.
      import hashlib, sys
      files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
      def sha1(fname):
        sha1hash = hashlib.sha1()
        with open(fname) as handle: #opening the file one line at a time for memory considerations
            for line in handle:
                sha1hash.update(line.encode('utf-8'))
        return(sha1hash.hexdigest())
      print('Comparing Files:',files[0],'and',files[1])
      if sha1(files[0]) == sha1(files[1]):
        print('Matched')
      else:
        print('Not Matched')
      

      Running above gives:
      Image

  1. Check data contents with SQL
    Using a couple of Python libraries, we can import our files into an SQL database, and use the Except Operator to highlight any differences.
    * The only thing to note is that Except expects the data to be ordered; otherwise, it will highlight everything as a difference.
      import sys, sqlite3, pandas as pd
      files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
      conn = sqlite3.connect(':memory:') #we are spinning an SQL db in memory
      cur = conn.cursor()
      chunksize = 10000
      i=0
      for file in files:
       i = i+1
       for chunk in pd.read_csv(file, chunksize=chunksize): #load the file in chunks in case its too big
           chunk.columns = chunk.columns.str.replace(' ', '_') #replacing spaces with underscores for column names
           chunk.to_sql(name='file' + str(i), con=conn, if_exists='append')
      print('Comparing', files[0], 'to', files[1]) #Compare if all data from File[0] are present in File[1]
      cur.execute( '''SELECT * FROM File1
                   EXCEPT
                   SELECT * FROM File2''')
      i=0
      for row in cur:
       print(row)
       i=i+1
      if i==0: print('No Differences')
      print('Comparing', files[1], 'to', files[0]) #Compare if all data from File[1] are present in File[0]
      cur.execute( '''SELECT * FROM File2
                   EXCEPT
                   SELECT * FROM File1''')
      i=0
      for row in cur:
       print(row)
       i=i+1
      if i==0: print('No Differences')
      cur.close()
    

    Running above gives:
    Image

  2. Check data contents with Pandas
    * prepare data in a dataframe
    • Using the .equals() method
      import sys, sqlite3, pandas as pd
      files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
      df1 = pd.read_csv(files[0])
      df2 = pd.read_csv(files[1])
      df3 = df1.equals(df2)
      print('Matches:', df3)     
      

      Running above gives:
      Image

    • Using the .any()
      import sys, sqlite3, pandas as pd
      files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
      df1 = pd.read_csv(files[0])
      df2 = pd.read_csv(files[1])
      df3 = (df1 != df2).any(axis=None)
      print('Differences in file:', df3)
      df3 = (df1 != df2).any(1)
      ne_stacked = (df1 != df2).stack()
      changed = ne_stacked[ne_stacked]
      changed.index.names = ['id', 'col']
      print('Differences In:')
      print(changed)
      

      Running above gives:
      Image

    • Using the .Eq()
      import sys, sqlite3, pandas as pd, numpy as np
      files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
      df1 = pd.read_csv(files[0])
      df2 = pd.read_csv(files[1])
      df3 = df1.eq(df2)
      print(df3.all())
      #print(df3.all(axis=1))
      df4 = df3.all(axis=1)
      df4 = pd.DataFrame(df4, columns=['Columns'])
      print(df4[df4['Columns']==False])
      

      Running above gives:
      Image