Beyond Compare

Notes collected from: 3 Quick Ways To Compare Data with Python.

Photo by Franki Chamaki on Unsplash

The mannual approach

you could comapre two numeric files in Excel - of course. But it consumes much time and energy.

The slightly advanced approach

years ago, I used to build VBA to compare cell by cell, and have logic returns which imply if two cells are the same - it’s faster, but troublesome to code the ‘old-fashioned’ VBA scripts.

The Pythonic way

The original article introduces a Python-based approach to quickly compare two files, this is the focus of this note. \

Check the integrity of data

MD5 Checksum: returns a hexadecimal number for the contents of a file.

import hashlib, sys
files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
def md5(fname):
    md5hash = hashlib.md5()
    with open(fname) as handle: #opening the file one line at a time for memory considerations
        for line in handle:
            md5hash.update(line.encode('utf-8'))
    return(md5hash.hexdigest())
print('Comparing Files:',files[0],'and',files[1])
if md5(files[0]) == md5(files[1]):
    print('Matched')
else:
    print('Not Matched')

Running above gives:

the SHA1 algorithm: another hexadecimal algorithm that converts file contents into a string.

import hashlib, sys
files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
def sha1(fname):
  sha1hash = hashlib.sha1()
  with open(fname) as handle: #opening the file one line at a time for memory considerations
      for line in handle:
          sha1hash.update(line.encode('utf-8'))
  return(sha1hash.hexdigest())
print('Comparing Files:',files[0],'and',files[1])
if sha1(files[0]) == sha1(files[1]):
  print('Matched')
else:
  print('Not Matched')

Running above gives:

Check data contents with SQL
Using a couple of Python libraries, we can import our files into an SQL database, and use the Except Operator to highlight any differences.
* The only thing to note is that Except expects the data to be ordered; otherwise, it will highlight everything as a difference.

  import sys, sqlite3, pandas as pd
  files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
  conn = sqlite3.connect(':memory:') #we are spinning an SQL db in memory
  cur = conn.cursor()
  chunksize = 10000
  i=0
  for file in files:
   i = i+1
   for chunk in pd.read_csv(file, chunksize=chunksize): #load the file in chunks in case its too big
       chunk.columns = chunk.columns.str.replace(' ', '_') #replacing spaces with underscores for column names
       chunk.to_sql(name='file' + str(i), con=conn, if_exists='append')
  print('Comparing', files[0], 'to', files[1]) #Compare if all data from File[0] are present in File[1]
  cur.execute( '''SELECT * FROM File1
               EXCEPT
               SELECT * FROM File2''')
  i=0
  for row in cur:
   print(row)
   i=i+1
  if i==0: print('No Differences')
  print('Comparing', files[1], 'to', files[0]) #Compare if all data from File[1] are present in File[0]
  cur.execute( '''SELECT * FROM File2
               EXCEPT
               SELECT * FROM File1''')
  i=0
  for row in cur:
   print(row)
   i=i+1
  if i==0: print('No Differences')
  cur.close()

Running above gives:

Check data contents with Pandas
* prepare data in a dataframe

Using the .equals() method

import sys, sqlite3, pandas as pd
files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
df1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])
df3 = df1.equals(df2)
print('Matches:', df3)     

Running above gives:

Using the .any()

import sys, sqlite3, pandas as pd
files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
df1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])
df3 = (df1 != df2).any(axis=None)
print('Differences in file:', df3)
df3 = (df1 != df2).any(1)
ne_stacked = (df1 != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
print('Differences In:')
print(changed)

Running above gives:

Using the .Eq()

import sys, sqlite3, pandas as pd, numpy as np
files = [sys.argv[1], sys.argv[2]] #these are the arguments we take
df1 = pd.read_csv(files[0])
df2 = pd.read_csv(files[1])
df3 = df1.eq(df2)
print(df3.all())
#print(df3.all(axis=1))
df4 = df3.all(axis=1)
df4 = pd.DataFrame(df4, columns=['Columns'])
print(df4[df4['Columns']==False])

Running above gives:

Beyond Compare

A Python programme for files comparison

Beyond Compare

A Python programme for files comparison

Beyond Compare

The mannual approach

The slightly advanced approach

The Pythonic way