Finding duplicate BLOBS using DBMS_CRYPTO.HASH and analytic functions

Environment: Oracle Database EE

Suppose that you have a table with a BLOB column, and you want to know if there are duplicate BLOBS in that table. A GROUP BY query with COUNT doesn’t work on BLOB columns… One way would be to write a PL/SQL or Perl program that extracts all BLOBS and compares the values between all rows. So, is there no simple way to do this in SQL? Yes, there is! You can use the DBMS_CRYPTO package to hash the values of the BLOB column and then group the hashed values using an analytic function.

Let’s work out an example.

First, we create a simple table with one BLOB column:

  BLOBID       NUMBER                           NOT NULL,
  BLOBNAME     VARCHAR2(50)                     NOT NULL,
  BLOBCONTENT  BLOB                             NOT NULL


Next, let’s fill the table with some records. In this example, records 1, 4 and 5 have the same BLOB values, as have records 2 and 6.

insert into blobs(blobid,blobname,blobcontent)
values(1,'blob 1',utl_raw.cast_to_raw('contents of blob 1'));

insert into blobs(blobid,blobname,blobcontent)
values(2,'blob 2',utl_raw.cast_to_raw('contents of blob 2'));

insert into blobs(blobid,blobname,blobcontent)
values(3,'blob 3',utl_raw.cast_to_raw('contents of blob 3'));

insert into blobs(blobid,blobname,blobcontent)
values(4,'blob 4',utl_raw.cast_to_raw('contents of blob 1'));

insert into blobs(blobid,blobname,blobcontent)
values(5,'blob 5',utl_raw.cast_to_raw('contents of blob 1'));

insert into blobs(blobid,blobname,blobcontent)
values(6,'blob 6',utl_raw.cast_to_raw('contents of blob 2'));


Now we need to grant our user execute rights on the DBMS_CRYPTO package.
Execute the following SQL with user SYS as SYSDBA:

grant execute on dbms_crypto to matthiash;

Now comes the difficult part: create a query so that the contents of the BLOBS are hashed using DBMS_CRYPTO, hashes that have the same value are grouped together, and the record ID’s are shown so that it’s clear which records have identical BLOB values. I managed to do this using window partitioning. Basically, we group the results into different windows, where each window contains the records which have the same hashed BLOB contents.

This is the query:

select blobid, blobname, blobcontent,row_number()
over(partition by dbms_crypto.hash(blobcontent,2) order by blobid asc) row_number,
count(blobid) over(partition by dbms_crypto.hash(blobcontent,2)) blobcount 
from blobs 
order by blobcount desc, dbms_crypto.hash(blobcontent,2) asc, row_number asc;

And these are the results:

1,blob 1,1,3
4,blob 4,2,3
5,blob 5,3,3
2,blob 2,1,2
6,blob 6,2,2
3,blob 3,1,1

So, BLOBS 1, 4 and 5 are grouped, next we have BLOBS 2 and 6, and finally BLOB 3. Mission accomplished! :-) Mind you that this is a very small table, so the query runs very fast. Using DBMS_CRYPTO.HASH on tables with millions of rows or very large BLOBS might need some CPU power!