Environment: Oracle Database EE 10.2.0.5
Suppose that you have a table with a BLOB column, and you want to know if there are duplicate BLOBS in that table. A GROUP BY query with COUNT doesn’t work on BLOB columns… One way would be to write a PL/SQL or Perl program that extracts all BLOBS and compares the values between all rows. So, is there no simple way to do this in SQL? Yes, there is! You can use the DBMS_CRYPTO package to hash the values of the BLOB column and then group the hashed values using an analytic function.
Let’s work out an example.
First, we create a simple table with one BLOB column:
CREATE TABLE BLOBS
(
BLOBID NUMBER NOT NULL,
BLOBNAME VARCHAR2(50) NOT NULL,
BLOBCONTENT BLOB NOT NULL
)
;
ALTER TABLE BLOBS ADD (
CONSTRAINT BLOBS_PK
PRIMARY KEY
(BLOBID))
;
Next, let’s fill the table with some records. In this example, records 1, 4 and 5 have the same BLOB values, as have records 2 and 6.
insert into blobs(blobid,blobname,blobcontent)
values(1,'blob 1',utl_raw.cast_to_raw('contents of blob 1'));
insert into blobs(blobid,blobname,blobcontent)
values(2,'blob 2',utl_raw.cast_to_raw('contents of blob 2'));
insert into blobs(blobid,blobname,blobcontent)
values(3,'blob 3',utl_raw.cast_to_raw('contents of blob 3'));
insert into blobs(blobid,blobname,blobcontent)
values(4,'blob 4',utl_raw.cast_to_raw('contents of blob 1'));
insert into blobs(blobid,blobname,blobcontent)
values(5,'blob 5',utl_raw.cast_to_raw('contents of blob 1'));
insert into blobs(blobid,blobname,blobcontent)
values(6,'blob 6',utl_raw.cast_to_raw('contents of blob 2'));
commit;
Now we need to grant our user execute rights on the DBMS_CRYPTO package.
Execute the following SQL with user SYS as SYSDBA:
grant execute on dbms_crypto to matthiash;
Now comes the difficult part: create a query so that the contents of the BLOBS are hashed using DBMS_CRYPTO, hashes that have the same value are grouped together, and the record ID’s are shown so that it’s clear which records have identical BLOB values. I managed to do this using window partitioning. Basically, we group the results into different windows, where each window contains the records which have the same hashed BLOB contents.
This is the query:
select blobid, blobname, blobcontent,row_number()
over(partition by dbms_crypto.hash(blobcontent,2) order by blobid asc) row_number,
count(blobid) over(partition by dbms_crypto.hash(blobcontent,2)) blobcount
from blobs
order by blobcount desc, dbms_crypto.hash(blobcontent,2) asc, row_number asc;
And these are the results:
BLOBID,BLOBNAME,ROW_NUMBER,BLOBCOUNT
1,blob 1,1,3
4,blob 4,2,3
5,blob 5,3,3
2,blob 2,1,2
6,blob 6,2,2
3,blob 3,1,1
So, BLOBS 1, 4 and 5 are grouped, next we have BLOBS 2 and 6, and finally BLOB 3. Mission accomplished! :-) Mind you that this is a very small table, so the query runs very fast. Using DBMS_CRYPTO.HASH on tables with millions of rows or very large BLOBS might need some CPU power!
HTH,
Matthias