-
I would think you don't need an image library to do this - simply fetching the URL content and comparing the two streams as byte arrays should do it.
Unless of course you are interested in identifying similar images as well.
-
Why not write one yourself? It does not seem difficult. Check this out.
-
Depending on how detailed you want to get with it:
- download the image
- as you download it generate a hash for it
- make a directory where the directory name is the hash value (if the directory does not exist)
- if directory contains 2 or more files then compare the file sizes
- if the file sizes are the same then do a byte by byte comparison of the image to the bytes of the images in the file
- if the bytes are unique then you have a new image
Regardless of if you want to do all that or not you need to:
- download the images
- do a byte-by-byte comparison of the images
No need to rely on any special imaging libraries, images are just bytes.
-
Look at the MessageDigest class. Essentially, you create an instance of it, then pass it a series of bytes. The bytes could be the bytes directly loaded from the URL if you know that two images that are the "same" will be the selfsame file/stream of bytes. Or if necessary, you could create a BufferedImage from the stream, then pull out pixel values, something like:
MessageDigest md = MessageDigest.getInstance("MD5");
ByteBuffer bb = ByteBuffer.allocate(4 * bimg.getWidth());
for (int y = bimg.getHeight()-1; y >= 0; y--) {
bb.clear();
for (int x = bimg.getWidth()-1; x >= 0; x--) {
bb.putInt(bimg.getRGB(x, y));
}
md.update(bb.array());
}
byte[] digBytes = md.digest();
Either way, MessageDigest.digest() eventually gives you a byte array which is the "signature" of the image. You could convert this to a hex string if it's helpful, e.g. for putting in a HashMap or database table, e.g.:
StringBuilder sb = new StringBuilder();
for (byte b : digBytes) {
sb.append(String.format("%02X", b & 0xff));
}
String signature = sb.toString();
If the content/image from two URLs gives you the same signature, then they're the same image.
Edit: I forgot to mention that if you were hashing pixel values, you'd probably want to include the dimensions of the image in the hash too. (Just to a similar thing-- write two ints to an 8-byte ByteBuffer, then update the MessageDigest with the corresponding 8-byte array.)
The other thing is that somebody mentioned is that MD5 is not collision-resistent. In other words, there is a technique for constructing multiple byte sequences with the same MD5 hash without having to use the "brute force" method of trial and error (where on average, you'd expect to have to try about 2^64 or 16 billion billion files before hitting on a collision). That makes MD5 unsuitable where you're trying to protect against this threat model. If you're not concerned about the case where somebody might deliberately try to fool your duplicate identification, and you're just worried about the chances of a duplicate hash "by chance", then MD5 is absolutely fine. Actually, it's not only fine, it's actually a bit over the top-- as I say, on average, you'd expect one "false duplicate" after about 16 billion billion files. Or put another way, you could have, say, a billion files and the chance of a collision be extremely close to zero.
If you are worried about the threat model outlined (i.e. you think somebody could be deliberately dedicating processor time to constructing files to fool your system), then the solution is to use a stronger hash. Java supports SHA1 out of the box (just replace "MD5" with "SHA1"). This will now give you longer hashes (160 bits instead of 128 bits), but with current knowledge makes finding a collision infeasible.
Personally for this purpose, I would even consider just using a decent 64-bit hash function. That'll still allow tens of millions of images to be compared with close-to-zero chance of a false positive.
-
You could also generate a MD5 signature of the file and ignore duplicate entries. Won't help you find similar images though.
-
calculate MD5s using something like this:
MessageDigest m=MessageDigest.getInstance("MD5");
m.update(image.getBytes(),0,image.length());
System.out.println("MD5: "+new BigInteger(1,m.digest()).toString(16));
Put them in a hashmap.
-
I've done something very similar to this before in Java and I found that the PixelGrabber class inside the java.awt.image package of the api is extremely helpful (if not downright necessary).
Additionally you would definitely want to check out the ColorConvertOp class which can performs a pixel-by-pixel color conversion of the data in the source image and the resulting color values are scaled to the precision of the destination image. The documentation goes on to say that the images can even be the same image in which case it would be quite simple to detect if they are identical.
If you were detecting similarity, you need to use some form of averaging method as mentioned in the answer to this question
If you can, also check out Volume 2 chapter 7 of Horstman's Core Java (8th ed) because there's a whole bunch of examples on image transformations and the like, but again, make sure to poke around the java.awt.image package because you should find you have almost everything prepared for you :)
G'luck!
-
Hashing is already suggested and recognizing if two files are identical is very easy, but you said pixel level.
If you want to recognize two images even if they are in different formats (.png/.jpg/.gif/..) and even if they were scaled I suggest:
(using an image library and if the image are medium/big no 16x16 icons):
- scale the image to some fixed size, it depends on the samples
- transform it to grey scale using the RGB-YUV conversion for exampel and taking Y from there (very easy)
3 Do the hamming distance of each image and set a threshold to decide if they are the same or not.
You will do a sum of the difference of all the grey pixels of both images you get a number if the difference is < T you consider both images identical
--
-
Inspect the response headers and interrogate the HTTP header ETag value, if present. (RFC2616: ETag) They maybe the same for identical images coming from your target web server. This is because the ETag value is often a message digest like MD5, which would allow you to take advantage of the webserver's already completed computations.
This may potentially allow you to not even download the image!
for each imageUrl in myList
Perform HTTP HEAD imageUrl
Pull ETag value from request
If ETag is in my map of known ETags
move on to next image
Else
Download image
Store ETag in map
Of course the ETag has to be present and if not, well the idea is toast. But maybe you have pull with the web server admins?
-
Maybe you could check this one.
http://www.java-forums.org/java-2d/22880-identify-pixel-image.html