I was on Aljazeera Arabic’s website the other day and, as I was voting on a poll, was presented the following screen:
The CAPTCHA in the screen above immediately caught my attention. The distortions in it seemed very simple, the text was not warped in any form and no overlap between characters.
The following is a URL for one of the CAPTCHAs:
http://www.aljazeera.net/Portal/KServices/Controles/SecureCAPTCHA/
GenerateImage.aspx?Code=EANmyyXghpajFhOX6rCRKQ==&Length=4
Opening the URL above and refreshing the page a few times gives the following CAPTCHAs:
The dashed grey lines are randomized, while the letters in the CAPTCHAs above are static. The letters are encoded in the
Code parameter in the URL. Notice that there are two forms for each character; a straight form and another that is slightly rotated.
Aljazeera’s CAPTCHA can easily be broken by doing the following:
- Removing the dashed grey lines
- Finding the characters in the image
- Separating the characters in the image
- Classifying each character
I’ll be using Octave/Matlab for the above tasks and will be explaining my algorithm using the following CAPTCHA as an example.
I’ll first begin by loading the CAPTCHA in Octave/Matlab using the line of code below. I have already saved the CAPTCHA image in a file called captcha.jpeg.
N = imread('captcha.jpeg');
N is a matrix where each element in the matrix corresponds to the color of the pixel at the corresponding location in the image. For example, the matrix below is a submatrix of N corresponding to the letter ‘J’ in the image. If you look closely, you can almost see the letter ‘J’ in the matrix.
255 249 249 255 255 255 255 255 255 255 255 255 189 255 246 7 251 255 249 255 255 246 249 255 170 76 11 0 253 255 255 250 245 255 255 245 254 24 59 60 255 233 255 255 255 255 230 255 246 1 9 96 170 255 238 255 247 255 246 255 255 2 0 9 60 251 240 255 255 255 255 250 238 255 2 0 0 206 154 251 241 245 255 254 255 253 5 0 9 13 193 128 246 255 253 255 255 243 0 4 0 0 9 241 255 247 255 255 245 251 255 0 0 0 10 0 255 255 246 255 255 255 233 21 0 7 0 2 0 255 245 255 255 246 255 231 11 6 0 5 11 1 245 243 255 253 252 255 248 0 15 0 0 0 255 255 255 255 240 243 255 254 250 11 8 0 0 248 238 251 255 255 255 250 255 241 0 0 10 255 255 252 255 238 244 255 255 240 255 18 0 255 255 246 255 253 255 242 251 255 244 0 12 0 255 251 255 249 254 241 255 253 255 0 17 0 255 255 244 255 255 246 255 250 246 6 0 9 255 243 255 249 255 253 255 243 8 11 12 254 251 250 255 249 245 233 255 10 1 0 0 255 246 255 249 251 8 15 0 0 0 0 255 243 246 255 0 11 0 5 1 0 255 255 237 255 254 0 5 4 13 0 244 255 250 255 255 248 255 255 255 247 249 255 255 244 255 247 255 255
A value of 255 corresponds to a white pixel while 0 corresponds to a black pixel. All values between 255 and 0 exclusively correspond to shades of grey.
1. Removing the dashed grey lines
The grey lines serve as a distortion to make the image harder for a computer to read. We can remove all shades of grey using the following line of code:
N = N > 100;The line of code above changes all the elements that are light-colored to one. The matrix becomes a representation for a binary image where a value of one corresponds to white and a value of zero corresponds to black. The following image is what we see after executing the line of code above. Notice how all shades of grey have been removed and we’re left only with the black letters.
2. Finding the characters in the image
I can take advantage of the fact that the characters are well-separated with whitespace in order to identify each character in the image.
First, the columns containing black pixels are identified using the code below.
% find location of dark colours (rows and columns) [r, c] = find(N==0); % find out the columns that contain black pixels c = unique(c);
c is a vector that now contains the column numbers, in sorted order, that contain black pixels.
Second, cluster the columns that are close together. Each cluster of columns is then considered a character. Here, I am declaring two columns to be “close together” if they are no more than three pixels apart. The line of code below does exactly that.
clusters = [0; find(diff(c) > 3); length(c)];
find(diff(c) > 3) finds the columns that are not close together, while 0 and length(c) are for the boundaries.
3. Extracting the characters from the image
The submatrix corresponding to each cluster in the image corresponds to a character. Running the following loop extracts each of the four characters from the image.
for i = 1:4 char = N(:, c(edges(i) + 1):c(edges(i + 1))); % Classify the character here... end
The following are the extracted characters:
4. Classifying each character
Now that each character’s image is separate, all that remains is to identify what character is in the image. As I mentioned earlier, each character of the 26 characters has two forms, a straight form and a rotated form. This brings the total number of characters to identify to 52.
Classifying a character is slightly less straightforward than the previous sections since there are some minor differences in the shape. Take the two characters below as an example; they are the same character, but there are some slight differences in each image.

So how can they be identified as the same? A very naive approach to solving this problem that seems to work very well is the following:
- For each of the 52 characters, compute the number of white pixels in each column.
The computations from the task above forms our dataset. This can be calculated by executing the following command for each character:
sum(char_image);
Where char_image above is the matrix corresponding to the character that we want to compute the number of white pixels in its columns. The command above works since each white pixel in the matrix has the value one while the other elements of the matrix (i.e. black pixels) are zero.
- Given an image of a character, compute the number of white pixels in its columns and compare it to that of the 52 characters in the dataset. The character in the dataset with the least difference in white pixels is the most likely match.
% The 52 characters in our dataset chars = ['L' 'I' ... characters of the dataset ... ]; % Sum of white pixels in each column for each of % the characters above. m{1} = [40 28 24 21 28 33 38 39 40 40 40 39 38 39 40]; m{2} = [...] ... rest of the data set ... % Compute the number of white pixels in each columns for the % image we're trying to classify. char_image = sum(char_image); % Assign a score of how different the character we're trying % to classify from a character in the dataset. The lower the % score, the more likely these characters match. score = 100; char = '-'; % Compare each character in our dataset to the character % we're trying to classify for i = 1:length(m) candidate_char = m{i}; % Only compare characters that are the same width if length(candidate_char) == length(char_image) candidate_score = max(abs(char_image-candidate_char)); if candidate_score < score % Did we find a better match? score = candidate_score; char = chars(i); end end end % char has the solution at this point
I tested the algorithm on roughly 100 CAPTCHAs and the success rate has been 100%.
For completeness, I used the bash script below to fetch CAPTCHAs directly from Aljazeera’s website.
VOTING_URL='http://www.aljazeera.net/Portal/KServices/supportPages/vote/SecureVote.aspx' CAPTCHA_CODE=$(wget -qO- "$VOTING_URL" | grep ?Code= | sed "s/.*?Code=\(.*\)&.*/\1/") echo "Captcha code: $CAPTCHA_CODE" CAPTCHA_URL="http://www.aljazeera.net/Portal/KServices/Controles/SecureCAPTCHA/GenerateImage.aspx?Code=$CAPTCHA_CODE&Length=4" CAPTCHA_FILE=$(echo $CAPTCHA_CODE | sed "s,/,_,g") wget "$CAPTCHA_URL" -O $CAPTCHA_FILE.jpeg
If you’re interested, you can download the entire dataset and source code of the algorithm here. You will need either Octave or Matlab to run the scripts.
It’s really a shame that a well-known news network like Aljazeera would use such a silly measure for security. Anyways, happy new year!


