Interesting thing. I was convinced that the maximum grid size of my machine (GeForce GTX 660M) can be [65535, 65535, 65535]. I checked the device parameters using the cudaGetDeviceProperties function and I couldn't believe the results. My GPU's limitation is actually _way_ higher than I expected and according to cudaGetDeviceProperties it's: [2147483647, 65535, 65535]. I looked my card up in the CUDA wiki and it turned out, that my device qery was not lying.
However, I had a serious problem testing the huge grid size I had unexpectedly discovered. My vector addition compiled successfully but returned errors and could not be profiled when the vector size was greater than 65535 (blocks) * 1024 (threads). I started two threads on NVIDIA dev forum and on stackoverflow and finally, this brilliant guy helped me.
What was the problem then? I didn't change the default Visual Studio setting which defines the arch and code parameters of nvcc. To change it to proper values for your card, please take a look at the CUDA wiki to locate your GPU and modify the following setting:
Now I can go beyond that old limit of 65535 and continue my research.
***I have updated my CUDA Hello World tutorial so you can update that setting as you create the project.***