Arpit,
For object classification we pretrain the convolutional layers on the ImageNet classification task at half the resolution which is 224 × 224 input image. Classification is a simple problem compared to object detection
For object detection we double the resolution which is 448 x 448 . Detection requires fine-grained visual information and that is the reason increasing the input resolution of the network from 224 × 224 to 448 × 448 helps increase the accuracy for object detection
see the network architecture below
Please let me know if the explanation helps
Thanks,
Renu