so in training my models, I'm almost exclusively using ReLu for activation in hidden layers and then sigmoid for output (mostly done binary classification), interested on what others you mainly use and what experience you've gotten by increasing layers, have you gotten into the dozens of layers? Also, is the learning rate the most modified hyper parameter, (I haven't done too much with momentum or generalization). It seems a very much try it and see approach (modify params, watch the cost/learning rate) tweak and retry..