Statistics isn't going to cover probability the way you need. Take actual probability. You don't really use statistics (the way it's taught in university) to do machine learning (although statistical concepts are used). It's just not a great use of time learning Chi squared tests and whatnot, since those aren't the meat and potatoes of what you need to know.
Machine learning has a large amount of overlap with signal processing. Many of these classes, depending on the university, find themselves in the Computer Engineering or Electrical Engineering departments. Machine learning isn't an abstract, theoretical space that you can solve by just being clever. Much of it has to do with model-based algorithm development, which is pretty firmly in the domain of electrical engineering, and has been for about 40 years.
That said, there is an abstract component that goes beyond the pragmatic study of these systems. So I'd focus on the following topics:
Differential Equations (useless to do machine learning if you can't understand the model of the world that you're trying to teach the machine)
Digital Signal Processing
(some) Abstract Algebra
If available: automatic classification
Neural Networks are a sort of canonical example of machine learning. However, there's nothing special about them. They're just a network of input-output nodes, with a number of nodes in-between. All each node does is compute a number in some way, and passes that number on to the next node(s). Where Neural Networks get interesting is the training of them. Well, what do you need to know to train them? You need to understand data analysis/signal processing. You need to know how to select training data. You need to understand linear and nonlinear optimization. You need to understand how to validate the network.
Part of my work involves developing neural network models for automatic classification. Here's general outline of what I'd do to approach such a problem:
1.) Develop a feature extraction method from the raw data (frequency components? Image histograms? Edge detections? Means?)
2.) Specify a network structure
3.) Select a representative subset of the data
4.) Compute the hidden layer nodal weights by training the network on the data
5.) Cross-validate the network on the full data set, perhaps using k-fold cross validation, or some other metric
6.) Deploy the network, or networks, in a test environment against virgin data
7.) Evaluate ROC curves and confusion matrices
At each of these steps, I need math learned in different classes. Step 1 is open-ended and depends on the particular problem. Step 2 relies on some EE knowledge and signal processing background. Step 3 involves statistics and probability. Step 4 involves quite a bit of linear algebra, calculus, variational calculus, or differential equations. Step 5 involves statistics and data mining (often a EE course). Step 6 is just business level stuff. Step 7 is again signal processing and probability.
So even a simple, well-known machine learning application involves math from many different fields. Of course, different folks specialize in different aspects of this math. I don't need to go through the nodal weight optimization step by hand. I have software written that does that automatically. But there are still parameters to specify, and there are still edge conditions that can mess up the result. So I do need to understand how the algorithm works, so that I can analyze whether the network I've made is garbage or not.
Short version: look into the options that the university offers in not just the math department, but also the EE and CompEng departments.