Stream detection with machine learning

The Milky Way galaxy is surrounded by a large number (dozens known) of dwarf galaxies. Due to dynamical friction, such galaxies occasionally fall into the Milky Way; this causes them to be transformed from a nearly spherical shape to an elongated shape, eventually developing very long tails of stars in a phenomenon called tidal disruption. The tails of the disrupted dwarf galaxies contain stars streaming from that galaxy’s core out toward the Milky Way’s halo, and they can be observed as lines running across the night sky. These are called tidal streams. Their observation, however, is not a trivial matter. Since dwarf galaxies contain only very few stars to begin with (compared to the Milky Way), the number of stars in tidal streams is very small compared to the background. Thus, tidal streams are too faint certainly to observe with the naked eye, but telescopic data often requires careful analysis to reveal their presence. A census of tidal streams is important for the understanding of the history of the formation of the Milky Way galaxy, and by extension the assembly of galaxies in general.

In this project, I constructed an algorithm that accepts a sample with a variable number N of coordinates and returns the most likely geometrical parameters of the stream, and additional information (e.g. its relative amplitude and the detection probability).

At the heart of this algorithm was the directional 2-point correlation function (D2PC), which is essentially a two-dimensional histogram that counts the number of pairs of stars with distance d and angle θ. The “background” D2PC is well understood, and any statistical deviations from it can be picked up by a relatively simple artificial neural network (ANN), that is trained on large amounts of mock data.

Heat map

The figure above shows the predicted (recovered) values versus the true values of the three geometrical parameters: angle, offset, and width. This is the result of 65,536 test inputs.

This was a nice exercise in machine learning that I undertook in 2017 as part of a project with James Taylor from the University of Waterloo, but it was unfortunately never completed and never published. I’m happy to share the code and notes, but there are probably better tools for the task out there.