Data Science is an interdisciplinary field, combining statistical and computational skills to help solve problems in almost all areas, including industry, commerce, government, and science. There has been a massive increase in the amount of data available from new technologies including network data, image data, and streaming data, and in nearly every facet of life, leading to demand for expertise at the interface of computer and statistical science. Data scientists need skills in data management and organization, statistics and machine learning, and distributed and parallel computing, coupled with excellent communication skills. Statistical reasoning plays a central role, including how to ask questions in order to best extract information from data, how to appropriately apply methods of prediction and estimation for reproducible results, how to quantify uncertainty, how to distinguish between causation and correlation, and how to convey scientific findings through a variety of means including data visualizations.
American Statistical Association’s Statement on The Role of Statistics in Data Science
Students interested in Data Science should consider coupling studies in statistics with studies in computer science. The following 3rd and 4th year courses in statistics are particularly applicable to Data Science: Methods of Data Analysis I and II (STA302H1, STA303H1), Design of Scientific Studies (STA305H1), Theory of Statistical Practice (STA355H1), and Statistical Methods for Data Mining and Machine Learning (STA414H1). Other useful courses include Statistical Computation (STA410H1), courses in statistics methods (such as STA437H1 and STA442H1) and courses which consider modern methods of statistical inference applied to specific types of data (such as Theory and Methods for Complex Spatial Data (STA465H1) and Fundamentals of Statistical Genetics (STA480H1)).
In preparing for a career as a Data Scientist, students should also consider participating in opportunities to work on problems involving answering questions from large and complex data sets, such as ASA DataFest at the University of Toronto