# Data Snooping Mining Dredging

Data snooping refers to statistical inference that the researcher decides to perform after looking at the data (as contrasted with pre-planned inference, which the researcher plans before looking at the data).

Data snooping can be done professionally and ethically, or misleadingly and unethically, or misleadingly out of ignorance. Data snooping misleadingly out of ignorance is a common error in using statistics. The problems with data snooping are essentially the problems of multiple inference.

Source: COMMON MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

**Data Mining **(from Wikipedia)

Data mining (the analysis step of the “Knowledge Discovery and Data Mining” process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.

**Data Dredging **(from Wikipedia)

Data dredging (data fishing, data snooping, equation fitting) is the use of data mining to uncover relationships in data.

The process of data mining involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching for combinations of variables that might show a correlation. Conventional tests of statistical significance are based on the probability that an observation arose by chance, and necessarily accept some risk of mistaken test results, called the *significance*. When large numbers of tests are performed, some produce false results, hence 5% of randomly chosen hypotheses turn out to be significant at the 5% level, 1% turn out to be significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some falsely appear statistically significant, since almost every data set with any degree of randomness is likely to contain some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these apparently significant results.