In my current position, I work to optimize and parallelize codes that deal with genomic data, e.g., DNA, RNA, proteins, etc. To be universally available, many of the input files holding DNA samples (called reads) are text files full of the characters 'A', 'C', 'G', and 'T'.
Real life datasets can have missing values. Sociological surveys and measurement of complex biological systems are two examples where the researcher can arrive at the point in which he should do something with missing observations. One can also treat outliers in datasets as samples which are also lost. Intel® Summary Statistics Library already contains functionality to detect outliers or get robust estimates in presence of “suspicious” observations.
Algorithm for parameterization of correlation matrix. The algorithm transforms the input which lacks property of positive semidefiniteness into the output meeting properties of correlation matrix. The algorithm is based on spectral decomposition method and can be used in financial computations.
In my previous posts I described some features and usage model of Intel® Statistics Library. However, there are many available statistical packages that provide good similar functionality. Does Intel® Summary Statistics Library deliver difference, bring something new and specific? The answer is yes.
Earlier I computed various statistical estimates like mean or variance-covariance matrix using Intel® Summary Statistics Library. In those cases I knew for sure that my datasets did not contain “bad” observations (points which do not belong to the distribution which I observed) or outliers. However, in some cases we need to deal with datasets which are contaminated with outliers.
Today it was necessary for me to compute statistical estimates for a dataset. The observations are weighted, and only several components of the random vector had to be analyzed. How often do we solve such tasks and how do we solve them in our every day life? If we meet such problems rarely or their size is small then use of a popular statistical package or development of a data processing program will be a proper way to address the problem. What if I need to process huge data arrays regularly analyzing gene expression levels for example?
- 第 1 页