Science topic

Statistics - Science topic

Statistical theory and its application.
Questions related to Statistics
  • asked a question related to Statistics
Question
2 answers
Relevant answer
Answer
If White privilege is based on societal advantages for white people. It's unlikely to disappear entirely because societal ideas about race can persist even if everyone looked the same. The goal is to create a society where race doesn't affect opportunity. We can work towards this by promoting diversity and equal treatment.
  • asked a question related to Statistics
Question
4 answers
Hello all,
I am running into a problem I have not encountered before with my mediation analyses. I am running a simple mediation X > M > Y in R.
Generally, I concur that the total effect does not have to be significant for there to be a mediation effect, and in the case I am describing, this would be a logical occurence, since the effects of path a and b are both significant and respectively are -.142 and .140, thus resulting in a 'null-effect' for the total effect.
However, my 'c path X > Y is not 'non-significant' as I would expect, rather, the regression does not fit (see below) :
(Residual standard error: 0.281 on 196 degrees of freedom Multiple R-squared: 0.005521, Adjusted R-squared: 0.0004468 F-statistic: 1.088 on 1 and 196 DF, p-value: 0.2982).
Usually I would say you cannot interpret models that do not fit, and since this path is part of my model, I hesitate to interpret the mediation at all. However, the other paths do fit and are significant. Could the non-fitting also be a result of the paths cancelling one another?
Note: I am running bootstrapped results for the indirect effects, but the code does utilize the 'total effect' path, which does not fit on its own, therefore I am concerned.
Note 2: I am working with a clinical sample, therefore the samplesize is not as great as I'd like group 1: 119; group2: 79 (N = 198).
Please let me know if additional information is needed and thank you in advance!
Relevant answer
Answer
Somehow it is not clear to my, what you mean with "does not fit"? Could you please provide the output of the whole analysis? I think this would be helpful.
  • asked a question related to Statistics
Question
3 answers
Assuming this is my hypothetical data set (attached figure), in which the thickness of a structure was evaluated in the defined positions (1-3) in 2 groups (control and treated). I emphasize that the structure normally increases and decreases in thickness from position 1 to 3. I would also like to point out that each position has data from 2 individuals (samples).
I would like to check if there is a statistical difference in the distribution of points (thickness) depending on the position. Suggestions were to use the 2-sample Kolmogorov-Smirnov test.
However, my data are not absolutely continuous, considering that the position of the measurement in this case matters (and the test ignores this factor, just ordering all values from smallest to largest and computing the statistics).
In this case, is the 2-sample Komogorov-Smirnov test misleading? Is there any other type of statistical analysis that could be performed in this case?
Thanks in advance!
Relevant answer
Answer
Thank you for your reply and suggestions, Mr. Bastu Tahir and Mr. Andrew Paul McKenzie Pegman!
However, it seems that the 2-sample Wilcoxon signed-rank test is for paired data and compares medians, which is not my need. In fact, the same values (80 in p1 and p3 in the control group, for example) were used only to extrapolate an example: the structure increases and decreases in thickness along the anteroposterior axis. And this is precisely the difficulty I am facing in the analysis, considering that these tests tend to order the values from smallest to largest, disregarding position information.
In any case, I'll find out more about the subject. Thank you again.
  • asked a question related to Statistics
Question
4 answers
Dear colleagues
Could you tell me please,how is it possible to consruct boxplot from dataframe in rstuio
df9 <- data.frame(Kmeans= c(1,0.45,0.52,0.54,0.34,0.39,0.57,0.72,0.48,0.29,0.78,0.48,0.59),hdbscan= c(0.64,1,0.32,0.28,0.33,0.56,0.71,0.56,0.33,0.19,0.53,0.45,0.39),sectralpam=c(0.64,0.31,1,0.48,0.24,0.32,0.52,0.66,0.32,0.44,0.28,0.25,0.47),fanny=c(0.64,0.31,0.38,1,0.44,0.33,0.48,0.73,0.55,0.51,0.32,0.39,0.57),FKM=c(0.64,0.31,0.38,0.75,1,0.26,0.55,0.44,0.71,0.38,0.39,0.52,0.53), FKMnoise=c(0.64,0.31,0.38,0.75,0.28,1,0.42,0.45,0.62,0.31,0.25,0.66,0.67), Mclust=c(0.64,0.31,0.38,0.75,0.28,0.46,1,0.36,0.31,0.42,0.47,0.66,0.53), PAM=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,1,0.73,0.43,0.39,0.26,0.41) ,
AGNES=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,1,0.31,0.48,0.79,0.31), Diana=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,0.42,1,0.67,0.51,0.43),
zones2=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,0.42,0.45,1,0.69,0.35),
zones3=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,0.42,0.45,0.59,1,0.41),
gsa=c(0.64,0.31,0.37,0.75,0.28,0.46,0.58,0.55,0.42,0.45,0.59,0.36,1), method=c("kmeans", "hdbscan", "spectralpam", "fanny", "FKM","FKMnoise", "Mclust", "PAM", "AGNES", "DIANA","zones2","zones3","gsa"))
head(df9)
df9 <- df9 %>% mutate(across(everything(), ~as.numeric(as.character(.))))
Thank you ery much
Relevant answer
Answer
Dear Valeriia Bondarenko
First you need to install "ggplot2" and "reshape2" along with those two libraries.
# Then you have to melted the methods
df9_melted<-melt(df9,id.vars="method")
# For the boxplot
ggplot(df9_melted,aes(x=method,y=value))+geom_boxplot()+labs(x="Method",y="Value",title="Boxplot of methods")
  • asked a question related to Statistics
Question
10 answers
In the domain of clinical research, where the stakes are as high as the complexities of the data, a new statistical aid emerges: bayer: https://github.com/cccnrc/bayer
This R package is not just an advancement in analytics - it’s a revolution in how researchers can approach data, infer significance, and derive conclusions
What Makes `Bayer` Stand Out?
At its heart, bayer is about making Bayesian analysis robust yet accessible. Born from the powerful synergy with the wonderful brms::brm() function, it simplifies the complex, making the potent Bayesian methods a tool for every researcher’s arsenal.
Streamlined Workflow
bayer offers a seamless experience, from model specification to result interpretation, ensuring that researchers can focus on the science, not the syntax.
Rich Visual Insights
Understanding the impact of variables is no longer a trudge through tables. bayer brings you rich visualizations, like the one above, providing a clear and intuitive understanding of posterior distributions and trace plots.
Big Insights
Clinical trials, especially in rare diseases, often grapple with small sample sizes. `Bayer` rises to the challenge, effectively leveraging prior knowledge to bring out the significance that other methods miss.
Prior Knowledge as a Pillar
Every study builds on the shoulders of giants. `Bayer` respects this, allowing the integration of existing expertise and findings to refine models and enhance the precision of predictions.
From Zero to Bayesian Hero
The bayer package ensures that installation and application are as straightforward as possible. With just a few lines of R code, you’re on your way from data to decision:
# Installation devtools::install_github(“cccnrc/bayer”)# Example Usage: Bayesian Logistic Regression library(bayer) model_logistic <- bayer_logistic( data = mtcars, outcome = ‘am’, covariates = c( ‘mpg’, ‘cyl’, ‘vs’, ‘carb’ ) )
You then have plenty of functions to further analyze you model, take a look at bayer
Analytics with An Edge
bayer isn’t just a tool; it’s your research partner. It opens the door to advanced analyses like IPTW, ensuring that the effects you measure are the effects that matter. With bayer, your insights are no longer just a hypothesis — they’re a narrative grounded in data and powered by Bayesian precision.
Join the Brigade
bayer is open-source and community-driven. Whether you’re contributing code, documentation, or discussions, your insights are invaluable. Together, we can push the boundaries of what’s possible in clinical research.
Try bayer Now
Embark on your journey to clearer, more accurate Bayesian analysis. Install `bayer`, explore its capabilities, and join a growing community dedicated to the advancement of clinical research.
bayer is more than a package — it’s a promise that every researcher can harness the full potential of their data.
Explore bayer today and transform your data into decisions that drive the future of clinical research: bayer - https://github.com/cccnrc/bayer
Relevant answer
Answer
Many thanks for your efforts!!! I will try it out as soon as possible and will provide feedback on github!
All the best,
Rainer
  • asked a question related to Statistics
Question
5 answers
What may be a good, strong and convincing example demonstrating the power of copulas by uncovering some not obvious statistical dependencies?
I am especially interested in the example contrasting copula vs a simple calculation of a correlation coefficient for the original distributions.
Something like this - the (properly normalized) correlation coefficient of components of a bivariate distribution does not suggest a strong statistical dependence between them, but the copula distribution of these two components shows a clear dependence between them (possibly manifested in the value of a correlation coefficient calculated for the copula distribution?). Or the opposite - the correlation coefficient of the original bivariate distribution suggests strong dependence, but its copula shows that the statistical dependence is "weak", or just absent.
Mostly interested in an example described in terms of formulae (so that the samples could be generated, e.g. in MATLAB), but if somebody can point to the specific pre-generated bivariate distribution dataset (or its plots), that will work too.
Thank you!
Relevant answer
Answer
I used the two sets of bivariate normal distributions generated by you, [x y1] and [x y2], representing strong and weak dependencies, calculating empirical copula distributions, [U V1] and [U V2] and also Pearson correlation coefficients for all bivariate distribution samples, then plotting them. The MATLAB / Octave script and the plots are attached.
I still do not understand what advantage the empirical copula distributions provide in this example. The weak and strong dependence are evident from the original plots of samples of [x y1] and [x y2] and the values of Pearson correlation coefficients for them. Pearson correlation coefficients of copulas are rather close to the same coefficients of the original distribution. Yes, the manifestation of weak dependence for the copula [U V2] looks different (samples of the copula distributions look like those from a uniform bivariate distribution)compared to the manifestation of the same in the original samples [x y2]. But why is this difference important?
Am I mis-interpreting your example, or am I missing something in the interpretation of the results?
Thank you,
  • asked a question related to Statistics
Question
5 answers
I have come across some ideas which I hope will be pursued by academics in the associated disciplines:
1) Topics Broadly outlined in the following articles
Fathabadi OS (2022) Voluntary Selection; Bringing Evolution at the Service of Humanity. Scientific J Genet Gene Ther 8(1): 009-015.DOI: http://dx.doi.org/10.17352/sjggt.000021
Fathabadi OS (2023) The way of future through voluntary selection. Glob J Ecol 8(1): 034-041. DOI: 10.17352/gje.000079
2) Explanation of how violence is the link between evolution and society. What is not clear in the articles above is that just as a disease shows symptoms such as fever in its early stages, violence that is caused by the lack of evolutionary order in society also shows itself in the form of problems like bullying, harassment, discrimination, passive aggression, coercive control, etc and then It may appear in the form of an increase in the rate of crimes and social riots and finally in the form of naked violence, civil wars, genocides and foreign wars. The purpose of the topics raised in both articles is to realize the evolutionary order through the methods proposed in control engineering and by achieving the desired statistical goals, not only to prevent violence, but also to spread satisfaction in the society and to provide internal and external security.
3) Interpretation of historical events including genocides and world wars through the lens of insights provided above for example how passive aggression became widespread prior to such events.
4) The articles above argue that Control Theory, in conjunction with data science, can help establish and maintain democracies with an unprecedented level of stability and provide optimal levels of living standards by regulating evolutionary health of societies and the relationships between them. This goal however, requires interpretation of the theory widely used in engineering, for application in Humanities. Control Engineering remains a rather challenging field to learn and its theoreticians and practitioners are mostly active in areas other than humanities - most critically politics - and even despite the large number of engineering graduates receiving education in Systems and Control Theory as part of their curriculum, the specific practical pathway for application of the theory for solving problems in Humanities remains rather unclear; let alone the inaccessibility of the field to the practitioners of humanities even to the extent necessary to allow them to express the problems in terms approachable by control engineering experts. Projects should therefore be defined to develop highly targeted educational materials and tools which provide a shortcut to applying the theory in solving practical problems in humanities. It is particularly important to understand that the type of Control Theory applicable to Humanities would be "Non-Linear Time-Varying Control" which is an extension of control Theory as relevant to most engineering problems and as such, the specific theory will need to be developed specifically beyond what a classically educated expert in Control Engineering would be comfortably equipped with even if educated with a PhD. For example the types of time constants and uncertainties relevant in Humanities and the types of irregularities which can be created by collective mis-intentions of social groups as well as moral aspects of implications of such techniques on human population will extend to matters associated in politics, biology, culture, education, and media. Actually it is important to understand that the knowledge and tools, will not act as a pill to fix problems but through insights, they provide directions and "Decision Support Systems" with local applicability and temporal relevance which despite their limitations, will provide unprecedented powers for providing better living standards for all populations and individuals. Such tools need to be maintained in order to remain relevant to the changes in the system under control and time. The developed materials will help students, researchers and practitioners get started and up to speed with the expertise through minimal training or self-study and the impact I believe would be revolutionary. In addition to Systems and Control Theory, the mentioned books/educational materials/tools may also cover topics such as Programming, Differential Equations, Numerical Optimisation, System Identification, Artificial Intelligence and to some extent Statistics. The set of these topic may one day help establish an independent field of study namely "Humanities Engineering".
5) Applications of the theory, educational materials, and tools mentioned above, to solve practical problems associated in sociology, psychology, politics and other fields of humanities is the ultimate goal and such projects make great topics for research in universities, Think tanks, and governments. It is important to mention that using the mean-values and standard deviations defining phenotypic profiles of populations is a way of taking into account the differences of different populations in achieving desirable results within them and in regulating the relationship between them.
6) It is also possible to characterise discourses, and broadcasted contents by defining relevant indices that quantify various aspects of them and then model and predict their relationship with the outcomes in society. These models can then act as decision support systems to identify and implement adjustments for achieving the desired social outcomes. If sufficiently predictive, they can also be used in combination with other models or in isolation as part of the control loops associated in Control Theory in order to achieve the desired outcomes in terms of sustainability, social stability, freedoms, economic welfare, health, national security, psychological security, gender equality, and optimal levels of happiness in all individuals and social groups, etc.
7) It is possible to compile a set of contents including matters mentioned in the articles above, to act as a mental anchorage for people. Something that is scientifically proven, convincing and understandable by those who put in the effort and allows them to remain motivated, morally directed, socially responsible and supported and mentally healthy. I believe adding a content starting from Genetics explaining how "Life is a complex Product of Nature" and how "Survival and Reproduction are Complex Interpretations of Laws of Nature" is necessary. Topics such as cosmology and Quantum Physics can also be included.
8) It is possible to define a project on optimal forms of democracy for different populations with emphasis on the fact that peoples' choices represent their interests as they understand them as individuals while much of what comprises our existing living standards or is necessary for achieving higher standards of living are a result of societies and mechanisms maintaining them. Societies were formed by cultures/religions as an aftermath of painful evolutionary events which occurred when people pursued their personal interests and emerged as optimal ways of achieving a better average standard of living for larger numbers of individuals over a larger proportion of their lives. In other words, many aspects of our existing living standards are by-products of societies and could not be achieved or maintained only pursuing our individual choices which is what a democracy guarantees. Democracies should be pursued for optimum living standards and preventing abuse however, it is necessary to have democracies in place to guarantee the maintenance of society itself (what you can call an evolutionary order) and realise/maintain its desired levels of standards of living and this should not be compromised by the choice of individuals.
9) Ethics of Voluntary Selection and application of methods concerned in Control Theory and AI in solving problems in Humanities specially to prevent abuse of individuals, and minorities under the flag of interests of society and to prevent creating senseless scientific approvals for imposing disadvantage on individuals and social groups. It is also necessary to minimise pain imposed on individuals and social groups in transitions as a result of adjustments.
10) While the first article introduces the concept of "Voluntary Selection" and a methodology to use it in a calculated way, it only acts as a beginning and if it is going to be implemented, a huge methodological and experimental effort is needed for identifying relevant phenotypes, developing phenotypic maps for distinct populations, identifying the results when choosing donor and receiver populations, and developing tools to predict and monitor the progress of such programs besides studying the implications for society, economy and beyond. Research can also dig deeper and take into account genotype-phenotype relationships in achieving the desired results.
Relevant answer
Answer
Alexander Ohnemus Yes, quite relevant.
  • asked a question related to Statistics
Question
5 answers
I want to estimate the half-life value for the virus as a function of strain and concentration, and as a continuous function of temperature.
Could anybody tell me, how to calculate the half-life value in R programming?
I have attached a CSV file of the data
Relevant answer
Answer
Estimating the half-life of a virus involves understanding its stability and decay rate under specific environmental or biological conditions. This is a crucial parameter in virology, impacting everything from the design of disinfection protocols to the assessment of viral persistence in the environment or within a host. Here's a structured approach to estimating the half-life values for a virus:
  1. Defining Conditions:Environment: Specify the environmental conditions such as temperature, humidity, UV exposure, and presence of disinfectants, as these factors significantly affect viral stability. Biological: In biological systems, consider the impact of host factors such as immune response, tissue type, and presence of antiviral agents.
  2. Experimental Setup:Sampling: Begin by preparing a known concentration of the virus under controlled conditions. Time Points: Collect samples at predetermined time points that are appropriate based on preliminary data or literature values suggesting the expected rate of decay.
  3. Quantitative Assays:Plaque Assay: One of the most accurate methods for quantifying infectious virus particles. It measures the number of plaque-forming units (PFU) which reflect viable virus particles. PCR-Based Assays: These can measure viral RNA or DNA but do not distinguish between infectious and non-infectious particles. Adjustments or complementary assays might be required to correlate these results with infectivity. TCID50 (Tissue Culture Infective Dose): This assay determines the dilution of virus required to infect 50% of cultured cells, providing another measure of infectious virus titer.
  4. Data Analysis:Plot Decay Curves: Use logarithmic plots of the viral titer (e.g., PFU/mL or TCID50/mL) against time. The decay of viral concentration should ideally follow first-order kinetics in the absence of complicating factors. Calculate Half-Life: The half-life (t1/2) can be calculated using the equation derived from the slope (k) of the linear portion of the decay curve on a logarithmic scale:�1/2=ln⁡(2)�t1/2​=kln(2)​Statistical Analysis: Ensure statistical methods are used to analyze the data, providing estimates of variance and confidence intervals for the half-life.
  5. Validation and Replication:Replicate Studies: Conduct multiple independent experiments to validate the half-life estimation. Variability in viral preparations and experimental conditions can affect the reproducibility of results. Peer Review: Consider external validation or peer review of the methodology and findings to ensure robustness and accuracy.
  6. Interpretation and Application:Contextual Interpretation: Understand that the estimated half-life is context-specific. Results obtained under laboratory conditions may differ significantly from those in natural or clinical settings. Application in Risk Assessment: Use the half-life data to inform risk assessments, disinfection strategies, or predictive modeling of viral spread and persistence.
By meticulously following these steps and ensuring the precision of each phase of the process, one can accurately estimate the half-life of a virus under specific conditions. This information is essential for developing effective control strategies and understanding the dynamics of viral infections.
Perhaps this protocol list can give us more information to help solve the problem.
  • asked a question related to Statistics
Question
3 answers
I am studying leadership style's impact on job satisfaction. in the data collection instrument, there are 13 questions on leadership style divided into a couple of leadership styles. on the other hand, there are only four questions for job satisfaction. how do i run correlational tests on these variables? What values do i select to analyze in Excel?
Relevant answer
Answer
First, you need to do the correlation between your target variable and each of your potential independent variables. After checking what independent variables are the more correlated to your target variable (as mentioned earlier coefficient correlation closest to - 1 or + 1). Once, you decide according to these correlation coefficients which variables you can select for your model, you need to ensure that there will be no multicollinearity in your model. To ensure that, for each independent variable you do correlation tests again. If two independent variables are too correlated, you should introduce only one in your model (e.g. the variable which had the higher correlation rate with your dependent variable).
  • asked a question related to Statistics
Question
63 answers
I explain here the connection between the pre-scientific Law of Universal Causality and all sorts of statistical explanations in physical sciences. The way it takes may look strange, but it will be interesting enough to consider.
To repeat in short what is already said a few times: by all possible assumptions, to exist (which is To Be with respect to Reality-in-total) is non-vacuous. Hence, any existent must have Extension, have finite-content parts. These parts, by the only other possible assumption, must yield impacts on other parts both external and internal. This is Change.
These impacts are always finite in the content and measured extents. The measured extents of Extension and Change are space and time. Without measurements we cannot speak of space and time as existing or as pertaining to existents. What pertain to all existents as most essential are Extension and Change. Existence in Extension and Change means that finitely extended objects give origin to finite impacts. This is Causality. Every existent is Extension-Change-wise existent, and hence everything is causal.
As pertinents to existents, Extension and Change are the most applicable qualities / universals of the group of all entities, i.e., Reality-in-total, because they belong to all that exist. Since Extension and Change are not primarily in our minds, let us call them as ontological universals. As is clear now, Extension and Change are the widest possible and most general ontological universals. All universals are pure qualities. All qualities other than ontological universals are mixtures of pure qualities.
There are physical-ontological universals / qualities that are not as universal as Extension and Change. ‘Colouredness’ / ‘being coloured’, ‘redness’, ‘unity’ / ‘being a unit’, ‘being malleable’, ‘being rigid’, etc. are also pure qualities. These are pertinents not merely of one existent process. They belong to many. These many are a group of existent processes of one kind, based on the one classification quality. Such groups of Extension-Change-wise existent entities are termed natural kinds.
Ontological universals can be reflected in minds too, but in very meagre ways, not always, and not always to the same extent of correspondence with ontological universals, because they are primarily in existent processes. A direct reflection is impossible. The many individuals who get them reflected meagrely formulate them differently.
The supposed common core of ontological universals in minds is a pure notion, but they are mere notions idealized by minds. These ideals are also not inherited of the pertinent ontological universals of all relevant existent things, but at least by way of absorption from some existents, in whatever manner of correspondence with ontological universals. I call them connotative universals, because they are the pure aspects of the conceptual activity of noting objectual processes together.
In brains connotative universals can show themselves only as a mixture of the relevant connotative universals and the relevant brain elements. Please note that this is not a brain-scientific statement. It is the best imaginable philosophical common-sense on the brain-scientific aspect of the formation of connotative universals, and hence it is acceptable to all brain scientists. In brains there are processes that define such activities. But it needs only to be accepted that these processes too are basically of Extension-Change-wise existence, and hence are causal in all senses.
Connotatives are just representations of all kinds of ontological universals. Connotatives are concatenated in various ways in connection with brain elements – in every case highly conceptually and symbolically. These concatenations of connotatives among themselves are imaginations, emotions, reflections, theories, etc., as considered exclusively in the mind.
Note here also that the lack of exact correspondence between ontological and connotative universals is what makes ALL our statements essentially statistical and non-exact at the formation of premises and at the jump from premises into conclusions. The statistical aspect here is part of the process of formation, by brains, of connotatives from ontological universals. This is the case in every part of imaginations, emotions, reflections, theories, etc., even when statistical measurements are not actually being made part of the inquiry as a matter of mentally guided linguistic and mathematical procedures.
Further, connotative universals are formulated in words expressed as terms, connected with connectives of processes, and concatenated in statements. These are the results of the symbolic functioning of various languages including mathematics. These are called denotative universals and their concatenations. All symbolic activities function at this level.
Now coming to statistics as an applied expression of mathematics. It is nothing but denotative universals concatenated in a quantitatively qualitative manner. Even here there is a lot of lack of exactness, which are known as uncertainty, randomness, etc. Pay attention to the fact that language, mathematics, and its statistical part work at the level of denotative universals and their concatenations. These are naturally derived from the conglomerations of ontological universals via concatenations of connotatives and then translated with further uncertainties unto denotative concatenations.
Causation works at the level of the conglomerations of ontological universals, which are in existent things themselves. That is, statistical connections appear not at the ontological level, but at the denotative level. When I say that this laptop is in front of me, there is a directness of acceptance of images from the ontological universals and their conglomerations into the connotative realm of connotations and from there into the denotative realm of connotations. But in roundabout conclusions regarding causal processes at the physical-ontological level into the statistical level, the amount or extent of directness of judgement is very much lacking.
Relevant answer
  • asked a question related to Statistics
Question
4 answers
What is the specific importance of a bachelor’s degree in the hiring process?
Relevant answer
Answer
A bachelor's degree signals foundational knowledge and transferable skills to employers, making it a plus in many fields. It can also be a screening tool for employers. However, its importance varies. Some professions require a specific degree, while experience or alternative credentials like certifications might be valued more in others. Overall, a degree can be an asset but isn't always the only thing that matters for getting hired.
  • asked a question related to Statistics
Question
5 answers
Why parsimoniously does fertility negatively correlate with socioeconomic status? How?
Relevant answer
Answer
Overall, the negative correlation between fertility and socioeconomic status can be attributed to a combination of economic, educational, cultural, and structural factors that shape individuals' reproductive choices and opportunities. Understanding these mechanisms is essential for policymakers, healthcare providers, and social scientists seeking to address disparities in reproductive health outcomes and promote equitable access to family planning resources.
  • asked a question related to Statistics
Question
4 answers
Hi,
I am hoping to get some help on what type of statistical test to run to validate my data. I have run 2 ELISAs with the same samples for each test. I did perform a Mann-Whitney U-test to compare the groups, and my results were good.
However, my PI wants me to also run a statistical test to determine that there wasn't any significant difference in the measurement of each sample between the runs. He wants to know that my results are concordant/reproducible.
I am trying to compare each sample individually, and since I don't have 3 data points, I can't run an ANOVA. What types of statistical tests will give me that information? Also, is there a test that will run all the samples simultaneously but only compare across the same sample.
For example, if my data looked like this.
A: 5, 5.7
B: 6, 8
C: 10, 20
I need a test to determine if there is any significant difference between the values for samples A, B, and C separately and not compare the group variance between A-C.
Relevant answer
Answer
If you want to see how comparable the results from the two ELISAs are, simply plot the results of the first ELIZA against those of the second ELISA.
Another option is to make a mean-difference plot (aka "Bland-Altman plot"): plot the differences between the ELISA results against the mean of the ELISA results.
Doing a statistical test and interpreting a non-significant result as "there is no difference" or "the groups/runs/ELISAs are comparable" is logically flawed and complete nonsense. Don't do this, ever!
  • asked a question related to Statistics
Question
2 answers
Hello everyone,
I am currently undertaking a research project that aims to assess the effectiveness of an intervention program. However, I am encountering difficulties in locating suitable resources for my study.
Specifically, I am in search of papers and tutorials on multivariate multigroup latent change modelling. My research involves evaluating the impact of the intervention program in the absence of a control group, while also investigating the influence of pre-test scores on subsequent changes. Additionally, I am keen to explore how the scores differ across various demographic groups, such as age, gender, and knowledge level (all measured as categorical variables).
Although I have come across several resources on univariate/bivariate latent change modelling with more than three time points, I have been unable to find papers that specifically address my requirements—namely, studies focusing on two time points, multiple latent variables (n >= 3), and multiple indicators for each latent variable (n >= 2).
I would greatly appreciate your assistance and guidance in recommending any relevant papers, tutorials, or alternative resources that pertain to my research objectives.
Best,
V. P.
Relevant answer
Answer
IYH Dear Vivian Parker
Ch. 19 Muthén, B. Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences. Newbury Park, CA: Sage.
Although this ref do not exclusively concentrate on two-time-point cases, it does contain discussions revolving around multiple latent variables and multiple indicators for those latent constructs. https://users.ugent.be/~wbeyers/workshop/lit/Muthen%202004%20LGMM.pdf
It contains rich content concerning latent growth curve models and elaborates on multivariate implementations.
While conceptually broader, it present crucial components necessary for building and applying two-time-point, multivariate latent change models.
  • asked a question related to Statistics
Question
8 answers
I want to examine the relationship between school grades and self-esteem and was planning to do a linear regression analysis.
Here's where my Problem is. I have three more variables: socioeconomic status, age and sex. I wanted to treat those as moderation variables, but I'm not sure if that's the best solution. Maybe a multiple regression analysis would be enough? Or should I control those variables?
Also if I'd go for a moderation analysis, how'd I go about analysing with SPSS? I can find a lot of videos about moderation analysis, but I can't seem to find cases with more than one moderator.
I've researched a lot already but can't seem to find an answer. Also my statistic skills aren't the best so maybe that's why.
I'd be really thankful for your input!
Relevant answer
Answer
Hi Daniel Wright. Sure, I'm fine with just calling it an interaction. I'm just saying that if one wanted to use some other term, I prefer effect modification over moderation because it is neutral with respect to the nature of the interaction.
  • asked a question related to Statistics
Question
12 answers
I recently had a strange question from one of the non-statistician asking on confidence interval. His way of understanding is that all the sample values that was used to calculate the confidence interval should be within that interval. I have tried to answer him the best, but couldn't convince him in any way. Is there any best way to explain why it need not be, and the purpose is not the way he understands. How would you handle this question?
Thanks in advance.
Relevant answer
Answer
If we use the example of the CI for the population mean: You may argue that the estimate should become "better" (more precise) when more data (information) is available. So the expected width of the CI should decrease with sample size. By choosing an arbitrary large sample size you can get arbitrarily small expected CIs. But the sample size has no effect on the variance of the data itself.
  • asked a question related to Statistics
Question
1 answer
Relevant answer
Answer
Metaphysics is the branch of philosophy that deals with the fundamental nature of reality.There are a few metaphysical ideas that could potentially end stratification or the division of people into different social classes one idea is the concept of social justice.
This is the idea that all people are equal and deserve to be treated fairly.Another idea is the concept of social mobility, this is the idea that people should have the opportunity to move up or down the social ladder based on their own efforts.
Another challenge is that some people may be unwilling to share power or resources with others.
  • asked a question related to Statistics
Question
3 answers
Respectfully, across reincarnation belief and scientific materialism, why is considering the individual self, as an illusion, a commonality? 1)
Relevant answer
Answer
I can only address this question with mathematical structures. The individual self is multi-dimensional manifold embedded with an much, much larger manifold of infinite dimensions. One may think of it as a vector space of tremendous size. As vast as it is, a human existence is but a small subspace of the infinite dimensional manifold. When released from physical existence, the aspects of individual self convolve with the larger space. In some sense, you may refer to that as the commonality.
  • asked a question related to Statistics
Question
7 answers
Dear all,
I am sharing the model below that illustrates the connection between attitudes, intentions, and behavior, moderated by prior knowledge and personal impact perceptions. I am seeking your input on the preferred testing approach, as I've come across information suggesting one may be more favorable than the other in specific scenarios.
Version 1 - Step-by-Step Testing
Step 1: Test the relationship between attitudes and intentions, moderated by prior knowledge and personal impact perceptions.
Step 2: Test the relationship between intentions and behavior, moderated by prior knowledge and personal impact perceptions.
Step 3: Examine the regression between intentions and behavior.
Version 2 - Structural Equation Modeling (SEM)
Conduct SEM with all variables considered together.
I appreciate your insights on which version might be more suitable and under what circumstances. Your help is invaluable!
Regards,
Ilia
Relevant answer
Answer
Ilia, some thoughts on your model. According to your path diagram you have 4 moderator effects. For such a large model, you need a large sample size to detect all moderator effects simultaneously. Do you have a justification for all of these nonlinear relationships?
Some relationships in the path diagram are missing. First, prior knowledge, personal impact, and attitude should be correlated - these are the predictor variables. Second, prior knowledge and personal impact should have direct effects on the dependent variables behavioral intentions and behavior (this is necessary).
As this model is quite complex, I would suggest to start with analyzing the linear model. If this model fits the data well, then I would include the interaction effects one by one. Keep in mind that you need to use a robust estimation method for parameter estimation because of the interaction effects. If these effects exist in the population, then behavioral intentions and behavior should be non-normally distributed.
Kind regards, Karin
  • asked a question related to Statistics
Question
2 answers
ResearchGate does a pretty good job of tracking publication analytics such as reads and citations over time. The recommendations feature can also be an interesting indicator for a publication's resonance with the scholarly community.
This progress allow for ideas to be developed about how to make the analytics features even better in the future. Here are some ideas I have been thinking about:
  • something equivalent to Altmetric that tracks social media mentions across multiple platforms and mentions in news articles, conference proceedings, etc.
  • more longitudinal data for individual publications by month and year
  • the ability to compare the performance of one's own publications, with perhaps a way to rank them in analytic reports by reads, citations, etc.
  • More specific analytics to allow for comparisons within and between departments on an individual and collective basis, which can be sorted by discipline, field, etc.
Are there any additional analytics features that you would like to see on ResearchGate?
Relevant answer
Answer
That's a fair point. I am interested metrics like reads because my research focuses on student government and I have noticed that a lot of my readers appear to be students. These students may not yet be publishing academic works or they may be in a different discipline, but even though they are not formally citing the publications in journals, they may be using them to help with student government activities in practice. Keeping an eye on reads and the countries where the readers are may give indications about new student-led initiatives that are emerging. I'm curious about what the potential impacts could be and how this may correspond with changes in metrics like reads over time.
  • asked a question related to Statistics
Question
7 answers
Meta-analyses and systematic reviews seem the shortcut to academic success as they usually have a better chance of getting published in accredited journals, be read more, and bring home a lot of citations. Interestingly enough, apart from being time-consuming, they are very easy; they are actually nothing but carefully followed protocols of online data collection and statistical analysis, if any.
The point is that most of this can be easily done (at least in theory) by a simple computer algorithm. A combination of if/thenstatements would simply allow the software to decide on the statistical parameters to be used, not to mention more advanced approaches that can be available to expert systems.
The only part needing a much more advanced algorithm like a very good artificial intelligence is the part that is supposed to search the articles, read them, accurately understand them, include/exclude them accordingly, and extract data from them. It seems that today’s level of AI is becoming more and more sufficient for this purpose. AI can now easily read papers and understand them quite accurately. So AI programs that can either do the whole meta-analysis themselves, or do the heavy lifting and let the human check and polish/correct the final results are on the rise. All needed would be the topic of the meta-analysis. The rest is done automatically or semi-automatically.
We can even have search engines that actively monitor academic literature, and simply generate the end results (i.e., forest plots, effect sizes, risk of bias assessments, result interpretations, etc.), as if it is some very easily done “search result”. Humans then can get back to doing more difficult research instead of putting time on searching and doing statistical analyses and writing the final meta-analysis paper. At least, such search engines can give a pretty good initial draft for humans to check and polish them.
When we ask a medical question from a search engine, it will not only give us a summary of relevant results (the way the currently available LLM chatbots do) but also will it calculate and produce an initial meta-analysis for us based on the available scientific literature. It will also warn the reader that the results are generated by AI and should not be deeply trusted, but can be used as a rough guess. This is of course needed until the accuracy of generative AI surpasses that of humans.
It just needs some enthusiasts with enough free time and resources on their hands to train some available open-source, open-parameter LLMs to do this specific task. Maybe even big players are currently working on this concept behind the scene to optimize their propriety LLMs for meta-analysis generation.
Any thoughts would be most welcome.
Vahid Rakhshan
Relevant answer
Answer
There was a recent well-publicised event where an actual legal court case included legal documents prepared by AI that included supposed legal citations to cases that did not ever exist.
So, you have two problems:
(1) Constructing code that does actually work;
(2) Persuading others that you have code that actually works.
  • asked a question related to Statistics
Question
3 answers
hi, i'm currently writing my psychology dissertation where i am investigating "how child-oriented perfectionism relates to behavioural intentions and attitudes towards children in a chaotic versus calm virtual reality environment".
therefore i have 3 predictor variables/independent variables: calm environment, chaotic environment and child-oriented perfectionism
my outcome/dependent variables are: behavioural intentions and attitudes towards children.
my hypotheses are:
  1. participants will have more negative behavioural intentions and attitudes towards children in the chaotic environment than in the calm environment.
  2. these differences (highlighted above) will be magnified in participants high in child-oriented perfectionism compared to participants low in child oriented perfectionism.
i used a questionnaire measuring child-oriented perfectionism which will calculate a score. then participants watched the calm environment video and then answered the behavioural intentions and attitudes towards children questionnaires in relation to the children shown in the calm environment video. participants then watched the chaotic environment video and then answered the behavioural intentions and attitudes towards children questionnaire in relation to the children in the chaotic environment video.
i am unsure whether to use a multiple linear regression or repeated measures anova with a continuous moderator (child-oriented perfectionism) to answer my research question and hypotheses. please please can someone help!
Relevant answer
Answer
1. participants will have more negative behavioural intentions and attitudes towards children in the chaotic environment than in the calm environment.
--- because there were only two conditions (levels of your factor), you can use a paired t-test (or wilcoxon if nonparametric) to compare the behavioral intentions/attitudes between the calm and chaotic environment where the same participants were subjected to both environments.
2. these differences (highlighted above) will be magnified in participants high in child-oriented perfectionism compared to participants low in child oriented perfectionism.
--- indeed this is a simple linear regression (not multiple one), you can start with creating a new dependent variable (y) as the difference in behavioral intentions/attitudes between the calm and chaotic environment, then you run a regression on the independent variable of a perfectionism score (x).
  • asked a question related to Statistics
Question
2 answers
Relevant answer
Answer
What could your "political inclinations" possibly have to do with the scientific issues discussed on this website?
  • asked a question related to Statistics
Question
1 answer
RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
res = hypotest_fun_out(*samples, **kwds)
Above warning occured in python. Firstly, the dataset was normalised and then while performing the t-test this warning appeared, though the output was displayed. Kindly suggest some methods to avoid this warning.
Relevant answer
Answer
Why do you normalize before testing? If you are doing a pairwise t-test and the differences are small this only makes differences smaller. https://www.stat.umn.edu/geyer/3701/notes/arithmetic.html
  • asked a question related to Statistics
Question
3 answers
I am somewhat Hegelian because I do not believe in martyrdom, and or dying on a hill, and usually the popular, and or traditional, opinion has a deeper less obvious reason.
Relevant answer
Answer
I value politics, I believe in politics, and I exercise my political right as a citizen.
  • asked a question related to Statistics
Question
7 answers
Can anyone here me with one biostatistics question. It is about finding the sample size from power analysis. I have the variables. Just need an assistance with the calculations.
  • asked a question related to Statistics
  • asked a question related to Statistics
Question
5 answers
As a Computer Science student inexperienced in statistics, I'm looking for some advice on selecting the appropriate statistical test for my dataset.
My data, derived from brain scans, is structured into columns: subject, channels, freqbands, measures, value, and group. It involves recording multiple channels (electrodes) per patient, dividing the signal into various frequency bands (freqbands), and calculating measures like Shannon entropy for each. So each signal gets broken down to one data point. This results in 1425 data points per subject (19 channels x 5 freqbands x 15 measures), totalling around 170 subjects.
I aim to determine if there's a significant difference in values (linked to specific channel, freqband, and measure combinations) between two groups. Additionally, I'm interested in identifying any significant differences at the channel, measure or freqband level.
What would be a suitable statistical test for this scenario?
Thanks in advance for any help!
  • asked a question related to Statistics
Question
2 answers
Has anyone gone through the Wohler's report_2023 yet? Its pros and cons? What are the ways to obtain its e-copy? Its subscription is very costly for a normal researcher (around 750 USD per user). Any alternatives to get similar kind of data as that of the report?
Relevant answer
Answer
Thanks David A. Jones sir!!
  • asked a question related to Statistics
Question
4 answers
I'm excited to speak at this FREE conference for anyone interested in statistics in clinical research. 👇🏼👇🏼 The Effective Statistician conference features a lineup of scholars and practitioners who will speak about professional & technical issues affecting statisticians in the workplace. I'll be giving a gentle introduction to structural equation modeling! I hope to see you there. Sign up here:
Relevant answer
Answer
Thanks for this valuable share!!
  • asked a question related to Statistics
Question
4 answers
How to test for common method bias in CB-SEM?
Relevant answer
Answer
Check Harman's one-factor test, which can be used to access, measure instrument and collect data for dependent and independent variables.
You can as well use covariance-based structural equation modelling (CB-SEM)
  • asked a question related to Statistics
Question
5 answers
Is it correct to choose the principal components method in order to show the relationship of species with biotopes?
Relevant answer
Answer
Olena Yarys If you are looking for patterns and relationships among those variables (species and biotopes), additional approaches like Canonical Correspondence Analysis (CCA) or regression models may be appropriate. Then you could validate and perform the sensibility analysis of your results.
  • asked a question related to Statistics
Question
2 answers
My answer: Yes, in order to interpret history, disincentives are the most rigorous guide. How?: Due to the many assumptions of inductive logic, deductive logic is more rigorous. Throughout history, incentives are less rigorous because no entity(besides God) is completely rational and or self-interested, thus what incentivizes an act is less rigorous then what disincentivizes the same action. And, as a heuristic, all entities(besides God) have a finite existence before their energy(eternal consciousness) goes to the afterlife( paraphrased from these sources : 1)
, thus interpretation through disincentives is more rigorous than interpreting through incentives.
Relevant answer
Answer
People's behavior in history is based on different motives, ideologies and personal views. Although motivational factors may influence decision making, individuals and groups often act within the context of their own authority and time.
  • asked a question related to Statistics
Question
2 answers
Who agrees life is more about preventing tragedies than performing miracles? I welcome elaborations.
Relevant answer
Answer
Maybe a bit cheezy, but "preventing tragedies IS performing miracles" in my opinion. Then again, negative news are always more reported and recognized than positive news, so if if someone performs an extraordinarily good feat, they will be only awarded, if at all, for a very short time.
  • asked a question related to Statistics
Question
4 answers
If you're using a number such as a statistic from a reference study you want to cite, should you write the number with the confidence interval? And how to effectively prevent plagiarism when dealing with numbers?
Thank you!
Relevant answer
Answer
Iltimass Gouazar when you’re citing statistics from a reference study, it’s generally a good practice to report the confidence intervals as well. Including them adds context to the statistic and gives a sense of the uncertainty or variability associated with the estimate. However, to prevent plagiarism, you do the needful which is to cite each of your sources and paraphrase using your own words.
  • asked a question related to Statistics
Question
4 answers
Are people more likely to mix up words if they are fluent in more languages? How? Why?
Relevant answer
Answer
Certainly! A person who is eloquent in more than one language is more likely to code-switch and mix up words from different languages within her L1. Language users be it consciously or unconsciously, seek facilitating things for themselves. Reasons for this interference vary:
1/ Similarities in pronunciation, grammar, vocabulary among languages systems like French, English, Spanish do play an important role in a multilingual society. The fact of knowing more than one language because of historical reasons, mixing up words become crucial when people communicate with others from different languages. A person who is fluent in French may easily mix up words when using English. The same thing happens to learners who mix up words from French when writing in or speaking English.
2/Language dominance: A bilingual speaker who uses the second language the whole day at work and with colleagues may not prevent herself from mixing up words when using her mother tongue at home.
3/ Prestige is another reason why people mix up words. For example, in Algeria people who uses French (a second language) words or sentences with Arabic is considered intellectual.
3/Actually languages interferences and code-witching occur even in the same language. For instance, a person who lives or works in an area which is far from home may be noticed since she uses different vocabulary and body language. The same thing happens to the same language user when words are mixed up using her own language at home.
  • asked a question related to Statistics
Question
4 answers
Hi!
This might be a bit of a stupid question, but I am currently writing my master thesis. One of the things I am doing is a factor analysis on a scale developed in Canada. This scale has only been validated on the Canadian workforce (the developers have one time done a exploratory factor analysis and two times done a confirmatory factor analysis). I am doing an exploratory and a confirmatory factor analysis in the Norwegian workforce to see what factor structure I would find here, and if it is the same as in Canada. As this this only one of three things I am doing in my masters I have hypothesis for all the other findings, so my supervisor would like me to have hypothesis for the factor structure as well. Whenever I try to come up with some arguments, I always feel like I am just arguing for the same attitudes in both countries, rather than the factor structure.
My question is: How do you make a hypothesis for this where you argue for the same/a different factor structure without arguing for the same/different attitudes?
Thank you in advance! :)
Relevant answer
Answer
Factor analysis can be used to identify the factors that contribute to the structure of the workforce.
It can as well be used to identify the key skills and competencies that are required for different roles in an organisation, gaps in the current workforce and also develop strategies to address these issues.
  • asked a question related to Statistics
Question
7 answers
I came across a commentary titled, 'Tests of Statistical Significance – their (ab)use in the social sciences' and it made me reflect on the validity of using my sample for statistical testing. I have a sample of 24 banks and they were not randomly selected. They were the top 50 banks ranked by the Banker and I narrowed down the sample to 24 because only those banks were usable for my study. I wanted to test the association between these banks using a McNemar's test but any result I obtain- I obtained insignificant results - would be meaningless, right? Because they are not a random selection. I did not want to make a generalisation, but I wanted to know if I could still comment on the insignificance of their association?
Relevant answer
Answer
A new book worth of our serious attention: The Myth of Statistical Inference (2021 by Michael C. Acree)
  • asked a question related to Statistics
Question
3 answers
Hello. We understand that a volcano plot is a graphical representation of differential values (proteins or genes), and it requires two parameters: fold change and p-value. However, for IP-MS (immunoprecipitation-mass spectrometry) data, there are many proteins identified in the IP (immunoprecipitation group) with their intensity, but these proteins are not detected in the IgG (control group)(the data is blank). This means that we cannot calculate the p-value and fold change for these "present(IP) --- absent(IgG)" proteins, and therefore, we cannot plot them on a volcano plot. However, in many articles, we see that these proteins are successfully plotted on a volcano plot. How did they accomplish this? Are there any data fitting methods available to assist in drawing? need imputation? but is it reflect the real interaction degree?
Relevant answer
Answer
Albert Lee : the issue with doing this is it makes the fold changes entirely arbitrary. Imagine I have a protein I detect in my test samples at "arbitrary value 10" but do not detect in my control samples at all.
If I call the ctrl value 0.5, then 0.5 vs 10.5 = 20 fold increase.
If I call the ctrl value 0.1, then 0.1 vs 10.1 = 100 fold increase.
If I call the ctrl value 0.0001, then 0.0001 vs 10.0001 = 100,000 fold increase.
In reality, the increase is effectively "infinite fold", but what this is really highlighting is that fold changes are not an appropriate metric here.
A lot (most) of statistical analysis is predicated on the measurement of change in values, not "present/absent" scenarios.
For disease biomarkers, for example, something that is present/absent is of use as a diagnostic biomarker, but not as a monitoring biomarker: you can say "if you see this marker at all, you have the disease", but you cannot really use it to track therapeutic efficacy, because all values of this marker other than "N/A" are indicative of disease.
For monitoring biomarkers you really want "healthy" and "diseased" values such that you can track the shift from one to the other.
David Genisys: I agree with Jochen Wilhelm , and would not plot my data in this manner.
A lot will depend on the kind of reviewers you get, and the type of paper you're trying to produce, but it would be more appropriate to note that these markers are entirely absent in one group, and then to comment on the robustness of their detection in the other. You wouldn't run stats necessarily, because as noted, stats are horrible for yes/no markers, but you could use the combination of presence/absence and actual level of the former to make inferences as to biological effect. If a marker goes from "not detected" to "detected but barely", then it might be indicative of dysregulated, aberrant expression behaviour, or perhaps stochastic low-level damage. Interesting, but perhaps not of biological import or diagnostic utility. If instead if goes from "not detected" to "readily detected, at high levels", then it's probably very useful as a diagnostic biomarker, and also indicative of some active biological process, be it widespread damage/release, or active expression of novel targets.
In either case you can make biological inferences without resorting to making up numbers so you can stick them on a volcano plot (and to be honest, if you get the kind of reviewers that demand volcano plots, you can always use the trick Albert suggests).
Volcano plots are primarily a way to take BIG DATA and present it in a manner that allows you to highlight the most interesting targets that have changed between groups: if you have whole swathes of genes that are instead present/absent, then those could be presented as a table, perhaps sorted by GO terms or something (if it looks like there are shared ontological categories you could use to infer underlying biology).
  • asked a question related to Statistics
Question
3 answers
We measured three aspects (i.e. variables) of self-regulation. We have 2 groups and our sample size is ~30 in each group. We anticipate that three variables will each contribute unique variance to a self-regulation composite. How do we compare if there are group differences in the structure/weighting of the composite? What analysis should be conducted?
Relevant answer
Answer
Are you thinking of self-regulation as a latent variable with the 3 "aspects" as manifest indicators? If so, you could use a two-group SEM, although your sample size is a bit small.
You've not said what software you use, but this part of the Stata documentation might help you get the general idea anyway.
  • asked a question related to Statistics
Question
3 answers
I have a paper that proposed a hypothesis test that is heavily based on existing tests (so it is pretty much a procedure built on existing statistical tests). It was rejected by a few journals claiming that it was not innovative, although I demonstrated that it outperforms some commonly used tests.
Are there any journals that take this sort of papers?
Relevant answer
Answer
There are two different strategies for submitting this type of work: 1) find a statistical journal that accepts more applied work or 2) find a scientific journal that finds your work of interest. What scientific, engineering, or medical problem are you trying to solve with your new method? What does your work add or provide to the community that is not addressed in the current literature? Once you know the answer to these two questions, you can better determine which journal to submit.
  • asked a question related to Statistics
Question
6 answers
I want to ask about the usage of parametrical and non-parametrical tests if we have an enormous sample size.
Let me describe a case for discussion:
- I have two groups of samples of a continuous variable (let's say: Pulse Pressure, so the difference between systolic and diastolic pressure at a given time), let's say from a) healthy individuals (50 subjects) and b) patients with hypertension (also 50 subjects).
- there are approx. 1000 samples of the measured variable from each subject; thus, we have 50*1000 = 50000 samples for group a) and the same for group b).
My null hypothesis is: that there is no difference in distributions of the measured variable between analysed groups.
I calculated two different approaches, providing me with a p-value:
Option A:
- I took all samples from group a) and b) (so, 50000 samples vs 50000 samples),
- I checked the normality in both groups using the Shapiro-Wilk test; both distributions were not normal
- I used the Mann-Whitney test and found significant differences between distributions (p<0.001), although the median value in group a) was 43.0 (Q1-Q3: 33.0-53.0) and in group b) 41.0 (Q1-Q3: 34.0-53.0).
Option B:
- I averaged the variable's values over all participants (so, 50 samples in group a) and 50 samples in group b))
- I checked the normality in both groups using the Shapiro-Wilk test; both distributions were normal,
- I used t Student test and obtained p-value: 0.914 and median values 43.1 (Q1-Q3: 33.3-54.1) in group a) and 41.8 (Q1-Q3: 35.3-53.1) in group b).
My intuition is that I should use option B and average the signal before the testing. Otherwise, I reject the null hypothesis, having a very small difference in median values (and large Q1-Q3), which is quite impractical (I mean, visually, the box plots look very similar, and they overlap each other).
What is your opinion about these two options? Are both correct but should be used depending on the hypothesis?
Relevant answer
Answer
You have 1000 replicate measurements from each subjects. These 1000 values are correlated and they should not be analyzed as if they were independent. So your model is wrong and you should identify a more sensible model. Eventually, the test of the difference between your groups should not have more than 98 degrees of freedom (it should have less, since a sensible model will surely include some other parameters than just the tow means). Having 1000 replicate measurements seems an overkill to me if there was no other aspect that should be considered in an analysis (like a change over time, with age, something like that). If there is nothing else that should be considered, the simplest analysis is to average the 1000 values per patient and do a t-test on 2x50 (averaged) values.
If you had a sample of independent thausands of samples per group, estimation would be mor interesting than testing. You should then better interpret the 95% confidence interval of the estimate (biological relevance) rather than the (in this respect silly) fact whether it is just in the positive or in the negative range.
  • asked a question related to Statistics
Question
5 answers
Neurons were treated with four different types of drugs, and then a full transcriptome was produced. I am interested in looking at the effects of these drugs on two specific pathways, each with around 20 genes. Would it be appropriate for me to just set up a simple comparative test (like a t-test) and run it for each gene? Or should I still use a differential gene expression package like DESeq2, even though only a few genes are going to be analysed? The aim of my experiment is a very targeted analysis, with the hopes that I may be able to uncover interesting relationships by cutting out the noise (i.e., the rest of the genes that are not of interest).
Relevant answer
Answer
Heather Macpherson oh yay that is much better. I think edgeR or limma would be highly appropriate to process your data. The edgeR and limma user guide is an excellent resource and has many tutorials on its proper use. as Jochen Wilhelm explained very well you will not want to subset. In edgeR and limma you can filter by experiment which would require a design matrix. i would also generate a contrast matrix for the group comparisons. After your groupwise comparisons you can subset as you like. highlight those genes in a volcano plot or smear plot. If you have not done this before, I would highly suggest starting from homer step 1 and then directly to the edgeR user guide vignette. Good luck!! http://homer.ucsd.edu/homer/basicTutorial/index.html
  • asked a question related to Statistics
Question
5 answers
During writing a review, usually published articles are collected from the popular data source like PubMed, google scholar, Scopus etc.
My questions are
1. how we can confirm that all the articles that are published in a certain period (e.g.,2000 to 2020) are collected and considered in the sorting process(excluding and including criteria)?
2. When the articles are not in open access, then how can we minimize the challenges to understand the data for the metanalysis?
Relevant answer
Answer
For the first question, using multiple databases as you suggest usually help to minimize the risk of missing relevant studies. The risk of missing studies by published year not usually an issue, since published year tend to be well documented and indexed.
However, missing studies due to a narrowed scope while searching for literature is always potential risk. If you have an reasonable knowledge of the field you are planning to review, you might have a range of publications already prior to the reviewing process. If you can find all of these publications during your scoping process then your scope is acceptable. If some of these are missing, you might need to extend the scope further.
Another issue that is difficult to account for is the inclusion of journals/publications that are not indexed in the big databases.
For the second question, one option is to contact authors of these subscription-only publications and request a copy. If some publications are essential yet no access can be granted, the last option is to purchase the publication.
  • asked a question related to Statistics
Question
6 answers
I would like to compare the percentage of slices which show gamma oscillations with the percentage of other slices from different conditions which also show gamma taking into consideration that groups might have different sample sizes. Thanks
Relevant answer
Answer
If you are still at an exploratory stage of your project, then you would want a visual presentation of your data. For this, you can take account of the different sample sizes for your percentages by using one of the variance-stabilisation techniques. These exist for the binomial distribution or, if in your case the counts are relatively small, those for the Poisson distribution might be better for you as being less technically distracting.
  • asked a question related to Statistics
Question
3 answers
Each sorbate is only adsorbed onto 1 sorbent at a time, in different sorption experiments. The research question is to determine which sorbate was the most sorbed and the fastest sorbed onto which sorbent. Is a test to measure the normality distribution of data needed? Thank you.
Relevant answer
Answer
Tan Suet May Amelia, to compare adsorbed concentrations of 4 sorbates by 4 sorbents and determine the most sorbed and fastest sorbed sorbate-sorbent combination, you can use a two-way ANOVA with post hoc tests. This will assess the main effects of sorbate and sorbent, as well as their interaction. Normality tests may not be needed if ANOVA assumptions are met. If normality is violated, consider non-parametric tests. Additionally, analyze time data with appropriate methods (e.g., survival analysis) to determine the fastest sorbed combination. With tools like OriginPro, you can then visualize your data the way i quickly did for you. For more relavant discussion you can ask me on my WA; https://wa.me/+923440907874. Good luck!
  • asked a question related to Statistics
Question
5 answers
When am I supposed to use the following tests to check homogeneity of variance
1) O'Brien
2) Bartlett
3) F test
4) Leven's and it's variation
5) Brown Forsythe
Can anyone help me m
Relevant answer
Answer
Results are drom bard
  • asked a question related to Statistics
Question
3 answers
I want to know if the type of litter found between sites is significantly different. I have 7 stations with polymer IDs for litter from each station. Can I compare two categorical variables (stations, n=7; polymer type, n=11)?
Can anyone share some advice on what stats to use and any code in R.
Relevant answer
Answer
we can use short - cut formules by using kimball's formula for partitioning the over all chi-squre value in the case of 2*c table
  • asked a question related to Statistics
Question
4 answers
I am trying to analyse mutation data for endometrial cancer obtained from different studies within several databases (COSMIC, cBioportal, Intogen). I have collated the data and grouped the mutations by gene. The focus of the analysis are non-synonymous coding mutations - because these mutations are most likely to cause a change in the normal protein function.
The aim of the study is to understand the mutational landscape of Endometrial cancer. The main objectives of the study are to find the commonly mutated genes in endometrial cancer, to find significantly damaging gene mutations in endometrial cancer and to create an updated list of genes comparable to commercial gene panels.
I have created this table with the collated data:
  1. Gene name
  2. Number of samples with coding mutations
  3. Frequency ( number of samples with coding mutations / total number of samples with coding mutation)
  4. CDS length
  5. Total number of unique coding mutations
  6. Number of unique coding: synonymous mutations
  7. Number of unique coding: non-synonymous mutations
  8. Mutation burden (number of unique coding: non-synonymoys mutations / CDS length)
  9. Composite score [(frequency of samples * 0.7) + (mutation burden * 0.3)]
The idea here is to use mutation burden to imply damaging effects of the genes' mutations in endometrial cancer. We then created a composite score to use as a comparable figure between the genes.
At the moment, our list of genes is at 16,000+. We are currently trying to think of a way to narrow down the list of genes to only focus on those significantly mutated compared to the other genes by way of statistics. Any advice is greatly appreciated.
Relevant answer
Answer
The significance of gene mutation burden in endometrial cancer data collated from different studies can be assessed using statistical methods such as Fisher’s exact test and logistic regression.
  • asked a question related to Statistics
Question
4 answers
first i send this number to bard chat bot
6020 5603 54830 6750 23000 12370 10000
it responds this.
You have listed 7 numbers.
  • 6020
  • 5603
  • 54830
  • 6750
  • 23000
  • 12370
  • 10000
The sum of these numbers is 103573.
The average of these numbers is 14825.
The median of these numbers is 6750.
The mode of these numbers is 10000, which appears twice.
The range of these numbers is 23000 - 10000 = 13000.
The standard deviation of these numbers is 3844.74.
Is there anything else I can help you with?
then i ask it to standardize them.
and the respond was this.
Sure. Standardizing the numbers will mean subtracting the mean and then dividing by the standard deviation. This will transform the numbers so that they have a mean of 0 and a standard deviation of 1.
The standardized numbers are:
  • -2.18
  • -2.88
  • 2.44
  • -1.38
  • 3.22
  • -0.78
  • 0.00
----------------------------------------
Bing chat respond to the same question what this
I can standardize the numbers you provided. Standardization is the process of transforming data to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean from each data point and then dividing by the standard deviation.
Here are the standardized values for the numbers you provided:
  • 6020: -0.705
  • 5603: -0.801
  • 54830: 2.891
  • 6750: -0.524
  • 23000: 0.693
  • 12370: -0.191
  • 10000: -0.363
Is there anything else I can help with? 😊
------------------------------------------
and perplexity respond was this.
Based on the search results, it is possible to standardize the numbers provided by converting them into z-scores. Standardizing data is a common practice in statistics to compare variables that have different scales or units of measurement. To standardize the numbers provided, one would need to calculate the mean and standard deviation of the data and then transform each value to its corresponding z-score
Here are the steps to standardize the numbers:
  1. Calculate the mean of the data: Mean = (6020 + 5603 + 54830 + 6750 + 23000 + 12370 + 10000) / 7 = 15038.43
  2. Calculate the standard deviation of the data: Standard deviation = sqrt(((6020 - 15038.43)^2 + (5603 - 15038.43)^2 + (54830 - 15038.43)^2 + (6750 - 15038.43)^2 + (23000 - 15038.43)^2 + (12370 - 15038.43)^2 + (10000 - 15038.43)^2) / 6) = 17523.22
  3. Calculate the z-score for each value: Z-score = (X - Mean) / Standard deviationZ-score for 6020 = (6020 - 15038.43) / 17523.22 = -0.706 Z-score for 5603 = (5603 - 15038.43) / 17523.22 = -0.768 Z-score for 54830 = (54830 - 15038.43) / 17523.22 = 2.238 Z-score for 6750 = (6750 - 15038.43) / 17523.22 = -0.426 Z-score for 23000 = (23000 - 15038.43) / 17523.22 = 0.455 Z-score for 12370 = (12370 - 15038.43) / 17523.22 = -0.152 Z-score for 10000 = (10000 - 15038.43) / 17523.22 = -0.853
By standardizing the numbers, we have transformed them into a common scale that allows for easier comparison and analysis. It is important to note that the interpretation of z-scores depends on the context and the distribution of the data.
-----------------------------------------------------
so, as you can see it can't even get sum of the data correct. my question is why this models cant get sum of this number right even when they apply all the correct steps
Relevant answer
Answer
There are seven numbers and you stated that 10000 appears twice, which means that there are eight numbers. In calculation of average denominator is 7 which means that 10000 cannot apper twice. Range is calculated as 23000 - 10000 instead as 54830 - 5603 = 49227. Sum of those numbers is 128573, not 103573. Are you sure about those numbers ?
  • asked a question related to Statistics
Question
6 answers
I'm expecting to use stock prices from the pre-covid period up to now to build a model for stock price prediction. I doubt regarding the periods I should include for my training and test set. Do I need to consider the pre-covid period as my training set and the current covid period as my test set? or should I include the pre-covid period and a part of the current period for my training set and the rest of the current period as my test set?
Relevant answer
Answer
To split your data into training and test sets for predicting stock prices using pre-COVID and current COVID periods, consider using a time-based approach. Allocate a portion of data from pre-COVID for training and the subsequent COVID period for testing, ensuring temporal continuity while evaluating predictive performance.
  • asked a question related to Statistics
Question
8 answers
There are 2 groups: one has an average of 305.15 and standard deviation of 241.83 while the second group has an average of 198.1 and a standard deviation of 98.1. Given the large standard deviation of the first group, its mean should not be significantly different from the second group. But when I conducted the independent sample t-test, it was which doesn't make sense. Is there any another test I can conduct to analyze the date (quantitative)?
The data is about solid waste generation on a monthly basis (averages). And I am comparing March and April data with that of other months. Also, the sample sizes are not equal i.e. less days in Feb as compared December for example.
Relevant answer
Answer
It is difficult to give a comprehensive answer without more details on 1) the data, 2) the experimental design, 3) your question and 4) what exactly you did to perform the Student's T-test.
Note however that usual Student's T-test assumes equal theoretical variances between the two populations, which is probably wrong considering your two values. You should use corrected T-tests for this case, like the Aspin-Welch test (default in R / t.test).
Note also that the T-test compares the means, that can be very precise even if the sample SD is very high if your sample size is big enough, so you cannot conclude just « by eye » based on means and sample SD. You should look at sample SEM to do that and better understand what's going on.
Never forget that T-test assumes Gaussian distributions, which may well be wrong also in your samples, and this all the more matter, than sample sizes are small and different (just like the hypothesis of variance equality is more important for unequal sized samples).
Last, don't confuse « statically significant » and « of practical interest », results of your test should be interpreted along with effect size / confidence intervals for the difference and so on.
  • asked a question related to Statistics
Question
3 answers
Hi everyone. I have searched all around the internet and literature for the answer to this question but haven’t been able to find any info regarding my specific situation.
I have multiple experiments consisting of qPCR data but can’t figure out how to best analyse it. I have WT and KO cells which I apply 3 treatments to and I have a control (no treatment) for both genotypes, and I check 15 genes. What I really want to show is if the genes are up/downregulated when I add a treatment in ko vs wt, so I want to make my comparison between the genotypes. But I can’t compare them directly, because at baseline they have quite different expression levels already, so I want to take the control for each into consideration. Before I was plotting -Dct (normalised to housekeeping gene only) and would compare each treatment in each genotype to its own control. But my group didn’t like this, which I understand, because the graphs are cluttered and I don’t show the comparison I’m really trying to make. I worked with a bioinformatician with my idea to normalise the Dct for each genotype/treatment to its own control and in that way make DDct, and then I compare the -DDct between genotypes for each treatment using an unpaired t-test. I don’t do fold change. These graphs are much nicer to look at, but my supervisor says it doesn’t make statistical sense this way, and wants to keep the graphs the original way.
can anyone help me out? What is the best way to analyse and graph my data?
Relevant answer
Answer
Ensuring that your data analysis and visualization methods are scientifically ⁠ valid and effectively convey the information is crucial. Here's a suggestion for a statistical analysis ⁠ that might better suit your needs: ⁠
Relative Expression Analysis: ​
Instead of using -Dct or -DDct, Take into account ⁠ utilizing the approach referred to as 'Relative Expression'. The fold change in gene expression between treated samples and their ⁠ corresponding controls is calculated within each genotype using this method. This way, you'll be comparing the change in expression ⁠ due to treatment in both genotypes separately. ‌
Relative Expression (RE) = 2^(-Dct) [where Dct ⁠ = Ct(target gene) - Ct(housekeeping gene)] ​
Fold Change Calculation: ‍
Calculate the fold change for each gene between treated samples and their controls ⁠ within each genotype using the relative expression values obtained in step 1. ‍
Fold Change Division of RE(treatment) by ⁠ RE(control) RE(treatment) / RE(control) ​
Comparison Between Genotypes: ‍
Having the fold change values for each gene within both genotypes, examine and contrast the fold changes between KO and WT genotypes across all treatments An adequate statistical test ⁠ can be employed to perform this task, like employing either an unpaired t-test or a non-parametric alternative when the data doesn't satisfy the assumptions of a t-test. ‌
Graphical Representation: ‌
Your data can be visually represented using bar graphs, Displaying the fold change ⁠ in gene expression for each treatment in both KO and WT cells. Visualizing the relative variations in gene expression among ⁠ the genotypes post-treatment is facilitated by this. ‍
  • asked a question related to Statistics
Question
3 answers
Hi!
My project includes 2 cell lines (developed from a stock culture) then each cell line was subcultured into 18 flasks. These 18 flasks are then separated into 3 temperatures and incubated for 6 time points. At each time point, I would analyze 2 flasks (one from each cell line), and this flask is discarded after.
Right now, we have the data analyzed with a mixed ANOVA model (time vs temperature as within factors) and tukey (to determine any differences between temperature, time point, and cell line). I was wondering if this is correct? Or should we do it differently because each time point uses a different flask compared to the next (meaning different cells from the stock culture).
thanks!
Relevant answer
Answer
Based on the description of your project, it appears that you have a nested experimental design, with a hierarchical structure of observations. Specifically, you have two cell lines, each of which is subcultured into 18 flasks. These flasks are then subjected to different temperatures and time points, with analysis conducted on two flasks at each time point. Each flask represents a unique unit of observation within its respective cell line.
Considering the nested structure of your data, a mixed ANOVA model can be appropriate to analyze the effects of temperature and time, while accounting for the nesting of flasks within cell lines. This model allows you to examine the main effects of temperature and time and their interaction. The mixed ANOVA model would consider temperature and time as within-subject factors while treating cell line as a between-subject factor.
However, it is crucial to acknowledge that using different flasks for each time point introduces potential variability due to the distinct culture conditions and individual variations among the flasks. This variability may affect the results' interpretation and could confound the effects of temperature and time.
To address this concern, you may consider using a random effects model that explicitly accounts for the nested structure of the data. This type of model would allow for the inclusion of random effects to capture the variability associated with the specific flask and cell line. You can better account for the variability between different flasks within the same cell line by incorporating random effects.
Furthermore, it's important to consider the potential impact of using different flasks at each time point on the interpretation of your results. The fact that each flask represents a unique sample from the stock culture introduces the possibility of variability due to factors such as clonal variation and experimental conditions at the time of subculturing. It's essential to carefully consider the implications of this design aspect and discuss them with domain experts or statisticians to ensure appropriate analysis and interpretation of the results.
In summary, while a mixed ANOVA model can be a suitable approach for analyzing the effects of temperature and time, considering the nested structure of your data, you may also want to explore random effects models that explicitly account for the variability associated with the nested design. Additionally, carefully consider the potential implications of using different flasks at each time point on the interpretation of your results. Consulting with experts in experimental design and statistics would be beneficial in guiding you toward the most appropriate analysis for your specific study design.
  • asked a question related to Statistics
Question
3 answers
For some smaller and less know "statistics" often no option to calculated the error or confidence intervals is given. However, this might be obtained by bootstrapping. In addition, both McElrath and Kruschke have used grid approximation as an example in their well written books. While given as an example, I have never seen it being used, although understandably difficult for higher dimensional issues for estimations of single less critical parameters it might be appropriate.
Consider we apply a multivariate analysis and calculate "R" from ANOSIM (see vegan R package vegan::anosim and https://sites.google.com/site/mb3gustame/hypothesis-tests/anosim). Now I am not interested in the testing if the data is compatible with 0, (it non "random" and biased which might be one and the same thing). I want to make some statement on the meaningfulness of the posterior of R by adjusting the likelihood (which is all we do).
The "R" from ANOSIM does not return the error, however we can bootstrap the data and estimate the alpha and beta parameters from beta distribution under the assumption this "R" would be a random variable. Given we know alpha and beta parameters from "R" we could use this as the likelihood introduce the prior and obtain the posterior estimate (see Fig and example). I am just not aware whether this is a reasonable approach or not as there is not much documentation on grid approximation.
Thank you in advance!
Relevant answer
Answer
I have two separate suggestions.
(1) It seems unnecessary to fit a beta distribution... instead you could just work with the bootstrapped values of R directly. If you were to ignore the "prior", you could just construct and plot a histogram from the bootstrapped values. Doing this essentially just attributes a unit weight to each bootstrapped value. To apply the prior you can just multiply each such weight by the prior and then use an obvious variant of the histogram procedure to produce a plot of the posterior distribution. This might at least provide some form of validation of using a beta distribution if the results are comparable.
(2) You are ignoring the ability of your anosim package to provide a significance test of "no difference". But you can convert this ability to provide confidence intervals for some meaningful non-zero "effect" in the obvious way. You just need to have a parametric way of adjusting one of the samples to become more or less similar to the other. Specifically, you would construct an adjusted pair of samples from the original, containing a given size of adjustment, and then apply the test of significance for no-difference. If the size of adjustment is such that the result says "accept the hypothesis of no difference", then that adjustment is counted as being in the confidence interval. This is standard theory.
However, I can't claim to understand what you are really trying to do. You say "I want to make some statement on the meaningfulness of the posterior of R", but you don't say where the prior distribution is coming from. Notionally, the bootstrapped distribution provides an indication of the sampling uncertainty of the raw estimate of R, but it doesn't assign any particular meaning to R itself. Further, in your first message you mention something about the data being non-random and you will need to worry about the consequences of this. In addition, you twice mentioned "grid approximation" but I don't know the meaning/use of this.
You might want to find a practical statistician at your university to help with this.
  • asked a question related to Statistics
Question
3 answers
Hello everyone! I am currently doing moderation/mediation analyses with Hayes Process.
As you can see the model 3 is significant with R2=.48
The independent variables have no sig. direct effect on the dependent variable, but significant interaction effects. The curious thing is: toptMEAN does not correlate with any of the variables, but still fits into the regression model. Should I take this as confirmation that toptMEAN has an effect on the dependent variable even though it does not correlate? Or am I missing something in the interpretation of these results?
(Maybe you could also suggest a different model for me to run. model 3 is currently the one with the highest r2 i found)
Relevant answer
Answer
It is very well possible to have significant interaction effects but no significant "main" effects. It is also possible to have significant "main" effects and no significant interactions. And it is possible that variables have significant zero-order correlations with the dependent variable but no significant regression coefficients (e.g., due to redundancies and/or overfitting of the regression model). Finally, it possible to find non-significant zero-order correlations between IVs and DV but significant regression coefficients (e.g., due to suppression effects).
Many of these effects cannot easily be identified from the zero-order correlations. That's one reason why we run multiple regression analyses--to identify redundancies, interactions, suppression effects, etc. that are not easy to see in a bivariate correlation analysis.
Note also that the "main" (lower-order) effects (and their significance) in the moderated regression model depend on whether you centered the predictors or not. This can make a huge difference for the interpretation. Especially when predictor variables do not have a meaningful zero point, centering prior to calculating the interaction terms is recommended (e.g., Aiken & West, 1991; Cohen et al., 2003). Otherwise, the lower-order terms (and their significance) may not be interpretable at all.
In any case, it would be a good idea for you to plot the effects to get a better understanding of what is going on. That is, look at the regression lines for different values of your moderators to understand the meaning of the interaction effects.
Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park: Sage.
Cohen, J., Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Mahwah, NJ: Erlbaum. (Chapters 7 and 9)
  • asked a question related to Statistics
Question
6 answers
I have 23500 points, I sorted them in Excel from lowest to biggest, and then in a scatter plot, I create its chart, now I want to find the data that after that data (point) my chart starts to a high slope near 90 degrees, or in another word my chart begins growing up faster.
Relevant answer
Answer
Hello Javid,
You're seeking, apparently, the point of inflection for a presumed function which links the two variables summarized in the display.
If you've fitted a model, then solve for the first derivative being set to zero.
If you haven't fitted a model, then you'll need to:
1. Define some measure of slope change. This could be [(Y for the kth case - Y for the k - 1th case)] / [(Y k - 1th case) - Y k - 2th case)]
2. Define some degree of slope change that represents a slope "shift" and not just a graduated increase.
3. Compute the slope change (#1 above) for each point on the graph, going out to the right.
4. When you come to a point at which the slope change exceeds your criterion level (#2 above), then this is your shift point.
Good luck with your work.
  • asked a question related to Statistics
Question
7 answers
What steps should be taken in determining whether there is causal relationship between two variables?
Relevant answer
Answer
Causality is a complex and challenging topic in statistics and data science, and there is no single method that can prove and establish the causality mathematically. However, there are some approaches that can provide evidence or support for causal claims, depending on the type and quality of the data available.
One approach is to use **experimental methods**, such as randomized controlled trials (RCTs), where you manipulate one variable (the treatment) and observe its effect on another variable (the outcome), while controlling for other confounding factors. This can provide strong evidence for causality, but it may not be feasible or ethical in some situations.
Another approach is to use observational methods, such as regression analysis, where you model the relationship between two variables using data that are collected without intervention. This can provide some evidence for causality, but it may be subject to bias, confounding, or reverse causality. To address these issues, you may need to use additional techniques, such as:
- Instrumental variables (IV): These are variables that are correlated with the treatment variable, but not with the outcome variable or any confounder. They can be used as proxies for the treatment variable to estimate its causal effect on the outcome variable.
- Propensity score matching (PSM): This is a technique that matches units (individuals, groups, etc.) that have similar probabilities of receiving the treatment, based on their observed characteristics. This can reduce the imbalance between the treatment and control groups and improve the comparability of the outcomes.
- Difference-in-differences (DID): This is a technique that compares the changes in outcomes between two groups (treatment and control) before and after an intervention. This can account for time-invariant confounders and measure the causal effect of the intervention.
- Granger causality: This is a technique that tests whether the past values of one variable can help predict the future values of another variable, using time series data. This can indicate whether there is a causal direction between the two variables.
A third approach is to use graphical methods, such as causal diagrams or directed acyclic graphs (DAGs), where you represent the variables and their causal relationships using nodes and arrows. This can help you visualize and understand the causal structure of the data, identify potential confounders or mediators, and design appropriate methods for causal inference.
  • asked a question related to Statistics
Question
2 answers
Hello there, I would like to correlate the mRNA expression of a gene obtained from patient blood samples with surrogate parameters like HbA1c and data like body weight, age, etc. However, I am unsure and confused about the correct way to do it. I did not find a solution on the web so far.
Which qPCR-data should you I use (dCT, ddCT, 2^-ddCT)?
If it favorable to use log-transformed data, or does it distort the results?
Is it statistically correct to simply use an xy-graph and do a correlation analysis?
Regards!
Relevant answer
Answer
To correctly correlate qPCR data with patient characteristics, you can follow these steps:
  1. Collect and preprocess data: Gather qPCR data from your experiments, including the gene expression levels for each patient sample. Also, collect the relevant patient characteristics, such as age, gender, disease status, treatment, or any other factors of interest. Ensure that the data is properly labeled and organized.
  2. Data exploration: Perform exploratory data analysis (EDA) to understand the distribution of your qPCR data and patient characteristics. Use summary statistics, histograms, box plots, or scatter plots to visualize the data and identify any outliers or patterns.
  3. Data integration: Merge or join your qPCR data and patient characteristics data based on a common identifier, such as patient ID. This will create a combined dataset that includes both the qPCR measurements and the corresponding patient information.
  • asked a question related to Statistics
Question
4 answers
I'm working on my PhD thesis and I'm stuck around expected analysis.
I'll briefly explain the context then write the question.
I'm studying moral judgment in the cross-context between Moral Foundations Theory and Dual Process theory.
Simplified: MFT states that moral judgmnts are almost always intuitive, while DPT states that better reasoners (higher on cognitive capability measures) will make moral judgmnets through analytic processes.
I have another idea - people will make moral judgments intuitively only for their primary moral values (e.g., for conservatives those are binding foundations - respectin authority, ingroup loyalty and purity), while for the values they aren't concerned much about they'll have to use analytical processes to figure out what judgment to make.
To test this idea, I'm giving participants:
- a few moral vignettes to judge (one concerning progressive values and one concerning conservative values) on 1-7 scale (7 meaning completely morally wrong)
- moral foundations questionnaire (measuring 5 aspects of moral values)
- CTSQ (Comprehensive Thinking Styles Questionnaire), CRT and belief bias tasks (8 syllogisms)
My hypothesis is therefore that cognitive measures of intuition (such as intuition preference from CTSQ) will predict moral judgment only in the situations where it concerns primary moral values.
My study design is correlational. All participants are answering all of the questions and vignettes. So I'm not quite sure how to analyse the findings to test the hypothesis.
I was advised to do a regressional analysis where moral values (5 from MFQ) or moral judgments from two different vignettes will be predictors, and intuition measure would be dependent variable.
My concern is that the anlaysis is a wrong choice because I'll have both progressives and conservatives in the sample, which means both groups of values should predict intuition if my assumption is correct.
I think I need to either split people into groups based on their MFQ scores than do this analysis, or introduce some kind of multi-step analysis or control or something, but I don't know what would be the right approach.
If anyone has any ideas please help me out.
How would you test the given hypothesis with available variables?
Relevant answer
Answer
There are several statistical analysis techniques available, and the choice of method depends on various factors such as the type of data, research question, and the hypothesis being tested. Here is a step-by-step guide on how to approach hypothesis testing:
  1. Formulate your research question and null hypothesis: Start by clearly defining your research question and the hypothesis you want to test. The null hypothesis (H0) represents the default position, stating that there is no significant relationship or difference between variables.
  2. Select an appropriate statistical test: The choice of statistical test depends on the nature of your data and the research question. Here are a few common examples:Student's t-test: Used to compare means between two groups. Analysis of Variance (ANOVA): Used to compare means among more than two groups. Chi-square test: Used to analyze categorical data and test for independence or association between variables. Correlation analysis: Used to examine the relationship between two continuous variables. Regression analysis: Used to model the relationship between a dependent variable and one or more independent variables.
  3. Set your significance level and determine the test statistic: Specify your desired level of significance, often denoted as α (e.g., 0.05). This value represents the probability of rejecting the null hypothesis when it is true. Based on your selected test, identify the appropriate test statistic to calculate.
  4. Collect and analyze your data: Gather the necessary data for your analysis. Perform the chosen statistical test using statistical software or programming languages like R or Python. The specific steps for analysis depend on the chosen test and software you are using.
  5. Calculate the p-value: The p-value represents the probability of obtaining the observed results (or more extreme) if the null hypothesis is true. Compare the p-value to your significance level (α). If the p-value is less than α, you reject the null hypothesis and conclude that there is evidence for the alternative hypothesis (Ha). Otherwise, you fail to reject the null hypothesis.
  6. Interpret the results: Based on the outcome of your analysis, interpret the results in the context of your research question. Consider the effect size, confidence intervals, and any other relevant statistical measures.
  • asked a question related to Statistics
Question
9 answers
In many biostatistics books, the negative sign is ignored in the calculated t value.
in left tail t test we include a minus sign in the critical value.
eg.
result of paired t test left tailed
calculated t value = -2.57
critical value = - 1.833 ( df =9; level of significance 5%) (minus sign included since
it is a left tailed test)
now, we can accept or reject the null hypothesis.
if we do not ignore the negative sign i.e. -2.57<1.833 null hypothesis accepted
if we ignore the negative sign i.e. 2.57>1.833 null hypothesis rejected.
Relevant answer
Answer
Le signe négatif en mathématique en général et en statistique en particulier a toute son importance notamment au niveau des commentaires des résultats (exemple: la corrélation positive s'oppose à la corrélation linéaire négative). Les signes sont à respecter.
  • asked a question related to Statistics
Question
9 answers
Hi!
I am performing PCR quantification of 5 inflammatory markers on 180 samples. As you can imagine, I therefore work on several 384-well plates. To compare them, I introduced a duplicate amplification control per primer for each plate to check if my amplification is indeed the same from one run to another.
Now, I have experimented with all my plates and I would like to run them through a statistical test to verify that the experiment is comparable from one plate to another. What statistical test should I use?
I'll let the stats pros answer! :)
Relevant answer
Answer
If you have included a plate calibrator (what you call an "amplification control", I assume), you can use it to adjust the Ct values to make them comparable between the plates.
If there is no plate-to-plate variation, then this adjustment would simply do nothing, and in any other case it will correct for this variation.
There is no need to demonstrate that the plate calibrators show that there is no between-plate variation.
If you still request some "test": there is no statistical test in that sense that would show equality. Statistical tests can only answer the question if your sample size is large enough to have sufficient confidence in concluding that there is some difference. Failing to reject the hypothesis of "no difference" is not evidence for the absence of a difference.
  • asked a question related to Statistics
Question
2 answers
Hello , I plan to use I-Distance statistical method to rank some countries based on some indicators. I found many papers that are used this method in their paper, however, I could not find any work that shows exactly how this method applied to the data and variables. E.g calculating partial coefficient of the correlation between two variables, etc. can anybody please suggest me any good resource for that?
Relevant answer
Answer
I-Distance statistical method is a way of ranking objects based on multiple indicators by calculating the distance of each object from an ideal point. It can be used to compare countries, regions, firms, etc. based on various criteria.
One resource that explains how to apply this method to data and variables is the paper titled "Statistical approach for ranking OECD countries based on composite GICSES index and I-distance method" by Mitrovic et al. (1). In this paper, the authors use the I-distance method to rank OECD countries based on their cognitive skills and educational attainment. They explain the steps of the method and provide formulas and examples.
Another resource that discusses the I-distance method is the Wikipedia article on "Statistical distance" (2). It gives a general overview of different types of statistical distances and divergences, and mentions the I-distance as one of them. It also provides some references for further reading.
The Wikipedia article does not mention I-distance explicitly, but it does mention it implicitly under the name of “Mahalanobis distance”. The I-distance is a special case of the Mahalanobis distance when the covariance matrix is diagonal. You can find more details about this in the paper by Mitrovic et al. , section 2.2.
(2) Statistical distance - Wikipedia. https://en.wikipedia.org/wiki/Statistical_distance.
  • asked a question related to Statistics
Question
7 answers
Hello,
I am currently analyzing data from a study and am running into some issues. I have two independent variables (low vs high intensity & protein vs no protein intervention) and 5 dependent variables that I measured on two separate occasions (pre intervention and post intervention). So technically I have 4 groups a) low intensity, no protein b) low intensity, protein c) high intensity, no protein and d) high intensity, protein.
Originally I was going to do a two-way MANOVA as I wanted to know the interaction between the two independent variables on multiple dependent variables however I forgot about the fact I have two measurements of each of the dependent variables and want to include how they changed over time.
I can't seem to find a test that will incorporate all these factors, it seems like I would need to do a three-way MANOVA but can't seem to find anything on that. So I am thinking of a) calculating the difference in the dependent variables between the two time stamps and using that measurement for the MANOVA or b) using MANOVA for the measurement of dependent from the post test and then doing a separate test to see how each of the dependent variables changed over time. Is this the right line of thinking or am I missing something? When researching this I kept finding doubly multivariate analysis for multiple observations but it seems to me that that only allows for time and one other independent variable, not two.
Any guidance or feedback would be greatly appreciated :)
Relevant answer
Answer
Hello Isabel,
The basic design is a two-between (intensity, 2 levels; protein, 2 levels), one-within (occasion, 2 levels) manova. Between-within manovas are sometimes called "doubly multivariate."
So yes, you have three factors.
However, you may wish to consider a couple of points:
1. Do you really intend to interpret the results across arbitrary linear composite/s of the five DVs (generated by the software to maximize the variance accounted for by the between-subjects IVs), or would you prefer to address how the study factors influence scores on each of the five DVs? The first calls for the multivariate test; the second calls for univariate tests.
2. Consider using the pre-intervention score for a given dependent variable as a covariate, then use the post-intervention score as the DV score. This eliminates the within-subjects factor, and offers the ability to "adjust" for randomly occurring differences at the outset among the four treatment combinations.
3. In general, tests of change over time (e.g., meanT2 - meanT1) force you to work with difference scores that are less reliable than either of the scores from which they are derived, unless all time scores are perfectly reliable (no measurement error), which is seldom if ever the case.
Good luck with your work.
  • asked a question related to Statistics
Question
9 answers
Dear ResearchGate community,
I have a statistical question which has given me a lot of headache (statistics usually do, but this is worse!). I have designed a survey in which participants read sentences, evaluate whether the sentences are meaningful and then select a response among 3 possible interpretations. I have 3 types of sentences (let's say language 1, language 2, language 3) and for each language there are 8 sentences. In total, the participants read 24 sentences (3x8).
What I'm interested is accuracy. Say a participant has accepted all 8 sentences in Language 1, whereas the correct meaning has been selected for only 5 of these. This means that the accuracy is 62,5%. In Language 2, on the other hand, the participant has accepted only 5 sentences and has found the correct meaning of only 3 of these. This means that my 100% is always changing.
Do anybody know how I can calculate mean precision with these kinds of numbers? The goal is to examine precision according to languages (1, 2 or 3). I have a feeling it has to do with ratios, but I'm quite lost at the moment!
Relevant answer
Answer
Do you mean precision or accuracy? You use both words (that have a very different meaning) interchageably as it seems. Could you clarify this?
Further: what means "accepting a senstence"? It that a binary decision (yes or no) or does it leave the possibility that "not accepting" means that the participant does not know or does not make a decision on that sentence?
  • asked a question related to Statistics
Question
4 answers
Hey there!
I have data that includes the following arms:
  • Healthy individuals who did not receive the drug
  • Ill individuals who did not receive the drug
  • Ill individuals who did receive the drug
  • Healthy individuals who did receive the drug
I want to perform a meta-analysis and show all my results for one outcome in a single forest plot. My effect size is SMD.
Can you tell me if it's possible and how to do it?
And, by the way, does Stata have my back? If yes, what command should I use?
Relevant answer
Answer
What's your research question, anyway?
  • asked a question related to Statistics
Question
5 answers
Has anyone conducted a meta-analysis with Comprehensive Meta-Analysis (CMA) software?
I have selected: comparison of two groups > means > Continuous (means) > unmatched groups (pre-post data) > means, SD pre and post, N in each group, Pre/post corr > finish
However, it is asking for pre/post correlations which none of my studies report. Is there a way to calculate this manually or estimate it somehow?
Thanks!
Relevant answer
Answer
Yes, it is possible to estimate the pre-post correlation coefficient in a meta-analysis using various methods, such as imputing a value or using a range of plausible values. Here are a few options:
  1. Imputing a value: If none of your studies report the pre-post correlation, you can impute a value based on previous research or assumptions. A commonly used estimate is a correlation coefficient of 0.5, which assumes a moderate positive relationship between the pre and post-measures. However, it is important to note that this value may not be appropriate for all studies or research questions.
  2. Using a range of plausible values: Another option is to use a range of plausible correlation coefficients in the analysis, rather than a single value. This can help to account for the uncertainty and variability in the data. A common range is 0 to 0.8, which covers a wide range of possible correlations.
  3. Contacting study authors: If possible, you can try to contact the authors of the included studies to request the missing information or clarification about the pre-post correlation coefficient. This can help to ensure that the analysis is based on accurate and complete data.
Once you have estimated the pre-post correlation coefficient, you can enter it into the appropriate field in the CMA software and proceed with the analysis. It is important to carefully consider the implications of the chosen correlation coefficient and to conduct sensitivity analyses to test the robustness of the results.
  • asked a question related to Statistics
Question
5 answers
I have been assigned the task of performing business sales forecasting using time series analysis. However, before starting the forecasting process, I need to identify and treat the outliers in the dataset.
To achieve this, I have decided to use Seasonal Trend Decomposition (STL) with LOESS.
I would appreciate your assistance in implementing this technique using Python or R programming language.
Relevant answer
Answer
Take at the attached. Best wishes David Booth
  • asked a question related to Statistics
Question
4 answers
Hello dear colleagues!
I want to present my results regarding water vapour transmission rate. I have 5 samples and I worked in triplicate. Do I apply the formula for each sample (1a, 1b, 1c, 2a, 2b, 2c etc) and then calculate the mean and SD or do I calculate the mean value and SD for each sample (1, 2, 3 etc) and then apply the formula?
the formula used is (initial weight-final weight)/ (areax24)
So should i use (initial weight of sample 1a - final weight of sample 1a) / (areax24) and calculate the mean and SD for sample 1
or calculate mean and DS of samples 1a, 1b, 1c and apply formula as(mean of initial weight of sample 1 - mean of final weight of sample 1) / (areax24)
I want to present my results as the number given by the formula +/- DS
Relevant answer
  • asked a question related to Statistics
Question
9 answers
For context, the study I am running is a between-participants vignette experimental research design.
My variables include:
1 moderator variable: social dominance orientation (SDO)
1 IV: target (Muslim woman= 0, woman= 1) <-- these represent the vignette 'targets' and 2 experimental conditions which are dummy-coded on SPSS as written here)
1 DV: bystander helping intentions
I ran a moderation analysis with Hayes PROCESS macro plug-in on SPSS, using model 1.
As you can see in my moderation output (first image), I have a significant interaction effect. Am I correct in saying there is no direct interpretation for the b value for interaction effect (Hence, we do simple slope analyses)? So all it tells us is - SDO significantly moderates the relationship between the target and bystander helping intentions.
Moving onto the conditional effects output (second image) - I'm wondering which value tells us information about X (my dichotomous IV) in the interaction, and how a dichotomous variable should be interpreted?
So if there was a significant effect for high SDO per se...
How would the IV be interpreted?
" At high SDO levels, the vignette target ___ led to lesser bystander helping intentions; b = -.20,t (88) = -1.65, p = .04. "
(Note: even though my simple slope analyses showed no significant effect for high SDO, I want to be clear on how my IV should be interpreted as it is relevant for the discussion section of the lab report I am writing!)
Relevant answer
Answer
The significant t-test for the interaction term in your model shows that the slopes of the two lines differ significantly. But at the 3 values of X that are shown in your results (x=-.856, x=0, x=.856), fitted values on the two lines do not differ significantly.
I suspect your output is from Hayes' PROCESS macro, and that -.856 and .856 correspond to the mean ± one SD. Is that right?
Why does it matter if the fitted values on the two lines do not differ significantly at those particular values of X? Your main question is whether the slopes differ significantly, is it not?
  • asked a question related to Statistics
Question
4 answers
The table provides a snapshot of the literacy rates of a selection of countries across the globe as of 2011-2021 census. The data is based on information collected by the UNESCO Institute for Statistics. Literacy rate refers to the percentage of people aged 15 years and above who can read and write. The table includes 18 countries, with literacy rates ranging from a low of 43.0% in Afghanistan to a high of 99.7% in Russia. Some of the world's most populous countries, such as China, India, and Nigeria, have literacy rates below 80%. On the other hand, many developed nations, such as Canada, France, Germany, Japan, the United Kingdom, and the United States, have literacy rates above 98%. The data can be used to gain insight into global education levels and to compare literacy rates across countries. Country Total literacy rate Afghanistan 43.0%
Any Additions/Questions/Results/Ideas Etc?
Relevant answer
Answer
Hello Rahul Jain
You make a very interesting point about literacy rates. Beyond just the numbers (eg what percentage of the population), some research has been about the amount of literacy required for the job. So there are subdivisions on the amount of literacy (such as in OECD reports) compared to a yes/no view. Do you think that is more helpful and do you think it is even necessary?
Example:
  • asked a question related to Statistics
Question
10 answers
Propensity score matching (PSM) and Endogenous Switching Regression (ESR) by full information maximum likelihood (FIML) are most commonly applied models in impact evaluation when there are no baseline data. Sometimes, it happens that the results from these two methods are different. In such cases, which one should be trusted the most because both models have their own drawbacks?
Relevant answer
Answer
what is thadvantage of psm over ESR
  • asked a question related to Statistics
Question
1 answer
Hi all,
I have categorical data from 2 different experimental conditions (see the stacked bar graphs as an example). I can use Chi-squre test for association to see if there is a statistically significant difference between the frequencies of the categories in these two datasets. However, this does not tell me if the 'change in the particular category is significant' (i.e., is the decrease in the 'red' category from 37% to 26% significant?). I believe, I need a post hoc test for such pairwise comparison but I couldn't figure out which pot hoc test can be used. I have the percentage values and the actual sample sizes for each category.
Any suggestion is greatly appreciated.
Thanks!
Relevant answer
Answer
Not to use Chi Square for this, but you can find a clear answer here, download this script from the following link it's free
  • asked a question related to Statistics
Question
3 answers
I have 3 groups:
  1. Intervention 1 (n=6)
  2. Intervention 2 (n=9)
  3. Treatment as usual (n=12)
And I have 2 time points (pre and post intervention)
What kind of test do you recommend? Thank you!
Relevant answer
Answer
If by a test you mean something that produces a p-value, you shouldn't do it. Your research apparently is applied, whicgh means you must be able to make quantitative comparisons. I would suggest a 2x3 factorial linear model (SPSS calls this Generalized Linear Models). Make sure to check that you want a coefficient table with confidence limits in the output. These are the numbers that tell you how strong the effects are, and how uncertain.
As Time is within-subject, you should run a multi-level model, with Time on participant level. That is very important for intervention studies, as the effect of any intervention is zero, when there was little to intervene. Thinbk of how much effect an aspirin has when there is no pain.
If you want to stick with a fixed effects model, you can simply reduce the time factor by doing T2 - T1 and interpret the result as net effect of the intervention.
  • asked a question related to Statistics
Question
6 answers
regression non significant effects
Relevant answer
Answer
Hi Sarah,
I don't understand the confusion. The effect of X on M is significant--the rest isn't. No evidence for mediation and no evidence that X has an effect on Y.
But please don't interpret a non-significant effect as "no effect" but as "no evidence of an effect". Huge difference.
Perhaps a power problem?
Best,
Holger
  • asked a question related to Statistics
Question
4 answers
Given 𝑛 independent Normally distributed random variables 𝑋ᵢ ∼ 𝑁(𝜇ᵢ,𝜎²ᵢ) and 𝑛 real constants 𝑎ᵢ∈ℝ, I need to find an acceptable Normal approximation of the distribution of 𝑌 random variable (assuming Pr[𝑋ᵢ≤0]≈0, to avoid divisions by zero)
Y = ∑aᵢXᵢ / ∑Xᵢ
I thought to split 𝑌 into single components
Y = a₁X₁ / ∑Xᵢ + a₂X₂ / ∑Xᵢ + ... + aₙXₙ / ∑Xᵢ
Y = a₁Y₁ + a₂Y₂ + ... + aₙYₙ
where the distribution of each 𝑌ᵢ can be found noting that
Yᵢ = Xᵢ / (Xᵢ + ∑Xⱼ) , for j≠i
and that
1/Yᵢ = (Xᵢ + ∑Xⱼ) / Xᵢ = 1 + ∑Xⱼ / Xᵢ
so, calling ∑Xⱼ = Uᵢ we can say that Xᵢ and Uᵢ are independent and, according to Díaz-Francés et Al. 2012, a Normal approximation of 1/𝑌ᵢ can be the one in figure 1 and, considering 1 ~ N(1,0), the r.v. Yᵢ can be approximated to figure 2. Thus, approximation of each aᵢYᵢ is the one in figure 3.
But now... I'm stuck at the sum of aᵢYᵢ because, not being independent, I don't know how to approximate the variance of their sum.
Any advice? Any more straightforward or more efficient method?
Relevant answer
Answer
Thanks for your answer.
Yes, the sum of all aᵢYᵢ will be "roughly" normal or, at least, will tend to a normal distribution as n increases. The challenging part is to find the correct variance approximation of the sum.
  • asked a question related to Statistics
Question
1 answer
I have a data set of peak areas from gas chromatography I would like to run on a PLSR model. Generally, for PLSR I would center and scale the data, is that appropriate here?
As the peaks differ in scale between compounds on a magnitude of 100, running the model on unscaled data is unfeasible.
Is it standard to scale these peak areas? Is there a scaling method that will reduce overfitting the model and avoid introducing extra noise?
Relevant answer
Answer
Firstly, you must perform scaling before you perform a PLSR or PLS-DA, PCA model. Some compounds are dominant in concentration in your sample so their peaks usually have higher peak intensity and variances. Therefore, if you don't scale your data, peaks with high intensity will be recognized as the most important differential metabolites and you will lose the information of others.
For scaling, you can choose:
1. Autoscaling
2. Range scaling
3. Parero scaling
These three scaling methods were often chosen. For more details about the advantages and disadvantages you can check the paper written by Robert A van den Berg at
although the paper was mentioning the peak table from GC-MS or LC-MS, it is the same issue as yours, the features in MS datable are m/z, in your case is rt.
Secondly, as for reducing overfitting, you can compare the consequences of accuracy when you choose different scaling methods, but more importantly, is usually the overfitting is related to the shape of your dataset. The best way to solve overfitting is to increase the number of samples. In PLSR, there is one way to reduce overfitting is to reduce the number of components you use to do prediction.
To avoid extra noise, once you do scaling, no doubt that all the noise will be amplified, remember all the scaling methods aim to regard all metabolites as having the same importance. But Parero scaling may work a little better than others. To reduce the noise, the best way is to remove the noise before scaling. Usually, we utilize QC samples. All the peaks in your QC samples with variances higher than 20% (or another number, please check papers recently published, 20% is usually used in MS data) will be recognized as noise and unstable metabolites and removed from the dataset.
Hope this will be useful for you.
  • asked a question related to Statistics
Question
5 answers
Hi,
I have data from a study that included 3 centers. I will conduct a multiple regression (10 IVs, 1 non-normally distributed DV) but I am unsure how to handle the variable "center" in these regressions. Should I:
1) Include "centre" as one predictor along with the other 10 IVs.
2) Utilize multilevel regression
Thanks in advance for any input
Kind regards
  • asked a question related to Statistics
Question
7 answers
Hey guys,
So I am planning an experiment where I will check for reactions between two different compounds with distinct UV spectral curves. I was wondering how I might go about doing stats on the results?
My instinct is to just do T. tests on the peak values over time, but that seems extremely crude. I'm sure there must be better ways of doing things? What do you think? Has anyone got any experience or suggestions with this?
Relevant answer
Answer
Are you measuring several instances of the same compound ? Or just one curve per compound ?
  • asked a question related to Statistics
Question
8 answers
Hello, I'm currently analyzing antibody data with Repeated Measures ANOVA and have run into problems. I built a model with age, gender, vaccine background, and around 7 genetic polymorphisms. When I do multiple comparisons afterward, I get the error "all pairwise comparisons mean-mean scatterplot cannot be shown because confidence intervals cannot be computed jmp." I don't know how to solve this. Does anybody know what the problem is? Someone suggested something about the degrees of freedom running out, but I do not understand. Appreciate your help!
Relevant answer
Answer
Sorry outside of my field
  • asked a question related to Statistics
Question
18 answers
Dear Statisticians
We are a team of engineers working on an assistive device for stroke patients. We designed a questionnaire to ensure that the technology will meet patients' needs. I have a question regarding the sample size for our survey and I hope you help me with this. I used the following formula, and used the number of stroke patients in the UK as the population size: https://lnkd.in/evJbP7R5 However, we also want to know how different stages of the disease would affect the answers. Thus, we have 3 subgroups (early subacute, late subacute, chronic) and we want to know how each group answers to our questions. Please notice that the population size of subgroups are not the same (e.g. early subacute 1000, late subacute 10,000, chronic 1,000,000). Could you please help me to calculate the sample size in this case. Many thanks for your help Ghazal
Relevant answer
Answer
The highest power is when the sample sizes are equal. When the sample sizes are unequal, you can work out the effective sample size by using the proportion on one group (p) as follows:
Effective sample size = N x (4(p x (1-p)))
where N is the total sample size and p is the proportion in one of the groups.
This shows you that when the split is 60/40 the loss of power is slight (4% loss), but at 90/10 the loss is drastic!
  • asked a question related to Statistics
Question
9 answers
I have data from the our experimental model - where we analyze the immune response following BCG vaccination, and then the responses and clinical outcome following Mtb infection of our vaccinated models. Because we cannot experimentally follow the very same entity after evaulating the post-vaccination response also for the post vaccination plus post infection studies - we have such data from different batches. Is it possible to do correlation here between post vaccination responses of 5 replicates in one batch (in different vaccine candidates) versus 4-5 replicates in vaccination & infection from another batch? I ask this because we are not following up the same replicates for post vaccination and post infection measurements (as it is not experimentally feasible). If correlation is not the best method, are there other ways to analyze the patterns - such as strength of association between T cell response in BCG vaccinated models versus increased survival of BCG vaccinated models (both measurements are from different batches)? We have several groups like that, with a variety of parameters measured per group in different sets of experiments.
Thanks for your responses and help.
Relevant answer
Answer
To make it a bit simpler:
say you have treatments A and B, and your experiment is done in two batches 1 and 2.
If treatment A is analyzed in batch 1 and B in batch 2, then treatment and batch are perfectky confounded you you have no chance to distangle batch-effects from treatment effects.
If samples with treatment A are measured in batch 1 and 2, and also samples with treatment B are measured in both batches, then one can model the batch effect and reveal the treatment effect.
If you have treatments A+C in batch 1 and treatments B+C in batch 2, you might estimate the batch effect from treatment C and apply it to correct A and B as well (dangerous, if the batch effect also depends on the treatment, but better than nothing).
  • asked a question related to Statistics
Question
1 answer
Dear all,
I want to calculate an effect of treatments (qualitative) on quantitative variables (e.g. plant growht, % infestation by nematodes, ...) compared to a control in an experimental setup. For the control and for each treatment, I have n=5.
Rather than comparing means between each treatments, including the control, I would like to to see whether each treatment has a positive/negative effect on my variable compared to the control.
For that purpose, I wanted to calculate the log Response Ratio (lnRR) that would show the actual effect of my treatment.
1) Method for the calculation of the LnRR
a) During my thesis, I had calculated the mean of the control value (Xc, out of n=5) and then compared it to each of the values of my treatments (Xti). Thereby, I ended up with 5 lnRR values (ln(Xt1/Xc); ln(Xt2/Xc); ln(Xt3/Xc); ln(Xt4/Xc); ln(Xt5/Xc)) for each treatment, calculated the mean of those lnRR values (n=5) and ran the following stats : i) comparison of the mean effects between all my treatments ("has one treatment a larger effect than the other one?") and ii) comparison of the effect to 0 ("is the effect significantly positive/negative?")
==> Does this method seem correct to you ?
b) While searching the litterature, most studies consider data from original studies and calculate LnRR from mean values within the studies. Hence, they end up with n>30. This is not our case here as data are from 1 experimental setup...
I also found this: "we followed the methods of Hedges et al. to evaluate the responses of gas exchange and water status to drought. A response ratio (lnRR, the natural log of the ratio of the mean value of a variable of interest in the drought treatment to that in the control) was used to represent the magnitude of the effects of drought as follows:
LnRR = ln(Xe/Xc) = lnXe - lnXc,
where Xe and Xc are the response values of each individual observation in the treatment and control, respectively." ==> This is confusing to me because the authors say that they use mean values of treatment / mean values of control), but in their calculation they use "individual observations". Are the individual observation means within each studies ?
==> Can you confirm that I CANNOT compare each observation of replicate 1 of control with replicate 1 of treatment; then replicate 2 of control with replicate 2 of treatment and so on? (i.e. ln(Xt1/Xc1); ln(Xt2/Xc2); ln(Xt3/Xc3); ln(Xt4/Xc4); ln(Xt5/Xc5)). This sounds wrong to me as each replicate is independent.
2) Statistical use of LnRR
Taking my example in 1a), I did a t-test for the comparison of mean lnRR value with "0".
However, n<30 so it would probably be best not to use a parametric test :
=> any advice on that ?
=> Would you stick with a comparison of means from raw values, without trying to calculate the lnRR to justify an effect ?
Thank you very much for your advice on the use of LnRR within single experimental studies.
Best wishes,
Julia
Relevant answer
Answer
Hi Prof. Julia,
Your questions are what I am wondering about, I think they are very important and interesting. I am trying to explore the temporal pattern of the restoration effect along the age of restoration. I use Ln(Xrestored/Xunrestored) as a measure of the restoration effect based on a fixed study site with multiple years of observations. I think your questions are highly related to my method. Have you found the answer to these questions and could you please tell me if so? I would be very grateful for this.
Best wishes,
Erliang
  • asked a question related to Statistics
Question
21 answers
I have a vector based on a signal in which I need to calculate the log-likelihood and need to maximize it using maximum likelihood estimation. Is there any way to do this in MATLAB using the in-build function mle().
Relevant answer
Answer
To maximize the log-likelihood estimate of a signal using maximum likelihood estimation in MATLAB, you can use the built-in optimization functions. Here's a general process you can follow:
  1. Define the likelihood function that you want to maximize. This function takes in the signal (vector) as its input and returns the log-likelihood estimate of the signal. The form of this function will depend on the specific problem you are trying to solve.
  2. Define any additional parameters that are needed by the likelihood function. For example, if you are estimating the parameters of a Gaussian distribution, you will need to define the mean and variance parameters.
  3. Use the "fminunc" function in MATLAB to perform the optimization. This function uses the gradient of the likelihood function to iteratively search for the maximum. You will need to provide the likelihood function, the initial guess for the signal, and any additional parameters as inputs.
  4. Extract the optimized signal from the output of the "fminunc" function. This will be the signal that maximizes the log-likelihood estimate.
So it depends on the model you have at hand. Here's some papers applying MLE for different type of problems[1-3]:
[1] Bazzi, Ahmad, Dirk TM Slock, and Lisa Meilhac. "Efficient maximum likelihood joint estimation of angles and times of arrival of multiple paths." 2015 IEEE Globecom Workshops (GC Wkshps). IEEE, 2015.
[2] Bazzi, Ahmad, Dirk TM Slock, and Lisa Meilhac. "On a mutual coupling agnostic maximum likelihood angle of arrival estimator by alternating projection." 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 2016.
[3] Bazzi, Ahmad, Dirk TM Slock, and Lisa Meilhac. "On Maximum Likelihood Angle of Arrival Estimation Using Orthogonal Projections." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
  • asked a question related to Statistics
Question
10 answers
Hi,
There is an article that I want to know which statistical method has been used, regression or Pearson correlation.
However, they don't say which one. They show the correlation coefficient and standard error.
Based on these two parameters, can I know if they use regression or Pearson correlation?
Relevant answer
Answer
Not sure I understand your question. If there is a single predictor and by regression you mean linear OLS regression, then the r is the same. Can you provide more details>
  • asked a question related to Statistics
Question
2 answers
I am examining whether sex and religion of a defendant may impact their percieved guilt, risk, possibility of rehabilitation and the harshness of sentencing. I have done this by creating 4 different case studies in which a defendant has differing sex and religion and was suspected of committing a crime. There were 200 participants in which 50 each where given one of the 4 case studies. Participants would then have to answer a number of questions about the case study such as "what sentence do you think is fair?" All data is ordinal. Ive been advised to used different statisical analysis so im confused and would like some advice on which one to use
Relevant answer
Answer
If the data are ordinal then it violates assumptions underlying many types of tests (e.g., ANOVA). Thus, the non-parametric equivalent is an option. Kruskal-Wallis is potentially suited for this design. I've done comparisons of one-way and factorial anova versus their non-parametric equivalents and found the results to be virtually identical. If the DV is on a 7-point scale (or similar) then then it is technically ordinal but practically interval in nature, and most researchers use parametric tests. If it is truly ordinal, such as a rank-based DV, then you'd want to look at rank ordered tests (e.g., Friedman's test), but I doubt that would be a good fit here. I think you could also set up a regression model depending on how comfortable you are with it.
  • asked a question related to Statistics
Question
3 answers
Example Scenario: 1 categorical variable (Yes/No), 3 continuous dependent variables
- 3 independent sample's t tests are conducted (testing for group differences on three variables); let's assume 2 of the 3 variables are significant with medium-large effect sizes
- a binomial logistic regression is conducted for the significant predictors (for classification purposes and predictor strength via conversion of unstandardized beta weights to standardized weights)
Since 4 tests are planned, the alpha would be set at .0125 (.05/4) via the Bonferroni correction. Should the adjusted alpha be also applied to the p-values for the Wald tests in the "variables in question" output?
Thank you in advance!
Relevant answer
Answer
Its essentially analogous to ANOVA or ANCOVA in this context so logically one should consider this. However:
- with three categories you get complete protection against the partial null hypotheses if the overall test of the effect (probably the LRT for the factor analogous to the F test). This is the logic of Fisher's LSD and so no correction would be needed if you relied on the overall test of the effect for Type I error protection.
- with four or more categories this property no longer holds, but its not a good idea to use the Bonferroni correction as its rather conservative. I'd suggest a modified Bonferroni procedure such as the Hommel or Hochberg correction. Hochberg can be done by hand easily, but it and the Hommel correction is also implemented in R with the p.adjust() function. As the input is just a set of uncorrected p values you can use output from R or another package very easily:
> p.vals <- c(.0012, .043, .123, .232)
> p.adjust(p.vals, method='hochberg')
This uses Wright's adjusted p value approach rather than altering alpha, but the decisions are equivalent:
Wright, S. P. (1992). Adjusted P-values for simultaneous inference. Biometrics, 48, 1005–1013. doi:10.2307/2532694.
  • asked a question related to Statistics
Question
6 answers
For my master's thesis I have conducted visitor surveys at two different immersive experiences. To analyze the extent to which visitors feel immersed, I used three different conceptualizations of immersion (narrative transportation scale (6 items), short immersion scale (3 items) and self-location scale (5 items)). The items are all on a 5 point-likert scale.
What is the best way to compare the 2 case studies on the 3 separate scales, so that I can draw conclusions about how immersed visitors of case study 1 feel versus how immersed visitors feel in case study 2?
So that it looks like this:
- narrative transportation: case 1 mean versus case 2 mean
- immersion: case 1 mean versus case 2 mean
- self-location: case 1 mean versus 2 mean
Is differentiating means even the way to go about this?
I've looked at:
- simple independent samples t-test, but this is not advised for likert scales and my data is non-normally distributed, which is typical for a likert
- Mann-Whitney u test, but my data is around 70 respondents per group
But i've also heard that comparing scales on individual questions is fine, but not on groups of questions, like with my subscales of immersion.
At this point, I'm at a loss as to how to tackle this, so any suggestion or comment is much appreciated...
Relevant answer
Answer
Are you analyzing individual Likert-score items or do you want to create scales for several related items? If you are creating scales, then you should first assess their reliability with coefficient alpha (aim for an alpha of .7 or above and preferably .8 or above). If each scale is sufficiently reliable, then you can use t-Tests to compare your two sources.
  • asked a question related to Statistics
Question
6 answers
Hello,
I'm running a one-way ANCOVA to compare a ratio variable between two groups and adjust for some confounding variables. The software I use (SPSS) reports partial eta squared as effect size statistic. However, I would like to know if it's valid to calculate Cohen's d for this analysis.
I first thought to use the least squares means, standard error of the mean (SEM) and n reported for the estimated marginal means from the ANCOVA analysis. First, multiply SEM * SQRT(n) to get each group's standard deviation (SD) and then calculate the pooled SD to calculate Cohen's d.
However, I also found that Fritz et al. reported an equation to calculate Cohen's d from eta squared as d = 2 * SQRT(eta squared) / SQRT(1 - eta squared).
Would Cohen's d calculated from the procedures above be valid for ANCOVA? Or should I report eta/partial eta squared?
Thank you in advance.
Alejandro.
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1), 2–18. https://doi.org/10.1037/a0024338
Relevant answer
Answer
Taking a step back the answer is that it depends what you are trying to do. In a strict send Cohen's d isn't defined for ANCOVA only for a two independent group design and therefore there's several plausible and less plausible ways to compute a standardised mean difference (smd) depending on the details of the study (homogeneity of variance, choice of covariates and so on).
As Bruce Weaver mentioned I think that the adjusted mean difference (unstandardised) is arguably a superior effect size statistic for interpretation.
If you are interested in power analysis for a future replication with the same design etc. then computing the smd then using th adjusted mean difference and the root mean square error from the analysis makes some sense. This will reduce the estimate of variability and capture the increase in power typical from including covariates.
However, this is not sensible for interpretation. Just because you have a more precise estimate doesn't mean the effect actually got bigger. Also it will be scaled differently from a two independent group design (a little like comparing two effects scaled as cm and inches or seconds and minutes). For the smd in this case I'd scale the adjusted mean in terms of the raw SD of the outcome (Y) variable).
  • asked a question related to Statistics
Question
3 answers
Hi ! I'm looking for an open source program dealing with exploratory technique of multivariate data analysis, in particular those based on Euclidean vector spaces.
Ideally, it should be capable of handling databases as a set of k matrices.
There is a software known as ACT-STATIS ( or a an older version named SPAD) who perform this task, but as far as I know they are not open source. Thanks !
Relevant answer
Answer
Thanks a lot ! I'll check and try it
  • asked a question related to Statistics
Question
3 answers
I have ten regions and I created dummy variables for them. When I run my model one of them shows omitted, so I had to exclude the one showing omitted,and then I got significant result. But I need "the coefficient value of the excluded variable" to estimate Total Factor Productivity.
If possible, could you please explain how I can calculate it,dear colleagues.
Relevant answer
Answer
See my answer to the same question that you asked at
it is about dummy variables that form a closed set.
  • asked a question related to Statistics
Question
4 answers
I am currently writing up my PhD thesis, and I have shown that even with a sampling resolution of 1 sample per cm, the black shale I am studying shows evidence of brief fluctuations between oxic and anoxic states (probably decades to centuries in duration) in the form of in-situ benthic macrofauna.
The issue is that many of the geochemical sampling techniques I used can only resolve proxy records at a 1 cm scale, due to sample weight requirements (e.g. total lipid extraction for biomarker analysis), and multiple redox oscillations become time-averaged in these samples.
Is it possible to use some sort of model based on Bayesian statistics, to estimate the likely true frequency of oxia/anoxia in a given sampling interval (i.e. using the 1cm scale proxy data and the <1cm scale lithological data as priors)? Have there been any studies that have used some sort of bayesian model to estimate true frequencies between samples (in any field of study)?
Relevant answer
Answer
Not a geologist (at least in a long time), but from applying Bayesian statistics on shorter time scale and in other applications, I would say YES. Additionally, you may want to look into other statistical sampling and resampling techniques, such as Bootstraping.
  • asked a question related to Statistics
Question
1 answer
How can I add the robust confidence ellipses of 97.5% on the variation diagrams (XY ilr-Transformed) in the robcompositions ,or composition packages?
Best
Azzeddine
Relevant answer
Answer
In order for the benefit to prevail, I have verified a group of packages that do the add of The robust confidence ellipses of 97.5%
View them here by package and its function
1- ellipses () using the package 'ellipse'
## ellipses () using the package 'rrcov'
## ellipses () using the package 'cluster'
  • asked a question related to Statistics
Question
3 answers
Hi all!
Does anyone use endophenotypes for research purposes? I found very interesting papers explaining the value of endophenotypes (mostly in psychiatry), but I think the concept is perfectly applicable to any medical condition. Unfortunately, I can't find a methodological paper explaining the process of constructing an endophenotype.
Is there a formal statistical/methodological approach to do this? or more than the process of making them, is there an evidence-based process to probe their suitability?
Relevant answer
Answer
endophenotypes are insidious genotypical signs that doosnt seen pehnotypically but van be detectable via laboratory and /or radioliogic fingins which have not implicated by clinical manifestations
  • asked a question related to Statistics
Question
8 answers
What test is appropriate for a data set with 10 continuous dependent variables and one dichotomous independent variable? Is it possible to perform 10 separate independent t-tests or some sort of ANOVA (MANOVA)? The sample size is 1022.
Relevant answer
Answer
Dear Ngozi you can surely use MANOVA or any other testing strategy the point is 'which is your goal ?' Having 1022 statistical units anything is statistically significant even with a very small difference between the groups. So I suggest you to move your focus in 'Which variable allows for a better discrimination between the two classes ?' You can solve this problem in many ways (linear discriminant analysis, knn, canonical analysis or by a simple ROC curve). Given you have so many data I suggest you to generate your model with a subset of data (training set) and look at the generated model performance on a training set made of the data you did not use in the previous analysis.
  • asked a question related to Statistics
Question
6 answers
How can I calculate ANOVA table for the quadratic model by Python?
I want to calculathe a table like the one I uploaded in the image.
Relevant answer
Answer
To calculate an ANOVA (Analysis of Variance) table for a quadratic model in Python, you can use the statsmodels library. Here is an example of how you can do this:
#################################
import statsmodels.api as sm
# Fit the quadratic model using OLS (Ordinary Least Squares)
model = sm.OLS.from_formula('y ~ x + np.power(x, 2)', data=df)
results = model.fit()
# Print the ANOVA table
print(sm.stats.anova_lm(results, typ=2))
#################################
In this example, df is a Pandas DataFrame that contains the variables y and x. The formula 'y ~ x + np.power(x, 2)' specifies that y is the dependent variable and x and x^2 are the independent variables. The from_formula() method is used to fit the model using OLS. The fit() method is then used to estimate the model parameters.
The anova_lm() function is used to calculate the ANOVA table for the model. The typ parameter specifies the type of ANOVA table to compute, with typ=2 corresponding to a Type II ANOVA table.
This code will print the ANOVA table to the console, with columns for the source of variance, degrees of freedom, sum of squares, mean squares, and F-statistic. You can also access the individual elements of the ANOVA table using the results object, for example:
#################################
# Print the F-statistic and p-value
print(results.fvalue)
print(results.f_pvalue)
#################################
I hope that helps
  • asked a question related to Statistics
Question
3 answers
We are using a continuous scoring system for the Eating Disorder Diagnostic Scale that gives a total score of disordered eating symptomatology with a score range from 0 to 109. We have a sample size of 48 and the scores range from 8 to 81. We want to see if participants' scores on the EDDS, along with whether they received intranasal oxytocin or placebo in a repeated-measures study design, had a significant effect on our dependent variables including performance on the Emotion Recognition Task and visual analogue scales taken at different time points during their visit, hence why we are using a repeated-measures ANOVA.
Any thoughts or advice would be greatly appreciated, thank you!
Relevant answer
Answer
It sounds like you need to assess the reliability of your multi-item scale before you include it in either a MANOVA or a regression.
  • asked a question related to Statistics
Question
17 answers
In my time series dataset, I have 1 dependent variable and 5 independent variables and I need to find the independent variable that affects the dependent variable the most (the independent variable that explains most variations in the dependent variable). Consider all 5 variables are economic factors.
Relevant answer
There are several statistical models that can be used to analyze the relationship between time series variables. Some common approaches include:
  1. Linear regression: This is a statistical model that is commonly used to understand the relationship between two or more variables. In your case, you can use linear regression to model the relationship between your dependent variable and each of the independent variables, and then compare the coefficients to determine which independent variable has the strongest effect on the dependent variable.
  2. Autoregressive Integrated Moving Average (ARIMA) model: This is a type of statistical model that is specifically designed for analyzing time series data. It involves using past values of the time series to model and forecast future values.
  3. Vector Autoregression (VAR) model: This is a statistical model that is similar to ARIMA, but it allows for multiple time series variables to be analyzed simultaneously.
  4. Granger causality test: This is a statistical test that can be used to determine whether one time series is useful in forecasting another time series. In other words, it can be used to determine whether one time series "Granger causes" another time series.
  5. Transfer function model: This is a statistical model that is used to analyze the relationship between two or more time series variables by considering the transfer of information from one series to another.
It's worth noting that there is no one "best" model for analyzing the relationship between time series variables, and the appropriate model will depend on the specifics of your data and the research question you are trying to answer.
  • asked a question related to Statistics
Question
6 answers
For my master's thesis, I conducted an infection assay experiment on wheat plants in pots to test several treatments for their effectiveness to control the pathogen. I had 10 variants in total. Each variant consisted of 5 pots, where 2 of them were put in a frame and measured each day for another, rather unrelated experiment (Hyperspectral Imaging). As the pots in the frames were moved daily to a measuring chamber, lied under lights for several minutes and were fixed in the frame, we initially decided to score those pots seperately and only use the 3 other pots per variant to test the effectiveness of the treatment.
As the data contained many zeros and didn't follow a Gaussian distribution, I conducted a Kruskal-Wallis test with Wilcoxon as posthoc test. Due to only having 3 repetitions per variant now, I get very high p-values. Now I wanted to test, whether being put in the frame made a significant effect on the plants/pathogens, because if not, I can combine the scoring values of the 2 pots in the frame with the other 3 to have a total of 5 repetitions per variant. For this, I plan to implement a factor called 'frame' with values 1/2 (yes/no). However, I don't know, which test to conduct here to evaluate the effect.
Do I have to conduct a confirmative factor analysis?
Thanks in advance!
Relevant answer
Answer
Hello again, I think you don't have enough data to perform statistical testing on single variants to compare them.
What i guess you must do for comparing the pots on the frame and the others, is to calculate first the mean value of your variable in each variant, in each pot type, and make a scatter plot of the results, each point represents a variant with in x the mean in the pots on the frame, and in y the mean of the variable for the three others. This way you see a certain correlation beween the mean in the pots on the frame and others. If you add a regression line, the more this line is near to y=x the more you are safe to use the 5 results as an identifier of the variant independently of whether the experiment is done on the pot or not. Also, you will have 10 points. use the R squared or the mean square error as an evaluator of your model.
Overall, since you don't have enough data, and statistical tests do not give you good results, rely mainly on plots and simpler EDA.