Science topic

R - Science topic

Explore the latest questions and answers in R, and find R experts.
Questions related to R
  • asked a question related to R
Question
4 answers
I am trying to calculate an ECM with panel data and thereby I have the following problem. I run the ecm command from the ecm package an error occurs saying “non-numeric matrix extent “. I found out that the reason is the creation of the panel data. I tried two different approaches to fix the problem.
At first I created the panel dataset with the pdata.frame command from the plm package.
p.dfA <- pdata.frame(data, index = c("id", "t"))
where “index” indicates the individual and time indexes for the panel dataset. The command converts the id and t variables into factor variables which later leads to the error in the ecm command.
Secondly, I created the panel dataset with the panel_data command from the panelr package.
p.dfB <- panel_data(data, id = "id", wave = "t")
where “id” is the name of the column (unquoted) that identifies participants/entities. A new column will be created called id, overwriting any column that already has that name. “wave” is the name of the column (unquoted) that identifies waves or periods. A new column will be created called wave, overwriting any column that already has that name.
This panel_data command also converts the id variable into a factor variable. So, the same error occurs in the ecm command.
If I transfer the factor variables back into numeric variables, I lose the panel structure of the dataset.
Could someone please explain to me how to run an ECM with panel data in R?
Thank you very much in advance!
Sample R Script attached
Relevant answer
Answer
To calculate an Error Correction Model (ECM) with panel data in R, you can use packages like plm or panel. These packages provide functions to estimate fixed effects or random effects models, which can incorporate an error correction term for panel data analysis.
  • asked a question related to R
Question
4 answers
Hello all,
I am running into a problem I have not encountered before with my mediation analyses. I am running a simple mediation X > M > Y in R.
Generally, I concur that the total effect does not have to be significant for there to be a mediation effect, and in the case I am describing, this would be a logical occurence, since the effects of path a and b are both significant and respectively are -.142 and .140, thus resulting in a 'null-effect' for the total effect.
However, my 'c path X > Y is not 'non-significant' as I would expect, rather, the regression does not fit (see below) :
(Residual standard error: 0.281 on 196 degrees of freedom Multiple R-squared: 0.005521, Adjusted R-squared: 0.0004468 F-statistic: 1.088 on 1 and 196 DF, p-value: 0.2982).
Usually I would say you cannot interpret models that do not fit, and since this path is part of my model, I hesitate to interpret the mediation at all. However, the other paths do fit and are significant. Could the non-fitting also be a result of the paths cancelling one another?
Note: I am running bootstrapped results for the indirect effects, but the code does utilize the 'total effect' path, which does not fit on its own, therefore I am concerned.
Note 2: I am working with a clinical sample, therefore the samplesize is not as great as I'd like group 1: 119; group2: 79 (N = 198).
Please let me know if additional information is needed and thank you in advance!
Relevant answer
Answer
Somehow it is not clear to my, what you mean with "does not fit"? Could you please provide the output of the whole analysis? I think this would be helpful.
  • asked a question related to R
Question
4 answers
Hi,
I made the metagenomics prediction of 16S using Tax4Fun and I processed that results using STAMP. However, I want to know if phyloseq is a suitable R package for do it. Can someone give me their opinion about that? or can someone recommended me other R package for this kind of analysis.
Thanks in advance,
Carolina.
Relevant answer
Answer
Taking advantage of the opportunity of my fellow researcher's question, I am also having difficulty working with the Phyloseq + Tax4Fun2 packages in RStudio. I've read countless tutorials but I can't find the error. If anyone is interested in partnering on publishing and teaching this tool, please get in touch as I would really like to learn.
  • asked a question related to R
Question
10 answers
In the domain of clinical research, where the stakes are as high as the complexities of the data, a new statistical aid emerges: bayer: https://github.com/cccnrc/bayer
This R package is not just an advancement in analytics - it’s a revolution in how researchers can approach data, infer significance, and derive conclusions
What Makes `Bayer` Stand Out?
At its heart, bayer is about making Bayesian analysis robust yet accessible. Born from the powerful synergy with the wonderful brms::brm() function, it simplifies the complex, making the potent Bayesian methods a tool for every researcher’s arsenal.
Streamlined Workflow
bayer offers a seamless experience, from model specification to result interpretation, ensuring that researchers can focus on the science, not the syntax.
Rich Visual Insights
Understanding the impact of variables is no longer a trudge through tables. bayer brings you rich visualizations, like the one above, providing a clear and intuitive understanding of posterior distributions and trace plots.
Big Insights
Clinical trials, especially in rare diseases, often grapple with small sample sizes. `Bayer` rises to the challenge, effectively leveraging prior knowledge to bring out the significance that other methods miss.
Prior Knowledge as a Pillar
Every study builds on the shoulders of giants. `Bayer` respects this, allowing the integration of existing expertise and findings to refine models and enhance the precision of predictions.
From Zero to Bayesian Hero
The bayer package ensures that installation and application are as straightforward as possible. With just a few lines of R code, you’re on your way from data to decision:
# Installation devtools::install_github(“cccnrc/bayer”)# Example Usage: Bayesian Logistic Regression library(bayer) model_logistic <- bayer_logistic( data = mtcars, outcome = ‘am’, covariates = c( ‘mpg’, ‘cyl’, ‘vs’, ‘carb’ ) )
You then have plenty of functions to further analyze you model, take a look at bayer
Analytics with An Edge
bayer isn’t just a tool; it’s your research partner. It opens the door to advanced analyses like IPTW, ensuring that the effects you measure are the effects that matter. With bayer, your insights are no longer just a hypothesis — they’re a narrative grounded in data and powered by Bayesian precision.
Join the Brigade
bayer is open-source and community-driven. Whether you’re contributing code, documentation, or discussions, your insights are invaluable. Together, we can push the boundaries of what’s possible in clinical research.
Try bayer Now
Embark on your journey to clearer, more accurate Bayesian analysis. Install `bayer`, explore its capabilities, and join a growing community dedicated to the advancement of clinical research.
bayer is more than a package — it’s a promise that every researcher can harness the full potential of their data.
Explore bayer today and transform your data into decisions that drive the future of clinical research: bayer - https://github.com/cccnrc/bayer
Relevant answer
Answer
Many thanks for your efforts!!! I will try it out as soon as possible and will provide feedback on github!
All the best,
Rainer
  • asked a question related to R
Question
2 answers
I would like to use the "create.matrix" function (R package "fossil") to create a matrix of taxa, localities, time and abundance.
In the function, time is specified by the arguments "time.col" (= column name or number containing the time periods) and "time" (= which time periods to keep for the matrix). In my data table, time is specified as a calendar date (d.m.y: e.g. 31.12.2022).
How should time ("time periods") be specified or formatted in the data table to be used by the create.matrix function?
How should time periods be specified in the "time" argument?
I want to create a species by (locality by date) matrix (a part of the data is attached). Some localities were sampled only once, some several times. As "time period" in the argument "time" I want to use the date of sampling (column "date" in the dataset), but I do not know how to specify it in the code below ("????"):
Function:
df <- create.matrix(dat, tax.name = "taxon_abbrev", locality = "locality_ID", time.col="date", time="????", abund.col = "num", abund = TRUE)
Relevant answer
Answer
Henry EZE Chukwu Thanks for the answer.
But how to specify the 'time' argument in 'create.matrix' function?
"time" = which time periods to keep for the matrix. I want to keep each date for the matrix.
  • asked a question related to R
Question
4 answers
Software procedures:
Make ANOVA 1 Analysis data with R software.
Find out "P" and "F" with values during an ANOVA 1 Analysis of experimental data.
Relevant answer
Answer
You can get statistical analysis like ANOVA by using 'LibreOfficeCal'(Pre installed in the ubuntu system). 1) Put your data then> 2) Select 'data' from above> 3) Select ANOVA (According to your requirement one way/two way).
  • asked a question related to R
Question
5 answers
I want to estimate the half-life value for the virus as a function of strain and concentration, and as a continuous function of temperature.
Could anybody tell me, how to calculate the half-life value in R programming?
I have attached a CSV file of the data
Relevant answer
Answer
Estimating the half-life of a virus involves understanding its stability and decay rate under specific environmental or biological conditions. This is a crucial parameter in virology, impacting everything from the design of disinfection protocols to the assessment of viral persistence in the environment or within a host. Here's a structured approach to estimating the half-life values for a virus:
  1. Defining Conditions:Environment: Specify the environmental conditions such as temperature, humidity, UV exposure, and presence of disinfectants, as these factors significantly affect viral stability. Biological: In biological systems, consider the impact of host factors such as immune response, tissue type, and presence of antiviral agents.
  2. Experimental Setup:Sampling: Begin by preparing a known concentration of the virus under controlled conditions. Time Points: Collect samples at predetermined time points that are appropriate based on preliminary data or literature values suggesting the expected rate of decay.
  3. Quantitative Assays:Plaque Assay: One of the most accurate methods for quantifying infectious virus particles. It measures the number of plaque-forming units (PFU) which reflect viable virus particles. PCR-Based Assays: These can measure viral RNA or DNA but do not distinguish between infectious and non-infectious particles. Adjustments or complementary assays might be required to correlate these results with infectivity. TCID50 (Tissue Culture Infective Dose): This assay determines the dilution of virus required to infect 50% of cultured cells, providing another measure of infectious virus titer.
  4. Data Analysis:Plot Decay Curves: Use logarithmic plots of the viral titer (e.g., PFU/mL or TCID50/mL) against time. The decay of viral concentration should ideally follow first-order kinetics in the absence of complicating factors. Calculate Half-Life: The half-life (t1/2) can be calculated using the equation derived from the slope (k) of the linear portion of the decay curve on a logarithmic scale:�1/2=ln⁡(2)�t1/2​=kln(2)​Statistical Analysis: Ensure statistical methods are used to analyze the data, providing estimates of variance and confidence intervals for the half-life.
  5. Validation and Replication:Replicate Studies: Conduct multiple independent experiments to validate the half-life estimation. Variability in viral preparations and experimental conditions can affect the reproducibility of results. Peer Review: Consider external validation or peer review of the methodology and findings to ensure robustness and accuracy.
  6. Interpretation and Application:Contextual Interpretation: Understand that the estimated half-life is context-specific. Results obtained under laboratory conditions may differ significantly from those in natural or clinical settings. Application in Risk Assessment: Use the half-life data to inform risk assessments, disinfection strategies, or predictive modeling of viral spread and persistence.
By meticulously following these steps and ensuring the precision of each phase of the process, one can accurately estimate the half-life of a virus under specific conditions. This information is essential for developing effective control strategies and understanding the dynamics of viral infections.
Perhaps this protocol list can give us more information to help solve the problem.
  • asked a question related to R
Question
2 answers
I am trying to implement gross segmentation process in my R routine pipeline for image satellite classification. (It could help generating learning polygons for OBIA). Is there an R function to apply segmentation on a spectral band ?
Many thanks,
Relevant answer
Answer
Many thanks Jonathan ! I had a previous look on it but my R seems to refuse to install this package, so I was looking for another solution just in case!
  • asked a question related to R
Question
5 answers
Hi there, I am currently struggling with running analysis on my data set due to missing data caused by the research design.
The IV is binary -- 0 = don't use, 1 = use
Participants were asked to rate 2 selection procedures they use, and 2 they don't use, and were provided with 5 option. So, for every participant there are ratings for 4/5 procedures.
Previous studies used a multilevel logistic regression, and analysed the data in HLM as it can cope with missing data.
Would R be able to handle this kind of missing data? I currently only have access to either SPSS and R. Or is there a particular way to handle this kind of missing data?
Relevant answer
Answer
I'm not sure you need to impute. Multilevel models can cope with missing outcome data without any imputation (basically this is just imbalance and the model is treating the missing responses as if MNAR). There is an issue here letting participants choose which option not to respond to - as that could bias things. If the choice were balanced or random then it would be better than letting them choose.
  • asked a question related to R
Question
3 answers
latine hyper cube design
global and local optimization
4D color maps How do we assess the global behavior of a model and obtain the equilibrium points and analyze their stability through simulation series ?
Relevant answer
Answer
Assessing the global behavior of a model and obtaining equilibrium points while analyzing their stability through simulation series typically involves the following steps:
  1. Define the Model: Clearly define the mathematical model that represents the system you want to study. This model could be based on differential equations, difference equations, agent-based models, etc.
  2. Identify Parameters and Variables: Identify the parameters and variables involved in the model. Parameters are constants that influence the behavior of the system, while variables are quantities that change over time.
  3. Determine Equilibrium Points: Equilibrium points are where the system's state variables remain constant over time. To find these points, set the derivatives of the state variables to zero and solve the resulting system of equations.
  4. Linearize the Model: Linearize the model around each equilibrium point. This involves approximating the behavior of the system near the equilibrium points using linear differential equations or difference equations.
  5. Stability Analysis: Analyze the stability of each equilibrium point by examining the eigenvalues of the linearized system. If all eigenvalues have negative real parts, the equilibrium point is stable. If any eigenvalue has a positive real part, the equilibrium point is unstable. If some eigenvalues have negative real parts and others have positive real parts, the stability is more complex and could involve limit cycles or chaotic behavior.
  6. Simulation Series: Perform simulation series by numerically integrating the model equations over a range of parameter values or initial conditions. This allows you to observe the dynamic behavior of the system and how it changes with different inputs.
  7. Visualize Results: Visualize the simulation results to gain insights into the system's behavior. This could involve plotting time series, phase portraits, bifurcation diagrams, or other relevant visualizations.
  8. Sensitivity Analysis: Conduct sensitivity analysis to understand how changes in model parameters or initial conditions affect the system's behavior. This helps identify which factors have the most significant impact on the system's dynamics.
  • asked a question related to R
Question
2 answers
Which is the most accurate/ usefull command in R for realizing a Necessity test: pofind() (with the option "nec") or superSubset()?
Which is the most common?
I understand that pofind() tests isolated ("independent") conditions and their negations; while superSubset() seeks configurations (i.e. combinations of conditions)....
I used to use superSubset()... but, perhaps for the two-step QCA process and for an ESA analysis... pofind() might be more useful...
Any insight?
Relevant answer
Answer
There should be no difference regarding accuracy, regardless of whether you use "pof", "pofind", or "superSubset" from the QCA package or "QCAfit" from the SetMethods package.
In terms of utility, this obviously depends on your research aims. As a standard function, I recommend working with QCAfit to investigate the potential necessity of all conditions in a single analytical step. Often, this is what you want to know - whether there are any conditions in your analysis that are in themselves consistently necessary.
The superSubset function is useful but it can also result in misleading conclusions, as when researchers try to make sense of necessary disjunctions with two or more elements. Of course, if there are theoretical expectations about specific SUIN conditions, then these could be tested with superSubset and/or with a specified argument in pof.
  • asked a question related to R
Question
1 answer
Hi all,
I'm currently working on a logistic regression model in which I've included year as a random variable, so in the end I am working with a Generalized Linear Mixed Model (GLMM). I've built the model, I got an output and I've checked the residuals with a very handy package called 'Dharma' and everything is ok.
But looking for bibliography and documentation on GLMMs, I found out that a good practice for evaluating logistic regression models is the k-fold cross-validation (CV). I would like to perform the CV on my model to check how good it is, but I can't find a way to implement it in a GLMM. Everything I found is oriented for GLM only.
Anyone could help me? I would be very thankful!!
Iraida
Relevant answer
Answer
Use cvMixed function from the cv library. It works smoothly with glmer objects (GLMMs) from the lme4 library.
  • asked a question related to R
Question
4 answers
Dear Colleagues,
Im looking for a solution to fitting multivariate multiple regression with mixed effects in R.
In my case I have multiple dependent and independent variables with a hierarchical structure: samples are nested inside treatment groups A/B, and treatments are nested in sites.
As far as I know the lmer and glmmTMB packages works well with mixed effects, but dont accept multiple response variables.
I also interested in Bayesian modeling, but it seems that bnsp or rstanarm packages cannot do both.
Can you recommend a solution to this?
Relevant answer
Answer
Ádám Fehér I would like to second Wim Kaijser 's answer about the use of multivariate analyses. In my research area, psychology, it is very seldom that you have a truly multivariate question. Therefore, it does not make sense to do a multivariate analysis. Nevertheless, I see many MANOVAs, just because you have multiple DVs and your lecturer and textbooks told you to do if you have this situation (and I am pretty sure I was guilty of this behavior, too…). Of course, I cannot tell how it is in your research area.
Just to give two examples from clinical psychology.
1) Imagine you have a treatment (and possible confounders) and you would like to assess the impact of the treatment on several outcome variables, like depression, anxiety, body mass index, checking behavior and some more. Here it does not make any sense imo to do a MAN(C)OVA, since the DVs may be not even correlated with each other, and you cannot tell what the meaning of the artificial variate (DV) will be. It explains the maximum variance between the original DVs, but what does this mean? Typically, you are interested in the treatment effect on each DV separately, i.e. you want to know how the treatment affects each DV, which is an univariate question.
2) Take the example from above, but your DVs are different measures of depression, e.g. measured with different questionnaires, like BDI, PVQ, SCL etc. Now, you may be interested in the treatment effect on depression as such, and you think that an artificial variate representing the “core” of all depression measures gives you the answer. You are not interested in the difference on BDI or SCL itself, just if the treatment affects depression. Here I would say that you have a multivariate question/hypothesis and therefore, a MAN(C)OVA may be appropriate.
As you can see, there is no single answer to your problem. Maybe you want to check Huang (2020) on this topic.
Nevertheless, I also recommend to use the brms package, which allows you for generalized linear mixed models and also multivariate approaches, as well as (truly) non linear models.
Huang, F. L. (2020). MANOVA: A procedure whose time has passed?. Gifted Child Quarterly, 64(1), 56-60.
  • asked a question related to R
Question
5 answers
I have two satellite images, named reference.tif and test.tif, and I'd like to calculate the similarity between the rasters using Spectral Angle Mapper (SAM) in R. From my research I couldn't find a package that performs SAM using raster data. Is it possible to do it in R or should I move to an other software?
Relevant answer
Answer
Maybe it is possible to not focus on the pure spectral similarity/difference but on the overall similarity/difference between two 2D arrays? e.g., https://stackoverflow.com/questions/189943/how-can-i-quantify-difference-between-two-images
  • asked a question related to R
Question
6 answers
Hi there,
I am trying to use VariogramST function in R, for spatial-temporal kriging.
The values I am working on, are monthly based; while in the function VariogramST, the predefined "tunit" is hours, days or weeks.
I appreciate if anyone can tell me how to change it to months?
Thanks,
Hamideh
Relevant answer
Answer
hello all, I have two queries: 1- I have random days multiple observations for 5 months, I need to aggregate these values on monthly basis. How can it be done? 2- No matter how much I change the parameters in vgmST my MSE and RMSE are really high (7 digits of a number).
  • asked a question related to R
Question
1 answer
I am new to spatial transcriptomics and am exploring data sets. Can someone walk me through what's contained in the R Object (Robj) files that I have?
Relevant answer
Answer
First load the Robject file and see what it contains.
how to open an Robj file in r? - Stack Overflow
You can use ls() to list all the object in it.
For most of the analysis, there are libraries already written, so you just need to fine the type of analysis you want to conduct, look for library, install it and follow the documentation.
  • asked a question related to R
Question
2 answers
I've tried with many packages, but the main issue arises when you want to introduce additional variables in the mean and the variance equation as well. For example, if you want to introduce news based variable in the variance. Any help is welcome.
Relevant answer
Answer
Preferred packages to conduct a bivariate GARCH analysis are,
  • rmgarch
  • mgarchBEKK
  • ccgarch,
  • asked a question related to R
Question
13 answers
Hello, I am looking for an advice regarding my experimental design and what statistics I should do.
My experimental design:
I have 3 cell lines (A, B, C) and from each I do 3-4 differentiations into cardiac myocytes (A1-A4, B1-B4, C1-C3). Then each differentiation is treated with the same 3 concentrations of a chemical (T1, T2, T3). And for each treatment, I measure the calcium concentration (y). So I have one continuous dependent variable (y), two categorical independent variables (x1 for cell line and x2 for treatment) and a random error which I want to correct for (e for differentiation).
I want to investigate the impact of the cell line (x1, main effect) on the cell response to the treatment (x2).
  1. I understand that e is nested under x1 but what about x2 ?
  2. Then I am not sure how to translate this in a correct formula for the aov() function in R. I am tempted to use aov( y ~ x1*x2 + Error(e)) but this does not account for the fact that e is nested under x1. Does it matter ?
  3. I don't understand how to interpret the different p-values, which one should I look for to answer my research question?
  4. Can I run a normal Tukey test for post-hoc multiple comparisons by pooling the differentiations (getting rid of the error source) or is there a way to correct for it without including it as another variable ?
Many thanks in advance !
Relevant answer
Answer
Sal Mangiafico you are right, but this seems to be a dummy coded variable with 3 levels. Therefore, the differences/slopes are the same, but the first model uses uncorrected test statistics to compare slopes/differences, whereas the "post hoc" test uses Holm adjusted p-values. Camille Charriere maybe it is possible to change the correction option to "none"? Then it should be the same.
  • asked a question related to R
Question
4 answers
I need to extract the x,y coordinates of a PCA plot (generated in R) to plot into excel (my boss prefers excel)
The code to generate the PCA:
pca <- prcomp(data, scale=T, center=T)
autoplot(pca, label=T)
If we take a look at pca$x, the first two PC scores are as follows for an example point is:
29. 3.969599e+01 6.311406e+01
So for sample 29, the PC scores are 39.69599 and 63.11406.
However if you look at the output plot in R, the coordinates are not 39.69599 and 63.11406 but ~0.09 ~0.2.
Obviously some simple algebra can estimate how the PC scores are converted into the plotted coordinates but I can't do this for ~80 samples.
Can someone please shed some light on how R gets these coordinates and maybe a location to a mystery coordinate file or a simple command to generate a plotted data matrix?
NOTE: pca$x does not give me what I want
Update:
Redoing prcomp() without scale and center gives me this for PC1 and PC2 for the first 5 samples
1 -8.9825883 0.0113775
2 -16.3018548 9.1766104
3 -21.0626458 3.0629666
4 5.5305875 4.0334291
5 0.2349433 12.4872609
However the plot ranges from -0.15 to 0.4 for PC1 and -0.35 to 0.15 for PC2
(Plot attached)
Relevant answer
Answer
Conscious my comment is hardly timely, but I believe the issue might be that some visualisation devices in R happen to "scale" the relevant PCA scores in the background. With this I mean the issue might be autoplot(), or other visualisation facilities - not the PCA you performed, per se.
Take for example the function biplot(), which is readily available in base R to visualise objects generated by the function prcomp(). If you look at how the function is coded (stats:::biplot.prcomp) you'll see that it divides the first two PCA scores by their standard deviation i.e.:
scores <- pca$x
lam <- pca$sdev[,1:2]
pca_plot_coord <- t(t(scores[,1:2])/lam)
(Notice that pca$sdev is the same as taking the square root of the relevant eigenvalues of the covariance matrix of your centred data, as they happen to equal the variance along the corresponding scores, if the computational structure of PCA is followed correctly. Fun fact, this equivalence won't necessarily hold if you use some canned R routines for PCA that rely on the singular value decomposition of the data matrix instead of manipulating its covariance matrix).
So in a nutshell, as previous comments have already pointed out, what you're actually interested in are the pca "scores" (pca$x); however some visualisation facilities in R, such as biplot(), might do some scaling in the background.
If you wanted to make your life simpler you could just use PCA() in the package FactoMineR and then plot() the resulting object: as far as I can tell, the plotted result is not manipulated further and what you get in the plot is based on the PCA scores, as you'd expect.
  • asked a question related to R
Question
3 answers
Hi everyone, I am doing Meta-analysis of mediation using the structural equation model in R (the package I will use is “metasem”). May I ask if anybody has experience in doing this type of analysis? I have found a guide to follow but I do not know how to import data with the correct format to do such an analysis.... I would highly appreciate it if you could give me any advice!
Relevant answer
Answer
Conducting a meta-analysis of mediation using structural equation modeling (SEM) in R can be a complex but rewarding task. The "metasem" package is a valuable resource for such analyses. Here are some additional thoughts and suggestions to help you with the process:
  1. Data Preparation:Ensure that your data is in the correct format for SEM analysis. Typically, this involves having data on effect sizes, standard errors, and other relevant statistics for each study. Consult the guide you provided to understand the required data format.
  2. Understanding the SEM Model:Before diving into the analysis, make sure you have a clear understanding of the structural equation model you are specifying. Familiarize yourself with the theoretical framework and the paths you are investigating in terms of mediation.
  3. Check for Publication Bias:Consider assessing and addressing publication bias in your meta-analysis. The guide you provided may cover this aspect, but be sure to explore methods like funnel plots or statistical tests for asymmetry to detect potential bias.
  4. Implementation in R:Follow the step-by-step instructions in the guide carefully. Pay close attention to syntax and parameterization in the "metasem" package.
  5. Data Import:The guide may not explicitly cover data import, but typically, you would use functions like read.csv() or read.table() in R to import your data from a CSV or text file. Ensure that your data is correctly formatted with the necessary columns. RCopy code# Example data import my_data <- read.csv("your_data_file.csv")
  6. Data Exploration:Before conducting the meta-analysis, explore your data using summary statistics, visualizations, and correlation matrices to ensure there are no unexpected issues or outliers.
  7. Model Modification:Be prepared to iteratively modify your SEM model based on the fit statistics and modifications indices provided by the "metasem" package. This may involve adding or removing paths, covariances, or latent variables to improve model fit.
  8. Diagnostic Checks:After running the meta-analysis, conduct diagnostic checks on the SEM models. This includes assessing goodness-of-fit statistics, standardized residuals, and other diagnostic measures.
  9. Documentation and Reporting:Clearly document your analysis steps, including model specifications, modifications made, and any sensitivity analyses performed. Transparent reporting is crucial for the reproducibility and reliability of your meta-analysis.
  10. Seeking Help:If you encounter specific issues or have questions about the "metasem" package, consider seeking help from the R community, such as posting questions on forums like Stack Overflow or the R-sig-meta-analysis mailing list.
Remember that conducting a meta-analysis, especially involving complex statistical methods like SEM, can be challenging. Take the time to thoroughly understand each step of the process and seek help when needed. Additionally, make use of the resources provided in the "metasem" package documentation and consider reaching out to the package authors for guidance if necessary.
  • asked a question related to R
Question
1 answer
I could not find any function in Rchoice that calculates average marginal effects (AME) of the significant variables on the binary outcome for random parameters logit model, only found for heteroskedastic binary models and IV probit models > effect(). Same with 'marginaleffects' (https://marginaleffects.com). I also tried to incorporate 'margins' (https://cran.r-project.org/web/packages/margins/index.html) but got this error: Error in x$terms %||% attr(x, "terms") %||% stop("no terms component nor attribute") : no terms component nor attribute My dataset contains: 11 independent variables (10 are dummy , 1 continuous) 3 IV are assumed to be random parameters. summary(), modelsummary() all worked on my model. Any suggestion is really appreciated.
I am yet to finish writing the discussion part of my Master's thesis due to this issue (just to let you know how important it is for me to find a solution). Thanks in advance.
Relevant answer
Answer
To estimate marginal effects using the Rchoice package in R, you can follow these steps:
1. Install the Rchoice package in your R environment.
2. Load the Rchoice package using the `library` function.
3. Specify your choice model using the appropriate functions in the Rchoice package, such as `rclogit` for conditional logit models or `rbpro Specify your probit or logit model using the appropriate functions from the Rchoice package.
4. Estimate the model using the `modelname <- choicemodel(formula, data) ` syntax, where `modelname` is a name you choose for your model, `formula` is the formula specifying the model equation, and `data` is the dataset containing your variables.
5. Extract the estimated model parameters using the `coef(modelname)` function.
6. Use the `mfx()` function in the Rchoice package to calculate the marginal effects.
  • asked a question related to R
Question
1 answer
Good morning everyone! I've just finished reading Shyon Baumann's paper on "Intellectualization and Art World Development: Film in the United States." This excellent paper includes a substantial section of textual analysis where various film reviews are examined. These reviews are considered a fundamental space for the artistic legitimation of films, which, during the 1960s, increasingly gained artistic value. To achieve this, Baumann focuses on two dimensions: critical devices and lexical enrichment. The paper is a bit dated, and the methodologies used can be traced back to a time when text analysis tools were not as widespread or advanced. On the other hand, they are not as advanced yet. The question is: are you aware of literature/methodologies that could provide insights to extend Baumann's work using modern text analysis technologies?
In particular, following the dimensions analyzed by Baumann:
a) CHANGING LANGUAGE
  • Techniques for the formation of artistic dictionaries that can replace the manual construction of dictionaries for artistic vocabulary (Baumann reviews a series of artistic writings and extracts terms, which are then searched in film reviews). Is it possible to do this automatically?
b) CHANGING CRITICAL DEVICES
  1. Positive and negative commentary -> I believe tools capable of performing sentiment analysis can be successfully applied to this dimension. Are you aware of any similar work?
  2. Director is named -> forming a giant dictionary of directors might work. But what about the rest of the crew who worked on the film? Is there a way to automate the collection of information on people involved in films?
  3. Comparison of directors -> Once point 2, which is more feasible, is done, how to recognize when specific individuals are being discussed? Does any tool exist?
  4. Comparison of films -> Similar to point 3.
  5. Film is interpreted -> How to understand when a film is being interpreted? What dimensions of the text could provide information in this regard? The problem is similar for all the following dimensions:
  6. Merit in failure
  7. Art vs. entertainment
  8. Too easy to enjoy
Expanding methods in the direction of automation would allow observing changes in larger samples of textual sources, deepening our understanding of certain historical events. The data could go more in-depth, providing a significant advantage for those who want to view certain artistic phenomena in the context of collective action.
Thank you in advance!
Relevant answer
Answer
I appreciate you raising this insightful question on how to leverage modern text analysis methods to build on Baumann's foundational work examining artistic legitimation. Automated techniques can certainly help scale such textual analysis to larger corpora. However, care must be taken to ensure computational approaches do not lose the nuance of manual qualitative interpretation.
Regarding building artistic dictionaries, word embedding models like word2vec can help uncover semantic relationships and suggest terms related to a seed vocabulary. However, human validation is still important before applying these dictionaries to make inferences.
For sentiment analysis, deep learning approaches like BERT have shown promise, but domain-specific tuning and qualitative checks are key to account for the complex expressions in artistic reviews. Models pre-trained on social media may not transfer well.
To identify creators, named entity recognition using dictionaries, rules, and ML approaches can help. However disambiguation remains challenging, so human-in-the-loop verification is recommended before making claims about individuals.
Overall, I believe the best approach is applying computational methods as a starting point, but having experts qualitatively analyze a sample of results to catch subtleties these tools may miss. If used prudently and in collaboration with scholars like yourself, text mining can uncover exciting new insights at scale. Please feel free to reach out to discuss further.
Wishing you the very best,
#textanalysis #digitalhumanities #mixedmethods
  • asked a question related to R
Question
3 answers
I have a large dataset containing many dates per participant. I want to creat a new variable containing the number of unique dates per participant. I have been trying to figure this out but it does not work. An example of how the dataset looks like you see below:
participant 1 - date.1 - date.2 - date.3 - date.4 ... date.412
participant 2 - date.1 - date.2 - date.3 - date.4 ... date.412
Dates are in the format: 31-oct-20
Some participants have only 10 dates, others have 412 dates
Any idea how I can create this new variable containing the number of unique dates?
Relevant answer
Answer
In general, you can sort the date variable, then create a variable that is 1 if the date is not the same as the preceding observation and zero otherwise.
. sort date
. gen unique=date==date[_n-1]
would be the Stata code. Note that the very first observation has no preceding observation and will have to be tidied by hand. In Stata you could write
. gen unique=date==date[_n-1] | (_n==1)
which is a kludge but it works.
  • asked a question related to R
Question
8 answers
For our database of odor responses (http://neuro.uni.kn/DoOR) I received data that uses the different chemical identifiers named above. I am looking for an easy way to look up/convert these into one another.
I already found the ChemCell macro (http://depth-first.com/articles/2010/11/01/chemcell-easily-convert-names-and-cas-numbers-to-chemical-structures-in-excel/) but would be happy to have some R code doing the work.
Relevant answer
Answer
Hi! you may try R package "chem", and convert cas to smiles use function "cas_to_smiles".
R code
# install.packages("remotes")
# remotes::install_github("harveyl888/chem")
library(chem)
library(tcltk)
cas<-readxl::read_xlsx('XXXXXXXXXXX.xlsx')# cas put in col1
cas<-cbind(cas,cas)
u <- 1:nrow(cas)
pb <- tkProgressBar("progress","done %", 0, 100)
for (i in u) {
cas[i,2]<-cas_to_smiles(cas[i,1])
info <- sprintf("done %d%%", round(i*100/length(u)))
setTkProgressBar(pb, i*100/length(u), sprintf("progress (%s)", info), info)
}
  • asked a question related to R
Question
24 answers
I just received the latest TOC alert for Behavior Research Methods, and this article caught my eye:
I've not had time to read it yet, but judging from a quick glance, I wonder if the main "problem" might be that users do not always take time to RTFM* and therefore, do not understand what their software is doing? In any case, I thought some members of this forum might be interested.
Cheers,
Bruce
* RTFM = Read The Fine Manual ;-)
Relevant answer
Answer
Uzair Essa Kori, I repeat my earlier question, which you have not yet addressed: Why in the world did you not say clearly at the top of your post that it was generated using AI? Do you not think that would be the honest and ethical thing to do?
  • asked a question related to R
Question
6 answers
I want to open a dataset of gene-trait association using:
data = read.table(file.choose(), header = T)
attach(data)
View(data)
But instead of column name (510 traits), these headers are shown:
na, na.1, na.2, ..., na.509
Please find the attached data (ethical point: this data is free access from https://maayanlab.cloud/Harmonizome/dataset/dbGAP+Gene-Trait+Associations)
Thanks
Relevant answer
Answer
Source: Artificial Intelligence Tools
There are a few reasons why column names might be converted to "na.xx" when reading headers in R.
One reason is that the data source does not actually contain headers in the first row. If this is the case, you can use the read.csv() function with the header argument set to FALSE.
Another reason is that the headers are not formatted correctly. R expects the headers to be on a single line, separated by commas or whitespace. If the headers are not formatted correctly, R may assign default names like "na.xx" to the columns.
Finally, it is possible that there are special characters or problematic characters in the header names. R may struggle to read these characters, which can cause the header names to be converted to "na.xx".
Here are some tips for resolving the issue of column names being converted to "na.xx" when reading headers in R:
  • Check the data source to ensure that it actually contains headers in the first row.
  • Use the read.csv() function with the header argument set to TRUE if the data source does contain headers in the first row.
  • Check the formatting of the headers to ensure that they are on a single line, separated by commas or whitespace.
  • Remove any special characters or problematic characters from the header names.
If you are still having trouble reading the headers in your data, you can try using the readLines() function to read the first row of the data file and then use the names() function to assign the headers to the columns.
Here is an example of how to do this
# Read the first row of the data file headers <- readLines("my_data.csv", n = 1) # Assign the headers to the columns names(data) <- headers # Read the rest of the data file data <- read.csv("my_data.csv", header = FALSE)
This should ensure that the column names are read correctly, even if they contain special characters or problematic characters.
  • asked a question related to R
Question
3 answers
Belkin and O'Reilly described in 2008 in "An algorithm for oceanic front detection in chlorophyll and SST satellite imagery" how to detect oceanographic fronts based on satellite imagery.
I also found the reference to the "NOAA CoastWatch Chlorophyll Frontal Product from MODIS/Aqua". But as I understood, this product is only available for the United States area.
I am looking for a product, e.g. as netcdf file from Copernicus, NASA etc. which offers already detected chlorophyll fronts for the area of the Patagonian Shelf, but couldn't find one so far.
If there is none, is there a feasible way to program the algorithm myself (e.g. in R), using processed (L4) chlorophyll data obtained for the region?
Many thanks for any suggestions.
Relevant answer
Answer
  1. You can access remote sensing products for chlorophyll data from organizations like NASA's OceanColor or Copernicus Marine Environment Monitoring Service, which provide pre-processed data for detecting chlorophyll fronts in ocean ecosystems.
  • asked a question related to R
Question
4 answers
Hi! I have a dataset with a list of species of sponges (Porifera) and the number of specimens found for each specie in three different sites. I add here a sample from my dataset. Which test should I use to compare the three sites showing both which species where found in each site and their abundance? I was also thinking of a visual representation showing just the difference between sites in terms of diversity of species (and not abundance), so that is possible to see which species were just in one sites and which ones were in both sites. For this last purpose I thought about doing an MDS but I am not sure if it is the right test to do neither how to do it in R and how to set the dataset, can you help me finding a script which also show the shape of the dataset? any advice in general would be great! thank you!
Relevant answer
Answer
Hi,
I wonder why you would like to ignore the abundance information?
Based on a species-site abundance matrix, you could calculate a dissimilarty matrix (if the abundance data should be considered, I would use bray-curtis similarity) and conduct a mantel test between all three dissimilarity matrices to test for correlations between the species compositions of the three sites.
  • asked a question related to R
Question
3 answers
I want to know if the type of litter found between sites is significantly different. I have 7 stations with polymer IDs for litter from each station. Can I compare two categorical variables (stations, n=7; polymer type, n=11)?
Can anyone share some advice on what stats to use and any code in R.
Relevant answer
Answer
we can use short - cut formules by using kimball's formula for partitioning the over all chi-squre value in the case of 2*c table
  • asked a question related to R
Question
5 answers
I have a question about what statistical tests I can do in R that would be most suited to analyse my enzymatic assay results.
My data
Dependent variable: Absorbance (its’ a colorimetric assay)
Factor 1: Protein (type of protein, wt and several mutants)
Factor 2: Substrate (several substrates)
For each assay, (e.g. Prot1 with Substr1) I have 3 – 6 data points (repeats), and it is not feasible to obtain more.
An example of my data would be something like fig1.
What I want to do:
1. Test if the different mutations have a different effect on the enzymes’ activity with the different substrates (basically test the dependence of Absorbance on the interaction between Protein and Substrate: Abs~Protein*Substrate)
Visually, if I plot my data on a bar chart (mean +/-SD), it appears to be the case, but I need to verify that what I see is significant.
Normally, I would do this with a two-way ANOVA, however:
my data is not normally distributed (according to Q-Q plot and skewness test, I have over-dispersed residuals (Laplace distribution), without any skew);
the variance is not homogenous (standardized residuals vs fitted values plot shows heteroskedasticity)
What sort of model could I use instead? Is there a way I can transform my data to allow a parametric test (the only transformations I found were against skewness, which I do not have)?
2. Test for each substrate, which mutations make the activity differ significantly from the wt (e.g. whether activity with Substr2 is significantly different for Prot2 and 3 from that for Prot1)
To avoid doing multiple pair-wise comparisons, I would normally do a pairwise.t.test with a Holm family-wise error rate correction.
For non-normally distributed data of non-equal variance, I saw it is recommended to do a Pairwise Wilcoxon rank sum test.
If I group my data by substrate, with one of the substrates (e.g. Substr2) it was normally distributed and of equal variance. I did the pairwise T test and it gave results consistent with what is seen on the graph. However, when I tried a Pairwise Wilcoxon rank sum test (holm correction) it showed no significant difference between any of the Proteins, which makes no sense (e.g. that there is no significant difference between Prot1 and 2, although one has an activity with the substrate and the other doesn’t). So it looks like the non-parametric test may not be powerful enough/ at all useful.
For the other substrates, either the distribution is not normal (again over-dispersed, Laplace distribution, with no skew), or the distribution is normal, but the variance is not equal (heteroskedasticity).
Just to note, one-way ANOVA or Kruskal-Wallis rank sum tests (where applicable) for Absorbance~Protein for each substrate individually, showed there is a significant variation in Absorbance depending on Protein.
What sort of pairwise comparison tests can I do in these cases? Or, again, how can I transform my data to be able to use a pairwise.t.test?
Thank you in advance!
I really appreciate any advice!
Relevant answer
Answer
glm(Protein*Substrate, Data, family=quasipoisson())
Or
lm(log(Protein)*Substrate, Data)
  • asked a question related to R
Question
1 answer
I have an unusual question: I am working on a Erasmus internship project with Drosophila mutants at 2 different timepoints and with WT, KO and KI condition. A company analyzed the data using DESeq2 and I have only got loads of PDFs and the results_apeglm.xlsx file.
This contains: Transcripts per million for each gene, replicate and timepoint with the comparison for looking at DEGs - so I have a padj and log2FC value. A snippet is attached as an example.
I now want to construct a graph and clustering where genes that are going in changing directions between WT and KO over time become visible out of the hundreds of candidate DEGs. With this I want to narrow down the long list to make it verifiable with qPCR and serve as a marker for transformation from presymptomatic to symptomatic.
I am setting up my analysis in R and want to use the degPatterns() function from DEGReport, as it gives a nice visual output and clusters the genes for me.
How can I now transform my Excel sheets, to a matrix format that I can use with degPatterns()? The example with the Summarized Experiment given in the vignette is not really helpful to me, sadly.
Thank you all for reading, pondering and helping with my question! I would be very happy if there´s a way to solve my data wrangling issue.
All the best,
Paul
Relevant answer
Answer
Hi, what exactly do you need? A matrix from excel sheet. Then, simply read excel file using
as.matrix(readxl::read_excel(your_file_location))
function. You need to remove few columns and then matched the columns to meta data.
and IF degPatterns() function is not working properly, then you may need to clean and re-transform your data.
  • asked a question related to R
Question
2 answers
Hi everyone,
I would like your opinion on using the complete() function of the mice package on R.
Theoretically, after multiple imputations, analyses should be performed on each imputed dataset, and then the results of the analyses should be pooled (see attached diagram).
However, I consider the complete() function. In summary, it permits generating a unique final dataset using the results of multiple imputations previously performed with the mice() function. This strategy is easy, "inexpensive," and allows us to manipulate only one dataset.
This is a concrete example of the usefulness of this strategy. I am conducting mixed-methods research in which I want to interview some participants after analyzing their responses to my survey. If my respondent John Doe did not answer to an item of a scale, I would risk having 5 plausible answers from John Doe after multiple imputations (if m=5, or 20 plausible responses if m=20, etc.). However, the complete() function will summarize the different estimates into one dataset (instead of 5, or 20, etc.). Basically, during an interview, I will be able to question John Doe based on his scale score computed with NA replacement. So I lose precision, but gain in ability to exploit the answers.
However, this approach seems problematic, as the literature does not support it well. In fact, except for this paper by van Buuren et al. (2011, cf. section 5.2), I cannot find any source that supports this approach:
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. https://doi.org/10.18637/jss.v045.i03
Well, I'm stuck between a rock (a more rigorous approach, i.e. the pooling) and a hard place (a more practical approach, i.e. the complete() function). What do you think?
Hope to read you (and my apologies for my broken English)
FM
Relevant answer
Answer
I believe that "complete" just extracts the imputed data sets, and if you don't specify further, it will give you the first imputed set.
So it seems obvious that you shouldn't use this function and perform analysis or present descriptive statistics just based on this one data set, as you will drastically (depending on the amount of uncertainty in the imputations) underestimate standard errors and with that confidence intervals and p-values. Basically then you would be doing a weird kind of single imputation instead of multiple imputation.
So: Don't.
  • asked a question related to R
Question
4 answers
When writing code, the kable function does not show the running result in the “ for loop”.
Relevant answer
Answer
When using 'print' enclosing the 'kable()' command, avoid using 'format="latex"'. Instead, even producing a 'latex' markdown, don't specify any format and add '{r, result="asis"}' in the chunk header. This should work properly.
  • asked a question related to R
Question
3 answers
I am capturing fluorescent microscopy images of a developmental process with a high degree of variability. I need to compile these images into a "gallery" so that I can see many/all images at once to look for possible phenotypic differences between conditions, which then I can measure through FIJI. The most basic way I can do this is by arranging the images in a Powerpoint, but this method feels manual and has some limitations (explained below). I am hoping someone may know of a better way to compile image data for investigation/curation, maybe through FIJI, R, or a different dedicated program.
Limitations of Powerpoint:
1. Cannot track sample group or ID: When you upload an image into Powerpoint, there is no way to check its filename. So I am forced to upload one image at a time (and make a text box noting condition and ID), which is slow and unideal for batch processing images.
2. Re-formatting and arranging is manual: Re-sizing, cropping, and positioning images on Powerpoint is very manual and there is no record of changes (so making scale bars is hard). The effort compounded when the need arises to re-size or re-position the images later in the process (see #3).
3. Cannot re-order images dynamically based on values: Because the process I am studying is highly variable, it is necessary to compare images at the same percentile between conditions. Thus, I would like to re-order the images by measured values (so that I can compare min to min, mid to mid, max to max, etc.). This is also useful for selecting representative images for presentations/papers.
4. Interactive/Shiny Graph: To further aid in comparing images based on values, it would be nice to have an interactive/shiny scatterplot of the measured values such that when you hover over/select an image the corresponding datapoint in marked in the graph.
Partial Solutions:
A. Keynote: Solves #1 because uploaded images retain the filename information. However, the file format is limited to Macs which limits data sharing.
B. QuickFigures: This FIJI plugin solves #1 and #2, but it is still clunky to use and cannot re-order images.
C. Image Data Explorer: This R package or browser program solves #4 but can only view one image at a time. It also has the advantage of easy annotation of the images by entering values into the corresponding data spreadsheet/dataframe.
Does anyone know of a way/program that solves any of these limitations? Or does anyone have a different way of compiling image data?
Relevant answer
Answer
Hello Nathan,
It may not be easy, but possible. You can rename files according to the parameters (or calculated values), not manually, in Matlab and Python. Also, you can merge labels into them. You can arrange them in adobe illustrator instead of PowerPoint. There may be better ways, but that comes to mind if I have the same issue.
Good luck!
Grisha
  • asked a question related to R
Question
1 answer
In two structural equation models, I am comparing the association of two different psychopathological measures with emotion regulation. Emotion regulation is a latent variable with a 4 manifest variables that construct it (lets call them A, B, C, D), which are the same in both models. In model 1, A and C have positive signs while B and D have negative signs. In model 2, B and D have positive signs while A and C are negative. I would like this to be consistent in both models, as currently model 1 reflects "efficiency of emotion regulation" and model 2 reflects "inefficiency of emotion regulation". This will surely confuse reviewers and readers...
Relevant answer
Answer
You could try assigning user-defined starting values to the factor loadings that have the desired sign. In the lavaan package, this is done as described on the website below:
  • asked a question related to R
Question
5 answers
For my thesis work, I have to deal with Multivariate multiple regression, while in my studies I only have dealt with multivariate regression (one regressand and multiple regressors). Now I have multiple response variables (Y) which are continuous.
My understanding is that although lm() handles vector Y's on the left-hand side of the model formula, it really just fits m separate lm models.
---------------------
To @Rainer Duesing
In my work, I'm dealing with a functional response so I wrote the response data on a common finite grid and now what I need to predict is the row-stacked matrix Y.
Relevant answer
Answer
There are several R packages for multivariate multiple regression. Some of them are:
  1. MGLM
  2. mcglm.
  • asked a question related to R
Question
10 answers
Looking for R package/s for in the field of soil erosion/sediment estimation and analysis.
Any comment or hint would be welcome.
Relevant answer
Answer
Yes, there are R packages available for modeling soil erosion and sediment yield. One such package is "RUSLE2R" (Revised Universal Soil Loss Equation 2 - R). RUSLE2R is an R implementation of the Revised Universal Soil Loss Equation 2, which is a widely used empirical model for estimating soil erosion. The package allows users to calculate soil erosion based on factors such as rainfall, soil erodibility, slope, land cover, and erosion control practices.
Here's how you can install and use the "RUSLE2R" package in R:
  1. Install the package from CRAN:install.packages("RUSLE2R")
  2. Load the package:library(RUSLE2R)
  3. Use the functions in the package to estimate soil erosion. For example, the RUSLE2 function can be used to calculate soil erosion using the Revised Universal Soil Loss Equation 2:# Example data rainfall <- c(1000, 800, 1200, 900, 1100) # Rainfall (mm) slope <- c(5, 10, 8, 15, 12) # Slope gradient (%) land_cover <- c("Cultivated", "Grassland", "Forest", "Urban", "Bare soil") # Land cover types # Soil erosion calculation result <- RUSLE2(rainfall, slope, land_cover) print(result)
  • asked a question related to R
Question
1 answer
Hi all, I have scRNA data generated from Rapsody platform and analysed in seven bridges platform. Now could you please give me an idea how to deal with seven bridges platform output files for seurat R scRNA analysis. Mainly i need Filtered Feature, Barcodes, and Matrix files for analysis.
Relevant answer
Answer
You would have obtained MolsPerCell.csv and Samples_Tags, among other files, from Seven Bridges.
Just follow the script given at this link
it should do the job. Let me know if you need more clarification
  • asked a question related to R
Question
6 answers
Dear colleagues. I need to build a matrix A,where are some values and a lot of zero.
Thanks a lot for your help
There are the elements ,matrix rows:
a_1=1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 and after 91 zeros
a_2=0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 after 90 zeros
a_3=0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 89 zeros
a_4=0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 88 z
a_5=0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 87 z
a_6=0,35 0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 86 z
a_7=0,28 0,35 0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 85 z
a_8=0,18 0,28 0,35 0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 84 zeros
a_9=0,08 0,18 0,28 0,35 0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 83 zeros
Relevant answer
Answer
It is best to have all the vectors of values (a_1, a_2,...) in a list:
a_List <- list(
c(1, 0.89, 0.75, 0.61, 0.46, 0.35, 0.28, 0.18, 0.08),
c(0.89, 1, 0.89 0.75, ...),
...
)
Each vector in this list can then be supplemented with zeroes to a length of 100:
a_List <- lapply(a_list, function(x) c(x, numeric(100-length(x)))
All vecors in the list now have the same length of 100. They can be bound row-wise ("rbind") to a matrix:
M <- do.call(rbind, a_List)
M is the desired matrix with 100 columns and as many rows as there are vectors in a_List.
  • asked a question related to R
Question
9 answers
I am currently completing the synthesis for a systematic review on the impact of the use of wireless devices and mental health. The systematic review looks at quantitative studies only. Due to the heterogeneity of outcomes (depression, anxiety, externalizing behaviours etc) and study designs - we have decided not to run a meta-analysis, nor we will produce forest plots. However, I feel that a harvest plot would be an attractive and intelligible method of summarizing our findings, and would complement a narrative sythesis. See below for what I mean by a harvest plot
Here is a great example of what I am trying to produce:
I am very familiar with using R / Python for data visualisation purposes, but I am initially stumped about how I might produce an attractive and aesthetically pleasing plot, short of stodgily moving rectangles around in a word / publisher doc. Can anyone suggest a package / software / website / any method that help me?
Much much much appreciation if you can!
Relevant answer
Answer
It is probably too late for the original poster, but blog post on how to create a havest plot in R using ggplot (which I contributed to) is available at https://medium.com/@cxiao94/creating-harvest-plots-in-r-75fb45c8f393 Hopefully the code used there can be adapted to your needs.
  • asked a question related to R
Question
13 answers
there are many statistics software that researchers usually use in their works. In your opinion, which one is better? which one do you offer to start?
Your opinions and experience can help others in particular younger researchers in selection.
Sincerely
Relevant answer
Answer
SPSS, as my results necessitate analysis using SPSS
  • asked a question related to R
Question
4 answers
I would like to know which R-package is good for making Taguchi design of experiment and its analysis. What I have seen is that qualityTool package only makes the design and makes signal to noise ratio graph. It does not do any more analysis beyond signal to noise ratio graph. Kindly guide me in this regard.
Relevant answer
Answer
The package qualityTools provides a function called taguchiDesign that can create a Taguchi design of experiment and perform some analysis such as signal-to-noise ratio and effect plots. However, this package may not support all types of Taguchi designs and analysis.
Another option is to use the package DoE.wrapper, which is a wrapper for several other packages that can create and analyze different types of designs of experiments, including Taguchi designs. This package allows you to use functions from FrF2, DoE.base, rsm and qualityTools with a common interface. You can also use the function Taguchi from the package FrF2 directly to create a Taguchi design.
  • asked a question related to R
Question
5 answers
Dear all,
I'm building a shiny application for analysis of mass-spectrometry based proteomics data (for non-coders), and am working on the section for GO term enrichment.
There are a wide variety of packages available for GO term enrichment in R, and I was wondering which of these tools is most suitable for proteomics data.
The two I'm considering using are https://agotool.org/ which corrects for abundance with PTM data, or STRINGdb which has an enrichment function.
What do you guys recommend?
Best regards,
Sam
Relevant answer
Answer
Hi, I don't know the extent of your shiny application, but you can also use EnrichR (in R), which is also good, though, it upload the data to server during calculation.
Plus, I am curious how you will handle the missing IDs during ID mapping? ClusterProfiler can map IDs, but there are always some % of ID that are missing.
  • asked a question related to R
Question
3 answers
From the link https://gtexportal.org/home/datasets, under V7, I'm trying to do R/Python analyses on the Gene TPM and Transcript TPM files. But in these files (and to open them I had to use Universal Viewer since the files are too large to view with an app like NotePad), I'm seeing a bunch of ID's for samples (i.e. GTEX-1117F-0226-SM-5GZZ7), followed by transcript ID's like ENSG00000223972.4, and then a bunch of numbers like 0.02865 (and they take up like 99% of the large files). Can someone help me decipher what the numbers mean, please? And are the numbers supposed to be assigned to a specific sample ID? (The amount of letters far exceed the amount of samples, btw). I tried opening these files as tables in R but I do not think R is categorizing the contents of the file correctly.
Relevant answer
Answer
GTEX-1117F-0226-SM-5GZZ7 is the sample ID and the ENSG00000223972.4 refers to the gene symbol according to the HUGO gene nomenclature. The numbers you are referring to are gene expression values. TPM (Transcripts Per Million) is a normalization method that has been used to scale these gene expression values so that it is possible to make the expression of genes comparable between samples. 
  • asked a question related to R
Question
1 answer
I am already familiar with the process of calculating hedge ratios with linear regression (OLS). I am already running 4 different regressions for calculating hedge ratios between emerging markets and different hedging assets like gold. This is done on in-sample data.
That would look something like this: EM=a+b*GOLD+e
I then construct a portfolio and test the standard deviation of the portfolio and compare that with the non-hedged portfolio of only emerging market equities in the out-sample: R-b*GOLD
However, I want to compare these OLS hedge ratios to conditional hedge ratios from for instance a BEKK GARCH or a DCC GARCH.
I have already tried to work with R and I used the rugarch and rmgarch packages and created a model, modelspec and modelfit, but I do not know how to go from there.
Relevant answer
Answer
Calculating conditional hedge ratios using models like BEKK GARCH or DCC GARCH in R or EViews involves a slightly different approach compared to OLS regression. Here's a general outline of the steps you can follow:
  1. Data Preparation: Ensure that you have your data ready in the appropriate format. Make sure you have the returns or log returns of the relevant variables, such as the emerging markets (EM) and the hedging asset (e.g., GOLD). Ensure that the data is properly aligned and covers the desired time period.
  2. Model Specification: Specify the appropriate conditional volatility model, such as BEKK GARCH or DCC GARCH. In R, you can use the rugarch package for both models. For BEKK GARCH, you can use the ugarchspec function with the appropriate specifications, such as the variance model (variance.model) and covariance model (distribution.model). For DCC GARCH, you can use the dccspec function.
  3. Model Estimation: Estimate the specified model using the ugarchfit function for BEKK GARCH or the dccfit function for DCC GARCH. Provide the returns of the EM and hedging asset as inputs to the estimation function. The output will be a fitted model object (e.g., fit_bekk or fit_dcc) that contains the estimated parameters and other relevant information.
  4. Extract Conditional Covariance: Extract the conditional covariance matrix from the fitted model object. In the case of BEKK GARCH, you can use the rcov function on the fitted model object (fit_bekk@fit$rcov). For DCC GARCH, you can use the rcov function on the fit_dcc@fit$R object.
  5. Calculate Conditional Hedge Ratios: Compute the conditional hedge ratios based on the extracted conditional covariance matrix. The conditional hedge ratio can be obtained by dividing the covariance between EM and the hedging asset by the variance of the hedging asset. In the case of BEKK GARCH, you can use the formula: conditional hedge ratio = (covariance EM-GOLD) / (variance GOLD). For DCC GARCH, you can use the formula: conditional hedge ratio = (conditional correlation EM-GOLD) * (conditional standard deviation EM) / (conditional standard deviation GOLD).
  6. Compare Results: Compare the conditional hedge ratios obtained from the GARCH models with the OLS hedge ratios you calculated previously. Assess the differences and evaluate the effectiveness of each approach in terms of risk reduction or portfolio performance.
It's important to note that the specific syntax and function names may vary slightly depending on the version of the packages you are using. Make sure to refer to the package documentation and examples for more detailed guidance.
  • asked a question related to R
Question
2 answers
Hello! I am doing a moderation analysis using PROCESS MACRO in R. My interaction (moderation) is significant but the bootstrap is not. How can I interpret that? Thank you!
Relevant answer
Answer
When interpreting the results of a moderation analysis using the PROCESS macro in R, it is important to consider both the significance of the interaction term and the bootstrap results. Here's how you can interpret the scenario where the interaction is significant, but the bootstrap is not:
  1. Significance of the interaction term: A significant interaction suggests that the relationship between the independent variable and the dependent variable is influenced by the moderator. In other words, the effect of the independent variable on the dependent variable differs at different levels of the moderator. This finding indicates that there is evidence of moderation.
  2. Non-significant bootstrap results: The bootstrap analysis is commonly used to estimate the confidence intervals and obtain more robust p-values for the indirect effects and the conditional effects (i.e., the moderation effect). If the bootstrap results are not significant, it means that the indirect effect or the conditional effect (moderation effect) is not significantly different from zero based on the bootstrap sampling. It suggests that there is no evidence for mediation or moderation in the specific pathways being examined.
In summary, when the interaction is significant, but the bootstrap results are not, it indicates that although there is evidence of moderation, the specific indirect effects or conditional effects being tested are not significantly different from zero based on the bootstrap analysis. This may suggest that the interaction effect is not strong enough to result in a significant indirect or conditional effect. However, it's important to consider the effect size and the context of your study to fully interpret the findings.
It's worth noting that it's always beneficial to carefully consider the specific details of your analysis and consult with a statistician or data analyst who can provide further guidance based on your specific research question and data.
  • asked a question related to R
Question
4 answers
I am looking for one but I could not get an info. I only found SmartPLS for such a tool, but I am looking for a free alternative.
Relevant answer
Answer
Hello,
you can try this reference .
Gaston Sanchez developped the R package plspm (along with a tutorial) which contain a PLS-REBUS function useful for performing class detection in heterogenous data
  • asked a question related to R
Question
7 answers
Dear community,
I am currently doing a research project including a moderated mediation that I am analysing with R (model 8). I did not find a significant direct effect of the IV on the DV. Furthermore, the moderator did not have a significant effect on any path. Doing a follow-up, I thus calculated a second model, that excluded the moderator (now mediation only, model 6). In this model, the effect of the IV on the DV is significant. Is it possible, that the mere presence of the moderator in the first modell influences my direct effect, even if it does not have an effect on my relationship between IV and DV? Is my thinking of, that direct effects only depict direct effects, without including the influence of other variables in the model, wrong?
Can anybody help me with an explanation and maybe also literature for this?
Thank you very much in advance!
KR, Katharina
Relevant answer
Answer
Thom Baguley thank you very very much, I get it now. Very helpful explanations and sources!
  • asked a question related to R
Question
4 answers
Need urgent help regarding this.
I am currently working on a project involving multiple time series forecasting using R. For the forecasting process, I have monthly data available from 2019 up to 2023.
Currently, I have generated baseline forecasts using R, specifically using ARIMA and ETS models. I have attached figures depicting the behavior of one of the time series along with its corresponding forecasted values based on the applied mathematical models.
However, I am facing an issue where most of the forecasted data does not seem to capture the seasonality well. The forecasts appear to be linear rather than capturing the expected seasonal patterns.
However, I am facing an issue where most of the forecasted data does not seem to capture the seasonality well. The forecasts appear to be linear rather than capturing the expected seasonal patterns.
Code:
t_data.fit = t_data %>%
model(arima=ARIMA(Demand)
)
Relevant answer
Answer
There are many reasons why your forecasts may appear to be linear rather than capturing the expected seasonal patterns. One possible reason is that the model you are using is not appropriate for the data you are working with. Another possible reason is that you have not included all of the relevant variables in your model.
  • asked a question related to R
Question
1 answer
I'm fitting a model in R where response variable is an over-dispersed count data, and where I meet an issue with goodness-of-fit test. As it has been previously discussed in Cross Validated, common GOF test like chi-square test for deviance is inappropriate for negative binomial regression because it includes a dispersion parameter as well as the mean:
I'm confused and would appreciate any suggestions about the appropriate GOF test for NB regression (preferably intuitive).
Relevant answer
Answer
The Likelihood Ratio Chi-Square test is the appropriate goodness-of-fit test for negative binomial regression. This test compares the fit of the negative binomial regression model to an alternative model, typically a simpler model such as a Poisson regression.
To perform the Likelihood Ratio Chi-Square test for negative binomial regression in SPSS, you can follow these steps:
  1. Run the negative binomial regression model using the appropriate command in SPSS. Make sure to include all relevant predictors and specify the negative binomial distribution. This will generate the estimated coefficients and other model outputs.
  2. Obtain the log-likelihood value for the negative binomial regression model. This value can usually be found in the model output or summary table.
  3. Run a modified version of the negative binomial regression model without the predictor variables. This is your alternative model, typically a Poisson regression model.
  4. Obtain the log-likelihood value for the alternative model.
  5. Calculate the likelihood ratio (LR) statistic by subtracting the log-likelihood value of the alternative model from the log-likelihood value of the negative binomial regression model: LR = -2 * (log-likelihood of an alternative model - log-likelihood of the negative binomial model).
  6. The LR statistic follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters estimated between the two models. The difference in the negative binomial versus Poisson regression is the number of predictors in the negative binomial model.
  7. Compare the calculated LR statistic to the critical chi-square value at the desired significance level. If the calculated LR statistic is greater than the critical value, it suggests that the negative binomial regression model provides a significantly better fit than the alternative model (Poisson regression).
It's important to note that the Likelihood Ratio Chi-Square test assesses the overall goodness-of-fit of the negative binomial regression model but does not provide information about specific predictors or individual model assumptions. Additional diagnostic tests and assessments should be performed to evaluate the model's performance thoroughly.
  • asked a question related to R
Question
3 answers
Hi all,
I'm busy building a shiny analysis pipeline to analyse protemics data from mass spectrometry, and I was wondering what the exact difference is between the terms Over-represented, and Upregulated. Can they be used interchangeably? Is one more appropriate for RNA or proteins?
Thanks,
Sam
Relevant answer
Answer
I would extend that and say we really need to be careful using upregulated. To me, that means the expression of the protein is increased, so there is a fundamental change in the amount of protein quantified as a result of that. I'd suggest we use the term increased (or decreased) abundance when quantifying the protein, unless we have clear evidence to say otherwise. There are a lot of reasons protein quant differs between samples, and it's not always due to expression level changes. I agree you will find such terms used interchangeably, that does not mean it's correct, or a good idea.
  • asked a question related to R
Question
3 answers
I am trying to run a spatio-temporal autoregressive model (STAR). Therefore I need to create a spatial weight matrix W with N × T rows and N × T columns to weight country interdependencies based on yearly trade data. Could someone please tell me how I to create such a matrix in R or Stata?
Relevant answer
Answer
Dear Jan,
OK ! I see ! you need to create the spatial weight matrix indeed !
There are many possibilities in R:
I strongly advise to work with sf because it so easier now !
but spdp may still be clearly adaptated to you context :
This is one of the definitive books on the subject in R:
there are other references but they are more geospatial (point process) oriented.
Here you should use one of those packages and that nb2mat from package spdp might do the trick !
All the best.
Franck.
  • asked a question related to R
Question
5 answers
Hi guys I'm looking for all R libraries developed/ intended / oriented to run standard and advanced psychometric analyses. I'm aware of the existence of popular packages such as psych, sem, mirt or CTT, but if you happen to know any other package that performs psychometric analyses i will really appreciate any info about it.
  • asked a question related to R
Question
2 answers
Hi
I am trying to set up a topmodel simulation in R. The flowlength function of the topmodel package, requires the outlet coordinate of the DEM matrix, i.e. the row and column position.
Is there a practical way to get that position? I am currently using spreadsheets to get visually the position based on my knowledge of the watershed. Unfortunately, for large DEM with high resolution, it is almost imposible.
Relevant answer
Answer
To get the outlet coordinate (row and column) of a DEM matrix for Topmodel in R, you can use the following steps:
1. Load the necessary packages in R:
R
library(raster)
library(hydroTSM)
library(topmodel)
2. Read in your DEM data using the raster package:
R
dem <- raster("path/to/dem.tif")
3. Define the flow direction and accumulation using the terrain function from the raster package:
R
flow_dir <- terrain(dem, "flowdir")
flow_acc <- terrain(dem, "flowacc")
4. Calculate the stream network using the topmodel package:
R
stream_net <- topmodel(flow_dir, flow_acc)
5. Get the outlet cell of the stream network using the hydroTSM package:
R
outlet_cell <- getOutletCell(stream_net)
6. Convert the outlet cell to row and column coordinates using the raster package:
R
outlet_row_col <- xyFromCell(dem, outlet_cell)
The outlet_row_col variable now contains the row and column coordinates of the outlet cell of your DEM.
  • asked a question related to R
Question
5 answers
Dear colleague,
I am currently working on a similar topic for my thesis, and I would like to learn more about how to performed this test. Could you please share with me some guidance steps or codes for implementing these tests? I would greatly appreciate your help and advice.
Relevant answer
Answer
Thank you, Professor, for sharing your code. Could you please also share the reference for the code? It seems that the test in the code is based on a VAR model rather than the GARCH model used in the original paper "A Lagrange multiplier test for causality in variance". Samuel Oluwaseun Adeyemo
  • asked a question related to R
Question
6 answers
I am trying to develop mixed logit model for my data, my data is in the format shown in the image. I am getting errors while running the model like data format, alt name cannot be same as columns etc. in packages like xlogit in python and mlogit, apollo in R
My question is does one need to convert their data into the format shown even if RP data is available and only one specific choice out of few is for each individual.
If the data size is a bit large how does one convert this data to the formal using code.
Also, what happens in the case all my variable are categorical and there is no varing variables across each individual
Relevant answer
Answer
It's difficult to provide a precise answer without more information about your specific data and the packages you are using. However, I can provide some general guidance on working with mixed logit models.
First, it's important to make sure that your data is in the correct format for the package you are using. For example, some packages may require the data to be in a long format with each observation on a separate row, while others may require it to be in a wide format with each individual on a separate row.
If you have RP (Revealed Preference) data and only one specific choice out of a few is available for each individual, you can still use mixed logit models. However, you may need to adjust your data format to make it compatible with the package you are using.
If your data size is large, you can use code to convert your data to the required format. This can be done using functions or loops in Python or R, depending on the package you are using. For example, in Python, you can use the pandas library to reshape your data into the desired format.
If all your variables are categorical and there is no varying variable across each individual, you can still use mixed logit models, but you may need to include random effects or heterogeneity parameters to capture any unobserved differences between individuals.
In general, working with mixed logit models can be complex, and it's important to carefully review the documentation and examples provided by the packages you are using to ensure that your data is in the correct format and that you are using the correct functions and parameters. Additionally, it can be helpful to consult with experts in the field or seek assistance from online forums or communities for guidance on working with these models.
I hope you find this useful.
With best regards
Ajaz Mir
  • asked a question related to R
Question
3 answers
In R, I am using the randomForest and caret packages for a Random Forest (RF) regression task. Genarally, there are two options when fine-tuning a model:
  • GridSearch,
  • RandomSearch.
I have seen on various posts that GridSearch has limitations such as taht there are intangibles that cannot be addressed in a gridsearch, such as parameter interaction stabilizes at a slower rate than error so Bootstrap replicates should be adjusted not just on error but also based on number and complexity of parameters included in the model.
My questions are:
  1. what are those interactions?
  2. how can I find such interactions and visualize them (e.g., in a graph) using R?
Examples online are welcomed.
Relevant answer
Answer
The principle of parameters tuning of your model. After preparing your data is to find the optimal parameters and splitting dataset, simultaneously with the cross-validation, the parameter tuning with a grid search (or not) is conducted. Each model technic has their own parameters. The interaction is just the combination of your parameters, for the GBM, it will be the number of tree, the height of tree, the shrinkage, the minimum of observations by terminal nodes or to create a new branch. With the package caret, you run all models with all possible combinations (this step takes a long time), for example for the graph that i published 6,375 were generated. Then, you obtain the best combination of parameters (i.e., the parameters to have the best rate of accuracy and the minimal rate of errors). Finally, you run your model with the optimised parameters and you assess it.
  • asked a question related to R
Question
1 answer
Dear colleagues! I've tried to estimate the histone variants enrichment of several arabidopsis genomic sites but every time my enrichment result looks like the added screenshot. What am I doing wrong?
The running function is:
enr.df <- enrichment(query =gr_list[[1]], catalog = anno, shuffles = 24, nCores = 12)
anno is remap2020_histone_nr_macs2_TAIR10_v1_0.bed,
catalog is a GRanges of sites of interest of A.thaliana (all 5 chromosomes).
There are no warnings after function execution.
Relevant answer
Answer
Found the solution. The problem is - it is not stated that you HAVE to provide function with valid chromosome lengths, and the format is strict: a data.frame with one column = size and row.names = chromosomes.
So, the function works pretty fine like:
tair <- readDNAStringSet("GCF_000001735.4_TAIR10.1_genomic.fna")
tair <- tair[1:5]
tair@ranges@NAMES <- str_split_fixed(tair@ranges@NAMES, " ",2)[,1]
enr.df <- enrichment(query = sites, catalog = anno, shuffles = 6, nCores = 12, byChrom = T, chromSizes = data.frame(size =tair@ranges@width, row.names = tair@ranges@NAMES))
  • asked a question related to R
Question
9 answers
I'm using the package deSolve in R to model a biological ODE system, however I have the problem of, with certain parameters or with certain initial conditions, the ode solver produces negative values, which make no sense in a biological scenario.
Is there a way to avoid this without recurring to logarithms?
Relevant answer
Answer
To avoid non-negative solutions in ODE solvers in R, you implement a custom step function that checks if the current solution is negative and stops the solver if it is. You can do this by defining a function that takes the current time and solution as input, checks if the solution is negative, and returns an error if it is. You can then pass this function as an argument to the ode() function in R
  • asked a question related to R
Question
5 answers
I want run SPSS control file from PISA 2012 data set into SPSS. But I am not able to run the .txt file into .sav file. Anyone have an idea how to run it?
Relevant answer
Answer
Hi Kaushal Kumar Bhagat Did you sort the problem out? I've met exactly the same issue. Thank you so much if there are any hints.
  • asked a question related to R
Question
5 answers
Hi there!
I'm following the method by Branko Miladinovic et al. "Trial sequential boundaries for cumulative
meta-analyses".
I'm using the following code: metacumbounds ln_hr se_ln_hr N, data(loghr) effect(r) id(study) surv(0.40) loss(0.00) alpha(0.05) beta(0.20) is(APIS) graph spending(1) rrr(.15) kprsrce(StataRsource.R) rdir(C:\Program Files\R\R-4.2.1\bin\x64\) shwRRR pos(10) xtitle(Information size)
but the following error is popping up:
File bounds.dta not found.
Any guess what's causing the error and how to fix it?
Thanks in advance
Relevant answer
Answer
Any new suggestion?
  • asked a question related to R
Question
1 answer
Dear all,
I want to calculate an effect of treatments (qualitative) on quantitative variables (e.g. plant growht, % infestation by nematodes, ...) compared to a control in an experimental setup. For the control and for each treatment, I have n=5.
Rather than comparing means between each treatments, including the control, I would like to to see whether each treatment has a positive/negative effect on my variable compared to the control.
For that purpose, I wanted to calculate the log Response Ratio (lnRR) that would show the actual effect of my treatment.
1) Method for the calculation of the LnRR
a) During my thesis, I had calculated the mean of the control value (Xc, out of n=5) and then compared it to each of the values of my treatments (Xti). Thereby, I ended up with 5 lnRR values (ln(Xt1/Xc); ln(Xt2/Xc); ln(Xt3/Xc); ln(Xt4/Xc); ln(Xt5/Xc)) for each treatment, calculated the mean of those lnRR values (n=5) and ran the following stats : i) comparison of the mean effects between all my treatments ("has one treatment a larger effect than the other one?") and ii) comparison of the effect to 0 ("is the effect significantly positive/negative?")
==> Does this method seem correct to you ?
b) While searching the litterature, most studies consider data from original studies and calculate LnRR from mean values within the studies. Hence, they end up with n>30. This is not our case here as data are from 1 experimental setup...
I also found this: "we followed the methods of Hedges et al. to evaluate the responses of gas exchange and water status to drought. A response ratio (lnRR, the natural log of the ratio of the mean value of a variable of interest in the drought treatment to that in the control) was used to represent the magnitude of the effects of drought as follows:
LnRR = ln(Xe/Xc) = lnXe - lnXc,
where Xe and Xc are the response values of each individual observation in the treatment and control, respectively." ==> This is confusing to me because the authors say that they use mean values of treatment / mean values of control), but in their calculation they use "individual observations". Are the individual observation means within each studies ?
==> Can you confirm that I CANNOT compare each observation of replicate 1 of control with replicate 1 of treatment; then replicate 2 of control with replicate 2 of treatment and so on? (i.e. ln(Xt1/Xc1); ln(Xt2/Xc2); ln(Xt3/Xc3); ln(Xt4/Xc4); ln(Xt5/Xc5)). This sounds wrong to me as each replicate is independent.
2) Statistical use of LnRR
Taking my example in 1a), I did a t-test for the comparison of mean lnRR value with "0".
However, n<30 so it would probably be best not to use a parametric test :
=> any advice on that ?
=> Would you stick with a comparison of means from raw values, without trying to calculate the lnRR to justify an effect ?
Thank you very much for your advice on the use of LnRR within single experimental studies.
Best wishes,
Julia
Relevant answer
Answer
Hi Prof. Julia,
Your questions are what I am wondering about, I think they are very important and interesting. I am trying to explore the temporal pattern of the restoration effect along the age of restoration. I use Ln(Xrestored/Xunrestored) as a measure of the restoration effect based on a fixed study site with multiple years of observations. I think your questions are highly related to my method. Have you found the answer to these questions and could you please tell me if so? I would be very grateful for this.
Best wishes,
Erliang
  • asked a question related to R
Question
1 answer
I'm trying to calculate the cophenetic distance by R function cophentic from a phylogenetic tree (with size 16.8MB, generated by package V.phylomaker2), and the RStudio raise the error: "Error in dist.nodes(x) : tree too big", as the picture shows. How can I solve it? Or is there any other method to calculate NRI when the tree is too big(I'm using the function "ses_mpd" from package "picante") ?
Thanks a lot!
Relevant answer
Answer
Contact the maintainer whose name and email is on the R pdf document describing the package. Best wishes David Booth.
  • asked a question related to R
Question
6 answers
I am tiding up with the below problem, it's a pleasure to have your ideas.
I've written a coding program in two languages, Python and R, but each came to a completely different result. Before jumping to a conclusion, I declare that:
- Every word of the code in two languages has multiple checks and is correct and represents the same thing.
- The used packages in two languages are the same version.
So, what do you think?
The code is about applying deep neural networks for time series data.
Relevant answer
Answer
Good morning, without the code it is difficilt to know where is the difference I do not use Python i work on R but maybe these difference is due to the stage of spitting dataset do you try to add thr same number in the count of generator of randomly for example seed(1234) (if my memory is good this function is also used in Python language. Were your results and metrics of evaluation totally different? In this case, mayve there is a reliability issue in your model. You should check your data preparation and features selection .
  • asked a question related to R
Question
5 answers
I am familiar with 'sdm' package to construct species distribution model (SDM). Now, I am an facing issue.
I used predict() of 'dismo' to predict the distribution of species few weeks ago and it ran smoothly without consuming much times. It hardly took 3-5 mins but now it is running since morning (8 hrs+) for same data. Yet I am waiting for the result. I have to prepare SDMs for 10 different specues. If it takes too much time to predict a single model then I have to wait for many days .....which is annoying.
How can I fix it ? Can anyone shares his/her thoughts in this regard. Or is there alternative to save time.
Thanks
PS: PC configuration: 8 GB RAM, AMD Processor
Relevant answer
Answer
Dear Pietro,
Thanks for your response.
I am using five different algorithms like GLM, BRT, MaxEnt, SVM, RF. Now, I found that there is an argument for parallel model running which takes less time than earlier I experience.
predict(model, predictors, nc=8).
Thanks again.
  • asked a question related to R
Question
1 answer
Good day! I have a list (~10000) of unique DNA sequences about 10-20 bp.
I want to find out if they could evolve from one or several sequences, or emerged independently.
Some of the sequences have similar motifs and could be aligned, others haven't at all - so I can't just perform MSA and make a tree - the distance matrix contains many NAs.
I've tried using principal components analysis on k-mers (1-4) frequencies but it gives me nothing - the frequencies form one dense cloud of points with PC1 that have only ~4% explained variance.
And I found that universalmotif R package is capable of performing similar analysis using motif_comparison(), so I converted the sequences into sequence motif format (one for each), but when tried it on a short set of data - found that the algorithm works in a very strange way on list of motifs created each from only one sequence. Different methods gives the same result (added tree to the question) - the sequences that are different are placed near instead of sequences that are someway similar...
Relevant answer
Answer
Your could use the k-mer analysis (k={3,4} seems appropriate) to compute a similarity matrix. Would be the opposite of the distance matrix, which, as you say, is quite incomplete.
  • asked a question related to R
Question
4 answers
Hello All,
I have a question regarding the SEM-based path analysis. I construct my model as the following:
"Y~M1+M2+X1+X2
M1~X1
M2~X2"
As you can see, both M1 and M2 are used as mediators in my model.
Here comes my question: All my variables (Y, M1, M2, X1, X2) are categorical variables. By running the above model in R ("lavaan" package), I can get the results (i.e., coefficient, p-value) of each category of Y, M1, and M2. However, my results do not show each level of X1 and X2.
Are there any ways to solve this issue?
Any help will be much appreciated!
Relevant answer
Answer
SEM with multicategorical X. see Hayes and preacher tutorial: Statistical mediation analysis with a multicategorical independent variable
  • asked a question related to R
Question
1 answer
I have three raster layers, two coarse resolution and one fine resolution. My goal is to extract GWR's coefficients (intercept and slope) and apply them to my fine resolution raster.
I can do this easily when I perform simple linear regression. For example:
library(terra)
library(sp) ntl = rast("path/ntl.tif") # coarse res raster
vals_ntl <- as.data.frame(values(ntl))
ntl_coords = as.data.frame(xyFromCell(ntl, 1:ncell(ntl)))
combine <- as.data.frame(cbind(ntl_coords,vals_ntl))
ebbi = rast("path/tirs010.tif") # coarse res raster
ebbi <- resample(ebbi, ntl, method="bilinear")
vals_ebbi <- as.data.frame(values(ebbi))
s = c(ntl, ebbi)
names(s) = c('ntl', 'ebbi')
block.data <- as.data.frame(cbind(combine, vals_ebbi))
names(block.data)[3] <- "ntl"
names(block.data)[4] <- "ebbi"
block.data <- na.omit(block.data)
model <- lm(formula = ntl ~ ebbi, data = block.data)
#predict to a raster
summary(model)
model$coefficients
pop = rast("path/pop.tif") # fine res raster
lm_pred010 = 19.0540153 + 0.2797187 * pop
But when I run GWR, the slope and intercept are not just two numbers (like in linear model) but it's a range. Attached are the results of the GWR.
My question is how can extract GWR model parameters (intercept and slope) and apply them to my fine resolution raster? In the end I would like to do the same thing as I did with the linear model, that is, GWR_intercept + GWR_slope * fine resolution raster.
Here is the code of GWR:
library(GWmodel)
library(raster)
block.data = read.csv(file = "path/block.data00.csv")
#create mararate df for the x & y coords
x = as.data.frame(block.data$x)
y = as.data.frame(block.data$y)
sint = as.matrix(cbind(x, y))
#convert the data to spatialPointsdf and then to spatialPixelsdf
coordinates(block.data) = c("x", "y") # specify a model equation
eq1 <- ntl ~ tirs dist = GWmodel::gw.dist(dp.locat = sint, focus = 0, longlat = FALSE) abw = bw.gwr(eq1, data = block.data, approach = "AIC", kernel = "tricube", adaptive = TRUE, p = 2, longlat = F, dMat = dist, parallel.method = "omp", parallel.arg = "omp")
ab_gwr = gwr.basic(eq1, data = block.data, bw = abw, kernel = "tricube", adaptive = TRUE, p = 2, longlat = FALSE, dMat = dist, F123.test = FALSE, cv = FALSE, parallel.method = "omp", parallel.arg = "omp")
ab_gwr
You can download the csv and the fine resolution raster from here (https://drive.google.com/drive/folders/1V115zpdU2-5fXssI6iWv_F6aNu4E5qA7?usp=sharing).
Relevant answer
Answer
# library(GWmodel)
# library(sp)
# library(raster)
wd = "path/"
provoliko = "EPSG:7767"
# fine resolution covariates
x1= raster(paste0(wd, "x1.tif"))
x2 = raster(paste0(wd, "x2.tif"))
s = stack(x1, x2)
regpoints <- rasterToPoints(s, spatial = TRUE)
# coarse resolution df
block.data = read.csv(paste0(wd, "block.data.csv"))
coordinates(block.data) <- c("x", "y")
proj4string(block.data) <- provoliko
dist = GWmodel::gw.dist(dp.locat = coordinates(block.data),
rp.locat = coordinates(regpoints),
focus = 0,
p = 2,
theta = 0,
longlat = F)
eq1 <- ntl ~ .
abw = bw.gwr(eq1,
data = block.data,
approach = "AIC",
kernel = "gaussian",
adaptive = TRUE,
p = 2,
parallel.method = "omp",
parallel.arg = "omp")
ab_gwr = gwr.predict(eq1,
data = block.data,
predictdata = regpoints,
bw = abw,
kernel = "gaussian",
adaptive = TRUE,
p = 2,
theta = 0,
longlat = FALSE,
dMat1 = dist)
sp <- ab_gwr$SDF
sf <- st_as_sf(sp)
# export prediction
gwr_pred = as.data.frame(sf$prediction)
gwr_pred = SpatialPointsDataFrame(data = gwr_pred, coords = regpoints)
gridded(gwr_pred) <- TRUE
gwr_pred <- raster(gwr_pred)
raster::crs(gwr_pred) <- provoliko
gwr_pred[gwr_pred <= 0] <- 0
writeRaster(gwr_pred,
paste0(wd, "ntl_gwr.tif"),
overwrite = TRUE)
  • asked a question related to R
Question
2 answers
AnnotationDbi can annotate gene with its full name or GO term.
However, how to annotate individual ligand-receptor pairs?
For example, KNG1-BDKRB2, how can you annotate its role or function with R or python module?
Relevant answer
Answer
Wang Lu There is no universal method for annotating individual ligand-receptor pairings. However, utilizing multiple databases and APIs, you may retrieve information on the specific pair.
For example, the biomaRt package in R may be used to obtain functional information from databases like as UniProt, Ensembl, or Gene Ontology. Using the gene symbol or accession number, you may search these databases for pertinent information such as function, pathway, or interaction partners.
The Biopython library in Python gives access to numerous biological databases such as UniProt, Ensembl, and KEGG. Similarly, you may extract ligand and receptor information and utilize it to annotate the ligand-receptor pair.
It is vital to remember that the data available in the databases you are using may restrict the annotation of individual ligand-receptor pairings. Furthermore, the data may be incomplete and may not fully reflect the function of the specific ligand-receptor combination in issue.
  • asked a question related to R
Question
3 answers
Is there a R or python package, can annotate the function or related disease of individual genes?
Thank you for your attention.
Relevant answer
Answer
Wang Lu Individual genes may be annotated using a variety of R and Python programs. Some common choices are:
1. biomaRt (R package) enables the retrieval of gene annotation data from a variety of sources (e.g. Ensembl, UniProt, etc.)
2. pyensembl (Python package) - a Pythonic interface to Ensembl data, including gene annotation.
3. AnnotationDbi (R package) - offers a consistent interface to a range of annotation resources, such as gene ontology and KEGG pathways.
4. BioPython (Python package) - a collection of bioinformatics-related Python modules, including one for processing and editing GenBank entries, which may be used to get gene annotation information.
5. GORILLA (R package) - a R tool for high-throughput gene ontology (GO) analysis, which may be used to annotate genes with their GO concepts.
Many of these programs may annotate genes with disease-related information, such as disease-gene connections from the Online Mendelian Inheritance in Man (OMIM) database or GeneCards.
  • asked a question related to R
Question
5 answers
Hi everyone,
What is the best method for converting nested data stored in a .mat file (created in MATLAB) to a format that can be read by R? (e.g., csv)
Thanks :)
Relevant answer
Answer
Saleh Hamed There are numerous methods for converting a.mat file (generated in MATLAB) to a format that R can read. The readMat() function from the R.matlab package in R is a common way. This function can read MATLAB.mat files and convert them to R objects including data frames, matrices, and lists.
Another approach is to import the.mat file in Python using the scipy.io.loadmat() function, and then use the pandas package to transform the data into a format that R can read, such as a.csv file.
You may also use the save() function in MATLAB to save the.mat file in a format that R can read, such as.csv or.txt.
It will be determined by your individual use case as well as the format and functionality required for your data in R.
  • asked a question related to R
Question
6 answers
Hello everyone.
I'm conducting a research in which I have pre-test and post-test data. First, I measured the number of subscribers of a service in the first period (pre-test). Second, I measured the number of subscribers of the same service in a second period (post-test). Between the pre-test and the post-test, consumers were shown a stimulus that could have changed their intention of being subscribers of the service. I used 2 types of stimulus, one about time (week and month) and the other about the method (method 1 and method 2). I have also the pre-test and post-test of the number of users who use the service but are not the owners of the subscription.
I have performed a McNemar test to compare if there are differences between the number of subscribers before and after each type of stimulus. That is, one McNemar test to measure differences in weekly time pressure, one for monthly pressure, and two more: one for method 1 and one for method 2. For all of them, I have significant results.
I have also calculated the percentages of decrease in the number of subscribers for each stimulus. For example, by pressing weekly the number of subscribers drops by 7%. In the case of users who use the service but are not the owners of the subscription the decrese is 54%. I intend to define which stimulus produces a greater decrease and in which type of users.
I think that maybe this analysis is poor for academic research, and I want also to compare the percentages of decrease in subscriptions with the percentage of decrease of users that use the service but aren't the owner of the suscription for each type of stimulus.
Can you please recommend me another type of analysis?
Maybe some analysis to measure if there are differences when doing a weekly and monthly pressure (or method 1 and method 2) between subscribers and users who use the service but are not the owners of the subscription. I mean using the two groups of users, the two times pressures or the two methods pressures, and the pretest and posttest.
Thank you very much.
Diana
Relevant answer
Answer
Shezan Ahmed, I have one question:
Can I use two-way ANOVA when I have a binary (dummy) outcome? I can't find any example of ANOVA with a binary dependent variable.
  • asked a question related to R
Question
1 answer
Dear Colleagues!
Is it possible to use percentage cover data (e.g. plant cover in sample plots) when creating rarefaction curves with the iNEXT package in R? (that can compute rarefaction curves for the Hill numbers)
Unfortunately, I cannot convert these percentages to count data.
Therefore I can only use the incidence-frequency based data (0/1) as input for creating these curves, but with this limitation only the species richness curves (q=0) make sense to me, and I lose the abundance-based information of my percent cover data in the shannon (q=1) and simpson (q=2) curves.
What is the proper way when I only have percentage cover data?
Any ideas would be appreciated!
Relevant answer
Answer
You can only develop rarefaction curves with actual count data. But if each point is considered an independent sample unit when point sampling to determine cover, the data will follow a binomial distribution, and analyses should be based on binomial statistics :)
  • asked a question related to R
Question
1 answer
I created this R package to allow easy VCF files visual analysis, investigate mutation rates per chromosome, gene, and much more: https://github.com/cccnrc/plot-VCF
The package is divided into 3 main sections, based on analysis target:
  1. variant Manhattan-style plots: visualize all/specific variants in your VCF file. You can plot subgroups based on position, sample, gene and/or exon
  2. chromosome summary plots: visualize plot of variants distribution across (selectable) chromosomes in your VCF file
  3. gene summary plots: visualize plot of variants distribution across (selectable) genes in your VCF file
Take a look at how many different things you can achieve in just one line of code!
It is extremely easy to install and use, well documented on the GitHub page: https://github.com/cccnrc/plot-VCF
I'd love to have your opinion, bugs you might find etc.
Relevant answer
Answer
I use TASSEL software for genome analysis. You need plink format of map and pad to operate it. You can try and explore this software
  • asked a question related to R
Question
7 answers
Hi,
Most of the researchers knew R Views website which is:
Please, I am wondering if this website contains all R packages available for researchers.
Thanks & Best wishes
Osman
Relevant answer
Answer
no need to buy R
  • asked a question related to R
Question
2 answers
I have .bed (file format) files, and I would like to make a correlation plot (preferably a line plot) for them.
For example, I want the x axis to be distance (i.e., chromosome 1) and the y axis to be some sort of correlation.
For example, if one of the files has a peak where the other does not, there would be no correlation, and if there are strong peaks for both files at the same location, then there would be high correlation.
Any suggestions? Thanks!
Relevant answer
Answer
Thanks Mohammad Alzeyadi for the suggestions!
  • asked a question related to R
Question
5 answers
I want to select some disease-related terms such as amyloid-related or stress-related terms for visualization using clusterProfiler in r (result of DisGeNET enrichment analysis). Please what code can I use to select these terms for visualization in r without manually selecting them
Relevant answer
Answer
  • asked a question related to R
Question
4 answers
I am running a PCA in JASP and SPSS with the same settings, however, the PCA in SPSS shows some factors with negative value, while in JASP all of them are positive.
In addition, when running a EFA in JASP, it allows me to provide results with Maximum Loading, whle SPSS does not. JASP goes so far with the EFA that I can choose to extract 3 factors and somehow get the results that one would have expected from previous researches. However, SPSS does not run under Maximum Loading setting, regardless of setting it to 3 factors or Eigenvalue.
Has anyone come across the same problem?
UPDATE: Screenshots were updated. EFA also shows results on SPSS, just without cumulative values, because value(s) are over 1. But why the difference in positive and negative factor loadings between JASP and SPSS.
Relevant answer
Answer
Thank you, I found the issues and now it gives me very similar results
  • asked a question related to R
Question
4 answers
Hi all, I am attempting to make a scatterplot of CO2 rate data (y) at varying temperatures (x) with two categorical variables, depth (Mineral, Organic) and substrate (Calcareous, Metapelite, Peridotite) with fitted lines following the equation y = a*exp(B*t) where y = co2 rate, a = basal respiration (intercept), B = slope and t = temperature (time equivalent). I have already fitted all of the exponential curves so I have the values for y, a, B and t for each data point. I am struggling to figure out how to fit the exponential curves across the two groups. So far, I have produced the base scatterplot using GGplot2 colouring by depth and faceting by substrate but I do not know what the best approach is to fit the curves. I have attempted to filter the dataset into categories (substrate = 1, depth = 1) and use nls to fit the curves with such little success that it isn't worth posting the code. Any advice/ guidance is greatly appreciated.
Relevant answer
Answer
To get the exponential curve function from the fitted parameters, you should define
exp_funct <- function(a,b,t) a*exp(b*t)
Just to cross-check:
exp_funct(co2_data$basal_respiration, co2_data$slope, co2_data$chamber_temp)
returns the values very similar to co2_data$co2_rate_u_m_h_g (differences are due to rounding errors, I assume).
This is the code to plot the curves in a temperature range from 10 to 30:
groups <- paste(co2_data$substrate, co2_data$depth)
palette <- hcl.colors(length(groups), "Set 2")
plot(NA, xlim = c(10, 30), ylim = c(0, 0.5), xlab = "Temperature", ylab = "CO2 rate")
for (row in 1:nrow(co2_data)) {
a <- co2_data$basal_respiration[row]
b <- co2_data$slope[row]
curve(exp_funct(a, b, t), from = 10, to = 30, xname = "t", add = TRUE,
col = palette[row], lwd=3)
}
legend("topleft", lwd=3, col=palette, legend = groups, bty="n")
You can also plot the curves depending on any other of the parameters.
  • asked a question related to R
Question
6 answers
It is a dta file, can not be read with read.csv. I used haven to read it. However, the second row of column is real name, such as Checking Year and Checking Month , how can you extract it?
Relevant answer
Answer
Jochen Wilhelm Sorry to make you misunderstanding my questions. I want to extract all labels of those columns as column names. For example, we want extract label names: Individual id, Household ID, Community ID, Checking Year, Checking Season, Checking Date, Checking Day. And Set those names as column names.
Please contact me if more information was needed :)
  • asked a question related to R
Question
4 answers
I want to make Venn diagram that has more than 50 sets, each contains two number.
The data looks like the table below, the total column is variable among samples but the intersection is constant number
total intersection
set1 1000 10
set2 1200 10
set3 1350 10
.
.
.
set50 3000 10
Relevant answer
Answer
Next, many valuable answers regarding Venn diagram, as well as some software solutions, you may find in this posts:
  • asked a question related to R
Question
6 answers
I am measuring two continuous variables over time in four groups. Firstly, I want to determine if the two variables correlate in each group. I then want to determine if there is significant differences in these correlations between groups.
For context, one variable is weight, and one is a behaviour score. The groups are receiving various treatment and I want to test if weight change influences the behaviour score differently in each group.
I have found the r package rmcorr (Bakdash & Marusich, 2017) to calculate correlation coefficients for each group, but am struggling to determine how to correctly compare correlations between more than two groups. The package diffcorr allows comparing between two groups only.
I came across this article describing a different method in SPSS:
However, I don't have access to SPSS so am wondering if anyone has any suggestions on how to do this analysis in r (or even Graphpad Prism).
Or I could the diffcorr package to calculate differences for each combination of groups, but then would I need to apply a multiple comparison correction?
Alternatively, Mohr & Marcon 2005 describe a different method using spearman correlation that seems like it might be more relevant, however I wonder why their method doesn’t seem to have been used by other researches? It also looks difficult to implement so I’m unsure if it’s the right choice.
Any advice would be much appreciated!
Relevant answer
Answer
You wrote: "For context, one variable is weight, and one is a behaviour score. The groups are receiving various treatment and I want to test if weight change influences the behaviour score differently in each group."
I'm not sure this is best tested with a correlation coefficient. This sounds like an interaction hypothesis (or moderation if you prefer). What you need I think is the interaction of weight change by group. This is usually tested by the regression coefficient for the interaction. You can standardize this to scale it similarly to a correlation coefficient (though that's actually best done outside the model for interactions).
You can compare correlations but that isn't necessarily sensible because you risk confounding the effects of interest with changes in SD of the variables across groups (and there seems no rationale for needing that).
A further complication is including weight change without baseline weight as a covariate might be a poor choice. Even if groups are randomized including baseline weight may increase precision of the estimates of the. other effects.
  • asked a question related to R
Question
3 answers
Dear community,
I am planning on conducting a GWAS analysis with two groups of patients differing in binary characteristics. As this cohort naturally is very rare, our sample size is limited to a total of approximately 1500 participants (low number for GWAS). Therefore, we were thinking on studying associations between pre-selected genes that might be phenotypically relevant to our outcome. As there exist no pre-data/arrays that studied similiar outcomes in a different patient cohort, we need to identify regions of interest bioinformatically.
1) Do you know any tools that might help me harvest genetic information for known pathways involved in relevant cell-functions and allow me to downscale my number of SNPs whilst still preserving the exploratory character of the study design? e.g. overall thrombocyte function, endothelial cell function, immune function etc.
2) Alternatively: are there bioinformatic ways (AI etc.) that circumvent the problem of multiple testing in GWAS studies and would allow me to robustly explore my dataset for associations even at lower sample sizes (n < 1500 participants)?
Thank you very much in advance!
Kind regards,
Michael Eigenschink
Relevant answer
Answer
for the second part of your problem, you can try vcf2gwas pipeline, that is very easy to run as a docker image
  • asked a question related to R
Question
4 answers
I want to get the coordinates from the vertices of a set of singlepart polygons (all in the same vector layer .shp) using R. I would like to have a list with the x and y coordinates per polygon. Do you know how this can be obtained?
Thank you in advance!
Relevant answer
Answer
Nikolaos Tziokas at the end I ended up reading the file as a simple feature (SF), and using this package (sf) for the purpose of my research. This has been a better solution for what I aimed to do.
Thank you for reminding me
  • asked a question related to R
Question
5 answers
What package fo you use to visualize for example UTRs, exons, introns and read coverage of a single gene? Or a plasmid map in linear form?
Relevant answer
Answer
Found the solution:
I use gggenes package to plot the genomic features, map reads using plotBed() of the sushi package and then combine the plots together in pdf-editor.
Unfortunatelly, the plots are uncombinable using R.
  • asked a question related to R
Question
5 answers
Hi PLS Experts,
I am an absolute beginner at using PLS, and I need your help. I have practised using PLS in R as well as in SPSS.
I am interested in using PLS as a predictive model. I have 2 Dependent Variables (DV) one is continuous while the other is categorical with 3 levels. However, I am not using these in one but model but rather in a separate model, as the categorical variable is also an independent variable in model 1 (the model with Continuous DV).
I am confused with the term Y-Variance Explained (For DV) and its effect on model performance.
Is a low percentage Y-Variance explained (in all components) mean poor prediction by the model?
I recently applied PLS on standardized data with 1 DV and 14 Predictors using R (mdatools package). The cumulative x-variance explained was 100%, but Y-variance explained was only 29% in all 14 components (optimal number of components is 3).
I am unable to explain the reason for such poor performance.
A summary of model is attached in the figure. (Predictions are in bottom right part of image).
Thank you for you time :)
Best
Sarang
#PLS
Relevant answer
Answer
I am a bit late to answer your question but let me try to explain to you as simply as I can.
The continuous dependent variable (y) is used to develop a predictive model. Of course and as you did, you need a multivariate matrix Xnxp, where n is the number of samples (rows) and p are the features (columns). Then, with PLS regression you obtain the weights (regression coefficients) that maximize the covariance between X and y. With these regression coefficients, you are able to predict any future or different dataset.
In terms of the categorical variable (y) is used to develop a classification model. Here you need the matrix X as well but need to use different algorithms such as logistic regression, as David Eugene Booth explained, or PLS-DA, LDA. The goal is to obtain the weights or coefficients that minimized the error. Those, as in regression, with those coefficients you will be able to classify any future or different dataset.
Now, I will explain your figure. Since you are dealing with a regression problem and you used PLS regression the first step is to optimize the PLS number or component or latent variables. This step could be done using the left-bottom plot. As you could see, it is related to RMSE, which is an error, therefore lower it is the better. This figure showed that as the components (x-axis (PLS components or latent variable)) increased the error RMSE decreased in both cv (red line and means cross-validation) and cal (the blue line which means calibration). I can see that you or the software suggested the number of components (n_comp =3) based on the other plots. But as you could see, at component 3 both curves have not reached the lower error, in my case, I will select the number of components 7 or 8 since both curves have reached almost the minimum and they look close to each other meaning that the model is generalizing better compare for example with component 5 or 6, in those components, the red curve suffers an increment.
Then, the remaining plot will be based on the selected PLS component and for sure component 7 will give you a better predictions plot (right-bottom). The top-left plot could be used to select outlier samples and the top-right plot is the regression coefficients that as I told you above could be used for predicting or classifying future datasets.
I hope that my explanation makes sense to you and could be useful. Do not hesitate to ask if you have any other questions,
All the best
  • asked a question related to R
Question
5 answers
Dear all,
I am currently trying to fit the same models in lavaan (R) and Mplus. However, I keep getting slightly different parameter or goodness-of-fit estimates.
E.g., I fitted this model in lavaan:
Bi_HiTop_3 <- '
HiTOP_ges =~ HiTOP_Bodily_Distress + HiTOP_Conver_Symptoms + HiTOP_Health_Anxiety+
HiTOP_Diseas_Convicti + HiTOP_Somati_Preoccup
HiTOP_Health_Anxiety ~~ HiTOP_Diseas_Convicti
HiTOP_ges~~ 1*HiTOP_ges
'
fit_HiTOP_bi3 <- sem(Bi_HiTop_3, vars, estimator= "MLMV", mimic = "mplus")
In Mplus, this is:
MODEL:
Som_HiTOP BY HiTOP_BD* HiTOP_CS HiTOP_HA HiTOP_DC HiTOP_SP;
Som_HiTOP @1;
HITOP_HA with HITOP_DC;
I get slightly different results on these two regarding both parameter and goodness-of-fit estimates. Only LogLikelihood seems to stay the same.
Has anyone encountered this problem and knows why?
Relevant answer
Answer
I'm not sure about your statement concerning "the variance of a single latent variable". It is neither per se unreasonable to fix a factor variance to 1.0, nor is it per se the case that a factor variance must always be equal to 1.0 anyway. That is, a factor variance can be estimated as a free parameter and can take on values other than 1.0 in the unstandardized solution. In case you decide to estimate a factor variance as a free parameter (this is the default in Mplus), you need to fix one loading on that same factor to a non-zero value for identification. Typically, a fixed value of 1.0 is chosen for ease of interpretation, but you could pick other non-zero values as well for identification. Mplus by default fixes the first loading on the factor to 1.0 and estimates the factor variance as a free (unrestricted/unconstrained) parameter.
If you decided to instead fix the factor variance to a value > 0 for identification (again, typically 1.0 is chosen for ease of interpretation), you would (in almost all cases) not also fix a loading on that same factor. Instead, you would then estimate all loadings as free parameters.
Both parameterizations (fixed variance vs. fixed loading) lead to equivalent solutions. That is, regardless of whether you fix the factor variance or one loading, you should get the same chi-square, df, and also the same completely standardized solution (STDYX in Mplus).
Fixing both (the variance and one loading for the same factor) typically leads to a misspecified (overly restricted) "nonsense" model. This can result in a large chi-square as well as potentially obscure/non-sensical/uninterpretable parameter estimates.
Why would we ever fix a factor variance rather than going with the typical default of fixing one loading per factor? One reason that come to mind is when the indicators of the given factor are measured on very different scales (have very different metrics). In that case, the variances may differ strongly between indicators, and the model may not converge under the fixed-loading approach to identification. In that case, fixing the factor variance instead can sometimes help resolve the convergence problems.
  • asked a question related to R
Question
4 answers
I'm working with a multiband raster and I want to extract each band to a single raster band. I tried two approaches using R (raster) and QGIS (gdal translate).
I noticed that the output file from QGIS is around 25MB while the output file from R is around 2MB. The original multiband raster is around 490MB with 19 bands. This led me to thinking that the QGIS output is more reasonable to use. Note that I will use the bands for SDM.
Is the R output still useable for this purpose? Can you also explain the difference in file sizes?
Relevant answer
Answer
its happen maybe - When creating new images or exporting existing images the bit-depth may change. This results in a change in the amount of disk space the new image requires. This happens for a number of reasons, discussed in articles in the Related Information, below. The reason is due to the amount of bits required to store each individual pixel (cell) in a raster image. When working with 8 bit images, 1 byte (8 bits) is required to store each pixel in the image. When working with 16 bit images, 2 bytes are required, and with 32 bit images, 4 bytes are required and so on. An easy way to determine the approximate size of the image is to use the formula below: Rows x Columns x number of bands x pixel depth (8 bits = 1 byte) For example: 8-bit image: 100 rows x 100 columns x 3 bands x 1 = an output raster that is approximately 30,000 bytes in size. 32-bit image: 1000 rows x 1000 columns x 1 band x 4 = an output raster that is approximately 4,000,000 bytes. (esrisupport)
plz check the "bit" on your both output file
  • asked a question related to R
Question
3 answers
Hello
please help in finding the area under the curve (as in the attached figure) in R.
The curve has been created using logspline function on histogram (density )
Relevant answer
Answer
For a numeric vector y, the function logspline(y) returns a fitted object holding information about the density of y. You can evaluate the density for a given logspline object ("fit") at some specific value x with dlogspline(x, fit). The area under the curve (in the bounds from x=x1 to x=x2) is the integral of this function from x1 to x2, which can be calculated in R with
integrate(f = dlogspline, lower = x1, upper = x2, fit = fit)
Note that this will be approximately 1 when the range x1...x2 spans the range of y, no matter how y is distributed.
  • asked a question related to R
Question
3 answers
Hi everyone!
I am investigating if different variables are associated with the choice of a nest box in a wild bird population.
Briefly, I am comparing the characteristics of multiple nest-boxes that were visited by 100 differents birds. However, from this bunch of nest-boxes, only 1 is chosen by each bird at the end. I wonder which is the best way to analyze these type of data.
For the moment I have built a logistic regression model in which my response variable is a binary outcome: a nest is (1) or not chosen (0). For this, I built a dataset with the 100 nest boxes that were chosen (y = 1) by each of the birds and another 100 nest-boxes that were not chosen by the birds (y = 0) (this 'not chosen' nests are picked randomly from the bunch of nest-boxes visited but not chosen of each bird) , coding the ID of the bird as a random factor.
I've repeated this procedure 1000 times and computed the mean estimates. But I am not quite sure this is correct and I do not find a lot of info on the internet. Is there any better procedure to analyze this type of data? Or does anybody know the name of this type of analyses?
Thank you in advance!
Iraida
Relevant answer
Answer
Why not run a logistic model including all the nests? You could code the binary variable "chosen"=factor(0,1) and then include the different variables describing nest-box characteristics. You can later use re-sampling techniques to estimate robust confidence intervals around your model coefficients, but I don't see the reason for not fitting a model with all the data. Perhaps you can explain why you came with that idea?
  • asked a question related to R
Question
7 answers
I am using R/R-studio to do some analysis on genes and I want to do a GO-term analysis. I currently have 10 separate FASTA files, each file is from a different species. Is there a way in R to use a FASTA file of genes to find enriched terms compared to the whole proteome?
Relevant answer
Answer
Dear Khan, I work with Arabidopsis; Sometimes I download the information from TAIR or from another place and I do my own files. but I use too the org.At.tair.db I don't know if I could help you but you could try:
If you use a non-model species maybe is better check the github from clusterprofiler's developer ( https://github.com/YuLab-SMU/clusterProfiler/issues ). If there are not some issues about your question, you can writte your question about the "org.Sc.eg.db". The developer or some people could answer you.
* Maybe if you have the files TERM2GENE and TERM2NAME you could use "enricher" funtion to have ORA but with this funcion you will have a general results, is better use "enricherGO" which give result with classification (BP, CC, MF). I don't know if you want to do ORA, GSEA or gseGO, but maybe this information will help you.
I dont know if your species have TERM2GENE and TERM2NAME from mapman ( https://mapman.gabipd.org/mapman ). Maybe you could read about it...
There are another packgate to have GO
Maybe you could check if you have some possibility.
I ever had questions and I wrote an email to the author of the package and I got an answer.
I hope that could give help you,
Good luck!
  • asked a question related to R
Question
1 answer
Hi everyone,
I have to identify overlapping polygons, with one of the datasets containing thousands of polygons. I am using the sf package and its st_intersects function, as:
dataframe1 %>%
st_filter(y = dataframe2, .predicate = st_intersects)
which takes about 6sec to compare each polygon of the 1st dataframe, and so, days for my current dataframes.
The only way I have encountered so far to make it possible is to first remove some manually and then split the dataframe to run the intersecting.
Would anyone have an advice on how to make it faster?
thanks a lot!
Relevant answer
Answer
Dear Juliette,
predicate functions can be parallelized. So if you have, let's say, 16 CPU cores, then the computation time is divided by cca 16. I attached the function that can be used for parallelizing sf functions, like st_intersects().
HTH,
Ákos
  • asked a question related to R
Question
46 answers
(Matlab, R, Python, ...??)
taking into account the current evolution of artificial intelligence
Relevant answer
Answer
you don't regret with Python
  • asked a question related to R
Question
4 answers
Hello,
I have a variable in a dataset with ~500 answers; it essentially represents participants' answers to an essay question. I am interested in how many words each individual has used in the task and I cannot seem to find a function in R to calculate/count each participant's words in that particular variable.
Is there a way to do this in R? Any packages you think could help me do this? Thank you so much in advance!
Relevant answer
Answer
Thank you so much, Daniel Wright , Jochen Wilhelm , Richard Johansen ! Your answers were very helpful. I was able to do it through the string package.
  • asked a question related to R
Question
3 answers
I am doing a meta analysis of proportions looking at the healing rate of eardrums after a terrorist bombing.
I am using the metafor package on R. I have transformed the proportions using logit and used a random effects model to calculate CIs. I have also calculated the heterogeneity using the dersimonian laird procedure.
However, the proportions calculated seem to be incorrect when the proportion is 1, the calculated proportions are <1. Please see the graph to see the problem.
I was wondering if anyone has encountered this and if this is because i should not be using logit transformation? Also my heterogeneity number has also come out strangely heterogeneous. I'm sure i'm doing something wrong but not exactly sure what.
Any help would be greatly appreciated, i'm a clinician not a epidemiologist so all of this is v new to me. I have attached the code from R too.
Relevant answer
Answer
It seems to me that are 2 possible problems
An error in data transformation. The solution is to use logistic regression directly which is where the logit transform takes you. The other possibility is rare events, which I suspect is common in your application. The approach here is to use Firth logistic regression. All of this is available in R. Best wishes David Booth
  • asked a question related to R
Question
2 answers
HI all
I would like to perform a redundancy analysis with a response coded as a distance matrix (disimilarity in species composition). That can be done using #dbrda function in #vegan package for R. However, the problem is that I have as predictors two matrices: one with "raw" data (soil variables) and the other a distance matrix calculated from "spatial" distances among sites. Function #dbrda accepts more than one explanatory matrix, but not if it one is a distance matrix. Does anyone knows if there is other function in R able yto do this? Probably in #ade4 or #phytools?
Relevant answer
Answer
Or did I understand something wrong?
Nevertheless, i completely disagree with the previous comment.
In my opionion it makes always sense to have an understanding what happens in an analysis and to have the chance to access intermediate steps.
Jan
  • asked a question related to R
Question
4 answers
Dear reachers,
I am studying the "qmap" package in R language, to perform bias correction (Quantile Mapping). I have read the Help Documentation about "qmap" package, and all cases are based on precipitation data. These codes include common parameter——“wet.day”, which is intended for precipitation data.
What is the difference between specific R codes for different climatic variables, such as precipitation, temperature, solar, or wind speed?
Relevant answer
Answer
Hello.
You can use any kind of data in qmap. In fitQmap function, first choose the observed columns, second choose the forecasted columns, and then specify the method. Notice that the number of observed data columns must be equal to forecasted data columns.
In the next step, you can test the performance by using doQmap function where you choose the forecasted data and then the output from fitQmap. The output from doQmap is bias corrected data.
  • asked a question related to R
Question
3 answers
Hi folks,
My colleague has just asked me for advice regarding analyzing his BRUV (baited remote underwater video) dataset. It's a video camera fixed on a structure, used to record marine fishes passing by or attracted to the attached bait. It resulted in a wide dataset of species assemblage (lots of zeroes, lots of species columns). He has generated an NMDS ordination plot and ANOSIM to analyze his species assemblage dataset. He sees a spatial (geographic) separation of species composition through these methods.
Now, he wants to understand what drives this assemblage. He has additional benthic composition dataset (% coral, sand, rubble, etc.), current strength, depth, and more abiotic data. His coauthor is suggesting fitting envfit vectors on their NMDS and use the p-value of said vectors. I don't think this is a good idea, but I'm not well versed in this topic so I couldn't explain it sophisticatedly. I think because the vectors are "retrofitted" onto the ordination, the p-values are therefore not explanatory toward the species assemblage.
The alternative I could think of is running PERMANOVA or a model. The problem with the former is that the benthic composition dataset are related to each other (7 different variables, but all add up to 100%) so they're not independent of each other.
I'm wondering if anyone has any solution to this/or can add to the explanation. Would it be reliable to run a PERMANOVA? Should he be transforming his benthic composition dataset first? Or would he be better off creating a model, and if yes, which kind?
Thanks!
Relevant answer
Answer
You are right, doing anything parametric in NMDS-space is a silly idea. PERMANOVA won't help, it's asking a similar question to the ANOSIM you've done. To relate community composition to drivers (e.g. explanatory variables, singly or in combination) there are many methods available. If I was using PERMANOVA I'd probably go for DSTLM (distance-based linear modelling) though my preference would be to stick with the non-parametric approach underpinning ANOSIM and use a method such as BIOENV, which searches for subsets of explanatory variables that give the best match with the biotic similarity matrix.
  • asked a question related to R
Question
5 answers
I want to upscale (make the pixel size larger) a raster from 15m to 460m spatial resolution, using a Gaussian filter.
The goal
I am having a coarse image which I want to downscale. I also have a fine resolution band to assist the downscaling. The downscaling method I am using is called geographically weighted area-to-point regression Kriging (GWATPRK). The method consists of two steps:
  1. GWR and,
  2. ATPK on the GWR's residuals.
In order to perform GWR using raster data, those needs to have the same pixel size. This means that, my fine resolution image needs to be upscaled to match the spatial resolution of the coarse band. This upscaling of the fine band needs to be done using a Gaussian kernel (i.e., the PSF). Is there a package in R which can help me to do that? Or anyone who can help write a custom function?
Relevant answer
Answer
  • asked a question related to R
Question
4 answers
I have performed a Zero-One-Inflated-Beta regression with logit link in R. The conditional effects can be extracted which is useful for plotting given it returns the intervals. Yet, when I calculate backwards from the values returned to the expected value (ZOIB and B [without ZOI]) and compare this with the expected values the Brms function conditional_effects returns, the trend lines do not match . Also, these manual derived values match much better with the data, I've been scratching my head for a few days now. The question is why don't the expected values match, where do I go awry?
Thank you in advance
------------------
In the appendix the script, data and model
Relevant answer
Answer
Just to come back here, when using the Brms function conditional_effects() on a ZOIB model the function grabs the full model (as expected). This also includes the zoi and coi components. E(Y) = ZOI * COI + mu * (1 - ZOI). Where
mu in is expressed as E(Y) = exp(B0 + B1 * X)/(1 + exp(B0 + B1 * X)). When setting: conditions_effects(model, dpar = "mu") the same results as for the green line are obtained.
The underfit that is observed when using the conditional_effects() function is due to the high amount of 0s compared to the low amount of 1s. It was suggest to better fit the ZOI and and COI in the model individually. This was all solved by user henningte on the Stan forum.
Just to correct my mistake of the red line. of The linear equation was wrongly E(Y) = exp(B0 + B1 * X * ZOI * COI)/(1 + exp(B0 + B1 * X + ZOI + COI))