Science topic
R - Science topic
Explore the latest questions and answers in R, and find R experts.
Questions related to R
I am trying to calculate an ECM with panel data and thereby I have the following problem. I run the ecm command from the ecm package an error occurs saying “non-numeric matrix extent “. I found out that the reason is the creation of the panel data. I tried two different approaches to fix the problem.
At first I created the panel dataset with the pdata.frame command from the plm package.
p.dfA <- pdata.frame(data, index = c("id", "t"))
where “index” indicates the individual and time indexes for the panel dataset. The command converts the id and t variables into factor variables which later leads to the error in the ecm command.
Secondly, I created the panel dataset with the panel_data command from the panelr package.
p.dfB <- panel_data(data, id = "id", wave = "t")
where “id” is the name of the column (unquoted) that identifies participants/entities. A new column will be created called id, overwriting any column that already has that name. “wave” is the name of the column (unquoted) that identifies waves or periods. A new column will be created called wave, overwriting any column that already has that name.
This panel_data command also converts the id variable into a factor variable. So, the same error occurs in the ecm command.
If I transfer the factor variables back into numeric variables, I lose the panel structure of the dataset.
Could someone please explain to me how to run an ECM with panel data in R?
Thank you very much in advance!
Sample R Script attached
Hello all,
I am running into a problem I have not encountered before with my mediation analyses. I am running a simple mediation X > M > Y in R.
Generally, I concur that the total effect does not have to be significant for there to be a mediation effect, and in the case I am describing, this would be a logical occurence, since the effects of path a and b are both significant and respectively are -.142 and .140, thus resulting in a 'null-effect' for the total effect.
However, my 'c path X > Y is not 'non-significant' as I would expect, rather, the regression does not fit (see below) :
(Residual standard error: 0.281 on 196 degrees of freedom
Multiple R-squared: 0.005521, Adjusted R-squared: 0.0004468
F-statistic: 1.088 on 1 and 196 DF, p-value: 0.2982).
Usually I would say you cannot interpret models that do not fit, and since this path is part of my model, I hesitate to interpret the mediation at all. However, the other paths do fit and are significant. Could the non-fitting also be a result of the paths cancelling one another?
Note: I am running bootstrapped results for the indirect effects, but the code does utilize the 'total effect' path, which does not fit on its own, therefore I am concerned.
Note 2: I am working with a clinical sample, therefore the samplesize is not as great as I'd like group 1: 119; group2: 79 (N = 198).
Please let me know if additional information is needed and thank you in advance!
Hi,
I made the metagenomics prediction of 16S using Tax4Fun and I processed that results using STAMP. However, I want to know if phyloseq is a suitable R package for do it. Can someone give me their opinion about that? or can someone recommended me other R package for this kind of analysis.
Thanks in advance,
Carolina.
In the domain of clinical research, where the stakes are as high as the complexities of the data, a new statistical aid emerges: bayer: https://github.com/cccnrc/bayer
This R package is not just an advancement in analytics - it’s a revolution in how researchers can approach data, infer significance, and derive conclusions
What Makes `Bayer` Stand Out?
At its heart, bayer is about making Bayesian analysis robust yet accessible. Born from the powerful synergy with the wonderful brms::brm() function, it simplifies the complex, making the potent Bayesian methods a tool for every researcher’s arsenal.
Streamlined Workflow
bayer offers a seamless experience, from model specification to result interpretation, ensuring that researchers can focus on the science, not the syntax.
Rich Visual Insights
Understanding the impact of variables is no longer a trudge through tables. bayer brings you rich visualizations, like the one above, providing a clear and intuitive understanding of posterior distributions and trace plots.
Big Insights
Clinical trials, especially in rare diseases, often grapple with small sample sizes. `Bayer` rises to the challenge, effectively leveraging prior knowledge to bring out the significance that other methods miss.
Prior Knowledge as a Pillar
Every study builds on the shoulders of giants. `Bayer` respects this, allowing the integration of existing expertise and findings to refine models and enhance the precision of predictions.
From Zero to Bayesian Hero
The bayer package ensures that installation and application are as straightforward as possible. With just a few lines of R code, you’re on your way from data to decision:
# Installation
devtools::install_github(“cccnrc/bayer”)# Example Usage: Bayesian Logistic Regression
library(bayer)
model_logistic <- bayer_logistic( data = mtcars, outcome = ‘am’, covariates = c( ‘mpg’, ‘cyl’, ‘vs’, ‘carb’ ) )
You then have plenty of functions to further analyze you model, take a look at bayer
Analytics with An Edge
bayer isn’t just a tool; it’s your research partner. It opens the door to advanced analyses like IPTW, ensuring that the effects you measure are the effects that matter. With bayer, your insights are no longer just a hypothesis — they’re a narrative grounded in data and powered by Bayesian precision.
Join the Brigade
bayer is open-source and community-driven. Whether you’re contributing code, documentation, or discussions, your insights are invaluable. Together, we can push the boundaries of what’s possible in clinical research.
Try bayer Now
Embark on your journey to clearer, more accurate Bayesian analysis. Install `bayer`, explore its capabilities, and join a growing community dedicated to the advancement of clinical research.
bayer is more than a package — it’s a promise that every researcher can harness the full potential of their data.
Explore bayer today and transform your data into decisions that drive the future of clinical research: bayer - https://github.com/cccnrc/bayer
I would like to use the "create.matrix" function (R package "fossil") to create a matrix of taxa, localities, time and abundance.
In the function, time is specified by the arguments "time.col" (= column name or number containing the time periods) and "time" (= which time periods to keep for the matrix). In my data table, time is specified as a calendar date (d.m.y: e.g. 31.12.2022).
How should time ("time periods") be specified or formatted in the data table to be used by the create.matrix function?
How should time periods be specified in the "time" argument?
I want to create a species by (locality by date) matrix (a part of the data is attached). Some localities were sampled only once, some several times. As "time period" in the argument "time" I want to use the date of sampling (column "date" in the dataset), but I do not know how to specify it in the code below ("????"):
Function:
df <- create.matrix(dat, tax.name = "taxon_abbrev",
locality = "locality_ID",
time.col="date",
time="????",
abund.col = "num",
abund = TRUE)
Software procedures:
Make ANOVA 1 Analysis data with R software.
Find out "P" and "F" with values during an ANOVA 1 Analysis of experimental data.
I want to estimate the half-life value for the virus as a function of strain and concentration, and as a continuous function of temperature.
Could anybody tell me, how to calculate the half-life value in R programming?
I have attached a CSV file of the data
I am trying to implement gross segmentation process in my R routine pipeline for image satellite classification. (It could help generating learning polygons for OBIA). Is there an R function to apply segmentation on a spectral band ?
Many thanks,
Hi there, I am currently struggling with running analysis on my data set due to missing data caused by the research design.
The IV is binary -- 0 = don't use, 1 = use
Participants were asked to rate 2 selection procedures they use, and 2 they don't use, and were provided with 5 option. So, for every participant there are ratings for 4/5 procedures.
Previous studies used a multilevel logistic regression, and analysed the data in HLM as it can cope with missing data.
Would R be able to handle this kind of missing data? I currently only have access to either SPSS and R. Or is there a particular way to handle this kind of missing data?
latine hyper cube design
global and local optimization
4D color maps How do we assess the global behavior of a model and obtain the equilibrium points and analyze their stability through simulation series ?
Which is the most accurate/ usefull command in R for realizing a Necessity test: pofind() (with the option "nec") or superSubset()?
Which is the most common?
I understand that pofind() tests isolated ("independent") conditions and their negations; while superSubset() seeks configurations (i.e. combinations of conditions)....
I used to use superSubset()... but, perhaps for the two-step QCA process and for an ESA analysis... pofind() might be more useful...
Any insight?
Hi all,
I'm currently working on a logistic regression model in which I've included year as a random variable, so in the end I am working with a Generalized Linear Mixed Model (GLMM). I've built the model, I got an output and I've checked the residuals with a very handy package called 'Dharma' and everything is ok.
But looking for bibliography and documentation on GLMMs, I found out that a good practice for evaluating logistic regression models is the k-fold cross-validation (CV). I would like to perform the CV on my model to check how good it is, but I can't find a way to implement it in a GLMM. Everything I found is oriented for GLM only.
Anyone could help me? I would be very thankful!!
Iraida
Dear Colleagues,
Im looking for a solution to fitting multivariate multiple regression with mixed effects in R.
In my case I have multiple dependent and independent variables with a hierarchical structure: samples are nested inside treatment groups A/B, and treatments are nested in sites.
As far as I know the lmer and glmmTMB packages works well with mixed effects, but dont accept multiple response variables.
I also interested in Bayesian modeling, but it seems that bnsp or rstanarm packages cannot do both.
Can you recommend a solution to this?
I have two satellite images, named reference.tif and test.tif, and I'd like to calculate the similarity between the rasters using Spectral Angle Mapper (SAM) in R. From my research I couldn't find a package that performs SAM using raster data. Is it possible to do it in R or should I move to an other software?
Hi there,
I am trying to use VariogramST function in R, for spatial-temporal kriging.
The values I am working on, are monthly based; while in the function VariogramST, the predefined "tunit" is hours, days or weeks.
I appreciate if anyone can tell me how to change it to months?
Thanks,
Hamideh
I am new to spatial transcriptomics and am exploring data sets. Can someone walk me through what's contained in the R Object (Robj) files that I have?
I've tried with many packages, but the main issue arises when you want to introduce additional variables in the mean and the variance equation as well. For example, if you want to introduce news based variable in the variance. Any help is welcome.
Hello, I am looking for an advice regarding my experimental design and what statistics I should do.
My experimental design:
I have 3 cell lines (A, B, C) and from each I do 3-4 differentiations into cardiac myocytes (A1-A4, B1-B4, C1-C3). Then each differentiation is treated with the same 3 concentrations of a chemical (T1, T2, T3). And for each treatment, I measure the calcium concentration (y). So I have one continuous dependent variable (y), two categorical independent variables (x1 for cell line and x2 for treatment) and a random error which I want to correct for (e for differentiation).
I want to investigate the impact of the cell line (x1, main effect) on the cell response to the treatment (x2).
- I understand that e is nested under x1 but what about x2 ?
- Then I am not sure how to translate this in a correct formula for the aov() function in R. I am tempted to use aov( y ~ x1*x2 + Error(e)) but this does not account for the fact that e is nested under x1. Does it matter ?
- I don't understand how to interpret the different p-values, which one should I look for to answer my research question?
- Can I run a normal Tukey test for post-hoc multiple comparisons by pooling the differentiations (getting rid of the error source) or is there a way to correct for it without including it as another variable ?
Many thanks in advance !
I need to extract the x,y coordinates of a PCA plot (generated in R) to plot into excel (my boss prefers excel)
The code to generate the PCA:
pca <- prcomp(data, scale=T, center=T)
autoplot(pca, label=T)
If we take a look at pca$x, the first two PC scores are as follows for an example point is:
29. 3.969599e+01 6.311406e+01
So for sample 29, the PC scores are 39.69599 and 63.11406.
However if you look at the output plot in R, the coordinates are not 39.69599 and 63.11406 but ~0.09 ~0.2.
Obviously some simple algebra can estimate how the PC scores are converted into the plotted coordinates but I can't do this for ~80 samples.
Can someone please shed some light on how R gets these coordinates and maybe a location to a mystery coordinate file or a simple command to generate a plotted data matrix?
NOTE: pca$x does not give me what I want
Update:
Redoing prcomp() without scale and center gives me this for PC1 and PC2 for the first 5 samples
1 -8.9825883 0.0113775
2 -16.3018548 9.1766104
3 -21.0626458 3.0629666
4 5.5305875 4.0334291
5 0.2349433 12.4872609
However the plot ranges from -0.15 to 0.4 for PC1 and -0.35 to 0.15 for PC2
(Plot attached)
Hi everyone, I am doing Meta-analysis of mediation using the structural equation model in R (the package I will use is “metasem”). May I ask if anybody has experience in doing this type of analysis? I have found a guide to follow but I do not know how to import data with the correct format to do such an analysis.... I would highly appreciate it if you could give me any advice!
here is the link to the guide: https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/mediation.html
I could not find any function in Rchoice that calculates average marginal effects (AME) of the significant variables on the binary outcome for random parameters logit model, only found for heteroskedastic binary models and IV probit models > effect(). Same with 'marginaleffects' (https://marginaleffects.com).
I also tried to incorporate 'margins' (https://cran.r-project.org/web/packages/margins/index.html) but got this error:
Error in x$terms %||% attr(x, "terms") %||% stop("no terms component nor attribute") :
no terms component nor attribute
My dataset contains:
11 independent variables (10 are dummy , 1 continuous)
3 IV are assumed to be random parameters.
summary(), modelsummary() all worked on my model.
Any suggestion is really appreciated.
I am yet to finish writing the discussion part of my Master's thesis due to this issue (just to let you know how important it is for me to find a solution).
Thanks in advance.
Good morning everyone! I've just finished reading Shyon Baumann's paper on "Intellectualization and Art World Development: Film in the United States." This excellent paper includes a substantial section of textual analysis where various film reviews are examined. These reviews are considered a fundamental space for the artistic legitimation of films, which, during the 1960s, increasingly gained artistic value. To achieve this, Baumann focuses on two dimensions: critical devices and lexical enrichment. The paper is a bit dated, and the methodologies used can be traced back to a time when text analysis tools were not as widespread or advanced. On the other hand, they are not as advanced yet. The question is: are you aware of literature/methodologies that could provide insights to extend Baumann's work using modern text analysis technologies?
In particular, following the dimensions analyzed by Baumann:
a) CHANGING LANGUAGE
- Techniques for the formation of artistic dictionaries that can replace the manual construction of dictionaries for artistic vocabulary (Baumann reviews a series of artistic writings and extracts terms, which are then searched in film reviews). Is it possible to do this automatically?
b) CHANGING CRITICAL DEVICES
- Positive and negative commentary -> I believe tools capable of performing sentiment analysis can be successfully applied to this dimension. Are you aware of any similar work?
- Director is named -> forming a giant dictionary of directors might work. But what about the rest of the crew who worked on the film? Is there a way to automate the collection of information on people involved in films?
- Comparison of directors -> Once point 2, which is more feasible, is done, how to recognize when specific individuals are being discussed? Does any tool exist?
- Comparison of films -> Similar to point 3.
- Film is interpreted -> How to understand when a film is being interpreted? What dimensions of the text could provide information in this regard? The problem is similar for all the following dimensions:
- Merit in failure
- Art vs. entertainment
- Too easy to enjoy
Expanding methods in the direction of automation would allow observing changes in larger samples of textual sources, deepening our understanding of certain historical events. The data could go more in-depth, providing a significant advantage for those who want to view certain artistic phenomena in the context of collective action.
Thank you in advance!
I have a large dataset containing many dates per participant. I want to creat a new variable containing the number of unique dates per participant. I have been trying to figure this out but it does not work. An example of how the dataset looks like you see below:
participant 1 - date.1 - date.2 - date.3 - date.4 ... date.412
participant 2 - date.1 - date.2 - date.3 - date.4 ... date.412
Dates are in the format: 31-oct-20
Some participants have only 10 dates, others have 412 dates
Any idea how I can create this new variable containing the number of unique dates?
For our database of odor responses (http://neuro.uni.kn/DoOR) I received data that uses the different chemical identifiers named above. I am looking for an easy way to look up/convert these into one another.
I already found the ChemCell macro (http://depth-first.com/articles/2010/11/01/chemcell-easily-convert-names-and-cas-numbers-to-chemical-structures-in-excel/) but would be happy to have some R code doing the work.
I just received the latest TOC alert for Behavior Research Methods, and this article caught my eye:
I've not had time to read it yet, but judging from a quick glance, I wonder if the main "problem" might be that users do not always take time to RTFM* and therefore, do not understand what their software is doing? In any case, I thought some members of this forum might be interested.
Cheers,
Bruce
* RTFM = Read The Fine Manual ;-)
I want to open a dataset of gene-trait association using:
data = read.table(file.choose(), header = T)
attach(data)
View(data)
But instead of column name (510 traits), these headers are shown:
na, na.1, na.2, ..., na.509
Please find the attached data (ethical point: this data is free access from https://maayanlab.cloud/Harmonizome/dataset/dbGAP+Gene-Trait+Associations)
Thanks
Belkin and O'Reilly described in 2008 in "An algorithm for oceanic front detection in chlorophyll and SST satellite imagery" how to detect oceanographic fronts based on satellite imagery.
I also found the reference to the "NOAA CoastWatch Chlorophyll Frontal Product from MODIS/Aqua". But as I understood, this product is only available for the United States area.
I am looking for a product, e.g. as netcdf file from Copernicus, NASA etc. which offers already detected chlorophyll fronts for the area of the Patagonian Shelf, but couldn't find one so far.
If there is none, is there a feasible way to program the algorithm myself (e.g. in R), using processed (L4) chlorophyll data obtained for the region?
Many thanks for any suggestions.
Hi! I have a dataset with a list of species of sponges (Porifera) and the number of specimens found for each specie in three different sites. I add here a sample from my dataset. Which test should I use to compare the three sites showing both which species where found in each site and their abundance? I was also thinking of a visual representation showing just the difference between sites in terms of diversity of species (and not abundance), so that is possible to see which species were just in one sites and which ones were in both sites. For this last purpose I thought about doing an MDS but I am not sure if it is the right test to do neither how to do it in R and how to set the dataset, can you help me finding a script which also show the shape of the dataset? any advice in general would be great! thank you!
I want to know if the type of litter found between sites is significantly different. I have 7 stations with polymer IDs for litter from each station. Can I compare two categorical variables (stations, n=7; polymer type, n=11)?
Can anyone share some advice on what stats to use and any code in R.
I have a question about what statistical tests I can do in R that would be most suited to analyse my enzymatic assay results.
My data
Dependent variable: Absorbance (its’ a colorimetric assay)
Factor 1: Protein (type of protein, wt and several mutants)
Factor 2: Substrate (several substrates)
For each assay, (e.g. Prot1 with Substr1) I have 3 – 6 data points (repeats), and it is not feasible to obtain more.
An example of my data would be something like fig1.
What I want to do:
1. Test if the different mutations have a different effect on the enzymes’ activity with the different substrates (basically test the dependence of Absorbance on the interaction between Protein and Substrate: Abs~Protein*Substrate)
Visually, if I plot my data on a bar chart (mean +/-SD), it appears to be the case, but I need to verify that what I see is significant.
Normally, I would do this with a two-way ANOVA, however:
my data is not normally distributed (according to Q-Q plot and skewness test, I have over-dispersed residuals (Laplace distribution), without any skew);
the variance is not homogenous (standardized residuals vs fitted values plot shows heteroskedasticity)
What sort of model could I use instead? Is there a way I can transform my data to allow a parametric test (the only transformations I found were against skewness, which I do not have)?
2. Test for each substrate, which mutations make the activity differ significantly from the wt (e.g. whether activity with Substr2 is significantly different for Prot2 and 3 from that for Prot1)
To avoid doing multiple pair-wise comparisons, I would normally do a pairwise.t.test with a Holm family-wise error rate correction.
For non-normally distributed data of non-equal variance, I saw it is recommended to do a Pairwise Wilcoxon rank sum test.
If I group my data by substrate, with one of the substrates (e.g. Substr2) it was normally distributed and of equal variance. I did the pairwise T test and it gave results consistent with what is seen on the graph. However, when I tried a Pairwise Wilcoxon rank sum test (holm correction) it showed no significant difference between any of the Proteins, which makes no sense (e.g. that there is no significant difference between Prot1 and 2, although one has an activity with the substrate and the other doesn’t). So it looks like the non-parametric test may not be powerful enough/ at all useful.
For the other substrates, either the distribution is not normal (again over-dispersed, Laplace distribution, with no skew), or the distribution is normal, but the variance is not equal (heteroskedasticity).
Just to note, one-way ANOVA or Kruskal-Wallis rank sum tests (where applicable) for Absorbance~Protein for each substrate individually, showed there is a significant variation in Absorbance depending on Protein.
What sort of pairwise comparison tests can I do in these cases? Or, again, how can I transform my data to be able to use a pairwise.t.test?
Thank you in advance!
I really appreciate any advice!
I have an unusual question: I am working on a Erasmus internship project with Drosophila mutants at 2 different timepoints and with WT, KO and KI condition. A company analyzed the data using DESeq2 and I have only got loads of PDFs and the results_apeglm.xlsx file.
This contains: Transcripts per million for each gene, replicate and timepoint with the comparison for looking at DEGs - so I have a padj and log2FC value. A snippet is attached as an example.
I now want to construct a graph and clustering where genes that are going in changing directions between WT and KO over time become visible out of the hundreds of candidate DEGs. With this I want to narrow down the long list to make it verifiable with qPCR and serve as a marker for transformation from presymptomatic to symptomatic.
I am setting up my analysis in R and want to use the degPatterns() function from DEGReport, as it gives a nice visual output and clusters the genes for me.
How can I now transform my Excel sheets, to a matrix format that I can use with degPatterns()? The example with the Summarized Experiment given in the vignette is not really helpful to me, sadly.
Thank you all for reading, pondering and helping with my question! I would be very happy if there´s a way to solve my data wrangling issue.
All the best,
Paul
Hi everyone,
I would like your opinion on using the complete() function of the mice package on R.
Theoretically, after multiple imputations, analyses should be performed on each imputed dataset, and then the results of the analyses should be pooled (see attached diagram).
However, I consider the complete() function. In summary, it permits generating a unique final dataset using the results of multiple imputations previously performed with the mice() function. This strategy is easy, "inexpensive," and allows us to manipulate only one dataset.
This is a concrete example of the usefulness of this strategy. I am conducting mixed-methods research in which I want to interview some participants after analyzing their responses to my survey. If my respondent John Doe did not answer to an item of a scale, I would risk having 5 plausible answers from John Doe after multiple imputations (if m=5, or 20 plausible responses if m=20, etc.). However, the complete() function will summarize the different estimates into one dataset (instead of 5, or 20, etc.). Basically, during an interview, I will be able to question John Doe based on his scale score computed with NA replacement. So I lose precision, but gain in ability to exploit the answers.
However, this approach seems problematic, as the literature does not support it well. In fact, except for this paper by van Buuren et al. (2011, cf. section 5.2), I cannot find any source that supports this approach:
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. https://doi.org/10.18637/jss.v045.i03
Well, I'm stuck between a rock (a more rigorous approach, i.e. the pooling) and a hard place (a more practical approach, i.e. the complete() function). What do you think?
Hope to read you (and my apologies for my broken English)
FM
When writing code, the kable function does not show the running result in the “ for loop”.
I am capturing fluorescent microscopy images of a developmental process with a high degree of variability. I need to compile these images into a "gallery" so that I can see many/all images at once to look for possible phenotypic differences between conditions, which then I can measure through FIJI. The most basic way I can do this is by arranging the images in a Powerpoint, but this method feels manual and has some limitations (explained below). I am hoping someone may know of a better way to compile image data for investigation/curation, maybe through FIJI, R, or a different dedicated program.
Limitations of Powerpoint:
1. Cannot track sample group or ID: When you upload an image into Powerpoint, there is no way to check its filename. So I am forced to upload one image at a time (and make a text box noting condition and ID), which is slow and unideal for batch processing images.
2. Re-formatting and arranging is manual: Re-sizing, cropping, and positioning images on Powerpoint is very manual and there is no record of changes (so making scale bars is hard). The effort compounded when the need arises to re-size or re-position the images later in the process (see #3).
3. Cannot re-order images dynamically based on values: Because the process I am studying is highly variable, it is necessary to compare images at the same percentile between conditions. Thus, I would like to re-order the images by measured values (so that I can compare min to min, mid to mid, max to max, etc.). This is also useful for selecting representative images for presentations/papers.
4. Interactive/Shiny Graph: To further aid in comparing images based on values, it would be nice to have an interactive/shiny scatterplot of the measured values such that when you hover over/select an image the corresponding datapoint in marked in the graph.
Partial Solutions:
A. Keynote: Solves #1 because uploaded images retain the filename information. However, the file format is limited to Macs which limits data sharing.
B. QuickFigures: This FIJI plugin solves #1 and #2, but it is still clunky to use and cannot re-order images.
C. Image Data Explorer: This R package or browser program solves #4 but can only view one image at a time. It also has the advantage of easy annotation of the images by entering values into the corresponding data spreadsheet/dataframe.
Does anyone know of a way/program that solves any of these limitations? Or does anyone have a different way of compiling image data?
In two structural equation models, I am comparing the association of two different psychopathological measures with emotion regulation. Emotion regulation is a latent variable with a 4 manifest variables that construct it (lets call them A, B, C, D), which are the same in both models. In model 1, A and C have positive signs while B and D have negative signs. In model 2, B and D have positive signs while A and C are negative. I would like this to be consistent in both models, as currently model 1 reflects "efficiency of emotion regulation" and model 2 reflects "inefficiency of emotion regulation". This will surely confuse reviewers and readers...
For my thesis work, I have to deal with Multivariate multiple regression, while in my studies I only have dealt with multivariate regression (one regressand and multiple regressors). Now I have multiple response variables (Y) which are continuous.
My understanding is that although lm() handles vector Y's on the left-hand side of the model formula, it really just fits m separate lm models.
---------------------
To @Rainer Duesing
In my work, I'm dealing with a functional response so I wrote the response data on a common finite grid and now what I need to predict is the row-stacked matrix Y.
Looking for R package/s for in the field of soil erosion/sediment estimation and analysis.
Any comment or hint would be welcome.
Hi all, I have scRNA data generated from Rapsody platform and analysed in seven bridges platform. Now could you please give me an idea how to deal with seven bridges platform output files for seurat R scRNA analysis. Mainly i need Filtered Feature, Barcodes, and Matrix files for analysis.
Dear colleagues. I need to build a matrix A,where are some values and a lot of zero.
Thanks a lot for your help
There are the elements ,matrix rows:
a_1=1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 and after 91 zeros
a_2=0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 after 90 zeros
a_3=0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 89 zeros
a_4=0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 88 z
a_5=0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 87 z
a_6=0,35 0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 86 z
a_7=0,28 0,35 0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 85 z
a_8=0,18 0,28 0,35 0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 84 zeros
a_9=0,08 0,18 0,28 0,35 0,46 0,61 0,75 0,89 1 0,89 0,75 0,61 0,46 0,35 0,28 0,18 0,08 83 zeros
I am currently completing the synthesis for a systematic review on the impact of the use of wireless devices and mental health. The systematic review looks at quantitative studies only. Due to the heterogeneity of outcomes (depression, anxiety, externalizing behaviours etc) and study designs - we have decided not to run a meta-analysis, nor we will produce forest plots. However, I feel that a harvest plot would be an attractive and intelligible method of summarizing our findings, and would complement a narrative sythesis. See below for what I mean by a harvest plot
Here is a great example of what I am trying to produce:
I am very familiar with using R / Python for data visualisation purposes, but I am initially stumped about how I might produce an attractive and aesthetically pleasing plot, short of stodgily moving rectangles around in a word / publisher doc. Can anyone suggest a package / software / website / any method that help me?
Much much much appreciation if you can!
there are many statistics software that researchers usually use in their works. In your opinion, which one is better? which one do you offer to start?
Your opinions and experience can help others in particular younger researchers in selection.
Sincerely
I would like to know which R-package is good for making Taguchi design of experiment and its analysis. What I have seen is that qualityTool package only makes the design and makes signal to noise ratio graph. It does not do any more analysis beyond signal to noise ratio graph. Kindly guide me in this regard.
Dear all,
I'm building a shiny application for analysis of mass-spectrometry based proteomics data (for non-coders), and am working on the section for GO term enrichment.
There are a wide variety of packages available for GO term enrichment in R, and I was wondering which of these tools is most suitable for proteomics data.
The two I'm considering using are https://agotool.org/ which corrects for abundance with PTM data, or STRINGdb which has an enrichment function.
What do you guys recommend?
Best regards,
Sam
From the link https://gtexportal.org/home/datasets, under V7, I'm trying to do R/Python analyses on the Gene TPM and Transcript TPM files. But in these files (and to open them I had to use Universal Viewer since the files are too large to view with an app like NotePad), I'm seeing a bunch of ID's for samples (i.e. GTEX-1117F-0226-SM-5GZZ7), followed by transcript ID's like ENSG00000223972.4, and then a bunch of numbers like 0.02865 (and they take up like 99% of the large files). Can someone help me decipher what the numbers mean, please? And are the numbers supposed to be assigned to a specific sample ID? (The amount of letters far exceed the amount of samples, btw). I tried opening these files as tables in R but I do not think R is categorizing the contents of the file correctly.
I am already familiar with the process of calculating hedge ratios with linear regression (OLS). I am already running 4 different regressions for calculating hedge ratios between emerging markets and different hedging assets like gold. This is done on in-sample data.
That would look something like this: EM=a+b*GOLD+e
I then construct a portfolio and test the standard deviation of the portfolio and compare that with the non-hedged portfolio of only emerging market equities in the out-sample: R-b*GOLD
However, I want to compare these OLS hedge ratios to conditional hedge ratios from for instance a BEKK GARCH or a DCC GARCH.
I have already tried to work with R and I used the rugarch and rmgarch packages and created a model, modelspec and modelfit, but I do not know how to go from there.
Hello! I am doing a moderation analysis using PROCESS MACRO in R. My interaction (moderation) is significant but the bootstrap is not. How can I interpret that? Thank you!
I am looking for one but I could not get an info. I only found SmartPLS for such a tool, but I am looking for a free alternative.
Dear community,
I am currently doing a research project including a moderated mediation that I am analysing with R (model 8). I did not find a significant direct effect of the IV on the DV. Furthermore, the moderator did not have a significant effect on any path. Doing a follow-up, I thus calculated a second model, that excluded the moderator (now mediation only, model 6). In this model, the effect of the IV on the DV is significant. Is it possible, that the mere presence of the moderator in the first modell influences my direct effect, even if it does not have an effect on my relationship between IV and DV? Is my thinking of, that direct effects only depict direct effects, without including the influence of other variables in the model, wrong?
Can anybody help me with an explanation and maybe also literature for this?
Thank you very much in advance!
KR, Katharina
Need urgent help regarding this.
I am currently working on a project involving multiple time series forecasting using R. For the forecasting process, I have monthly data available from 2019 up to 2023.
Currently, I have generated baseline forecasts using R, specifically using ARIMA and ETS models. I have attached figures depicting the behavior of one of the time series along with its corresponding forecasted values based on the applied mathematical models.
However, I am facing an issue where most of the forecasted data does not seem to capture the seasonality well. The forecasts appear to be linear rather than capturing the expected seasonal patterns.
However, I am facing an issue where most of the forecasted data does not seem to capture the seasonality well. The forecasts appear to be linear rather than capturing the expected seasonal patterns.
Code:
t_data.fit = t_data %>%
model(arima=ARIMA(Demand)
)
I'm fitting a model in R where response variable is an over-dispersed count data, and where I meet an issue with goodness-of-fit test. As it has been previously discussed in Cross Validated, common GOF test like chi-square test for deviance is inappropriate for negative binomial regression because it includes a dispersion parameter as well as the mean:
I'm confused and would appreciate any suggestions about the appropriate GOF test for NB regression (preferably intuitive).
Hi all,
I'm busy building a shiny analysis pipeline to analyse protemics data from mass spectrometry, and I was wondering what the exact difference is between the terms Over-represented, and Upregulated. Can they be used interchangeably? Is one more appropriate for RNA or proteins?
Thanks,
Sam
I am trying to run a spatio-temporal autoregressive model (STAR). Therefore I need to create a spatial weight matrix W with N × T rows and N × T columns to weight country interdependencies based on yearly trade data. Could someone please tell me how I to create such a matrix in R or Stata?
Hi guys I'm looking for all R libraries developed/ intended / oriented to run standard and advanced psychometric analyses. I'm aware of the existence of popular packages such as psych, sem, mirt or CTT, but if you happen to know any other package that performs psychometric analyses i will really appreciate any info about it.
Hi
I am trying to set up a topmodel simulation in R. The flowlength function of the topmodel package, requires the outlet coordinate of the DEM matrix, i.e. the row and column position.
Is there a practical way to get that position? I am currently using spreadsheets to get visually the position based on my knowledge of the watershed. Unfortunately, for large DEM with high resolution, it is almost imposible.
Dear colleague,
I am currently working on a similar topic for my thesis, and I would like to learn more about how to performed this test. Could you please share with me some guidance steps or codes for implementing these tests? I would greatly appreciate your help and advice.
I am trying to develop mixed logit model for my data, my data is in the format shown in the image. I am getting errors while running the model like data format, alt name cannot be same as columns etc. in packages like xlogit in python and mlogit, apollo in R
My question is does one need to convert their data into the format shown even if RP data is available and only one specific choice out of few is for each individual.
If the data size is a bit large how does one convert this data to the formal using code.
Also, what happens in the case all my variable are categorical and there is no varing variables across each individual
In R, I am using the randomForest and caret packages for a Random Forest (RF) regression task. Genarally, there are two options when fine-tuning a model:
- GridSearch,
- RandomSearch.
I have seen on various posts that GridSearch has limitations such as taht there are intangibles that cannot be addressed in a gridsearch, such as parameter interaction stabilizes at a slower rate than error so Bootstrap replicates should be adjusted not just on error but also based on number and complexity of parameters included in the model.
My questions are:
- what are those interactions?
- how can I find such interactions and visualize them (e.g., in a graph) using R?
Examples online are welcomed.
Dear colleagues! I've tried to estimate the histone variants enrichment of several arabidopsis genomic sites but every time my enrichment result looks like the added screenshot. What am I doing wrong?
The running function is:
enr.df <- enrichment(query =gr_list[[1]], catalog = anno, shuffles = 24, nCores = 12)
anno is remap2020_histone_nr_macs2_TAIR10_v1_0.bed,
catalog is a GRanges of sites of interest of A.thaliana (all 5 chromosomes).
There are no warnings after function execution.
I'm using the package deSolve in R to model a biological ODE system, however I have the problem of, with certain parameters or with certain initial conditions, the ode solver produces negative values, which make no sense in a biological scenario.
Is there a way to avoid this without recurring to logarithms?
I want run SPSS control file from PISA 2012 data set into SPSS. But I am not able to run the .txt file into .sav file. Anyone have an idea how to run it?
Hi there!
I'm following the method by Branko Miladinovic et al. "Trial sequential boundaries for cumulative
meta-analyses".
I'm using the following code: metacumbounds ln_hr se_ln_hr N, data(loghr) effect(r) id(study) surv(0.40) loss(0.00) alpha(0.05) beta(0.20) is(APIS) graph spending(1) rrr(.15) kprsrce(StataRsource.R) rdir(C:\Program Files\R\R-4.2.1\bin\x64\) shwRRR pos(10) xtitle(Information size)
but the following error is popping up:
File bounds.dta not found.
Any guess what's causing the error and how to fix it?
Thanks in advance
Dear all,
I want to calculate an effect of treatments (qualitative) on quantitative variables (e.g. plant growht, % infestation by nematodes, ...) compared to a control in an experimental setup. For the control and for each treatment, I have n=5.
Rather than comparing means between each treatments, including the control, I would like to to see whether each treatment has a positive/negative effect on my variable compared to the control.
For that purpose, I wanted to calculate the log Response Ratio (lnRR) that would show the actual effect of my treatment.
1) Method for the calculation of the LnRR
a) During my thesis, I had calculated the mean of the control value (Xc, out of n=5) and then compared it to each of the values of my treatments (Xti). Thereby, I ended up with 5 lnRR values (ln(Xt1/Xc); ln(Xt2/Xc); ln(Xt3/Xc); ln(Xt4/Xc); ln(Xt5/Xc)) for each treatment, calculated the mean of those lnRR values (n=5) and ran the following stats : i) comparison of the mean effects between all my treatments ("has one treatment a larger effect than the other one?") and ii) comparison of the effect to 0 ("is the effect significantly positive/negative?")
==> Does this method seem correct to you ?
b) While searching the litterature, most studies consider data from original studies and calculate LnRR from mean values within the studies. Hence, they end up with n>30. This is not our case here as data are from 1 experimental setup...
I also found this: "we followed the methods of Hedges et al. to evaluate the responses of gas exchange and water status to drought. A response ratio (lnRR, the natural log of the ratio of the mean value of a variable of interest in the drought treatment to that in the control) was used to represent the magnitude of the effects of drought as follows:
LnRR = ln(Xe/Xc) = lnXe - lnXc,
where Xe and Xc are the response values of each individual observation in the treatment and control, respectively."
==> This is confusing to me because the authors say that they use mean values of treatment / mean values of control), but in their calculation they use "individual observations". Are the individual observation means within each studies ?
==> Can you confirm that I CANNOT compare each observation of replicate 1 of control with replicate 1 of treatment; then replicate 2 of control with replicate 2 of treatment and so on? (i.e. ln(Xt1/Xc1); ln(Xt2/Xc2); ln(Xt3/Xc3); ln(Xt4/Xc4); ln(Xt5/Xc5)). This sounds wrong to me as each replicate is independent.
2) Statistical use of LnRR
Taking my example in 1a), I did a t-test for the comparison of mean lnRR value with "0".
However, n<30 so it would probably be best not to use a parametric test :
=> any advice on that ?
=> Would you stick with a comparison of means from raw values, without trying to calculate the lnRR to justify an effect ?
Thank you very much for your advice on the use of LnRR within single experimental studies.
Best wishes,
Julia
I'm trying to calculate the cophenetic distance by R function cophentic from a phylogenetic tree (with size 16.8MB, generated by package V.phylomaker2), and the RStudio raise the error: "Error in dist.nodes(x) : tree too big", as the picture shows. How can I solve it? Or is there any other method to calculate NRI when the tree is too big(I'm using the function "ses_mpd" from package "picante") ?
Thanks a lot!
I am tiding up with the below problem, it's a pleasure to have your ideas.
I've written a coding program in two languages, Python and R, but each came to a completely different result. Before jumping to a conclusion, I declare that:
- Every word of the code in two languages has multiple checks and is correct and represents the same thing.
- The used packages in two languages are the same version.
So, what do you think?
The code is about applying deep neural networks for time series data.
I am familiar with 'sdm' package to construct species distribution model (SDM). Now, I am an facing issue.
I used predict() of 'dismo' to predict the distribution of species few weeks ago and it ran smoothly without consuming much times. It hardly took 3-5 mins but now it is running since morning (8 hrs+) for same data. Yet I am waiting for the result. I have to prepare SDMs for 10 different specues. If it takes too much time to predict a single model then I have to wait for many days .....which is annoying.
How can I fix it ? Can anyone shares his/her thoughts in this regard. Or is there alternative to save time.
Thanks
PS: PC configuration: 8 GB RAM, AMD Processor
Good day! I have a list (~10000) of unique DNA sequences about 10-20 bp.
I want to find out if they could evolve from one or several sequences, or emerged independently.
Some of the sequences have similar motifs and could be aligned, others haven't at all - so I can't just perform MSA and make a tree - the distance matrix contains many NAs.
I've tried using principal components analysis on k-mers (1-4) frequencies but it gives me nothing - the frequencies form one dense cloud of points with PC1 that have only ~4% explained variance.
And I found that universalmotif R package is capable of performing similar analysis using motif_comparison(), so I converted the sequences into sequence motif format (one for each), but when tried it on a short set of data - found that the algorithm works in a very strange way on list of motifs created each from only one sequence. Different methods gives the same result (added tree to the question) - the sequences that are different are placed near instead of sequences that are someway similar...
Hello All,
I have a question regarding the SEM-based path analysis. I construct my model as the following:
"Y~M1+M2+X1+X2
M1~X1
M2~X2"
As you can see, both M1 and M2 are used as mediators in my model.
Here comes my question: All my variables (Y, M1, M2, X1, X2) are categorical variables. By running the above model in R ("lavaan" package), I can get the results (i.e., coefficient, p-value) of each category of Y, M1, and M2. However, my results do not show each level of X1 and X2.
Are there any ways to solve this issue?
Any help will be much appreciated!
I have three raster layers, two coarse resolution and one fine resolution. My goal is to extract GWR's coefficients (intercept and slope) and apply them to my fine resolution raster.
I can do this easily when I perform simple linear regression. For example:
library(terra)
library(sp)
ntl = rast("path/ntl.tif") # coarse res raster
vals_ntl <- as.data.frame(values(ntl))
ntl_coords = as.data.frame(xyFromCell(ntl, 1:ncell(ntl)))
combine <- as.data.frame(cbind(ntl_coords,vals_ntl))
ebbi = rast("path/tirs010.tif") # coarse res raster
ebbi <- resample(ebbi, ntl, method="bilinear")
vals_ebbi <- as.data.frame(values(ebbi))
s = c(ntl, ebbi)
names(s) = c('ntl', 'ebbi')
block.data <- as.data.frame(cbind(combine, vals_ebbi))
names(block.data)[3] <- "ntl"
names(block.data)[4] <- "ebbi"
block.data <- na.omit(block.data)
model <- lm(formula = ntl ~ ebbi, data = block.data)
#predict to a raster
summary(model)
model$coefficients
pop = rast("path/pop.tif") # fine res raster
lm_pred010 = 19.0540153 + 0.2797187 * pop
But when I run GWR, the slope and intercept are not just two numbers (like in linear model) but it's a range. Attached are the results of the GWR.
My question is how can extract GWR model parameters (intercept and slope) and apply them to my fine resolution raster? In the end I would like to do the same thing as I did with the linear model, that is, GWR_intercept + GWR_slope * fine resolution raster.
Here is the code of GWR:
library(GWmodel)
library(raster)
block.data = read.csv(file = "path/block.data00.csv")
#create mararate df for the x & y coords
x = as.data.frame(block.data$x)
y = as.data.frame(block.data$y)
sint = as.matrix(cbind(x, y))
#convert the data to spatialPointsdf and then to spatialPixelsdf
coordinates(block.data) = c("x", "y")
# specify a model equation
eq1 <- ntl ~ tirs
dist = GWmodel::gw.dist(dp.locat = sint, focus = 0, longlat = FALSE)
abw = bw.gwr(eq1,
data = block.data,
approach = "AIC",
kernel = "tricube",
adaptive = TRUE,
p = 2,
longlat = F,
dMat = dist,
parallel.method = "omp",
parallel.arg = "omp")
ab_gwr = gwr.basic(eq1,
data = block.data,
bw = abw,
kernel = "tricube",
adaptive = TRUE,
p = 2,
longlat = FALSE,
dMat = dist,
F123.test = FALSE,
cv = FALSE,
parallel.method = "omp",
parallel.arg = "omp")
ab_gwr
You can download the csv and the fine resolution raster from here (https://drive.google.com/drive/folders/1V115zpdU2-5fXssI6iWv_F6aNu4E5qA7?usp=sharing).
AnnotationDbi can annotate gene with its full name or GO term.
However, how to annotate individual ligand-receptor pairs?
For example, KNG1-BDKRB2, how can you annotate its role or function with R or python module?
Is there a R or python package, can annotate the function or related disease of individual genes?
Thank you for your attention.
Hi everyone,
What is the best method for converting nested data stored in a .mat file (created in MATLAB) to a format that can be read by R? (e.g., csv)
Thanks :)
Hello everyone.
I'm conducting a research in which I have pre-test and post-test data. First, I measured the number of subscribers of a service in the first period (pre-test). Second, I measured the number of subscribers of the same service in a second period (post-test). Between the pre-test and the post-test, consumers were shown a stimulus that could have changed their intention of being subscribers of the service. I used 2 types of stimulus, one about time (week and month) and the other about the method (method 1 and method 2). I have also the pre-test and post-test of the number of users who use the service but are not the owners of the subscription.
I have performed a McNemar test to compare if there are differences between the number of subscribers before and after each type of stimulus. That is, one McNemar test to measure differences in weekly time pressure, one for monthly pressure, and two more: one for method 1 and one for method 2. For all of them, I have significant results.
I have also calculated the percentages of decrease in the number of subscribers for each stimulus. For example, by pressing weekly the number of subscribers drops by 7%. In the case of users who use the service but are not the owners of the subscription the decrese is 54%. I intend to define which stimulus produces a greater decrease and in which type of users.
I think that maybe this analysis is poor for academic research, and I want also to compare the percentages of decrease in subscriptions with the percentage of decrease of users that use the service but aren't the owner of the suscription for each type of stimulus.
Can you please recommend me another type of analysis?
Maybe some analysis to measure if there are differences when doing a weekly and monthly pressure (or method 1 and method 2) between subscribers and users who use the service but are not the owners of the subscription. I mean using the two groups of users, the two times pressures or the two methods pressures, and the pretest and posttest.
Thank you very much.
Diana
Dear Colleagues!
Is it possible to use percentage cover data (e.g. plant cover in sample plots) when creating rarefaction curves with the iNEXT package in R? (that can compute rarefaction curves for the Hill numbers)
Unfortunately, I cannot convert these percentages to count data.
Therefore I can only use the incidence-frequency based data (0/1) as input for creating these curves, but with this limitation only the species richness curves (q=0) make sense to me, and I lose the abundance-based information of my percent cover data in the shannon (q=1) and simpson (q=2) curves.
What is the proper way when I only have percentage cover data?
Any ideas would be appreciated!
I created this R package to allow easy VCF files visual analysis, investigate mutation rates per chromosome, gene, and much more: https://github.com/cccnrc/plot-VCF
The package is divided into 3 main sections, based on analysis target:
- variant Manhattan-style plots: visualize all/specific variants in your VCF file. You can plot subgroups based on position, sample, gene and/or exon
- chromosome summary plots: visualize plot of variants distribution across (selectable) chromosomes in your VCF file
- gene summary plots: visualize plot of variants distribution across (selectable) genes in your VCF file
Take a look at how many different things you can achieve in just one line of code!
It is extremely easy to install and use, well documented on the GitHub page: https://github.com/cccnrc/plot-VCF
I'd love to have your opinion, bugs you might find etc.
Hi,
Most of the researchers knew R Views website which is:
Please, I am wondering if this website contains all R packages available for researchers.
Thanks & Best wishes
Osman
I have .bed (file format) files, and I would like to make a correlation plot (preferably a line plot) for them.
For example, I want the x axis to be distance (i.e., chromosome 1) and the y axis to be some sort of correlation.
For example, if one of the files has a peak where the other does not, there would be no correlation, and if there are strong peaks for both files at the same location, then there would be high correlation.
Any suggestions? Thanks!
I want to select some disease-related terms such as amyloid-related or stress-related terms for visualization using clusterProfiler in r (result of DisGeNET enrichment analysis). Please what code can I use to select these terms for visualization in r without manually selecting them
I am running a PCA in JASP and SPSS with the same settings, however, the PCA in SPSS shows some factors with negative value, while in JASP all of them are positive.
In addition, when running a EFA in JASP, it allows me to provide results with Maximum Loading, whle SPSS does not. JASP goes so far with the EFA that I can choose to extract 3 factors and somehow get the results that one would have expected from previous researches. However, SPSS does not run under Maximum Loading setting, regardless of setting it to 3 factors or Eigenvalue.
Has anyone come across the same problem?
UPDATE: Screenshots were updated. EFA also shows results on SPSS, just without cumulative values, because value(s) are over 1. But why the difference in positive and negative factor loadings between JASP and SPSS.
Hi all, I am attempting to make a scatterplot of CO2 rate data (y) at varying temperatures (x) with two categorical variables, depth (Mineral, Organic) and substrate (Calcareous, Metapelite, Peridotite) with fitted lines following the equation y = a*exp(B*t) where y = co2 rate, a = basal respiration (intercept), B = slope and t = temperature (time equivalent). I have already fitted all of the exponential curves so I have the values for y, a, B and t for each data point. I am struggling to figure out how to fit the exponential curves across the two groups. So far, I have produced the base scatterplot using GGplot2 colouring by depth and faceting by substrate but I do not know what the best approach is to fit the curves. I have attempted to filter the dataset into categories (substrate = 1, depth = 1) and use nls to fit the curves with such little success that it isn't worth posting the code. Any advice/ guidance is greatly appreciated.
It is a dta file, can not be read with read.csv. I used haven to read it. However, the second row of column is real name, such as Checking Year and Checking Month , how can you extract it?
I want to make Venn diagram that has more than 50 sets, each contains two number.
The data looks like the table below, the total column is variable among samples but the intersection is constant number
total intersection
set1 1000 10
set2 1200 10
set3 1350 10
.
.
.
set50 3000 10
I am measuring two continuous variables over time in four groups. Firstly, I want to determine if the two variables correlate in each group. I then want to determine if there is significant differences in these correlations between groups.
For context, one variable is weight, and one is a behaviour score. The groups are receiving various treatment and I want to test if weight change influences the behaviour score differently in each group.
I have found the r package rmcorr (Bakdash & Marusich, 2017) to calculate correlation coefficients for each group, but am struggling to determine how to correctly compare correlations between more than two groups. The package diffcorr allows comparing between two groups only.
I came across this article describing a different method in SPSS:
However, I don't have access to SPSS so am wondering if anyone has any suggestions on how to do this analysis in r (or even Graphpad Prism).
Or I could the diffcorr package to calculate differences for each combination of groups, but then would I need to apply a multiple comparison correction?
Alternatively, Mohr & Marcon 2005 describe a different method using spearman correlation that seems like it might be more relevant, however I wonder why their method doesn’t seem to have been used by other researches? It also looks difficult to implement so I’m unsure if it’s the right choice.
Any advice would be much appreciated!
Dear community,
I am planning on conducting a GWAS analysis with two groups of patients differing in binary characteristics. As this cohort naturally is very rare, our sample size is limited to a total of approximately 1500 participants (low number for GWAS). Therefore, we were thinking on studying associations between pre-selected genes that might be phenotypically relevant to our outcome. As there exist no pre-data/arrays that studied similiar outcomes in a different patient cohort, we need to identify regions of interest bioinformatically.
1) Do you know any tools that might help me harvest genetic information for known pathways involved in relevant cell-functions and allow me to downscale my number of SNPs whilst still preserving the exploratory character of the study design? e.g. overall thrombocyte function, endothelial cell function, immune function etc.
2) Alternatively: are there bioinformatic ways (AI etc.) that circumvent the problem of multiple testing in GWAS studies and would allow me to robustly explore my dataset for associations even at lower sample sizes (n < 1500 participants)?
Thank you very much in advance!
Kind regards,
Michael Eigenschink
I want to get the coordinates from the vertices of a set of singlepart polygons (all in the same vector layer .shp) using R. I would like to have a list with the x and y coordinates per polygon. Do you know how this can be obtained?
Thank you in advance!
What package fo you use to visualize for example UTRs, exons, introns and read coverage of a single gene? Or a plasmid map in linear form?
Hi PLS Experts,
I am an absolute beginner at using PLS, and I need your help. I have practised using PLS in R as well as in SPSS.
I am interested in using PLS as a predictive model. I have 2 Dependent Variables (DV) one is continuous while the other is categorical with 3 levels. However, I am not using these in one but model but rather in a separate model, as the categorical variable is also an independent variable in model 1 (the model with Continuous DV).
I am confused with the term Y-Variance Explained (For DV) and its effect on model performance.
Is a low percentage Y-Variance explained (in all components) mean poor prediction by the model?
I recently applied PLS on standardized data with 1 DV and 14 Predictors using R (mdatools package). The cumulative x-variance explained was 100%, but Y-variance explained was only 29% in all 14 components (optimal number of components is 3).
I am unable to explain the reason for such poor performance.
A summary of model is attached in the figure. (Predictions are in bottom right part of image).
Thank you for you time :)
Best
Sarang
#PLS
Dear all,
I am currently trying to fit the same models in lavaan (R) and Mplus. However, I keep getting slightly different parameter or goodness-of-fit estimates.
E.g., I fitted this model in lavaan:
Bi_HiTop_3 <- '
HiTOP_ges =~ HiTOP_Bodily_Distress + HiTOP_Conver_Symptoms + HiTOP_Health_Anxiety+
HiTOP_Diseas_Convicti + HiTOP_Somati_Preoccup
HiTOP_Health_Anxiety ~~ HiTOP_Diseas_Convicti
HiTOP_ges~~ 1*HiTOP_ges
'
fit_HiTOP_bi3 <- sem(Bi_HiTop_3, vars, estimator= "MLMV", mimic = "mplus")
In Mplus, this is:
MODEL:
Som_HiTOP BY HiTOP_BD* HiTOP_CS HiTOP_HA HiTOP_DC HiTOP_SP;
Som_HiTOP @1;
HITOP_HA with HITOP_DC;
I get slightly different results on these two regarding both parameter and goodness-of-fit estimates. Only LogLikelihood seems to stay the same.
Has anyone encountered this problem and knows why?
I'm working with a multiband raster and I want to extract each band to a single raster band. I tried two approaches using R (raster) and QGIS (gdal translate).
I noticed that the output file from QGIS is around 25MB while the output file from R is around 2MB. The original multiband raster is around 490MB with 19 bands. This led me to thinking that the QGIS output is more reasonable to use. Note that I will use the bands for SDM.
Is the R output still useable for this purpose? Can you also explain the difference in file sizes?
Hello
please help in finding the area under the curve (as in the attached figure) in R.
The curve has been created using logspline function on histogram (density )
Hi everyone!
I am investigating if different variables are associated with the choice of a nest box in a wild bird population.
Briefly, I am comparing the characteristics of multiple nest-boxes that were visited by 100 differents birds. However, from this bunch of nest-boxes, only 1 is chosen by each bird at the end. I wonder which is the best way to analyze these type of data.
For the moment I have built a logistic regression model in which my response variable is a binary outcome: a nest is (1) or not chosen (0). For this, I built a dataset with the 100 nest boxes that were chosen (y = 1) by each of the birds and another 100 nest-boxes that were not chosen by the birds (y = 0) (this 'not chosen' nests are picked randomly from the bunch of nest-boxes visited but not chosen of each bird) , coding the ID of the bird as a random factor.
I've repeated this procedure 1000 times and computed the mean estimates. But I am not quite sure this is correct and I do not find a lot of info on the internet. Is there any better procedure to analyze this type of data? Or does anybody know the name of this type of analyses?
Thank you in advance!
Iraida
I am using R/R-studio to do some analysis on genes and I want to do a GO-term analysis. I currently have 10 separate FASTA files, each file is from a different species. Is there a way in R to use a FASTA file of genes to find enriched terms compared to the whole proteome?
Hi everyone,
I have to identify overlapping polygons, with one of the datasets containing thousands of polygons. I am using the sf package and its st_intersects function, as:
dataframe1 %>%
st_filter(y = dataframe2, .predicate = st_intersects)
which takes about 6sec to compare each polygon of the 1st dataframe, and so, days for my current dataframes.
The only way I have encountered so far to make it possible is to first remove some manually and then split the dataframe to run the intersecting.
Would anyone have an advice on how to make it faster?
thanks a lot!
(Matlab, R, Python, ...??)
taking into account the current evolution of artificial intelligence
Hello,
I have a variable in a dataset with ~500 answers; it essentially represents participants' answers to an essay question. I am interested in how many words each individual has used in the task and I cannot seem to find a function in R to calculate/count each participant's words in that particular variable.
Is there a way to do this in R? Any packages you think could help me do this? Thank you so much in advance!
I am doing a meta analysis of proportions looking at the healing rate of eardrums after a terrorist bombing.
I am using the metafor package on R. I have transformed the proportions using logit and used a random effects model to calculate CIs. I have also calculated the heterogeneity using the dersimonian laird procedure.
However, the proportions calculated seem to be incorrect when the proportion is 1, the calculated proportions are <1. Please see the graph to see the problem.
I was wondering if anyone has encountered this and if this is because i should not be using logit transformation? Also my heterogeneity number has also come out strangely heterogeneous. I'm sure i'm doing something wrong but not exactly sure what.
Any help would be greatly appreciated, i'm a clinician not a epidemiologist so all of this is v new to me. I have attached the code from R too.
HI all
I would like to perform a redundancy analysis with a response coded as a distance matrix (disimilarity in species composition). That can be done using #dbrda function in #vegan package for R. However, the problem is that I have as predictors two matrices: one with "raw" data (soil variables) and the other a distance matrix calculated from "spatial" distances among sites. Function #dbrda accepts more than one explanatory matrix, but not if it one is a distance matrix. Does anyone knows if there is other function in R able yto do this? Probably in #ade4 or #phytools?
Dear reachers,
I am studying the "qmap" package in R language, to perform bias correction (Quantile Mapping). I have read the Help Documentation about "qmap" package, and all cases are based on precipitation data. These codes include common parameter——“wet.day”, which is intended for precipitation data.
What is the difference between specific R codes for different climatic variables, such as precipitation, temperature, solar, or wind speed?
Hi folks,
My colleague has just asked me for advice regarding analyzing his BRUV (baited remote underwater video) dataset. It's a video camera fixed on a structure, used to record marine fishes passing by or attracted to the attached bait. It resulted in a wide dataset of species assemblage (lots of zeroes, lots of species columns). He has generated an NMDS ordination plot and ANOSIM to analyze his species assemblage dataset. He sees a spatial (geographic) separation of species composition through these methods.
Now, he wants to understand what drives this assemblage. He has additional benthic composition dataset (% coral, sand, rubble, etc.), current strength, depth, and more abiotic data. His coauthor is suggesting fitting envfit vectors on their NMDS and use the p-value of said vectors. I don't think this is a good idea, but I'm not well versed in this topic so I couldn't explain it sophisticatedly. I think because the vectors are "retrofitted" onto the ordination, the p-values are therefore not explanatory toward the species assemblage.
The alternative I could think of is running PERMANOVA or a model. The problem with the former is that the benthic composition dataset are related to each other (7 different variables, but all add up to 100%) so they're not independent of each other.
I'm wondering if anyone has any solution to this/or can add to the explanation. Would it be reliable to run a PERMANOVA? Should he be transforming his benthic composition dataset first? Or would he be better off creating a model, and if yes, which kind?
Thanks!
I want to upscale (make the pixel size larger) a raster from 15m to 460m spatial resolution, using a Gaussian filter.
The goal
I am having a coarse image which I want to downscale. I also have a fine resolution band to assist the downscaling. The downscaling method I am using is called geographically weighted area-to-point regression Kriging (GWATPRK). The method consists of two steps:
- GWR and,
- ATPK on the GWR's residuals.
In order to perform GWR using raster data, those needs to have the same pixel size. This means that, my fine resolution image needs to be upscaled to match the spatial resolution of the coarse band. This upscaling of the fine band needs to be done using a Gaussian kernel (i.e., the PSF). Is there a package in R which can help me to do that? Or anyone who can help write a custom function?
From here you can download the image (https://drive.google.com/drive/folders/18_1Kshb8WbT04gwOw4d_xhfQenULDXdB?usp=sharing).
I have performed a Zero-One-Inflated-Beta regression with logit link in R. The conditional effects can be extracted which is useful for plotting given it returns the intervals. Yet, when I calculate backwards from the values returned to the expected value (ZOIB and B [without ZOI]) and compare this with the expected values the Brms function conditional_effects returns, the trend lines do not match . Also, these manual derived values match much better with the data, I've been scratching my head for a few days now. The question is why don't the expected values match, where do I go awry?
Thank you in advance
------------------
In the appendix the script, data and model