Science topic
Dataset - Science topic
Explore the latest questions and answers in Dataset, and find Dataset experts.
Questions related to Dataset
I have data set consist of x and y variable where I wan to perform maximum likelihood estimation (MLE) to fit a mean function and astandard deviation function o the data. I am estimating the the beta and alpha parameters using Maximum likelihood function in order to observe the mean and sigma trend in my data. And finally performing the optimization of the likelihood.
Every time I encounter the issue of obtaining different values for the mean and sigma when passing different initial values, I find that my model becomes sensitive to the initial parameters. Is there any way to address this situation so that I can obtain the best values for the mean and sigma that automatically fit into my model?
P.S. I have applied different optimisation methods also but didn't work.
i have achieved required results by implementing it through other techniques but I want to implement it using MLE. How can I cope this issue ?
I have a question that I would like to ask, for a data-driven task (for example, based on machine learning, etc.), what kind of data set is the advantage data set? Is there a qualitative or quantitative way to describe the quality of the data set?
Decision is an important concept in social network. If any one have website link , please give link.
I am trying to apply a machine-learning classifier to a dataset. But the dataset is in the .pcap file extension. How can I apply classifiers to this dataset?
Is there any process to convert the dataset into .csv format?
Thanks,
I possess data on JV, current, voltage, and current density obtained from a simple diode (metal-semiconductor). Currently, I aim to plot the Richardson curve and Arrhenius plot to determine the Schottky barrier height (SBH). However, I am encountering a challenge as the plotted curve exhibits a negative slope, deviating from the typical trend. Additionally, I am uncertain about the appropriate values of Js or Io to employ, as well as the extraction method, despite my prior research efforts. Could someone provide a step-by-step guide, preferably utilizing Origin software, to extract the SBH and ideality factor from the Richardson or Arrhenius plot? I have attached the dataset for reference. Your assistance would be greatly appreciated, and I would be grateful for a sample data or worksheet demonstrating the procedure
Can anyone please tell me the database names or websites from where I can download human SNP datasets along with the quantitative traits (phenotypes) for genome-wide association studies (GWAS)?
I ask this question from the perspective that AI algorithms can automate tasks, analyze vast amounts of data, and suggest new research avenues. It can also improve research efficiency and speed up scientific progress, in addition to analysing massive datasets, it could identify patterns, and accelerate scientific discovery. These no doubt are advantageous benefits, but AI algorithms can also inherit biases from the data they're trained on, leading to discriminatory or misleading results, which directly affect research in terms of the quality of output. Additionally, the "black box" nature of some AI systems makes it difficult to understand how they reach conclusions, raising concerns about transparency and accountability.
I am working on EEG classification problem and have collected my own data set ,at first step I applied filters and ICA than decomposed it by applying db3 at level 6 ,after trying all kind of known feature(Non Statistical i.e. Hjorth features, Entropy, band Power and Statistical features mean, median, variance ,skewness ,std_deviation ) accuracy with SVM classifier with 10-K fold method has stuck at 63% ,where as when i fed the chunk of cleaned and pruned data to SVM the accuracy of different chunks is 80% .what should i report in paper please guide.
good greeting
Can I get help finding a dataset for multi-model fake news and downloading it?
Imagine you join a new research lab and are immediately assigned a dataset that contains the same variables as those in the substance abuse dataset but comprising a new sample. You are told the lab originally collected this data set with one particular research question in mind: “Does satisfaction with life predict health outcomes?” Before doing any statistical tests, you decide to browse through the dataset and make some graphs of the results. It seems to you that in your data, satisfaction with life predicts health outcomes more strongly in males than in females, and on reflection, you can think of several theoretical reasons why that should be the case. You disregard the data on females and investigate the hypothesis that low levels of satisfaction with life (using the “swl” variable) will be positively predictive of mental ill health (using the “psych6” variable) in males. You finally do a statistical test, and obtain a very low p-value (less than .001) associated with the regression coefficient. You write a paper using this single result, concluding that there is strong evidence for your hypothesis.
Question: Is this practice a p-hacking or the garden of forking path?
Imagine you join a new research lab and are immediately assigned a dataset that contains the same variables as those in the substance abuse dataset but comprising a new sample. You are told the lab originally collected this data set with one particular research question in mind: “Does satisfaction with life predict health outcomes?” Before doing any statistical tests, you decide to browse through the dataset and make some graphs of the results. It seems to you that in your data, satisfaction with life predicts health outcomes more strongly in males than in females, and on reflection, you can think of several theoretical reasons why that should be the case. You disregard the data on females and investigate the hypothesis that low levels of satisfaction with life (using the “swl” variable) will be positively predictive of mental ill health (using the “psych6” variable) in males. You finally do a statistical test, and obtain a very low p-value (less than .001) associated with the regression coefficient. You write a paper using this single result, concluding that there is strong evidence for your hypothesis.
Question: Is this practice a p-hacking or the garden of forking path?
For my research, I will retrieve data for each firm (100 firms) over 5 years, leading to 500 data points.
Should this dataset size be sufficient for using a fixed effects model?
Hello,
I performed a Correspondence analysis on species count data associated with 2 illustrative (qualitative) variables. the problem is that the data set is : (i) relatively small and (ii) I have concatenated two factors in one (i.e: Health state (binary) - the name of a species (3 different species)) as one supplementary variable.
I expected to get Dim 1 & Dim 2 to be represented each by one of the two supplementary variables but the results were different as I got Health state on the first Dim along with the second additional illustrative variable and the second Dim explained by the name of the species.
I'm confused! should I continue interpreting or perform an ACM?
I am working on a deep learning model with following GAN(Generative Adversarial Network) architecture and got stucked on dataset creation(using tensorflow function to simplify the data for my model) step and particularly dubious about the optimum time required for the execution of this function(since i am a beginner).
My dataset includes 4 types of 2D MRI modalities alongwith a Ground truth for 100 subjects, and for each subject there are 100 images.
Hi all,
I have collected some data from 3 treatment groups (n=5) with relation to 6 markers of fibrosis. When I was looking to analyse (with GraphPad Prism) these 6 data sets, I noticed that some data sets were normally distributed, while others are normally distributed after logarithmic transformation and one data set was not normally distributed (even after logarithmic transformation).
As I'm not too familiar with this kind of result, I thought to do the following. Please let me know if this is correct or if I have misunderstood.
- For data sets that are normally distributed: Analyse data as is with parametric one-way ANOVA
- For data sets that are normally distributed after logarithmic transformation: Analyse y=ln(x) transformed data with parametric one-way ANOVA
- For data set that is not normally distributed: Analyse data as is with the non-parametric Kruskal-Wallis One-Way ANOVA. (Is there any other way to transform data as a way of normalising it?)
Would it be strange to present these 6 data sets together when some have undergone log transformation where others have not?
What should I do? I would greatly appreciate any advice
When a model is trained using a specific dataset with limited diversity in labels, it may accurately predict labels for objects within that dataset. However, when applied to real-time recognition tasks using a webcam, the model might incorrectly predict labels for objects not present in the training data. This poses a challenge as the model's predictions may not align with the variety of objects encountered in real-world scenarios.
- Example: I trained a real-time recognition model for a webcam, where I have classes lc = {a, b, c, ..., m}. The model consistently predicts class lc perfectly. However, when I input a class that doesn't belong to lc, it still predicts something from class lc.
Are there any solutions or opinions that experts can share to guide me further in improving the model?
Thank you for considering your opinion on my problems.
Imagine you join a new research lab and are immediately assigned a dataset that contains the same variables as those in the substance abuse dataset but comprising a new sample. You are told the lab originally collected this data set with one particular research question in mind: “Does satisfaction with life predict health outcomes?” Before doing any statistical tests, you decide to browse through the dataset and make some graphs of the results. It seems to you that in your data, satisfaction with life predicts health outcomes more strongly in males than in females, and on reflection, you can think of several theoretical reasons why that should be the case. You disregard the data on females and investigate the hypothesis that low levels of satisfaction with life (using the “swl” variable) will be positively predictive of mental ill health (using the “psych6” variable) in males. You finally do a statistical test, and obtain a very low p-value (less than .001) associated with the regression coefficient. You write a paper using this single result, concluding that there is strong evidence for your hypothesis.
Question: What is a term for the practice that you are engaging in?
Is this practice a p-hacking or the garden of forking path?
Machine learning can be used for prediction of antibiotic resistance in healthcare setting. The problem is that most hospitals do not have electronic health records. Datasets need to be large to make models reliable. I need a dataset with minimum of 1000 records. Kindly assist
wrong number of values. Read 20, expected 22, read Token[EOL], line 3215 Problem encountered on Line: 3215.
I need help with this error in order to load the dataset onto WEKA
I am working on the project to detect credit card fraud using machine learning. Looking for a latest dataset .
Thanks in advance
I have installed R-cran, then RStudio and all the related packages but I can't import raw dataset from web of science in biblioshiny package of RStudio. Please see the error mgs attached herewith and suggest solution if possible.
It is universally accepted that Arithmetic Mean (AM), Geometric Mean (GM) and Harmonic Mean (HM) are three measures of central tendency of data.
Suppose,
X1 , X2 , ................ , Xn
are n observations in a data set with C as their central tendency.
I am in the thrust of what is the logic or what is the cause for which each of AM , GM and HM is accepted as a measure of the central tendency C.
If I accept the argument that each of
AM ( X1 , X2 , ................ , Xn} , GM ( X1 , X2 , ................ , Xn) and HM ( X1 , X2 , ................ , Xn)
converges to C as n tends to infinity for which they are regarded as measures of C i.e. the central tendency of X1 , X2 , ................ , Xn ,
will it be correct ?
Hello guys
I want to employ FMRI for conducting research.
At first step, I want to know FMRI data is an image like MRI.
Or I should behave with FMRI like time-series when it comes to analyzing data
thank you
I am researching on automatic modulation classification (AMC). I used the "RADIOML 2018.01A" dataset to simulate AMC and used the convolutional long-short term deep neural network (CLDNN) method to model the neural network. But now I want to generate the dataset myself in MATLAB.
My question is, do you know a good sources (papers or codes) that have produced a dataset for AMC in MATLAB (or Python)? In fact, have they produced the In-phase and Quadrature components for different modulations (preferably APSK and PSK)?
I want to train neural networks to evaluate the seismic performance of bridges, but the papers online are all based on their own databases and have not been published. Where can I find the relevant dataset? The dataset can include the following content: yield strength of steel bars, compressive strength of concrete, number of spans, span length, seismic intensity, support type, seismic damage level, etc
Hello everyone, I would appreciate you helping me with this question,
I am using remote sensing-based models to calculate ET actual. I want to use ERA-5 dataset as a meteorological dataset
I found that ERA-5 dataset has different pressure levels and provides (Relative humidity, Temperature, and Wind speed in u and v directions) but it doesn't provide shortwave solar radiation.
On the other hand, ERA5-Land provides (2m temperature, Surface solar radiation downwards, 10m u-component of wind, 10m v-component of wind) but doesn't provide relative humidity.
My question is:
- Which one is better to use ERA5 or ERA5-land?
if I used ERA-5 I would find two problems, the first is to identify the pressure level, and the second is to get the short-wave radiations.
and if I used the ERA-5-land I don't know how to get relative humidity.
- Also, can I use ERA5 for the relative humidity and the ERA5-land for all other parameters?
- Finally, what is the shortest way to get hourly data for certain days (20 days) for a specific location (E, N) as a CSV file
I am working on situation where I have more than one independent ( Feature) variable, let say four and one dependent (targeted) variable, my confusion is when XGBoost algorithm applied on such data set, during prediction it will consider all feature variable as individual like linear regression do or as a group to develop decision trees (Prediction).
If any reference paper or book available related to supervised Machine Learning Algorithm , Kindly share
I have a dataset from lung cancer images with 163 samples (2D images). I use the fine-tuning of deep learning algorithms to classify samples, but the validation loss did not decrease. I augmented the data and used dropout, but the validation loss didn't drop. How can I solve this problem?
What is the popular facial image dataset to detect Austisms, and what are the sources?
Hi
I have attached two images of the scatter- plot of some datasets. as per these plots, it seems that data is not linearly separable.
Can anyone please confirm my understanding?
Thanks and Regards
Monika
I currently have the correlation/covariance matrix for a set of variables, as well as the output from a regression analysis, but lack access to the underlying raw dataset. Given these constraints, would it be feasible to conduct a comprehensive analysis of endogeneity? If so, I would greatly appreciate guidance on methodologies or statistical techniques that could be employed to investigate the presence of endogeneity under these circumstances.
Thanks,
Harshavardhana
Hi all,
I am looking for public repositories or services that are willing to host some large scientific data sets in the range of hundreds of GBs. Besides the universities' infrastructures we have found Hugging Face as another option. At the same time, public funding agencies would prefer some public non-commercial platforms. Looking for ideas to complete the list below:
For small data:
- University servers
- Zenodo (<50GB per default)
- GitHub (<2GB free version)
For large data:
- Hugging Face (<5GB per file, multiple files possible)
- ? any ideas?
Thanks!
for example dependent variable is daily stock returns and independent variables are company characteristics such as firm size, leverage etc.
I need DEAP dataset urgently, I didn't recieve any username and password from the officials, can some one please help me with the dataset if you have it or the credentials.
Are synthetic thermal images useful in bio-medical image processing for diagnostic purposes? Please share some resources of such data sets.
1. I'm seeking soil property raster datasets with resolutions matching those of Sentinel or Landsat imagery, as SoilGrids data are currently available only at a 250m resolution. Therefore, I'm interested in finding options with finer resolutions of 10/30m. What are the best available alternatives with finer resolutions?
2. Is there a dataset specific to the Indian context that provides more accurate and locally optimized soil data?
In detail, when utilizing the data from 1998 to 2014 as the training dataset, the Ljung-Box (Q) statistic, particularly Ljung-Box (18), is not generated in SPSS. However, if the analysis incorporates the entire dataset spanning from 1998 to 2021, the statistics are produced without issue.
I am curr research at phising detection Using URL . Using logistic Regression model . I have data set 1:10 ratio 20k legitimate and 2 k phishing .
Greetings all,
Our team, '#THE GLOBEST TEAM,’ presents a #great opportunity to participate in several in several #American dataset clinical studies related to the #internal medicine field.
Your mission is to #write a specific section of the article and we will review your work. Once we have reviewed the manuscript, you will make any necessary edits related to your mission.
If you have participated in #10 original research studies or reviews previously, and you are a #well-scientific writer, please leave a comment with your #name, #Google Scholar account, and #email address.
#Research #opportunity #clinicalresearch #USAresearch #National Center for Health Statistics #THE GLOBEST TEAM
A dataset urgently needed for EEG signals in children with autism
I have studied one paper entitled "Transcriptome analysis of the response to chronic constant hypoxia in zebrafish hearts" . I want to know how the Fold change values calculated in this paper were retrieved. Is there any specific formula available or it was totally determined from software. I have also accessed the required and individual GEO accession(for comparing hypoxia and normoxia data samples as control and diseased samples respectively) available at NCBI-GEO datasets for the same paper. But the value given there is negative LOG FC for upregulated transcripts (based on individual profile transcripts ID). In contrast the value given for upregulated genes in the same paper is positive for upregulated transcripts. I did not get the discrepancy observed in these results. I have also tried to calculate the FC value from individual expression values for each transcript and every unique accession for normal and hypoxia samples. but the values for FC are totally different from paper data. I want to utilize the microarray data available from GEO for my work. Is there exist any specific method for data processing from such a microarray expression. Please share with me.
I learned that multiple researchers were successful obtain MOOCs datasets from Stanford via the CAROL website: https://datastage.stanford.edu/. The data request form was placed at http://carol.stanford.edu/research. However, recently the domain name carol.stanford.edu as well as the Center for Advanced Research through Online Learning (CAROL) disappeared on the Internet. Consequently, I have no way to request for my needed datasets.
Do you know another URL to submit the data request form, or any alternative solution/repository to obtain some MOOC learners' interaction data from well known course on Coursera or edX?
Thanks in advance
Can artificial intelligence help improve sentiment analysis of changes in Internet user awareness conducted using Big Data Analytics as relevant additional market research conducted on large amounts of data and information extracted from the pages of many online social media users?
In recent years, more and more companies and enterprises, before launching new product and service offerings as part of their market research, commission sentiment analysis of changes in public sentiment, changes in awareness of the company's brand, recognition of the company's mission and awareness of its offerings to specialized marketing research firms. This kind of sentiment analysis is carried out on computerized Big Data Analytics platforms, where a multi-criteria analytical process is carried out on a large set of data and information taken from multiple websites. In terms of source websites from which data is taken, information is dominated by news portals that publish news and journalistic articles on a specific issue, including the company, enterprise or institution commissioning this type of study. In addition to this, the key sources of online data include the pages of online forums and social media, where Internet users conduct discussions on various topics, including product and service offers of various companies, enterprises, financial or public institutions. In connection with the growing scale of e-commerce, including the sale of various types of products and services on the websites of online stores, online shopping portals, etc., as well as the growing importance of online advertising campaigns and promotional actions carried out on the Internet, the importance of the aforementioned analyses of Internet users' sentiment on specific topics is also growing, as playing a complementary role to other, more traditionally conducted market research. A key problem for this type of sentiment analysis is becoming the rapidly growing volume of data and information contained in posts, comments, posts, banners and advertising spots posted on social media, as well as the constantly emerging new social media. This problem is partly solved by the issue of increasing computing power and multi-criteria processing of large amounts of data thanks to the use of increasingly improved microprocessors and Big Data Analytics platforms. In addition, in recent times, the possibilities of advanced multi-criteria processing of large sets of data and information in increasingly shorter timeframes may significantly increase when generative artificial intelligence technology is involved in the aforementioned data processing.
The key issues of opportunities and threats to the development of artificial intelligence technology are described in my article below:
OPPORTUNITIES AND THREATS TO THE DEVELOPMENT OF ARTIFICIAL INTELLIGENCE APPLICATIONS AND THE NEED FOR NORMATIVE REGULATION OF THIS DEVELOPMENT
I described the applications of Big Data technologies in sentiment analysis, business analytics and risk management in my co-authored article:
APPLICATION OF DATA BASE SYSTEMS BIG DATA AND BUSINESS INTELLIGENCE SOFTWARE IN INTEGRATED RISK MANAGEMENT IN ORGANIZATION
The use of Big Data Analytics platforms of ICT information technologies in sentiment analysis for selected issues related to Industry 4.0
In view of the above, I address the following question to the esteemed community of scientists and researchers:
Can artificial intelligence help improve sentiment analysis of changes in Internet users' awareness conducted using Big Data Analytics as relevant additional market research conducted on a large amount of data and information extracted from the pages of many online social media users?
Can artificial intelligence help improve sentiment analysis conducted on large data sets and information on Big Data Analytics platforms?
What do you think about this topic?
What is your opinion on this issue?
Please answer,
I invite everyone to join the discussion,
Thank you very much,
Best wishes,
Dariusz Prokopowicz
The above text is entirely my own work written by me on the basis of my research.
In writing this text I did not use other sources or automatic text generation systems.
Copyright by Dariusz Prokopowicz
I'm doing some research to explore the application of eXplainable Artificial Intelligence (XAI) in the context of brain tumor detection. Specifically, I aim to develop a model that not only accurately detects the presence of brain tumors but also provides clear explanations for its decisions regarding positive or negative results. My main concerns are making sure that the model's decision-making process is transparent and comprehending the underlying reasoning behind its choices. I would be grateful for any thoughts, suggestions, or links to papers or web articles that address the practical application of XAI in this field (including the dataset types or anything that is related with XAi).
Thank you.
Hello ResearchGate community! I'm working on a Fabric Defect Detection System project and need a diverse fabric defect dataset. Any recommendations or shared datasets would greatly benefit my research. Thank you for your support!
Hello everyone,
I am currently working on a Thesis about the impact of AI on Consulting firms.
I am looking for datasets surrouding this subject. If you have any data or somewhat that could help me, I would be very happy to receive your help.
Thank you very much,
Thibaud
Hello ResearchGate community,
I'm currently analyzing a dataset derived from a survey of ~200 paired responses across two time points. The survey is of teachers and students, using 60 Likert-like items to assess beliefs about education. After factor analysis, I derived three core factors.
I'm now trying to assess the relative magnitude of change in factor scores over two time points. I say relative magnitude because, for all factors, the scores decreased. So I need to see 1) whether changes were significant and 2) the size of those changes.
Preliminary tests, including Shapiro-Wilk, Q-Q plots, and outlier detection, indicated non-normality, guiding me to utilize a Wilcoxon signed-rank test.
However, I'm at a crossroads regarding the appropriate effect size measure. Traditional non-parametric effect size measures, like rank biserial correlation, seem to fall short for my purpose, as they primarily address the probability of difference -- rather than the magnitude of change I'm interested in capturing. I've established that two factors saw a statistically significant change using Wilcoxon signed rank. But I need to understand how big these deceases were and hopefully compare the two.
I'm contemplating justifying the use of Cohen's d or exploring median-based measures for a more accurate reflection of the change magnitude. But I'm struggling to find relevant info online. I've seen references to things like Hodges Lehmann, using simple median change, etc. But nothing solid.
Does anyone have insights or references on how to effectively apply a median measure in this context or justify using Cohen's d with the Wilcoxon signed-rank test for ordinal Likert data?
I appreciate any guidance or shared experiences on this matter.
I have a data set from HPLC analysis. I would like to open the HPLC data file, but the Masslynx Software version (4.1) I have is not compatible with the file settings. Can anyone suggest an alternative software that I can use to open this data set?
The file was created to open in Masslynx Version 4.2, but I do not have access to this version.
Thank you.
I found that the structure for TiFeSi is given differently in ICSD and Pearson crystal database as follows:
The Wykoff positions are given in the Pearson database (data set no 1822291)
Ti1 Ti 4 b 0.25 0.2207 0.0206
Ti2 Ti 4 b 0.25 0.4979 0.1677
Ti3 Ti 4 b 0.25 0.7996 0.0463
Fe1 Fe 8 c 0.5295 0.1236 0.3699
Fe2 Fe 4 a 0 0 0.0
Si1 Si 8 c 0.506 0.3325 0.2452
Si2 Si 4 b 0.25 0.0253 0.2554
The Wykoff positions are given in the ICSD database (database code 41157)
Ti1 Ti0+ 4 b 0.25 0.2004(7) 0.2964(14)
Ti2 Ti0+ 4 b 0.25 0.7793(6) 0.2707(14)
Ti3 Ti0+ 4 b 0.25 0.9979(6) 0.9178(15)
Fe1 Fe0+ 8 c 0.0295(7) 0.3764(4) 0.12
Fe2 Fe0+ 4 a 0 0 0.2501(12)
Si1 Si0+ 8 c 0.0060(13) 0.1675(9) 0.9953(18)
Si2 Si0+ 4 b 0.25 0.9747(11) 0.5055(23)
Although the lattice parameter for both of the database is almost the same.
Which one should I take for ab initio calculations or XRD Rietveld refinement?
RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
res = hypotest_fun_out(*samples, **kwds)
Above warning occured in python. Firstly, the dataset was normalised and then while performing the t-test this warning appeared, though the output was displayed. Kindly suggest some methods to avoid this warning.
I am conducting a study on the "Impact of Land Use and Land Cover (LULC) Changes on Land Surface Temperature (LST)" and plan to use Google Earth Engine (GEE) for my analysis. I am at a crossroads in deciding between the "USGS Landsat 8 Level 2, Collection 2, Tier 1" dataset and the "USGS Landsat 8 Collection 2 Tier 1 TOA Reflectance" dataset for LULC classification.
Could the community provide insights on:
- Which dataset would be more suitable for LULC classification, especially in the context of analyzing its impact on LST?
- What specific pre-processing steps would be recommended for the preferred dataset within the GEE environment to ensure data integrity and robustness of the classification?
Any shared experiences, particularly those related to the use of these datasets in GEE for LULC and LST studies, would be incredibly valuable.
Thank you for your contributions!
Hello everyone,
I am currently working on a Thesis about the impact of AI on Consulting firms.
I am looking for datasets surrouding this subject. If you have any data or somewhat that could help me, I would be very happy to receive your help.
Thank you very much,
Thibaud
I'm recently trying to perform an RNA seq data analysis and in 1st step, I faced a few questions in my mind, which I would like to understand. Please help to understand these questions.
1) In 1st image, raw data from NCBI-SRA have marked 1&2 at the ends of the reads, What is the meaning of this? are those meaning forward and reverse reads?
2) In the second image I was trying to perform trimmomatic with this data set. I chose "paired-end as a collection" but it does not take any input even though my data was there in "fastqsanger.gz" format. Why is that? Should I treat this paired-end data as single-end data while performing Trimmomatic?
3) in the 3rd and 4th images, I collected the same data from ENA where they give two separate files for 1 and 2 marked data in SRA. Then I tried to process them in Trimmomatic by using "Paired-end as individual dataset" and then run it. Trimmomatic gives me 4 files for those, Why is that? which one will be useful for alignment ??
A big thank you in advance :)
+1
I'm performing RNA-seq data analysis. I want to do healthy vs disease_stage_1, Healthy vs disease_stage_2, and Healthy vs disease_stage_3. In the case of healthy, disease_stage_1, disease_stage_2, and disease_stage_3 data sets, I have 19, 7, 8, and 15 biological replicates respectively.
Does this uneven number of replicates affect the data analysis?
Should I Use an even no of datasets like for every dataset, 7 biological replicates (As the lowest number of replicates here is 7)?
Hi, where can I find the irradiance solar energy benchmark and hourly dataset? And, which criteria are essential for it?
Thanks in advance
I would like to explain my question, during a discussion with my colleague (Prof Dr Amer ), he informed me that the dataset that used from other field like engineering communication can also use in civil engineering applications. He used that in structure analysis for example in shear slab and other part, is that easy to used in geotechnical engineering ( soil improvements, pile group .. etc.)
Kumpulan data dalam jumlah besar sebanyak lebih dari 350.000 instances telah di impor kedalam file orange3, saat di prediksi kedapatan hasil data ada yang masih tidak utuh atau tidak lengkap, sehingga sulit untuk dianalisis, solusi teknisnya bagaimana?
Hi Everyone,
I am looking for a dataset to work on the customer churn prediction. I have found data regarding frontier airlines in statista.com but it was too expensive to buy. Are there any datasets that are related to the apparel industry along with customer feedback?
Our work with the outstanding Canadian Prairie data is summarized in this BPI Book
Emerging Issues in Environment, Geography and Earth Science Vol. 4
ISBN 978-81-967981-3-0 (Print)
ISBN 978-81-967981-4-7 (eBook)
DOI: 10.9734/bpi/eieges/v4
We strongly recommend this overview. This Prairie data set is hourly for over fifty years and is calibrated back to standards. There is no comparable dataset for analyzing land-surface processes and the surface anergy balance both diurnally and seasonally across time and climate change.
There are five chapters in the book as well as a brief overview of a further six papers of the Prairie data analysis, ECMWF model comparisons and model development.
Hello everone.
I'm very new in ML and DL models, and i want to use CNN to train and test it on my dataset, i had very large dataset and it is already split into 80% train and 20% test.
I wrote code in python and use tensorflow to train CNN, and I'm not sure if it's correct. And now I am struggles on test CNN.
can anyone help to ensure I trained CNN corecctly and tell me how to test it? Either by giving adivices or by providing me with useful resources
I will be truly grateful.
#CNN #ML #MachineLearning #DeepLearning #ConvolutionalNeuralNetwork
using simulation how to generate the dataset for taskscheduling with task characteristics and Vm characteristics so as to train the Ml models
How does the incorporation of diverse datasets affect the performance and bias mitigation of ChatGPT in various conversational contexts?
Please give valuable information.
I need a MRI images datasets for HCC to extract some features from it , i will be delighted if someone menstion a specific data.
how can aerodynamists generate sufficient data set for aerodynamic problems. what the time cost for this step (considering a simple 3D problem)?
need help regarding datasets for early detection of neurological disorder
If some knows this let explain in a brief way!
Thanks a lot.
I have a data set on Kaggle the site that includes information regarding speed and location, and based on the values, the attacker is identified.
Then I input this data set into a learning algorithm for training.
After that, I have a second data set resulting from the process of performing a simulation of the vehicle network and apply an attack rate to it.
Then I run the algorithm on the second Dataset to give me the accuracy of the detection and confusion algorithm
Is my thinking this way correct or not?please tell me if correct or not
task x elemental + creation = relative
task defines available resource given force multiplication, economies of scale and LaGrange control(er$)
Hello Everyone !
I am currently analyzing the emissions of around 300 companies over a time spans of +- 20 years (time series data !). I am wondering what is the best way to approach the analysis of this dataset and what methods can I use to draw insights from my dataset.
I was thinking about starting with indexing my dataset (since companies dont have the same volume of emissions) and then average these indexes according to specific characteristics of the companies (ex: size, country, etc...) in order to attempt to pick up trends.
After the descriptive statistic analysis, I was thinking that I could top my analysis with a regression analysis of emissions according to the type of company (inv. company, state-owned, etc...). For that matter, is there a specific statistical test I can do to regress time series data according to a specific independant variable ?
Let me know what you think of this approach...I am listening to your comments !
Cordially,
Diego Spaey
Basically I have 40 subject and for each I collect
- coronary cannulation before TAVR as : selective, non selective e unsuccessfull.
Then i collected the same data after TAVR with the same 3 level of outcome.
This is a case of repeated measure with multilevel outcome. In addition my contingency table is not "square" due to the fact that there aren't "non selective" outcome in the group "before TAVR".
Here my dataset
AfterTAVR
BeforeTAVR 0 1 2
0 1 0 0
1 1 16 22
McNemar, Stuart-Maxwell’sTest and Cochran’sQTest due to the Not Binary Outcome and Non Square matrix (3x2) of the my dataset.
Can someone have some suggestions??? I will really appreciate it
I have selected two deep learning models CNN and sae for data analysis of a 1 d digitized data set. I need to justify choice of these two dl models in comparison to other dl and standard ml models. I am using ga to optimize hyper parameters values of the two dl models. Can you give some inputs for this query.thanks.
Hello,
I am looking for the best downscalling technique to correct precipitation climate change dataset. I am not sure about which of these two methods is more robust for my task.
Thanks!
SO, I have data sets from 1980-2020 years of precipitation and temperature how to plot similar map? Not sure how to proceed, I have annual Precipitation for approximately 15 stations for my catchment. So, If i take average annual rainfall values it gives average annual map, if I am not wrong. then how to find the change in precipitation? should I use any formula to find the value for each station?
Hello,
I am trying to remove outliers from a dataset. I removed a few, but they continuously appear. What should I do?
Please do not post AI generated answers.
Good morning, I have two datasets with the exact same columns. I would like to select rows that have a matching ID between the two datasets (Please, see tables below).I tried to merge datasets with the r bind function but all rows were included. Do you have some advice to keep only rows with a matching ID?
Input
df1
ID VAR 1 VAR 2
a ... ...
b ... ...
c ... ...
d ... ...
df2
ID VAR 1 VAR 2
a ... ...
b ... ...
e ... ...
f ... ...
Output
df
ID VAR 1 VAR 2
a ... ...
b ... ...
Firth logistic regression is a special version of usual logistic regression which handles separation or quasi-separation issues. To understand the Firth logistic regression, we have to go one step back.
What is logistic regression?
Logistic regression is a statistical technique used to model the relationship between a categorical outcome/predicted variable, y(usually, binary - yes/no, 1/0) and one or more independent/predictor or x variables.
What is maximum likelihood estimation?
Maximum likelihood estimation is a statistical technique to find the best representative model that represents the relationship between the outcome and the independent/predictor variables of the underlying data (your dataset). The estimation process calculates the probability of different models to represent the dataset and then selects the model that maximizes this probability.
What is separation?
Separation means empty bucket for a side! Suppose, you are trying to predict meeting physical activity recommendations (outcome - 1/yes and 0/no) and you have three independent or predictor variables like gender (male/female), socio-economic condition (rich/poor), and incentive for physical activity (yes/no). Suppose, you have a combination, gender = male, socio-economic condition = rich, incentive for physical activity = no, which always predict not meeting physical activity recommendation (outcome - 0/no). This is an example of complete separation.
What is quasi-separation?
Reconsider the above example. We have 50 adolescents for the combination- gender = male, socio-economic condition = rich, incentive for physical activity = no. For 49/48 (not exactly 50, near about 50) of them, outcome is "not meeting physical activity recommendation" (outcome - 0/no). This is the instance of quasi-separation.
How separation or quasi-separation may impact your night sleep?
When separation or quasi-separation is present in your data, the traditional logistic regression will keep increasing the co-efficient of predictors/independent variables to infinite level (to be honest, not infinite, the wording should be without limit) to establish the bucket theory - one of the outcomes is completely or nearly empty. When the anomaly happens, it is actually suggesting that the traditional logistic regression model is outdated here.
There is a bookish name of the issue - convergence issue. But how to know convergence issues have occurred with the model?
- Very large co-efficient estimates. The estimates could be near infinite too!
- Along with large co-efficient estimates, you may see large standard errors too!
- It may also happen that logistic regression tried several times (known as iterations) but failed to get the best model or in bookish language, failed to converge.
What to do if such convergence issues have occurred?
Forget all the hard works you have done so far! You have to start your new journey with an alternative logistic regression, which is known as Firth logistic regression. But what Firth logistic regression actually does? Without using much technical terms, Firth logistic regression actually leads to more reliable co-efficients, which helps to choose best representative model for your data ultimately.
How to conduct Firth logistic regression?
First install the package "logistf" and load it in your R-environment.
install.packages("logistf")
library(logistf)
Now, assume you have a dataset "physical_activity" with a binary outcome variable "meeting physical activity recommendation" and three predictor/independent variables: gender (male/female), socio-economic condition (rich/poor), and incentive for physical activity (yes/no).
pa_model <- logistf(meet_PA ~ gender + sec + incentive, data = physical_activity)
Now, display the result.
summary(pa_model)
You got log odds. Now, we have to convert it into odds.
odds_ratios_pa <- exp(coef(pa_model))
print(odds_ratios_pa)
Game over! Now, how to explain the result?
Don't worry! There is nothing special. The explanation of Firth logistic regression's result is same as traditional logistic regression model. However, if you are struggling with the explanation, let me know in the comment. I will try my best to reduce your stress!
Note: If you find any serious methodological issue here, my inbox is open!
Hi everyone! I tried to perform a classic One Way Anova with the package GAD in R, followed by a SNK test, which I always used, but it didn't work with this dataset, and I got the same error for both tests, which is the following:
"Error in if (colnames(tm.class)[j] == "fixed") tm.final[i, j] = 0 :
missing value where TRUE/FALSE needed"
I understand there is something that gives NA values in my datatset but I do not know how to fix it. There are no NA values in the dataset as itself. Here is the dataset:
temp Filtr_eff
gradi19 11.33
gradi19 15.90
gradi19 10.54
gradi26 11.01
gradi26 -1.33
gradi26 9.80
gradi30 -49.77
gradi30 -42.05
gradi30 -32.03
So, I have three different levels of the factor temp (gradi19, gradi26 and gradi30) and my variable is Filtr_eff. I also already set the factor as fixed.
Please help me, how do I fix the error? I could do the Anova with another package (library car worked for example with this dataset) and I could do tukey instead of SNK, but I want to understand why I got this error since it never happened to me..thanks!
PS: I attached the R and txt files
Short Course: Statistics, Calibration Strategies and Data Processing for Analytical Measurements
Pittcon 2024, San Diego, CA, USA (Feb 24-28, 2024)
Time: Saturday, February 24, 2024, 8:30 AM to 5:00 PM (Full day course)
Short Course: SC-2561
Presenter: Dr. Nimal De Silva, Faculty Scientist, Geochemistry Laboratories, University of Ottawa, Ontario, Canada K1N 6N5
Email: [email protected]
Abstract:
Over the past few decades, instrumental analysis has come a long way in terms of sensitivity, efficiency, automation, and the use of sophisticated software for instrument control and data acquisition and processing. However, the full potential of such sophistication can only be realized with the user’s understanding of the fundamentals of method optimization, statistical concepts, calibration strategies and data processing, to tailor them to the specific analytical needs without blindly accepting what the instrument can provide. The objective of this course is to provide the necessary knowledge to strategically exploit the full potential of such capabilities and commonly available spreadsheet software. Topics to be covered include Analytical Statistics, Propagation of Errors, Signal Noise, Uncertainty and Dynamic Range, Linear and Non-linear Calibration, Weighted versus Un-Weighted Regression, Optimum Selection of Calibration Range and Standard Intervals, Gravimetric versus Volumetric Standards and their Preparation, Matrix effects, Signal Drift, Standard Addition, Internal Standards, Drift Correction, Matrix Matching, Selection from multiple responses, Use and Misuse of Dynamic Range, Evaluation and Visualization of Calibrations and Data from Large Data Sets of Multiple Analytes using EXCEL, etc. Although the demonstration data sets will be primarily selected from ICPES/MS and Chromatographic measurements, the concepts discussed will be applicable to any analytical technique, and scientific measurements in general.
Learning Objectives:
After this course, you will be familiar with:
- Statistical concepts, and errors relevant to analytical measurements and calibration.
- Pros and cons of different calibration strategies.
- Optimum selection of calibration type, standards, intervals, and accurate preparation of standards.
- Interferences, and various remedies.
- Efficient use of spreadsheets for post-processing of data, refining, evaluation, and validation.
Access to a personal laptop for the participants during the course would be helpful, although internet access during the course is not necessary. However, some sample- and work-out spreadsheets, and course material need to be distributed (emailed) to the participants day before the course.
Target Audience: Analytical Technicians, Chemists, Scientists, Laboratory Managers, Students
Register for Pittcon: https://pittcon.org/register
The datasets are provided as medians and interquartile ranges. Can we perform a pooled analysis? How do we convert variables to get the events as (N)?
I have three RNA-Seq datasets of the same tissue and want to analyse them on Galaxy. My initial literature survey gave me the idea that I can merge the three datasets if they are from the same model and tissue followed by making two groups Control and Test and then run the analysis. Am I correct?
Can somebody with more experience elaborate on this?
Or it is a better idea to analyse the three datasets separately and find the common mRNAs?
How to integrate two different ML or DL models in a single framework?
Outliers detection criteria in a data set.
I'm not getting any solutions for DE analysis on R and also can't figure out which dataset should be used for this type of analysis. looking for some help !!
I want to get data for climatic variables from the Climate Research Unit dataset for analysis
Seeking insights for algorithmic optimization.
Hello everyone
I am working with my master thesis.
I have 3 latents IV and one ordered DV.
Can I use GSEM to deal with my data set in stata?
Thank you
I am a Msc student and my thesis is framed on developing a CNN-based approach to predict soil carbon hotspots using remote sensing data. Soil carbon hotspots are areas where the concentration of organic carbon in the soil is unusually high. These hotspots are important because they play a critical role in the global carbon cycle, helping to regulate the Earth's climate. This research will focus on developing a CNN-based approach to predict soil carbon hotspots, which can be used to identify areas that are particularly important for soil carbon sequestration. I am writing passionately for assistance which will help me assess the dataset which has a combination of remote and satellite dataset to aid me use it in my thesis. Thank you for your time and consideration
Hellow!
Actually, I want to delineate the groundwater potential zone by using the FR model.
But I am confused about groundwater well data that is used for in different research purposes. Most of the paper divides the data into two sets (training and testing) for validation and FR calculation. But, for my study area, that falls into only 19 wells. So i am confused that, can i divided it training and testing datasets or 19 well used for both validation and FR calculation?
Please suggest me.
Thanks in advance.
Lately I'm working on the scRNA-seq analysis. It took some time for me to find a proper dataset on GEO, whose Accession ID is GSE157783. I expected to get three files of each sample from the dataset, but I ended up finding that the authors only provided 3 files in total.
Besides, I found the format of the 3 files different to that mentioned in the online courses. I suppose that files end up with "tsv.gz" are needed, but here I just found 3 "tar.gz" files.
Hope someone can help :(
I need your helpe PLEASE!
For my research paper, and In order to develop my dataset, i fielded the missing observation with interpolation/extrapolation method. And I need to ensure the quality and behavior of data before starting my analysis. Could you kindly provide more details on the specific steps and methodologies to be employed to ensure the meaningfulness and verifiability of the results, I am particularly interested in understanding:
- The quality assurance measures taken before and after applying interpolation/extrapolation techniques.
-Is there a trend approach to be adopted to reflect developments within the periods for the missing data?
- and if there are any diagnostic tests to be conducted to validate the reliability of the fielded data.
Thank you in advance for your time and consideration.
The iris images of the Casia Iris V3 Lamp were acquired under variations of illumination. The local thresholding fails to segment some iris images. The alternative is to use: The adaptative thresholding technique: is this type of thresholding technique performant?
Hi,
i want to study doping effect characterization using ellipsometry.i have 5 dataset of n & k values of doped thin film. Is there any software available to simulate ellipsometry and get parameters like reflection, delta to analyse further. I try to find on ANSYS lumerical but couldn't find any good information about ellipsometry simulation.
Thanks.
Supervised Learning
In supervised learning, the dataset is labeled, meaning each input has an associated output or target variable. For instance, if you're working on a classification problem to predict whether an email is spam or not, each email in the dataset would be labeled as either spam or not spam. Algorithms in supervised learning are trained using this labeled data. They learn the relationship between the input variables and the output by being guided or supervised by this known information. The ultimate goal is to develop a model that can accurately map inputs to outputs by learning from the labeled dataset. Common tasks include classification, regression, and ranking.
Unsupervised Learning
Unsupervised learning deals with unlabeled data, where the information does not have corresponding output labels. There's no specific target variable for the algorithm to predict. Algorithms in unsupervised learning aim to find patterns, structures, or relationships within the data without explicit guidance. For instance, clustering algorithms group similar data points together based on some similarity or distance measure. The primary goal is to explore and extract insights from the data, uncover hidden structures, detect anomalies, or reduce the dimensionality of the dataset without any predefined outcomes. Supervised learning uses labeled data with known outcomes to train models for prediction or classification tasks, while unsupervised learning works with unlabeled data to explore and discover inherent patterns or structures within the dataset without explicit guidance on the expected output. Both have distinct applications and are used in different scenarios based on the nature of the dataset and the desired outcomes.
In the realm of machine learning, the availability of large and diverse datasets is often crucial for effective model training. However, in certain domains where data is limited or privacy concerns are paramount, exploring the use of synthetic datasets emerges as a compelling alternative.
Question: How can the adoption of synthetic datasets revolutionize machine learning applications in areas with data scarcity and stringent privacy considerations?
I have a dataset that includes 1900 companies. Also, I investigated 10 employees for each company. There is a question about the risk preference of each employee. At now, I need to calculate the ICC1 and ICC2 values for each company. I have already coded for each company, so each company will have a unique company_id. At now, I have the employee dataset, it means I have the 19000 data, and each employee will match the company according to the company_id. In this case, how to get the ICC1, and ICC2 value of each company in R. I have already tried for few days, expecting someone could resolve my problem.