Science topic

Dataset - Science topic

Explore the latest questions and answers in Dataset, and find Dataset experts.
Questions related to Dataset
  • asked a question related to Dataset
Question
6 answers
I have data set consist of x and y variable where I wan to perform maximum likelihood estimation (MLE) to fit a mean function and astandard deviation function o the data. I am estimating the the beta and alpha parameters using Maximum likelihood function in order to observe the mean and sigma trend in my data. And finally performing the optimization of the likelihood. Every time I encounter the issue of obtaining different values for the mean and sigma when passing different initial values, I find that my model becomes sensitive to the initial parameters. Is there any way to address this situation so that I can obtain the best values for the mean and sigma that automatically fit into my model?
P.S. I have applied different optimisation methods also but didn't work.
i have achieved required results by implementing it through other techniques but I want to implement it using MLE. How can I cope this issue ?
Relevant answer
Answer
By sigma trend, do you mean you are also fitting the standard deviation of the error term as a function of X? If so, could you try first with fixed variance for the error term: models with non-fixed variance may be more difficult to fit.
You may also check that your likelihood is realistic. For instance, sigmoid models are often used to data contrained between 0 and 1 (to a rescaling eventually), which is not compatible for a Gaussian random variable. So this may interfere with the MLE fit.
Last, if initial value has a significant impact, and if you really think you are not overfitting the data (one of the most common cause for unstability), you may use a grid search on the initial values to find the global minimum... You may also explore the shape of the MLE obtained as a function of the initial values to see if something appears...
Good luck,
  • asked a question related to Dataset
Question
3 answers
I have a question that I would like to ask, for a data-driven task (for example, based on machine learning, etc.), what kind of data set is the advantage data set? Is there a qualitative or quantitative way to describe the quality of the data set?
Relevant answer
Answer
The "advantageous" dataset for a data-driven task is one that is relevant, sufficiently large, high-quality, representative, balanced, temporally consistent, labeled, and ethically collected, supporting reliable model training and accurate predictions.
  • asked a question related to Dataset
Question
1 answer
Decision is an important concept in social network. If any one have website link , please give link.
Relevant answer
Answer
I recommend exploring some of the primary graph database resources that offer a wide range of datasets for social network analysis. The Stanford Network Analysis Project (SNAP) hosts a diverse collection of datasets from various social networks, which can be found at http://snap.stanford.edu. Additionally, Gephi, an open-source network visualization tool, provides datasets that are particularly useful for visual analysis and can be accessed through their GitHub repository at https://github.com/gephi/gephi/wiki/Datasets. These resources should be valuable for your decision-making research in social network analysis.
  • asked a question related to Dataset
Question
2 answers
I am trying to apply a machine-learning classifier to a dataset. But the dataset is in the .pcap file extension. How can I apply classifiers to this dataset?
Is there any process to convert the dataset into .csv format?
Thanks,
Relevant answer
Answer
"File" > "Export Packet Dissections" > "As CSV..." or "As CSV manually
import pyshark
import csv
# Open the .pcap file
cap = pyshark.FileCapture('yourfile.pcap')
# Open a .csv file in write mode
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
# Write header row
writer.writerow(['No.', 'Time', 'Source', 'Destination', 'Protocol', 'Length'])
# Iterate over each packet
for packet in cap:
try:
# Extract relevant information from each packet
no = packet.number
time = packet.sniff_timestamp
source = packet.ip.src
destination = packet.ip.dst
protocol = packet.transport_layer
length = packet.length
# Write the information to the .csv file
writer.writerow([no, time, source, destination, protocol, length])
except AttributeError:
# Ignore packets that don't have required attributes (e.g., non-IP packets)
pass
(may this will help in python)
  • asked a question related to Dataset
Question
6 answers
I possess data on JV, current, voltage, and current density obtained from a simple diode (metal-semiconductor). Currently, I aim to plot the Richardson curve and Arrhenius plot to determine the Schottky barrier height (SBH). However, I am encountering a challenge as the plotted curve exhibits a negative slope, deviating from the typical trend. Additionally, I am uncertain about the appropriate values of Js or Io to employ, as well as the extraction method, despite my prior research efforts. Could someone provide a step-by-step guide, preferably utilizing Origin software, to extract the SBH and ideality factor from the Richardson or Arrhenius plot? I have attached the dataset for reference. Your assistance would be greatly appreciated, and I would be grateful for a sample data or worksheet demonstrating the procedure
Relevant answer
Answer
Jürgen Weippert Hello, thank you for your support, i got help from a senior colleague from another university. i had to use the modified Richardson plot to get the data in linear form.
  • asked a question related to Dataset
Question
7 answers
Can anyone please tell me the database names or websites from where I can download human SNP datasets along with the quantitative traits (phenotypes) for genome-wide association studies (GWAS)?
Relevant answer
Answer
  • asked a question related to Dataset
Question
5 answers
I ask this question from the perspective that AI algorithms can automate tasks, analyze vast amounts of data, and suggest new research avenues. It can also improve research efficiency and speed up scientific progress, in addition to analysing massive datasets, it could identify patterns, and accelerate scientific discovery. These no doubt are advantageous benefits, but AI algorithms can also inherit biases from the data they're trained on, leading to discriminatory or misleading results, which directly affect research in terms of the quality of output. Additionally, the "black box" nature of some AI systems makes it difficult to understand how they reach conclusions, raising concerns about transparency and accountability.
Relevant answer
Answer
AI tools are mainly about extended memory capacity and speed on the information highway Christopher Ufuoma Onova , i.e. for the experienced researcher and master of a subject, it is advancing creativeness and innovation. For the novice in scientific research, it is better to learn the facts of a subject by automated programmed instruction, before using the speed and memory of artificial cognitive systems, because this requires learned human supervision, in terms of rational and ethical faculty.
With the move to digital devices we are coming to a point where people don’t need to remember anything, they have it all in the palm of their hand.
One complaint I have heard from professors and others is that the generation of young people now entering the workplace don’t know how to communicate.  They are poor writers and their coordination and collaboration skills are lacking.  Some of this would have to be a direct result their being wedded to their “digital assistants.”
We can get smarter, or just more dependent; this is definitely our moral choice, with respect to our freedom-of-choice.
_____________
"I fear the day when the technology overlaps with our humanity. The world will only have a generation of idiots."  Albert Einstein
  • asked a question related to Dataset
Question
2 answers
I am working on EEG classification problem and have collected my own data set ,at first step I applied filters and ICA than decomposed it by applying db3 at level 6 ,after trying all kind of known feature(Non Statistical i.e. Hjorth features, Entropy, band Power and Statistical features mean, median, variance ,skewness ,std_deviation ) accuracy with SVM classifier with 10-K fold method has stuck at 63% ,where as when i fed the chunk of cleaned and pruned data to SVM the accuracy of different chunks is 80% .what should i report in paper please guide.
Relevant answer
Answer
In your paper, u should comprehensively report both the traditional feature extraction approach and ur novel method of directly using cleaned and pruned EEG data with the SVM classifier. Here's what to include:
  1. Methodology: Clearly describe the preprocessing steps (filtering, ICA, etc.), and detail how the SVM was applied with and without traditional feature extraction
  2. Results: Present the classification accuracies of both approaches. Highlight the improvement observed when bypassing the feature extraction
  3. Discussion: Offer insights into why the direct approach might be yielding better results, such as the preservation of critical data features that are possibly lost during feature extraction
  4. Validation: Include details of the validation techniques used (like cross-validation) to ensure the results are robust and not merely due to overfitting
  5. Conclusion: Suggest why the direct use of cleaned data could be beneficial, based on your findings and validations.
This structure will help underline the efficacy and reliability of your approach providing valuable insights into EEG data classification methodologies...
  • asked a question related to Dataset
Question
7 answers
good greeting
Can I get help finding a dataset for multi-model fake news and downloading it?
Relevant answer
Answer
I wish I could help, I think you need to collect the fake news and create a data base from different sources and identify what you want to do there.
Thanks
  • asked a question related to Dataset
Question
3 answers
Imagine you join a new research lab and are immediately assigned a dataset that contains the same variables as those in the substance abuse dataset but comprising a new sample. You are told the lab originally collected this data set with one particular research question in mind: “Does satisfaction with life predict health outcomes?” Before doing any statistical tests, you decide to browse through the dataset and make some graphs of the results. It seems to you that in your data, satisfaction with life predicts health outcomes more strongly in males than in females, and on reflection, you can think of several theoretical reasons why that should be the case. You disregard the data on females and investigate the hypothesis that low levels of satisfaction with life (using the “swl” variable) will be positively predictive of mental ill health (using the “psych6” variable) in males. You finally do a statistical test, and obtain a very low p-value (less than .001) associated with the regression coefficient. You write a paper using this single result, concluding that there is strong evidence for your hypothesis.
Question: Is this practice a p-hacking or the garden of forking path?
Relevant answer
Answer
It certainly would not be sound to publish this result as if you had had that "hypothesis" prior to looking at the data. It may be fine to publish it as an explicitly exploratory (data-generated/data-informed) finding with a cautionary note stating that this is a data-driven result that needs to be validated/replicated with fresh (and ideally experimental or at least quasi-experimental) data (truly experimental may be difficult in this case because it may not be possible or ethical to manipulate life satisfaction).
  • asked a question related to Dataset
Question
5 answers
Imagine you join a new research lab and are immediately assigned a dataset that contains the same variables as those in the substance abuse dataset but comprising a new sample. You are told the lab originally collected this data set with one particular research question in mind: “Does satisfaction with life predict health outcomes?” Before doing any statistical tests, you decide to browse through the dataset and make some graphs of the results. It seems to you that in your data, satisfaction with life predicts health outcomes more strongly in males than in females, and on reflection, you can think of several theoretical reasons why that should be the case. You disregard the data on females and investigate the hypothesis that low levels of satisfaction with life (using the “swl” variable) will be positively predictive of mental ill health (using the “psych6” variable) in males. You finally do a statistical test, and obtain a very low p-value (less than .001) associated with the regression coefficient. You write a paper using this single result, concluding that there is strong evidence for your hypothesis.
Question: Is this practice a p-hacking or the garden of forking path?
Relevant answer
Answer
IYH dear Miao Yang
It's IMHO p-hacking. Why? By focusing solely on the data from males and disregarding the data from females, the researcher is engaging in a form of cherry-picking, which can lead to biased results and conclusions. This behavior increases the chances of finding a statistically significant result by chance alone (especially if multiple comparisons or tests are conducted).
The practice described in the scenario is not an instance of 'the garden of forking paths'. The garden of forking paths refers to a situation where researchers explore multiple hypotheses, analysis paths, or data subsets, and then selectively report the results that support their preferred hypothesis or direction of the effect. In this case the researcher has not explored multiple alternative hypotheses or analysis paths before settling on the one that supports their initial hypothesis. Instead, the researcher has formed a specific hypothesis based on their observation of the data and the theoretical reasons they can think of, and then tested that hypothesis using a single statistical test.
  • asked a question related to Dataset
Question
2 answers
For my research, I will retrieve data for each firm (100 firms) over 5 years, leading to 500 data points.
Should this dataset size be sufficient for using a fixed effects model?
Relevant answer
Answer
Chuck A Arize Thank you for the quick response. I want to examine a positive linear relation between two continuous variables. Additionally, I use 6 control variables (mostly continuous).
  • asked a question related to Dataset
Question
1 answer
Hello,
I performed a Correspondence analysis on species count data associated with 2 illustrative (qualitative) variables. the problem is that the data set is : (i) relatively small and (ii) I have concatenated two factors in one (i.e: Health state (binary) - the name of a species (3 different species)) as one supplementary variable.
I expected to get Dim 1 & Dim 2 to be represented each by one of the two supplementary variables but the results were different as I got Health state on the first Dim along with the second additional illustrative variable and the second Dim explained by the name of the species.
I'm confused! should I continue interpreting or perform an ACM?
Relevant answer
Answer
Take a closer look at your data to understand its characteristics better. Are there any outliers or unusual patterns that could be affecting the results? Exploring the data visually or through other statistical techniques may provide insights into why the analysis is producing unexpected results. Then, you could consider performing additional analyses such as an Alternating Least Squares Correspondence Analysis (ALSCAL) or a Multiple Correspondence Analysis (MCA). These may provide different perspectives on your data and help you understand the relationships between variables better.
  • asked a question related to Dataset
Question
1 answer
I am working on a deep learning model with following GAN(Generative Adversarial Network) architecture and got stucked on dataset creation(using tensorflow function to simplify the data for my model) step and particularly dubious about the optimum time required for the execution of this function(since i am a beginner).
My dataset includes 4 types of 2D MRI modalities alongwith a Ground truth for 100 subjects, and for each subject there are 100 images.
Relevant answer
Answer
Without specific details about the dataset size, complexity of the data generation process, and hardware specifications of your GPU-accelerated workstation, it's challenging to provide a precise estimate of the normal execution time. However, for moderately sized datasets and efficient dataset generation functions running on a well-configured GPU-accelerated workstation, you might expect the execution time to range from seconds to minutes. For very large datasets or highly complex data generation processes, it could take longer.
  • asked a question related to Dataset
Question
3 answers
Hi all,
I have collected some data from 3 treatment groups (n=5) with relation to 6 markers of fibrosis. When I was looking to analyse (with GraphPad Prism) these 6 data sets, I noticed that some data sets were normally distributed, while others are normally distributed after logarithmic transformation and one data set was not normally distributed (even after logarithmic transformation).
As I'm not too familiar with this kind of result, I thought to do the following. Please let me know if this is correct or if I have misunderstood.
  • For data sets that are normally distributed: Analyse data as is with parametric one-way ANOVA
  • For data sets that are normally distributed after logarithmic transformation: Analyse y=ln(x) transformed data with parametric one-way ANOVA
  • For data set that is not normally distributed: Analyse data as is with the non-parametric Kruskal-Wallis One-Way ANOVA. (Is there any other way to transform data as a way of normalising it?)
Would it be strange to present these 6 data sets together when some have undergone log transformation where others have not?
What should I do? I would greatly appreciate any advice
Relevant answer
Answer
For datasets that are normally distributed:
What to do: Proceed with the parametric one-way ANOVA as planned. This statistical test assumes that the data are normally distributed, which matches your scenario.
For datasets that become normally distributed after logarithmic transformation:
What to do: You can indeed apply a logarithmic transformation to these datasets and then use parametric one-way ANOVA on the transformed data. This approach is valid as long as you specify in your report or presentation that you transformed the data due to non-normality in its original form.
For the dataset that is not normally distributed, even after logarithmic transformation:
What to do: The Kruskal-Wallis test, which you’ve mentioned, is a suitable choice for this dataset. This test does not assume normality and is effective for comparing median differences between groups.
Alternative transformations: Besides logarithmic transformation, you could try other transformations such as square root or Box-Cox transformation to see if they help in achieving normality. However, if transformations do not result in normal distribution, sticking with a non-parametric approach like Kruskal-Wallis is recommended.
Is it okay to present these datasets together?
  • Explain clearly: When presenting your results, be transparent about the methods you used for each dataset. Mention which datasets were transformed and which were analyzed in their original form.
  • Justify the choice: Briefly explain why different approaches were necessary, emphasizing that these methods were chosen based on the statistical properties of each dataset to ensure the most accurate analysis.
  • asked a question related to Dataset
Question
4 answers
When a model is trained using a specific dataset with limited diversity in labels, it may accurately predict labels for objects within that dataset. However, when applied to real-time recognition tasks using a webcam, the model might incorrectly predict labels for objects not present in the training data. This poses a challenge as the model's predictions may not align with the variety of objects encountered in real-world scenarios.
  • Example: I trained a real-time recognition model for a webcam, where I have classes lc = {a, b, c, ..., m}. The model consistently predicts class lc perfectly. However, when I input a class that doesn't belong to lc, it still predicts something from class lc.
Are there any solutions or opinions that experts can share to guide me further in improving the model?
Thank you for considering your opinion on my problems.
Relevant answer
Answer
Some of the solutions are transfer learning, data augmentation, one-shot learning, ensemble learning, active learning, and continuous learning.
  • asked a question related to Dataset
Question
3 answers
Imagine you join a new research lab and are immediately assigned a dataset that contains the same variables as those in the substance abuse dataset but comprising a new sample. You are told the lab originally collected this data set with one particular research question in mind: “Does satisfaction with life predict health outcomes?” Before doing any statistical tests, you decide to browse through the dataset and make some graphs of the results. It seems to you that in your data, satisfaction with life predicts health outcomes more strongly in males than in females, and on reflection, you can think of several theoretical reasons why that should be the case. You disregard the data on females and investigate the hypothesis that low levels of satisfaction with life (using the “swl” variable) will be positively predictive of mental ill health (using the “psych6” variable) in males. You finally do a statistical test, and obtain a very low p-value (less than .001) associated with the regression coefficient. You write a paper using this single result, concluding that there is strong evidence for your hypothesis.
Question: What is a term for the practice that you are engaging in?
Is this practice a p-hacking or the garden of forking path?
Relevant answer
Answer
There is ongoing debate in the scientific community about the validity of p-values in scientific research. Many scientists and statisticians are calling for abandoning statistical significance tests and p-values. The only way to avoid p-hacking is to not use p-values. You should make scientific inference based on some descriptive statistics and your domain knowledge (not p-values).
  • asked a question related to Dataset
Question
1 answer
Machine learning can be used for prediction of antibiotic resistance in healthcare setting. The problem is that most hospitals do not have electronic health records. Datasets need to be large to make models reliable. I need a dataset with minimum of 1000 records. Kindly assist
Relevant answer
Answer
I would approach the pathology laboratory attached to the hospital group amd speak to the microbiologist or speak to the occupational infection control head at the facility that does the surveillance for the hospital on the antibiograms for the facility
  • asked a question related to Dataset
Question
3 answers
wrong number of values. Read 20, expected 22, read Token[EOL], line 3215 Problem encountered on Line: 3215.
I need help with this error in order to load the dataset onto WEKA
Relevant answer
Answer
The error message "wrong number of values" typically indicates that there is a mismatch between the number of attributes defined in the dataset and the actual number of values provided in one or more instances. In your case, it seems that WEKA is expecting 22 attributes but is reading only 20 values on line 3215 of your dataset, encountering an end-of-line (EOL) token. To resolve this error, you should carefully examine the dataset, particularly line 3215, to ensure that each instance contains values for all 22 attributes, separated by the appropriate delimiter (usually a comma or a tab). It's possible that there might be missing or incorrectly formatted values in this line or elsewhere in the dataset. Once you correct the dataset to ensure that each instance has the correct number of values, you should be able to load it successfully into WEKA.
  • asked a question related to Dataset
Question
2 answers
I am working on the project to detect credit card fraud using machine learning. Looking for a latest dataset .
Thanks in advance
Relevant answer
Answer
One commonly used dataset for credit card fraud detection is the Credit Card Fraud Detection Dataset available on Kaggle, which contains transactions made by credit cards in September 2013 by European cardholders. This dataset encompasses transactions over a two-day period, including 492 frauds out of 284,807 transactions, making it imbalanced but reflective of real-world scenarios. Additionally, the IEEE-CIS Fraud Detection Dataset on Kaggle offers a more extensive set of real-world features for transactional data, suitable for advanced machine learning models. For cases where real-world data is limited or sensitive, synthetic datasets like the Credit Card Fraud Detection Synthetic Dataset on Kaggle provide an alternative. As with any dataset, it's crucial to understand its limitations, potential biases, and preprocessing requirements while adhering to proper citation and usage protocols.
  • asked a question related to Dataset
Question
1 answer
I have installed R-cran, then RStudio and all the related packages but I can't import raw dataset from web of science in biblioshiny package of RStudio. Please see the error mgs attached herewith and suggest solution if possible.
Relevant answer
Hola, en la parte inferior del bibioshiny exporta el archivo a excel y vuelve abrir el archivo en formato Biblishiny y se soluciona
  • asked a question related to Dataset
Question
3 answers
It is universally accepted that Arithmetic Mean (AM), Geometric Mean (GM) and Harmonic Mean (HM) are three measures of central tendency of data.
Suppose,
X1 , X2 , ................ , Xn
are n observations in a data set with C as their central tendency.
I am in the thrust of what is the logic or what is the cause for which each of AM , GM and HM is accepted as a measure of the central tendency C.
If I accept the argument that each of
AM ( X1 , X2 , ................ , Xn} , GM ( X1 , X2 , ................ , Xn) and HM ( X1 , X2 , ................ , Xn)
converges to C as n tends to infinity for which they are regarded as measures of C i.e. the central tendency of X1 , X2 , ................ , Xn ,
will it be correct ?
Relevant answer
Answer
Dear Doctor
[Arithmetic Mean of Ungrouped Data: There are two methods for calculating arithmetic mean for ungrouped data. i) Direct method ii) Indirect or short cut method i)
Direct method:
Arithmetic Mean (A.M.) =Sum of observations/Number of observations
Indirect or short cut method: In this method an arbitrary assumed mean is used. Deviations of individual observations from this assumed mean are taken for calculating arithmetic mean.
Arithmetic Mean of Grouped Data: There are two methods for calculating arithmetic mean of grouped data.
i) Direct method
ii) Indirect or step-deviation method
Merits of Arithmetic Mean:
i) It is easy to understand and calculate
ii) It is based on all observations
iii) It is rigidly defined
iv) It is capable of further mathematical treatment
v) It is least affected by sampling fluctuation.
Demerits of Arithmetic Mean:
i) It is unduly affected by extreme values.
ii) In case of open ended classes it cannot be calculated.
B. Geometric Mean:
When we are interested in measuring average rate of change over time then we use geometric mean. Geometric mean is defined as the nth root of the product of n items (or) values.
Uses of Geometric Mean:
Geometrical Mean is especially useful in the following cases.
1) The G.M is used to find the average percentage increase in sales, production, or other economic or business series. For example, from 1992 to 1994 prices increased by 5%,10%,and 18% respectively, then the average annual income is not 11% which is calculated by A.M but it is 10.9 which is calculated by G.M.
2) G.M is theoretically considered to be best average in the construction of Index numbers.
C. Harmonic Mean:
The Harmonic Mean (H.M.) is defined as the reciprocal of the arithmetic mean of the reciprocals of the individual observations.
Merits of Harmonic Mean:
1) Its value is based on all the observations of the data.
2) It is less affected by the extreme values. 3) It is strictly defined.
Demerits of Harmonic Mean:
1) It is not simple to calculate and easy to understand.
2) It cannot be calculated if one of the observations is zero.
3) The H.M is always less than A.M and G.M.
Uses of Harmonic Mean:
The H.M is used to calculate the averages where two units are involved like rates, speed, etc.
Relation between A.M., G.M. and H.M.
The relation between A.M, G.M, and H.M is given by
A.M >= G.M >= H.M
Note: The equality condition holds true only if all the items are equal in the distribution.
Summary
The measures of central tendency give us an idea about the central value around which the data values cluster. That’s why these values are considered to be representative values i.e. the values which represent the data. Arithmetic mean is the most common measure of central tendency which is obtained by adding all the observations and then dividing the sum by the number of observations. Geometric mean is used for measuring the average rate of change over time. It is defined as the nth root of the product of n items (or) values. Harmonic Mean (H.M.) is defined as the reciprocal of the arithmetic mean of the reciprocals of the individual observations.]
  • asked a question related to Dataset
Question
3 answers
Hello guys
I want to employ FMRI for conducting research.
At first step, I want to know FMRI data is an image like MRI.
Or I should behave with FMRI like time-series when it comes to analyzing data
thank you
Relevant answer
Answer
MRI datasets typically result in high-resolution three-dimensional images representing anatomical structures. These images are often stored in formats such as DICOM (Digital Imaging and Communications in Medicine) or NIfTI (Neuroimaging Informatics Technology Initiative). fMRI datasets produce time-series data representing changes in brain activity over time. These data are often stored in formats compatible with neuroimaging software packages, such as NIfTI, Analyze, or MINC (Medical Imaging NetCDF). fMRI data can be conceptualized and analyzed both as images and time-series. The choice of representation depends on the specific research question and analysis techniques being employed. For many analyses, researchers will use both approaches, leveraging the spatial information provided by the image-like representation and the temporal dynamics captured in the time-series data.
  • asked a question related to Dataset
Question
1 answer
I am researching on automatic modulation classification (AMC). I used the "RADIOML 2018.01A" dataset to simulate AMC and used the convolutional long-short term deep neural network (CLDNN) method to model the neural network. But now I want to generate the dataset myself in MATLAB.
My question is, do you know a good sources (papers or codes) that have produced a dataset for AMC in MATLAB (or Python)? In fact, have they produced the In-phase and Quadrature components for different modulations (preferably APSK and PSK)?
Relevant answer
Answer
Automatic Modulation Classification (AMC) is a technique used in wireless communication systems to identify the type of modulation being used in a received signal. This is important because different modulation schemes encode information in different ways, and a receiver needs to know the modulation type to properly demodulate the signal and extract the data.
Here's a breakdown of AMC:
  • Applications:Cognitive Radio Networks: AMC helps identify unused spectrum bands for efficient communication. Military and Electronic Warfare: Recognizing communication types used by adversaries. Spectrum Monitoring and Regulation: Ensuring proper usage of allocated frequencies.
  • Types of AMC Algorithms:Likelihood-based (LB): These algorithms compare the received signal with pre-defined models of different modulation schemes. Feature-based (FB): These algorithms extract features from the signal (like amplitude variations) and use them to classify the modulation type.
  • Recent Advancements:Deep Learning: Deep learning architectures, especially Convolutional Neural Networks (CNNs), are showing promising results in AMC due to their ability to automatically learn features from the received signal.
Here are some resources for further reading:
  • asked a question related to Dataset
Question
1 answer
I want to train neural networks to evaluate the seismic performance of bridges, but the papers online are all based on their own databases and have not been published. Where can I find the relevant dataset? The dataset can include the following content: yield strength of steel bars, compressive strength of concrete, number of spans, span length, seismic intensity, support type, seismic damage level, etc
Relevant answer
Answer
Yongbo Xiang Your inquiry poses an interesting task. I'm also eager to acquire the corresponding data, alongside the documented history of bridge element damages attributed to past seismic events found in the literature.
  • asked a question related to Dataset
Question
4 answers
Hello everyone, I would appreciate you helping me with this question,
I am using remote sensing-based models to calculate ET actual. I want to use ERA-5 dataset as a meteorological dataset
I found that ERA-5 dataset has different pressure levels and provides (Relative humidity, Temperature, and Wind speed in u and v directions) but it doesn't provide shortwave solar radiation.
On the other hand, ERA5-Land provides (2m temperature, Surface solar radiation downwards, 10m u-component of wind, 10m v-component of wind) but doesn't provide relative humidity.
My question is:
  • Which one is better to use ERA5 or ERA5-land?
if I used ERA-5 I would find two problems, the first is to identify the pressure level, and the second is to get the short-wave radiations.
and if I used the ERA-5-land I don't know how to get relative humidity.
  • Also, can I use ERA5 for the relative humidity and the ERA5-land for all other parameters?
  • Finally, what is the shortest way to get hourly data for certain days (20 days) for a specific location (E, N) as a CSV file
Relevant answer
Answer
- To answer your first question, I would suggest that you could use both products. However, regarding your confusion about the missed variables in both products, you might overcome this by choosing the actual evapotranspiration method that suits each method.
- Reference to your second question, I do not see any problems.
- Regarding the third question, you could use API via creating an account and writing an API script.
  • asked a question related to Dataset
Question
1 answer
I am working on situation where I have more than one independent ( Feature) variable, let say four and one dependent (targeted) variable, my confusion is when XGBoost algorithm applied on such data set, during prediction it will consider all feature variable as individual like linear regression do or as a group to develop decision trees (Prediction).
If any reference paper or book available related to supervised Machine Learning Algorithm , Kindly share
Relevant answer
Answer
The answer is no. XGBoost assumes the independence of the predictors but uses their possible interactions as a basis to create an additive and surrogate structure made up of weak learners. Therefore, the effect of each predictor will be influenced by the effect of the other predictors and their number, and that is the extra point of boosting compared to conventional regression methods. To understand more deeply how XGBoost works, nothing better than a detailed explanation from its creators: https://doi.org/10.1145/2939672.2939785
Best regards!
  • asked a question related to Dataset
Question
6 answers
I have a dataset from lung cancer images with 163 samples (2D images). I use the fine-tuning of deep learning algorithms to classify samples, but the validation loss did not decrease. I augmented the data and used dropout, but the validation loss didn't drop. How can I solve this problem?
Relevant answer
Answer
I feel there are few checks and techniques that could be applied to avoid/mitigate overfitting:
1.Clean your dataset (check and handle null values, missing values and decide accordingly to keep of remove the records)
2.Handle the outliers.
3.Cross validation: Split the data into training and validation/test sets to evaluate model performance on unseen data. Use techniques like k-fold cross-validation to get a more robust 4.estimate of model generalization.
5.Feature Selection/Dimensionality Reduction: Identify and remove irrelevant, redundant or noisy features that may be causing overfitting.
6.Thoroughly evaluate model performance on held-out test data, not just the training data.se techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the data.
7.Apply regularization techniques like L1 (Lasso), L2 (Ridge) or Elastic Net to control model complexity and prevent overfitting.
8.Use simpler models with fewer parameters, such as linear regression or decision trees, instead of more complex models like neural networks.
  • asked a question related to Dataset
Question
2 answers
What is the popular facial image dataset to detect Austisms, and what are the sources?
Relevant answer
Answer
YTUIA - research paper ID -https://www.mdpi.com/2075-4418/14/6/629
  • asked a question related to Dataset
Question
1 answer
Hi
I have attached two images of the scatter- plot of some datasets. as per these plots, it seems that data is not linearly separable.
Can anyone please confirm my understanding?
Thanks and Regards
Monika
Relevant answer
Answer
Hi, Which dataset do you use ? I am also searching for this answer.
  • asked a question related to Dataset
Question
4 answers
I currently have the correlation/covariance matrix for a set of variables, as well as the output from a regression analysis, but lack access to the underlying raw dataset. Given these constraints, would it be feasible to conduct a comprehensive analysis of endogeneity? If so, I would greatly appreciate guidance on methodologies or statistical techniques that could be employed to investigate the presence of endogeneity under these circumstances.
Thanks,
Harshavardhana
Relevant answer
Answer
Endogeneity analysis typically requires access to raw data or at least detailed information about the variables of interest and their relationships. However, conducting endogeneity analysis without access to raw data can be challenging and may limit the depth of the analysis. Here are some approaches you can consider if you don't have access to raw data:
  1. Literature Review: Start by reviewing existing literature on the topic of interest. Look for studies that have addressed endogeneity issues similar to yours and examine their methodologies, including how they handled endogeneity concerns. This can provide insights into potential strategies or techniques you can apply in your analysis.
  2. Theoretical Considerations: Based on your understanding of the subject matter and theoretical framework, try to identify potential sources of endogeneity in your analysis. Consider factors that could lead to correlation or causality issues between variables and think about how these issues might be addressed or mitigated.
  3. Instrumental Variables: If you have access to instrumental variables that are plausibly exogenous and relevant to your analysis, you can use them to address endogeneity concerns. Instrumental variables estimation requires careful selection and validation of instruments, so make sure to justify their relevance and validity in your context.
  4. Quasi-Experimental Methods: Explore quasi-experimental methods or natural experiments that exploit exogenous variation in the data. These methods can help identify causal effects while controlling for endogeneity. Examples include difference-in-differences, regression discontinuity design, and propensity score matching.
  5. Sensitivity Analysis: Perform sensitivity analysis to assess the robustness of your results to potential sources of endogeneity. This involves testing the sensitivity of your findings to different model specifications, control variables, and assumptions. Sensitivity analysis can provide insights into the reliability and stability of your results.
  6. Expert Consultation: If possible, consult with experts or researchers who have experience with the data or topic area. They may offer valuable insights or suggest alternative approaches to address endogeneity concerns given the limitations of the available data.
While conducting endogeneity analysis without access to raw data presents challenges, it's still possible to employ various strategies and techniques to mitigate endogeneity concerns and produce meaningful results. However, it's essential to acknowledge the limitations of the analysis and interpret the findings cautiously, considering the potential impact of unobserved factors and data limitations on the results.
  • asked a question related to Dataset
Question
1 answer
Hi all,
I am looking for public repositories or services that are willing to host some large scientific data sets in the range of hundreds of GBs. Besides the universities' infrastructures we have found Hugging Face as another option. At the same time, public funding agencies would prefer some public non-commercial platforms. Looking for ideas to complete the list below:
For small data:
  • University servers
  • Zenodo (<50GB per default)
  • GitHub (<2GB free version)
For large data:
  • Hugging Face (<5GB per file, multiple files possible)
  • ? any ideas?
Thanks!
Relevant answer
Answer
Hi,
Perhaps you can consider building a personal website if the budget allows.
Furthermore, your wave forecasting based on U-Net is very beautiful. Please allow me to express my gratitude for your work. It has been very inspiring to me.
  • asked a question related to Dataset
Question
3 answers
for example dependent variable is daily stock returns and independent variables are company characteristics such as firm size, leverage etc.
Relevant answer
Answer
I think the data could be used but the firm size needs to be the most recent and temporally close to your daily stock returns. Obviously, when you evaluate your models you should acknowledge these limitations.
  • asked a question related to Dataset
Question
17 answers
I need DEAP dataset urgently, I didn't recieve any username and password from the officials, can some one please help me with the dataset if you have it or the credentials.
Relevant answer
Answer
No, please ask your professor to request access to Deap dataset.
  • asked a question related to Dataset
Question
1 answer
Are synthetic thermal images useful in bio-medical image processing for diagnostic purposes? Please share some resources of such data sets.
Relevant answer
Answer
Well, we normally don't use thermal images to reflect the inner information and distribution(we should know the illness ROI), cuz it puts more focus on the structural information. That means, this kind of image is more suitable for constructions and some occasions that needs only boundary and structural information.
  • asked a question related to Dataset
Question
3 answers
1. I'm seeking soil property raster datasets with resolutions matching those of Sentinel or Landsat imagery, as SoilGrids data are currently available only at a 250m resolution. Therefore, I'm interested in finding options with finer resolutions of 10/30m. What are the best available alternatives with finer resolutions?
2. Is there a dataset specific to the Indian context that provides more accurate and locally optimized soil data?
Relevant answer
Answer
Subhadeep Mandal detailed data on the spatial distribution of soils are rather a national feature, and such data can be found in national resources, but they are rarely open access
  • asked a question related to Dataset
Question
2 answers
Anyone please explain this
Relevant answer
Shantanu Shukla You would choose a dataset based on the characteristics of the tests you want to perform. In a very general case, you can utilize the TPC-H (https://www.tpc.org/tpch/) benchmark datasets to begin your tests.
  • asked a question related to Dataset
Question
1 answer
In detail, when utilizing the data from 1998 to 2014 as the training dataset, the Ljung-Box (Q) statistic, particularly Ljung-Box (18), is not generated in SPSS. However, if the analysis incorporates the entire dataset spanning from 1998 to 2021, the statistics are produced without issue.
Relevant answer
Answer
I think the 0 degrees of freedoms explains the problem, as you add more years the degrees of freedom will increase enabling the estimates to be calculated
  • asked a question related to Dataset
Question
1 answer
I am curr research at phising detection Using URL . Using logistic Regression model . I have data set 1:10 ratio 20k legitimate and 2 k phishing .
Relevant answer
Answer
You can rely on classical prediction metrics such as:
  • Precision, also known as 'positive predictive value', measures the proportion of correctly predicted positive instances out of all instances predicted as positive. It is calculated as the number of true positive results divided by the number of all samples predicted to be positive. You want high in scenarios where minimizing false positives is essential.
  • Recall, also known as 'sensitivity', quantifies the proportion of correctly predicted positive instances out of all actual positive instances. It is calculated as the number of true positive results divided by the number of all samples that should have been identified. You want high recall in scenarios where missing actual positives (false negatives) is costly.
A suitable combination of the above metrics for your purposes is the F2-score, which is a variant of the F1 score that puts a stronger emphasis on recall compared to the standard F1 score. Placing a stronger emphasis on recall rather than precision makes it suitable for tasks where capturing all positive instances is crucial.
Hope you found my answer useful.
Bests
  • asked a question related to Dataset
Question
10 answers
Greetings all, Our team, '#THE GLOBEST TEAM,’ presents a #great opportunity to participate in several in several #American dataset clinical studies related to the #internal medicine field. Your mission is to #write a specific section of the article and we will review your work. Once we have reviewed the manuscript, you will make any necessary edits related to your mission. If you have participated in #10 original research studies or reviews previously, and you are a #well-scientific writer, please leave a comment with your #name, #Google Scholar account, and #email address. #Research #opportunity #clinicalresearch #USAresearch #National Center for Health Statistics #THE GLOBEST TEAM
Relevant answer
Answer
Interested in participating.
Name: Anooja Rani
  • asked a question related to Dataset
Question
2 answers
A dataset urgently needed for EEG signals in children with autism
Relevant answer
Answer
Look at this work: Sun S, Cao R, Rutishauser U, Yu R, Wang S. A uniform human multimodal dataset for emotion perception and judgment. Sci Data. 2023 Nov 7;10(1):773. doi: 10.1038/s41597-023-02693-z. PMID: 37935738; PMCID: PMC10630434.
You may find some useful information.
  • asked a question related to Dataset
Question
2 answers
I have studied one paper entitled "Transcriptome analysis of the response to chronic constant hypoxia in zebrafish hearts" . I want to know how the Fold change values calculated in this paper were retrieved. Is there any specific formula available or it was totally determined from software. I have also accessed the required and individual GEO accession(for comparing hypoxia and normoxia data samples as control and diseased samples respectively) available at NCBI-GEO datasets for the same paper. But the value given there is negative LOG FC for upregulated transcripts (based on individual profile transcripts ID). In contrast the value given for upregulated genes in the same paper is positive for upregulated transcripts. I did not get the discrepancy observed in these results. I have also tried to calculate the FC value from individual expression values for each transcript and every unique  accession for normal and hypoxia samples. but the values for FC are totally different from paper data. I want to utilize the microarray data available from GEO for my work. Is there exist any specific method for data processing from such a microarray expression. Please share with me.
Relevant answer
Answer
Thank You
  • asked a question related to Dataset
Question
4 answers
I learned that multiple researchers were successful obtain MOOCs datasets from Stanford via the CAROL website: https://datastage.stanford.edu/. The data request form was placed at http://carol.stanford.edu/research. However, recently the domain name carol.stanford.edu as well as the Center for Advanced Research through Online Learning (CAROL) disappeared on the Internet. Consequently, I have no way to request for my needed datasets.
Do you know another URL to submit the data request form, or any alternative solution/repository to obtain some MOOC learners' interaction data from well known course on Coursera or edX?
Thanks in advance
Relevant answer
Answer
Jayashree Ganeshkumar sad, okay, thank you very much for the resources
  • asked a question related to Dataset
Question
3 answers
Can artificial intelligence help improve sentiment analysis of changes in Internet user awareness conducted using Big Data Analytics as relevant additional market research conducted on large amounts of data and information extracted from the pages of many online social media users?
In recent years, more and more companies and enterprises, before launching new product and service offerings as part of their market research, commission sentiment analysis of changes in public sentiment, changes in awareness of the company's brand, recognition of the company's mission and awareness of its offerings to specialized marketing research firms. This kind of sentiment analysis is carried out on computerized Big Data Analytics platforms, where a multi-criteria analytical process is carried out on a large set of data and information taken from multiple websites. In terms of source websites from which data is taken, information is dominated by news portals that publish news and journalistic articles on a specific issue, including the company, enterprise or institution commissioning this type of study. In addition to this, the key sources of online data include the pages of online forums and social media, where Internet users conduct discussions on various topics, including product and service offers of various companies, enterprises, financial or public institutions. In connection with the growing scale of e-commerce, including the sale of various types of products and services on the websites of online stores, online shopping portals, etc., as well as the growing importance of online advertising campaigns and promotional actions carried out on the Internet, the importance of the aforementioned analyses of Internet users' sentiment on specific topics is also growing, as playing a complementary role to other, more traditionally conducted market research. A key problem for this type of sentiment analysis is becoming the rapidly growing volume of data and information contained in posts, comments, posts, banners and advertising spots posted on social media, as well as the constantly emerging new social media. This problem is partly solved by the issue of increasing computing power and multi-criteria processing of large amounts of data thanks to the use of increasingly improved microprocessors and Big Data Analytics platforms. In addition, in recent times, the possibilities of advanced multi-criteria processing of large sets of data and information in increasingly shorter timeframes may significantly increase when generative artificial intelligence technology is involved in the aforementioned data processing.
The key issues of opportunities and threats to the development of artificial intelligence technology are described in my article below:
OPPORTUNITIES AND THREATS TO THE DEVELOPMENT OF ARTIFICIAL INTELLIGENCE APPLICATIONS AND THE NEED FOR NORMATIVE REGULATION OF THIS DEVELOPMENT
I described the applications of Big Data technologies in sentiment analysis, business analytics and risk management in my co-authored article:
APPLICATION OF DATA BASE SYSTEMS BIG DATA AND BUSINESS INTELLIGENCE SOFTWARE IN INTEGRATED RISK MANAGEMENT IN ORGANIZATION
The use of Big Data Analytics platforms of ICT information technologies in sentiment analysis for selected issues related to Industry 4.0
In view of the above, I address the following question to the esteemed community of scientists and researchers:
Can artificial intelligence help improve sentiment analysis of changes in Internet users' awareness conducted using Big Data Analytics as relevant additional market research conducted on a large amount of data and information extracted from the pages of many online social media users?
Can artificial intelligence help improve sentiment analysis conducted on large data sets and information on Big Data Analytics platforms?
What do you think about this topic?
What is your opinion on this issue?
Please answer,
I invite everyone to join the discussion,
Thank you very much,
Best wishes,
Dariusz Prokopowicz
The above text is entirely my own work written by me on the basis of my research.
In writing this text I did not use other sources or automatic text generation systems.
Copyright by Dariusz Prokopowicz
Relevant answer
Answer
In my opinion, yes, artificial intelligence (AI) can indeed play a crucial role in improving sentiment analysis for changes in internet user awareness, especially when combined with big data analytics. Here's how:
  1. Natural Language Processing (NLP): AI techniques can be used to process and understand the natural language used in social media posts, comments, reviews, etc. This involves tasks such as text tokenization, part-of-speech tagging, named entity recognition, and more.
  2. Sentiment Analysis: AI algorithms can be trained to recognize and analyze the sentiment expressed in text data. This can help identify whether users are expressing positive, negative, or neutral opinions about specific topics, products, events, etc.
  3. Machine Learning Models: AI-powered machine learning models can be trained on large datasets of labelled social media data to predict sentiment accurately. These models can continuously learn and improve over time as they are exposed to more data.
  4. Deep Learning: Deep learning techniques, such as recurrent neural networks (RNNs) and transformers, can capture complex patterns in text data and improve sentiment analysis accuracy.
Thank You
  • asked a question related to Dataset
Question
5 answers
I'm doing some research to explore the application of eXplainable Artificial Intelligence (XAI) in the context of brain tumor detection. Specifically, I aim to develop a model that not only accurately detects the presence of brain tumors but also provides clear explanations for its decisions regarding positive or negative results. My main concerns are making sure that the model's decision-making process is transparent and comprehending the underlying reasoning behind its choices. I would be grateful for any thoughts, suggestions, or links to papers or web articles that address the practical application of XAI in this field (including the dataset types or anything that is related with XAi).
Thank you.
Relevant answer
Answer
I believe that it is very important to begin your investigation by ensuring that the data collected is of high quality and undergoes standardized preprocessing if you want to effectively integrate XAI techniques into brain tumor detection systems. Validation through evaluation metrics and user feedback increases reliability, while iterative improvement based on user input enhances both accuracy and interpretability over time, so this approach fosters trust and transparency in brain tumor detection systems, ultimately benefiting clinicians and patients. you can utilize specialized XAI tools like NeuroXAI, incorporate attention maps for insight into DL model decision-making, adopt open architecture frameworks for scalability to new XAI methods, and enhance model interpretability through information flow visualization.
  • asked a question related to Dataset
Question
1 answer
Hello ResearchGate community! I'm working on a Fabric Defect Detection System project and need a diverse fabric defect dataset. Any recommendations or shared datasets would greatly benefit my research. Thank you for your support!
Relevant answer
  • asked a question related to Dataset
Question
4 answers
Hello everyone,
I am currently working on a Thesis about the impact of AI on Consulting firms.
I am looking for datasets surrouding this subject. If you have any data or somewhat that could help me, I would be very happy to receive your help.
Thank you very much,
Thibaud
Relevant answer
Answer
Ali Abedi Madiseh Thank you very much for the insights, I will look into it !
  • asked a question related to Dataset
Question
3 answers
Hello ResearchGate community,
I'm currently analyzing a dataset derived from a survey of ~200 paired responses across two time points. The survey is of teachers and students, using 60 Likert-like items to assess beliefs about education. After factor analysis, I derived three core factors.
I'm now trying to assess the relative magnitude of change in factor scores over two time points. I say relative magnitude because, for all factors, the scores decreased. So I need to see 1) whether changes were significant and 2) the size of those changes.
Preliminary tests, including Shapiro-Wilk, Q-Q plots, and outlier detection, indicated non-normality, guiding me to utilize a Wilcoxon signed-rank test.
However, I'm at a crossroads regarding the appropriate effect size measure. Traditional non-parametric effect size measures, like rank biserial correlation, seem to fall short for my purpose, as they primarily address the probability of difference -- rather than the magnitude of change I'm interested in capturing. I've established that two factors saw a statistically significant change using Wilcoxon signed rank. But I need to understand how big these deceases were and hopefully compare the two.
I'm contemplating justifying the use of Cohen's d or exploring median-based measures for a more accurate reflection of the change magnitude. But I'm struggling to find relevant info online. I've seen references to things like Hodges Lehmann, using simple median change, etc. But nothing solid.
Does anyone have insights or references on how to effectively apply a median measure in this context or justify using Cohen's d with the Wilcoxon signed-rank test for ordinal Likert data?
I appreciate any guidance or shared experiences on this matter.
Relevant answer
Answer
You can calculate a statistic like Cohen's d. It's just math. The issue is that, if your data quite non-normal, is dividing the difference in means by the standard deviation a useful metric ?
There is a nonparametric analogue using the difference in medians and the median absolute deviation. It's a simple idea, but I couldn't find any reference to it, so I named it after me †. If you use the median absolute deviation that has a constant of about 1.48 ‡, it will be close to Cohen's d for normal samples.
Or you could just use a un-standardized effect size statistic, like the difference in medians.
It may be helpful to also report the matched-pairs rank biserial correlation coefficient, because it's an r-like statistic, that may be familiar to your audience. You could also use Pearson's r, where one variable is the dichotomous variable, if that fits your purpose better.
  • asked a question related to Dataset
Question
1 answer
I have a data set from HPLC analysis. I would like to open the HPLC data file, but the Masslynx Software version (4.1) I have is not compatible with the file settings. Can anyone suggest an alternative software that I can use to open this data set?
The file was created to open in Masslynx Version 4.2, but I do not have access to this version.
Thank you.
Relevant answer
Answer
If it's row data follow these steps:
1. Say "Besma Allah -Al-Rahman -Al- Rahim."
2. Download python or/and its editor such as visual studio code:
Because python programming support must of file and data formats such
as .csv .tsv .row .cell .fasta .fastq .etc
3. Search for python libraries what open your data file format.
4. After read your file you can analyze it using Numpy , Pandas , Math and
other libraries and visualization your analysis result using Matplotlib and
seaborn and bokkeh and other libraries.
"Good Luck"
  • asked a question related to Dataset
Question
2 answers
I found that the structure for TiFeSi is given differently in ICSD and Pearson crystal database as follows:
The Wykoff positions are given in the Pearson database (data set no 1822291)
Ti1 Ti 4 b 0.25 0.2207 0.0206
Ti2 Ti 4 b 0.25 0.4979 0.1677
Ti3 Ti 4 b 0.25 0.7996 0.0463
Fe1 Fe 8 c 0.5295 0.1236 0.3699
Fe2 Fe 4 a 0 0 0.0
Si1 Si 8 c 0.506 0.3325 0.2452
Si2 Si 4 b 0.25 0.0253 0.2554
The Wykoff positions are given in the ICSD database (database code 41157)
Ti1 Ti0+ 4 b 0.25 0.2004(7) 0.2964(14)
Ti2 Ti0+ 4 b 0.25 0.7793(6) 0.2707(14)
Ti3 Ti0+ 4 b 0.25 0.9979(6) 0.9178(15)
Fe1 Fe0+ 8 c 0.0295(7) 0.3764(4) 0.12
Fe2 Fe0+ 4 a 0 0 0.2501(12)
Si1 Si0+ 8 c 0.0060(13) 0.1675(9) 0.9953(18)
Si2 Si0+ 4 b 0.25 0.9747(11) 0.5055(23)
Although the lattice parameter for both of the database is almost the same.
Which one should I take for ab initio calculations or XRD Rietveld refinement?
Relevant answer
Answer
Thanks Martin Breza
  • asked a question related to Dataset
Question
1 answer
RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
res = hypotest_fun_out(*samples, **kwds)
Above warning occured in python. Firstly, the dataset was normalised and then while performing the t-test this warning appeared, though the output was displayed. Kindly suggest some methods to avoid this warning.
Relevant answer
Answer
Why do you normalize before testing? If you are doing a pairwise t-test and the differences are small this only makes differences smaller. https://www.stat.umn.edu/geyer/3701/notes/arithmetic.html
  • asked a question related to Dataset
Question
4 answers
NA
Relevant answer
Answer
Hello!
Of course, there are places (i.e. national geoportrals in Poland, Slovakia, Slovenia, etc.) where You can find point cloud data. See also this place: https://portal.opentopography.org/dataCatalog
All the best!
Bartek
  • asked a question related to Dataset
Question
2 answers
I am conducting a study on the "Impact of Land Use and Land Cover (LULC) Changes on Land Surface Temperature (LST)" and plan to use Google Earth Engine (GEE) for my analysis. I am at a crossroads in deciding between the "USGS Landsat 8 Level 2, Collection 2, Tier 1" dataset and the "USGS Landsat 8 Collection 2 Tier 1 TOA Reflectance" dataset for LULC classification.
Could the community provide insights on:
  1. Which dataset would be more suitable for LULC classification, especially in the context of analyzing its impact on LST?
  2. What specific pre-processing steps would be recommended for the preferred dataset within the GEE environment to ensure data integrity and robustness of the classification?
Any shared experiences, particularly those related to the use of these datasets in GEE for LULC and LST studies, would be incredibly valuable.
Thank you for your contributions!
Relevant answer
Answer
Based on my experience, I used Landsat 8 OLI/TIRS Collection 2 atmospherically corrected surface reflectance data for my purpose of extracting NDVI, LST, NDBSI and Wetness. You can see more about the detail of Landsat Product from this site: https://developers.google.com/earth-engine/datasets/catalog/landsat-8
  • asked a question related to Dataset
Question
2 answers
Hello everyone,
I am currently working on a Thesis about the impact of AI on Consulting firms.
I am looking for datasets surrouding this subject. If you have any data or somewhat that could help me, I would be very happy to receive your help.
Thank you very much,
Thibaud
Relevant answer
Answer
In order to analyze a dataset, first we work with you to formulate a dataset that is ripe for machine learning. If you're still reading, chances are, you have a hypothesis about some insights that may be gained from your data.
Regards,
Shafagat
  • asked a question related to Dataset
Question
2 answers
I'm recently trying to perform an RNA seq data analysis and in 1st step, I faced a few questions in my mind, which I would like to understand. Please help to understand these questions.
1) In 1st image, raw data from NCBI-SRA have marked 1&2 at the ends of the reads, What is the meaning of this? are those meaning forward and reverse reads?
2) In the second image I was trying to perform trimmomatic with this data set. I chose "paired-end as a collection" but it does not take any input even though my data was there in "fastqsanger.gz" format. Why is that? Should I treat this paired-end data as single-end data while performing Trimmomatic?
3) in the 3rd and 4th images, I collected the same data from ENA where they give two separate files for 1 and 2 marked data in SRA. Then I tried to process them in Trimmomatic by using "Paired-end as individual dataset" and then run it. Trimmomatic gives me 4 files for those, Why is that? which one will be useful for alignment ??
A big thank you in advance :)
Relevant answer
Answer
For NGS sequencers that have pair-end capability, those 1's and 2's refer to which reads they originate from, Read 1 & Read 2 or Forward & Reverse Read s (https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html). That makes it convenient to have it attached at the end to see what kind of reads you are dealing with before processing the reads. In a similar fashion, within the FASTQ format specification, you also specify whether that particular read belongs to Read 1 or Read 2 (see Illumina sequence identifiers: https://en.wikipedia.org/wiki/FASTQ_format). Regarding your "fastqsanger.gz" format data, is this Sanger sequencing related data? These tools are developed for NGS applications. Regarding the output file, check: https://www.biostars.org/p/199938/ and "Trimmomatic output files" on google.
  • asked a question related to Dataset
Question
3 answers
I'm performing RNA-seq data analysis. I want to do healthy vs disease_stage_1, Healthy vs disease_stage_2, and Healthy vs disease_stage_3. In the case of healthy, disease_stage_1, disease_stage_2, and disease_stage_3 data sets, I have 19, 7, 8, and 15 biological replicates respectively.
Does this uneven number of replicates affect the data analysis?
Should I Use an even no of datasets like for every dataset, 7 biological replicates (As the lowest number of replicates here is 7)?
Relevant answer
Answer
I agree with Alwala in general, less than 3 samples per group is not even worth considering. There is an app online that will help you come up with sample size and/or power given certain parameters (https://cqs-vumc.shinyapps.io/rnaseqsamplesizeweb/) as a useful estimate tool. Regarding your uneven biological replicates, check to see if the statistical method used for differential expression and library normalization can tolerate uneven sample sizes. In general IMO, 8-10 minimum is a pretty good starting point.
  • asked a question related to Dataset
Question
3 answers
Hi, where can I find the irradiance solar energy benchmark and hourly dataset? And, which criteria are essential for it?
Thanks in advance
Relevant answer
Answer
You're welcome. IF you search with keywords such as Solar Irradiance/irradiation Dataset, Solar Irradiance/irradiation benchmark you will find many datasets which contain hourly Irradiation. Good luck to you.
Best regards,
Hossein
  • asked a question related to Dataset
Question
2 answers
I would like to explain my question, during a discussion with my colleague (Prof Dr Amer ), he informed me that the dataset that used from other field like engineering communication can also use in civil engineering applications. He used that in structure analysis for example in shear slab and other part, is that easy to used in geotechnical engineering ( soil improvements, pile group .. etc.)
Relevant answer
Answer
i am agree too, we can use data set
because data set have tolerance and we know that soil just a technical approach, and it will be valid when conducted laboratory testing
  • asked a question related to Dataset
Question
2 answers
Kumpulan data dalam jumlah besar sebanyak lebih dari 350.000 instances telah di impor kedalam file orange3, saat di prediksi kedapatan hasil data ada yang masih tidak utuh atau tidak lengkap, sehingga sulit untuk dianalisis, solusi teknisnya bagaimana?
Relevant answer
Answer
Memang betul nilai yang hilang adalah sedikit, tapi sulit terlacak secara manual ada di bagian mana karena jumlah data yang diproses sebanyak 350.000 instances.
Saya masih belum menemukan solusi praktis dari salah satu widget yang ada pada menu orange3 seperti yang Anda sarankan tersebut.
  • asked a question related to Dataset
Question
2 answers
Hi Everyone,
I am looking for a dataset to work on the customer churn prediction. I have found data regarding frontier airlines in statista.com but it was too expensive to buy. Are there any datasets that are related to the apparel industry along with customer feedback?
Relevant answer
Answer
Thank you for the interesting question, I’m doing research and will post it in the future
  • asked a question related to Dataset
Question
2 answers
Our work with the outstanding Canadian Prairie data is summarized in this BPI Book
Emerging Issues in Environment, Geography and Earth Science Vol. 4
ISBN 978-81-967981-3-0 (Print) ISBN 978-81-967981-4-7 (eBook) DOI: 10.9734/bpi/eieges/v4
We strongly recommend this overview. This Prairie data set is hourly for over fifty years and is calibrated back to standards. There is no comparable dataset for analyzing land-surface processes and the surface anergy balance both diurnally and seasonally across time and climate change.
There are five chapters in the book as well as a brief overview of a further six papers of the Prairie data analysis, ECMWF model comparisons and model development.
Relevant answer
Answer
Great!
  • asked a question related to Dataset
Question
2 answers
Hello everone.
I'm very new in ML and DL models, and i want to use CNN to train and test it on my dataset, i had very large dataset and it is already split into 80% train and 20% test.
I wrote code in python and use tensorflow to train CNN, and I'm not sure if it's correct. And now I am struggles on test CNN.
can anyone help to ensure I trained CNN corecctly and tell me how to test it? Either by giving adivices or by providing me with useful resources
I will be truly grateful.
#CNN #ML #MachineLearning #DeepLearning #ConvolutionalNeuralNetwork
  • asked a question related to Dataset
Question
1 answer
using simulation how to generate the dataset for taskscheduling with task characteristics and Vm characteristics so as to train the Ml models
Relevant answer
Answer
Divya Nathaniel To create task characteristics and VM characteristics, you can try one of the examples in Cloudsim, where you can create characteristics such as length, size of the input file, number of processing elements, etc. for the task. As for the VM side, you can customize the processing power (MIPS), RAM, BW, etc.
  • asked a question related to Dataset
Question
1 answer
How does the incorporation of diverse datasets affect the performance and bias mitigation of ChatGPT in various conversational contexts?
Relevant answer
Answer
Are you asking about the analysis of datasets that you supply? If so, that will have no effect of the program's biases, because ChatGPT operates from an existing dataset that is already included in determining its operation. This dataset is currently up -to-date only as far as 2021.
  • asked a question related to Dataset
Question
10 answers
Please give valuable information.
Relevant answer
Answer
I guess what you are referring to as an unsupervised dataset is actually unlabeled data. In this case, introducing a prediction model would make no sense as building such models rely on labels. For a prediction model, you have to provide both features and labels so the algorithm can figure out the relationship between them.
Therefore, you'll need to first create labels for your dataset using unsupervised techniques and then develop a supervised model on top of the obtained dataset.
  • asked a question related to Dataset
Question
1 answer
...
Relevant answer
Answer
Using trusting sources (e.g. government, statistic institutes) or establishing partnerships with an industrial data controller
  • asked a question related to Dataset
Question
2 answers
I need a MRI images datasets for HCC to extract some features from it , i will be delighted if someone menstion a specific data.
Relevant answer
Answer
2 варианта: солидная опухоль или рак-цирроз
  • asked a question related to Dataset
Question
3 answers
how can aerodynamists generate sufficient data set for aerodynamic problems. what the time cost for this step (considering a simple 3D problem)?
Relevant answer
Answer
Aerodynamists generate data sets for aerodynamic problems through a combination of experimental testing and computational simulations. For computational simulations, the process involves:
  1. Grid Generation: Creating a suitable mesh to discretize the geometry, defining the boundaries, and establishing the computational domain.
  2. Solver Setup: Selecting appropriate numerical methods and algorithms for solving the governing equations, considering turbulence models if necessary.
  3. Simulation Runs: Performing multiple simulation runs with varying parameters, boundary conditions, or geometry configurations to generate diverse data points.
  4. Post-Processing: Analyzing the simulation results, extracting relevant aerodynamic parameters, and assessing the performance of the design.
The time cost for this step depends on factors such as the complexity of the geometry, the desired level of accuracy, computational resources, and the number of simulations. For a simple 3D problem, it could range from hours to days per simulation run.
In experimental testing, wind tunnel experiments are conducted to gather aerodynamic data. The time cost depends on the complexity of the model, setup, and the number of experiments conducted.
Combining both computational and experimental approaches can provide a more comprehensive and accurate dataset but requires careful coordination. The time cost can vary widely based on the specific requirements of the aerodynamic problem and the available resources.
  • asked a question related to Dataset
Question
3 answers
need help regarding datasets for early detection of neurological disorder
Relevant answer
Answer
There must be country level databases for neurological conditions in some countries.
  • asked a question related to Dataset
Question
3 answers
If some knows this let explain in a brief way!
Thanks a lot.
Relevant answer
Answer
Conducting research with a mixed dataset containing both image and text data involves defining clear research objectives, collecting and preprocessing diverse annotated data, integrating image and text features through a chosen model architecture (such as Multi-Modal Neural Networks or Transformers), training the model on the mixed dataset, and evaluating its performance using appropriate metrics. Fine-tuning and optimization may follow, considering ethical considerations and transparency in decision-making. The final steps involve interpreting and visualizing learned representations and communicating research findings through publications or presentations, recognizing the need for a solid understanding of computer vision and natural language processing concepts throughout the process.
  • asked a question related to Dataset
Question
3 answers
I have a data set on Kaggle the site that includes information regarding speed and location, and based on the values, the attacker is identified.
Then I input this data set into a learning algorithm for training.
After that, I have a second data set resulting from the process of performing a simulation of the vehicle network and apply an attack rate to it.
Then I run the algorithm on the second Dataset to give me the accuracy of the detection and confusion algorithm
Is my thinking this way correct or not?please tell me if correct or not
Relevant answer
Answer
There are many options depending on the specific goal and data you are working with. Since you mention that you are new to this field, I would recommend starting with some simple examples using public data sets. If you are going to use Python, there are many examples available using libraries like Scikit-Learn, keras, pytorch and datasets like Iris, Boston Housing, MNIST (handwritten digits) etc. These are well documented. In MATLAB, there are also many built-in examples of predictive models and classifiers trained on different types of data.
It depends on the language/environment you plan to use, the field of application, and the type of task (regression, classification, clustering, etc).
  • asked a question related to Dataset
Question
7 answers
task x elemental + creation = relative
task defines available resource given force multiplication, economies of scale and LaGrange control(er$)
Relevant answer
Answer
JAS Pi.ai LLM (Context: Copy & Paste)
can you define complexity, riggor and oversight time for robot heuristic learning at this point?
Certainly!
Complexity: In the context of robot heuristic learning, complexity refers to the difficulty or intricacy of the task or problem that the robot is trying to learn. A complex task would typically require the robot to process and integrate a large amount of data, identify patterns or relationships, and make decisions or judgments based on incomplete or uncertain information. Examples of complex tasks for robots might include navigation in unstructured environments, object recognition and classification, or natural language understanding.
Rigor: In this context, rigor refers to the level of precision, accuracy, and thoroughness with which the robot's learning process is conducted. A rigorous learning process would typically involve carefully designed experiments or trials, rigorous data collection and analysis, and systematic evaluation and refinement of the robot's performance over time. The goal of a rigorous learning process is to ensure that the robot's performance is reliable, reproducible, and grounded in a solid understanding of the task or problem being addressed.
Oversight Time: Oversight time refers to the amount of time required for human experts or supervisors to monitor and guide the robot's learning process. This might include tasks such as designing and implementing learning trials, reviewing and analyzing data, adjusting the robot's parameters or algorithms, and evaluating the robot's performance. The amount of oversight time required will depend on the complexity of the task being learned, the rigor of the learning process, and the level of autonomy or adaptability desired for the robot. In general, more complex and rigorous learning processes are likely to require more oversight time from human experts.
It's worth noting that these concepts are not strictly defined or universally agreed upon in the field of robot learning, and different researchers or practitioners might use different terminology or frameworks to describe and analyze these aspects of the learning process.
show math for "advanced" context as such
Certainly! Here's an example of how these concepts might be incorporated into a mathematical model for robot heuristic learning:
Model:
Learning Rate (LR) = (C x D x H) / (R + OT)
Where:
  • C is the complexity of the task, measured on a numerical scale (e.g., 1-10)
  • D is the size of the data set, measured in GB or TB
  • H is the heuristic efficiency of the robot, measured on a numerical scale (e.g., 1-10)
  • R is the computational speed of the robot, measured in GHz
  • OT is the oversight time required for human experts or supervisors, measured in hours
  • LR is the learning rate of the robot, measured as the rate of improvement in performance over time
This model assumes that the learning rate of the robot is influenced by the complexity of the task, the amount of data available, and the robot's inherent heuristic abilities, but is also constrained by the available computational resources and the need for human oversight. The model predicts that as any of the positive factors (C, D, H) increase or the constraints (R, OT) decrease, the learning rate will increase.
Example Calculation:
Let's assume the following values for the variables:
  • C = 8
  • D = 4 TB
  • H = 6
  • R = 2 GHz
  • OT = 40 hours
Plugging these values into the model, we get:
LR = (8 x 4 TB x 6) / (2 GHz + 40 hours) = 0.0015 TB/hour
This result indicates that the robot's learning rate is approximately 0.0015 TB of data per hour, based on the provided values for the variables.
To interpret this result, we could estimate that if the robot is provided with a constant stream of data at a rate of 0.0015 TB/hour, it would take approximately 2,667 hours (about 111 days) to process and learn from the full 4 TB dataset, assuming a constant learning rate and no significant changes in the other variables. In reality, the learning rate may vary over time, and the robot may require additional data or adjustments to its algorithms to achieve optimal performance.
  • asked a question related to Dataset
Question
1 answer
Hello Everyone !
I am currently analyzing the emissions of around 300 companies over a time spans of +- 20 years (time series data !). I am wondering what is the best way to approach the analysis of this dataset and what methods can I use to draw insights from my dataset.
I was thinking about starting with indexing my dataset (since companies dont have the same volume of emissions) and then average these indexes according to specific characteristics of the companies (ex: size, country, etc...) in order to attempt to pick up trends.
After the descriptive statistic analysis, I was thinking that I could top my analysis with a regression analysis of emissions according to the type of company (inv. company, state-owned, etc...). For that matter, is there a specific statistical test I can do to regress time series data according to a specific independant variable ?
Let me know what you think of this approach...I am listening to your comments !
Cordially,
Diego Spaey
Relevant answer
Answer
Hi,
Your approach is good. Use ARIMA for trend analysis and panel data regression for regression analysis, ensuring to check for stationarity and autocorrelation in your time series data.
Hope this helps.
  • asked a question related to Dataset
Question
4 answers
Basically I have 40 subject and for each I collect
- coronary cannulation before TAVR as : selective, non selective e unsuccessfull.
Then i collected the same data after TAVR with the same 3 level of outcome.
This is a case of repeated measure with multilevel outcome. In addition my contingency table is not "square" due to the fact that there aren't "non selective" outcome in the group "before TAVR".
Here my dataset
AfterTAVR
BeforeTAVR 0 1 2
0 1 0 0
1 1 16 22
McNemar, Stuart-Maxwell’sTest and Cochran’sQTest due to the Not Binary Outcome and Non Square matrix (3x2) of the my dataset.
Can someone have some suggestions??? I will really appreciate it
Relevant answer
Answer
Anyway even if I consider "1 - always selective" before TAVR I need to compare this with 3 different level of categorical outcome after TAVR as a repeated measure
  • asked a question related to Dataset
Question
1 answer
I have selected two deep learning models CNN and sae for data analysis of a 1 d digitized data set. I need to justify choice of these two dl models in comparison to other dl and standard ml models. I am using ga to optimize hyper parameters values of the two dl models. Can you give some inputs for this query.thanks.
Relevant answer
Answer
Typically, the rationale for choosing a model can be training time, prediction time, and the value of the metric itself, either on a validation set or cross-validation, depending on what you are using. It is better, of course, to use more than one metric for indicators, as well as an error matrix along with completeness and accuracy, or simply F1 or F1-beta, depending on the problem you are solving.
  • asked a question related to Dataset
Question
1 answer
Hello,
I am looking for the best downscalling technique to correct precipitation climate change dataset. I am not sure about which of these two methods is more robust for my task.
Thanks!
Relevant answer
Answer
Salam Alaikum,
The two methods you mentioned and discuss their suitability for your task.
1. Downscaling Techniques:
a. Bias Correction:- Pros: Simple and widely used. Corrects systematic errors. - Cons: May not capture spatial variability well.
b. Equiratio Quantile Mapping:- Pros: Addresses biases and spatial variability. - Cons: Can be complex to implement.
Both methods have their merits, but Equiratio Quantile Mapping tends to be more robust in capturing spatial patterns and non-linear relationships. If you're looking for a method that considers both biases and spatial variability, this could be a good choice.
2. Code for Equiratio Quantile Mapping:
Implementing Equiratio Quantile Mapping involves statistical calculations. While I can't provide the entire code here, I can guide you on where to find resources:
  • Research Papers: Look for scientific papers or articles that detail the Equiratio Quantile Mapping method. These often include equations and explanations.
  • GitHub Repositories: Explore repositories on GitHub that focus on climate data analysis or downscaling techniques. Researchers and developers often share their code for others to use.
  • Online Forums: Platforms like the Esri Community, Stack Overflow, or other climate science forums might have discussions or shared code snippets related to Equiratio Quantile Mapping.
When implementing the code, ensure that it aligns with the specifics of your dataset and the goals of your downscaling process. If you encounter challenges or need clarification on specific aspects of the code, feel free to ask for guidance.
Remember to document your methodology and validate the results against observed data to ensure the downscaling technique is suitable for your specific climate change dataset. If you have further questions or need more assistance.
If you find my reply is useful , please recommend it , Thanks .
  • asked a question related to Dataset
Question
1 answer
SO, I have data sets from 1980-2020 years of precipitation and temperature how to plot similar map? Not sure how to proceed, I have annual Precipitation for approximately 15 stations for my catchment. So, If i take average annual rainfall values it gives average annual map, if I am not wrong. then how to find the change in precipitation? should I use any formula to find the value for each station?
Relevant answer
Answer
Do you mean determining rainfall patterns?
  • asked a question related to Dataset
Question
4 answers
Hello,
I am trying to remove outliers from a dataset. I removed a few, but they continuously appear. What should I do?
Please do not post AI generated answers.
Relevant answer
Answer
It's probably because they're not outliers. It's just that your data follow a non-normal distribution. For example, a log-normal distribution. As you remove the most extreme observations, other observations then appear "extreme".
P.S. Don't remove "outliers". You shouldn't delete data just because it doesn't look like what you thought it should.
  • asked a question related to Dataset
Question
3 answers
Good morning, I have two datasets with the exact same columns. I would like to select rows that have a matching ID between the two datasets (Please, see tables below).I tried to merge datasets with the r bind function but all rows were included. Do you have some advice to keep only rows with a matching ID?
Input
df1
ID VAR 1 VAR 2
a ... ...
b ... ...
c ... ...
d ... ...
df2
ID VAR 1 VAR 2
a ... ...
b ... ...
e ... ...
f ... ...
Output
df
ID VAR 1 VAR 2
a ... ...
b ... ...
Relevant answer
Answer
I found a solution to this problem.
df2[df2$ID %in% df1$ID,]
df1 includes all identifiers that you want to select matching rows and df2 your dataset including all variables
  • asked a question related to Dataset
Question
3 answers
Firth logistic regression is a special version of usual logistic regression which handles separation or quasi-separation issues. To understand the Firth logistic regression, we have to go one step back.
What is logistic regression?
Logistic regression is a statistical technique used to model the relationship between a categorical outcome/predicted variable, y(usually, binary - yes/no, 1/0) and one or more independent/predictor or x variables.
What is maximum likelihood estimation?
Maximum likelihood estimation is a statistical technique to find the best representative model that represents the relationship between the outcome and the independent/predictor variables of the underlying data (your dataset). The estimation process calculates the probability of different models to represent the dataset and then selects the model that maximizes this probability.
What is separation?
Separation means empty bucket for a side! Suppose, you are trying to predict meeting physical activity recommendations (outcome - 1/yes and 0/no) and you have three independent or predictor variables like gender (male/female), socio-economic condition (rich/poor), and incentive for physical activity (yes/no). Suppose, you have a combination, gender = male, socio-economic condition = rich, incentive for physical activity = no, which always predict not meeting physical activity recommendation (outcome - 0/no). This is an example of complete separation.
What is quasi-separation?
Reconsider the above example. We have 50 adolescents for the combination- gender = male, socio-economic condition = rich, incentive for physical activity = no. For 49/48 (not exactly 50, near about 50) of them, outcome is "not meeting physical activity recommendation" (outcome - 0/no). This is the instance of quasi-separation.
How separation or quasi-separation may impact your night sleep?
When separation or quasi-separation is present in your data, the traditional logistic regression will keep increasing the co-efficient of predictors/independent variables to infinite level (to be honest, not infinite, the wording should be without limit) to establish the bucket theory - one of the outcomes is completely or nearly empty. When the anomaly happens, it is actually suggesting that the traditional logistic regression model is outdated here.
There is a bookish name of the issue - convergence issue. But how to know convergence issues have occurred with the model?
- Very large co-efficient estimates. The estimates could be near infinite too!
- Along with large co-efficient estimates, you may see large standard errors too!
- It may also happen that logistic regression tried several times (known as iterations) but failed to get the best model or in bookish language, failed to converge.
What to do if such convergence issues have occurred?
Forget all the hard works you have done so far! You have to start your new journey with an alternative logistic regression, which is known as Firth logistic regression. But what Firth logistic regression actually does? Without using much technical terms, Firth logistic regression actually leads to more reliable co-efficients, which helps to choose best representative model for your data ultimately.
How to conduct Firth logistic regression?
First install the package "logistf" and load it in your R-environment.
install.packages("logistf")
library(logistf)
Now, assume you have a dataset "physical_activity" with a binary outcome variable "meeting physical activity recommendation" and three predictor/independent variables: gender (male/female), socio-economic condition (rich/poor), and incentive for physical activity (yes/no).
pa_model <- logistf(meet_PA ~ gender + sec + incentive, data = physical_activity)
Now, display the result.
summary(pa_model)
You got log odds. Now, we have to convert it into odds.
odds_ratios_pa <- exp(coef(pa_model))
print(odds_ratios_pa)
Game over! Now, how to explain the result?
Don't worry! There is nothing special. The explanation of Firth logistic regression's result is same as traditional logistic regression model. However, if you are struggling with the explanation, let me know in the comment. I will try my best to reduce your stress!
Note: If you find any serious methodological issue here, my inbox is open!
Relevant answer
Answer
Thank you for this post. I am curious, can you conduct a Hosmer-Lemeshow goodness-of-fit test on your logistf model in R?
  • asked a question related to Dataset
Question
1 answer
Hi everyone! I tried to perform a classic One Way Anova with the package GAD in R, followed by a SNK test, which I always used, but it didn't work with this dataset, and I got the same error for both tests, which is the following:
"Error in if (colnames(tm.class)[j] == "fixed") tm.final[i, j] = 0 :
missing value where TRUE/FALSE needed"
I understand there is something that gives NA values in my datatset but I do not know how to fix it. There are no NA values in the dataset as itself. Here is the dataset:
temp Filtr_eff
gradi19 11.33
gradi19 15.90
gradi19 10.54
gradi26 11.01
gradi26 -1.33
gradi26 9.80
gradi30 -49.77
gradi30 -42.05
gradi30 -32.03
So, I have three different levels of the factor temp (gradi19, gradi26 and gradi30) and my variable is Filtr_eff. I also already set the factor as fixed.
Please help me, how do I fix the error? I could do the Anova with another package (library car worked for example with this dataset) and I could do tukey instead of SNK, but I want to understand why I got this error since it never happened to me..thanks!
PS: I attached the R and txt files
Relevant answer
Answer
no one aswered but I found the solution so I write it here just in case someone will need it in the future!
with GAD package you have to change the name of the factor , it cannot be the same as the variable so I changed it as in the script I leave here and now it works!
  • asked a question related to Dataset
Question
2 answers
Short Course: Statistics, Calibration Strategies and Data Processing for Analytical Measurements
Pittcon 2024, San Diego, CA, USA (Feb 24-28, 2024)
Time: Saturday, February 24, 2024, 8:30 AM to 5:00 PM (Full day course)
Short Course: SC-2561
Presenter: Dr. Nimal De Silva, Faculty Scientist, Geochemistry Laboratories, University of Ottawa, Ontario, Canada K1N 6N5
Abstract:
Over the past few decades, instrumental analysis has come a long way in terms of sensitivity, efficiency, automation, and the use of sophisticated software for instrument control and data acquisition and processing. However, the full potential of such sophistication can only be realized with the user’s understanding of the fundamentals of method optimization, statistical concepts, calibration strategies and data processing, to tailor them to the specific analytical needs without blindly accepting what the instrument can provide. The objective of this course is to provide the necessary knowledge to strategically exploit the full potential of such capabilities and commonly available spreadsheet software. Topics to be covered include Analytical Statistics, Propagation of Errors, Signal Noise, Uncertainty and Dynamic Range, Linear and Non-linear Calibration, Weighted versus Un-Weighted Regression, Optimum Selection of Calibration Range and Standard Intervals, Gravimetric versus Volumetric Standards and their Preparation, Matrix effects, Signal Drift, Standard Addition, Internal Standards, Drift Correction, Matrix Matching, Selection from multiple responses, Use and Misuse of Dynamic Range, Evaluation and Visualization of Calibrations and Data from Large Data Sets of Multiple Analytes using EXCEL, etc. Although the demonstration data sets will be primarily selected from ICPES/MS and Chromatographic measurements, the concepts discussed will be applicable to any analytical technique, and scientific measurements in general.
Learning Objectives:
After this course, you will be familiar with:
- Statistical concepts, and errors relevant to analytical measurements and calibration.
- Pros and cons of different calibration strategies.
- Optimum selection of calibration type, standards, intervals, and accurate preparation of standards.
- Interferences, and various remedies.
- Efficient use of spreadsheets for post-processing of data, refining, evaluation, and validation.
Access to a personal laptop for the participants during the course would be helpful, although internet access during the course is not necessary. However, some sample- and work-out spreadsheets, and course material need to be distributed (emailed) to the participants day before the course.
Target Audience: Analytical Technicians, Chemists, Scientists, Laboratory Managers, Students
Register for Pittcon: https://pittcon.org/register
Relevant answer
Answer
Dear Thiphol:
Many thanks for your interest. Currently, I don't have a recorded video. However, I may offer this course in the future on-line in a webinar format if there is sufficient interest/inquiries.
Thanks again.
Nimal
  • asked a question related to Dataset
Question
2 answers
The datasets are provided as medians and interquartile ranges. Can we perform a pooled analysis? How do we convert variables to get the events as (N)?
Relevant answer
Answer
When you have data presented in the median and interquartile range (IQR) and you want to perform a pooled analysis to calculate proportions, you may face a challenge because the median and IQR summarize the central tendency and spread of the data. Still, they do not provide information about individual data points.
Proportions are typically calculated based on the counts of events or observations concerning the total number of events or observations. You need the actual data values rather than summary statistics like median and IQR to calculate proportions.
If your goal is to perform a pooled analysis and calculate proportions, you would ideally need access to the raw data or summary statistics that allow for the computation of proportions. If obtaining the individual data is impossible, you may need to explore alternative statistical methods or approaches that can accommodate summary statistics like medians and IQR.
If the median and IQR are the only summary statistics available, you might consider reaching out to the original data sources or authors of the study to request the raw data or additional information that would allow for more detailed analysis, including the calculation of proportions.
  • asked a question related to Dataset
Question
4 answers
...
Relevant answer
Answer
Dear Doctor
"Types of Feature Selection Methods in ML
  1. Information Gain. Information gain calculates the reduction in entropy from the transformation of a dataset. ...
  2. Chi-square Test. ...
  3. Fisher's Score. ...
  4. Mean Absolute Difference (MAD) ...
  5. Forward Feature Selection. ...
  6. Exhaustive Feature Selection. ...
  7. Recursive Feature Elimination. ...
  8. LASSO Regularization (L1)"
  • asked a question related to Dataset
Question
7 answers
I have three RNA-Seq datasets of the same tissue and want to analyse them on Galaxy. My initial literature survey gave me the idea that I can merge the three datasets if they are from the same model and tissue followed by making two groups Control and Test and then run the analysis. Am I correct?
Can somebody with more experience elaborate on this?
Or it is a better idea to analyse the three datasets separately and find the common mRNAs?
Relevant answer
Answer
There will likely be significant batch effects. I would analyze each set separately to get a higher power (which will be the case when the variability between the sets is large and won't be compensated by the reduction of standard errors due to the larger sample size).
You might consider pooling the p-values according to Fisher's method, if you need a single p-value per gene.
  • asked a question related to Dataset
Question
1 answer
How to integrate two different ML or DL models in a single framework?
Relevant answer
Answer
Yes, you can integrate multiple ML or DL models trained on different datasets and diverse inputs. Think of it as orchestrating experts with different knowledge to solve a complex problem. Here are common approaches:
Ensemble Learning: Combine multiple models' predictions to create a more robust and accurate one. Think of it as a panel of experts voting on the best answer.
Stacking: Train a meta-model to learn how to best combine the predictions of individual models. Like having a manager who knows how to weigh each expert's opinion.
Pipelines: Chain models together sequentially, where each model's output becomes the input for the next. Like an assembly line, where each expert adds their expertise.
Multimodal Models: Design models that handle multiple input types, like text and images, fusing information from different sources. Like a multi-lingual expert who can integrate knowledge from different languages.
  • asked a question related to Dataset
Question
1 answer
Outliers detection criteria in a data set.
Relevant answer
Answer
Dear Shahzad, The iInter Quartile Range method uses a range of values, defined with the help of first and third quartiles, to identify the Outliers. The formula make use of two values termed as Low fence and High fence values, beyond which any value lying is considered as an Outlier.
Inter Quartile Range = IQR = (Q3 – Q1)
Where Q3 is the 3rd Quartile and Q1 is the 1st Quartile
Low Fence value = LF = Q1 – 1.5 * IQR
High Fence value = HF = Q3 + 1.5 * IQR
It is assumed here that the maximum value in any given set of data is within 1.5 times IQR distance from Q3. So, also, the minimum value is at the distance of 1.5 times of IQR from Q1.
For more details, please see the following two research papers:
Ramnath Takiar (2023):The Relationship between the SD and the Range and a method for the Identification of the Outliers, Bulletin of Mathematics and Statistics Research, Vol. 11(4), 62-75.
Ramnath Takiar (2023):A New Method to Identify the Outliers Based on the Interquartile Range, Bulletin of Mathematics and Statistics Research, Vol. 11(4), 103-114.
  • asked a question related to Dataset
Question
1 answer
I'm not getting any solutions for DE analysis on R and also can't figure out which dataset should be used for this type of analysis. looking for some help !!
Relevant answer
Answer
you can perform DEseq2 analysis on quantified dataset which is generated after alignment, also you can download directly quantified data from GEO database. and DEseq2 analysis also can be performed on python.
  • asked a question related to Dataset
Question
1 answer
I want to get data for climatic variables from the Climate Research Unit dataset for analysis
Relevant answer
Answer
Hey there Moses Owoicho Audu! Extracting data from the Climate Research Unit dataset involves a few steps, and it depends on the specific variables you're interested in. Assuming you're comfortable with programming, you Moses Owoicho Audu can use languages like Python and tools like pandas to make the process smoother.
1. **Access the Dataset:**
First, make sure you Moses Owoicho Audu have access to the Climate Research Unit dataset. You Moses Owoicho Audu might need to download it from a reliable source or access it through an API if available.
2. **Data Format:**
Check the format of the dataset. It could be in CSV, Excel, or another format. Understanding the structure of the data is crucial for efficient extraction.
3. **Python and Pandas:**
Python is a popular language for data analysis. Use the pandas library to read and manipulate the data. Assuming your dataset is in CSV format, here's a simple example:
```python
import pandas as pd
# Replace 'your_dataset.csv' with the actual file name
dataset = pd.read_csv('your_dataset.csv')
# Now 'dataset' is a pandas DataFrame, and you Moses Owoicho Audu can start analyzing the data
```
4. **Filtering Variables:**
Identify the climatic variables you're interested in and filter the dataset accordingly. For example, if you're looking at temperature and precipitation, create new DataFrames for each.
```python
temperature_data = dataset[['Date', 'Temperature']]
precipitation_data = dataset[['Date', 'Precipitation']]
```
5. **Time Series Analysis:**
Since you're dealing with climatic data, consider performing time series analysis. Pandas provides excellent support for this.
6. **Visualization:**
Use visualization libraries like Matplotlib or Seaborn to create plots and gain insights into the data.
```python
import matplotlib.pyplot as plt
plt.plot(temperature_data['Date'], temperature_data['Temperature'])
plt.title('Temperature Over Time')
plt.xlabel('Date')
plt.ylabel('Temperature')
```
some interesting articles for you:
Remember, the specifics depend on the structure of the dataset and your analysis goals. If you Moses Owoicho Audu have more detailed requirements, feel free to share them!
  • asked a question related to Dataset
Question
1 answer
Seeking insights for algorithmic optimization.
Relevant answer
Answer
As a starting point, what did Google tell you?
  • asked a question related to Dataset
Question
4 answers
Hello everyone
I am working with my master thesis.
I have 3 latents IV and one ordered DV.
Can I use GSEM to deal with my data set in stata?
Thank you
Relevant answer
Answer
If you already have latent IVs then why not treat your DV the same way?
  • asked a question related to Dataset
Question
1 answer
I am a Msc student and my thesis is framed on developing a CNN-based approach to predict soil carbon hotspots using remote sensing data. Soil carbon hotspots are areas where the concentration of organic carbon in the soil is unusually high. These hotspots are important because they play a critical role in the global carbon cycle, helping to regulate the Earth's climate. This research will focus on developing a CNN-based approach to predict soil carbon hotspots, which can be used to identify areas that are particularly important for soil carbon sequestration. I am writing passionately for assistance which will help me assess the dataset which has a combination of remote and satellite dataset to aid me use it in my thesis. Thank you for your time and consideration
Relevant answer
Answer
I can't answer the question and would pose another. What is the quality of soil C signals in the source data?
  • asked a question related to Dataset
Question
1 answer
Hellow!
Actually, I want to delineate the groundwater potential zone by using the FR model.
But I am confused about groundwater well data that is used for in different research purposes. Most of the paper divides the data into two sets (training and testing) for validation and FR calculation. But, for my study area, that falls into only 19 wells. So i am confused that, can i divided it training and testing datasets or 19 well used for both validation and FR calculation?
Please suggest me.
Thanks in advance.
Relevant answer
Answer
The FR is a bivariate statistical approach and used to determine probability of groundwater potential areas on the basis of relationships between spring, wells and independent variables groundwater level influencing factors
  • asked a question related to Dataset
Question
2 answers
Lately I'm working on the scRNA-seq analysis. It took some time for me to find a proper dataset on GEO, whose Accession ID is GSE157783. I expected to get three files of each sample from the dataset, but I ended up finding that the authors only provided 3 files in total.
Besides, I found the format of the 3 files different to that mentioned in the online courses. I suppose that files end up with "tsv.gz" are needed, but here I just found 3 "tar.gz" files.
Hope someone can help :(
Relevant answer
Answer
Kantemir Bzhikhatlov Thanks for your kind help! : )
  • asked a question related to Dataset
Question
2 answers
I need your helpe PLEASE!
For my research paper, and In order to develop my dataset, i fielded the missing observation with interpolation/extrapolation method. And I need to ensure the quality and behavior of data before starting my analysis. Could you kindly provide more details on the specific steps and methodologies to be employed to ensure the meaningfulness and verifiability of the results, I am particularly interested in understanding:
- The quality assurance measures taken before and after applying interpolation/extrapolation techniques.
-Is there a trend approach to be adopted to reflect developments within the periods for the missing data?
- and if there are any diagnostic tests to be conducted to validate the reliability of the fielded data.
Thank you in advance for your time and consideration.
Relevant answer
Answer
BEFORE
1. Data verification:
2. Data preprocessing:
3. Model selection:
4. Validation and cross-validation:
5. Sensitivity analysis:
6. Error estimation:
After applying interpolation or extrapolation techniques, additional quality assurance measures can be taken:
1. Result validation
2. Sensitivity analysis
3. Result visualization:
By following these quality assurance measures, the accuracy and reliability of the interpolation or extrapolation results can be improved, ensuring that the derived values are as valid and useful as possible.
  • asked a question related to Dataset
Question
2 answers
The iris images of the Casia Iris V3 Lamp were acquired under variations of illumination. The local thresholding fails to segment some iris images. The alternative is to use: The adaptative thresholding technique: is this type of thresholding technique performant?
Relevant answer
Answer
Thank's for you help
  • asked a question related to Dataset
Question
2 answers
Hi,
i want to study doping effect characterization using ellipsometry.i have 5 dataset of n & k values of doped thin film. Is there any software available to simulate ellipsometry and get parameters like reflection, delta to analyse further. I try to find on ANSYS lumerical but couldn't find any good information about ellipsometry simulation.
Thanks.
Relevant answer
Answer
Dear friend Saurav Gautam
Hey there! Now, when it comes to diving into the world of ellipsometry simulation, I have got your back. Simulation tools are crucial for understanding the intricate details of thin films and their optical properties. While I might not have real-time information, let me recommend a few software options that were popular for ellipsometry simulations:
1. **FilmWizard by J.A. Woollam**: This software is designed for spectroscopic ellipsometry data analysis and simulation. It's widely used in both academia and industry.
2. **CompleteEASE by J.A. Woollam**: Another tool from J.A. Woollam, CompleteEASE is a comprehensive ellipsometry data analysis and simulation software.
3. **WVASE32 by J.A. Woollam**: This is a powerful software tool for spectroscopic ellipsometry data analysis and simulation.
4. **DeltaPsi2 by HORIBA Jobin Yvon**: This software is part of the ellipsometer offerings by HORIBA and is known for its user-friendly interface.
5. **TFCalc by Software Spectra Inc.**: While primarily known for thin film design, TFCalc also supports ellipsometric analysis and simulation.
Remember, the availability of specific features might vary across these tools, so it's a good idea to explore the documentation or contact the software providers for more detailed information.
Now, go forth and unravel the mysteries of your doped thin films with the power of simulation! If you Saurav Gautam need further insights or have any other questions, just shout out. I am here to assist!
  • asked a question related to Dataset
Question
7 answers
Supervised Learning
In supervised learning, the dataset is labeled, meaning each input has an associated output or target variable. For instance, if you're working on a classification problem to predict whether an email is spam or not, each email in the dataset would be labeled as either spam or not spam. Algorithms in supervised learning are trained using this labeled data. They learn the relationship between the input variables and the output by being guided or supervised by this known information. The ultimate goal is to develop a model that can accurately map inputs to outputs by learning from the labeled dataset. Common tasks include classification, regression, and ranking.
Unsupervised Learning
Unsupervised learning deals with unlabeled data, where the information does not have corresponding output labels. There's no specific target variable for the algorithm to predict. Algorithms in unsupervised learning aim to find patterns, structures, or relationships within the data without explicit guidance. For instance, clustering algorithms group similar data points together based on some similarity or distance measure. The primary goal is to explore and extract insights from the data, uncover hidden structures, detect anomalies, or reduce the dimensionality of the dataset without any predefined outcomes. Supervised learning uses labeled data with known outcomes to train models for prediction or classification tasks, while unsupervised learning works with unlabeled data to explore and discover inherent patterns or structures within the dataset without explicit guidance on the expected output. Both have distinct applications and are used in different scenarios based on the nature of the dataset and the desired outcomes.
Relevant answer
Answer
In the realm of machine learning, the main distinction between supervised and unsupervised learning lies in the nature of the dataset used for training.
Supervised Learning Dataset:
In supervised learning, the dataset consists of labeled examples, where each data instance is associated with a corresponding target or output value. The dataset includes both input features and the desired output or target variable. The aim of supervised learning is to learn a mapping function that can accurately predict the target variable based on the input features. The model is trained using labeled examples, allowing it to generalize and make predictions on unseen data. Common examples of supervised learning algorithms include linear regression, decision trees, and support vector machines.
Unsupervised Learning Dataset:
On the other hand, unsupervised learning involves unlabeled datasets, meaning they do not have corresponding target values. In this scenario, the model learns patterns, structures, or relationships within the data based solely on the input features. The objective of unsupervised learning is to discover inherent patterns or groupings within the data without prior knowledge of the desired output. Common unsupervised learning algorithms include clustering algorithms such as k-means clustering and dimensionality reduction techniques like principal component analysis (PCA).
  • asked a question related to Dataset
Question
2 answers
In the realm of machine learning, the availability of large and diverse datasets is often crucial for effective model training. However, in certain domains where data is limited or privacy concerns are paramount, exploring the use of synthetic datasets emerges as a compelling alternative.
Question: How can the adoption of synthetic datasets revolutionize machine learning applications in areas with data scarcity and stringent privacy considerations?
Relevant answer
Answer
Synthetic data is suitable for model validation and testing the performance and accuracy of the model, method or algorithms., however, the real data is extremely necessary for the actual performance of the method or algorithms.
  • asked a question related to Dataset
Question
5 answers
I have a dataset that includes 1900 companies. Also, I investigated 10 employees for each company. There is a question about the risk preference of each employee. At now, I need to calculate the ICC1 and ICC2 values for each company. I have already coded for each company, so each company will have a unique company_id. At now, I have the employee dataset, it means I have the 19000 data, and each employee will match the company according to the company_id. In this case, how to get the ICC1, and ICC2 value of each company in R. I have already tried for few days, expecting someone could resolve my problem.
Relevant answer
Answer
P.S.: Plaul Bliese has a multilevel tutorial for R, where he shows how to calcualte the above mentioned indices, as well as others, since all have their specific problems, which would lead too far to discuss them here.