英語論文の書き方】第77回 「データの解析(パート1):データ探索を行う」について
2020年10月27日 14時24分
第76回では「研究結果がもたらす影響を考える」を取り上げました。
第77(今回)のテーマは
「データの解析(パート1):データ探索を行う」についてです。
データの取得は、研究を行う上でもっとも時間を要する作業かもしれません。
データ解析のプロセスについて考える際に役立つ方法を
本記事を含めて、三部作でお伝えしていきます。
今回、パート1としてお伝えするのは以下の項目です。
・Decide what to look for
・Group and compare your data
・Propose physical explanations for your data
・Look for patterns and deviations
・Consider non-traditional ways to explore your data
・Learn from your exploration
次はパート2に続きますので、ぜひ続けてお読みいただければと思います。
第77(今回)のテーマは
「データの解析(パート1):データ探索を行う」についてです。
データの取得は、研究を行う上でもっとも時間を要する作業かもしれません。
データ解析のプロセスについて考える際に役立つ方法を
本記事を含めて、三部作でお伝えしていきます。
今回、パート1としてお伝えするのは以下の項目です。
・Decide what to look for
・Group and compare your data
・Propose physical explanations for your data
・Look for patterns and deviations
・Consider non-traditional ways to explore your data
・Learn from your exploration
次はパート2に続きますので、ぜひ続けてお読みいただければと思います。
Analyzing your data (part 1 of 3): exploring your data By Geoffrey Hart
Obtaining data, whether in the lab or the field, can be the most time-consuming part of research. But eventually, you’ll return to your office with your data and face the next challenge: exploring the data to see what you’ve learned and then sharing it with the world. In this three-part article, I’ll discuss one way to think of this process: explore your data first, formally analyze the data to confirm the first impressions that result from that exploration, and then decide how to present your discoveries.
Note: Data analysis is a large, complex subject. Whole books have been written about it, so I cannot cover the entire subject here. Instead, my goal is to describe a helpful way to think about the process.
Because this stage is exploratory, don’t focus too narrowly. Keep an open mind as you look for relationships and patterns. Some of these may be unexpected and may or may not relate directly to your research questions. The purpose of exploration is not to confirm your prejudices by looking only for a specific pattern; it’s to discover new things, including patterns you did not anticipate. An overly narrow focus can cause you to miss important results, particularly when those results contradict a research hypothesis.
Examine the data for each group completely on its own merits, rather than starting with an assumption of what you’ll see. Assumptions bias our thinking so that we see only data that supports those assumptions (“confirmation bias”) instead of seeing what the data actually show. For example, a closer look at the data in one group may reveal the existence of two sub-groups. Consider a preliminary graph that shows two clusters of data points with the coordinates (x,y), with one group showing high values of x and y and the other showing low values of both variables, with a large gap between the groups that contains no intermediate values. It’s natural to analyze all of your data as a single group, but in this case, you clearly need to subdivide the data into two groups and analyze each group separately. If you can propose a plausible explanation of that clustering, you may have discovered evidence for a physical mechanism that explains the separation between the two groups.
Note: See my previous article about linear regression (https://www.worldts.com/english-writing/eigo-ronbun74/index.html) for some additional thoughts on this subject.
If you studied a known physical process, the nature of that process may define separate groups of data. These groups may differ from the groups you were thinking of when you defined your research questions. For example, if your results are not normally distributed, what does this mean? Skewness may reveal a physical process that biases the measured values towards small or large values; for example, a forest that is harvested by selection harvesting in which the harvesters remove only the largest trees will show a skew towards a population dominated by small trees. In contrast, a bimodal distribution (one with two peaks) suggests a need to segment your population into two subgroups with different means and statistical distributions before you continue your analysis.
Since you’re still exploring your data at this stage, avoid simplifications that can lead to incorrect interpretations of your data. For example, plot all graphs with both axes starting at 0. Although it’s tempting to plot only the range of data that contains your dataset to clarify details of the variations within that range, this conceals important context: how far your data lie from (0, 0) and how far they fall from the line y=x. Many authors reach incorrect conclusions when they ignore this context. If this additional context proves to be unimportant, you can subsequently graph only the part of the original graphs that contains your data.
Once you establish an overall understanding of the patterns in your data, look for deviations from those patterns (exceptions). Exceptions sometimes result from random statistical noise, but other times they reveal important exceptions to a pattern. Of course, they also may represent data-entry errors that will be much easier to fix now than in several months, when you’re revising your paper after peer review.
Thinking even farther outside the box, some researchers have used sound to explore their data by taking advantage of the power of human hearing to detect changes in frequency or loudness (https://www.frontiersin.org/articles/10.3389/fnins.2016.00009/full). For example, if you convert all your data to magnitudes, and map those magnitudes to a sound volume or sound frequency, you can play the resulting “song”. The changes in volume or frequency may reveal trends that would be difficult to detect in other ways.
If you found that visual animations or sound files helped you to understand your data, your readers and other researchers will also find them useful. In that case, provide the software you used and the resulting visual or auditory data as online supplemental information.
Exploration is not only a way to decide what you’ve found: it’s also an important way to improve your future research. Exploration may reveal problems such as an inadequate sample size or inadequate stratification. Learn from those problems and design your next study to mitigate these problems.
In part 2 of this article, I’ll describe how to rigorously confirm what you found.
Note: Data analysis is a large, complex subject. Whole books have been written about it, so I cannot cover the entire subject here. Instead, my goal is to describe a helpful way to think about the process.
Decide what to look for
Before you start your exploration, define what you’re looking for. This depends on your research hypothesis. For example, if your goal is:- to get a sense of how your data is distributed, consider plotting a histogram (i.e., a frequency or probability distribution)
- to determine the specific values of measured variables: calculate a “measure of central tendency” such as the mean or median
- to detect differences between experimental treatments: calculate differences
- to describe the variation: calculate standard deviations, standard errors, coefficients of variation, or confidence intervals
- to show trends: create graphs of variables as a function of changes in time or changes in another variable.
- to detect relationships: examine scatterplots for the relationships between pairs of variables.
Because this stage is exploratory, don’t focus too narrowly. Keep an open mind as you look for relationships and patterns. Some of these may be unexpected and may or may not relate directly to your research questions. The purpose of exploration is not to confirm your prejudices by looking only for a specific pattern; it’s to discover new things, including patterns you did not anticipate. An overly narrow focus can cause you to miss important results, particularly when those results contradict a research hypothesis.
Group and compare your data
Next, divide the data into groups that you can compare to reveal similarities and differences. If your research is based on clearly defined treatments, group the data initially by treatment. If you based your research on different geographic locations, group your data by location. Now begin to compare those groups to look for similarities and differences both within and between groups.Examine the data for each group completely on its own merits, rather than starting with an assumption of what you’ll see. Assumptions bias our thinking so that we see only data that supports those assumptions (“confirmation bias”) instead of seeing what the data actually show. For example, a closer look at the data in one group may reveal the existence of two sub-groups. Consider a preliminary graph that shows two clusters of data points with the coordinates (x,y), with one group showing high values of x and y and the other showing low values of both variables, with a large gap between the groups that contains no intermediate values. It’s natural to analyze all of your data as a single group, but in this case, you clearly need to subdivide the data into two groups and analyze each group separately. If you can propose a plausible explanation of that clustering, you may have discovered evidence for a physical mechanism that explains the separation between the two groups.
Propose physical explanations for your data
Think carefully about the physical processes you’re studying and how those processes constrain the distribution of your data. Is it reasonable to assume that the relationship between two variables is linear for all possible values of each variable, with no minimum or maximum value and no regions where behavior changes? More often, there are discontinuities, thresholds, upper and lower limits, or other differences that constrain your data.Note: See my previous article about linear regression (https://www.worldts.com/english-writing/eigo-ronbun74/index.html) for some additional thoughts on this subject.
If you studied a known physical process, the nature of that process may define separate groups of data. These groups may differ from the groups you were thinking of when you defined your research questions. For example, if your results are not normally distributed, what does this mean? Skewness may reveal a physical process that biases the measured values towards small or large values; for example, a forest that is harvested by selection harvesting in which the harvesters remove only the largest trees will show a skew towards a population dominated by small trees. In contrast, a bimodal distribution (one with two peaks) suggests a need to segment your population into two subgroups with different means and statistical distributions before you continue your analysis.
Look for patterns and deviations
Although visual interpretations can guide how we analyze data by revealing patterns, don’t rely only on those subjective interpretations; our eyes often mislead us, particularly when we preferentially look for results that support our expectations. Confirm those interpretations using objective techniques. I’ll discuss those techniques in part 2 of this article.Since you’re still exploring your data at this stage, avoid simplifications that can lead to incorrect interpretations of your data. For example, plot all graphs with both axes starting at 0. Although it’s tempting to plot only the range of data that contains your dataset to clarify details of the variations within that range, this conceals important context: how far your data lie from (0, 0) and how far they fall from the line y=x. Many authors reach incorrect conclusions when they ignore this context. If this additional context proves to be unimportant, you can subsequently graph only the part of the original graphs that contains your data.
Once you establish an overall understanding of the patterns in your data, look for deviations from those patterns (exceptions). Exceptions sometimes result from random statistical noise, but other times they reveal important exceptions to a pattern. Of course, they also may represent data-entry errors that will be much easier to fix now than in several months, when you’re revising your paper after peer review.
Consider non-traditional ways to explore your data
When you’re exploring your data, consider unusual alternatives that can make patterns easier to see. For example, animation provides powerful insights. The human eye is exquisitely skilled at detecting visual changes, so you can learn much by animating a graph to show how a variable (e.g., vegetation cover in each month of the year) evolves during the year or how an organism (e.g., an infectious pathogen) moves through a community. Trying to detect such changes by examining a series of static graphs can reveal the same trend, but because this form of analysis is highly abstract, it is more difficult than actually seeing the change. The popular Origin software offers animation as a built-in tool (https://www.originlab.com/doc/Origin-Help/Graph-Animation), but you can also use less powerful tools, such as Excel and PowerPoint, to create animations (https://www.makeuseof.com/tag/animate-excel-charts-powerpoint/).Thinking even farther outside the box, some researchers have used sound to explore their data by taking advantage of the power of human hearing to detect changes in frequency or loudness (https://www.frontiersin.org/articles/10.3389/fnins.2016.00009/full). For example, if you convert all your data to magnitudes, and map those magnitudes to a sound volume or sound frequency, you can play the resulting “song”. The changes in volume or frequency may reveal trends that would be difficult to detect in other ways.
If you found that visual animations or sound files helped you to understand your data, your readers and other researchers will also find them useful. In that case, provide the software you used and the resulting visual or auditory data as online supplemental information.
Learn from your exploration
One thing your exploration may reveal is a more complex situation than you expected when you designed your study. Don’t assume that the simplest explanation is the best explanation; that’s a misunderstanding of Occam’s principle, which is often incorrectly assumed to mean that the simplest solution is most likely to be correct. The correct interpretation is that you should not choose a solution more complex than what is necessary to explain your data; a complex explanation is sometimes the only realistic explanation.Exploration is not only a way to decide what you’ve found: it’s also an important way to improve your future research. Exploration may reveal problems such as an inadequate sample size or inadequate stratification. Learn from those problems and design your next study to mitigate these problems.
In part 2 of this article, I’ll describe how to rigorously confirm what you found.
Acknowledgments
I’m grateful for the reality check on my statistical descriptions provided by Dr. Julian Norghauer (https://www.statsediting.com/about.html). Any errors in this article are my sole responsibility.無料メルマガ登録
これからも約2週間に一度のペースで、英語で論文を書く方向けに役立つコンテンツをお届けしていきますので、お見逃しのないよう、上記のフォームよりご登録ください。
もちろん無料です。
バックナンバー
第1回 if、in case、when の正しい使い分け:確実性の程度を英語で正しく表現する
第2回 「装置」に対する英語表現
第3回 助動詞のニュアンスを正しく理解する:「~することが出来た」「~することが出来なかった」の表現
第4回 「~を用いて」の表現:by と with の違い
第5回 技術英文で使われる代名詞のitおよび指示代名詞thisとthatの違いとそれらの使用法
第6回 原因・結果を表す動詞の正しい使い方:その1 原因→結果
第7回 原因・結果を表す動詞の使い方:その2 結果→原因
第8回 受動態の多用と誤用に注意
第9回 top-heavyな英文を避ける
第10回 名詞の修飾語を前から修飾する場合の表現法
第11回 受動態による効果的表現
第12回 同格を表す接続詞thatの使い方
第13回 「技術」を表す英語表現
第14回 「特別に」を表す英語表現
第15回 所有を示すアポストロフィー + s ( ’s) の使い方
第16回 「つまり」「言い換えれば」を表す表現
第17回 寸法や重量を表す表現
第18回 前置詞 of の使い方: Part 1
第19回 前置詞 of の使い方: Part 2
第20回 物体や物質を表す英語表現
第21回 句動詞表現より1語動詞での表現へ
第22回 不定詞と動名詞: Part 1
第23回 不定詞と動名詞の使い分け: Part 2
第24回 理由を表す表現
第25回 総称表現 (a, theの使い方を含む)
第26回研究開発」を表す英語表現
第27回 「0~1の数値は単数か複数か?」
第28回 「時制-現在形の動詞の使い方」
第29回 then, however, therefore, for example など接続副詞の使い方
第30回 まちがえやすいusing, based onの使い方-分詞構文
第31回 比率や割合の表現(ratio, rate, proportion, percent, percentage)
第32回 英語論文の書き方 総集編
第33回 Quality Review Issue No. 23 report, show の時制について
第34回 Quality Review Issue No. 24 参考文献で日本語論文をどう記載すべきか
第35回 Quality Review Issue No. 25 略語を書き出すときによくある間違いとは?
第36回 Quality Review Issue No. 26 %と℃の前にスペースを入れるかどうか
第37回 Quality Review Issue No. 27 同じ種類の名詞が続くとき冠詞は付けるべき?!
第38回 Quality Review Issue No. 22 日本人が特に間違えやすい副詞の使い方
第39回 Quality Review Issue No. 21 previous, preceding, earlierなどの表現のちがい
第40回 Quality Review Issue No. 20 using XX, by XXの表現の違い
第41回 Quality Review Issue No. 19 increase, rise, surgeなど動詞の選び方
第42回 Quality Review Issue No. 18 論文での受動態の使い方
第43回 Quality Review Issue No. 17 Compared with とCompared toの違いは?
第44回 Reported about, Approach toの前置詞は必要か?
第45回 Think, propose, suggest, consider, believeの使い分け
第46回 Quality Review Issue No. 14 Problematic prepositions scientific writing: by, through, and with -3つの前置詞について
第47回 Quality Review Issue No. 13 名詞を前から修飾する場合と後ろから修飾する場合
第48回 Quality Review Issue No. 13 単数用法のThey
第49回 Quality Review Issue No. 12 study, investigation, research の微妙なニュアンスのちがい
第50回 SinceとBecause 用法に違いはあるのか?
第51回 Figure 1とFig.1の使い分け
第52回 数式を含む場合は現在形か?過去形か?
第53回 Quality Review Issue No. 8 By 2020とup to 2020の違い
第54回 Quality Review Issue No. 7 high-accuracy data? それとも High accurate data? 複合形容詞でのハイフンの使用
第55回 実験計画について
第56回 参考文献について
第57回 データの分析について
第58回 強調表現について
第59回 共同研究の論文執筆について
第60回 論文の略語について
第61回 冠詞の使い分けについて
第62回 大文字表記について
第63回 ダッシュの使い分け
第64回 英語の言葉選びの難しさについて
第65回 過去形と能動態について
第66回 「知識の呪い」について
第67回 「文献の引用パート1」について
第68回 「文献の引用パート2」について
第69回 「ジャーナル用の図表の準備」について
第70回 「結論を出す ~AbstractとConclusionsの違い~」について
第71回 「研究倫理 パート1: 研究デザインとデータ報告」について
第72回 「研究倫理 パート2: 読者の時間を無駄にしない」について
第73回 「記号と特殊文字の入力」について
第74回 「Liner regression(線形回帰)は慎重に」について
第75回 「Plagiarism(剽窃)を避ける」について
第76回 研究結果がもたらす影響を考える
第2回 「装置」に対する英語表現
第3回 助動詞のニュアンスを正しく理解する:「~することが出来た」「~することが出来なかった」の表現
第4回 「~を用いて」の表現:by と with の違い
第5回 技術英文で使われる代名詞のitおよび指示代名詞thisとthatの違いとそれらの使用法
第6回 原因・結果を表す動詞の正しい使い方:その1 原因→結果
第7回 原因・結果を表す動詞の使い方:その2 結果→原因
第8回 受動態の多用と誤用に注意
第9回 top-heavyな英文を避ける
第10回 名詞の修飾語を前から修飾する場合の表現法
第11回 受動態による効果的表現
第12回 同格を表す接続詞thatの使い方
第13回 「技術」を表す英語表現
第14回 「特別に」を表す英語表現
第15回 所有を示すアポストロフィー + s ( ’s) の使い方
第16回 「つまり」「言い換えれば」を表す表現
第17回 寸法や重量を表す表現
第18回 前置詞 of の使い方: Part 1
第19回 前置詞 of の使い方: Part 2
第20回 物体や物質を表す英語表現
第21回 句動詞表現より1語動詞での表現へ
第22回 不定詞と動名詞: Part 1
第23回 不定詞と動名詞の使い分け: Part 2
第24回 理由を表す表現
第25回 総称表現 (a, theの使い方を含む)
第26回研究開発」を表す英語表現
第27回 「0~1の数値は単数か複数か?」
第28回 「時制-現在形の動詞の使い方」
第29回 then, however, therefore, for example など接続副詞の使い方
第30回 まちがえやすいusing, based onの使い方-分詞構文
第31回 比率や割合の表現(ratio, rate, proportion, percent, percentage)
第32回 英語論文の書き方 総集編
第33回 Quality Review Issue No. 23 report, show の時制について
第34回 Quality Review Issue No. 24 参考文献で日本語論文をどう記載すべきか
第35回 Quality Review Issue No. 25 略語を書き出すときによくある間違いとは?
第36回 Quality Review Issue No. 26 %と℃の前にスペースを入れるかどうか
第37回 Quality Review Issue No. 27 同じ種類の名詞が続くとき冠詞は付けるべき?!
第38回 Quality Review Issue No. 22 日本人が特に間違えやすい副詞の使い方
第39回 Quality Review Issue No. 21 previous, preceding, earlierなどの表現のちがい
第40回 Quality Review Issue No. 20 using XX, by XXの表現の違い
第41回 Quality Review Issue No. 19 increase, rise, surgeなど動詞の選び方
第42回 Quality Review Issue No. 18 論文での受動態の使い方
第43回 Quality Review Issue No. 17 Compared with とCompared toの違いは?
第44回 Reported about, Approach toの前置詞は必要か?
第45回 Think, propose, suggest, consider, believeの使い分け
第46回 Quality Review Issue No. 14 Problematic prepositions scientific writing: by, through, and with -3つの前置詞について
第47回 Quality Review Issue No. 13 名詞を前から修飾する場合と後ろから修飾する場合
第48回 Quality Review Issue No. 13 単数用法のThey
第49回 Quality Review Issue No. 12 study, investigation, research の微妙なニュアンスのちがい
第50回 SinceとBecause 用法に違いはあるのか?
第51回 Figure 1とFig.1の使い分け
第52回 数式を含む場合は現在形か?過去形か?
第53回 Quality Review Issue No. 8 By 2020とup to 2020の違い
第54回 Quality Review Issue No. 7 high-accuracy data? それとも High accurate data? 複合形容詞でのハイフンの使用
第55回 実験計画について
第56回 参考文献について
第57回 データの分析について
第58回 強調表現について
第59回 共同研究の論文執筆について
第60回 論文の略語について
第61回 冠詞の使い分けについて
第62回 大文字表記について
第63回 ダッシュの使い分け
第64回 英語の言葉選びの難しさについて
第65回 過去形と能動態について
第66回 「知識の呪い」について
第67回 「文献の引用パート1」について
第68回 「文献の引用パート2」について
第69回 「ジャーナル用の図表の準備」について
第70回 「結論を出す ~AbstractとConclusionsの違い~」について
第71回 「研究倫理 パート1: 研究デザインとデータ報告」について
第72回 「研究倫理 パート2: 読者の時間を無駄にしない」について
第73回 「記号と特殊文字の入力」について
第74回 「Liner regression(線形回帰)は慎重に」について
第75回 「Plagiarism(剽窃)を避ける」について
第76回 研究結果がもたらす影響を考える