【英語論文の書き方】第84回 「研究データと関連文書の管理(パート3):データ検証とカスタム開発ソフトウェア」について

2022年5月31日 17時38分

第83回では「研究データと関連文書の管理(パート2):必要なプロジェクトファイル、フォルダ(ディレクトリ)の構成とデータの消去」を取り上げました。

第84回(今回)のテーマは
「研究データと関連文書の管理(パート3):データ検証とカスタム開発ソフトウェア」についてです。
 
この記事のパート2では、
プロジェクトフォルダ(ディレクトリ)を構成して整理する方法や、大切なファイルやデータ入力エラーを検知する際の注意についてお話しました。
 
コンピュータプログラムを開発する前には、この分野で有名な警告を覚えておくことが大切です。つまり、garbage in, garbage out「ごみを入れればごみしか出てこない=不完全なデータを入力すれば、不完全な答えしか得られない」。
あるデータを取得しても、そのデータが間違っていたとすれば、どんなに注意深く扱っても意味がありません。
 
今回のパート3では、
データを有効化して処理する方法について検討したいと思います。
 
それでは、記事をお読みください。

Managing your study data and the supporting documentation. Part 3: Data validation and custom-developed software By Geoffrey Hart

Managing your study data and the supporting documentation. Part 3: Data validation and custom-developed software

 
By Geoffrey Hart
 
In part 1 and part 2 of this series, I described how to set up an efficient project folder (directory) structure, described the files and subfolders it should contain, and suggested precautions you should take to protect key files and detect data-entry errors. In this part, I will discuss some considerations about how to validate and process your data.
 
Note: This series of articles is based on the following book, but with the details modified for non-psychological research: Berenson, K.R. 2018. Managing Your Research Data and Documentation. American Psychological Association. 105 p. including index. (https://www.apa.org/pubs/books/4313048)
 
In Berenson’s wording, “Command” files are a type of program created to input, clean, validate, and process data. This name is based on the terminology used by the SPSS statistical software to describe a sequence of commands that will implement a series of analyses, such as calculating the means for two groups or treatments. A more general name would be “software scripts”, since that would include programs written using the R software to analyze a dataset, scripts that control the operation of a program (e.g., to set parameters for processing a specific type of data such as waveforms), and protocols or checklists that must be followed during data processing. These files also include data-entry forms that record survey responses and any software that you develop to perform specific analyses that aren’t available as a standard feature of other software.
 

Controlling data entry

Before you begin developing any computer programs, it’s important to remember a famous warning from computer science: “garbage in, garbage out”. This refers to the importance of ensuring that the data you enter is valid and ready for analysis. It doesn’t matter how carefully you obtain your data if the data that you record is erroneous. Ideally, data-entry software should use tools such as pick lists (a menu that provides a restricted list of choices so you can only choose among the valid options), radio buttons (which let you choose only one option from a series of options), and autofill (the same technology that your smartphone uses to complete words as you type them) to ensure that you enter data correctly. For example, instead of typing a complex treatment name manually (thus, creating a risk of typing errors), you can type all the valid names only once in your data-entry software and edit those names to ensure they are correct. Subsequently, anyone who uses the software to enter data can select the correct name from the carefully edited list, thereby eliminating name errors.
 
Note: Of course, it’s still possible to choose the wrong item from a list of options, so someone must still check the entered data. The sooner you do this after data entry, the easier it will be to detect and fix errors.
 
Grouping all your data from one experiment or treatment before you enter it into your software reduces the risk of another type of error. For example, first enter all the control data in a single file and include "control" in the file name so that it’s clear the file only contains data from the control treatment. Since you are not entering any non-control data in this file, this eliminates the need to type or select the word “control” and you can be confident that every line of data is correctly labeled as coming from the control treatment. Create additional files for each group of treatment data, so that each file includes only the results of a single treatment. After you complete your data entry and validation for the control and for each treatment, you can merge the files, if necessary.
 
Another “garbage in, garbage out” problem relates to how easy it is to create a programming error that will affect all of your analyses. Always ask an expert to review your program, and validate it carefully using test data with a known outcome. For simple calculations, you may be able to rely on approximations to provide a simple check. For example, you can quickly calculate a total or average in your head if you simplify the calculation by rounding decimal values to the nearest integer or the nearest multiple of 10. If your software provides approximately the same total or average, you can be more confident the program is working correctly. For additional validation, test extreme cases (what programmers call “edge cases”), such as unusually large or small values or a larger dataset created by duplicating 10 examples 100 times and running the software again to confirm that the average value doesn’t change.
 
Note: Professional computer programmers have graduated from a 2- to 4-year program of intensive study. If you spent a few hours or days reading the user manual for software such as the R statistical programming language, don’t expect to be as good at programming as someone who spent several years learning both that computer language and how to use it correctly. Always ask an expert to review your programs to increase the likelihood that they’re correct.
 
Frequency tables are also useful for validating data. For example, ensure that the total frequency for all data categories combined equals the total sample size (i.e., to ensure that you didn’t accidentally add or delete some of the data). If you know, for example, that the values of a given variable must fall within a certain range, any value outside that range is clearly an error. For example, if you use numbers from 1 to 12 to encode the months of the year, any value <0 or >10 that has a frequency of 1 or more is an error. If you correct any datapoints that you detect this way, clearly document this correction so that a future researcher can repeat your analysis and obtain the same result—or can change the data in a different way if they disagree with your reasons.
 

Dealing with missing data

Note: When you detect what seem to be errors in your data, never guess at the true value unless you have no alternative. It’s better to mark data as “missing” than to choose an incorrect value that will bias the results of your analyses. If it’s truly necessary to fill in missing data, carefully document what you did so that if you have to return to your original data to generate a new working copy, you’ll remember to make the same changes in the new copy.
 
Learn how your statistics software encodes missing data. For example, some older software may use a value such as 999. If that number could appear in your data, change the 999 to a different value or add steps in your software to highlight any values of 999 so that you can decide how to deal with this problem. Understanding how to handle missing data is less effective than preventing the omission in the first place. For field research, ensure that all data fields have been filled in for each group of measurements (e.g., to ensure that you didn’t accidentally forget a measurement. If you record data manually using a data-entry program, review all of your choices and data before you save the data and move on to the next measurement.
 
Sometimes it's necessary to replace missing data with a plausible estimate; this is most common when it’s necessary to add that value to your dataset to permit various calculations. When you have a logical way to do this, it may be acceptable to do so. For example, it’s reasonable to interpolate outdoor temperatures between 09:00 AM and 10:00 AM to provide an estimated value at 09:30 AM because ambient temperatures tend to change predictably over short periods. In contrast, it would be difficult to justify assigning a value of 50% to a missing test score in a class where students have, in the past, generally received test scores of 60 to 100%. When economic data is missing for one year in a longer study period, it’s common to interpolate between years to generate an estimate of the missing data. Similarly, for spatially distributed environmental data such as vegetation type or elevation, it’s often possible to extrapolate spatial trends. However, in both cases, you must be aware of situations that would prevent such extrapolation. For example, linear extrapolation won’t work for vegetation types near a sharp transition, such as between terrestrial and aquatic vegetation and won’t work for elevations in an area with many abrupt changes in elevation (e.g., a cliff). In some cases, you will need to use more complex methods such as the mean of a 7-day moving window, linear regression, kriging interpolation, or randomly sampling values from a suitable “prior” distribution. Before you consider any of these methods, discuss the problem you want to solve with a statistician, who can provide expert advice.
 
Whatever method you choose, it must be objective (i.e., the logic for your choice must be clear and mathematically correct), repeatable, and clearly described to ensure that if another researcher repeats your data analysis, they won’t bias the result by choosing a different method. Thus, develop objective and quantitative criteria for when to estimate missing data and when to choose a different solution.
 

Documenting how your software works

Professional programmers sometimes neglect a point that you should not neglect: you should document the meaning and assumptions of each line or group of lines in the software you develop: state what the software does, any assumptions that must be correct for that function to work properly, and the basis for your assumptions. This will help you remember what you did and why in each step of an algorithm. Perhaps more importantly, it will also help a colleague or team member validate your logic, and will help future researchers update your program to account for any improved knowledge or methods in your field of study after you created the original program. If you document your program code directly inside the program, you don’t need to create a separate document that will gradually become different from the software (since human nature means that you are more likely to forget to update a separate document), making it difficult to determine whether the description in the software or the description in this separate document is correct. In contrast, documenting the change before you make it ensures that the documentation remains valid.
 
Naming variables is particularly important. Many programming errors result from using a single name for two or more different meanings or from using two variable names for a single meaning. One solution is to develop a standard method for naming variables, since this will make it easier to remember what a variable does based on its name. This will also make it easier for you to reuse software you created in previous research in a new program. Create a document that clearly explains your naming conventions for variables and use it each time you create a new program or revise an old one. This is particularly important if the instruments you use to obtain a measurement assign cryptic variable names that only a machine could love. In such cases, create a table that lists the machine-assigned names in one column and the human-friendly names you create in a second column. This is particularly important if you are working in one language (e.g., Japanese) but writing papers in another language (e.g., English), since the way you choose variable names will differ between languages. For example, English variable names are generally based on the first letter of each word (e.g., Net Primary Production = NPP).
 
To avoid incorrect interpretation of your data, choose variable names that clearly reflect the meaning of the data. For example, some variables are based on binary logic in which 1 = yes and 0 = no. On this basis, your variable name should include the word “treated” if 1 = treated and “untreated” if 0 = not treated. If you use a variable named “response strength” with values ranging from 1 to 10, use 1 (a low number) for a low strength and 10 (a high number) for a high strength. If you use a variable named “risk reduction”, don’t use 1 = low risk; use 10 = high risk reduction. This may seem obvious, but I’ve corrected serious errors in many of the papers I edit that occurred when authors forgot the meaning of their variable names and reached an incorrect conclusion based on that misunderstanding. If you accidentally chose a confusing variable name, choose a clearer name and recode your data so that it agrees with that new name. Name these variables systematically (e.g., add “recoded” to the new variable name) so that when you review your data, it will be clear whether you’re using the raw data or the recoded data.
 
In Part 4 of this article, I’ll discuss how to create and document your replication files (the files you will make available to future researchers who want to replicate your study or who want to use your data in meta-analyses).

無料メルマガ登録

メールアドレス
お名前

これからも約2週間に一度のペースで、英語で論文を書く方向けに役立つコンテンツをお届けしていきますので、お見逃しのないよう、上記のフォームよりご登録ください。
 
もちろん無料です。

バックナンバー

第1回 if、in case、when の正しい使い分け:確実性の程度を英語で正しく表現する

第2回 「装置」に対する英語表現

第3回 助動詞のニュアンスを正しく理解する:「~することが出来た」「~することが出来なかった」の表現

第4回 「~を用いて」の表現:by と with の違い

第5回 技術英文で使われる代名詞のitおよび指示代名詞thisとthatの違いとそれらの使用法

第6回 原因・結果を表す動詞の正しい使い方:その1 原因→結果

第7回 原因・結果を表す動詞の使い方:その2 結果→原因

第8回 受動態の多用と誤用に注意

第9回 top-heavyな英文を避ける

第10回 名詞の修飾語を前から修飾する場合の表現法

第11回 受動態による効果的表現

第12回 同格を表す接続詞thatの使い方

第13回 「技術」を表す英語表現

第14回 「特別に」を表す英語表現

第15回 所有を示すアポストロフィー + s ( ’s) の使い方

第16回 「つまり」「言い換えれば」を表す表現

第17回 寸法や重量を表す表現

第18回 前置詞 of の使い方: Part 1

第19回 前置詞 of の使い方: Part 2

第20回 物体や物質を表す英語表現

第21回 句動詞表現より1語動詞での表現へ

第22回 不定詞と動名詞: Part 1

第23回 不定詞と動名詞の使い分け: Part 2

第24回 理由を表す表現

第25回 総称表現 (a, theの使い方を含む)

第26回研究開発」を表す英語表現

第27回 「0~1の数値は単数か複数か?」

第28回 「時制-現在形の動詞の使い方」

第29回  then, however, therefore, for example など接続副詞の使い方​

第30回  まちがえやすいusing, based onの使い方-分詞構文​

第31回  比率や割合の表現(ratio, rate, proportion, percent, percentage)

第32回 英語論文の書き方 総集編

第33回 Quality Review Issue No. 23 report, show の時制について​

第34回 Quality Review Issue No. 24 参考文献で日本語論文をどう記載すべきか​

第35回 Quality Review Issue No. 25 略語を書き出すときによくある間違いとは?​

第36回 Quality Review Issue No. 26 %と℃の前にスペースを入れるかどうか

第37回 Quality Review Issue No. 27 同じ種類の名詞が続くとき冠詞は付けるべき?!​

第38回 Quality Review Issue No. 22  日本人が特に間違えやすい副詞の使い方​

第39回 Quality Review Issue No. 21  previous, preceding, earlierなどの表現のちがい

第40回 Quality Review Issue No. 20 using XX, by XXの表現の違い

第41回 Quality Review Issue No. 19 increase, rise, surgeなど動詞の選び方

第42回 Quality Review Issue No. 18 論文での受動態の使い方​

第43回 Quality Review Issue No. 17  Compared with とCompared toの違いは?​

第44回 Reported about, Approach toの前置詞は必要か?​

第45回 Think, propose, suggest, consider, believeの使い分け​

第46回 Quality Review Issue No. 14  Problematic prepositions scientific writing: by, through, and with -3つの前置詞について​

第47回 Quality Review Issue No. 13 名詞を前から修飾する場合と後ろから修飾する場合​

第48回 Quality Review Issue No. 13 単数用法のThey​

第49回 Quality Review Issue No. 12  study, investigation, research の微妙なニュアンスのちがい

第50回 SinceとBecause 用法に違いはあるのか?

第51回 Figure 1とFig.1の使い分け

第52回 数式を含む場合は現在形か?過去形か?

第53回 Quality Review Issue No. 8  By 2020とup to 2020の違い

第54回 Quality Review Issue No. 7  high-accuracy data? それとも High accurate data? 複合形容詞でのハイフンの使用

第55回 実験計画について

第56回 参考文献について

第57回 データの分析について

第58回 強調表現について

第59回 共同研究の論文執筆について

第60回 論文の略語について

第61回 冠詞の使い分けについて

第62回 大文字表記について

第63回 ダッシュの使い分け

第64回 英語の言葉選びの難しさについて

第65回 過去形と能動態について

第66回 「知識の呪い」について

第67回 「文献の引用パート1」について

第68回 「文献の引用パート2」について

第69回 「ジャーナル用の図表の準備」について

第70回 「結論を出す ~AbstractとConclusionsの違い~」について

第71回 「研究倫理 パート1: 研究デザインとデータ報告」について

第72回 「研究倫理 パート2: 読者の時間を無駄にしない」について

第73回 「記号と特殊文字の入力」について

第74回 「Liner regression(線形回帰)は慎重に」について

第75回 「Plagiarism(剽窃)を避ける」について

第76回 研究結果がもたらす影響を考える

第77回 「データの解析(パート1):データ探索を行う」について

第78回 「データの解析(パート2):統計分析」について

第79回 「データの解析(パート3):データを提示する」について

第80回 データ、その他の大事なものをバックアップする(パート1)

第81回 「データ以外のもの(パート2)」について

第82回「研究データと関連文書の管理(パート1):研究内容を文書で厳密に記録することがなぜ大切なのか」について 

第83回「研究データと関連文書の管理(パート2):必要なプロジェクトファイル、フォルダ(ディレクトリ)の構成とデータの消去 


〒300-1206
茨城県牛久市ひたち野西3-12-2
オリオンピアA-5

TEL 029-870-3307
FAX 029-870-3308
ワールド翻訳サービス スタッフブログ ワールド翻訳サービス Facebook ワールド翻訳サービスの動画紹介