【英語論文の書き方】第83回 「研究データと関連文書の管理(パート2):必要なプロジェクトファイル、フォルダ(ディレクトリ)の構成とデータの消去」について

2022年2月28日 17時50分

第82回では「研究データと関連文書の管理(パート1):研究内容を文書で厳密に記録することがなぜ大切なのか」を取り上げました。

第83回(今回)のテーマは
「研究データと関連文書の管理(パート2):必要なプロジェクトファイル、フォルダ(ディレクトリ)の構成と
データの消去」についてです。

この記事のパート1では、
研究データを徹底管理することの重要性と、
データを保管する方法についてお話しました。

プロジェクトファイルには研究者が収集した全データが含まれており、
データを収集・分析した方法を記すことは、
未来の研究者が知っておくべきあらゆる内容を伝えていることになります。

今回のパート2では、
プロジェクトファイルを作成し、整理する方法を
もう少し詳しくお伝えします。

ぜひご参考になさってください。

 

Managing your study data and the supporting documentation. Part 2: Create the necessary project files and folder (directory) structure and cleaning your data By Geoffrey Hart

In part 1 of this series of articles, I discussed the importance of rigorously documenting your research data and methods to accompany the archived data. In this part, I'll provide more details on how to create and organize your project files.
 
Note: This series of articles is based on the following book, but with the details modified for non-psychological research: Berenson, K.R. 2018. Managing Your Research Data and Documentation. American Psychological Association. 105 p. including index. (https://www.apa.org/pubs/books/4313048)
 
Project files include all of the data you collected, descriptions of the methods you used to collect and analyze that data, and any explanations of things that future researchers will need to know before they can understand what you did and how they can build on your research.
 

Create a logical project structure on your computer

 
Every research project should have its own folder (directory) on your computer. That folder and all the subordinate folders it contains should be named sufficiently clearly to distinguish them from the many similar projects that you will manage during a given study and later in your research career. If you are early in your career and are only conducting one or two studies simultaneously, the folder name may be as simple as your name, a key word related to the subject, and the year. For example, if you store all of your research in a folder named “Research”, your current project might be named “Hart 2020 drought stress field study”. If you’re part of a research group, you may need to include the names of the principal investigators or to use a project structure defined by your employer.
 
Note: If your research group comprises people from multiple institutions, create a document that lists complete names and contact information for all investigators, with the contents valid at the time of the study. If possible, try to update this document periodically to include current contact information.
 
Avoid nicknames and abbreviations that only your research group will understand. After a few years, these shortcuts become problems because researchers may have moved to another institution or retired, and eventually nobody will remember the meaning of these shortcuts or why you created them. If it’s necessary to use a complex project naming system that only bureaucrats could love, create a document named “Explanation of project folder names” that provides the necessary explanations.
 
How you name the folders for each project depends on the nature of your research and how you approach the design and subsequent management of your studies. Berenson (2018) recommends creating the following subfolders:
  • Project files: All of the “paperwork” associated with a project (whether scans of paper documents or electronic copies), such as funding applications and approvals by the Institutional Review Board  (i.e., the committee responsible for approving research). Include official documents such as funding applications, requirements you must fulfill to satisfy the conditions attached to funding (e.g., how to cite the funding agency and grant number in the Acknowledgments section of a journal paper). Clearly list all the “deliverables” that you promised to provide in your funding proposal. It’s acceptable to do more work than you promised, but rarely acceptable to perform less.
  • Data files: All of your original data, formatted as “read only” so that it can’t be accidentally changed. Most computer operating systems let you apply this format directly from the file management system (e.g., the File Explorer in Windows, the Macintosh Finder).
 
Creating read-only files: To protect files against accidental modification, change their format to “read only”:
  • Macintosh: Select the file in a Finder window, and then press Command+I to display the file information dialog box. Scroll to the heading “Sharing & Permissions”. For the “everyone” settings, select “Read Only” under the heading “Privilege”.
  • Windows: Using the Windows File Explorer, right-click or Control-click the file and select “Properties”. Select the checkbox beside “Read only”.
  • Although these file protections are helpful, they are not a substitute for a rigorous backup and archiving strategy. Your employer’s computer staff should implement such a strategy for you, but to be safe, create your own backups too. For some thoughts on how to do this, see my article "Backing up your data… and other important things".
 
  • Working files: All files that represent your work in progress, such as “cleaned” data files (files from which outliers and erroneous data have been removed). The original data files must not appear in this folder.
  • Command files: This includes all data-processing scripts (e.g., for the R statistical software) and the code for any software you developed to analyze your data. (I’ll discuss command files and other software in more detail in Part 3 of this article.) If that software evolves during your research, create a version control system so that you can retain old versions of the software in case you need to return to an older version. This system doesn't have to be complex; it's sufficient to simply add the revision date to the file name. For example, if you wrote a program to normalize your data, create a folder with the name “Normalization software”, and use filenames such as “October-2021-version” for each version.
  • Replication files: Files that you can provide to someone who wants to replicate your analysis, using either your data or their own data. Depending on the nature of your research, some of the information may be confidential and should be excluded from the replication files.
 
An alternative project folder structure might be folders named “original data”, “cleaned data”, “method documentation”, “software”, and “administrative paperwork”. In this series of articles, I’ll use Berenson’s suggested names so that if you consult her book, you can more easily find details that I don’t have room to discuss here. However, I’ll discuss her categories in an order that seems more logical to me, as it more closely follows the order in which you’ll probably perform your research. Most importantly, choose a folder structure that makes sense to you and your colleagues, since you’re the ones who will use it.
 
Whatever names and structure you choose, create a standard nomenclature for all files in a given category (e.g., methods) and document that nomenclature so that a year or more after you developed it, when you begin your next project, you can create names that are consistent with that older system of names and equally easy to understand. For example, your data files could be named using the format Site–Date–Treatment or Material–Trial number–Data type (where "data type" is replaced by words such as "video" or "chemical analyses"). This will make it much easier to organize and find files. Avoid cryptic coded names, even if you document the meanings of those names in a separate document. The ideal system should be easy to understand without referring to this document. Modern computers allow long file names, and carefully choosing folder names and file names within those folders reduces the risk of misinterpreting a name and assigning it to the wrong date or treatment.
 
Note: Don’t rely on computer time stamps (i.e., the date your computer assigns when you modify a file) to create versions of a file. If you open a file and save it, the time-stamp date changes. Instead, explicitly add the date. For example, “Cleaned data--2 March 2021--outliers removed”.
 
The Project folder should also include any documents that guided your research, data-entry forms, and the formal protocols that you designed to support data collection in the field or laboratory. If your research group has developed a standard experimental protocol that defines the sequence for recording, storing, and processing data, store a copy of that protocol in the Project folder. You can subsequently use these documents as checklists to guide you as you begin to analyze your data. Such checklists also provide a convenient visual indicator of what work you’ve completed and what work you must still do.
 
Also include documents that explain any unique characteristics of the data that will affect how it is interpreted. For example, if your data from May and June 2021 were recorded (respectively) before and after a key instrument was recalibrated, or if one day of field data was recorded on a cloudy day with strong winds whereas the rest of the data was recorded on a calm sunny day, record this information. Don’t ignore these differences, since they may affect how you interpret your results. Data recorded under different conditions may still be usable, particularly if it can be combined with data collected under similar conditions on another day, but it may not be safe to combine this data with data collected under very different conditions without warning the reader what you have done.
 
Also document any problems you solved during analysis, such as factors that are likely to lead to missing data, how to deal with missing data and your criteria for including or excluding data. For example, particle physicists often use “five-sigma” (five standard deviations) as their criterion for a statistically significant result. If you believe that some data are outliers because they were recorded under conditions that might affect the results, record the criteria you used to define them as outliers (e.g., more than three standard deviations from the mean).
 
If you will be working with human subjects, privacy and confidentiality considerations suggest that you should store all information that would identify individuals separately from the data provided by those individuals. For example, if I am study subject 1 in your research, I should be identified only as subject 1 in all data files. A password-protected file that contains all information on your subjects would record my full name and contact details under the name “subject 1”. (This information would be necessary if it becomes necessary to stop a clinical trial because of serious side-effects, and the researchers need to contact everyone affected by that problem.) This separation of subject from real identity is essential in double-blinded research.
 

Data Files

 
To emphasize the key characteristic of this folder, I would change Berenson’s name to “Raw data” or “Original data” to emphasize that all files in this folder represent archival information that should never be changed. Working data belongs in the Working Files folder that I describe in the next section. This distinction is essential; if you damage or lose your original data, you may be unable to restore it, but if you damage or lose a working file, you can recreate it from the original data following the methods you have documented. Before you begin working with your original data files, change their permissions to “read only” (as I described earlier) to prevent accidental modification. Then create a new version of the file named “working copy”, “data ready for cleaning”, or some other appropriate name before you begin to clean or analyze the data.
 
Carefully consider how you will analyze your data so that you can store your original data in logical groups that support that approach. For example, if you performed a longitudinal study, create subfolders with names based on the dates when you obtained the data. If your study is conducted at multiple locations, create subfolders with names based on the locations, such as “Field data” and “Laboratory data”. Create subfolders within those folders named based on the data type (e.g., video records vs. chemical analyses) or the recording method (e.g., recorded using a datalogger vs. manually recorded).
 
Also include the following files in the Data Files folder:
  • Data from other people: For example, in genetics research, retain a copy of the search results from a public database. For any research, retain a copy of the results of your literature search.
  • Metadata descriptions: Metadata is information that describes the nature of the data, such as the keywords you used for a literature search or details that you used to classify human subjects into different experimental groups.
  • Cleaned files: These are the files you will use in your analyses. That is, after you clean your data, store a copy of the cleaned data (set to “read only” for additional protection) before you begin your subsequent analysis. If you damage the cleaned data, you can simply make a new copy of the cleaned file instead of having to repeat the cleaning operation.
 

Working Files

Working files are files that you will continue changing until your analysis is complete. The goal of having a Working Files folder is to ensure that you aren’t working on your original or cleaned data files; if those files are damaged, it may be impossible to recover them, or very time-consuming. After each analytical step that requires significant amounts of work, store a copy of the results in your Data Files folder so that you can obtain a new copy of that version of your data if necessary.
 
Note: “Cleaning” files involves removing obvious errors such as typing errors, extreme outliers, and data that you collected under inappropriate conditions (e.g., during a rainstorm for an instrument that requires dry air). I’ll discuss this more in Part 3 of this article.
 
Reviewing your data to detect problems should be done as soon as possible after you collect the data. For example, if you’re working in a laboratory, quickly review your data to ensure that there are no problems; if there are, you may be able to fix the problem (e.g., recalibrate the instrument) and immediately repeat the measurement; it may be impossible to repeat the measurements several hours or days later. This is particularly important during field research at remote locations. It’s sometimes possible to remain in the field for an extra day to repeat your data collection, but it may be impossible to return to a remote site if you only discover a problem after you return home.
 
Most lab or field researchers now record at least some of their data automatically, but some still record information on paper and transfer it to a computer. In such cases, it’s essential that you proofread the computer file created from the paper forms to detect data-entry errors. A common approach is to have one person read the data on paper while the other person checks to ensure the computer file says the same thing. You can automate this process, at least for small datasets, by having two people enter the same data into separate computer files. For example, you could create an Excel file named “Geoff’s input data” and a second file named “Matt’s input data”. You can then use Excel to create a third file that contains the results of subtracting the value in Matt’s file from the corresponding value in Geoff’s file. If the values are identical, the result will be 0. Any other value indicates a data-entry error, such as a missing row of data.
 
For larger datasets, and particularly for machine-recorded data, use the search function or another tool provided by your spreadsheet or statistical software to highlight values that lie outside the expected or permitted range of values. For example, if you measured the length of an object (e.g., a leaf) to a precision of 0.1 mm, then values <0.1 mm represent errors. That is, any plant part that can be measured must have a non-zero value greater than 0.
 
In Part 3 of this article, I’ll discuss data validation and some software techniques you can use to process your data.
 

無料メルマガ登録

メールアドレス
お名前

これからも約2週間に一度のペースで、英語で論文を書く方向けに役立つコンテンツをお届けしていきますので、お見逃しのないよう、上記のフォームよりご登録ください。
 
もちろん無料です。

バックナンバー

第1回 if、in case、when の正しい使い分け:確実性の程度を英語で正しく表現する

第2回 「装置」に対する英語表現

第3回 助動詞のニュアンスを正しく理解する:「~することが出来た」「~することが出来なかった」の表現

第4回 「~を用いて」の表現:by と with の違い

第5回 技術英文で使われる代名詞のitおよび指示代名詞thisとthatの違いとそれらの使用法

第6回 原因・結果を表す動詞の正しい使い方:その1 原因→結果

第7回 原因・結果を表す動詞の使い方:その2 結果→原因

第8回 受動態の多用と誤用に注意

第9回 top-heavyな英文を避ける

第10回 名詞の修飾語を前から修飾する場合の表現法

第11回 受動態による効果的表現

第12回 同格を表す接続詞thatの使い方

第13回 「技術」を表す英語表現

第14回 「特別に」を表す英語表現

第15回 所有を示すアポストロフィー + s ( ’s) の使い方

第16回 「つまり」「言い換えれば」を表す表現

第17回 寸法や重量を表す表現

第18回 前置詞 of の使い方: Part 1

第19回 前置詞 of の使い方: Part 2

第20回 物体や物質を表す英語表現

第21回 句動詞表現より1語動詞での表現へ

第22回 不定詞と動名詞: Part 1

第23回 不定詞と動名詞の使い分け: Part 2

第24回 理由を表す表現

第25回 総称表現 (a, theの使い方を含む)

第26回研究開発」を表す英語表現

第27回 「0~1の数値は単数か複数か?」

第28回 「時制-現在形の動詞の使い方」

第29回  then, however, therefore, for example など接続副詞の使い方​

第30回  まちがえやすいusing, based onの使い方-分詞構文​

第31回  比率や割合の表現(ratio, rate, proportion, percent, percentage)

第32回 英語論文の書き方 総集編

第33回 Quality Review Issue No. 23 report, show の時制について​

第34回 Quality Review Issue No. 24 参考文献で日本語論文をどう記載すべきか​

第35回 Quality Review Issue No. 25 略語を書き出すときによくある間違いとは?​

第36回 Quality Review Issue No. 26 %と℃の前にスペースを入れるかどうか

第37回 Quality Review Issue No. 27 同じ種類の名詞が続くとき冠詞は付けるべき?!​

第38回 Quality Review Issue No. 22  日本人が特に間違えやすい副詞の使い方​

第39回 Quality Review Issue No. 21  previous, preceding, earlierなどの表現のちがい

第40回 Quality Review Issue No. 20 using XX, by XXの表現の違い

第41回 Quality Review Issue No. 19 increase, rise, surgeなど動詞の選び方

第42回 Quality Review Issue No. 18 論文での受動態の使い方​

第43回 Quality Review Issue No. 17  Compared with とCompared toの違いは?​

第44回 Reported about, Approach toの前置詞は必要か?​

第45回 Think, propose, suggest, consider, believeの使い分け​

第46回 Quality Review Issue No. 14  Problematic prepositions scientific writing: by, through, and with -3つの前置詞について​

第47回 Quality Review Issue No. 13 名詞を前から修飾する場合と後ろから修飾する場合​

第48回 Quality Review Issue No. 13 単数用法のThey​

第49回 Quality Review Issue No. 12  study, investigation, research の微妙なニュアンスのちがい

第50回 SinceとBecause 用法に違いはあるのか?

第51回 Figure 1とFig.1の使い分け

第52回 数式を含む場合は現在形か?過去形か?

第53回 Quality Review Issue No. 8  By 2020とup to 2020の違い

第54回 Quality Review Issue No. 7  high-accuracy data? それとも High accurate data? 複合形容詞でのハイフンの使用

第55回 実験計画について

第56回 参考文献について

第57回 データの分析について

第58回 強調表現について

第59回 共同研究の論文執筆について

第60回 論文の略語について

第61回 冠詞の使い分けについて

第62回 大文字表記について

第63回 ダッシュの使い分け

第64回 英語の言葉選びの難しさについて

第65回 過去形と能動態について

第66回 「知識の呪い」について

第67回 「文献の引用パート1」について

第68回 「文献の引用パート2」について

第69回 「ジャーナル用の図表の準備」について

第70回 「結論を出す ~AbstractとConclusionsの違い~」について

第71回 「研究倫理 パート1: 研究デザインとデータ報告」について

第72回 「研究倫理 パート2: 読者の時間を無駄にしない」について

第73回 「記号と特殊文字の入力」について

第74回 「Liner regression(線形回帰)は慎重に」について

第75回 「Plagiarism(剽窃)を避ける」について

第76回 研究結果がもたらす影響を考える

第77回 「データの解析(パート1):データ探索を行う」について

第78回 「データの解析(パート2):統計分析」について

第79回 「データの解析(パート3):データを提示する」について

第80回 データ、その他の大事なものをバックアップする(パート1)

第81回 「データ以外のもの(パート2)」について

第82回「研究データと関連文書の管理(パート1):研究内容を文書で厳密に記録することがなぜ大切なのか」について 


〒300-1206
茨城県牛久市ひたち野西3-12-2
オリオンピアA-5

TEL 029-870-3307
FAX 029-870-3308
ワールド翻訳サービス スタッフブログ ワールド翻訳サービス Facebook ワールド翻訳サービスの動画紹介