catcheR_load - Data loading
============

This step enables the loading of single-cell data generated by either ``catcheR_10Xcatch`` or ``catcheR_scicatch``, following GTF-based annotation. The data is imported into a **Monocle** object, where experimental design information is added, followed by normalization and clustering.

Preparation
-----------

Before running ``catcheR_load``, prepare the following in a new working folder:

#. Copy the count matrix annotated with gene names from the previous step  
   (e.g., ``filtered_annotated_silencing_matrix_complete_all_samples.csv``)

#. Copy the file ``rc_barcodes_genes.csv`` 

#. Create a newline-separated plain text file listing the **control genes**  
   (e.g., SCR, B2M)

#. Create a newline-separated plain text file listing the **control samples** (if any)  
   (e.g., 1, 3). These sample names should match those used by ``aggr``  
   (see the input CSV file used for ``aggr``)

#. Create a newline-separated plain text file listing the **sample replicate labels**  
   (e.g. batch1, batch1, batch2, batch2).  
   The order must match the sample order in the input matrix exactly.  
   This file is required for downstream **batch-aware analyses**.  
   If your dataset includes multiple experiments or processing batches, batch correction is recommended.

#. Create a **CSV file** listing each sample along with its **annotation name** (required).  
   This will be used in plots instead of the sample number.  
   Example file available on `GitHub <https://github.com/alessandro-bertero/catcheR/blob/dev/input_examples/samples.csv>`_

#. *(Optional)* Create a newline-separated plain text file listing **genes of interest**  
   whose expression will be visualized on the UMAP.

Running ``catcheR_load``
------------------------

.. code-block:: r

   catcheR_load(
     group = "docker",
     folder, 
     expression.matrix,
     control_genes,
     control_samples = NULL,
     replicates = NULL,
     sample_names,
     resolution = 8e-4,
     genes = NULL
   )

**Example usage:**

.. code-block:: r

   catcheR_load(
     group = "docker",
     folder = "/path/to/working/folder/", 
     expression.matrix = "annotated_silencing_matrix_complete_all_samples.csv",
     control_genes = "controls.txt",
     control_samples = "noTET.txt",
     replicates = "replicates.txt",
     sample_names = "samples.csv",
     resolution = 8e-4,
     genes = "genelist.txt"
   )

The ``resolution`` argument sets the resolution parameter used by Monocle’s ``cluster_cells`` function.

Outputs
-------

Running ``catcheR_load`` produces the following outputs:

#. ``expression_data.csv`` and ``cell_metadata.csv``  
   These can be used to create a Monocle Cell Data Set (CDS) and are also bundled in ``starting_cds.RData``, the ready-to-load R object.

#. ``UMAP.pdf``  
   Dimensionality reduction UMAP plot.
   
   .. image:: UMAP.pdf

#. ``UMAP_gene_expression.pdf``  
   Gene expression overlay on UMAP using genes from the ``genes`` argument.
   
   .. image:: UMAP_gene_expression.pdf

#. ``UMAP_clustering.pdf``  
   Clustering result visualized on UMAP at the specified resolution.
   
   .. image:: UMAP_clustering.pdf

#. ``processed_cds.RData``  
   The Monocle CDS after normalization, dimensionality reduction, clustering, and trajectory inference.

Compatibility
-------------

At the end of this step, your data will be structured for use with the ``monocle3`` package.

However, it is also possible to switch to other frameworks such as **Seurat** or **Scanpy**:

.. code-block:: r

   library(SeuratWrappers)
   library(Seurat)
   seurat <- as.Seurat(cds, assay = NULL)
   scanpy_sce <- as.SingleCellExperiment(seurat)

.. note::

   For standard **iPS2-seq** perturbation analysis, always continue using the ``CDS`` object generated by ``catcheR``.

Next steps
----------

The following analyses can be performed after this step:

- ``catcheR_pseudotime``
- ``catcheR_modules``
- ``catcheR_enrichment``