在 LINQ 中组合符合条件的 GroupBy 和 Count


我正在努力找出一个 LINQ 语句来汇总数据。我正在通过开发一个工具来帮助我清理重复文件来学习 c#。我已经有一个字典变量,它填充了存储在 fileResult 中的文件项信息,该文件定义为Dictionary<string, List<string>>. 列表项包括 Path、FileHash 和 FolderDupFileCount(以及其他项)。

我已经成功地使用这个 LINQ 表达式来汇总所有不同的 FileHash,分配一个组 ID,并用相同的哈希计算所有的。

var fileMD5Groups = fileResult.GroupBy(x => x.Value.FileHash).Select((x, xid) => 
                    new { x.Key, count = x.Distinct().Count(), id = xid + 1 }).ToDictionary(y => y.Key, z => z);


           // Group by Path and Count the files in this path that have duplicates
           // fileResult contains a field called FileHash
            var folderDuplicateCount =
                from file in fileResult
                group file by file.Value.Path into g
                where g.Count() > 1
                select new { Path = g.Key, FolderDupFileCount = g.Count() };
            // Convert to dictionary
            Dictionary<string, int> dupResults = folderDuplicateCount
                                                 .ToDictionary(x => x.Path, x => x.FolderDupFileCount);


编辑 1:以下是我正在使用的完整方法。

public static bool UpdateFileHashResults(Dictionary<string, FolderItem> folderResult, Dictionary<string, FileItem> fileResult)
            var fileMD5Groups = fileResult.GroupBy(x => x.Value.FileHash).Select((x, xid) => new { x.Key, count = x.Distinct().Count(), id = xid + 1 }).ToDictionary(y => y.Key, z => z);

           // Group by Path and Count the files in this path which have the
           // same FileHash that are in other Path's
           // fileResult contains a field called FileHash
            var folderDuplicateCount =
                from file in fileResult
                group file by file.Value.Path into g
                where g.Count() > 1
                select new { Path = g.Key, FolderDupFileCount = g.Count() };
            Dictionary<string, int> dupResults = folderDuplicateCount.ToDictionary(x => x.Path, x => x.FolderDupFileCount);
            foreach (var file in fileResult.ToList())
                var ik = file.Key;
                var ivMD5Hash = file.Value.FileHash;
                var fResult = fileResult[ik];
                var ivFileFolder = file.Value.Path;

                fResult.FileHashGroupID = fileMD5Groups[ivMD5Hash].id;
                fResult.FileHashCount = fileMD5Groups[ivMD5Hash].count;
                if (RS.FoldersFound)
                    var folResult = folderResult[ivFileFolder];
                    fResult.FolderID = folResult.FolderID;
                    var dupCount = 0;
                    if (dupResults.ContainsKey(ivFileFolder))
                        dupCount = dupResults[ivFileFolder];

                    fResult.FolderDupFileCount = dupCount;
                    folResult.FolderDupFileCount = dupCount;
            return true;

现在,var fileResult = fileListing.FindFiles(fileList)是拳头分配,使用界面:

public interface IFileListing
        Dictionary<string, FileItem> FindFiles(IEnumerable<string> files);

对于文件夹结果var folderResult =FolderListing.FindFolders(folderPaths);并使用下面的界面。

    public interface IFolderListing
        Dictionary<string, FolderItem> FindFolders(IEnumerable<string> folders);

所需结果:我试图获得按路径分组的结果,并计算此文件夹中与其他路径中的文件具有相同 FileHash 的文件数。因此,如果路径有 10 个文件,并且其中 2 个文件与另一个路径中的文件具有相同的哈希值,则 .FolderDupFileCount 的此路径的结果应为 2。



在了解了更多关于 linq 的知识以及更多的反复试验之后,我找到了一个有效的解决方案。感谢 NetMage 提出的问题和评论帮助我思考了问题。我也按照建议更改了我的 lambda 名称,但不确定它是否完全一致。



现在,修改dupResults提供了正确的结果。我还添加了两个额外的计算字段DupFilesHash = string.Concat(frg.Select(fvg => fvg.FileHash))id = frgId + 1. 这些字段是在我的字典中更新字段的助手,并在特定条件下分配。DupFileHash是此路径中文件的串联哈希,在其他路径中具有重复项。然后重新散列该哈希字符串以提供代表这些重复项的唯一指纹,该指纹可用于定位/匹配别处的重复项。

我无法弄清楚的最大问题是在第一个之后.GroupBy(frg => frg.Path),似乎我无法访问其他值字段。我遇到了一个例子,显示了frg.Select(fvg => fvg.FileHash)然后灯亮了,我学到了一些新东西。

public static bool UpdateFileHashResults(Dictionary<string, FolderItem> folderResult, Dictionary<string, FileItem> fileResult)
            // List of file hashes with a count of files with identical hashes
            var fileMD5Groups = fileResult.FileItemDictionaryToList()
                .GroupBy(kvg => kvg.FileHash)
                .Select((kvg, kvgId) => new { kvg.Key, 
                    count = kvg.Distinct().Count(), 
                    id = kvgId + 1 })
                .ToDictionary(krg => krg.Key, kvg => kvg);

            // List of all folders and a count of the number of files in this folder 
            // that have the same file hash in another folder(s)
            var dupResults = fileResult.FileItemDictionaryToList()
                .Where(frg => fileMD5Groups[frg.FileHash].count > 1)
                .GroupBy(frg => frg.Path)
                .Select((frg, frgId) => new { Path = frg.Key, 
                    NumberOfFilesWithDuplicates = frg.Count(), 
                    DupFilesHash = string.Concat(frg.Select(fvg => fvg.FileHash)), 
                    id = frgId + 1})
                .ToDictionary(frg => frg.Path, fvg => fvg);
            // Loop over all files and back load values into folder and file results dictionaries
            foreach (KeyValuePair<string, FileItem> file in fileResult.ToList())
                string ik = file.Key;
                string ivMD5Hash = file.Value.FileHash;
                FileItem fResult = fileResult[ik];
                string ivFileFolder = file.Value.Path;
                int fileHashCount = fileMD5Groups[ivMD5Hash].count;

                fResult.FileHashGroupID = fileMD5Groups[ivMD5Hash].id;
                fResult.FileHashCount = fileHashCount;

                if (RS.FoldersFound)
                    FolderItem folResult = folderResult[ivFileFolder];
                    fResult.FolderID = folResult.FolderID;
                    int dupCount = 0;
                    int dupID = 0;
                    string dupFilesHash = "";
                    if (dupResults.ContainsKey(ivFileFolder) && fileHashCount> 1)
                        dupCount = dupResults[ivFileFolder].NumberOfFilesWithDuplicates;
                        dupID = dupResults[ivFileFolder].id;
                        dupFilesHash= dupResults[ivFileFolder].DupFilesHash;
                        dupFilesHash = HashTool.MD5StringHash(dupFilesHash);

                    fResult.FolderDupFileCount = dupCount;
                    folResult.FolderDupFileCount = dupCount;
                    fResult.FolderDupFileCountID = dupID;
                    folResult.FolderDupFileCountID = dupID;
                    fResult.FolderDupFilesHash = dupFilesHash;
            return true;


