文字檔高達58.7GB大小能再使檔案再變小嗎？ (第2頁)

billq

billq
個人積分：1327分
文章編號：91751958

1327分

11樓

2025-06-24 3:12

極限就是7Z了不用懷疑

標準模式就可以了
再壓更小都不會小到哪去了
而且這種壓縮檔
解壓縮要用更多效能與時間來處理

『得不償失』這句成語就是在說這個
不要只看小就其他完全不管了

studentss 樓主

這文字檔是要冷儲存起來....不是常解壓縮的.....所以必須小....

2025-06-24 9:36

studentss 樓主

一個文字檔1MB 約有77000個.....所以檔案壓起來還是很大

2025-06-24 9:50

天龍國子民

天龍國子民
個人積分：5723分
文章編號：91753109

5723分

12樓

2025-06-24 10:26

當然就7zip了,壓縮率最好.
文字檔本身的壓縮率是最好的, 所以, 你的狀況大概也是極限了.
不然就分割, 1拆10, 每個檔案6GB?

studentss 樓主

一個文字檔1MB 約有77000個.....所以檔案壓起來還是很大

2025-06-24 10:42

studentss 樓主

分割總體檔案大小不會變小啦

2025-06-24 10:44

IQuit

IQuit
個人積分：7527分
文章編號：91753957

7527分

13樓

2025-06-24 12:40

studentss wrote:
我有一堆文字檔，用7z壓縮後還有58.7GB，還有什麼方法能使這些文字檔再無損變小嗎？
這我不會啦....批次檔太難了
原始檔77.285GB ->壓縮後58.7GB

>這我不會啦....批次檔太難了

可惜了,
好用的東西你不會.

>原始檔77.285GB ->壓縮後58.7GB

這壓縮率賊低才 76%,
通常純文字壓縮率賊高.

實測:

純文字檔案大小: 20,368,248,475 Bytes=19,424.68 MB=18.97 GB

在 2023/02/20 壓縮為 .7Z
2023/02/20 BigTextFile_Test_Sample.7z 壓縮後 209 MB, 壓縮率 1.08%

解壓重新壓縮:
2025/06/24 BigTextFile_Test_Sample_NEW.7z 壓縮後 209 MB, 壓縮率 1.08% --> 仍然沒變
2025/06/24 BigTextFile_Test_Sample.rar 壓縮後 5.15 MB, 壓縮率 0.0266%

此例結論:

純文字檔案大小: 20,368,248,475 Bytes=19,424.68 MB=18.97 GB
WinRar(壓縮後 5.15 MB) 比 7-Zip(壓縮後 209 MB) 壓縮率更高.

18.97 GB 壓縮後 5.15 MB 壓縮率 0.0266% 賊高.

★★★ 所以 7-Zip 不是每次都壓贏 WinRar. ★★★

PS:
2023/02/20 壓縮為 .7Z
OS : Windows 7 X64
RAM : 12 GB

已使用 14.5 年的 Q8400 整機壞了(Windows 7 X64),
買中古整機買了 2 台.
安裝: 2023/11/29 Windows 11

2025/06/24 重新壓縮為 .7Z/.Rar
OS : Windows 11 X64 23H2
RAM : 48 GB
7-Zip : V24.09
WinRar : V7.01

點我看大圖

pc8801

你這要看檔裡面都啥,好比完全沒空格換行滿滿都是一般字碼就很難壓小

2025-06-24 13:36

IQuit

To pc8801: 是沒錯. 以此例來說 , WinRAR 比 7-Zip 賊小, 打破 7-Zip 賊小的神話.

2025-06-24 13:40

pc8801

pc8801
個人積分：10370分
文章編號：91754234

10370分

14樓

2025-06-24 13:32

這種你就不用壓縮了

浪費時間精神而已

直接買個大容量儲存設備才是王道

studentss 樓主

ㄚ.....喔不.....^0^

2025-06-24 17:22

Gabriellin

Gabriellin
個人積分：1962分
文章編號：91754258

1962分

15樓

2025-06-24 13:36

樓主用以下程式去分析自己的文字檔吧，例如我找一個大概1MB的文字檔，分析後如下，實際用7-zip的LZMA壓縮，設定開最大產生的壓縮檔也還有245KB。如果內容天生就是幾乎隨機不重複，例如都儲存隨機檔案的SHA-1雜湊值，那本身就很難進一步壓縮，可能要使用文字壓縮專用的演算法才行。如果資料重複很多又能預處理，就要先清洗資料，將不需要或重複的部份處理掉，剩下的內容再進行壓縮。

python .\file_analyzer.py .\NEWS.txt
================================================================================
檔案分析報告：NEWS.txt
================================================================================
檔案大小：1,085,460 bytes (1.04 MB)
檔案型態：文字檔案

📊 熵值分析：
熵值：5.062 / 8.000
壓縮潛力：中等
預估壓縮率：部分重複性，有一定壓縮空間 (30-60%)

📝 字元分佈分析：
數字：4.2%
字母：68.7%
中文：0.0%
空白：18.5%
標點：6.1%
其他：2.5%

最常見字元/位元組：
1. '' '': 145,882 次 (13.4%)
2. 'e': 81,626 次 (7.5%)
3. 't': 60,892 次 (5.6%)
4. 'o': 58,395 次 (5.4%)
5. 'n': 52,346 次 (4.8%)

🔄 重複性分析：
總行數：27,284
唯一行數：13,853
重複行比例：49.2%

子字串重複率：
4char：64.2% (48524/75577)
8char：41.7% (163620/392819)
16char：25.3% (192131/760441)
32char：19.2% (167112/868890)

最常見的行：
1. '\r': 8,531 次
2. '-----\r': 184 次
3. '-------\r': 137 次

🏷️ 資料型態：英文文字檔案

💡 壓縮建議：
📦 高重複性資料，建議先去重再壓縮
命令：sort file.txt | uniq > unique.txt

-----------------------------------------------------
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
文字檔案壓縮性分析工具
分析檔案的熵值、重複性、字元分布等特性，判斷壓縮潛力
"""

import os
import sys
import math
import argparse
from collections import Counter, defaultdict
import re
from pathlib import Path

class FileAnalyzer:
def __init__(self, filepath):
self.filepath = Path(filepath)
self.file_size = 0
self.data = b''
self.text_data = ''
self.is_text = True

def load_file(self):
"""載入檔案資料"""
try:
self.file_size = self.filepath.stat().st_size
with open(self.filepath, 'rb') as f:
self.data = f.read()

# 嘗試解碼為文字
try:
self.text_data = self.data.decode('utf-8')
except UnicodeDecodeError:
try:
self.text_data = self.data.decode('big5')
except UnicodeDecodeError:
try:
self.text_data = self.data.decode('gbk')
except UnicodeDecodeError:
self.is_text = False
self.text_data = self.data.decode('utf-8', errors='ignore')

except Exception as e:
print(f"錯誤：無法讀取檔案 {self.filepath}: {e}")
return False
return True

def calculate_entropy(self, data=None):
"""計算檔案的熵值 (Shannon entropy)"""
if data is None:
data = self.data

if len(data) == 0:
return 0

counter = Counter(data)
total = len(data)
entropy = -sum((count/total) * math.log2(count/total)
for count in counter.values())
return entropy

def analyze_character_distribution(self):
"""分析字元分布"""
if not self.is_text:
return self.analyze_byte_distribution()

counter = Counter(self.text_data)
total_chars = len(self.text_data)

# 分類字元
categories = {
'digits': 0,
'letters': 0,
'whitespace': 0,
'punctuation': 0,
'chinese': 0,
'others': 0
}

for char, count in counter.items():
if char.isdigit():
categories['digits'] += count
elif char.isalpha():
if ord(char) > 127: # 可能是中文或其他非ASCII字元
categories['chinese'] += count
else:
categories['letters'] += count
elif char.isspace():
categories['whitespace'] += count
elif char in '.,;:!?()[]{}"\'-':
categories['punctuation'] += count
else:
categories['others'] += count

# 轉換為百分比
for key in categories:
categories[key] = (categories[key] / total_chars) * 100

return categories, counter.most_common(10)

def analyze_byte_distribution(self):
"""分析位元組分布（二進位檔案）"""
counter = Counter(self.data)
total_bytes = len(self.data)

categories = {
'printable': 0,
'control': 0,
'high_ascii': 0
}

for byte_val, count in counter.items():
if 32 <= byte_val <= 126: # 可列印ASCII
categories['printable'] += count
elif byte_val < 32: # 控制字元
categories['control'] += count
else: # 高ASCII值
categories['high_ascii'] += count

# 轉換為百分比
for key in categories:
categories[key] = (categories[key] / total_bytes) * 100

return categories, counter.most_common(10)

def analyze_repetition(self):
"""分析重複性"""
if not self.text_data:
return {}

lines = self.text_data.split('\n')
total_lines = len(lines)
unique_lines = len(set(lines))

# 分析重複行
line_counter = Counter(lines)
duplicate_lines = sum(1 for count in line_counter.values() if count > 1)

# 分析重複子字串
substrings = {}
text_len = len(self.text_data)

# 檢查不同長度的重複子字串
for length in [4, 8, 16, 32]:
if text_len > length:
substr_counter = Counter()
for i in range(text_len - length + 1):
substr = self.text_data[i:i+length]
substr_counter[substr] += 1

repeated_substrs = sum(1 for count in substr_counter.values() if count > 1)
total_substrs = len(substr_counter)

substrings[f'{length}char'] = {
'total': total_substrs,
'repeated': repeated_substrs,
'repetition_rate': (repeated_substrs / total_substrs * 100) if total_substrs > 0 else 0
}

return {
'lines': {
'total': total_lines,
'unique': unique_lines,
'duplicate_rate': ((total_lines - unique_lines) / total_lines * 100) if total_lines > 0 else 0
},
'substrings': substrings,
'most_common_lines': line_counter.most_common(5)
}

def estimate_compression_potential(self, entropy):
"""根據熵值估計壓縮潛力"""
if entropy >= 7.5:
return "極差", "接近隨機資料，壓縮效果很差 (<10%)"
elif entropy >= 6.5:
return "差", "低重複性資料，壓縮效果有限 (10-30%)"
elif entropy >= 5.0:
return "中等", "部分重複性，有一定壓縮空間 (30-60%)"
elif entropy >= 3.0:
return "好", "高重複性，壓縮效果良好 (60-80%)"
else:
return "極佳", "極高重複性，壓縮效果極佳 (>80%)"

def classify_data_type(self, entropy, char_dist, repetition):
"""分類資料型態"""
if not self.is_text:
return "二進位檔案"

# 基於特徵分類
if entropy > 7.0:
if char_dist[0]['digits'] > 50:
return "隨機數字資料"
elif char_dist[0]['letters'] > 60:
return "隨機字母資料（可能是加密或編碼資料）"
else:
return "高熵文字資料（接近隨機）"

elif repetition['lines']['duplicate_rate'] > 50:
return "高重複性文字（日誌檔或範本資料）"

elif char_dist[0]['digits'] > 70:
return "數字資料檔案"

elif char_dist[0]['chinese'] > 30:
return "中文文字檔案"

elif char_dist[0]['letters'] > 60:
return "英文文字檔案"

else:
return "混合格式文字檔案"

def generate_report(self):
"""生成完整分析報告"""
if not self.load_file():
return

print("=" * 80)
print(f"檔案分析報告：{self.filepath.name}")
print("=" * 80)

# 基本資訊
print(f"檔案大小：{self.file_size:,} bytes ({self.file_size/1024/1024:.2f} MB)")
print(f"檔案型態：{'文字檔案' if self.is_text else '二進位檔案'}")

# 熵值分析
entropy = self.calculate_entropy()
compression_level, compression_desc = self.estimate_compression_potential(entropy)

print(f"\n📊 熵值分析：")
print(f" 熵值：{entropy:.3f} / 8.000")
print(f" 壓縮潛力：{compression_level}")
print(f" 預估壓縮率：{compression_desc}")

# 字元分佈分析
char_dist, common_chars = self.analyze_character_distribution()
print(f"\n📝 字元分佈分析：")

if self.is_text:
print(f" 數字：{char_dist['digits']:.1f}%")
print(f" 字母：{char_dist['letters']:.1f}%")
print(f" 中文：{char_dist['chinese']:.1f}%")
print(f" 空白：{char_dist['whitespace']:.1f}%")
print(f" 標點：{char_dist['punctuation']:.1f}%")
print(f" 其他：{char_dist['others']:.1f}%")
else:
print(f" 可列印字元：{char_dist['printable']:.1f}%")
print(f" 控制字元：{char_dist['control']:.1f}%")
print(f" 高ASCII值：{char_dist['high_ascii']:.1f}%")

print(f"\n 最常見字元/位元組：")
for i, (char, count) in enumerate(common_chars[:5], 1):
if self.is_text:
display_char = repr(char) if char in '\n\r\t ' else char
print(f" {i}. '{display_char}': {count:,} 次 ({count/len(self.text_data)*100:.1f}%)")
else:
print(f" {i}. 0x{char:02X}: {count:,} 次 ({count/len(self.data)*100:.1f}%)")

# 重複性分析（僅文字檔案）
if self.is_text:
repetition = self.analyze_repetition()
print(f"\n🔄 重複性分析：")
print(f" 總行數：{repetition['lines']['total']:,}")
print(f" 唯一行數：{repetition['lines']['unique']:,}")
print(f" 重複行比例：{repetition['lines']['duplicate_rate']:.1f}%")

print(f"\n 子字串重複率：")
for length, data in repetition['substrings'].items():
print(f" {length}：{data['repetition_rate']:.1f}% ({data['repeated']}/{data['total']})")

if repetition['most_common_lines']:
print(f"\n 最常見的行：")
for i, (line, count) in enumerate(repetition['most_common_lines'][:3], 1):
preview = line[:50] + "..." if len(line) > 50 else line
preview = preview.replace('\n', '\\n').replace('\r', '\\r')
print(f" {i}. '{preview}': {count:,} 次")

# 資料型態分類
data_type = self.classify_data_type(entropy, (char_dist, common_chars),
self.analyze_repetition() if self.is_text else {})
print(f"\n🏷️ 資料型態：{data_type}")

# 壓縮建議
print(f"\n💡 壓縮建議：")
if entropy > 7.0:
print(" ⚠️ 此檔案接近隨機資料，任何壓縮演算法效果都有限")
print(" 建議：檢查是否為已壓縮或加密的資料")
elif entropy > 6.0:
print(" 📦 建議使用：PAQ、ZPAQ 等專門處理低冗餘資料的演算法")
print(" 參數：7z -mx=9 -md=1536m 或 xz -9e")
elif repetition.get('lines', {}).get('duplicate_rate', 0) > 30:
print(" 📦 高重複性資料，建議先去重再壓縮")
print(" 命令：sort file.txt | uniq > unique.txt")
else:
print(" 📦 標準壓縮演算法即可：7z -mx=9, bzip2 -9, xz -6")

print("=" * 80)

def main():
parser = argparse.ArgumentParser(description='分析文字檔案的壓縮特性')
parser.add_argument('filepath', help='要分析的檔案路徑')
parser.add_argument('-v', '--verbose', action='store_true', help='顯示詳細資訊')

args = parser.parse_args()

if not os.path.exists(args.filepath):
print(f"錯誤：檔案 '{args.filepath}' 不存在")
sys.exit(1)

analyzer = FileAnalyzer(args.filepath)
analyzer.generate_report()

if __name__ == "__main__":
main()

studentss 樓主

python 我不會啦...^0^

2025-06-24 17:38

studentss 樓主

這有沒有VB6的程式碼丫?...^0^..python不懂也不會用

2025-06-28 20:02

IQuit

IQuit
個人積分：7527分
文章編號：91754406

7527分

16樓

2025-06-24 14:01

>6F
>小雞飯:至少更新7以上吧，這沒多大的軟體更新也就10秒鐘的事情，
>小雞飯:7.0新增支援超過1G的字典大小，可以提高壓縮率的樣子

WinRar V7.01

我 48 GB RAM,
設 256 MB 壓縮字典檔就不行了, 笑到噴淚

偶爾會報 "記憶體不足以應付 256 MB 壓縮字典檔，修改為 128 MB。",
它這 1 GB 字典大小不是指 RAM 有多少,
要求極高,
要求極高,
要求極高. 笑到噴淚

studentss 樓主

不用到20.6MB啦....我自覺難上加難

2025-06-24 17:26

studentss 樓主

能少1GB就很喔彌陀佛囉

2025-06-24 17:34

1tac

1tac
個人積分：1066分
文章編號：91754691

1066分

17樓

2025-06-24 14:52

現在霸榜的算法都是基於神經網路的, 榜首甚至用上了Transformer, 樓主如果有N家顯卡不妨試試
http://mattmahoney.net/dc/text.html

studentss 樓主

太難了吧！

2025-06-24 18:54

studentss

studentss
個人積分：71分
文章編號：91755935

71分

樓主

2025-06-24 18:50

用完winrar 的zip 最佳壓縮，壓完變57.9GB

studentss 樓主

修正是56.9GB

2025-06-28 16:06

Gabriellin

Gabriellin
個人積分：1962分
文章編號：91756166

1962分

19樓

2025-06-24 19:51

如果只想使用現成工具的話，可以去下載PeaZip，然後格式選擇ZPAQ，壓縮率選最大、最高那個，由於被包裝過看起來會自動適應目前的環境，等它跑完就好了。結果應該是可以比LZMA, ZIP這些常見格式更小，要更小就要去找其他比較難用的工具了。

PAQ files compressor, opener and extractor utility

studentss 樓主

PeaZip 用ZIP格式壓縮率選最大能在少個2MB...讚喔...總體能少290MB

2025-06-28 11:32

studentss 樓主

ZPAQ格式....我試過了反而變更大..........^0^...壓輸WINRAR的ZIP格式....多出4MB.....總體多出580MB

2025-06-28 11:37

pc8801

pc8801
個人積分：10370分
文章編號：91757900

10370分

20樓

2025-06-25 9:17

隨便打兩天的八小時最低工資你都能買1tb的外接ssd了吧?

哪那麼多美國時間三不五時在那邊弄那甚麼幾十GB的壓縮比例?

也許一次兩次你不覺得

但你把每次用掉的時間都記錄下來

然後幾年之後你就發覺自己真的都是在沒事找事

studentss 樓主

用筆寫的手會斷掉....^0^

2025-06-25 18:20

studentss 樓主

還是我把文字檔發表到網路上...^0^

2025-06-25 18:40